Dhavan Shah, Maier-Bascom Professor, School of Journalism and Mass Communication*

Chris Wells, Assistant Professor, School of Journalism and Mass Communication*

Alex Hanna, Doctoral Student, Department of Sociology*

JungHwan Yang, Doctoral Student, School of Journalism and Mass Communication*

*The University of Wisconsin-Madison

How big must data be to merit being called “Big Data?” Too big to fit on a thumb drive? Over a hundred gigabytes? Measured in terabytes or petabytes? As these questions suggest, the term Big Data is misleading, as it is often not the volume of the data that is the defining issue, but its emergence from traces of behavior of many independent and interacting entities, the velocity at which it is produced and the variability of its form (Dumbill, 2012). Drawing inferences, making predictions and developing theory from such data often requires computational approaches that complement more conventional statistical techniques. These new approaches to communication science, often including applications of network analysis, natural language processing, machine learning and other tools, have the potential to provide novel insights to existing questions about digital journalism practices, online political talk, and social network structure. Of course, to do this work, one must first overcome the challenges associated with the acquisition, archiving, and analysis of data at a massive scale.

Notably, recent decreases in prices for storage capacity, boosts in processing power, and the availability of shared clusters of computing resources have greatly expanded researchers’ ability to collect and utilize these sorts of data, allowing academics to do what was only recently the purview of industry and government. As a result, Big Data are now being used by academics to shape policy debates, providing insights about public sentiment in the absence of polling data, political alignments in the midst of contentious politics and routine political behavior, and social behaviors within online communities. One important source for these data is Twitter, which is unusual in its openness in terms of being able to observe interactions among members and its accessibility to scholars. In the Social Media & Democracy research group (http://smad.journalism.wisc.edu/), which is part of the Mass Communication Research Center within the School of Journalism and Mass Communication, we have been archiving 40-50 million tweets per day for the last 18 months, and have been collecting focused samples of up to 80,000 political elites and their followers since before the 2010 midterm elections.

In one recent paper, we explore the clustering of these political networks during the 2010 Midterm, mapping 23,466 followers of 409 candidates running for U.S. House, Senate, and Governorships. To do so, we conduct multidimensional scaling of hashtag use within 9 million tweets, allowing us to see how citizens self-organize around common language use. Moving beyond a simple right-left division, we identify and describe five unique clusters within these candidate follower networks. Compared to the considerable cohesion on the political left, the multiple clusters observed on the right are defined by the emergence of the Tea Party and its alignment with conservative media outlets, the growing role of women in the Republican Party and their status in high visibility races, as well as the perspectives and priorities of conventional conservatives. We then predict cluster membership at the individual level from survey data of a subset of users. It is worth noting that these distinctions on the political right continued to define the 2012 election cycle as well as the 2013 government shutdown.

Another study examines the effectiveness of Twitter as a campaign tool during this same election cycle. By analyzing follower relationships, @ mentions, and retweet activities among the Midterm candidates and their followers, we explore the factors related successful use of Twitter as a campaign tool. We constructed several Twitter effectiveness measures and examined the relationships among those measures and select external factors. Findings suggest that Twitter amplifies existing campaign advantages, with prominent candidates from bigger races earning more followers and candidates who have more followers are more likely to be mentioned and retweeted. However, the findings also clearly indicate that influence on Twitter is not necessarily determined by candidates’ campaign resources, as third-party candidates and those with limited campaign finance also exert significant influence in this online network, suggesting the democratic potential of social media in political campaigns.

In another study, we capitalize on our rich data to examine political polarization and citizen engagement during the US and French presidential campaigns. We use the Twitter Gardenhose collection to filter tweets based on keywords around a 50-day window, from March 19, 2012 to May 8, 2012 for the French election and September 19, 2012 to November 8, 2012 for the US Election, particularly focusing on engagement during the only French presidential debate and first US presidential debate on May 2, 2012 and October 3, 2012, respectively. First, we examine the minute-by-minute “share of voice” earned by each candidate in the hours before, during, and after the debates. From these data, we constructed partisan alignments based on hashtag usage and retweet networks, examining the social connections and partisan alignments of French and US citizen during periods of national conversation. We observed more stark political polarization in the French case, while the US case demonstrated less clear ideological division. Although debates are moments of national conversation in both democracies, this comparative work reveals important differences how citizens respond to major broadcast events and connect with one another through new media tools.

Needless to say, there is a great deal more to do with data of this form and scale; we are only scratching the surface of the possible insights that might be gleaned in the coming years, especially when these data are connected with external indicators such as news media content, polling data, and other behavioral indicators such as campaign contributions. Our efforts so far have made us aware of the need for more sophisticated computational and algorithmic techniques to deal with these data, and the need for training and sharing of these skills among the next generation of researchers. Theoretical and conceptual development will also be necessary to make the most of the new perspectives on communication and political phenomena. Big Data are certainly not the answer to all research questions, but they are an important addition to established methods, as computational approaches are used to complement conventional tools.

Using Big Data for New Political Communication Insights