GDELT and Big Data- Why Theory Still Matters

I’m really excited by the announcement of the GDELT (Global Data on Events, Location, and Tone) data set. Foreign Policy has done a great summary, and the Dart-Throwing Chimp has also written an insightful commentary on GDELT’s potential for the evolution of political science. You can also read the authors’ paper, which was recently presented at the ISA conference a few weeks ago.

At the risk of oversimplifying things, GDELT is a step towards building a giant database of everything.  Political events such as diplomatic overtures, threats, and demonstrations can be visually mapped out and tracked over a specific period of time. By mapping out the occurrence of events, one can potentially use the data as a predictive tool, as James Yonamine did in his paper on tracking violence in Afghanistan.

Image

More real-time work is being done by Alex Hanna on the Arab Spring and Rolf Fredheim on Russian protests (thanks for the tips so far!),  and I look forward to contributing as soon as I become more familiar with R software. But in this post, I’ll focus on the normative and theoretical implications of GDELT and Big Data.

In Viktor Mayer-Schoenberger and Kenneth Cukier’s recent book, “Big Data: A Revolution That Will Transform How We Live, Work, and Think”, the authors emphasize how having more “messy” data can transform our traditional theories of causation in social science. Whereas social scientists once needed high quality and representative sample sizes for their data sets, researchers of today and tomorrow will have the luxuries of having an overwhelming deluge of data instead. The necessity of having a “valid” sample will be lessened, when you can potentially have “n=all” instead.

In a recent interview, Viktor Mayer-Schoenberger summarizes how our age-old model of the scientific method may be improved.

“So Big Data enables us not to test the hypothesis, but to let the data speak and tell us what hypothesis is best. And in that way it completely reshapes what we call the scientific method or — more generally speaking — how we understand and make sense of the world.”

The Big Data movement may potentially evolve our traditional notions of what constitutes “theory”, and in turn improve our understanding of what the world is like. Yet it becomes even more important to remain skeptical of the data which is presented. Skilled researchers must be aware of both GDELT’s own internal limitations, and its external implications for policymakers as well. John Beiler has also written a great piece on the theoretical implications of big data, in particularly what this means for social science theory.

“…I think the social sciences, and science in general, is about asking interesting questions of the data that will often require more finesse than taking an “ANALYZE ALL THE DATA” approach. Thus, while datasets like GDELT provide new opportunities, they are not opportunities to relax and let the data do the talking. If anything, big data generating processes will require more work on the part of the researcher than previous data sources.”

My takeaway from John’s points is that data in itself is not neutral, we must become especially more self-reflective about the parts of the data we are using. But I am even more concerned with the external and practical purposes of the data we have access to.

In the example of aid transparency, “Big Data” is inherently geared towards policy, academic, and donor elites who actually have the power to interpret and do things with these numbers, while the beneficiares of aid (often the rural and technologically-illiterate poor) are often disadvantaged. During a recent roundtable at ISA2013, I heard a story about an African government which had asked a aid transparency organization to disclose the names and location of civil society groups to find out where the aid was going. In this case, the government wanted to use the data to target the NGOs’ (who can be a threat to the state’s legitimacy in certain contexts) and put them in jail. The commentator went on to emphasize that no one really knows who is using open data, and there is no clear way to determine for what purpose.

In conclusion, information and “Big Data” are not apolitical, for information in itself is power. Power remains asymmetrically biased towards the actors who not only has access to big data and information, but also have the power to do things with it. Furthermore, big data cannot tell us what we should do with this information.

Regardless, I remain optimistic about the potential of big data to really evolve our traditional ways of thinking. I agree with John and Big Data’s authors in that more and more “messy” data will be more useful to improving our theories. Yet I remind myself that theories of power and politics are still relevant, perhaps more than ever.

 “Big data is a resource and a tool. It is meant to inform, rather than explain; it points us toward understanding, but it can still lead to misunderstanding, depending on how well or poorly it is wielded. And however dazzling we find the power of big data to be, we much never let its seductive glimmer blind us to its inherent imperfections.”  Page 197 of Viktor Mayer-Schoenberger and Kenneth Cukier’s”Big Data: A Revolution That Will Transform How We Live, Work, and Think”,