Highlighted Selections from:

The Parable of Google Flu: Traps in Big Data Analysis

DOI: 10.1126/science.1248506

Lazer, D. et al. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343.6176 (2014): 1203–1205. CrossRef. Web.

p.1: Research on whether search or social media can predict x has become common place (5–7) and is often put in sharp contrast with traditional methods and hypotheses. Although these studies have shown the value of these data, we are far from a place where they can supplant more traditional methods or theories (8). We explore two issues that contributed to Google Flu Trends (GFT)’s mistakes big data hubris and algorithm dynamics and offer lessons for moving forward in the big data age -- Highlighted mar 23, 2014

p.1: “Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis -- Highlighted mar 23, 2014

p.2: searches for treatments for the flu and searches for information on differentiating the cold from the flu track closely with GFT’s errors (SM). This points to the possibility that the explanation for changes in relative search behavior is “blue team” dynamics—where the algorithm producing the data (and thus user utilization) has been modified by the service provider in accordance with their business model. -- Highlighted mar 23, 2014

p.2: In improving its service to customers, Google is also changing the data-generating process. Modifications to the search algorithm are presumably implemented so as to support Google’s business model—for example, in part, by providing users useful information quickly and, in part, to promote more advertising revenue. Recommended searches, usually based on what others have searched, will increase the relative magnitude of certain searches. Because GFT uses the relative prevalence of search terms in its model, improvements in the search algorithm can adversely affect GFT’s estimates. Oddly, GFT bakes in an assumption that relative search volume for certain terms is statically related to external events, but search behavior is not just exogenously determined, it is also endogenously cultivated by the service provider. -- Highlighted mar 23, 2014

p.2: Although it does not appear to be an issue in GFT, scholars should also be aware of the potential for “red team” attacks on the systems we monitor. Red team dynamics occur when research subjects (in this case Web searchers) attempt to manipulate the data generating process to meet their own goals, such as economic or political gain. Twitter polling is a clear example of these tactics. Campaigns and companies, aware that news media are monitoring Twitter, have used numerous tactics to make sure their candidate or product is trending (23, 24). -- Highlighted mar 23, 2014

p.2: Similar use has been made of Twitter and Facebook to spread rumors about stock prices and markets. Ironically, the more successful we become at monitoring the behavior of people using these open sources of information, the more tempting it will be to manipulate those signals. -- Highlighted mar 23, 2014

p.3: Transparency and Replicability. Replication is a growing concern across the academy. The supporting materials for the GFT related papers did not meet emerging community standards. Neither were core search terms identified nor larger search corpus provided. It is impossible for Google to make its full arsenal of data available to outsiders, nor would it be ethically acceptable, given privacy issues. However, there is no such constraint regarding the derivative, aggregated data. Even if one had access to all of Google’s data, it would be impossible to replicate the analyses of the original paper from the information provided regarding the analysis. -- Highlighted mar 23, 2014

p.3: What is at stake is twofold. First, science is a cumulative endeavor, and to stand on the shoulders of giants requires that scientists be able to continually assess work on which they are building (25). Second, accumulation of knowledge requires fuel in the form of data. There is a network of researchers waiting to improve the value of big data projects and to squeeze more actionable information out of these types of data. -- Highlighted mar 23, 2014

p.3: Study the Algorithm. Twitter, Facebook, Google, and the Internet more generally are constantly changing because of the actions of millions of engineers and consumers. Researchers need a better understanding of how these changes occur over time. -- Highlighted mar 23, 2014