Finally, someone has very publicly thrown cold water on the wild claims made for the potential of ‘big data’. I like the title: “Why Big Data is Not Truth.”
It seems like every week now, I hear or read about someone in the news, typically an engineer or a computer scientist but very rarely a social scientist, breathlessly extolling the potential of ‘big data’ to yield transformative insights into social phenomena or individual behavior. Almost inevitably this is illustrated with an utterly banal example of a finding, usually fit for nothing more than a cocktail party conversation, like perhaps people with small heads (as inferred from the sizes of the hats they buy) consume unusually large numbers of mangoes on Tuesdays and Thursdays. That is a made up example, but to me is representative of the sorts of trivial and atheoretical ‘findings’ that too often are hauled out in puff pieces about the golden world of opportunity offered by big data. The banality of these ‘findings’ illustrate the fundamental challenge that we face when we seek insight into underlying processes or mechanisms from observational data on people: describing a relationship is not the same as understanding it, or explaining it.
Correlation is not causality, and the problem doesn’t disappear no matter how much data we throw at it. Whether a dataset contains one thousand records with one hundred variables, or one trillion records with one million variables, if it is observational data collected ‘in the wild’ or via a survey, any association observed in it is still just an empirical finding, albeit a potentially important one, until it is replicated in different settings with different data, and has a credible explanation. A larger dataset or more variables don’t magically compensate for the fact that the data is based on observation, as opposed to generated by a controlled experiment with random assignment to treatment and control group. If we’re lucky, there may be something in the data that can be thought of as an exogenous shock experienced by a random subset of the subjects, in which case differences between subjects who experienced the shock and those who didn’t may be interpreted as a genuine effect of the shock.
Lest anyone accuse me of being prejudiced against large datasets with many variables, let me be the first to say that some of my best friends are large datasets. Indeed, for the last twenty years, I have helped create large historical datasets, analyze them, and release them to the public in the hope that others will be able to find applications for them that I could never imagine. We have created datasets that record people who lived in China in the eighteenth and nineteenth centuries from birth to death, recording at regular intervals their social and economic status, their household and community context, and their demographic behavior and socioeconomic attainment. I will probably continue helping to compile and analyze such datasets for the rest of my career, because that is how I roll, and because no one has showed up at my doorstep with a suitcase full of cash that would be mine if only I would join them on some sort of outlandish caper like you would expect in a Ross Thomas novel.
It is this experience with large datasets that has made me wary of the more extravagant claims for big data. My collaborators and I have learned a great deal about life in the past in China, and about demographic behavior in general, from careful analysis of these data. I want to continue compiling, analyzing and release these and other data. I am sure that others who work with the data we have publicly released will make even more spectacular and important discoveries, not just about China, but about human populations more generally.
All the effort we have expended in the construction and analysis of these large datasets has made me painfully aware of what it is realistic to hope for. We can describe important empirical regularities in great detail. Many of these are of considerable interest in their own right, even if we can only suggest possible explanations for them, because they illuminate life in another time. They are worth publishing in the same way that some fascinating but inexplicable astronomical phenomenon is worth publishing.
For some findings, an explanation is fairly straightforward and very credible. We find that married women who had not yet borne children for their husbands, or had borne only daughters, had higher death rates than women who had borne sons. This makes sense, since in the past in China, the primary responsibility of married women was to bear and then raise an heir for their husband’s family, and until they had at least borne a son, they were probably on a sort of probationary status, with limited access to family resources. Once they had borne a son, they were probably fully enfranchised members of their husband’s household. And we find that death rates rose and birth rates fell when grain prices were high, presumably because of economic adversity.
If we’re lucky, we find something that may have some relevance for the contemporary era. For example, we found that babies born soon after their elder siblings (within 24 months) had elevated death rates in old age. We speculated that this reflected the effects of maternal depletion on the newborn. Linked to contemporary results on apparently adverse short-term consequences of a short preceding birth interval, perhaps this might tell us something important about human physiology.
But we also find perplexing results that are robust to alternative specifications and persist no matter what subset of the data we look at, but we can’t explain. We find that high status males actually had higher death rates than other males. We don’t know why, and can only speculate. Perhaps their status and wealth allowed them to make what our son’s elementary schools refers to as ‘bad choices’: maybe they squandered their money on debauchery in Shenyang (at the time, Fengtian) and died early as a result of liver failure or tertiary syphilis. We just don’t know.
More relevant to my rant, we periodically observe statistically significant associations, some of them quite fascinating, that disappear when we expand the dataset, or use a different subset of the data, or make slight modifications to our model. If I had a dime for every association like this that we had come across, I’d be a rich man. I suppose that if the result were interesting, we could come up with some post hoc rationalization of why it only appears in a specific subset of the dataset, when the model is specified in a very particular way, and try and publish it, but that sort of thing makes us queasy, because of our awareness that if you measure enough associations, the phenomenon of mass significance will lead at least some of them to appear to be significant, ever if they aren’t. Again, we feel more comfortable making a claim if a result appears under multiple alternative specifications of the model, and across different subsets of the dataset.
I’m happy to continue plugging away with this sort of analysis indefinitely because I feel like an astronomer, except that instead of peering through a telescope at distant stars or galaxies and then trying to work backwards to develop an explanation for the regularities I observe, I am observing people in the past who I will never meet (unless I can buy a Tardis on Craigslist from a dissipated Time Lord whose alimony, child support, gambling debts and coke habit have made him desperate for money) and trying to discern and provide explanations for the regularities that I observe. Some of the explanations or interpretations I come up with may be overturned as people uncover even better data or apply better methods, but I am pleased to have made some incremental contribution to our understanding of life in the past.
If the starry-eyed proselytizers of the salvation to be delivered to us by collection and analysis of ‘big data’ were willing to put down their Kool-Aid for a moment and limit themselves to a more cautious prediction that large quantities of data will allow us to observe empirical regularities and every once in a while come up with some genuine insight about the determinants of specific behaviors, I would be happy. But too often, ‘big data’ proselytizers seem to imagine a future like the one in Isaac Asimov’s Foundation trilogy which I enjoyed so much in middle school, where simply by sifting through enough data, it is possible to predict not only individual behavior, but social change, decades or centuries in advance. To put it mildly, they’re getting somewhat ahead of the field in terms of the optimism about the possibility to go from observation of individuals to predictions about their behavior.
To me, the biggest challenge to the use of ‘big data’ is some version of the phenomenon of ‘mass significance,’ which I referred to earlier in the context of our own experience. If you have hundreds or thousands of variables that in reality have nothing to do with each other, and in fact are all series of random numbers generated by die rolls or some other process, but you calculate pairwise correlations between them, inevitably by luck of the draw some percentage of them will appear to have an association that is statistically significant at some threshold. But if you collect the same data again in another time period, a completely different set of variables may be associated with each other. In other words, what appears to be statistically significant association in data collected in one time period, will not have any association in a second time period. Companies that find that people whose last names end in Y or who like to fill their cars with gas on Wednesdays also tend to be especially receptive to offered discounts on artichokes in one time period, may be disappointed in the next time period when they offer special deals on artichokes to such people.
Another problem, well known from previous analysis of observation data, is the possibility that observed relationships are not causal, but reflect complex influences of other variables that we don’t observe. These might be variables that affect the chances of particular types of people being observed in our data, or variables that affect the values of the variables that we do observe. Whether spurious relationships observed in data are the result of selection biases or the influence of an unobserved variable on the variables that we do observe, any relationship we do observe is unlikely to be causal, and changing behavior or making policy based on it may be premature, to say the least. And in spite of the claims made for various approaches, I don’t there is any statistical voodoo that fixes the situation, and allows anyone to make solid claims of causality from purely observational data, except in very limited situations where at least one of the variables appears to be genuinely exogenous, in which case instrumental variables or other approaches may offer some insight.
This would all be fine if the goal of sifting through large amounts of data and identifying regularities was solely to develop a better understanding of the world, in the same way that astronomers sift through enormous amounts of data to development a steadily better understanding of the universe. There would not be any harm if all we wanted to do was observe empirical regularities, hypothesize about relationships, and then wait to see if the next round of data collection confirmed our hypotheses. I love doing that with historical data, since if I am wrong, no one is going to die because of some misguided policy that I propose, because everyone I study is already dead. And of course I love doing that with contemporary data. I don’t work that much with contemporary data, but others do, and we learn all sorts of remarkable things.
The scarier and probably more likely scenario, however, is that analysts will attempt to translate empirical regularities observed in ‘big data’ into government policy, company strategy, or individual behavior change without deep consideration of the possibility that the observed relationship is spurious, and perhaps can’t even be explained. At best, this will lead to wasted effort, because the relationship of concern was spurious to begin with, and changing policy or changing behavior will have no effect. In a worst case scenario, however, it could be destructive.
We already have many examples of policy or at least recommendations based solely on observational data had downright pernicious effects. Hormone replacement therapy comes to mind. Large observational studies based on what at the time was ‘big data’ led to a conclusion that hormonal replacement therapy would reduce the risk of breast cancer. Eventually, better designed studies revealed that hormonal replacement therapy didn’t reduce the risk of breast cancer, and probably increased it. That is but one example. The health and public policy literature is littered with other examples of recommendations for diet change or other lifestyle change that were made based on survey studies or other observational studies, but were not borne out in later, more rigorous studies.
I am terrified that as we move forward into an era of ‘big data’, results from the correlations of millions of variables with each other will be reported uncritically, and we will be subjected to an endless stream of breathless reports based on observed but in the end spurious relationships, perhaps that people who eat mangoes on Tuesdays are more likely to be struck by lightning, or people who last names contain three or more vowels are more likely to buy yellow cars, etc. If you think that is paranoia, just consider how many studies are already published every week that suggest that some slight diet modification raises or lowers the chances of some obscure cancer, based on observational data.
What is to be done?
I’m all in favor of continuing to collect and analyze data, including ‘big data’. Every once in a while, a relationship may emerge that really matters. And in many cases, even empirical regularities are useful and interesting to observe, even if we can’t explain such regularities. Traffic planners may find it very useful to find out that a certain street is especially likely to be clogged with traffic on days of the month that are also prime numbers, even if they have no idea why. Companies may find it very useful to know about patterns in customer behavior, even ones they can’t explain.
That said, we need to retain some healthy skepticism about the implications of associations observed in the analysis of ‘big data.’ Basically, we need to accept that ‘big data’ is not a magic bullet that makes more fundamental issues about inference vanish. I’m doubtful based on the results of effort by social scientists that having orders of magnitudes more data will suddenly allow us to predict individual behavior with great specificity, or predict dramatic social changes Life will probably remain stochastic at both the individual or aggregate level. We may develop models that are useful for predicting the frequency of particular types of behavior in a sort of actuarial fashion, where we may predict that on average X percent of people with specified characteristics will do Y over some time period, but I doubt that we will ever have models that predict that individual i who has specified characteristics will do something on a specified date. In other words, we may have lots of data that may be useful in actuarial calculations about average outcomes for aggregates of people, but I doubt we’ll get to the point where we can reliably predict the behavior of specific individuals in the short term.
The nightmare scenario is that a bad situation in which we already have almost weekly news reports based on dubious, never-replicated analyses suggesting that doing X increases our chances of suffering Y will turn into a worse situation where we have a daily or hourly stream of results claiming that individuals who do X raise their risk of experiencing Y, or that companies or cities, counties, or states that implement policy X will likely experience outcome Y. Data mining may lead to a spasmodic, panicked, ever changing set of recommendations to individuals, companies, or governments, that eventually produces cynicism, and perhaps a backlash in which nobody believes anything based on empirical observation.
At the very least, this suggests a need for a very high bar for claiming that observed associations are suggestive of causal relationships that in turn lead to policy prescriptions, or recommendations for changes in behavior. Ideally, associations will need to observed in multiple, independent datasets, and will need to have some sort of plausible account for the underlying mechanism or process generating the relationship. In an ideal world, empirical observations of potentially important relationships would be followed up my more rigorous analysis like the ones much in vogue among economists that would try to establish causality, or at least provide some evidence for it.
This isn’t to say that we need to fetishize causality and turn their noses up at any analysis that doesn’t rely on instrumental variables, a natural experiment, or some sort of randomized field experiment. Rather, the prescriptions for behavior or policy that we develop based on observations from big data have to be calibrated according to the import of the outcome, the plausibility of the proposed underlying mechanism or process, and the cost of the proposed change in behavior or policy. If analysis of ‘big data’ suggests that we can people who avoid wearing plaid on Thursdays appear to have a lower risk of being bit by rabid squirrels, it wouldn’t cost much to avoid wearing plaid on Thursdays for a few months until the result is confirmed. But if analysis of ‘big data’ suggests that carrying around bricks of depleted uranium substantially reduce our chances of being attacked by seagulls, we might want to hold off doing anything pending some careful thought and further investigation.
Along these lines, it would be a good idea for the engineers and computer scientists who are plunging ahead with the collection and analysis of ‘big data’ to learn from the experience of social scientists who have been grappling with the limitations of observational data for decades. As Bismarck said, ““Fools learn from experience. I prefer to learn from the experience of others.” Those who are now collecting and analyzing ‘big data’ should learn from the experience of social scientists, not by reinventing the wheel and repeating the same mistakes social scientists have made for the last few decades. The most important lesson is perhaps to be humble, and be aware of the limitations of observational data. Perhaps we should invite computer scientists or engineers working with social data into our research methods classes, not to teach them new statistical techniques, but to teach them the fundamentals of study design, like the difference between experimental and observational designs, the circumstances under which an inference of causality may be justified, and the dangers posed by selection processes and omitted variables.
Conversely, as social scientists, we need to incorporate training in the management of large and complex datasets into the undergraduate and graduate social science curriculum. Right now, our quantitative training typically provides students with predigested datasets that don’t require any manipulation, and then teaches them a variety of flavors of regression, some very exotic, that they can use to estimate models on those datasets. We almost never offer systematic training to students in how to manipulate those datasets to create new variables. And we almost never offer any systematic training in how to take ‘found’ data (perhaps the output from a web server log, or administrative data) and suck it into STATA or some other program, and organize it.
As a result, we have students who know how to take a dataset that someone hands them and run a five stage most squares regression with a cubic spline for age, income instrumented by the level of solar background radiation, and a Heckman sampling correction. But if you hand them a more complex longitudinal dataset like CMGPD that may require some simple manipulation to create variables measuring household or community characteristics to include in a discrete-time event history analysis via a simple logistic regression, they’re stuck. In the years I spent teaching regression, it was clear to me that for many students, the biggest problem was not in choosing variables, estimating a regression, and interpreting results, but in preparing the data for the estimation.
There are already many excellent social scientists who already create and work with absolutely ginormous datasets, I would speculate that when it comes to the techniques for managing those large and complex datasets, most of them are either self-taught, collaborating with computer scientists with expertise in database management, or came in from other fields. But we can’t rely on graduate students or faculty with relevant skills for manipulating large datasets to keep falling from the sky the way they have in the past. We have to produce them systematically.
Now to put my tinfoil hat on, another serious concern I have about ‘big data’ is that it may not turn out to be that useful in terms of improving our understanding of processes and mechanisms by which individual context and characteristics affect individual behavior or outcomes, but will likely prove to be a goldmine for post hoc extraction of information about individuals’ past behavior that could be used to embarrass or blackmail them. In other words, it may turn out that big data leads to little in the way of important, fundamental insights about human behavior, but will facilitate the creation of individual dossiers full of tidbits that can be hauled out to embarrass people whenever they seek political office, blow the whistle on their employer, or who knows what. Various totalitarian states collected enormous amounts of information on their citizens via surveillance and the reports of informers. I’m not sure that the data ever allowed any of the states to predict the individual behavior or social change. If the data could have been exploited to make accurate predictions about individuals or society itself, some of those totalitarian states might still be around. What we learned however is that the information was less useful for prediction than for control.
Note: I have been going back and modifying this as I have had more thoughts, or received feedback. An exchange with Mark Hayward was particularly inspiring because it drew attention to the need for social scientists to develop a response.