An avalanche of newly accessible datasets – popularly called “Big Data” – is transforming research questions and processes across the social sciences. Dialogo sat down with Professor Howard C. Nusbaum and Professor James A. Evans to discuss the impact and opportunities surrounding these changes.
Professor Howard C. Nusbaum is the Stella M. Rowley Professor of Psychology and director of the Center for Practical Wisdom at the University of Chicago. He is internationally recognized for his multi-disciplinary studies of the nature of wisdom and the cognitive and neural mechanisms that mediate communication and thinking. Nusbaum’s past research has investigated the effects of sleep on learning, adaptive processes in language learning, and the neural mechanisms of speech communication. His current research investigates how experience can increase wisdom and produce changes in insight and economic decisions, and examines the role of sleep in cognitive creativity and abstraction.
Professor Evans is a professor in the Department of Sociology, director of Knowledge Lab at University of Chicago, and faculty director of the Masters program in Computational Social Sciences. In his research, Evans explores how social and technical institutions shape knowledge—science, scholarship, law, news, religion—and how these understandings reshape the social and technical world. He has studied how industry collaboration shapes the ethos, secrecy and organization of academic science; the web of individuals and institutions that produce innovations; and markets for ideas and their creators, as well as the impact of the Internet on knowledge in society.
Dialogo: What does big data mean in the realm of the social sciences?
James Evans: Big data can mean many different things. The classic triptych is high volume, high variety, and high velocity. In the social sciences especially, it's increasingly high volume and high variety. Each does a different kind of thing. Large-scale data comes off of highly instrumented social processes. For example, our cell phones and all of the transactions that we engage in online and in many other contexts are instrumented by an ensemble of sensors. Those sensors create large streams of data that allow us to ask and answer questions about social process at high levels of resolution than we could have only conceived before, and with much larger scale data over many different kinds of interactions and time periods, et cetera.
The variety part means that we can also explore the relationship between different kinds of social action because they exist in this common format in a way that was previously only conceivable in contexts like ethnography, where people were looking at multiple modes but in very small scales.
Overall, it's a game changer in social sciences.
Howard Nusbaum: For a long time, social scientists have used survey instruments like the General Social Survey, which is a very structured set of questions that people answer, and tracked that data over a long period of time. We used to consider that big data, but now there are projects like the Kavli HUMAN Project at New York University, where they intend to survey 10,000 people. To extend this into a place where there are 10,000 people instrumented across the boroughs of New York City gives access to multidimensional data in a way that we've never had before. One can think about it as the Hubble telescope of social sciences, moving the social sciences into the realm of something where we have evidence about people's movements, people's choices, people's feelings, interactions between individuals.
Evans: Recently, we published a study that used all of the Amazon.com book purchase data, along with Barnes and Noble, and other online book purchasers to identify the association between preferences for political ideology books on their red or blue side, and all other consumptive science and literature. That's a transaction trace, but also clearly reveals insights about the way in which people who hold or consume information about a certain ideology also consume other kinds of things.
We are also using eye-tracking data of a variety of types, which again, increasingly is able to provide really rich interaction signals. We're able to instrument in ways that before were specific to like one or two labs. Now, you can run a virtual laboratory of 10,000 or 15,000 or 100,000 people and get detailed interaction traces that capture arousal and attention, and other things.
In another study, we took data for tens of thousands of publications related to gene/drug interactions in the literature and aligned them with data from a high-throughput experiment on gene/drug interactions that replicated about 1.7 million of those interactions. We used the trace of collaboration and a whole host of variables that we extracted from the original papers to predict, in some cases with enormous success, the degree to which different kinds of communities produced knowledge that was more or less replicable in the future. This would have been impossible without the ability to perform high-throughput experiments, on the one hand, or use computational tools to extract information in mass from publications, on the other.
In short, there is data that we previously didn't think of as data; like full text, government documents, user-generated images and videos, from which we can pull signals which are, in some cases, enormously predictive.
Nusbaum: Finding signals that were heretofore unused or hidden or latent is interesting. That standard model of a meta-analysis, which you're alluding to as an upgraded approach, the standard modeling of published research in psychology and other fields, is taking studies -- specifically the summary statistics recorded in those studies -- and analyzing them for consistency across conditions reported in the studies for those statistics. You'd say, 'Oh look, nine out of ten studies, or 100 out of 150 studies show a certain kind of pattern of data consistent with the conclusions,' so you have this sense of reproducibility. Now there are new methods of using data in publications that can lead to new insights.
For example, with functional magnetic resonance imaging (fMRI) papers -- those publications have data tables in them, so the data tables have X-Y-Z coordinates corresponding to neural activity in specific spots in your brain. Instead of just taking summary statistics, an approach called NeuroSynth is used to recreate an idealized version of the data from the data tables from each of these studies, generating a new synthetic data set at a much finer grain resolution that the old approach to meta-analysis. This actually lets you do new experiments on data that has been published. This is a way of doing new studies that are a type of synthetic research.
Dialogo: Are conclusions in research stronger because of the volume of data that is available?
Evans: The answer to this question has to be 'yes and no,' right? Because the 'yes' acknowledges that, okay, we're able to access data from new places, and at new scales. And the 'no' highlights that digital data is data from the wild, so to speak…data from transactions or data from clicks online, or from online activity, or dating sites, or wherever, has this deep problem of algorithmic confounding. You have data on choices (e.g., “clicks”), but those choices were given to you because it was predicted that they would most appeal to you, and so as researchers we don’t know what part of online activity is a result of people’s preferences, and the “smart” algorithms that were used to predict them. As a result, there's way more data on these huge global platforms, but the platforms are smart and that smartness shapes the results of the experiment that you're performing every time you go online and search for things. It creates enormous opportunities, enormous challenges.
Nusbaum: Every method, regardless of where it comes from, has its pluses and minuses. In the past days, social psychologists, sociologists, and political scientists would go to NORC (a social sciences research organization at University of Chicago) and collect stratified samples of data from populations according to a certain kind of model. Cognitive psychologists like me would run 10, 20, maybe 30 people if we were lucky, in our laboratories and collect data. Then Amazon came along with Mechanical Turk, and researchers started running studies online. We can ramp up a study now from 10 people in the lab to 1500 people (online) basically in a week.
And those people are sitting at home, perhaps watching TV while they're doing a study. There could be kids running around and dogs barking. If you're doing an auditory study, the quality of the headphones differs. We have to insert new kinds of quality control checks to see if people are actually engaged in a task. We have to collect demographics in a way we didn't before. We have to find ways to collect online measures over the Internet. Yet, at the same time, we can see that we generally get very similar results from 1500 people in the field to 10 people in our laboratory. Because of that, it gives us a methodological reach in new directions that we didn't have before.
At the same time, by going out on the web, we can now collect data from a wide variety of people. For example, there are supposed to be roughly one in 10,000 people who have absolute pitch, the ability to name the note when you hear a note played. As it turns out, we can find them all over the place, and if you set up a website and do testing for absolute pitch, we can start to bring them in to the site for testing. We can now get people with different kinds of backgrounds interested in doing our studies from all over the world. We never could have done that before.
That's not big data, per se, but it's a kind of reach into a data space that we didn't have before.
Evans: It's big sampling.
Nusbaum: It's big sampling, that's right. One of the things that is interesting about big data is we're now trying to think about data in different ways. Scientists in social sciences who perhaps never thought about those issues are confronting those kinds of forces as well, grappling with how to think about the kind of person who has produced these data. How do you think about the framework and situation under which data was collected? What were the intents of the researchers compared to the participants?
Dialogo: What are some of the other challenges that come out of big data?
Evans: The institutions with the greatest sense or reach into human activity are not public researchers. They're private companies like Facebook, and LinkedIn, and Google. They have more touches of more individuals than anyone else; any other government agency.
That creates a couple of different kinds of challenges. One is that there's a hierarchy of access by selective individuals who have selective relationships with important people and these companies that creates a kind of random access to this data by the social sciences. A second associated challenge is that the government has decided to invest less in social science data, which puts at risk the possibility that more and more of the science that emerges from these social data streams becomes private rather than public science.
Nusbaum: There are also problems of meaning. For example, suppose you want to do a study that looks at the neighborhood safety of older people in different income brackets. How do you go about that? There are different ways of deciding what the meaning of safety is and how that translates into existing data or collecting new data, such as converting street view images from different addresses into visual measures of local safety.
The other problem that the data scientists talk about is sustainability. As more data is collected, it piles up and the set of data gets bigger and bigger. How do you organize the data? How do you organize and allow access to the data, maintaining privacy and security? How do you maintain reproducibility as opposed to replicability such that the same data can be processed by the same kinds of tools and get the same result or the same conceptual kind of tools even as software develops over time?
Evans: Reproducibility and replicability have become deeper issues on platforms that are constantly changing. That's the polar opposite of something like a NORC survey, which has been the same for 50 years. Just the velocity of technological change for filtering out information from noise signal is changing dramatically. I completely agree with Howard that often data science teams are just looking for variation and they rediscover things over and over again that have maybe been discovered 50 or 100 years before in small scale data.
One of the biggest challenges for the new kinds of insight that comes from big data is that as social scientists we have a taste for certain kinds of questions and we like answers of a certain resolution that conform to a certain kind of story, in the same way that a blockbuster movie has to be between an hour and a half and two and a half hours. You can't have a five-hour movie or a half-hour movie. It’s both a problem to digest those new studies and to take them seriously, but its also an opportunity, because it holds the possibility of expanding the collective imagination of the social sciences.
Nusbaum: Early on, the Journal of Cognitive Neuroscience required that every time you published a paper, you had to put your data in a central repository they maintained. The problem is data came from different scanners, different instruments with different properties, with different work flows, with different kinds of data structures. Nobody could easily make use of it.
Now there is the opportunity for somebody who doesn't know anything about, for example, Alzheimer's disease to basically analyze brain data in a unique way that psychologists or neuroscientists haven't thought about, because we had preconceived notions of the disease. A computational approach that looks for data patterns can come up with some new information that can provide fundamental insights. Getting more of that behavioral and neuroscience in common formats that are publicly accessible with common data modeling and analysis tools is critical for making breakthroughs in a lot of areas.
Evans: But many of these methods are fundamentally not “statistical” yet. They're so high dimensional. There's no meaningful articulation of a confidence interval or anything like that because we have no precise sense of the search space that methods like multi-layer neural nets explore to identify their answers. And it remains unclear how these kinds of issues are going to shake out in the social sciences.
Dialogo: How is the availability of big data impacting methodologies?
Evans: We see it even in the traditional survey. Today – and this has really been pushed and piloted by social media and information companies like Facebook and Google – has been this development of what I'll call active and interactive learning surveys, where you're predicting the answer to the next question that a person might be posed.
Rather than asking a thousand questions, you might ask six personally sequenced questions to maximize that information, which means you have more space and time to ask about a whole host of other things. That's a big shift using these models and using prediction in the context of performing survey and I would say observational data is similar.
Nusbaum: That’s an extension of an old process that's taken place in other areas, often called adaptive testing. GRE uses this kind of adaptive testing. On the one hand, it's efficient and often effective, and predicts performance in other circumstances. On the other hand, in other cases, it can miss out on some things. We know from work by people like (U.S.C. Professor) Norbert Schwarz that the context of the questions matters as much as anything. If the context changes by adaptive testing, then there may be things changing that we're not aware of. One question primes you in one way. For example, if you say, 'How good is your life?' Then you say, 'How good is your marriage?' That gets one kind of ordering versus if you say, 'How good is your marriage,' and then, 'How good is your life?' It gives another kind of pattern of response. So thinking about those kinds of things are going to nuance the way we approach these things. That will be a developmental process, I think, over time.
Evans: This highlights that more than just forms of data gathering are changing. There was a recent paper by Tom Griffiths then at Berkeley, now Princeton, which talked about experimental design as algorithm design. The idea is when you're using these algorithms to optimally collect data then, all of a sudden, the whole idea of collecting data and then analyzing it, with a strong wall between the two processes, doesn't make sense anymore, right?
Nusbaum: We don't have yet the precision of understanding our instruments in the way that physicists often do because they're building them from scratch. Our instruments are much more context dependent than their instruments. The algorithms we use for designing our studies and the algorithms that we use for analyzing our studies are slightly mismatched.
Evans: By building these models, you figure out what is it that you know firmly, and what you know only loosely.
Dialogo: As big data continues to become bigger, what changes do you anticipate? What do you think will happen in future research?
Nusbaum: From my perspective, we're seeing a convergence of different kinds of research methods. James and I were part of a common National Science Foundation research project. The conversations that we had suggested a common approach in conceptualization and different kinds of data that we can bring to bear on the same question.
One of the things we're seeing in the social sciences is sociologists are taking blood spots. Political scientists are taking buccal swabs. Economists are doing fMRI and using methods from neuroscience. We're getting biological data. We're getting behavioral data. We're getting location and movement data. We're getting choice data. We're getting all kinds of data, and it doesn't matter what discipline you are coming from.
Finding the causal links between the individual and the group by looking at how the individual's choices and behavior are influenced by the invisible forces of society is fundamental whether we're talking about linguistics or psychology. Social science research moving in a direction where we can start to address that, because we have data with the grain of the individual and data with the grain of the group. We can look at the big forces, and we can look at the individual in relationship to them. That’s one place in which we're going to have traction that we have not had good traction in the past.
One thing that relates to this notion of multiple levels of resolution that are studied by different fields coming together is a shift from what a focus on establishing necessary causal conditions to establishing sufficient conditions. This is a distinction that we talk a lot about in explanations in the social sciences. Is the factor that you're observing necessary for the operation of a mechanism? Is it necessary for the outcome that you observe? Or, is it sufficient? With the integration of different perspectives, and with large-scale data, there is an increasing taste for sufficient explanations that hold in different contexts and istuations.
That's driven almost all activity in the quantitative social sciences over the last 100 years; find something statistically significant, but typically ... it's really small, not really substantial. Increasingly by integrating all these levels of analysis, we're able to explain sometimes 90, 95, 98, 99 percent of the activity of an individual or of a group in a particular setting.
It changes a lot, right? This can make social science more potentially applied, because now we are talking about effects that are reliable but maybe not substantial, and we're talking about reproducing phenomena. This shift in stance will provide more opportunities for us to quickly send insights out into the world of systems that generate values for people.
Evans: Our theories become shaped in a different way. That moves us closer to physics in certain ways. As theories about various phenomena become more complex, seeing the relationships among those kinds of structures becomes much more straightforward. We have this problem in genetics. People in genetics used to have these simple causal theories, 'This gene produces this outcome.' Now there are statistical theories, 'This pattern of genes gives this population.' There's no causal theory there. It's only a statistical association. They don't go from genes to proteins to neurons to behavior, or structure. They're in search of the same kind of problem and solutions that we are. They have very complex, highly dimensional problems with data, big data that relate these things. They don't know how to connect them. There will likely be forms of theoretical solution that may be common amongst different fields now that didn't used to occur because those disciplines weren’t viewed in common. I think that's going to be a huge change.
Dialogo: Closing thoughts?
Evans: There are a number of different potential worlds that could come out of this Big Data moment. In one world, I could imagine that the computational social sciences and behavioral sciences move so quickly and aggressively, and adopt or embrace other epistemological levels of analysis and styles that they separate, and you are left with psychology, and sociology, and political science on the one hand, and you separately have a computational social science that has speciated from those things. On the other hand, you could have a world in which computational approaches just become the way of doing good sociology or psychology or economics, which brings all of those fields a little closer.
Questions also remain about whether the biggest tranches of data are going to be locked up in such a way that the science that comes out of them is really also locked up in databases and services, and can only be used by the proprietary producers of those things? Or, are they going to be become part of a broader interchange, and feed the individual social sciences that gave rise to them? I don't know. I think none of us knows.
Nusbaum: This is particular challenge we see right now. 23andMe has collected many people's genetic data. If you want to ask questions of and get a guaranteed 10,000 responses with genetic analysis, it'll cost you six figures. Essentially, you can pay them hundreds of thousands of dollars and run a social science study on the genetic database and you have guaranteed results. If you just imagine the fact that there's this huge database of hundreds of thousands of peoples' genetic data as a potential pool that you could sample, consider what kind of social science you could do.
As neuroscience tools have become more effective and cheaper, there has started to be a schism within the field of Psychology. As a cognitive neuroscientist, you might be using fMRI to address basic questions of mechanisms in language understanding and decision making, and you train students in neuroscience methods and forget about the deep theoretical background coming from psychology. There is now a whole cadre of people studying brains and forgetting that we know a lot about behavior and psychology. Hopefully in the future these perspectives will be merged together. In fact, as we've seen more biological methods used in other parts of the social sciences, there's a hope that actually there can be a broader convergence of disciplines and methods, moving from the past separation of psychology versus sociology versus political science to have better understanding of the questions and theories that span the social sciences. I think that's a real opportunity that's been missing for a long time.