University of California,
In the first part of this essay (Perspectives, Fall 2015), I suggested things are looking pretty good for sociological theory, an optimism grounded in my appreciation of emergent sociological sub-fields where interesting theoretical work is being paired with innovative new measurement regimes to create different kinds of sociological insights. I pointed to the field of computational sociology (or Big Data social science) as an example. In this second part, I offer a few reasons why I think this area of research will continue to need more and better theory in the years ahead. I highlight three causes, what I call: (1) the paradigm effect, (2) the data effect, and (3) the culture effect.
1. The Paradigm Effect
(Kuhn 1962, p. 111).
Sociology’s dominant framework for social measurement dates back to the years around World War II when it first coalesced into something like a coherent methodological paradigm. Survey methods emerged as the dominant means of gathering social data in the postwar years, anchoring the core of what was essentially a new methodological field. Many of the relevant technologies, organizational forms, and intellectual developments of survey techniques had already been pioneered and refined in the decades before the war, but WWII was an important catalyst. Scholars from multiple disciplines were brought together in large, well resourced, practical project teams. For example, the U.S. War Department surveyed over a half million active duty American soldiers about their experiences in combat, unit morale, racial prejudice, and many more topics (Stouffer et al. 1949). Such efforts led to an accelerated articulation between the theory and the method of survey analysis, as well as an elevation in the legitimacy of the work. After the war, academics like Paul Lazarsfeld further theorized, formalized, and publicized these methods, facilitating their rapid institutionalization (Converse 1987; Mohr and Rawlings 2010, 2015; Platt 1996). Friedland and Alford (1991, p. 248) argue that the institutionalization of fields depends upon the development of a common core institutional logic which they define as “a set of material practices and symbolic constructions which constitutes [the field’s] organizing principles and which is available to organizations and individuals to elaborate.” They emphasize that this is driven both through coherence within the logic and through its difference from other logics. In this case, I think we can trace the emergence of a dominant logic of social measurement that wove together a constellation of survey techniques, material technologies, and practices, along with a deep-level set of shared understandings about what it means to measure, collect, conceptualize, and analyze data about the social world (Knorr Cetina 2001; Latour 2005; MacKenzie 2009; Mirowski 2002; Mohr and Ghaziani 2014; Wagner-Pacifici, Mohr, and Breiger 2015).
At its heart, and in its brilliance, this is a model of discovery that depends upon the ever more efficient leveraging of scarce information to learn about some larger unmeasured social whole.
This dominant measurement logic emerged out of and remains grounded in the theory and practice of survey analysis. Thus, it follows a trajectory of scientific investigation that begins from a statistically controlled sample of respondents, willingly answering a set of precisely worded questions to measure both subjective and objective characteristics about themselves. These responses are then statistically extrapolated to help us understand how those characteristics are distributed, inter-related, and (ideally) causally connected across the larger population. At its heart, and in its brilliance, this is a model of discovery that depends upon the ever more efficient leveraging of scarce information to learn about some larger unmeasured social whole.
Elsewhere, Goldberg himself has complained about the intellectual limitations of working within the dominant paradigm. In a marvelous article on analyzing Big Data entitled, “In Defense of Forensic Social Science,” Goldberg (2015) links the embrace of hypothesis testing as a master trope in social scientific work back to the problem of informational scarcity in the post-war era. Goldberg writes, during this period “[t]wo methods of research became particularly prominent: surveys and laboratory experiments. Both are costly and time-consuming, and require a significant investment in infrastructure and personnel. Such an upfront investment makes exploratory research potentially wasteful and therefore highly risky” (p. 2). Goldberg explains how this differs from his experience of working with Big Data. He writes, along “with the pain of drowning in an ocean of amorphous data also comes the liberation of unshackling oneself from the blinders of one’s limited imagination…the analytical focus shifts from thinking about the most cost-effective data that one needs to collect in order to support, or refute, a hypothesis, to figuring out how to structure a mountain of data into meaningful categories of knowledge” (p. 2).
Content analysis comes from this same place. Again, WWII served as crucible. Harold Lasswell led a team of social scientists at the U.S. Experimental Division for the Study of Wartime Communications. Building on existing content analysis practices, Lasswell’s group developed and refined a new set of formal procedures for systematically extracting the core bits of information from a textual corpus. This produced quantitative datasets that could be reliably employed, in the formal sense of having a reliability metric, to map the distribution of information across the larger, unmeasured textual space. Using these content analysis techniques, Lasswell’s unit employed teams of human “coders” to read foreign—especially enemy—newspapers to gather war intelligence. After the war, Lasswell worked to help refine and institutionalize the wartime methodologies into the modern research program of formal content analysis (Lasswell and Leites 1949; Lasswell, Lerner, and Pool 1952). Notice, the underlying paradigm is the same. Content analysis, from its beginning, has sought to measure meaning by extrapolating from small bits of textual information that have been carefully selected so as to best represent a larger, more complex, unmeasured—in this case, discursive— whole (Mohr, Wagner-Pacifici and Breiger 2015).
As brilliant and scientifically accomplished as both of these research programs grew to become over the decades, the underlying measurement framework defining both is the careful and calculated leveraging of scarce information. As the contemporary era of Big Data so vividly illustrates, this is only one way to measure the social, and it is a measurement paradigm that has quickly proven inadequate for organizing the work of Big Data social scientists.
Some problems of applying quantitative practices predicated on data scarcity to the world of big data are simply practical. Sample sizes quickly become so large that traditional statistical measures of significance are rendered useless because everything becomes significant (McFarland and McFarland 2015). There are also systematic distortions of the social world that come embedded in Big Data formats (Adams and Brückner 2015; Diesner 2015; Lewis 2015; Shaw 2015). But more than this, as Monica Lee and John Levi Martin (2015) complain, our most basic conventions for measuring the social world become a drag on scientific productivity. For example, thinking through the lens of traditional causal modeling as we engage with Big Data has left us using “new tools to accomplish old tasks. In a word, we have been trying to make things insignificant. Like a neurologically impaired subject with dilated pupils, we are putting our hands over our eyes and hoping to peep through our fingers” (Lee and Martin 2015, p. 1). Lee and Martin give the example of outliers—data points far from the central tendency—which represents a classic technical concern of traditional linear modeling. With Big Data,
| || |
a group of outliers, though a very small percentage of the total population, may still consist of thousands of people—many times more than the total respondents to the GSS. They should not be expunged as outliers but understood as a significant population in its own right…moving from one average man that poorly represents one big population to multiple average men that represent segments of the total population more accurately... (Lee and Martin 2015, p. 2)
Contrary to traditional measurement logics, Big Data social science regularly presents us with methodological conundrums about how to effectively carve into an overwhelming abundance of information in a theoretically meaningful way. Lee and Martin put it starkly, now “[l]ike a real scientist, our problem isn’t running out of information, but choosing which path to follow” (Lee and Martin 2015, p. 4). They propose a path that takes us away from the traditional measurement paradigm “… towards cartography—the construction of question-independent, though theoretically organized, reductions of information to make possible the answering of many questions” (p. 4). They envision a Blau-space-like multi-dimensional map that “allows us to scan our eyes up, down, left, and right, to draw both horizontal and vertical comparisons—how people in the population relate to each other in terms of demographics or any single surface (e.g. psychobilly concert attendance), as well as which factors contribute concert attendance for each subpopulation” (p. 3).
Big Data social science regularly presents us with methodological conundrums about how to effectively carve into an overwhelming abundance of information in a theoretically meaningful way.
| || |
[w]hereas social scientists customarily obsess over causality and rely on formal tests of statistical significance, computer scientists using supervised models focus on results. The first topic-model presentation I attended used the method to identify public records particularly likely to require redaction, out of a set of records too immense for humans to screen by hand. The only measure that mattered was whether the models improved prediction (which they did).
In sum, sociologists’ dominant paradigm for measuring the social appears increasingly out of step with the methodological problems that come to the fore as we move from an era of data scarcity into an era of Big Data social science. Lacking an established measurement paradigm to do the conceptual heavy lifting, research scientists working with Big Data will need help in theorizing where to look, how to look, what to look for, and what to make of what they are looking at. Hence, my assertion: the old measurement paradigm is beginning to crack (or, rather, is cracking in some new ways), and Big Data social scientists are going to need more and better theory—two kinds of theories, in fact. On the one hand, we need to re-theorize the practice of social measurement itself, toward more exploratory and less simple-mindedly hypothesis testing approaches to the use of data. On the other, we need new (and reinvigorated) theories of the social world, theories which can now find a new and possibly more illuminating empirical footing in the plenitude of information which has begun to come our way.
2. The Data EffecT
The explosion in digital information has been accompanied by the emergence of an equally dynamic technical field focused on analyzing these data. Tech firms, in close collaboration with research universities, have created a range of new tools for reading this expanding universe of textual data (think Google). Consider topic modeling—a way of automatically coding the thematic content of large textual corpora—which is increasingly used by both social scientists and digital humanists to radically change the scale of data queried and the kinds of questions asked (Blei, Ng, and Jordan 2003; Mohr and Bogdanov 2013). Scholars have used these tools to explore a wide array of topics.
In short, these technologies are fundamentally changing the way that scholars read and interrogate textual data.
Daniel McFarland and colleagues (2013) analyzed the ProQuest database—containing more than a million dissertation abstracts—to study the emergence of boundaries in scientific fields. Paul DiMaggio’s (2013) team studied some 8,000 articles concerning public funding of the arts in 5 major U.S. newspapers over the course of a decade. They showed how political events and political leanings of the newspapers affected the types of stories that were published. Ian Miller (2013) used topic models to study Qing Dynasty Veritable records containing thousands of reports of social unrest submitted to the Chinese emperor over the course of many decades. Miller used this to show how social constructions of crime changed in Chinese society. In short, these technologies are fundamentally changing the way that scholars read and interrogate textual data.
These types of developments create the historical conditions for a new age of computational hermeneutics that builds productively on the proliferation of new text mining tools and datasets (Mohr et al. 2013, 2015). We argue this heralds the development of a style of content analysis that no longer (by techno-methodological necessity) throws away the nuances of textual information, but which instead seeks to identify as much nuance as possible. Unfettered from the limitations of Lasswell’s human coders, this new analysis relies upon a vast array of algorithmic coding and measurement devices. But, as work in the Digital Humanities illustrates, what Big Data researchers really need is more and better theory: Theories of reading, semiotics, and narrativity; theories about how identity, agency, communication, discourse, and social institutions are meaningfully ordered and constituted. Once again, the old scarcity-based measurement paradigm of content analysis provides few clues for helping contemporary text analysts to think about how to approach such a bewildering array of measurement options.
3. The Culture EffecT
gaining access to the conceptual world in which our subjects live so that we can,
in some extended sense of the term, converse with them”
(Geertz 1973, p. 24).
Births, deaths, marriages, incomes, occupational categories, and years of education—since sociologists started transforming social observations into measurable units, there has been a distinction between things that are more easily quantified and the wide range of cultural, cognitive, and hermeneutic qualities of social life, which have been far less easily translatable into reliable metrics (Mohr and Ghaziani 2014; Jepperson and Swidler 1994). The more data-intensive side of sociology has conventionally leaned heavily toward structural, resource, and demographic factors that seemed to be more easily quantifiable.
One of the most interesting things about the Big Data revolution is that it inverts this old imbalance. We are now inundated with textual data, visual data, audio data, and other kinds of highly nuanced cultural data, as the social world continues to digitize its subjective experience of selfness. Such data calls for ways to theorize language and speech, image and vision, hearing and sound. Moreover, because this information often comes through the “contextlessness” of digital space, scholars now have a comparatively harder time identifying the concrete measures of social relations, social structure, and social demography that have served as the mainstay of quantitative social science over these last many decades.
We are now inundated with textual data, visual data, audio data, and other kinds of highly nuanced cultural data...
My main point, for all of the reasons I have proposed here and more, social scientists working with Big Data are going to need lots of new theories, and a new generation of theorists to accomplish the work that needs to be done (Venturini, Jensen, and Latour 2015). Make no mistake, action is required. When social scientists don’t step up, physicists and engineers have no incentive to wait. The early years of Big Data have suggested that when there is a vacuum of good social scientific theory, Big Data researchers are more than happy to ad hoc a problem and call it theory. As many papers in the BD&S issue testify, naïve empiricism and large leaps of faith often result in bad social science because of the Big Gap that separates social life and the purported representation of that world with Big Data. As Breiger (2015, p. 2) puts it, “whereas many studies have been undertaken of massively large systems such as social networking sites, an under-researched question is the extent to which the behavioral findings of these studies ‘scale down,’ i.e. apply to human groups and organizations of moderate size (dozens or hundreds), where most human social life takes place and is likely to continue to do so.
Big Data/Big Theory/Big PaneL
Acknowledgements: Thanks to Roger Friedland, Erin McDonnell, and Craig Rawlings for useful suggestions and comments on this essay (not all of which I was able to respond to). Thanks also to Ronald Breiger and Robin Wagner-Pacifici for their comments on this essay and for their collaboration on the larger project of which this is but a small piece. Finally, special thanks to Erin McDonnell and Damon Mayrl for their support, patience, and superb editorial work.
Algee-Hewitt, Sarah Allison, Marissa Gemma, Ryan Heuser, Franco Moretti, and Hannah Walser. 2015. “Canon/Archive. Large-scale Dynamics in the Literary Field” Stanford Literary Lab Pamphlet #11. Available online at https://litlab.stanford.edu/pamphlets/.
Allison, Sarah, Marissa Gemma, Ryan Heuser, Franco Moretti, Amir Tevel and Irena Yamboliev. 2013. “Style at the Scale of the Sentence.” Stanford Literary Lab Pamphlet #11. Available online at https://litlab.stanford.edu/pamphlets/.
Bail, Christopher A. 2014. “The Cultural Environment: Measuring Culture with Big Data.” Theory and Society 43(3-4): 465-82.
______. 2015. “Lost in a Random Forest: Using Big Data to Study Rare Events.” Big Data and Society 2(2) DOI: 10.1177/2053951715604333.
Bearman, Peter L. 2015. “Big Data and Historical Social Science.” Big Data and Society 2(2) DOI: 10.1177/2053951715612497.
Blei, David M., Andrew Y. Ng,and Michael I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993-1022.
Breiger, Ronald. L. 2015. “Scaling Down.” Big Data and Society 2(2) DOI: 10.1177/2053951715602497.
Converse, Jean M. 1987. Survey Research in the United States: Roots and Emergence, 1890-1960. New Brunswick, NJ: Transaction Publishers.
De Nooy, Wouter. 2015. “Structure from Interaction Events.” Big Data and Society 2(2) DOI: 10.1177/2053951715603732.
Diesner, Jana. 2015. “Small Decisions with Big Impact on Data Analytics.” Big Data and Society 2(2) DOI: 10.1177/2053951715617185.
DiMaggio, Paul J. 2015. “Adapting Computational Text Analysis to Social Science (and Vice Versa).” Big Data and Society 2(2) DOI: 10.1177/2053951715602908.
DiMaggio, Paul, Manish Naga, and David Blei. 2013. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics 41(6): 570–606.
Friedland, Roger, and Robert R. Alford. 1991. “Bringing Society Back In: Symbols, Practices, and Institutional Contradictions.” Pp. 232-63 in The New Institutionalism in Organizational Analysis, edited by Walter W. Powell and Paul J. Dimaggio. Chicago: University of Chicago Press.
Geertz, Clifford. 1973. “Thick Description: Toward an Interpretive Theory of Culture.” Pp. 3-30 in The Interpretation of Cultures: Selected Essays. New York: Basic Books.
Goldberg, Amir. 2011. “Mapping Shared Understandings Using Relational Class Analysis: The Case of the Cultural Omnivore Reexamined.” American Journal of Sociology 116(5): 1397-1436.
______. 2015. “In Defense of Forensic Social Science.” Big Data and Society 2(2) DOI: 10.1177/2053951715601145.
Jepperson, Ronald L., and Ann Swidler. 1994. “What Properties of Culture Should We Measure?” Poetics 22(4): 359-71.
Jockers, Matthew L. 2013. Macroanalysis: Digital Methods and Literary History. Champaign: University of Illinois Press.
Knorr Cetina, Karin. 2001. “Objectual Practice.” Pp. 175-188 in The Practice Turn in Contemporary Theory, edited by Theodore R. Schatzki, Karin Knorr Cetina, and Eike von Savigny. New York: Routledge.
Kuhn, Thomas. 1962. The Structure of Scientific Revolutions. Chicago: University of Chicago Press.
Lasswell, Harold D., and Nathan Leites (eds.). 1949. Language of Politics: Studies in Quantitative Semantics. New York: George W. Stewart.
Lasswell, Harold D., Daniel Lerner,and Ithiel De Sola Pool. 1952. The Comparative Study of Symbols: An Introduction. Palo Alto, CA: Stanford University Press.
Latour, Bruno. 2005. Reassembling the Social: An Introduction to Actor-Network Theory. New York: Oxford University Press.
Lee, Monica, and John Levi Martin. 2015. “Surfeit and Surface.” Big Data and Society 2(2) DOI: 10.1177/2053951715604334.
Lewin, Kurt. 1951. Field Theory in Social Science. New York: Harper.
Lewis, Kevin. 2015. “Three Fallacies of Digital Footprints.” Big Data and Society 2(2) DOI: 10.1177/2053951715602496.
Liu, Alan. 2013. “The Meaning of the Digital Humanities.” PMLA 128: 409-23.
MacKenzie, Donald. 2009. Material Markets: How Economic Agents are Constructed. Oxford: Oxford University Press.
McFarland, Daniel A., and H. Richard McFarland. 2015. “Big Data and the Danger of Being Precisely Inaccurate.” Big Data and Society 2(2) DOI: 10.1177/2053951715602495.
McFarland, Daniel A., Daniel Ramage, Jason Chuang, Jeffrey Heer, Christopher D. Manning, and Daniel Jurafsky. 2013. “Differentiating Language Usage through Topic Models.” Poetics 41(6): 607-25.
Miller, Ian Matthew. 2013. “Rebellion, Crime, and Violence in Qing China, 1722-1911: A Topic Modeling Approach.” Poetics 41(6): 626-49.
Mirowski, Philip. 2002. Machine Dreams: Economics Becomes a Cyborg Science. Cambridge: Cambridge University Press.
Mohr, John W., and Petko Bogdanov. 2013. “Topic Models: What They Are and Why They Matter.” Poetics 41(6): 545-69.
Mohr, John W., and Amin Ghaziani. 2014. “Problems and Prospects for Measurement in the Study of Culture.” Theory and Society 43(3-4): 225-46.
Mohr, John W., and Craig Rawlings. 2010. “Formal Models of Culture.” Pp. 118-128 in The Routledge Handbook of Cultural Sociology, edited by John Hall, Laura Grindstaff, and Ming-Cheng Lo. New York: Routledge.
______. 2015. “Formal Methods of Cultural Analysis.” Pp. 357-67 in International Encyclopedia of the Social & Behavioral Sciences, edited by James D. Wright. 2nd edition, Vol. 9. Oxford: Elsevier.
Mohr, John W., Robin Wagner-Pacifici, and Ronald Breiger. 2015. “Toward a Computational Hermeneutics.” Big Data and Society 2(2) DOI: 10.1177/2053951715613809.
Mohr, John W., Robin Wagner-Pacifici, Ronald Breiger, and Petko Bogdanov. 2013. “Graphing the Grammar of Motives in U.S. National Security Strategies: Cultural Interpretation, Automated Text Analysis and the Drama of Global Politics.” Poetics 41(6): 670-700.
Moretti, Franco. 2011. “Network Theory, Plot Analysis.” Stanford Literary Lab Pamphlet #2 https://litlab.stanford.edu/pamphlets/.
______. 2013. Distant Reading. London: Verso Books.
Moretti, Franco and Dominique Pestre. 2015. “Bankspeak: The Language of World Bank Reports, 1946-2012.” Stanford Literary Lab Pamphlet #9 https://litlab.stanford.edu/pamphlets/.
Platt, Jennifer. 1996. A History of Sociological Research Methods in America: 1920 – 1960. Cambridge: Cambridge University Press.
Shaw, Ryan. 2015. “Big Data and Reality.” Big Data and Society 2(2) DOI: 10.1177/2053951715608877.
Stouffer, Samuel A., Edward A. Suchman,Leland C. DeVinney,Shirley A. Star,and Robin M. Williams,Jr. 1949. Studies in Social Psychology in World War II: The American Soldier. Vol. 1, Adjustment During Army Life. Princeton: Princeton University Press.
Thornton, Patricia H., William Ocasio, and Michael Lounsbury. 2012. The Institutional Logics Perspective: A New Approach to Culture, Structure, and Process. New York: Oxford University Press.
Underwood, Charles. 2015. “The Literary Uses of High-Dimensional Space.” Big Data and Society 2(2) DOI: 10.1177/2053951715602494.
Vaisey, Stephen. 2009. “Motivation and Justification: A Dual-Process Model of Culture in Action.” American Journal of Sociology 114(6): 1675-1715.
Venturini, Tommaso, Pablo Jensen, and Bruno Latour. 2015. “Fill in the Gap: A New Alliance for Social and Natural Sciences.” Journal of Artificial Societies and Social Simulation 18(2) 11 <http://jasss.soc.surrey.ac.uk/18/2/11.html> DOI: 10.18564/jasss.2729.
Wagner-Pacifici, Robin, John W. Mohr, and Ronald L. Breiger. 2015. “Ontologies, Methodologies, and New Uses of Big Data in the Social and Cultural Sciences.” Big Data and Society 2(2) DOI: 10.1177/2053951715613810.