Too big to ignore When does big data provide big value?

Much of the language surrounding big data conveys a muddled conception of what data, “big” or otherwise, means to the majority of organizations pursuing analytics strategies. Big data is shrouded in hyperbole and confusion, which can be a breeding ground for strategic errors. Big data is a big deal, but it is time to separate the signal from the noise.

“Beware of false knowledge; it is more dangerous than ignorance.”—George Bernard Shaw

The Signal and the Noise

By now, we have all heard the claims. Data is the new oil. Big data is different.1 Big data is a management revolution.2 Big data is the next frontier for innovation, competition, and productivity.3 Big data makes the scientific method obsolete.4

On the one hand, it is easy so see why big data has generated so much excitement both in the business press and in the larger culture. New business models, scientific breakthroughs, and profound societal transformations are all observable effects of big data.

Familiar examples abound. Google Translate exploits word associations in massive databases of free-form text to yield a tool that can, if you wish, instantly translate Icelandic to Indonesian. Social, political, economic, and professional relationships—and the societal and marketing mores surrounding them—are rapidly evolving along with the evolution of social networking and social media technologies. Political campaigns increasingly harness the power of social networks and social media to raise funds, motivate constituencies, and get out the vote. Companies use detailed databases about their customers’ behavior and lifestyles not only to better target them, but to create such innovative “data products” as playlists, newsfeeds, next-best offers, and recommendation engines for items ranging from airplane tickets to romantic partners. The raw material for all of these innovations—and surely many more to come—is large amounts of data.

So the topic is undoubtedly important. Yet at the same time, much of the language that has come to surround big data conveys a muddled conception of what data, “big” or otherwise, means to the majority of organizations pursuing analytics strategies. Big data is indeed a signature issue of our time. But it is also shrouded in hyperbole and confusion, which can be a breeding ground for strategic errors. In short, big data is a big deal, but it is time to separate the signal from the noise.

V is for …

Of course business applications of data analysis and predictive modeling go back decades.5 For example, credit scoring dates back to the ENIAC-era late 1950s, and actuaries have long analyzed industrial-strength data to price insurance contracts and more recently to set aside appropriate loss reserves as well as guide underwriting and claim adjustment decisions. And in the decade since the appearance of Michael Lewis’s Moneyball, statistical approaches to improved business decision making have spread to realms as disparate as improving patient safety, making better hiring decisions, and warding off lapses in child support payments. What then is new about big data?

The term is somewhat hard to pin down in part because it is commonly used in at least two distinct senses. In everyday discussions, “big data” is increasingly used as shorthand for the granular and varied data sources that go into the sorts of projects described above.6 Here “big” is used in the sense of “as rich and detailed as practical given the business context.” Such data is “big” relative to the small, “clean,” easily accessible datasets that can easily be manipulated in spreadsheets and have traditionally been fodder for mainstream academic statistical research. While colloquial, this is actually a useful conception of “big data” for reasons that will become clear.

More officially, “big data” denotes data sources whose very size creates problems for standard data management and analysis tools. Examples include the data continuously emanating in vast quantities from digital sensors, audio and video recording devices, mobile computing devices, Internet searches, social networking and media technologies, and so on. Such examples motivate Doug Laney’s widely accepted “three V’s” characterization of big data:7

  • Volume: Here, “big” is often taken to mean multiple terabyte- or petabyte-class data, motivating the use of such highly parallel next-generation data management technologies as MapReduce, Hadoop, and NoSQL.8
  • Variety: Big data goes beyond numbers in databases, a.k.a. “structured data.” Such “unstructured” data sources as Internet search log files, tweets, call center transcripts, telecom messages, email, and data from sensor networks, video, and photographs can equally well be considered data. The multi-structured nature of big data in part accounts for its large volume and often high degree of “messiness” and “noisiness.”
  • Velocity: Because much of it emanates from sensors, web search logs, real-time feeds, or mobile devices, big data is often generated continuously and at a rapid clip.

Big data is said to be different and revolutionary because its size, scope, and fluidity are so great as to enable it to “speak to us," often in real time, in ways not seen before. This point of view is clearly articulated in Chris Anderson’s influential essay, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”:9

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The thinking is that in the pre-big data era, statistical science was necessary to make up for the inherent limitations of incomplete data samples. Statisticians and scientists were forced to cleanse, hypothesize, sample, model, and analyze data to arrive at contingent lessons that are at least consistent with, and at best strongly supported by, limited data. Today, so the thinking goes, organizations increasingly have access to something approaching a complete sample. Therefore the process of “learning from data” becomes akin more to a problem in algorithm and architecture design than one of learning from and quantifying uncertain knowledge using statistical science.

The mode of thought represented by Anderson’s essay can be seductive. But for reasons we will explore, most organizations would do well to resist this particular seduction.

Is Big Different?

Fitzgerald: “The rich are different from you and me.” Hemingway: “Yes, they have more money.”

Google Translate is a case in point of how big data can be dramatically different. In their widely cited article, “The Unreasonable Effectiveness of Data,”10 Google researchers Alon Halevy, Peter Norvig, and Fernando Pereira describe an approach to translation that hinges on mining the messy patterns from enormous collections of translations that exist “in the wild.” The approach is notable in that it bypasses traditional statistical methodology involving laboriously cleansing, sampling, exploring, and modeling data. The very size and completeness of the data now available employ word associations to do the work that linguistic rules and complicated models did in previous, less effective, approaches.

Halevy, Norvig, and Pereira write:

Invariably, simple models and a lot of data trump more elaborate models based on less data. … Currently, statistical translation models consist mostly of large memorized phrase tables that give candidate mappings between specific source- and target-language phrases.

Analogous discussions could be made, for example, about recommendation engines, online marketing experiments performed in rapid succession, and the use of Internet search data to predict flu outbreaks or “nowcast” economic trends. Its near-completeness and real-time nature can make big data an “unreasonably effective” (to use Halevy, Norvig, and Pereira’s phrase) force for innovative data products.

In short, why ask why? In many applications, we only care about a good enough next answer, decision, or recommendation. And this is what big data, processed by the right computer algorithms, gives us. Or does it?

How Data Science is a Science

“The whole of science is nothing more than a refinement of everyday thinking.” —Albert Einstein

As important as Google Translate and any number of other big data products are, cursory thinking about their significance can muddle important issues ranging from the epistemological to the economic to the strategic.

First, a fundamental distinction should not be forgotten: Data is not the same thing as information. In itself, data is nothing more than an inert raw material: un-interpreted symbols on a page or in a database. Furthermore, the size of a dataset is often a poor proxy for the amount of useful information it contains. A single Eminem video accounts for more of the world’s exabytes of data than do the complete works of Einstein. A typical collection of holiday snapshots from the Galapagos Islands will occupy more disk space than On the Origin of Species.

Second, it is deeply misguided to view the skills needed to convert raw data into usable information in software engineering terms. Contrary to the view articulated by Chris Anderson, these skills are inherently scientific in nature. Indeed, they are increasingly labeled by the helpful umbrella term “data science.”11 Data science should be viewed as a synthesis of statistical science, computer science, and domain knowledge appropriate to the problem at hand.12 The term originated from a far-sighted group of statisticians who understood that as data continues to grow in volume and availability, the ability to interact with, visually explore, and generally compute with data becomes an inescapable part of doing serious statistics.13

Business projects with data science at their core—a.k.a. business analytics projects—have three primary phases: design, analysis, and execution. Critical thinking, scientific judgment, creativity, and pragmatism are inherent to each of them. It is, generally speaking, unrealistic to expect that any of these phases could be automated or outsourced to computer algorithms processing big data.

Strategy and design: Perhaps a data scientist’s most important skill is in understanding (and often helping to articulate) an organization’s questions, problems, or strategic challenges and then translating them into the design of one or more data analysis projects. John Tukey’s famous slogan captures the need for such a strategy-led approach: “Better to have an approximate answer to the right question than a precise answer to the wrong question.”

To illustrate, consider a hypothetical chief medical officer of a hospital group seeking to improve patient outcomes. Should the focus of data analysis be on preventing medical errors? Identifying physicians at high risk of being sued for malpractice? Identifying patients with potentially high medical utilization? Identifying patients likely to fall off their treatments? Predictive approaches to guiding diagnosis and treatment decisions? Even after the various alternatives have been articulated and prioritized, there remains the design challenge of outlining a sensible series of data analysis and/or predictive modeling steps. Indeed the decision of which issues to tackle and in what order might be partially informed by the opinion of a seasoned data scientist regarding their relative feasibility.

“Better to have an approximate answer to the right question than a precise answer to the wrong question.”

Analytical process: The technical aspect of an analytics project itself comprises three distinct phases: data scrubbing and preparation, data analysis, and validation. Each requires judgment, generally making it nonautomatic in character.

The need for statistical programming (computing with data) is rightly emphasized because data comes in such messy and disparate forms: scanned documents, web logs, electronic medical records, billing records and other financial transactions, geospatial coordinates, audio/video streams, and increasingly free-form unstructured text and beyond. But it would be mistaken to conceive of this process purely as a programming challenge.

News - Green Jelly Beans Linked to Acne!

Analogous to Halevy, Norvig, and Pereira’s observation that more data trumps more elaborate models, we have repeatedly found that the creation of innovative data “features” (or “synthetic variables”) from raw data elements does more to enhance the effectiveness of analytics projects than does elaborate methodology. Examples of creating explanatory or predictive power where none existed before include calculating the distance between two addresses, calculating social network centrality using one or more shared associations, using behavioral data in a novel way, or creating index variables to serve as proxies for latent traits not directly observable.14 In short, characterizing the feature creation aspect of data analysis as a programming exercise would be akin to characterizing creative writing as word processing.15

After data scrubbing comes data exploration, analysis, and validation. Each of these steps should be viewed as an exercise in creative scientific investigation, aided with the tools of data visualization, statistics, and machine learning. Many datasets used in analytics projects contain considerable amounts of random noise, spurious correlations, ambiguity, and redundancy alongside the abiding “signal” that we wish to capture in a model or analysis.

A particularly diabolical and underappreciated problem is known in the vernacular as “multiple comparisons.” Imagine 100 people in a room flipping fair coins. Because so many coins are being repeatedly flipped, it is likely that at least one of the coins will—by chance—produce so many heads or so many tails that the coin will be found to be biased with a high degree of “statistical significance.”16 This is despite the fact that the coin thus selected is in reality no different from the other coins and is likely to produce heads in roughly half of the future flips.

A venerable motto of computer science is GIGO: “garbage in, garbage out.” Perhaps an analogous motto for the nascent field of data science ought to be NIINO: “no insight in, none out.”

In fields such as medicine, drug testing, and psychology, essentially this phenomenon is known as the “file drawer problem”: A purely mechanical approach of filing away nonsignificant (or sometimes unwanted) results and publicizing “significant” results systematically yields spurious findings that will likely not hold up over time.17 The xkcd cartoon reproduced here conveys the issue humorously and succinctly.18

While this is a problem for any large-scale analysis involving many possible data dimensions and relationships, it is particularly acute in the age of big data and brute-force algorithms for extracting interesting relationships. Furthermore, as will be discussed below, the cost of such false positives should be taken into account.

The problem of multiple comparisons is but one of many reasons why data science should not, in general, be viewed as the brute-force harvesting of patterns from stores of big data. For others, see the inset, “Putting the science in data science.”

Business execution: As the statistician George Box wrote, “All models are wrong, but some are useful.” A major implication of this statement is that it is generally advisable to blend model indications with common sense and expert judgment to arrive at a decision. This happens often even in bona fide big data applications. For example, if Jim uses Google Translate to write to a francophone friend he will hopefully know enough to correct various mistakes and change the second person singular from the formal vous to the familiar tu. Similarly, the recommendations of book or movie collaborative filtering algorithms are often useful but other times should be taken with a grain of salt.19

In these everyday examples, the stakes are low, and it is easy to blend model indications (translations or recommendations) with judgment founded on experience (our prior beliefs about the correct translation or what we will enjoy). In more high-stakes business analytics applications, such as using models to make medical diagnoses, screening potential employees, security rating, fraud investigation, or complex loan or insurance underwriting decisions, the blending of expert judgment with model indications should be viewed as a risk management issue of strategic importance.20 It is important that the data scientist guiding the analytical process play a role in communicating the assumptions and limitations of the analysis or model so that these issues are effectively addressed.21 This is yet another reason why data science should not be framed in purely programming or engineering terms and why big data should not be characterized as an automatic source of reliable predictions regardless of context.

Amassing repositories of big data and purchasing software, therefore, is not sufficient for business analytics. Data scientists—a.k.a. people—are essential to the process. It should also be kept in mind that in many situations, the appropriate data, at least to start, will not be “big in the 3V sense,” and much can be done with open-source statistical computing software. Again, domain knowledge and scientific judgment are important factors in such decisions.

A venerable motto of computer science is GIGO: “garbage in, garbage out.” Perhaps an analogous motto for the nascent field of data science ought to be NIINO: “no insight in, none out.”

What Happens in Vagueness Stays in Vagueness

It is natural to find the discussions of big data in the business press and blogosphere bewildering. The field is inherently multidisciplinary, and terms such as “big data,” “business analytics,” and “data science” mean different things to different people.22

Confusion also may arise for a more fundamental reason: Concepts relating to statistical uncertainty simply do not come naturally to the human mind. The same body of psychological research that underpins behavioral economics also suggests that we are very poor natural statisticians. We are naturally prone to find spurious information in data where none exists, latch on to causal narratives that are unsupported by sketchy statistical evidence, ignore population base rates when estimating probabilities for individual cases, be overconfident in our judgments, and generally be “fooled by randomness.”23 There is little wonder that misleading narratives about big data have multiplied.

Putting the science in data science (Or, Those who ignore statistics are condemned to reinvent it.)24

Not only is information different from data in general, there is no automatic, purely algorithmic way to extract the right islands of information from oceans of raw data. Generally speaking, this process requires a combination of domain knowledge, creativity, critical thinking, an understanding of statistical reasoning, and the ability to visualize and program with data. Theory and sound causal understanding remain important checks on the innate human tendency to be “fooled by randomness.” Far from making the scientific method obsolete, increasing volumes of data are making data science a core strategic capability of many organizations. Here are some reasons why.

Data contains too few patterns:

  • In many situations, one can fit many models to, and draw a variety of conclusions from, the same data. Examples are all around and include forecasting the profitability of a cohort of insurance policies, estimating Value at Risk (VaR) of a portfolio of securities, evaluating the effectiveness of an ad campaign or human resources policy, analyzing a financial time series, and predicting the outcome of a political election. While more data is often helpful, it is misleading to characterize these activities as “data driven.” Rather they are driven by the creative and judgmental application of scientific principles of data analysis. Data—“big” or otherwise—is an input into the process, not the driving force of the process.
  • Human creativity and domain knowledge are often necessary to create synthetic data features. Examples include body mass index, measures of social network centrality, distances between relevant physical addresses, composite measures of employee performance, and proxy variables for unobservable latent traits. Once again data is a raw material, not a source of automatic insight.

Data contains too many patterns:

  • In many situations, one can fit many models to, and draw a variety of conclusions from, the same data. Examples are all around and include forecasting the profitability of a cohort of insurance policies, estimating Value at Risk (VaR) of a portfolio of securities, evaluating the effectiveness of an ad campaign or human resources policy, analyzing a financial time series, and predicting the outcome of a political election. While more data is often helpful, it is misleading to characterize these activities as “data driven.” Rather they are driven by the creative and judgmental application of scientific principles of data analysis. Data—“big” or otherwise—is an input into the process, not the driving force of the process.
  • Human creativity and domain knowledge are often necessary to create synthetic data features. Examples include body mass index, measures of social network centrality, distances between relevant physical addresses, composite measures of employee performance, and proxy variables for unobservable latent traits. Once again data is a raw material, not a source of automatic insight.

Data contains too many patterns:

  • Datasets contain a mixture of “signal” and “noise,” and many big data sources have low signal-to-noise ratios, perhaps earning a fourth “V”: “vagueness.” While statistical and machine learning techniques continue to improve in their ability to separate signal from noise, they are often best viewed as tools that facilitate an inherently judgmental process. Examples include selecting which variables and variable interactions should be considered for inclusion in a model, which data points should be excluded or down-weighted for various reasons, and whether various apparently linear or nonlinear relationships among variables are real or spurious.
  • The problem of “multiple comparisons”: It often happens that the more associations you test, the more apparently significant patterns you will detect, even when nothing is actually happening. This is generally regarded as a basic fact of statistical life and becomes more challenging as datasets become larger and messier.

Big datasets are unnecessarily big:

  • Summary statistics can be sufficient for the task at hand. An elementary example: Suppose a coin has been repeatedly tossed. Assuming a binomial process, two numbers (the total number of tosses and the number of heads) contain as much information about the probability of heads on the next toss as a complete history of the previous tosses. More bytes do not always translate to more information.
  • Scores and composite indices (such as credit scores, social media sentiment scores, or lifestyle clusters) can reduce hundreds of data elements to a handful of numbers.
  • Often a small, carefully chosen sample of data contains as much usable information as a large “messy” dataset.

Big datasets are too small:

  • Another broadly accepted fact of statistical life is “the curse of dimensionality.” In many situations, even the largest conceivable data is “sparse” because of the large number of dimensions involved and/or a rare “outcome” variable (such as fraud or infrequent purchases). Imagine, for example, a marketing researcher confronting a dataset containing few or no purchases for many of the hundreds of different products offered; an actuary confronted with setting professional liability insurance rates for a multitude of specialty/geography combinations; or a geneticist analyzing many thousands of gene combinations for a relatively small patient population.
  • Biased sampling frames: In 1936 the Literary Digest conducted a poll that received over 2 million responses predicting that Alfred Landon would prevail over Franklin Roosevelt by a double-digit margin. In fact, Roosevelt won by a landslide. The Literary Digest erred in conducting its poll by telephone, which at the time was disproportionately used by the wealthy. In this sense the huge sample was still too “small.” Analogously today, petabytes of social media data aren’t guaranteed to accurately reflect the membership or sentiments of the population of interest.25

The patterns in the data are trivial:

While “garbage in, garbage out” is a well-known danger in data science, a less frequently noted GIGO is “gigabytes in, generalities out.” Often, an automatic approach to data analysis results in generalities that are obvious, nonactionable, or both. For example, a first attempt at a recommendation engine might suggest the “obvious,” most popular movies or songs to most people; similarly a novice analysis of insurance claim severity might uncover obvious facts such as back injuries costing more than fractured arms. Human insight is required to design data analysis approaches that go beyond the obvious.

On the other hand, “obviousness” is sometimes a good thing. Some of the most valuable models insightfully combine a number of fairly obvious variables. The value of such models is due less to incorporating surprise nuggets of insight (although those are always nice) than to outperforming the human mind’s ability to quantify their relative importance and interactions. Also important is that such models help enforce consistency on groups of cognitively bounded, and perhaps distracted and tired, human professionals.26

The patterns in the data are diabolically misleading:

  • A famous example of “Simpson’s Paradox” occurred when the University of California, Berkeley was sued for gender bias in its graduate school admissions decisions: In 1973 Berkeley admitted 44 percent of its male graduate school applicants but only 35 percent of its female applicants. However when the data was broken down by department, the apparent bias disappeared. Why? Women tended to apply to departments (such as humanities) with lower admission rates—doh! Similarly naive analyses can spuriously suggest that higher prices lead to higher demand, or that marketing campaigns can lead to lower sales.
  • Studies find that highly intelligent women tend to marry men who are less intelligent than they are.27 Top-performing companies tend to do worse over time. These facts do not require sociological or management theoretic explanations. They are instances of “regression to the mean.” The concept was first identified by Francis Galton, a cousin of Charles Darwin and the inventor of regression analysis, when he studied such phenomena as tall parents having shorter offspring (and vice versa). Misunderstanding of this phenomenon can lead to such bad business decisions as paying for past performance of sports players who have experienced winning streaks, or movie genres and franchises whose moment has passed.

Nevertheless, avoiding misconceptions about big data should be regarded as a prerequisite for avoiding analytics projects with negative ROI. First, data should not be confused with useful information. Deriving insight from data, as discussed above, should generally not be viewed as a form of programming or software implementation but as a type of scientific investigation requiring the judgmental evaluation of ambiguous data.

It is equally important to pay attention to the economic and strategic aspects of big data. While the potential benefits of big data get a lot of attention, less attention is given to the costs. Many of these costs are straightforward: It costs money to acquire, store, back up, secure, integrate, manage, audit, document, and make available any data source.28 And while inexpensive multi-terabyte drives are available at the local electronics store, they are not characteristic of the enterprise hardware many organizations use to manage big data.

More subtle economic points can be equally relevant. First, economic decisions should be made at the margins. Therefore, a big data project should not be evaluated in isolation but in terms of how much insight or predictive power it is likely to add over and above a less costly analytics project.

Second, big data projects often carry opportunity costs. Many organizations have a menu of potential analytics projects with limited resources to execute them. For a large retailer wishing to make real-time next-best offers or an Internet company aiming to enjoy the “winner take all” effect of network externalities, big data naturally rises to the top of the list of priorities. For many other organizations, a more likely path to analytics success is paved with a sequence of smaller, well-targeted projects, with the benefits of one helping fund the next. For such organizations, a focus on big data could be an expensive distraction.

Finally, the human capital and organizational costs required to work with, analyze, and act upon big data should not be underestimated. Data science skills are clearly in demand and, therefore, can be difficult and expensive to acquire. While this is subject to change as supply ramps up to meet demand, a deeper and more abiding point was made over 40 years ago by the artificial intelligence pioneer and management theorist Herbert Simon. Simon wrote that:

In an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: It consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.29

To be sure, there is ample evidence that when properly analyzed and acted upon, data helps enable organizations to make decisions more accurately, consistently, and economically.30 But this does not imply the cost-effectiveness of big data technology in any particular application.

Moneyball is often held up as a prime example of the sort of transformative innovation enabled by data-driven decision making. In fact, variations on “Big Data: Moneyball for Business” are common names for articles about the topic. But Billy Beane’s achievement did not result from using terabyte- or petabyte-class data. As author Michael Lewis recounts, it resulted from an inspired use of the right data (for example, on-base percentage rather than batting average) to address the right business opportunity (an inefficient market for talent due to a widespread culture of intuition-driven decision making). Much the same could be said about the use of statistical analysis and predictive models to help guide medical decisions; loan or insurance underwriting decisions; hiring, admissions, and resource allocation decisions; and so on.

This last point sheds light on the “data is the new oil” metaphor. The metaphor is helpful in driving home the point that data should be treated as a valuable asset that, analogous to oil, can be refined to help power insights and better decisions. But unlike oil, data is not an undifferentiated quantity. More volumes and varieties of data are not necessarily better. Indeed, if not pursued strategically, they can lead to missed opportunities, expensive distractions, and what Simon called “a poverty of attention.” Rather than let the data tail wag the strategic dog, it is crucial to begin with a plan within which the organization can judge which data is the right data and which analysis is the right analysis.

Two Cheers for Big data

Is big data a big deal? Yes. Instances of innovative big data-driven products, business models, and scientific breakthroughs are already common and are likely to multiply in the future.

But for a leader seeking an appropriate business analytics strategy specific to his or her organization, the answer to this question is less straightforward. There is indeed abundant evidence that organizations—large and small, public and  private—benefit from employing the analysis of data to guide strategic, operational, and tactical decisions. But this does not always entail processing terabyte- or petabyte-class data on Hadoop clusters.

Above all, business strategy should guide the choice of data and technology, not vice versa. Laying out a clear vision and analytical roadmap helps avoid being side-tracked by narratives in which the phrase “big data” is defined one way and used in another; data is conflated with usable information; the judgment-infused process of data science is mischaracterized as mining patterns and associations from raw data; and the economics of big data are downplayed.

Properly harnessed, the right data can indeed be an organization’s new oil. But it is important not to lose sight of two fundamental points. First, analytics initiatives ultimately do not begin with data; they begin with clearly articulated problems to be addressed and opportunities to be pursued. Second, more data does not guarantee better decisions. But the right data—properly analyzed and acted upon—often does. Organizations that lose sight of these principles risk experiencing big data not as the new oil, but as the new turmoil.