Big Data is More than Just 'Three V's'
Thirteen years ago, Gartner (then META Group) coined the famous “three V’s” to characterize the emerging big data revolution.
First is Volume: “E-commerce channels increase the depth/breadth of data available about a transaction (or any point of interaction).” The idea is that sinking costs of storage and computing enable a higher volume of data to be preserved and communicated. Think, for example, of the amount of data generated by Facebook users.
Second is Velocity: “E-commerce has also increased point-of-interaction (POI) speed and, consequently, the pace data [sic] used to support interactions and generated by interactions.” The emergence of ubiquitous high-speed wired and wireless connectivity enables real-time communication of all kinds of data, enabling faster decision-making. Think, for example, of the astronomical amount of financial data collected in real-time, or manufacturers who collect real-time data concerning operations on the factory floor, to enhance efficiency and maintenance. This is infinitely faster than, say, tasking individuals to take surveys.
Third is Variety: “Through 2003/04, no greater barrier to effective data management will exist than the variety of incompatible data formats, non-aligned data structures, and inconsistence data semantics.” As more kinds of things become data-generating, data becomes more variegated. One of the great software breakthroughs in recent years was the development of the tools needed to handle such diverse and heterogeneous datasets. This has been instrumental in some key applications of big data: making sense out of the vast array of social media data, meta-data, and combining seismic, physical, and chemical data, as is necessary in “smart drilling” for oil and natural gas (a less popular, though extremely illustrious instance of new big data techniques).
The three V’s slogan has become common currency in the emerging literature on big data. The aptness of this slogan is confirmed by certain key examples – consumer-preference data (e.g. Amazon), data-driven health technologies, Google flu predictions, etc. – well-known to those with passing familiarity with big data trends. Few who write about big data, it seems, manage without the perfunctory invocation of the three V’s and reference to these and similar instances of new data analytics capabilities.
But it’s time to move beyond – or at least to supplement – the three V’s.
To be sure, each V has its value. There is no doubt that the amount of data has exploded (and will continue to expand); and there is no doubt that, in many cases, the ability to analyze and process huge volumes and diverse kinds of data pays dividends. Similarly, the pace of data-collection represents an important change in how we communicate information.
But the drawbacks to the three V paradigm are twofold.
First, the V’s are subjective. How big is the “big” in big data? What’s the benchmark? What’s the metric? One finds surprisingly little agreement about how to answer these questions in the literature. What about the much-heralded zettabyte? The incomprehensibility of numbers this large tells us a lot about the inadequacy of the frameworks we are using to evaluate the impact of big data. Would we, in 1890, count the drops of ink in all of the books and newspapers in circulation to capture the socioeconomic implications of the industrialization of the printing press?
How fast is big data? Information has been communicated at the speed of light since the telegraph. It doesn’t get any faster than that without throwing out the laws of physics (not an impossibility, I concede, but I wouldn’t hold my breath). Moreover, network communication is often slower today than the light-speed telegraph was in 1890, thanks to the infrastructural challenges associated with the amount of data in transport today. What’s distinctive about big data is that it brings such high velocity to everything – not just Morse code and telephone calls.
The least subjective V may be variety. The heterogeneity of data sets is (or was) for information systems as real as hitting a wall. But this, of course, is an artifact of the state of technology. We have already mounted this wall. It might be worth recalling that, at a certain time in our history, oral and written language were heterogeneous datasets, given the scarcity of writing tools and papyrus among the general populace.
Second, and more importantly, not one of these V’s is a necessary condition for big data, nor are they jointly necessary.
Not all tools and technologies we associate with big data involve high volumes of data. Consider a new health app, iBlueButton, which enables the consolidation and sharing of personal health records on personal smart devices (tablets, smart phones, etc.). One’s personal health records are “small” when compared to the amount of data recorded on Facebook or about an oil field. In fact, iBlueButton touts the fact that these data are stored, not in the Cloud or at any third party, but locally, on your device. Not much volume, nor much variety, here.
Next consider Google Ngrams. This new service is a big data phenomenon, as its creators emphasize. But while Ngrams depend upon huge volumes of data –in fact all the words in more than 30 million books -- there isn’t much variety here: all the data are words. Nor is there much velocity, even though, when you type in a word you get an answer quickly. Google search does that too; what makes Ngrams unique is that they make so much accessible so easily. (For example, one can easily find out when people started saying “The” United States, instead of “these” United States. Hint: it wasn’t until long after the Civil War, contrary to historians’ beliefs). Financial transactions, meanwhile, are saliently high-velocity and high-volume, but have very little variety.
Or take a utility smart meter, which records the flow of kilowatt hours into someone’s home. Here we have no variety, some velocity, but lots of volume (since there are millions of such meters now). What makes smart meters useful is not combining diverse datasets, but making available in real-time so much data that is geographically diverse.
Of course, there are great examples of when all three V’s come together. Google car comes to mind, for requiring data to be high-velocity to deal with the challenges of real-time navigation, and enormous volumes of highly varied data, from speed, radar, and cameras, to lidar and GPS. But many “big data” phenomena simply don’t adhere to the three V’s paradigm.
So where does that leave us? It leaves us searching for an improved framework or paradigm for the big data revolution. A few things to keep in mind as we undertake this search:
More important than volume, velocity, or variety, is the imminent ubiquity of data – the “datafication” of everything. What makes the era of big data unique is not just that there are lots of data available, but that new kinds of objects and domains of social and economic life are becoming data-generating – from industrial machines, to social interactions and to biological processes. This is possible thanks to advances in storage capacity and smart technology. Sensors can, increasingly, be embedded into anything – people, pills, machines, etc. And it is possible to store lots of this data – since storage keeps getting cheaper.
On the analytics side, what is distinctive about the era of big data are the software tools. New kinds, larger, volumes, and a wider range of data are now manageable, where once they were not. We can rely on, or supplement our knowledge with, the predictive capabilities these new analytics techniques furnish.
The era of big data involves the wedding of dataficaition with this analytics revolution. These ingredients are only just beginning to come together. Once they are seamlessly blended, we won’t be counting bytes anymore.