How does one define Big Data and is “big” the best adjective to describe it? There are many voices trying to come up with answers to this topical question. Gartner and Forrester both agree that a better word would be “extreme”. Between the two major consulting firms they have determined four characteristics that extreme can qualify: they are agreed on three: volume, velocity and variety. On the fourth they diverge, Forrester postulates variability while Gartner prefers the word complexity. These are reasonable contributions and may form the foundation for the definition of big data that the Open Methodology Group is seeking to create within their open architecture Mike 2.0.
However the definition still falls short of the mark, as any combination of these characteristics can be found in many of today’s large data warehouses and parallel databases operating in outsourced or in-house data centers. No matter how extreme the data eventually Moore’s Law* and technology will asymptotically accommodate and govern the data. I could suggest that the missing attribute is volatility or the rate of change, but that too can be applied to current serviced capabilities. Another important attribute that is all too often missed by analysts is that Big Data is world data, it is data in many formats and many languages contributed by almost every nationality and culture and the noise generated by the systems and devices they employ.
Yet the characteristic that seems to address this definition shortfall best is openness, where openness means accessible (addressable or through API), shareable and unrestricted. This may be controversial as it raises some key issues around privacy, property and rights, but these problems for big data still need to be resolved independent of any definition. Why openness? Here are six observations:
- Any data that is not open, ie that is private, covert or obscured is by default protected and confined to the private architecture and data model(s) of that closed system. While sharing many of the attributes of “big data” and possibly the same data sources at best this can only represent a subset of big data as a whole.
- Big data does not and cannot have a single owner, supplier or agent (heed well ye walled gardens), and is the sum of many parts including amongst others social media streams, communication channels and complex signal networks
- There will never be a single Big Data Analytic Application/Engine , but there will be a multitude of them , each working on different or slightly different subsets of the whole.
- Big Data analysis will demand multi-pass processing including some form of abstract notation, private systems will develop their own notation but public notation standards will evolve, and open notation standards will improve the speed and consistency of analysis.
- Big Data volumes are not just expanding, they are accelerating especially as visual/graphic data communications becomes established (currently trending). Cloning and copying of Big Data will expand global storage requirements exponentially. Enterprises will recognize the impractical economy of this model and support industry standards that provide a robust and accessible information environment.
- As enterprises cross into crowd-sourcing and collaboration in the public domains it will be increasingly difficult and expensive to maintain private information and integrate or cross reference with public Big Data. The need to go open to survive will be accompanied by the recognition that contributing private data and potentially intellectual property is more economic and supportive of rapid open innovation.
The conclusion remains that one of the intrinsic attributes of Big Data is that it is and must be maintained as “open”.
Related Links
3 comments
Comments feed for this article
May 3, 2012 at 3:41 pm
Alan Berkson (@berkson0)
Colin,
Excellent post. One aspect of Big Data we discussed in The Lab is, perhaps, data for which we don’t yet have attributes, like the “trend” data for 2020F. How would that fit in, do you think?
Alan Berkson
Intelligist Group
May 4, 2012 at 3:26 pm
chopemurray
Thanks Alan. You raise an excellent point; the aetiology of trending is vital to our understanding of influence and how ideas are communicated. Evidence suggests that we are still in the early days of deconstructing trends as the following paper on “Structural Trend Analysis for Online Social Networks” by Budak, Agrawal and El Abbadi indicates http://bit.ly/Irwjhf. In particular the fact that “…trends are time-sensitive (and) offline solutions
that require a non-constant number of passes on data are impractical” suggests that the definition, communication and implmentation of as yet to be identified standard attributes is imperative.
This gives even greater impetus for establishing an open collaborative community to progress a common understanding of these attributes so that our evolving and future information requirements can be supported and improved over time.
May 5, 2012 at 4:16 pm
Big Data: The Answer is 42 | Intelligent Catalyst
[…] One area in particular is Big Data. There have been many discussions about what exactly is Big Data, which I’m not going to get into here. Some big thinkers have weighed in on Big Data (worth it for Merv Adrian’s quote at the end) and I had an interesting discussion recently with a colleague about the definition of Big Data. […]