How does one define Big Data and is “big” the best adjective to describe it? There are many voices trying to come up with answers to this topical question. Gartner and Forrester both agree that a better word would be “extreme”. Between the two major consulting firms they have determined four characteristics that extreme can qualify: they are agreed on three: volume, velocity and variety. On the fourth they diverge, Forrester postulates variability while Gartner prefers the word complexity. These are reasonable contributions and may form the foundation for the definition of big data that the Open Methodology Group is seeking to create within their open architecture Mike 2.0.
However the definition still falls short of the mark, as any combination of these characteristics can be found in many of today’s large data warehouses and parallel databases operating in outsourced or in-house data centers. No matter how extreme the data eventually Moore’s Law* and technology will asymptotically accommodate and govern the data. I could suggest that the missing attribute is volatility or the rate of change, but that too can be applied to current serviced capabilities. Another important attribute that is all too often missed by analysts is that Big Data is world data, it is data in many formats and many languages contributed by almost every nationality and culture and the noise generated by the systems and devices they employ.
Yet the characteristic that seems to address this definition shortfall best is openness, where openness means accessible (addressable or through API), shareable and unrestricted. This may be controversial as it raises some key issues around privacy, property and rights, but these problems for big data still need to be resolved independent of any definition. Why openness? Here are six observations:
- Any data that is not open, ie that is private, covert or obscured is by default protected and confined to the private architecture and data model(s) of that closed system. While sharing many of the attributes of “big data” and possibly the same data sources at best this can only represent a subset of big data as a whole.
- Big data does not and cannot have a single owner, supplier or agent (heed well ye walled gardens), and is the sum of many parts including amongst others social media streams, communication channels and complex signal networks
- There will never be a single Big Data Analytic Application/Engine , but there will be a multitude of them , each working on different or slightly different subsets of the whole.
- Big Data analysis will demand multi-pass processing including some form of abstract notation, private systems will develop their own notation but public notation standards will evolve, and open notation standards will improve the speed and consistency of analysis.
- Big Data volumes are not just expanding, they are accelerating especially as visual/graphic data communications becomes established (currently trending). Cloning and copying of Big Data will expand global storage requirements exponentially. Enterprises will recognize the impractical economy of this model and support industry standards that provide a robust and accessible information environment.
- As enterprises cross into crowd-sourcing and collaboration in the public domains it will be increasingly difficult and expensive to maintain private information and integrate or cross reference with public Big Data. The need to go open to survive will be accompanied by the recognition that contributing private data and potentially intellectual property is more economic and supportive of rapid open innovation.
The conclusion remains that one of the intrinsic attributes of Big Data is that it is and must be maintained as “open”.
- Gartner and Forrester “Nearly” Agree on Extreme / Big Data
- Single-atom transistor is ‘end of Moore’s Law’ and ‘beginning of quantum computing’.