The Big (D) — Ways of Looking at Big Data

Recently I have been involved in several efforts regarding what is often referred to as Big Data, but of a particular kind.  Oftentimes the term, which was first defined by Doug Laney now at Gartner, is seen as utilizing data in order to monetize consumer information that is being collected to allow business to focus advertising, marketing, and product development.  More generally, however, big data (as defined by Laney) distinguishes itself from normal relational database management by its volume, variety, velocity, variability, and complexity.  The Wikipedia definition is slightly different with the additional attribute of veracity.

Recently in a blog post at AITS.org I talked about the role of normalization of data which prompted some discussion via e-mail.  I have identified under the rubric of an Integrated Digital Environment (IDE) the need to find the lexicon of big data based on its purpose in order to solve the issues inherent in it, but in particular of variability and complexity.  The questions raised were whether the IDE concept specified the solutions thereby limiting its sustainability as a concept.  It was certainly not my intention to specify the method of normalization and integration.  I proposed what current technology allows and left open other methods.  I am keen to hear any ideas on alternatives that will push the technology further.  But my concepts of how best to approach the issue of Big Data certainly is influenced by the data I encounter.

For example, the world I inhabit is focused on project management.  Traditional “tools” have focused on very small pieces of the overall data being collected in order to derive importance regarding performance, progress, and effectiveness.  The ability to normalize the same ‘kind’ of data coming from different proprietary solutions improves the ability of the organization to optimize the use of data and turn it into information.  Breaking down proprietary barriers reduces the need to make investments in multiple information solutions, along with the overhead associated with them.  But in order to achieve this ‘agnosticism’ some approaches are more effective than others.

Information when it comes to big data is usually of two types: information derived from the patterns from the various types of data, and the information content inherent in the various types of data.

The first does not necessarily require normalization since it simply requires access to data using standard interfaces designed to scale.  The latter, which is the data I most often deal with, can also be accessed by interfaces, but then the value of the information must be derived either during or just after retrieval.  That is, the interface must be made ‘smart’ or the raw data, once delivered, must be properly identified and processed in order to derive its significance.  This is easy when technology solutions play in their own sandbox and can easily interpret their specialized terminology.  Not so good for the agnostic, non-proprietary ideal.  The second method prescribes a neutral schema that allows for all proprietary terminology to be eschewed in favor of clarity so that we can easily equate apples with apples.  Some processing must be made during retrieval, but not to the extent of direct access of data.

In the real world, however, this is not an either-or proposition.  Data can be very discrete, even when it is part of a “big data” environment.  For example, time-phased systems that are used to manage projects use different terminologies for the same normalized concept represented by the data.  Oftentimes direct access to some of these elements, because of the complexity of the algorithms that derive data values, is impossible.  In those cases, then, a schema resolves any variation in properly representing the interrelationships between the elements after retrieval and integration, especially to ensure accurate representation of time-phasing.  Other data is more ‘flat’ and, by itself, can inform and append other more complex elements of data without requiring extensive processing.  In those cases standard interfaces that are scaled to the task at hand will work just fine.

Thus, we must utilize different methodologies based on the characteristics of the data. One size does not fit all.  Instead, a flexible solution that utilizes all of the methods at hand: schemas, normalization of data after retrieval, use of direct access to tables (such as OLE DB), programming interfaces (APIs), and calling web services provides what we need in approaching big data.  Some of these are application neutral.  Some utilize the inherent proprietary interfaces in the data itself to take it outside of its sandbox.  Combined with a flexible user interface, the new paradigm is moving away from best-of-breed or stovepiped “tools” that perform only one primary function in interpreting a restricted set of information.