I’ll have an article that elaborates on some of the ramifications of data streams and data reservoirs on AITS.org, so stay tuned there. In the meantime, I’ve had a lot of opportunities lately, in a practical way, to focus on data quality and approaches to data. There is some criticism in our industry about using metaphors to describe concepts in computing.
Like any form of literature, however, there are good and bad metaphors. Opposing them in general, I think, is contrarian posing. Metaphors, after all, often allow us to discover insights into an otherwise opaque process, clarifying in our mind’s eye what is being observed through the process of deriving similarities to something more familiar. Strong metaphors allow us to identify analogues among the phenomena being observed, providing a ready path to establishing a hypothesis. Having served this purpose, we can test that hypothesis to see if the metaphor serves our purposes in contributing to understanding.
I think we have a strong set of metaphors in the case of data streams and data reservoirs. So let’s define our terms.
Traditionally a data stream in communications theory is a set of data packets that are submitted in sequence. For the purpose of systems theory, a data stream is data that is submitted between two entities either on a sequential real time or on a regular periodic basis. A data reservoir is just what it sounds like it is. Streams can be diverted to feed a reservoir, which diverts data for a specific purpose. Thus, data in the reservoir is a repository of all data from the selected streams, and any alternative streams, that includes legacy data. The usefulness of the metaphors are found in the way in which we treat these data.
So, for example, data streams in practical terms in project and business management are the artifacts that represent the work that is being performed. This can be data relating to planning, production, financial management and execution, earned value, scheduling, technical performance, and risk for each period of measurement. This data, then, requires real time analysis, inference, and distribution to decision makers. Over time, this data provides trending and other important information that measures the inertia of the efforts in providing leading and predictive indicators.
Efficiencies can be realized by identifying duplication in data streams, especially if the data being provided into the streams are derived from a common dataset. Streams can be modified to expand the data that is submitted, so as to eliminate alternative streams of data that add little value on their own, that is, that are stovepiped and suboptimized contrary to the maximum efficiency of the system.
In the case of data reservoirs, what these contain is somewhat different than the large repositories of metadata that must be mined. On the contrary, a data reservoir contains a finite set of data, since what is contained in the reservoir is derived from the streams. As such, these reservoirs contain much essential historical information to derive parametrics and sufficient data from which to derive organizational knowledge and lessons learned. Rather than processing data in real time, the handling of data reservoirs are done to append the historical record of existing efforts to provide a fuller picture of performance and trending, and of closed out efforts that can inform systems approaches to similar future efforts. While not quite fitting into the category of Big Data, such reservoirs can probably best be classified as Small Big Data.
Efficiencies from the streams into the reservoir can be realized if the data can be further definitized through the application of structured schemas, combined with flexible Data Exchange Instructions (DEIs) that standardize the lexicon, allowing for both data normalization and rationalization. Still, there may be data that is not incorporated into such schemas, especially if the legacy metadata predates the schema specified for the applicable data streams. In this case, data rationalization must be undertaken combined with standard APIs to provide consistency and structure to the data. Even in this case, however, given the finite set since the data is specific to a system that uses a fairly standard lexicon, such rationalization will yield results that are valid.
Needless to say, applications that are agnostic to data and that provide on-the-fly flexibility in UI configuration by calling standard operating environment objects–also known as fourth generation software–have the greatest applicability to this new data paradigm. This is because they most effectively leverage both flexibility in the evolution of the data streams to reach maximum efficiency, and in leveraging the lessons learned that are derived from the integration of data that was previously walled off from complementary data that will identify and clarify systems interdependencies.