Do You Believe in Magic? — Big Data, Buzz Phrases, and Keeping Feet Planted Firmly on the Ground

My alternative title for this post was “Money for Nothing,” which is along the same lines.  I have been engaged in discussions regarding Big Data, which has become a bit of a buzz phrase of late in both business and government.  Under the current drive to maximize the value of existing data, every data source, stream, lake, and repository (and the list goes on) has been subsumed by this concept.  So, at the risk of being a killjoy, let me point out that not all large collections of data is “Big Data.”  Furthermore, once a category of data gets tagged as Big Data, the further one seems to depart from the world of reality in determining how to approach and use the data.  So for of you who find yourself in this situation, let’s take a collective deep breath and engage our critical thinking skills.

So what exactly is Big Data?  Quite simply, as noted by this article in Forbes by Gil Press, term is a relative one, but generally means from a McKinsey study, “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”  This subjective definition is a purposeful one, since Moore’s Law tends to change what is viewed as simply digital data as opposed to big data.  I would add some characteristics to assist in defining the term based on present challenges.  Big data at first approach tends to be unstructured, variable in format, and does not adhere to a schema.  Thus, not only is size a criteria for the definition, but also the chaotic nature of the data that makes it hard to approach.  For once we find a standard means of normalizing, rationalizing, or converting digital data, it no longer is beyond the ability of standard database tools to effectively use it.  Furthermore, the very process of taming it thereby renders it non-big data, or perhaps, if a exceedingly large dataset, perhaps “small big data.”

Thus, having defined our terms and the attributes of the challenge we are engaging, we now can eliminate many of the suppositions that are floating around in organizations.  For example, there is a meme that I have come upon that asserts that disparate application file data can simply be broken down into its elements and placed into database tables for easy access by analytical solutions to derive useful metrics.  This is true in some ways but both wrong and dangerous in its apparent simplicity.  For there are many steps missing in this process.

Let’s take, for example, the least complex example in the use of structured data submitted as proprietary files.  On its surface this is an easy challenge to solve.  Once someone begins breaking the data into its constituent parts, however, greater complexity is found, since the indexing inherent to data interrelationships and structures are necessary for its effective use.  Furthermore, there will be corruption and non-standard use of user-defined and custom fields, especially in data that has not undergone domain scrutiny.  The originating third-party software is pre-wired to be able to extract this data properly.  Absent having to use and learn multiple proprietary applications with their concomitant idiosyncrasies, issues of sustainability, and overhead, such a multivariate approach defeats the goal of establishing a data repository in the first place by keeping the data in silos, preventing integration.  The indexing across, say, financial systems or planning systems are different.  So how do we solve this issue?

In approaching big data, or small big data, or datasets from disparate sources, the core concept in realizing return on investment and finding new insights, is known as Knowledge Discovery in Databases or KDD.  This was all the rage about 20 years ago, but its tenets are solid and proven and have evolved with advances in technology.  Back then, the means of extracting KDD from existing databases was the use of data mining.

The necessary first step in the data mining approach is pre-processing of data.  That is, once you get the data into tables it is all flat.  Every piece of data is the same–it is all noise.  We must add significance and structure to that data.  Keep in mind that we live in this universe, so there is a cost to every effort known as entropy.  Computing is as close as you’ll get to defeating entropy, but only because it has shifted the burden somewhere else.  For large datasets it is pushed to pre-processing, either manual or automated.  In the brute force world of data mining, we hire data scientists to pre-process the data, find commonalities, and index it.  So let’s review this “automated” process.  We take a lot of data and then add a labor-intensive manual effort to it in order to derive KDD.  Hmmm..  There may be ROI there, or there may not be.

But twenty years is a long time and we do have alternatives, especially in using Fourth Generation software that is focused on data usage without the limitations of hard-coded “tools.”  These alternatives apply when using data on existing databases, even disparate databases, or file data structured under a schema with well-defined data exchange instructions that allow for a consistent manner of posting that data to database tables. The approach in this case is to use APIs.  The API, like OLE DB or the older ODBC, can be used to read and leverage the relative indexing of the data.  It will still require some code to point it in the right place and “tell” the solution how to use and structure the data, and its interrelationship to everything else.  But at least we have a means for reducing the cost associated with pre-processing.  Note that we are, in effect, still pre-processing data.  We just let the CPU do the grunt work for us, oftentimes very quickly, while giving us control over the decision of relative significance.

So now let’s take the meme that I described above and add greater complexity to it.  You have all kinds of data coming into the stream in all kinds of formats including specialized XML, open, black-boxed data, and closed proprietary files.  This data is non-structured.  It is then processed and “dumped” into a non-relational database such as NoSQL.  How do we approach this data?  The answer has been to return to a hybrid of pre-processing, data mining, and the use of APIs.  But note that there is no silver bullet here.  These efforts are long-term and extremely labor intensive at this point.  There is no magic.  I have heard time and again from decision makers the question: “why can’t we just dump the data into a database to solve all our problems?”  No, you can’t, unless you’re ready for a significant programmatic investment in data scientists, database engineers, and other IT personnel.  At the end, what they deploy, when it gets deployed, may very well be obsolete and have wasted a good deal of money.

So, once again, what are the proper alternatives?  In my experience we need to get back to first principles.  Each business and industry has commonalities that transcend proprietary software limitations by virtue of the professions and disciplines that comprise them.  Thus, it is domain expertise to the specific business that drives the solution.  For example, in program and project management (you knew I was going to come back there) a schedule is a schedule, EVM is EVM, financial management is financial management.

Software manufacturers will, apart from issues regarding relative ease of use, scalability, flexibility, and functionality, attempt to defend their space by establishing proprietary lexicons and data structures.  Not being open, while not serving the needs of customers, helps incumbents avoid disruption from new entries.  But there often comes a time when it is apparent that these proprietary definitions are only euphemisms for a well-understood concept in a discipline or profession.  Cat = Feline.  Dog = Canine.

For a cohesive and well-defined industry the solution is to make all data within particular domains open.  This is accomplished through the acceptance and establishment of a standard schema.  For less cohesive industries, but where the data or incumbents through the use of common principles have essentially created a de facto schema, APIs are the way to extract this data for use in analytics.  This approach has been applied on a broader basis for the incorporation of machine data and signatures in social networks.  For closed or black-boxed data, the business or industry will need to execute gap analysis in order to decide if database access to such legacy data is truly essential to its business, or given specification for a more open standard from “time-now” will eventually work out suboptimization in data.

Most important of all and in the end, our results must provide metrics and visualizations that can be understood, are valid, important, material, and be right.

The Water is Wide — Data Streams and Data Reservoirs

I’ll have an article that elaborates on some of the ramifications of data streams and data reservoirs on AITS.org, so stay tuned there.  In the meantime, I’ve had a lot of opportunities lately, in a practical way, to focus on data quality and approaches to data.  There is some criticism in our industry about using metaphors to describe concepts in computing.

Like any form of literature, however, there are good and bad metaphors.  Opposing them in general, I think, is contrarian posing.  Metaphors, after all, often allow us to discover insights into an otherwise opaque process, clarifying in our mind’s eye what is being observed through the process of deriving similarities to something more familiar.  Strong metaphors allow us to identify analogues among the phenomena being observed, providing a ready path to establishing a hypothesis.  Having served this purpose, we can test that hypothesis to see if the metaphor serves our purposes in contributing to understanding.

I think we have a strong set of metaphors in the case of data streams and data reservoirs.  So let’s define our terms.

Traditionally a data stream in communications theory is a set of data packets that are submitted in sequence.  For the purpose of systems theory, a data stream is data that is submitted between two entities either on a sequential real time or on a regular periodic basis.  A data reservoir is just what it sounds like it is.  Streams can be diverted to feed a reservoir, which diverts data for a specific purpose.  Thus, data in the reservoir is a repository of all data from the selected streams, and any alternative streams, that includes legacy data.  The usefulness of the metaphors are found in the way in which we treat these data.

So, for example, data streams in practical terms in project and business management are the artifacts that represent the work that is being performed.  This can be data relating to planning, production, financial management and execution, earned value, scheduling, technical performance, and risk for each period of measurement.  This data, then, requires real time analysis, inference, and distribution to decision makers.  Over time, this data provides trending and other important information that measures the inertia of the efforts in providing leading and predictive indicators.

Efficiencies can be realized by identifying duplication in data streams, especially if the data being provided into the streams are derived from a common dataset.  Streams can be modified to expand the data that is submitted, so as to eliminate alternative streams of data that add little value on their own, that is, that are stovepiped and suboptimized contrary to the maximum efficiency of the system.

In the case of data reservoirs, what these contain is somewhat different than the large repositories of metadata that must be mined.  On the contrary, a data reservoir contains a finite set of data, since what is contained in the reservoir is derived from the streams.  As such, these reservoirs contain much essential historical information to derive parametrics and sufficient data from which to derive organizational knowledge and lessons learned.  Rather than processing data in real time, the handling of data reservoirs are done to append the historical record of existing efforts to provide a fuller picture of performance and trending, and of closed out efforts that can inform systems approaches to similar future efforts.  While not quite fitting into the category of Big Data, such reservoirs can probably best be classified as Small Big Data.

Efficiencies from the streams into the reservoir can be realized if the data can be further definitized through the application of structured schemas, combined with flexible Data Exchange Instructions (DEIs) that standardize the lexicon, allowing for both data normalization and rationalization.  Still, there may be data that is not incorporated into such schemas, especially if the legacy metadata predates the schema specified for the applicable data streams.  In this case, data rationalization must be undertaken combined with standard APIs to provide consistency and structure to the data.  Even in this case, however, given the finite set since the data is specific to a system that uses a fairly standard lexicon, such rationalization will yield results that are valid.

Needless to say, applications that are agnostic to data and that provide on-the-fly flexibility in UI configuration by calling standard operating environment objects–also known as fourth generation software–have the greatest applicability to this new data paradigm.  This is because they most effectively leverage both flexibility in the evolution of the data streams to reach maximum efficiency, and in leveraging the lessons learned that are derived from the integration of data that was previously walled off from complementary data that will identify and clarify systems interdependencies.

 

One-Trick Pony — Software apps and the new Project Management paradigm

Recently I have been engaged in an exploration and discussion regarding the utilization of large amounts of data and how applications derive importance from that data.  In an on-line discussion with the ever insightful Dave Gordon, I first postulated that we need to transition into a world where certain classes of data are open so that the qualitative content can be normalized.  This is what for many years was called the Integrated Digital Environment (IDE for short).  Dave responded with his own post at the AITS.org blogging alliance, countering that while such standards are necessary in very specific and limited applications, that modern APIs provide most of the solution.  I then responded directly to Dave here, countering that IDE is nothing more than data neutrality.  Then also at AITS.org I expanded on what I proposed to be a general approach in understanding big data, noting the dichotomy in the software approaches that organize the external characteristics of the data to generalize systems and note trends, as opposed to those that are focused on the qualitative content within the data.

It should come as no surprise then, given these differences in approaching data, that we also find similar differences in the nature of applications that are found on the market.  With the recent advent of on-line and hosted solutions, there are literally thousands of applications in some categories of software that propose to do one thing with data, or that are focused one-trick pony applications that can be mixed and matched to somehow provide an integrated solution.

There are several problems with this sudden explosion of applications of this nature.

The first is in the very nature of the explosion.  This is a classic tech bubble, albeit limited to a particular segment of the software market, and it will soon burst.  As soon as consumers find that all of that information traveling over the web with the most minimal of protections is compromised by the next trophy hack, or that too many software providers have entered the market prematurely–not understanding the full needs of their targeted verticals–it will hit like the last one in 2000.  It only requires a precipitating event that triggers a tipping point.

You don’t have to take my word for it.  Just type in a favorite keyword into your browser now (and I hope you’re using VPN doing it) for a type of application for which you have a need–let’s say “knowledge base” or “software ticket systems.”  What you will find is that there are literally hundreds if not thousands of apps built for this function.  You cannot test them all.  Basic information economics, however, dictates that you must invest some effort in understanding the capabilities and limitations of the systems on the market.  Surely there are a couple of winners out there.  But basic economics also dictates that 95% of those presently in the market will be gone in short order.  Being the “best” or the “best value” does not always win in this winnowing out.  Certainly chance, the vagaries of your standing in the search engine results, industry contacts–virtually any number of factors–will determine who is still standing and who is gone a year from now.

Aside from this obvious problem with the bubble itself, the approach of the application makers harkens back to an earlier generation of one-off applications that attempt to achieve integration through marketing while actually achieving, at best, only old-fashioned interfacing.  In the world of project management, for example, organizations can little afford to revert to the division of labor, which is what would be required to align with these approaches in software design.  It’s almost as if, having made their money in an earlier time, that software entrepreneurs cannot extend themselves beyond their comfort zones in taking advantage of the last TEN software generations that provide new, more flexible approaches to data optimization.  All they can think to do is party like it’s 1995.

For the new paradigm in project management is to get beyond the traditional division of labor.  For example, is scheduling such a highly specialized discipline rising to the level of a profession that it is separate from all of the other aspects of project management?  Of course not.  Scheduling is a discipline–a sub-specialty actually–that is inextricably linked to all other aspects of project management in a continuum.  The artifacts of the process of establishing project systems and controls constitutes the project itself.

No doubt there are entities and companies that still ostensibly organize themselves into specialties as they did twenty years ago: cost analysts, schedule analysts, risk management specialists, among others.  But given that the information from the these systems: schedule, cost management, project financial management, risk management, technical performance, and all the rest, can be integrated at the appropriate level of their interrelationships to provide us a cohesive, holistic view of the complex system that we call a project, is such division still necessary?  In practice the industry has already moved to position itself to integration, realizing the urgency of making the shift.

For example, to utilize an application to query cost management information in 1995 was a significant achievement during the first wave of software deployment that mimicked the division of labor.  In 2015, not so much.  Introducing a one-trick pony EVM “tool” in 2015 is laziness–hoping to turn back the clock in ignoring the obsolescence of such an approach–regardless of which slick new user interface is selected.

I recently attended a project management meeting of senior government and industry representatives.  During one of my side sessions I heard a colleague propose the discipline of Project Management Analyst in lieu of previously stove-piped specialties.  His proposal is a breath of fresh air in an industry that develops and manufacturers the latest aircraft and space technology, but has hobbled itself with systems and procedures designed for an earlier era that no longer align with the needs of doing business.  I believe the timely deployment of systems has suffered as a result during this period of transition. 

Software must lead, and accelerate the transition to the new integration paradigm.

Thus, in 2015 the choice is not between data that adheres to conventions of data neutrality, or to those that utilize data access via APIs, but in favor of applications that do both.

It is not between different hard-coded applications that provide the old “what-you-see-is-what-you-get” approach.  It is instead between such limited hard-coded applications, and those that provide flexibility so that business managers can choose among a nearly unlimited pallet of choices of how and which data, converted into information, is available to the user or classes of user based on their role and need to know; aggregated at the appropriate level of detail for the consumer to derive significance from the information being presented.

It is not between “best-of-breed” and “mix-and-match” solutions that leverage interfaces to achieve integration.  It is instead between such solution “consortiums” that drive up implementation and sustainment costs, bringing with them high overhead, against those that achieve integration by leveraging the source of the data itself, reducing the number of applications that need to be managed, allowing data to be enriched in an open and flexible environment, achieving transformation into useful information.

Finally, the choice isn’t among applications that save their attributes in a proprietary format so that the customer must commit themselves to a proprietary solution.  Instead, it is between such restrictive applications and those that open up data access, clearly establishing that it is the consumer that owns the data.

Note: I have made minor changes from the original version of this post for purposes of clarification.