Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

Takin’ Care of Business — Information Economics in Project Management

Neoclassical economics abhors inefficiency, and yet inefficiencies exist.  Among the core issues that create inefficiencies is the asymmetrical nature of information.  Asymmetry is an accepted cornerstone of economics that leads to inefficiency.  We can see in our daily lives and employment the effects of one party in a transaction having more information than the other:  knowing whether the used car you are buying is a lemon, measuring risk in the purchase of an investment and, apropos to this post, identifying how our information systems allow us to manage complex projects.

Regarding this last proposition we can peel this onion down through its various levels: the asymmetry in the information between the customer and the supplier, the asymmetry in information between the board and stockholders, the asymmetry in information between management and labor, the asymmetry in information between individual SMEs and the project team, etc.–it’s elephants all the way down.

This asymmetry, which drives inefficiency, is exacerbated in markets that are dominated by monopoly, monopsony, and oligopoly power.  When informed by the work of Hart and Holmström regarding contract theory, which recently garnered the Nobel in economics, we have a basis for understanding the internal dynamics of projects in seeking efficiency and productivity.  What is interesting about contract theory is that it incorporates the concept of asymmetrical information (labeled as adverse selection), but expands this concept in human transactions at the microeconomic level to include considerations of moral hazard and the utility of signalling.

The state of asymmetry and inefficiency is exacerbated by the patchwork quilt of “tools”–software applications that are designed to address only a very restricted portion of the total contract and project management system–that are currently deployed as the state of the art.  These tend to require the insertion of a new class of SME to manage data by essentially reversing the efficiencies in automation, involving direct effort to reconcile differences in data from differing tools. This is a sub-optimized system.  It discourages optimization of information across the project, reinforces asymmetry, and is economically and practically unsustainable.

The key in all of this is ensuring that sub-optimal behavior is discouraged, and that those activities and behaviors that are supportive of more transparent sharing of information and, therefore, contribute to greater efficiency and productivity are rewarded.  It should be noted that more transparent organizations tend to be more sustainable, healthier, and with a higher degree of employee commitment.

The path forward where there is monopsony power, where there is a dominant buyer, is to impose the conditions for normative behavior that would otherwise be leveraged through practice in a more open market.  For open markets not dominated by one player as either supplier or seller, instituting practices that reward behavior that reduces the effects of asymmetrical information, and contracting disincentives in business transactions on the open market is the key.

In the information management market as a whole the trends that are working against asymmetry and inefficiency involve the reduction of data streams, the construction of cross-domain data repositories (or reservoirs) that allow for the satisfaction of multiple business stakeholders, and the introduction of systems that are more open and adaptable to the needs of the project system in lieu of a limited portion of the project team.  These solutions exist, yet their adoption is hindered because of the long-term infrastructure that is put in place in complex project management.  This infrastructure is supported by incumbents that are reinforcing to the status quo.  Because of this, from the time a market innovation is introduced to the time that it is adopted in project-focused organizations usually involves the expenditure of several years.

This argues for establishing an environment that is more nimble.  This involves the adoption of a series of approaches to achieve the goals of broader information symmetry and efficiency in the project organization.  These are:

a. Instituting contractual relationships, both internally and externally, that encourage project personnel to identify risk.  This would include incentives to kill efforts that have breached their framing assumptions, or to consolidate progress that the project has achieved to date–sending it as it is to production–while killing further effort that would breach framing assumptions.

b. Institute policy and incentives on the data supply end to reduce the number of data streams.  Toward this end both acquisition and contracting practices should move to discourage proprietary data dead ends by encouraging normalized and rationalized data schemas that describe the environment using a common or, at least, compatible lexicon.  This reduces the inefficiency derived from opaqueness as it relates to software and data.

c.  Institute policy and incentives on the data consumer end to leverage the economies derived from the increased computing power from Moore’s Law by scaling data to construct interrelated datasets across multiple domains that will provide a more cohesive and expansive view of project performance.  This involves the warehousing of data into a common repository or reduced set of repositories.  The goal is to satisfy multiple project stakeholders from multiple domains using as few streams as necessary and encourage KDD (Knowledge Discovery in Databases).  This reduces the inefficiency derived from data opaqueness, but also from the traditional line-and-staff organization that has tended to stovepipe expertise and information.

d.  Institute acquisition and market incentives that encourage software manufacturers to engage in positive signalling behavior that reduces the opaqueness of the solutions being offered to the marketplace.

In summary, the current state of project data is one that is characterized by “best-of-breed” patchwork quilt solutions that tend to increase direct labor, reduces and limits productivity, and drives up cost.  At the end of the day the ability of the project to handle risk and adapt to technical challenges rests on the reliability and efficiency of its information systems.  A patchwork system fails to meet the needs of the organization as a whole and at the end of the day is not “takin’ care of business.”

Stay Open — Open and Proprietary Databases (and Why It Matters)

The last couple of weeks have been fairly intense workwise and so blogging has lagged a bit.  Along the way the matter of databases came up at a customer site and what constitutes open data and what comprises proprietary data.  The reason why this issue matters to customers rests of several foundations.

First, in any particular industry or niche there is a wide variety of specialized apps that have blossomed.  This is largely due to Moore’s Law.  Looking at the number of hosted and web apps alone can be quite overwhelming, particularly given the opaqueness of what one is buying at any particular time when it comes to software technology.

Second, given this explosion, it goes without saying that the market will apply its scythe ruthlessly in thinning it.  Despite the ranting of ideologues, this thinning applies to both good and bad ideas, both sound and unsound businesses equally.  The few that remain are lucky, good, or good and lucky.  Oftentimes it is being first to market on an important market discriminator, regardless of the quality in its initial state, that determines winners.

Third, most of these technology solutions will run their software on proprietary database structures.  This undermines the concept that the customer owns the data.

The reasons why software solutions providers do this is multifaceted.  For example, the database structure is established to enhance the functionality and responsiveness of the application where the structure is leveraged to work optimally with the application’s logic.

But there are also more prosaic reasons for proprietary database structures.  First, the targeted vertical or segment may not be very structured regarding the type of data, so there is wide variation on database configuration and structure.  But there is also a more base underlying motivation to keep things this way:  the database structure is designed to protect the application’s data from easy access from third party tools and, as a result, make their solution “sticky” within the market segment that is captured.  That is, database structure is a way to build barriers to competition.

For incumbents that are stable, the main disadvantages to the customer lie in the use of the database as a means of tying them to the solution as a barrier to exit.  At the same time incumbents erect artificial barriers to data entry.  For software markets with a great deal of new entries and innovation that will lead to some thinning, picking the wrong solution using proprietary data structures can lead to real problems when attempting to transition to more stable alternatives.  For example, in the case of hosted applications not only is data not on the customer’s own database servers, but that data could be located far from the worksite or even geographically dispersed outside of the physical control of the customer.

Open APIs in using data mining and variations of it as the Shaman of Big Data prescribe unstructured and non-relational databases has served to, at least in everyone’s mind, minimize such proprietary concerns.  After all, it thought, we can just crack open the data–right?  Well…not so fast.  Given a number of data scientists, data analysts, and open API object tools mainframe types can regain the status they lost with the introduction of the PC and spend months building systems that will eventually rationalize data that has been locked in proprietary prisons.  Or perhaps not.  The bigger the data the bigger the problem.  The bigger the question the more one must bring in those who understand the difference between correlation and causation.  In the end it comes down to the mathematics and valid methods of determining in real terms the behavior of systems.

Or if you are a small or medium-sized business or organization you can just decide that the data is irretrievable, or effectively so, since the ROI is not there to make it retrievable.

Or you can avoid the inevitable and, if you do business in a highly structured market, such as project management, utilize some open standard such as the UN/CEFACT XML.  Then, when choosing a COTS solution in communicating with the market, determine that databases must, at a minimum, conform to the open standard in database design.  This provides maximum flexibility to the customer, who can then perform value analysis on competing products, based on a analysis of functionality, flexibility, and sustainability.

This places the customer back into the role of owning the data.

One-Trick Pony — Software apps and the new Project Management paradigm

Recently I have been engaged in an exploration and discussion regarding the utilization of large amounts of data and how applications derive importance from that data.  In an on-line discussion with the ever insightful Dave Gordon, I first postulated that we need to transition into a world where certain classes of data are open so that the qualitative content can be normalized.  This is what for many years was called the Integrated Digital Environment (IDE for short).  Dave responded with his own post at the AITS.org blogging alliance, countering that while such standards are necessary in very specific and limited applications, that modern APIs provide most of the solution.  I then responded directly to Dave here, countering that IDE is nothing more than data neutrality.  Then also at AITS.org I expanded on what I proposed to be a general approach in understanding big data, noting the dichotomy in the software approaches that organize the external characteristics of the data to generalize systems and note trends, as opposed to those that are focused on the qualitative content within the data.

It should come as no surprise then, given these differences in approaching data, that we also find similar differences in the nature of applications that are found on the market.  With the recent advent of on-line and hosted solutions, there are literally thousands of applications in some categories of software that propose to do one thing with data, or that are focused one-trick pony applications that can be mixed and matched to somehow provide an integrated solution.

There are several problems with this sudden explosion of applications of this nature.

The first is in the very nature of the explosion.  This is a classic tech bubble, albeit limited to a particular segment of the software market, and it will soon burst.  As soon as consumers find that all of that information traveling over the web with the most minimal of protections is compromised by the next trophy hack, or that too many software providers have entered the market prematurely–not understanding the full needs of their targeted verticals–it will hit like the last one in 2000.  It only requires a precipitating event that triggers a tipping point.

You don’t have to take my word for it.  Just type in a favorite keyword into your browser now (and I hope you’re using VPN doing it) for a type of application for which you have a need–let’s say “knowledge base” or “software ticket systems.”  What you will find is that there are literally hundreds if not thousands of apps built for this function.  You cannot test them all.  Basic information economics, however, dictates that you must invest some effort in understanding the capabilities and limitations of the systems on the market.  Surely there are a couple of winners out there.  But basic economics also dictates that 95% of those presently in the market will be gone in short order.  Being the “best” or the “best value” does not always win in this winnowing out.  Certainly chance, the vagaries of your standing in the search engine results, industry contacts–virtually any number of factors–will determine who is still standing and who is gone a year from now.

Aside from this obvious problem with the bubble itself, the approach of the application makers harkens back to an earlier generation of one-off applications that attempt to achieve integration through marketing while actually achieving, at best, only old-fashioned interfacing.  In the world of project management, for example, organizations can little afford to revert to the division of labor, which is what would be required to align with these approaches in software design.  It’s almost as if, having made their money in an earlier time, that software entrepreneurs cannot extend themselves beyond their comfort zones in taking advantage of the last TEN software generations that provide new, more flexible approaches to data optimization.  All they can think to do is party like it’s 1995.

For the new paradigm in project management is to get beyond the traditional division of labor.  For example, is scheduling such a highly specialized discipline rising to the level of a profession that it is separate from all of the other aspects of project management?  Of course not.  Scheduling is a discipline–a sub-specialty actually–that is inextricably linked to all other aspects of project management in a continuum.  The artifacts of the process of establishing project systems and controls constitutes the project itself.

No doubt there are entities and companies that still ostensibly organize themselves into specialties as they did twenty years ago: cost analysts, schedule analysts, risk management specialists, among others.  But given that the information from the these systems: schedule, cost management, project financial management, risk management, technical performance, and all the rest, can be integrated at the appropriate level of their interrelationships to provide us a cohesive, holistic view of the complex system that we call a project, is such division still necessary?  In practice the industry has already moved to position itself to integration, realizing the urgency of making the shift.

For example, to utilize an application to query cost management information in 1995 was a significant achievement during the first wave of software deployment that mimicked the division of labor.  In 2015, not so much.  Introducing a one-trick pony EVM “tool” in 2015 is laziness–hoping to turn back the clock in ignoring the obsolescence of such an approach–regardless of which slick new user interface is selected.

I recently attended a project management meeting of senior government and industry representatives.  During one of my side sessions I heard a colleague propose the discipline of Project Management Analyst in lieu of previously stove-piped specialties.  His proposal is a breath of fresh air in an industry that develops and manufacturers the latest aircraft and space technology, but has hobbled itself with systems and procedures designed for an earlier era that no longer align with the needs of doing business.  I believe the timely deployment of systems has suffered as a result during this period of transition. 

Software must lead, and accelerate the transition to the new integration paradigm.

Thus, in 2015 the choice is not between data that adheres to conventions of data neutrality, or to those that utilize data access via APIs, but in favor of applications that do both.

It is not between different hard-coded applications that provide the old “what-you-see-is-what-you-get” approach.  It is instead between such limited hard-coded applications, and those that provide flexibility so that business managers can choose among a nearly unlimited pallet of choices of how and which data, converted into information, is available to the user or classes of user based on their role and need to know; aggregated at the appropriate level of detail for the consumer to derive significance from the information being presented.

It is not between “best-of-breed” and “mix-and-match” solutions that leverage interfaces to achieve integration.  It is instead between such solution “consortiums” that drive up implementation and sustainment costs, bringing with them high overhead, against those that achieve integration by leveraging the source of the data itself, reducing the number of applications that need to be managed, allowing data to be enriched in an open and flexible environment, achieving transformation into useful information.

Finally, the choice isn’t among applications that save their attributes in a proprietary format so that the customer must commit themselves to a proprietary solution.  Instead, it is between such restrictive applications and those that open up data access, clearly establishing that it is the consumer that owns the data.

Note: I have made minor changes from the original version of this post for purposes of clarification.

Over at AITS.org Dave Gordon takes me to task on data normalization — and I respond with Data Neutrality

Dave Gordon at AITS.org takes me to task on my post regarding recommending using common schemas for certain project management data.  Dave’s alternative is to specify common APIs instead.   I am not one to dismiss alternative methods of reconciling disparate and, in their natural state, non-normalized data to find the most elegant solution.  My initial impression, though, is: been there, done that.

Regardless of the method used to derive significance from disparate sources of data that is of a common type, one still must obtain the cooperation of the players involved.  The ANSI X12 standard has been in use in the transportation industry for quite some time and has worked quite well, leaving the preference of proprietary solution up to the individual shippers.  The rule has been, however, that if you are going to write solutions for that industry that you need to allow the shipping info needed by any receiver to conform to a particular format so that it can be read regardless of the software involved.

Recently the U.S. Department of Defense, which had used certain ANSI X12 formats for particular data for quite some time has published and required a new set of schemas for a broader set of data under the rubric of the UN/CEFACT XML.  Thus, it has established the same approach as the transportation industry: taking an agnostic stand regarding software preferences while specifying that submitted data must conform to a common schema so that a proprietary file type is not given preference over another.

A little background is useful.  In developing major systems contractors are required to provide project performance data in order to ensure that public funds are being expended properly for the contracted effort.  This is the oversight responsibility portion of the equation.  The other side concerns project and program management.  Given the usual cost-plus contract type most often used, the government program management office in cooperation with its commercial counterpart looks to identify the manifestation of cost, schedule, and/or technical risk early enough to allow that risk to be handled as necessary.   Also, at the end of this process, which is only now being explored, is the usefulness of years of historical data across contract types, technologies, and suppliers that can be used to benefit the public interest by demonstrating which contractors perform better, to show the inherent risk associated with particular technologies through parametric methods, and a host of insights that can be derived through econometric project management trending and modeling.

So let’s assume that we can specify APIs in requesting the data in lieu of specifying that the customer can receive an application-agnostic file that can be read by any application that conforms with the data standard.  What is the difference?  My immediate observation is that is reverses the relationship in who owns the data.  In the case of the API the proprietary application becomes the gatekeeper.  In the case of an agnostic file structure it is open to everyone and the consumer owns the data.

In the API scenario large players can do what they want to limit competition and extensions to their functionality.  Since they can block box the manner in which data is structured, it also becomes increasingly difficult to make qualitative selections from the data.  The very example that Dave uses–the plethora of one-off mobile apps–usually must exist only in their own ecosystem.

So it seems to me that the real issue isn’t that Big Brother wants to control data structure.  What it comes down to is that specifying an open data structure defeats the ability of one or a group of solution providers from controlling the market through restrictions on accessing data.  This encourages maximum competition and innovation in the marketplace–Data Neutrality.

I look forward to additional information from Dave on this issue.  Each of the methods of achieving the end of Data Neutrality isn’t an end in itself.  Any method that is less structured and provides more flexibility is welcome.  I’m just not sure that we’re there yet with APIs.