Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

Open: Strategic Planning, Open Data Systems, and the Section 809 Panel

Sundays are usually days reserved for music and the group Rhye was playing in the background when this topic came to mind.

I have been preparing for my presentation in collaboration with my Navy colleague John Collins for the upcoming Integrated Program Management Workshop in Baltimore. This presentation will be a non-proprietary/non-commercial talk about understanding the issue of unlocking data to support national defense systems, but the topic has broader interest.

Thus, in advance of that formal presentation in Baltimore, there are issues and principles that are useful to cover, given that data capture and its processing, delivery, and use is at the heart of all systems in government, and private industry and organizations.

Top Data Trends in Industry and Their Relationship to Open Data Systems

According to Shohreh Gorbhani, Director, Project Control Academy, the top five data trends being pursued by private industry and technology companies. My own comments follow as they relate to open data systems.

  1. Open Technologies that transition from 2D Program Management to 3D and 4D PM. This point is consistent with the College of Performance Management’s emphasis on IPM, but note that the stipulation is the use of open technologies. This is an important distinction technologically, and one that I will explore further in this post.
  2. Real-time Data Capture. This means capturing data in the moment so that the status of our systems is up-to-date without the present delays associated with manual data management and conditioning. This does not preclude the collection of structured, periodic data, but also does include the capture of transactions from real-time integrated systems where appropriate.
  3. Seamless Data Flow Integration. From the perspective of companies in manufacturing and consumer products, technologies such as IoT and Cloud are just now coming into play. But, given the underlying premises of items 1 and 2, this also means the proper automated contextualization of data using an open technology approach that flows in such a way as to be traceable.
  4. The use of Big Data. The term has lost a good deal of its meaning because of its transformation into a buzz-phrase and marketing term. But Big Data refers to the expansion in the depth and breadth of available data driven by the economic forces that drive Moore’s Law. What this means is that we are entering a new frontier of data processing and analysis that will, no doubt, break down assumptions regarding the validity and strength of certain predictive analytics. The old assumptions that restrict access to data due to limitations of technology and higher cost no longer apply. We are now in the age of Knowledge Discovery in Data (KDD). The old approach of reporting assumed that we already know what we need to know. The use of data challenges old assumptions and allows us to follow the data where it will lead us.
  5. AI Forecasting and Analysis. No doubt predictive AI will be important as we move forward with machine learning and other similar technologies. But this infant is not yet a rug rat. The initial experiences with AI are that they tend to reflect the biases of the creators. The danger here is that this defeats KDD, which results in stagnation and fugue. But there are other areas where AI can be taught to automate mundane, value-neutral tasks relating to raw data interpretation.

The 809 Panel Recommendation

The fact that industry is the driving force behind these trends that will transform the way that we view information in our day-to-day work, it is not surprising that the 809 Panel had this to say about existing defense business systems:

“Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.”

Section 809 Volume 3, Section 9, p. 477

At one point in my military career, I was assigned as the Materiel, Fuels, and Transportation Officer of Naval Air Station, Norfolk. As a major naval air base, transportation hub, and home to a Naval Aviation Depot, we shipped and received materiel and supplies across the world. In doing so, our transportation personnel would use what at the time was new digital technology to complete an electronic bill of lading that specified what and when items were being shipped, the common or military carrier, the intended recipient, and the estimated date of arrival, among other essential information.

The customer and receiving end of this workflow received an open systems data file that contained these particulars. The file was an early version of open data known as an X12 file, for which the commercial transportation industry was an early adopter. Shipping and receiving activities and businesses used their own type of local software: and there were a number of customized and commercial choices out there, as well as those used by common carriers such various trucking and shipping firms, the USPS, FEDEX, DHS, UPS, and others. The X12 file was the DMZ that made the information open. Software manufacturers, if they wanted to stay relevant in the market, could not impose a proprietary data solution.

Furthermore, standardization of terminology and concepts ensured that the information was readable and comprehensible wherever the items landed–whether across receiving offices in the United States, Japan, Europe, or even Istanbul. Understanding that DoD needs the skillsets to be able to optimize data, it didn’t require an army of data scientists to achieve this end-state. It required the right data science expertise in the right places, and the dictates of transportation consumers to move the technology market to provide the solution.

Over the years both industry and government have developed a number of schema standards focused on specific types of data, progressing from X12 to XML and now projected to use JSON-based schemas. Each of them in their initial iterations automated the submission of physical reports that had been required by either by contract or operations. These focused on a small subset of the full dataset relating to program management and project controls.

This progression made sense.

When digitized technology is first introduced into an intensive direct-labor environment, the initial focus is to automate the production of artifacts and their underlying processes in order to phase in the technology’s acceptance. This also allows the organization to realize immediate returns on investment and improvements in productivity. But this is the first step, not the final one.

Currently for project controls the current state is the UN/CEFACT XML for program performance management data, and the contract cost and labor data collection file known as the FlexFile. Clearly the latter file, given that the recipient is the Office of the Secretary of Defense Cost Assessment and Program Evaluation (OSD CAPE), establish it as one of many feedback loops that support that office’s role in coordinating the planning, programming, budgeting, and evaluation (PPBE) system related to military strategic investments and budgeting, but only one. The program performance information is also a vital part of the PPBE process in evaluation and in future planning.

For most of the U.S. economy, market forces and consumer requirements are the driving force in digital innovation. The trends noted by Ms. Gorbhani can be confirmed through a Google search of any one of the many technology magazines and websites that can be found. The 809 Panel, drawn as it was from specialists and industry and government, were tasked “to provide recommendations that would allow DoD to adapt and deliver capability at market speeds, while ensuring that DoD remains true to its commitment to promote competition, provide transparency in its actions, and maintain the integrity of the defense acquisition system.”

Given that the work of the DoD is unique, creating a type of monopsony, it is up to leadership within the Department to create the conditions and mandates necessary to recreate in microcosm the positive effects of market forces. The DoD also has a very special, vital mission in defending the nation.

When an individual business cobbles together its mission statement it is that mission that defines the necessary elements in data collection that are then essential in making decisions. In today’s world, best commercial sector practice is to establish a Master Data Management (MDM) approach in defining data requirements and practices. In the case of DoD, a similar approach would be beneficial. Concurrent with the period of the 809 Panel’s efforts, RAND Corporation delivered a paper in 2017 (link in the previous sentence) that made recommendations related to data governance that are consistent with the 809 Panel’s recommendations. We will be discussing these specific recommendations in our presentation.

Meeting the mission and readiness are the key components to data governance in DoD. Absent such guidance, specialized software solution providers, in particular, will engage in what is called “rent-seeking” behavior. This is an economic term that means that an “entity (that) seeks to gain added wealth without any reciprocal contribution of productivity.”

No doubt, given the marketing of software solution providers, it is hard for decision-makers to tell what constitutes an open data system. The motivation of a software solution is to make itself as “sticky” as possible and it does that by enticing a customer to commit to proprietary definitions, structures, and database schemas. Usually there are “black-boxed” portions of the software that makes traceability impossible and that complicates the issue of who exactly owns the data and the ability of the customer to optimize it and utilize it as the mission dictates.

Furthermore, data visualization components like dashboards are ubiquitous in the market. A cursory stroll through a tradeshow looks like a dashboard smorgasbord combined with different practical concepts of what constitutes “open” and “integration”.

As one DoD professional recently told me, it is hard to tell the software systems apart. To do this it is necessary to understand what underlies the software. Thus, a proposed honest-broker definition of an open data system is useful and the place to start, given that this is not a notional concept since such systems have been successfully been established.

The Definition of Open Data Systems

Practical experience in implementing open data systems toward the goal of optimizing essential information from our planning, acquisition, financial, and systems engineering systems informs the following proposed definition, which is based on commercial best practice. This proposal is also based on the principle that the customer owns the data.

  1. An open data system is one based on non-proprietary neutral schemas that allow for the effective capture of all essential elements from third-party proprietary and customized software for reporting and integration necessary to support both internal and external stakeholders.
  2. An open data system allows for complete traceability and transparency from the underlying database structure of the third-party software data, through the process of data capture, transformation, and delivery of data in the neutral schema.
  3. An open data system targets the loading of the underlying source data for analysis and use into a neutral database structure that replicates the structure of the neutral schema. This allows for 100% traceability and audit of data elements received through the neutral schema, and ensures that the receiving organization owns the data.

Under this definition, data from its origination to its destination is more easily validated and traced, ensuring quality and fidelity, and establishing confidence in its value. Given these characteristics, integration of data from disparate domains becomes possible. The tracking of conflicting indicators is mitigated, since open system data allows for its effective integration without the bias of proprietary coding or restrictions on data use. Finally, both government and industry will not only establish ownership of their data–a routine principle in commercial business–but also be free to utilize new technologies that optimize the use of that data.

In closing, Gahan Wilson, a cartoonist whose work appeared in National Lampoon, The New Yorker, Playboy, and other magazines recently passed.

When thinking of the barriers to the effective use of data, I came across this cartoon in The New Yorker:

Open Data is the key to effective integration and reporting–to the optimal use of information. Once mandated and achieved, our defense and business systems will be better informed and be able to test and verify assumed knowledge, address risk, and eliminate dogmatic and erroneous conclusions. Open Data is the driver of organizational transformation keyed to the effective understanding and use of information, and all that entails. Finally, Open Data is necessary to the mission and planning systems of both industry and the U.S. Department of Defense.

The Future — Data Focus vs. “Tools” Focus

The title in this case is from the Leonard Cohen song.

Over the last few months I’ve come across this issue quite a bit and it goes to the heart of where software technology is leading us.  The basic question that underlies this issue can be boiled down into the issue of whether software should be thought of as a set of “tools” or an overarching solution that can handle data in a way that the organization requires.  It is a fundamental question because what we call Big Data–despite all of the hoopla–is really a relative term that changes with hardware, storage, and software scalability.  What was Big Data in 1997 is not Big Data in 2016.

As Moore’s Law expands scalability at lower cost, organizations and SMEs are finding that the dedicated software tools at hand are insufficient to leverage the additional information that can be derived from that data.  The reason for this is simple.  A COTS tools publisher will determine the functionality required based on a structured set of data that is to be used and code to that requirement.  The timeframe is usually extended and the approach highly structured.  There are very good reasons for this approach in particular industries where structure is necessary and the environment is fairly stable.  The list of industries that fall into this category is rapidly becoming smaller.  Thus, there is a large gap that must be filled by workarounds, custom code, and suboptimized use of Excel.  Organizations and people cannot wait until the self-styled software SMEs get around to providing that upgrade two years from now so that people can do their jobs.

Thus, the focus must be shifted to data and the software technologies that maximize its immediate exploitation for business purposes to meet organizational needs.  The key here is the arise of Fourth Generation applications that leverage object oriented programming language that most closely replicate the flexibility of open source.  What this means is that in lieu of buying a set of “tools”–each focused on solving a specific problem stitched together by a common platform or through data transfer–that software that deals with both data and UI in an agnostic fashion is now available.

The availability of flexible Fourth Generation software is of great concern, as one would imagine, to incumbents who have built their business model on defending territory based on a set of artifacts provided in the software.  Oftentimes these artifacts are nothing more than automatically filled in forms that previously were filled in manually.  That model was fine during the first and second waves of automation from the 1980s and 1990s, but such capabilities are trivial in 2016 given software focused on data that can be quickly adapted to provide functionality as needed.  What this development also does is eliminate and make trivial those old checklists that IT shops used to send out in a lazy way of assessing relative capabilities of software to simplify the competitive range.

Tools restrict themselves to a subset of data by definition to provide a specific set of capabilities.  Software that expands to include any set of data and allows that data to be displayed and processed as necessary through user configuration adapts itself more quickly and effectively to organizational needs.  They also tend to eliminate the need for multiple “best-of-breed” toolset approaches that are not the best of any breed, but more importantly, go beyond the limited functionality and ways of deriving importance from data found in structured tools.  The reason for this is that the data drives what is possible and important, rather than tools imposing a well-trod interpretation of importance based on a limited set of data stored in a proprietary format.

An important effect of Fourth Generation software that provides flexibility in UI and functionality driven by the user is that it puts the domain SME back in the driver’s seat.  This is an important development.  For too long SMEs have had to content themselves with recommending and advocating for functionality in software while waiting for the market (software publishers) to respond.  Essential business functionality with limited market commonality often required that organizations either wait until the remainder of the market drove software publishers to meet their needs, finance expensive custom development (either organic or contracted), or fill gaps with suboptimized and ad hoc internal solutions.  With software that adapts its UI and functionality based on any data that can be accessed, using simple configuration capabilities, SMEs can fill these gaps with a consistent solution that maintains data fidelity and aids in the capture and sustainability of corporate knowledge.

Furthermore, for all of the talk about Agile software techniques, one cannot implement Agile using software languages and approaches that were designed in an earlier age that resists optimization of the method.  Fourth Generation software lends itself most effectively to Agile since configuration using simple object oriented language gets us to the ideal–without a reliance on single points of failure–of releasable solutions at the end of a two-week sprint.  No doubt there are developers out there making good money that may challenge this assertion, but they are the exceptions to the rule that prove the point.  An organization should be able to optimize the pool of contributors to solution development and rollout in supporting essential business processes.  Otherwise Agile is just a pretext to overcome suboptimized developmental approaches, software languages, and the self-interest of developers that can’t plan or produce a releasable product in a timely manner within budgetary constraints.

In the end the change in mindset from tools to data goes to the issue of who owns the data: the organization that creates and utilizes the data (the customer), or the proprietary software tool publishers?  Clearly the economics will win out in favor of the customer.  It is time to displace “tools” thinking.

Note:  I’ve revised the title of the blog for clarity.

Living in the Material World — A Call for Open Data from Materials Science

Since advocating for transparent and open data schemas and data repositories, I’ve run into other high tech professionals who have opined that such data requirements are only applicable to my core competency in the U.S. Aerospace & Defense vertical.  Now, in the 26 November on-line edition of the journal Science, comes an opinion piece from a number of Chinese corrosion scientists under the title “Materials science: Share corrosion data” have advocated the development of open data infrastructures to make the age and life of existing metallurgic infrastructure easily accessible to technology in a non-proprietary format.

As the article states, “corrosion costs six cents for every dollar of gross domestic product in the United States. Globally, that amounts to more than US$4 trillion a year — equivalent to damages from 40 Hurricane Katrinas. Half of that cost is in corrosion prevention and control, the other half in damages and lost productivity.”

In the software technology industry, the recent issues regarding the Trans-Pacific Partnership (TPP) has revealed strains in U.S.-Chinese relations on intellectual property and proprietary systems.  So the danger is always that the source of the advocacy will be ignored, despite the clear economic argument in favor of what the editorial proposes.

But the concern is transnational in nature.  As the piece points out, the U.S. Department of Energy has initiated its own program in alignment with the U.S. National Institute of Standards and Technology (NIST) Materials Genome Initiative (MGI) to establish open repositories of data in support of the alternative energy industry.  Given the aging U.S. infrastructure and the dangers from sudden failure from these engineering systems, it seems wise to track and share data for materials corrosion and lifespans.  Think of the sudden bridge failures that have hit headlines over the last several years.

Since the cost of not doing so can be measured in lives as well as economic cost, such data normalization and rationalization in this area takes on the role of being an essential element of governance.