Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

Learning the (Data) — Data-Driven Management, HBR Edition

The months of December and January are usually full of reviews of significant events and achievements during the previous twelve months. Harvard Business Review makes the search for some of the best writing on the subject of data-driven transformation by occasionally publishing in one volume the best writing on a critical subject of interest to professional through the magazine OnPoint. It is worth making part of your permanent data management library.

The volume begins with a very concise article by Thomas C. Redman with the provocative title “Does Your Company Know What to Do with All Its Data?” He then goes on to list seven takeaways of optimizing the use of existing data that includes many of the themes that I have written about in this blog: better decision-making, innovation, what he calls “informationalize products”, and other significant effects. Most importantly, he refers to the situation of information asymmetry and how this provides companies and organizations with a strategic advantage that directly affects the bottom line–whether that be in negotiations with peers, contractual relationships, or market advantages. Aside from the OnPoint article, he also has some important things to say about corporate data quality. Highly recommended and a good reason to implement systems that assure internal information systems fidelity.

Edd Wilder-James also covers a theme that I have hammered home in a number of blog posts in the article “Breaking Down Data Silos.” The issue here is access to data and the manner in which it is captured and transformed into usable analytics. His recommended approach to a task that is often daunting is to find the path of least resistance in finding opportunities to break down silos and maximize data to apply advanced analytics. The article provides a necessary balm that counteracts the hype that often accompanies this topic.

Both of these articles are good entrees to the subject and perfectly positioned to prompt both thought and reflection of similar experiences. In my own day job I provide products that specifically address these business needs. Yet executives and management in all too many cases continue to be unaware of the economic advantages of data optimization or the manner in which continuing to support data silos is limiting their ability to effectively manage their organizations. There is no doubt that things are changing and each day offers a new set of clients who are feeling their way in this new data-driven world, knowing that the promises of almost effort-free goodness and light by highly publicized data gurus are not the reality of practitioners, who apply the detail work of data normalization and rationalization. At the end it looks like magic, but there is effort that needs to be expended up-front to get to that state. In this physical universe under the Second Law of Thermodynamics there are no free lunches–energy must be borrowed from elsewhere in order to perform work. We can minimize these efforts through learning and the application of new technology, but managers cannot pretend not to have to understand the data that they intend to use to make business decisions.

All of the longer form articles are excellent, but I am particularly impressed with the Leandro DalleMule and Thomas H. Davenport article entitled “What’s Your Data Strategy?” from the May-June 2017 issue of HBR. Oftentimes when addressing big data at professional conferences and in visiting businesses the topic often runs to the manner of handling the bulk of non-structured data. But as the article notes, less than half of an organization’s relevant structured data is actually used in decision-making. The most useful artifact that I have permanently plastered at my workplace is the graphic “The Elements of Data Strategy”, and I strongly recommend that any manager concerned with leveraging new technology to optimize data do the same. The graphic illuminates the defensive and offensive positions inherent in a cohesive data strategy leading an organization to the state: “In our experience, a more flexible and realistic approach to data and information architectures involves both a single source of truth (SSOT) and multiple versions of the truth (MVOTs). The SSOT works at the data level; MVOTs support the management of information.” Elimination of proprietary data silos, elimination of redundant data streams, and warehousing of data that is accessed using a number of analytical methods achieve the necessary states of SSOT that provides the basis for an environment supporting MVOTs.

The article “Why IT Fumbles Analytics” by Donald A. Marchand and Joe Peppard from 2013, still rings true today. As with the article cited above by Wilder-James, the emphasis here is with the work necessary to ensure that new data and analytical capabilities succeed, but the emphasis shifts to “figuring out how to use the information (the new system) generates to make better decisions or gain deeper…insights into key aspects of the business.” The heart of managing the effort in providing this capability is to put into place a project organization, as well as systems and procedures, that will support the organizational transformation that will occur as a result of the explosion of new analytical capability.

The days of simply buying an off-the-shelf silo-ed “tool” and automating a specific manual function are over, especially for organizations that wish to be effective and competitive–and more profitable–in today’s data and analytical environment. A more comprehensive and collaborative approach is necessary. As with the DalleMule and Davenport article, there is a very useful graphic that contrasts traditional IT project approaches against Analytics and Big Data (or perhaps “Bigger” Data) Projects. Though the prescriptions in the article assume an earlier concept of Big Data optimization focused on non-structured data, thereby making some of these overkill, an implementation plan is essential in supporting the kind of transformation that will occur, and managers act at their own risk if they fail to take this effect into account.

All of the other articles in this OnPoint issue are of value. The bottom line, as I have written in the past, is to keep a focus on solving business challenges, rather than buying the new bright shiny object. Alternatively, in today’s business environment the day that business decision-makers can afford to stay within their silo-ed comfort zone are phasing out very quickly, so they need to shift their attention to those solutions that address these new realities.

So why do this apart from the fancy term “data optimization”? Well, because there is a direct return-on-investment in transforming organizations and systems to data-driven ones. At the end of the day the economics win out. Thus, our organizations must be prepared to support and have a plan in place to address the core effects of new data-analytics and Big Data technology:

a. The management and organizational transformation that takes place when deploying the new technology, requiring proactive socialization of the changing environment, the teaching of new skill sets, new ways of working, and of doing business.

b. Supporting transformation from a sub-optimized silo-ed “tell me what I need to know” work environment to a learning environment, driven by what the data indicates, supporting the skills cited above that include intellectual curiosity, engaging domain expertise, and building cross-domain competencies.

c. A practical plan that teaches the organization how best to use the new capability through a practical, hands-on approach that focuses on addressing specific business challenges.