Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

Ring Out the Old, Ring in the New: Data Transformation Podcasting

Robin Williams at Innovate IPM interviewed me a few weeks ago and has a new podcast up to cap off the year. The main thrust of our discussion, as it turned out, which began as a wide-ranging one, settled on digital transformation and the changes and developments that I’ve seen in this area over the last three decades.

I met Rob at a recent Projects Controls conference. He is a professional, curious, and engaging individual who quickly puts one at ease. We both found a lot in common regarding our perspectives on project management and project controls and I agreed to the podcast interview. Our discussion was no different than many that I’ve had with other professionals in my areas of interest in my own living room, and the discussion comes off as a similarly engaging and informal conversation between like-minded individuals.

Before he posted the podcast, I managed to get a preview. Despite years of doing interviews, hosting symposiums, an occasional emcee or radio spot, home movies, and other recordings, I still cannot get over the strange feeling of hearing my own voice during a long conversation. I am constantly looking for faults, and cringed with the utterance of each “ah” or “um” while listening to myself–returning in my head to the admonitions of my supervisors when I was taught to be a Navy instructor–though, thankfully, they are few.

Still, thanks to the magic of editing, Rob managed to keep the focus on the main point of the conversation when I strayed into some side discussion. During the time of the interview Rob caught me at a time when I was working on a paper to present to DoD professionals regarding digital transformation, and so the interview caught me in real-time while I was developing in my mind two main concepts that I picked up by reading the literature in the areas of establishing a Master Data Management (MDM) strategy, and a knowledge management environment. While I do not mention these items in the interview, the discussion allowed me to subsequently sort out where these concepts apply.

In any event, the podcast can be found here: https://www.innovateipm.com/podcast/episode/206e7fbd/13-history-of-digital-transformation-with-nick-pisano. I hope you find it interesting and informative.

Sledgehammer: Pisano Talks!

My blogging hiatus is coming to an end as I take a sledgehammer to the writer’s block wall.

I’ve traveled far and wide over the last six months to various venues across the country and have collected a number of new and interesting perspectives on the issues of data transformation, integrated project management, and business analytics and visualization. As a result, I have developed some very strong opinions regarding the trends that work and those that don’t regarding these topics and will be sharing these perspectives (with the appropriate supporting documentation per usual) in following posts.

To get things started this post will be relatively brief.

First, I will be speaking along with co-presenter John Collins, who is a Senior Acquisition Specialist at the Navy Engineering & Logistics Office, at the Integrated Program Management Workshop at the Hyatt Regency in beautiful downtown Baltimore’s Inner Harbor 10-12 December. So come on down! (or over) and give us a listen.

The topic is “Unlocking Data to Improve National Defense Systems”. Today anyone can put together pretty visualizations of data from Excel spreadsheets and other sources–and some have made quite a bit of money doing so. But accessing the right data at the right level of detail, transforming it so that its information content can be exploited, and contextualizing it properly through integration will provide the most value to organizations.

Furthermore, our presentation will make a linkage to what data is necessary to national defense systems in constructing the necessary artifacts to support the Department of Defense’s Planning, Programming, Budgeting and Execution (PPBE) process and what eventually becomes the Future Years Defense Program (FYDP).

Traditionally information capture and reporting has been framed as a question of oversight, reporting, and regulation related to contract management, capital investment cost control, and DoD R&D and acquisition program management. But organizations that fail to leverage the new powerful technologies that double processing and data storage capability every 18 months, allowing for both the depth and breadth of data to expand exponentially, are setting themselves up to fail. In national defense, this is a condition that cannot be allowed to occur.

If DoD doesn’t collect this information, which we know from the reports of cybersecurity agencies that other state actors are collecting, we will be at a serious strategic disadvantage. We are in a new frontier of knowledge discovery in data. Our analysts and program managers think they know what they need to be viewing, but adding new perspectives through integration provide new perspectives and, as a result, will result in new indicators and predictive analytics that will, no doubt, overtake current practice. Furthermore, that information can now be processed and contribute more, timely, and better intelligence to the process of strategic and operational planning.

The presentation will be somewhat wonky and directed at policymakers and decisionmakers in both government and industry. But anyone can play, and that is the cool aspect of our community. The presentation will be non-commercial, despite my day job–a line I haven’t crossed up to this point in this blog, but in this latter case will be changing to some extent.

Back in early 2018 I became the sole proprietor of SNA Software LLC–an industry technology leader in data transformation–particularly in capturing datasets that traditionally have been referred to as “Big Data”–and a hybrid point solution that is built on an open business intelligence framework. Our approach leverages the advantages of COTS (delivering the 80% solution out of the box) with open business intelligence that allows for rapid configuration to adapt the solution to an organization’s needs and culture. Combined with COTS data capture and transformation software–the key to transforming data into information and then combining it to provide intelligence at the right time and to the right place–the latency in access to trusted intelligence is reduced significantly.

Along these lines, I have developed some very specific opinions about how to achieve this transformation–and have put those concepts into practice through SNA and delivered those solutions to our customers. Thus, the result has been to reduce both the effort and time to capture large datasets from data that originates in pre-processed data, and to eliminate direct labor and the duration to information delivery by more than 99%. The path to get there is not to apply an army of data scientists and data analysts that deals with all data as if it is flat and to reinvent the wheel–only to deliver a suboptimized solution sometime in the future after unnecessarily expending time and resources. This is a devolution to the same labor-intensive business intelligence approaches that we used back in the 1980s and 1990s. The answer is not to throw labor at data that already has its meaning embedded into its information content. The answer is to apply smarts through technology, and that’s what we do.

Further along these lines, if you are using hard-coded point solutions (also called purpose-built software) and knitted best-of-breed, chances are that you will find that you are poorly positioned to exploit new technology and will be obsolete within the next five years, if not sooner. The model of selling COTS solutions and walking away except for traditional maintenance and support is dying. The new paradigm will be to be part of the solution and that requires domain knowledge that translates into technology delivery.

More on these points in future posts, but I’ve placed the stake in the ground and we’ll see how they hold up to critique and comment.

Finally, I recently became aware of an extremely informative and cutting-edge website that includes podcasts from thought leaders in the area of integrated program management. It is entitled InnovateIPM and is operated and moderated by a gentleman named Rob Williams. He is a domain expert in project cost development, with over 20 years of experience in the oil, gas, and petrochemical industries. Robin has served in a variety of roles throughout his career and is now focuses on cost estimating and Front-End Loading quality assurance. His current role is advanced project cost estimator at Marathon Petroleum’s Galveston Bay Refinery in Texas City.

Rob was also nice enough to continue a discussion we started at a project controls symposium and interviewed me for a podcast. I’ll post additional information once it is posted.