Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

Takin’ Care of Business — Information Economics in Project Management

Neoclassical economics abhors inefficiency, and yet inefficiencies exist.  Among the core issues that create inefficiencies is the asymmetrical nature of information.  Asymmetry is an accepted cornerstone of economics that leads to inefficiency.  We can see in our daily lives and employment the effects of one party in a transaction having more information than the other:  knowing whether the used car you are buying is a lemon, measuring risk in the purchase of an investment and, apropos to this post, identifying how our information systems allow us to manage complex projects.

Regarding this last proposition we can peel this onion down through its various levels: the asymmetry in the information between the customer and the supplier, the asymmetry in information between the board and stockholders, the asymmetry in information between management and labor, the asymmetry in information between individual SMEs and the project team, etc.–it’s elephants all the way down.

This asymmetry, which drives inefficiency, is exacerbated in markets that are dominated by monopoly, monopsony, and oligopoly power.  When informed by the work of Hart and Holmström regarding contract theory, which recently garnered the Nobel in economics, we have a basis for understanding the internal dynamics of projects in seeking efficiency and productivity.  What is interesting about contract theory is that it incorporates the concept of asymmetrical information (labeled as adverse selection), but expands this concept in human transactions at the microeconomic level to include considerations of moral hazard and the utility of signalling.

The state of asymmetry and inefficiency is exacerbated by the patchwork quilt of “tools”–software applications that are designed to address only a very restricted portion of the total contract and project management system–that are currently deployed as the state of the art.  These tend to require the insertion of a new class of SME to manage data by essentially reversing the efficiencies in automation, involving direct effort to reconcile differences in data from differing tools. This is a sub-optimized system.  It discourages optimization of information across the project, reinforces asymmetry, and is economically and practically unsustainable.

The key in all of this is ensuring that sub-optimal behavior is discouraged, and that those activities and behaviors that are supportive of more transparent sharing of information and, therefore, contribute to greater efficiency and productivity are rewarded.  It should be noted that more transparent organizations tend to be more sustainable, healthier, and with a higher degree of employee commitment.

The path forward where there is monopsony power, where there is a dominant buyer, is to impose the conditions for normative behavior that would otherwise be leveraged through practice in a more open market.  For open markets not dominated by one player as either supplier or seller, instituting practices that reward behavior that reduces the effects of asymmetrical information, and contracting disincentives in business transactions on the open market is the key.

In the information management market as a whole the trends that are working against asymmetry and inefficiency involve the reduction of data streams, the construction of cross-domain data repositories (or reservoirs) that allow for the satisfaction of multiple business stakeholders, and the introduction of systems that are more open and adaptable to the needs of the project system in lieu of a limited portion of the project team.  These solutions exist, yet their adoption is hindered because of the long-term infrastructure that is put in place in complex project management.  This infrastructure is supported by incumbents that are reinforcing to the status quo.  Because of this, from the time a market innovation is introduced to the time that it is adopted in project-focused organizations usually involves the expenditure of several years.

This argues for establishing an environment that is more nimble.  This involves the adoption of a series of approaches to achieve the goals of broader information symmetry and efficiency in the project organization.  These are:

a. Instituting contractual relationships, both internally and externally, that encourage project personnel to identify risk.  This would include incentives to kill efforts that have breached their framing assumptions, or to consolidate progress that the project has achieved to date–sending it as it is to production–while killing further effort that would breach framing assumptions.

b. Institute policy and incentives on the data supply end to reduce the number of data streams.  Toward this end both acquisition and contracting practices should move to discourage proprietary data dead ends by encouraging normalized and rationalized data schemas that describe the environment using a common or, at least, compatible lexicon.  This reduces the inefficiency derived from opaqueness as it relates to software and data.

c.  Institute policy and incentives on the data consumer end to leverage the economies derived from the increased computing power from Moore’s Law by scaling data to construct interrelated datasets across multiple domains that will provide a more cohesive and expansive view of project performance.  This involves the warehousing of data into a common repository or reduced set of repositories.  The goal is to satisfy multiple project stakeholders from multiple domains using as few streams as necessary and encourage KDD (Knowledge Discovery in Databases).  This reduces the inefficiency derived from data opaqueness, but also from the traditional line-and-staff organization that has tended to stovepipe expertise and information.

d.  Institute acquisition and market incentives that encourage software manufacturers to engage in positive signalling behavior that reduces the opaqueness of the solutions being offered to the marketplace.

In summary, the current state of project data is one that is characterized by “best-of-breed” patchwork quilt solutions that tend to increase direct labor, reduces and limits productivity, and drives up cost.  At the end of the day the ability of the project to handle risk and adapt to technical challenges rests on the reliability and efficiency of its information systems.  A patchwork system fails to meet the needs of the organization as a whole and at the end of the day is not “takin’ care of business.”

Big Time — Elements of Data Size in Scaling

I’ve run into additional questions about scalability.  It is significant to understand the concept in terms of assessing software against data size, since there are actually various aspect of approaching the issue.

Unlike situations where data is already sorted and structured as part of the core functionality of the software service being provided, this is in dealing in an environment where there are many third-party software “tools” that put data into proprietary silos.  These act as barriers to optimizing data use and gaining corporate intelligence.  The goal here is to apply in real terms the concept that the customers generating the data (or stakeholders who pay for the data) own the data and should have full use of it across domains.  In project management and corporate governance this is an essential capability.

For run-of-the-mill software “tools” that are focused on solving one problem, this often is interpreted as just selling a lot more licenses to a big organization.  “Sure we can scale!” is code for “Sure, I’ll sell you more licenses!”  They can reasonably make this assertion, particularly in client-server or web environments, where they can point to the ability of the database system on which they store data to scale.  This also comes with, usually unstated, the additional constraint that their solution rests on a proprietary database structure.  Such responses, through, are a form of sidestepping the question, nor is it the question being asked.  Thus, it is important for those acquiring the right software to understand the subtleties.

A review of what makes data big in the first place is in order.  The basic definition, which I outlined previously, came from NASA in describing data that could not be held in local memory or local storage.  Hardware capability, however, continues to grow exponentially, so that what is big data today is not big data tomorrow.  But in handling big data, it then becomes incumbent on software publishers to drive performance to allow their customers to take advantage of the knowledge contained in these larger data sets.

The elements that determine the size of data are:

a.  Table size

b.  Row or Record size

c.  Field size

d.  Rows per table

e.  Columns per table

f.  Indexes per table

Note the interrelationships of these elements in determining size.  Thus, recently I was asked how many records are being used on the largest tables being accessed by a piece of software.  That is fine as shorthand, but the other elements add to the size of the data that is being accessed.  Thus, a set of data of say 800K records may be just as “big” as one containing 2M records because of the overall table size of fields, and the numbers of columns and indices, as well as records.  Furthermore, this question didn’t take into account the entire breadth of data across all tables.

Understanding the definition of data size then leads us to understanding the nature of software scaling.  There are two aspects to this.

The first is the software’s ability to presort the data against the database in such as way as to ensure that latency–the delay in performance when the data is loaded–is minimized.  The principles applied here go back to database management practices back in the day when organizations used to hire teams of data scientists to rationalize data that was processed in machine language–especially when it used to be stored in ASCII or, for those who want to really date themselves, EBCDIC, which were incoherent by today’s generation of more human-readable friendly formats.

Quite simply, the basic steps applied has been to identify the syntax, translate it, find its equivalents, and then sort that data into logical categories that leverage database pointers.  What you don’t want the software to do is what used to be done during the earliest days of dealing with data, which was smaller by today’s standards, of serially querying ever data element in order to fetch only what the user is calling.  Furthermore, it also doesn’t make much sense to deal with all data as a Repository of Babel to apply labor-intensive data mining in non-relational databases, especially in cases where the data is well understood and fairly well structured, even if in a proprietary structure.  If we do business in a vertical where industry standards in data apply, as in the use of the UN/CEFACT XML convention, then much of the presorting has been done for us.  In addition, more powerful industry APIs (like OLE DB and ODBC) that utilize middleware (web services, XML, SOAP, MapReduce, etc.) multiply the presorting capabilities of software, providing significant performance improvements in accessing big data.

The other aspect is in the software’s ability to understand limitations in data communications hardware systems.  This is a real problem, because the backbone in corporate communication systems, especially to ensure security, is still largely done over a wire.  The investments in these backbones is usually categorized as a capital investment, and so upgrades to the system are slow.  Furthermore, oftentimes backbone systems are embedded in physical plant building structures.  So any software performance is limited by the resistance and bandwidth of wiring.  Thus, we live in a world where hardware storage and processing is doubling every 12-18 months, and software is designed to better leverage such expansion, but the wires over which data communication depends remains stuck in the past–constrained by the basic physics of CAT 6 or Fiber Optic cabling.

Needless to say, software manufacturers who rely on constant communications with the database will see significantly degraded performance.  Some software publishers who still rely on this model use the “check out” system, treating data like a lending library, where only one user or limited users can access the same data.  This, of course, reduces customer flexibility.  Strategies that are more discrete in handling data are the needed response here until day-to-day software communications can reap the benefits of physical advancements in this category of hardware.  Furthermore, organizations must understand that the big Cloud in the sky is not the answer, since it is constrained by the same physics as the rest of the universe–with greater security risks.

All of this leads me to a discussion I had with a colleague recently.  He opened his iPhone and did a query in iTunes for an album.  In about a second or so his query selected the artist and gave a list of albums–all done without a wire connection.  “Why can’t we have this in our industry?” he asked.  Why indeed?  Well, first off, Apple iTunes has sorted its data to optimize performance with its app, and it is performed within a standard data stream and optimized for the Apple iOS and hardware.  Secondly, the variables of possible queries in iTunes are predefined and tied to a limited and internally well-defined set of data.  Thus, the data and application challenges are not equivalent as found in my friend’s industry vertical.  For example, aside from the challenges of third party data normalization and rationalization, iTunes is not dealing with dynamic time-phased or trending data that requires multiple user updates to reflect changes using predictive analytics which is then served to different classes of users in a highly secure environment.  Finally, and most significantly, Apple spent more on that system than the annual budget of my friend’s organization.  In the end his question was a good one, but in discussing this it was apparent that just saying “give me this” is a form of magical thinking and hand waving.  The devil is in the details, though I am confident that we will eventually get to an equivalent capability.

At the end of the day IT project management strategy must take into account the specific needs of classes of users in making determinations of scaling.  What this means is a segmented approach: thick-client users, compartmentalized local installs with subsets of data, thin-client, and web/mobile or terminal services equivalents.  The practical solution is still an engineered one that breaks the elephant into digestible pieces that lean forward to leveraging advances in hardware, database, and software operating environments.  These are the essential building blocks to data optimization and scaling.

Living in the Material World — A Call for Open Data from Materials Science

Since advocating for transparent and open data schemas and data repositories, I’ve run into other high tech professionals who have opined that such data requirements are only applicable to my core competency in the U.S. Aerospace & Defense vertical.  Now, in the 26 November on-line edition of the journal Science, comes an opinion piece from a number of Chinese corrosion scientists under the title “Materials science: Share corrosion data” have advocated the development of open data infrastructures to make the age and life of existing metallurgic infrastructure easily accessible to technology in a non-proprietary format.

As the article states, “corrosion costs six cents for every dollar of gross domestic product in the United States. Globally, that amounts to more than US$4 trillion a year — equivalent to damages from 40 Hurricane Katrinas. Half of that cost is in corrosion prevention and control, the other half in damages and lost productivity.”

In the software technology industry, the recent issues regarding the Trans-Pacific Partnership (TPP) has revealed strains in U.S.-Chinese relations on intellectual property and proprietary systems.  So the danger is always that the source of the advocacy will be ignored, despite the clear economic argument in favor of what the editorial proposes.

But the concern is transnational in nature.  As the piece points out, the U.S. Department of Energy has initiated its own program in alignment with the U.S. National Institute of Standards and Technology (NIST) Materials Genome Initiative (MGI) to establish open repositories of data in support of the alternative energy industry.  Given the aging U.S. infrastructure and the dangers from sudden failure from these engineering systems, it seems wise to track and share data for materials corrosion and lifespans.  Think of the sudden bridge failures that have hit headlines over the last several years.

Since the cost of not doing so can be measured in lives as well as economic cost, such data normalization and rationalization in this area takes on the role of being an essential element of governance.

Over at AITS.org — Big Data and the Repository of Babel

In 1941, the Argentine writer Jorge Luis Borges (1899-1986) published a short story entitled “The Library of Babel.”  In the story Borges imagines a universe, known as the Library, which is described by the story’s narrator as made up of adjacent hexagonal rooms.  To find out more about how the Library is an effective metaphor for, among other things, how we should approach Big Data, please take a look at my article here.