Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

For What It’s Worth — More on the Materiality and Prescriptiveness Debate and How it Affects Technological Solutions

The underlying basis on the materiality vs. prescriptiveness debate that I previously wrote about lies in two areas:  contractual compliance, especially in the enforcement of public contracts, and the desired outcomes under the establishment of a regulatory regime within an industry.  Sometimes these purposes are in agreement and sometimes they are in conflict and work at cross-purposes to one another.

Within a simple commercial contractual relationship, there are terms and conditions established that are based on the expectation of the delivery of supplies and services.  In the natural course of business these transactions are usually cut-and-dried: there is a promise for a promise, a meeting of the minds, consideration, and performance.  Even in cases that are heavily reliant on services, where the terms are bit more “fuzzy,” the standard is that the work being performed be done in a “workmanlike” or “professional” manner, usually defined by the norms of the trade or profession involved.  There is some judgment here depending on the circumstances, and so disputes tend to be both contentious and justice oftentimes elusive where ambiguity reigns.

In research and development contracts the ambiguities and contractual risks are legion.  Thus, the type of work and the ability to definitize that work will, to the diligent contract negotiator, determine the contract type that is selected.  In most cases in the R&D world, especially in government, contract types reflect a sharing and handling of risk that is reflected in the use of cost-plus type contracts.

Under this contract type, the effort is reimbursed to the contractor, who must provide documentation on expenses, labor hours, and work accomplished.  Overhead, G&A, and profit is negotiated based on a determination of what is fair and reasonable against benchmarks in the industry, which will be ultimately determined through negotiation of the parties.  A series of phases and milestones are established to mark the type of work that is expected to be accomplished over time.  The ultimate goal is the produce a prototype end item application that meets the needs of the agency, whether that agency is the Department of Defense or some other civilian agency in the government.

The period of performance of the contracts in these cases, depending on the amount of risk in research and development in pushing the requisite technology, usually involving several years.  Thus, the areas of concern given the usually high dollar value, inherent risk, and extended periods, involve:

  1. The reliability, accuracy, quality, consistency, and traceability of the underlying systems that report expenditures, effort, and progress;
  2. Measures that are indicative of whether all of the aspects of the eventual end item will meet elements that constitute expectations and standards of effectiveness, performance, and technical achievement.  These measures are conducted within the overall cost and schedule constraints of the contracted effort;
  3. Assessment over the lifecycle of the contract regarding external, as well as internal technical, qualitative, and quantitative risks of the effort;
  4. The ability of items 1 through 3 above to provide an effective indication or early warning that the contractual vehicle will significantly vary from either the contractual obligations or the established elements outlining the physical, performance, and technical characteristics of the end item.
  5. The more mundane, but no less important, verification of contractual performance against the terms and conditions to avoid a condition of breach.

Were these the only considerations in public contracting related to project management our work in evaluating these relationships, while challenging, would be fairly cut-and-dried given that they would be looked at from a contracting perspective.  But there is also a systemic purpose for a regulatory regime.  These are often in conflict with one another.  Such requirements as compliance, surveillance, process improvement, and risk mitigation are looking at the same systems, but from different perspectives with, ultimately, differing reactions, levels of effectiveness, and results.  What none of these purposes includes is a punitive purpose or result–a line oftentimes overstepped, in particular, by private parties.  This does not mean that some regulations that require compliance with a law do not come with civil penalties, but we’ll get to that in a moment.

The underlying basis of any regulatory regime is established in law.  The sovereign–in our case the People of the United States through the antecedent documents of governance, including the U.S. Constitution and Constitutions of the various states, as well as common law–possesses an inherent right to regulate the health, safety, and welfare of the people.  The Preamble of the U.S. Constitution actually specifies this purpose in writing, but in broader terms.  Thus, the purposes of a regulatory regime when it comes to this specific issue are what are at issue.

The various reasons usually are as follows:

  1. To prevent an irreversible harm from occurring.
  2. To enforce a particular level of professional conduct.
  3. To ensure compliance with a set of regulations or laws, especially where ambiguities in civil and common law have yielded judicial uncertainty.
  4. To determine the level of surveillance of a regulated system that is needed based on a set of criteria.
  5. To encourage particular behaviors.
  6. To provide the basis for system process improvement.

Thus, in applying a regulation there are elements that go beyond the overarching prescriptiveness vs. materiality debate.  Materiality only speaks to relevance or significance, while prescriptiveness relates to “block checking”–the mindless work of the robotic auditor.

For example, let’s take the example of two high profile examples of regulation in the news today.

The first concerns the case of where Volkswagen falsified its emissions test results for a good many of its vehicles.  The role of the regulator in this case was to achieve a desired social end where the state has a compelling interest–the reduction of air pollution from automobiles.  The regulator–the Environmental Protection Agency (EPA)–found the discrepancy and issued a notice of violation of the Clean Air Act.  The EPA, however, did not come across this information on its own.  Since we are dealing with multinational products, the initial investigation occurred in Europe under a regulator there and the results passed to the EPA.  The corrective action is to recall the vehicles and “make the parties whole.”  But in this case the regulator’s remedy may only be the first line of product liability.  It will be hard to recall the pollutants released into the atmosphere and breach of implicit contract with the buyers of the automobiles.  Whether a direct harm can be proven is now up to the courts, but given that executives in an internal review (article already cited) admitted that executives knew about deception, the remedies now extend to fraud.  Regardless of the other legal issues,

The other high profile example is the highly toxic levels of lead in the drinking water of Flint, Michigan.  In this case the same regulator, the EPA, has issued a violation of federal law in relation to safe drinking water.  But as with the European case, the high levels of lead were first discovered by local medical personnel and citizens.  Once the discrepancy was found a number of actions were required to be taken to secure proper drinking water.  But the damage has been done.  Children in particular tend to absorb lead in their neurological systems with long term adverse results.  It is hard to see how the real damage that has been inflicted will make the damaged parties whole.

Thus, we can see two things.  First, the regulator is firmly within the tradition of regulating the health, safety, and welfare, particularly the first category and second categories.  Secondly, the regulatory regime is reactive.

While obviously the specific illnesses caused by the additional pollution form Volkswagen vehicles is probably not directly traceable to harm, the harm in the case of elevated lead levels in Flint’s water supply is both traceable and largely irreversible.

Thus, in comparing these two examples, we can see that there are other considerations than the black and white construct of materiality and prescriptiveness.  For example, there are considerations of irreversible harm, prevention, proportionality, judgment, and intentional results.

The first reason for regulation listed above speaks to irreversible harm.  In these cases proportionality and prevention are the main concerns.  Ensuring that those elements are in place that will prevent some catastrophic or irreversible harm through some event or series of events is the primary goal in these cases.  When I say harm I do not mean run of the mill, litigious, constructive “harm” in the legal or contractual sense, but great harm–life and death, resulting disability, loss of livelihood, catastrophic market failure, denial of civil rights, and property destruction kind of harm.  In enforcing such goals, these fall most in line with prescriptiveness–the establishment of particular controls which, if breached, would make it almost impossible to fully recover without a great deal of additional cost or effort.  Furthermore, when these failures occur a determination of culpability or non-culpability is determined.  The civil penalties in these cases, where not specified by statute, are ideally determined by proportionality of the damage.  Oftentimes civil remedies are not appropriate since these often involve violations of law.  This occurs, in real life, from the two main traditional approaches to audit and regulation being rooted in prescriptive and judgmental approaches.

The remainder of the reasons for regulation provide degrees of oversight and remedy that are not only proportional to the resulting findings and effects, but also to the goal of the regulation and its intended behavioral response.  Once again, apart from the rare and restricted violations given in the first category above, these regulations are not intended to be enforced in a punitive manner, though there can be penalties for non-compliance.  Thus, proportionality, purpose, and reasonableness are additional considerations to take into account.  These oftentimes fall within the general category of materiality.

Furthermore, going beyond prescriptiveness and materiality, a paper entitled Applying Systems-Thinking to Reduce Check-the-Box Decisions in the Audit of Complex Estimates, by Anthony Bucaro at the University of Illinois at Urbana-Champaign, proposes an alternative auditing approach that also is applicable to other types of regulation, including contract management.  The issue that he is addressing is the fact that today, in using data, a new approach is needed to shift the emphasis to judgment and other considerations in whether a discrepancy warrants a finding of some sort.

This leads us, then, to the reason why I went down this line of inquiry.  Within project management, either a contractual or management prerogative already exists to apply a set of audits and procedures to ensure compliance with established business processes.  Particular markets are also governed by statutes regulating private conduct of a public nature.  In the government sphere, there is an added layer of statutes that prescribe a set of legal and administrative guidance.  The purposes of these various rules varies.  Obviously breaking a statute will garner the most severe and consequential penalties.  But the set of regulatory and administrative standards often act at cross purposes, and in their effect, do not rise to the level of breaking a law, unless they are necessary elements in complying with that law.

Thus, a whole host of financial and performance data assessing what, at the core, is a very difficult “thing” to execute (R&D leading to a new end item), offers some additional hazards under these rules.  The underlying question, outside of statute, concerns what the primary purpose should be in ensuring their compliance.  Does it pass the so-what? test if a particular administrative procedure is not followed to the letter?

Taking a broader approach, including a data-driven and analytical one, removes much of the arbitrariness when judgment and not box-checking is the appropriate approach.  Absent a consistent and wide pattern that demonstrates a lack of fidelity and traceability of data within the systems that have been established, auditors and public policymakers must look at the way that behavior is affected.  Are there incentives to hide or avoid issues, and are there sufficient incentives to find and correct deficiencies?  Are the costs associated with dishonest conclusions adequately addressed, and are there ways of instituting a regime that encourages honesty?

At the core is technology–both for the regulated and the regulator.  If the data that provides the indicators of compliance come, unhindered, from systems of record, then dysfunctional behaviors are minimized.  If that data is used in the proper manner by the regulator in driving a greater understanding of the systemic conditions underlying the project, as well as minimizing subjectivity, then the basis for trust is established in determining the most appropriate means of correcting a deficiency.  The devil is in the details, of course.  If the applied technology simply reproduces the check-block mentality, then nothing has been accomplished.  Business intelligence and systems intelligence must be applied in order to achieve the purposes that I outlined earlier.

 

Second Foundation — More on a General Theory of Project Management

In ending my last post on developing a general theory of project management, I introduced the concept of complex adaptive systems (CAS) and posited that projects and their ecosystems fall into this specific category of systems theory.  I also posited that it is through the tools of CAS that we will gain insight into the behavior of projects.  The purpose is not only to identify commonalities in these systems across what is frequently asserted are irreconcilable across economic market verticals, but to identify regularities and the proper math in determining the behavior of these systems.

A brief overview of some of the literature is in order so that we can define our terms, since CAS is a Protean term that has evolved with its application.  Aside from the essential work at the Santa Fe Institute, some of which I linked in my last post on the topic, I would first draw your attention to an overview of CAS by Serena Chan at MIT.  Ms. Chan wrote her paper in 2001, and so her perspective in one important way has proven to be limited, which I will shortly address.  Ms. Chan correctly defines complexity and I will leave it to the reader to go to the link above to read the paper.  The meat of her paper is her definition of CAS by identifying its characteristics.  These are: distributed control, connectivity, co-evolution, sensitive dependence on initial conditions, emergence, distance from equilibrium, and existence in a state of paradox.  She then posits some tools that may be useful in studying the behavior of CAS and then concludes with an odd section on the application of CAS to engineering systems, positing that engineering systems cannot be CAS because they are centrally controlled and hence do not exhibit emergence (non-preprogrammed behavior).  She interestingly uses the example of the internet as her proof.  In the year 2015, I don’t think one can seriously make this claim.  Even in 2001 such an assertion would be specious for it had been ten years since the passage of the High Performance Computing and Communication Act of 1991 (also called the Gore Bill) which commercialized ARPANET.  (Yes, he really did have a major hand in “inventing” the internet as we know it).  It was also eight years from the introduction of Mosaic.  Thus, the internet, as many engineering systems requiring collaboration and human interaction, fall under the rubric of CAS as defined by Ms. Chan.

The independent consultant Peter Fryer at his Trojan Mice blog adds a slightly different spin to identifying CAS.  He asserts that CAS properties are emergence, co-evolution, suboptimal, requisite variety, connectivity, simple rules, iteration, self-organizing, edge of chaos, and nested systems.  My only pique with many of these stated characteristics is that they seem to be slightly overlapping and redundant, splitting hairs without adding to our understanding.  They also tend to be covered by the larger definitions of systems theory and complexity.  Perhaps its worth reducing them within CAS because they provide specific avenues in which to study these types of systems.  We’ll explore this in future posts.

An extremely useful book on CAS is by John H. Miller and Scott E. Page under the rubric of the Princeton Studies in Complexity entitled Complex Adaptive Systems: An Introduction to Computational Models of Social Life.  I strongly recommend it.  In the book Miller and Page explore the concepts of emergence, self-organized criticality, automata, networks, diversity, adaptation, and feedback in CAS.  They also recommend mathematical models to study and assess the behavior of CAS.  In future posts I will address the limitations of mathematics and its inability to contribute to learning, as opposed to providing logical proofs of observed behavior.  Needless to say, this critique will also discuss the further limitations of statistics.

Still, given these stated characteristics, can we state categorically that a project organization is a complex adaptive system?  After, all people attempt to control the environment, there are control systems in place, oftentimes work and organizations are organized according to the expenditure of resources, there is a great deal of planning, and feedback occurs on a regular basis.  Is there really emergence and diversity in this kind of environment?  I think so.  The reason why I think so is because of the one obvious factor that is measures despite the best efforts to exert control, which in reality consists of multiple agents: the presence of risk.  We think we have control of our projects, but in reality we only can exert so much control. Oftentimes we move the goalposts to define success.  This is not necessarily a form of cheating, though sometimes it can be viewed in that context.  The goalposts change because in human CAS we deal with the concept of recursion and its effects.  Risk and recursion are sufficient to land project efforts clearly within the category of CAS.  Furthermore, that projects clearly fall within the definition of CAS follows below.

It is within an extremely useful paper written on CAS from a practical standpoint that was published in 2011 and written by Keith L. Green of the Institute for Defense Analysis (IDA) entitled Complex Adaptive Systems in Military Analysis that we find a clear and comprehensive definition.  In borrowing from A. S. Elgazzar, of both the mathematics departments of El-Arish, Egypt and Al-Jouf King Saud University in the Kingdom of Saudi Arabia; and A. S. Hegazi of the Mathematics Department, Faculty of Science at Mansoura, Egypt–both of whom have contributed a great deal of work on the study of the biological immune systems as a complex adaptive system–Mr. Green states:

A complex adaptive system consists of inhomogeneous, interacting adaptive agents.  Adaptive means capable of learning.  In this instance, the ability to learn does not necessarily imply awareness on the part of the learner; only that the system has memory that affects its behavior in the environment.  In addition to this abstract definition, complex adaptive systems are recognized by their unusual properties, and these properties are part of their signature.  Complex adaptive systems all exhibit non-linear, unpredictable, emergent behavior.  They are self-organizing in that their global structures arise from interactions among their constituent elements, often referred to as agents.  An agent is a discrete entity that behaves in a given manner within its environment.  In most models or analytical treatments, agents are limited to a simple set of rules that guide their responses to the environment.  Agents may also have memory or be capable of transitioning among many possible internal states as a consequence of their previous interactions with other agents and their environment.  The agents of the human brain, or of any brain in fact, are called neurons, for example.  Rather than being centrally controlled, control over the coherent structure is distributed as an emergent property of the interacting agents.  Collectively, the relationships among agents and their current states represent the state of the entire complex adaptive system.

No doubt, this definition can be viewed as having a specific biological bias.  But when applied to the artifacts and structures of more complex biological agents–in our case people–we can clearly see that the tools we use must been broader than those focused on a specific subsystem that possesses the attributes of CAS.  It calls for an interdisciplinary approach that utilizes not only mathematics, statistics, and networks, but also insights from the areas of the physical and computational sciences, economics, evolutionary biology, neuroscience, and psychology.  In understanding the artifacts of human endeavor we must be able to overcome recursion in our observations.  It is relatively easy for an entomologist to understand the structures of ant and termite colonies–and the insights they provide of social insects.  It has been harder, particularly in economics and sociology, for the scientific method to be applied in a similarly detached and rigorous method.  One need only look to the perverse examples of Spencer’s Social Statics and Murray and Herrnstein’s The Bell Curve as but two examples where selection bias, ideology, class bias, and racism have colored such attempts regarding more significant issues.

It is my intent to avoid bias by focusing on the specific workings of what we call project systems.  My next posts on the topic will focus on each of the signatures of CAS and the elements of project systems that fall within them.