I’ve had a lot of discussions lately on data normalization, including being asked the question of what constitutes normalization when dealing with legacy data, specifically in the field of project management. A good primer can be found at About.com, but there are also very good older papers out on the web from various university IS departments. The basic principals of data normalization today consist of finding a common location in the database for each value, reducing redundancy, properly establishing relationships among the data elements, and providing flexibility so that the data can be properly retrieved and further processed into intelligence in such as way as the objects produced possess significance.
The reason why answering this question is so important is because our legacy data is of such a size and of such complexity that it falls into the broad category of Big Data. The condition of the data itself provides wide variations in terms of quality and completeness. Without understanding the context, interrelationships, and significance of the elements of the data, the empirical approach to project management is threatened, since being able to use this data for purposes of establishing trends and parametric analysis is limited.
A good paper that deals with this issue was authored by Alleman and Coonce, though it was limited to Earned Value Management (EVM). I would argue that EVM, especially in the types of industries in which the discipline is used, is pretty well structured already. The challenge is in the other areas that are probably of more significance in getting a fuller understanding of what is happening in the project. These areas of schedule, risk, and technical performance measures.
In looking at the Big Data that has been normalized to date–and I have participated with others in putting a significant dent in this area–it is apparent that processes in these other areas lack discipline, consistency, completeness, and veracity. By normalizing data in sub-specialties that have experienced an erosion in enforcing standards of quality and consistency, technology becomes a driver for process improvement.
A greybeard in IT project management once said to me (and I am not long in joining that category): “Data is like water, the more it flows downstream the cleaner it becomes.” What he meant is that the more that data is exposed in the organizational stream, the more it is questioned and becomes a part of our closed feedback loop: constantly being queried, verified, utilized in decision making, and validated against reality. Over time more sophisticated and reliable statistical methods can be applied to the data, especially if we are talking about performance data of one sort or another, that takes periodic volatility into account in trending and provides us with a means for ensuring credibility in using the data.
In my last post on Four Trends in Project Management, I posited that the question wasn’t more or less data but utilization of data in a more effective manner, and identifying what is significant and therefore “better” data. I recently heard this line repeated back to me as a means of arguing against providing data. This conclusion was a misreading of what I was proposing. One level of reporting data in today’s environment is no more work than reporting on any other particular level of a project hierarchy. So cost is no longer a valid point for objecting to data submission (unless, of course, the one taking that position must admit to the deficiencies in their IT systems or the unreliability of their data).
Our projects must be measured against the framing assumptions in which they were first formed, as well as the established measures of effectiveness, measures of performance, and measures of technical achievement. In order to view these factors one must have access to data originating from a variety of artifacts: the Integrated Master Schedule, the Schedule and Cost Risk Analysis, and the systems engineering/technical performance plan. I would propose that project financial execution metrics are also essential in getting a complete, integrated, view of our projects.
There may be other supplemental data that is necessary as well. For example, the NDIA Integrated Program Management Division has a proposed revision to what is known as the Integrated Baseline Review (IBR). For the uninitiated, this is a process in which both the supplier and government customer project teams can come together, review the essential project artifacts that underlie project planning and execution, and gain a full understanding of the project baseline. The reporting systems that identify the data that is to be reported against the baseline are identified and verified at this review. But there are also artifacts submitted here that contain data that is relevant to the project and worthy of continuing assessment, precluding manual assessments and reviews down the line.
We don’t yet know the answer to these data issues and won’t until all of the data is normalized and analyzed. Then the wheat from the chaff can be separated and a more precise set of data be identified for submittal, normalized and placed in an analytical framework to give us more precise information that is timely so that project stakeholders can make decisions in handling any risks that manifest themselves during the window that they can be handled (or make the determination that they cannot be handled). As the farmer says in the Chinese proverb: “We shall see.”
3 thoughts on “Days of Future Passed — Legacy Data and Project Parametrics”
Normalization, as defined by Edgar F. Codd and popularized by Chris Date, refers to relational databases. It is a design approach that seeks to reduce the occurrence of duplicate data. Relational databases managed using a single application, or a group of applications that respect the referential integrity rules of the system of record, benefit from this rigorous approach – it reduces the risk of errors creeping into the data as records are added or maintained, and makes management reporting queries more efficient.
Historical records, on the other hand, are not maintained; they are queried. Careful controls are used to ensure that historical records are not updated, corrected, tweaked, or otherwise dinked with. That said, not all historical records are consumable in their native state. Many require some form of rationalization before they can be used. For example, two databases containing shoe sizes may have incompatible values, if the data was collected in different countries. But there is a more difficult conundrum: unaccounted-for states.
In an earlier time, a field called “Sex” would have two mutually exclusive values. Then someone noted that the transgendered might not self-identify with either value. In other jurisdictions, where privacy matters are given more weight, and the entity collecting the data has to be able to justify the request, a “None of your damned business” value is required. And as more records are prepared by software actors, we’ve found the need for “We don’t know, exactly.” Consequently, commingling history with new-age records requires more than simple translation – it requires an understanding of the original request that resulted in the recorded response, and the nuances implied in the recorded values. That level of understanding can usually be provided only through transformation at the source, rather than at the destination.
All that said: if knowledge is an artifact, then it merits design. But just as there is more than one valid, effective design for a soup spoon, so will there be multiple valid designs for the knowledge accumulated during a project. I would suggest that, while normalization focuses on standardizing design, it does not address history, nor does it account for data nuances. I would also point out that current query technology and natural-language processing has made the analysis of history much less dependent on the relational models used to collect and maintain it. In order to support management reporting across lots of databases, you need to be able to make queries that are meaningful in the context of the question at hand, without regard to the underlying values in the database, or the primary and foreign keys. Aggregating the transformed responses requires a certain amount of faith in the sources, but just as relational designs make records less subject to error by eliminating redundant values, so does localized transformation reduce errors by eliminating redundant rules.
One day, our brilliant designs will be seen as quaint, or at least insufficient. With luck, we’ll live long enough to be embarrassed by them. Peace be with you.
Comments are closed.