Innervisions: The Connection Between Data and Organizational Vision

During my day job I provide a number of fairly large customers with support to determine their needs for software that meets the criteria from my last post. That is, I provide software that takes an open data systems approach to data transformation and integration. My team and I deliver this capability with an open user interface based on Windows and .NET components augmented by time-phased and data management functionality that puts SMEs back in the driver’s seat of what they need in terms of analysis and data visualization. In virtually all cases our technology obviates the need for the extensive, time consuming, and costly services of a data scientist or software developer.

Over the course of my career both as a consumer and a provider of technology solutions, I have seen an evolution in software that began with simple point solutions being developed to automate particular manual processes, to more sophisticated solutions that are designed to automate a complex function. In most of these cases, a customer has identified a gap or deficiency in their requirements that represents an inefficiency or sub-optimization of their processes and then seek a software “tool” to acquire in order to address that specific purpose. The application of these “tools” combine to meet the overall vision of the organization or sub-system within the organization.

What Do You Do With A Problem Like “Tools”

The capabilities of software in terms of data handling capabilities and functionality double every 12-18 months in today’s environment. The use of the term “tools” for software, which is really based on a pre-2000 concept, is that in the mind’s eye software is analogous to any other tool. In the literature, particularly in that authored by consultants, this analogy is oftentimes extended to common household or construction tools: a wrench, a screwdriver, or a power drill. Under this concept each tool has a specific purpose and it is up to the SME to determine which tool is best for a specific job.

The problem with this concept is that not only is it obsolete, but it does great harm financially to the organization in terms of overhead costs, organizational efficiency and effectiveness.

First of all, most physical tools are fairly static in their specific use. A hammer is still a hammer, even if some sort of power is extended to give it power. It’s purpose remains to use force to insert a connective fastener, like a nail, into a medium, like a piece of wood. A nail gun, for instance, is a type of hammer. It is more powerful and efficient but, still, it is a glorified hammer. It is a superior tool in construction because it is more efficient, provides a consistency in quality, and is faster. It also eliminates the factors of arm strength, physical coordination, and visual alignment skills of the user; as anyone who has experienced a sore thumb as a result of a misaligned strike can attest. But a nail gun is still restricted to its specific function–sinking nails for the purpose of fastening.

Software, as it has evolved, was similarly based on the concept of a tool. The physical functions of a specific vocation were the first to undergo digitization: accountants and business operations personnel had spreadsheet software applications, secretarial and clerical staffs (yes, they used to exist) had word processing software, marketing and middle management could relay their ideas with presentation software, and the list went on.

As the power of software improved it followed the functions of traditional line-and-staff organizations. Many of these were built to replace the physical calculation of formulae and concepts that required a slide rule and, later, a scientific calculator. Soon scheduling software replaced manual GANTT planning, earned value software automated the calculation of basic EVM analytics, and risk software allowed for the complex formulation involved in assessing risk for the branch of a plan using simulated Monte Carlo analysis.

Each of these software applications targeted a specific occupation, and incorporated specific knowledge (functionality) required of that occupation.

Organizational software for multiple functions usually consisted of a suite of tools under the rubric of an ERP or Business Intelligence System. Modules and “bolt-ons” consisted of tying together business processes and point software requirements augmented by large software consulting staffs to customize the solutions. In actual practice, however, these were software tools tied together though a common brand and operating environment. Oftentimes the individual bolt-ons and tools weren’t even authored by the same development team with a common vision in mind, but a reaction to market forces that required a gap be filled through acquisition of a company or intellectual property.

Needless to say, these “enterprise” solutions aren’t that at all. Instead, they are a business-driven means to penetrate a vertical by providing scattershot functionality. Once inside a company or organization the other bolt-ons and modules are marketed in order to take over other business processes. Integration is achieved across domains through data transfer or other interpretive methods.

This approach has been successful, as it has been since the halcyon days when IBM dominated the computing market, especially among the larger software firms. It also meets many of the emotional and psychic needs of many senior managers. After all, the software firm–given its economic size–feels solid. The numbers of specialists introduced into the organization to augment staff provide a feeling of safety and accomplishment. C-level management and stockholders feel that risk is handled given that their software needs are being met at some level.

What this approach did not, and does not, meet is genuine data integration, especially given the realization that the data we have been using has been inadequate and artificially restricted based on what software providers were convincing their customers was the art of the possible. The term “Big Data” began to be introduced into the lexicon, and with it the economic realization that capturing and integrating datasets that were previously “impossible” to capture and integrate was (and presently is) an economic imperative.

But the approach of incumbents, whose priority is to remain “sticky” and to defend territory against new technologies, was to respond: “we have a tool for that.” Thus, the result has been the further introduction of inefficient individual applications with their inability to fully exploit data. Among these tools are largely “dumb”–that is, viewing data flat–data visualization tools that essentially paint pretty pictures from Excel or, when they need to be applied on a larger scale, default to the old business intelligence brute force approach of applying labor to derive the importance in data. Old habits are hard to change and what one person has done another can do. But this is the economic equivalent of what is called rent-seeking behavior. That is, it is inefficient and exploitative.

After all, if you buy what was advertised as a sports car you expect to see an engine under the hood and a transmission connected to a drivetrain and a pretty powerful one at that. What one does not expect is to buy the car but have to design and build the features of these essential systems while a team of individuals are paid by the hour to push us to where we want to go. Yet, organizations (and especially consultants) seem to be happy with this model when it comes to information management.

Thus, when a technology company like mine comes across a request for proposal, an informal invitation to participate in market research, or in exploratory professional meetings (largely virtual as of this writing), the emphasis and terminology is on software “tools”, which limits the ability of consumers to exploit technology because it mentally paints a picture that limits the definition of what software should do and can do.

This mindset, however, is beginning to change and, no doubt, our current predicament under the Coronavirus crisis will accelerate that transition.

To take our analogy one step further, we are long past the time when we must buy each component of an automobile individually and then assemble it in our own garage. Point solutions, which are set and inelastic, are like individual parts of the car.

Enterprise solutions consisting of different modules and datasets, oftentimes constructed from incompatible foundations, exacerbate this situation and add the element of labor to a supposedly automated process, like buying OEM products and having to upgrade the automobile we supposed bought to do its job, but still needed (with the help of a mechanic) to perform the normal functions of steering, stopping, and accelerating.

Open systems solutions provide more flexibility, but they can be both a blessing and a curse. The challenge is to provide the right balance of out-of-the-box point solution-type functionality while still providing enough flexibility for adaptability. Taking a common data approach is key to achieving this balance. This will require the abandonment of the concept of software “tools” and shifting the focus on data.

Data and Information Take Over: Two Models

The economic imperative for data integration and optimization developed from the needs of the organization and its practitioners–whether it be managers, analysts, or auditors working in a company, a business unit, a governmental agency, or a program or project organization–is to be positioned facing forward.

In order to face forward one must first establish a knowledge-based organization or, as oftentimes identified, a data-driven organization. What this means in real terms is that data is captured, processed, and contextualized so that its importance and meaning can be derived in a timely manner so that something can be done about what is happening. During our own present situation this is not just an economic imperative, but for public health an existential one for many of us.

Thus, we are faced with several key dimensions that must be addressed: size, manner of integration, contextualization, timeliness, and target. This applies to both known and unknown datasets.

Our known datasets are those that are already being used and populated in existing systems. We know, for example, that in program and project management that we require an estimate and plan, a schedule, a manner of organizing and tracking our progress, financial management and material management systems and others. These represent our pool of structured data, and understanding the lexicon of these systems is what is necessary to normalize and rationalize the data through a universal translator.

Our unknown datasets are those that require collection but, when done, is collected and processed in an ad hoc manner. Usually the need for this data collection is learned through the school of hard knocks. In other cases, the information is not collected at all or accidentally, such as when management relies on outside experts and anecdotal information. This is the equivalent of an organizational JOHARI window shown below.

Overview of Johari Window with quadrants
showing the relationships of self-knowledge and understanding

The Johari Window explains our perceptions and our relationship to the outside world. Our universe is not a construction of our own making or imagination. We cannot make our own reality nor are there “alternative facts.” The most colorful example of refuting this specious philosophical mind game is relayed to us in Boswell’s Life of Samuel Johnson.

After we came out of the church, we stood talking for some time together of Bishop Berkeley’s ingenious sophistry to prove the nonexistence of matter, and that every thing in the universe is merely ideal. I observed, that though we are satisfied his doctrine is not true, it is impossible to refute it. I never shall forget the alacrity with which Johnson answered, striking his foot with mighty force against a large stone, till he rebounded from it — “I refute it thus.”

We can deny what we do not know, or construct magical thinking. but reality is unmoved. In the case of Johnson he kicked the stone and the stone, also unmoved, kicked back in the form of the pain that Johnson felt when he “rebounded from it”.

Nor are the quadrants equal in our perceptual windows. Some people and organizations are very well informed and others less so, but the tension and conflict of our lives–both internally and externally–relates to expanding the “open” and “facade” portions of the Johari window so that we are not only informed of how others register us, but also to uncover the unknown, and to attempt to control how others perceive us in our various roles and guises.

We see this playing out in tracking the current Coronavirus pandemic. The absence of reliable widespread tests and testing infrastructure has impeded an understanding of the virus and the most effective strategies to deploy in dealing with it. Absent data, health and governmental agencies have been left with no choice but to use the same social distancing and travel restrictions deployed during the 1918 Influenza Pandemic and then, if lifting some of these, hope for the best.

This is the situation despite the fact that national risk assessments and risk registers, such as the U.S. National Security Council Pandemic Playbook and the U.K. National Risk Register, outlined measures to be taken given certain particular indicators. No doubt there are lessons to be learned here, but at the core lesson is the fact that, absent reliable and timely data that is converted into information that can be used in a decisive and practical manner, an organization, a state, or a nation risks its survival when it fails to imagine what information it needs to collect, absent the prosaic information that comes from performing the day-to-day routine.

Admittedly, there is no great insight here regarding this need (or, at least, there shouldn’t be). This condition is the reason why intelligence systems and agencies were created in the first place. It is why military and health services imagine scenarios and war-game them, and why organizations deploy brain-storming. Individuals and organizations that go into the world uninformed or self-deluded do not last long, and history is replete with such examples. Blanche DuBois relied on the kindness of strangers and we are best served by her experience as an archetype.

And yet, we still find ourselves struggling to properly collect, integrate, and utilize information at the same time that we have come to the realization that we need to collect and process information from larger pools of data. The root cause of this condition, as asserted above, rests in the mental framing of how to approach data and the problem that needs to be solved. It requires us to change the conceptual framework that relies on the concept of “tools.”

We can make this adjustment by realigning the object of the challenge so that it conforms with what we imagine to be the desired end-state. But, still, how do we determine what we need to collect? This is first a question of perception as opposed to one regarding knowledge: what one views as not only necessary but within the realm of possibility.

Once again, this dilemma is best served by models and, in this case, it is not unlike the Overton Window. Those preferring to eschew Wikipedia entries can also find a more detailed and nuanced definition at the source through the Mackinac Center for Public Policy website.

Overton Windows showing degrees of acceptability as modified by Joshua Trevino

Joseph Overton described the window as one of defining acceptable political policies in the mind of the public. He used the terms “more free” and “less free” to describe policies that think tanks recommend to describe the amount of government intervention, avoiding the left-right comparisons used by polemicists. Various adjustments and variations to the basic window have been proposed since his original use of the model, but it has been expanded to describe public perceptions in general on a host of socioeconomic concerns.

As with the Johari Window, I would posit that there is an analogous Overton Window in relation to information that frames what is viewed as the art of the possible. These perceptions influence the actions of decision-makers in assessing the risk involved in buying software solutions. When it comes to the rapidly developing field of data capture, transformation, and effective utilization, the perception from the start suggests some degree of risk and the danger of moving too quickly. For those in the field of data optimization, given that new technology capacity increases exponentially in shorter periods of time, the barrier here is to shift the informational Overton Window so that the market is educated on the risk-reward equation.

A Unified Model for Aligning Our Data

We have discussed two models up to this point in our exploration: an Informational Johari Window and an Informational Overton Window. Each of these models, using a simplified method, isolates different dimensions of the problem of data, which when freed of the concept of “tools” unlocking it, provides us with a clearer picture of the essential nature of its capture and utilization, and to what purposes.

We are now ready to take the next step in defining how to approach data to serve the strategic interests of the enterprise or organization.

For those of us in the information field, especially in the early years when applying solutions to line-and-staff organizations, what we found is that the very introduction of the new technology changed both the structure and nature of the organization. Initially we noted a sophisticated and accelerated version of the Hawthorne Effect. But there was something more elemental and significant going on.

Digital technology is amazingly attuned, especially when properly designed and deployed, to extend the functions of human knowledge gathering and processing. In this way it can be interpreted as an extension of human evolution–of the nature of human society acting as a complex adaptive system. In fact, there are so many connections between early physical, methodological, and industrial societal developments to digitization, such as the connection between the development of the Jacquard Loom to the development of the computer punch card (and there are others) that it seems that human society would have found a way to get to this point regardless of the existence of the intervening human pioneers, though their actual contributions are clear. (For further information on the waves of development see the books Future Shock and The Third Wave by Alvin Toffler.)

When many of us first applied digitized technology to knowledge workers (in my case in the field of contract management) we found that the very introduction of the technology changed perceptions, work habits, and organizational structures in very essential ways. Like the effect of the idea of evolution as described by Daniel Dennett, the application of digital evolution is like a universal acid–it eats through and transforms everything it touches.

For example, a report that, in the past, would have taken a week or two to complete, mostly because of the research required, now took a day or so. Procurement Action Lead Times (PALT) realized significant improvements since information previously only available in paper form was now provided on-line. At the same time, systems were now able to handle greater volumes of demand. As a result, customers’ expectations changed so much that they no longer felt that they had to hold back requests for fear of overloading the system and depend on human intervention. Suppliers, seeing many commodities experiencing steady and stable growth, reverted to just-in-time manufacturing.

Over time, typing pools and secretarial staffs, the former being commonplace well into the 1980s and the latter into the 1990s, except as symbols of privilege or prestige, disappeared. Middle management and many support staffs followed this trend in the early 2000s. Today, consulting services consisting of staffing personnel to apply non-value added manual solutions such as Excel spreadsheets and PowerPoint slides to display data that has already been captured and processed, still manage to hold on in isolated pockets. That this model is not sustainable nor efficient should be obvious except for the continued support these models lend to the self-serving concept of “tools.”

Thus, the next step in the alignment of data capture and utilization to organizational vision is the interplay between our models. Practical experience suggests, though anecdotal, that as forward-facing organizations adopt more powerful digitized technologies designed to capture more and larger datasets, and to better utilize that data, that they tend to move to expand their self-awareness–their Informational Johari Window.

This, in turn, allows them to distinguish between structured and unstructured data and the value–the qualitative information content–of these datasets. This knowledge is then applied to reduce the labor and custom code required for larger data capture and utilization. In the end, these developments then determine what is the art of the possible by moving and expanding the Informational Overton Window.

Combining these concepts from a data perspective results in a combined model as illustrated below from the perspective of the subject:

Data Window of Perception and Possibility (Subject)

Extending this concept to the external subject (object or others) results in the following:

Data Window of Perception and Possibility (Object or Others)

This simplistic model describes several ways of looking at the problem of data and how to align it with its use to serve our purposes. When we gather data from the world the result can be symmetrical or asymmetrical. That is, each of us does not have the capacity to collect the same data that may be relevant to our existence or the survival of our organizations or institutions.

This same concept of symmetry and asymmetry applies to our ability to process data into information and–further–to properly apply information to when it will contribute to a decisive outcome in terms of knowledge, understanding, insight, or action.

As with the psychological Johari Window, our model takes it account the unknown within the much larger data space. Think of our Big Blue Ball (which is not so big) within the context of space. All of space represents the data of the universe. We are finding that the secrets of vast space-time are found in quanta as well in the observations of large and distant celestial events and objects. Data is everywhere. Yet, we can perceive only a small part of the universe. That is why our Data Window does not encompass the entire data space.

The quadrants, of course, are rarely co-equal, but for purposes of simplicity they are shown as such. As with the psychological Johari Window of self-awareness, the tension and conflict within the individual and its relationship with the external world is in the adjustment of the sizes of the quadrants that, hopefully, tend toward more self-awareness and openness. From the perspective of data, the equivalent is toward the expansion of the physical expansion of the Data Window while the quadrants within the window expand to minimize asymmetry of external knowledge and the unknown.

The physical limitations of symmetry, asymmetry, and the unknown portions of the data space is further limited by our perceptions. Our understanding of what is possible, acceptable, sensible, radical, unthinkable, and impossible is influenced by these perceptions. Those areas of information management that fall within some mean or midpoint of the limitations of our perceptions represent current practice and which, as with the original Johari Window, I label as “policy,” though a viable alternative label would be “practice.”

Note that there perceptions vary by the position of the subject. In the case of our own perceptions, as for those reading this post, the first variation of the model is aligned vertically. For the case of the perceptions of others, which are important in understanding their position when advocating a particular course of action, the perception model is aligned horizontally across the quadrants.

The interplay of the quadrants within the Data Window directly affect how we perceive the use of data and its potential. Thus, I have labeled the no-man’s-land portion that pushes into areas that are unknown to the subject and external object is labeled as “The Frontier.”

To an American a “frontier” is an unexplored country while, historically, in the Old World a “frontier” is a border. The former promises not only risk, but, also opportunity and invites exploration. The latter is a limitation. No doubt, my use of the term is culturally biased to the first definition.

Intellectually and physically, as we enter the frontier and learn what secrets await us there, we learn. For data we may first see a Repository of Babel and deal with it as if it were flat. But, given enough exploration we will learn its lexicon and underlying structure and, eventually, learn how to process it into information and harness its content. This, in turn, will influence the size of the Data Window, the relative sizes of the quadrants, and our perceptions of the art of the possible.

Conception to Application

This model, I believe, is a useful antecedent concept in approaching and making comprehensible what is often called Big Data. The model also helps us be more precise in how we perceive and define the term as technology changes, given that exponential increases in hardware storage and processing capabilities expand our Data Window.

Furthermore, understanding the interplay of how wee approach data, and the consequences of our perceptions of it, allow us to weigh the risk when looking at new technologies and the characteristics they need to possess in order to meet organizational goals and vision. The initial bias, as noted by Paul Kahneman in his book Thinking, Fast and Slow, is for people to stick with the status quo or the familiar–the devil they know–in lieu of something new and innovative, even when the advantages of adoption of the new innovation are clearly obvious. It requires a reorientation of thinking to allow the acceptance of the new.

Our familiar patterns when thinking about information is to look for solutions that are “tools.” The new, unfamiliar concept that we find challenging is the understanding that we do not know what we do not know when it come to data and its potential–that we must push into the frontier in order to do so–and doing so will require not only new technology that is oriented toward the optimization of data, its processing from information to knowledge, and its use, but also a new way of thinking about it and how it will align with our organizational strategy.

This can only be done by first starting with a benchmark–to practically take stock–of where we individually as organizations and where we need to be in terms of understanding our mission or purpose. For project controls and project management there is no area more at odds with this alignment.

Recently, Dave Gordon in his blog The Practicing IT Project Manager argued why project managers needed to align their projects with organizational strategy. He noted that in 2015, during the development of the “Talent Triangle” that the Project Management Institute found that a major deficiency noted by organizations was that project managers needed to take an active role in aligning their projects with organizational strategy.

As I previously noted, there are a number of project management tools on the market today and a number of data visualization tools. Yet, there are significant gaps not only in the capture, quality, and processing of data, but also in the articulation of a consistent data strategy that aligns with the project organization and the overarching organization’s business strategy, goals, and priorities.

For example, in government, program managers spend a large portion of the year defending their programs to show that they are effectively and efficiently overseeing the expenditure of resources: that they are “executing program.” Failure to execute program will result in a budget mark, or worse, result in a re-baseline, or possible restructuring or cancellation. Projected production may be scaled back in favor of more immediate priorities.

Yet, none of our so-called “tools” fully capture program execution as it is defined by agencies and Congress. We have performance management tools, earned value tools, and the list can go on. A typical program manager in government spends almost five months assessing and managing program execution, and defending program and only a few minutes each month reviewing performance. This fact alone should be indicative that our priorities are misaligned.

The intersection of organizational alignment and program management in this case is related to resource utilization and program execution. No doubt, project controls and performance management contribute to our understanding of program execution, but they are removed from informing both the program manager and the organization in a comprehensive manner about execution, risk, and opportunity–and whether those elements conflict with or align with the agency’s goals. They are even further removed from an understanding of decisions related to program execution on the interrelationships across spectrum of the project and program portfolio.

The reason for this condition is that the data is currently not being captured and processed in a comprehensive manner to be positioned for its effective exploitation and utilization in meeting the needs of the various levels of the organization, nor does the perception of the specific data needed align with organizational needs.

Correspondingly, in construction and upstream oil and gas, project managers and stakeholders are most concerned with scope, timeliness, and the inevitable questions of claims–especially the avoidance or equitable settlement of the last.

As with government, our data strategy must align with our organizational goals and vision from the perspective of all stakeholders in the effort. At the heart of this alignment is data and those technologies “fitted” to exploit it and align it with our needs.

Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

(Data) Transformation–Fear and Loathing over ETL in Project Management

ETL stands for data extract, transform, and load. This essential step is the basis for all of the new capabilities that we wish to acquire during the next wave of information technology: business analytics, big(ger) data, interdisciplinary insight into processes that provide insights into improving productivity and efficiency.

I’ve been dealing with a good deal of fear and loading regarding the introduction of this concept, even though in my day job my organization is a leading practitioner in the field in its vertical. Some of this is due to disinformation by competitors in playing upon the fears of the non-technically minded–the expected reaction of those who can’t do in the last throws of avoiding irrelevance. Better to baffle them with bullshit than with brilliance, I guess.

But, more importantly, part of this is due to the state of ETL and how it is communicated to the project management and business community at large. There is a great deal to be gained here by muddying the waters even by those who know better and have the technology. So let’s begin by clearing things up and making this entire field a bit more coherent.

Let’s start with the basics. Any organization that contains the interaction of people is a system. For purposes of a project management team, a business enterprise, or a governmental body we deal with a special class of systems known as Complex Adaptive Systems: CAS for short. A CAS is a non-linear learning system that reacts and evolves to its environment. It is complex because of the inter-relationships and interactions of more than two agents in any particular portion of the system.

I was first introduced to the concept of CAS through readings published out of the Santa Fe Institute in New Mexico. Most noteworthy is the work The Quark and the Jaguar by the physicist Murray Gell-Mann. Gell-Mann is received the Nobel in physics in 1969 for his work on elementary particles, such as the quark, and is co-founder of the Institute. He also was part of the team that first developed simulated Monte Carlo analysis during a period he spent at RAND Corporation. Anyone interested in the basic science of quanta and how the universe works that then leads to insights into subjects such as day-to-day probability and risk should read this book. It is a good popular scientific publication written by a brilliant mind, but very relevant to the subjects we deal with in project management and information science.

Understanding that our organizations are CAS allows us to apply all sorts of tools to better understand them and their relationship to the world at large. From a more practical perspective, what are the risks involved in the enterprise in which we are engaged and what are the probabilities associated with any of the range of outcomes that we can label as success. For my purposes, the science of information theory is at the forefront of these tools. In this world an engineer by the name of Claude Shannon working at Bell Labs essentially invented the mathematical basis for everything that followed in the world of telecommunications, generating, interpreting, receiving, and understanding intelligence in communication, and the methods of processing information. Needless to say, computing is the main recipient of this theory.

Thus, all CAS process and react to information. The challenge for any entity that needs to survive and adapt in a continually changing universe is to ensure that the information that is being received is of high and relevant quality so that the appropriate adaptation can occur. There will be noise in the signals that we receive. What we are looking for from a practical perspective in information science are the regularities in the data so that we can make the transformation of receiving the message in a mathematical manner (where the message transmitted is received) into the definition of information quality that we find in the humanities. I believe that we will find that mathematical link eventually, but there is still a void there. A good discussion of this difference can be found here in the on-line publication Double Dialogues.

Regardless of this gap, the challenge of those of us who engage in the business of ETL must bring to the table the ability not only to ensure that the regularities in the information are identified and transmitted to the intended (or necessary) users, but also to distinguish the quality of the message in the terms of the purpose of the organization. Shannon’s equation is where we start, not where we end. Given this background, there are really two basic types of data that we begin with when we look at a set of data: structured and unstructured data.

Structured data are those where the qualitative information content is either predefined by its nature or by a tag of some sort. For example, schedule planning and performance data, regardless of the idiosyncratic/proprietary syntax used by a software publisher, describes the same phenomena regardless of the software application. There are only so many ways to identify snow–and, no, the Inuit people do not have 100 words to describe it. Qualifiers apply in the humanities, but usually our business processes more closely align with statistical and arithmetic measures. As a result, structured data is oftentimes defined by its position in a hierarchical, time-phased, or interrelated system that contains a series of markers, indexes, and tables that allow it to be interpreted easily through the identification of a Rosetta stone, even when the system, at first blush, appears to be opaque. When you go to a book, its title describes what it is. If its content has a table of contents and/or an index it is easy to find the information needed to perform the task at hand.

Unstructured data consists of the content of things like letters, e-mails, presentations, and other forms of data disconnected from its source systems and collected together in a flat repository. In this case the data must be mined to recreate what is not there: the title that describes the type of data, a table of contents, and an index.

All data requires initial scrubbing and pre-processing. The difference here is the means used to perform this operation. Let’s take the easy path first.

For project management–and most business systems–we most often encounter structured data. What this means is that by understanding and interpreting standard industry terminology, schemas, and APIs that the simple process of aligning data to be transformed and stored in a database for consumption can be reduced to a systemic and repeatable process without the redundancy of rediscovery applied in every instance. Our business intelligence and business analytics systems can be further developed to anticipate a probable question from a user so that the query is pre-structured to allow for near immediate response. Further, structuring the user interface in such as way as to make the response to the query meaningful, especially integrated with and juxtaposed other types of data requires subject matter expertise to be incorporated into the solution.

Structured ETL is the place that I most often inhabit as a provider of software solutions. These processes are both economical and relatively fast, particularly in those cases where they are applied to an otherwise inefficient system of best-of-breed applications that require data transfers and cross-validation prior to official reporting. Time, money, and effort are all saved by automating this process, improving not only processing time but also data accuracy and transparency.

In the case of unstructured data, however, the process can be a bit more complicated and there are many ways to skin this cat. The key here is that oftentimes what seems to be unstructured data is only so because of the lack of domain knowledge by the software publisher in its target vertical.

For example, I recently read a white paper published by a large BI/BA publisher regarding their approach to financial and accounting systems. My own experience as a business manager and Navy Supply Corps Officer provide me with the understanding that these systems are highly structured and regulated. Yet, business intelligence publishers treated this data–and blatantly advertised and apparently sold as state of the art–an unstructured approach to mining this data.

This approach, which was first developed back in the 1980s when we first encountered the challenge of data that exceeded our expertise at the time, requires a team of data scientists and coders to go through the labor- and time-consuming process of pre-processing and building specialized processes. The most basic form of this approach involves techniques such as frequency analysis, summarization, correlation, and data scrubbing. This last portion also involves labor-intensive techniques at the microeconomic level such as binning and other forms of manipulation.

This is where the fear and loathing comes into play. It is not as if all information systems do not perform these functions in some manner, it is that in structured data all of this work has been done and, oftentimes, is handled by the database system. But even here there is a better way.

My colleague, Dave Gordon, who has his own blog, will emphasize that the identification of probable questions and configuration of queries in advance combined with the application of standard APIs will garner good results in most cases. Yet, one must be prepared to receive a certain amount of irrelevant information. For example, the query on Google of “Fun Things To Do” that you may use if you are planning for a weekend will yield all sorts of results, such as “50 Fun Things to Do in an Elevator.”  This result includes making farting sounds. The link provides some others, some of which are pretty funny. In writing this blog post, a simple search on Google for “Google query fails” yields what can only be described as a large number of query fails. Furthermore, this approach relies on the data originator to have marked the data with pointers and tags.

Given these different approaches to unstructured data and the complexity involved, there is a decision process to apply:

1. Determine if the data is truly unstructured. If the data is derived from a structured database from an existing application or set of applications, then it is structured and will require domain expertise to inherit the values and information content without expending unnecessary resources and time. A structured, systemic, and repeatable process can then be applied. Oftentimes an industry schema or standard can be leveraged to ensure consistency and fidelity.

2. Determine whether only a portion of the unstructured data is relative to your business processes and use it to append and enrich the existing structured data that has been used to integrate and expand your capabilities. In most cases the identification of a Rosetta Stone and standard APIs can be used to achieve this result.

3. For the remainder, determine the value of mining the targeted category of unstructured data and perform a business case analysis.

Given the rapidly expanding size of data that we can access using the advancing power of new technology, we must be able to distinguish between doing what is necessary from doing what is impressive. The definition of Big Data has evolved over time because our hardware, storage, and database systems allow us to access increasingly larger datasets that ten years ago would have been unimaginable. What this means is that–initially–as we work through this process of discovery, we will be bombarded with a plethora of irrelevant statistical measures and so-called predictive analytics that will eventually prove out to not pass the “so-what” test. This process places the users in a state of information overload, and we often see this condition today. It also means that what took an army of data scientists and developers to do ten years ago takes a technologist with a laptop and some domain knowledge to perform today. This last can be taught.

The next necessary step, aside from applying the decision process above, is to force our information systems to advance their processing to provide more relevant intelligence that is visualized and configured to the domain expertise required. In this way we will eventually discover the paradox that effectively accessing larger sets of data will yield fewer, more relevant intelligence that can be translated into action.

At the end of the day the manager and user must understand the data. There is no magic in data transformation or data processing. Even with AI and machine learning it is still incumbent upon the people within the organization to be able to apply expertise, perspective, knowledge, and wisdom in the use of information and intelligence.

Post-Blogging NDIA Blues — The Latest News (Project Management Wonkish)

The National Defense Industrial Association’s Integrated Program Management Division (NDIA IPMD) just had its quarterly meeting here in sunny Orlando where we braved the depths of sub-60 degrees F temperatures to start out each day.

For those not in the know, these meetings are an essential coming together of policy makers, subject matter experts, and private industry practitioners regarding the practical and mundane state-of-the-practice in complex project management, particularly focused on the concerns of the the federal government and the Department of Defense.  The end result of these meetings is to publish white papers and recommendations regarding practice to support continuous process improvement and the practical application of project management practices–allowing for a cross-pollination of commercial and government lessons learned.  This is also the intersection where innovation among the large and small are given an equal vetting and an opportunity to introduce new concepts and solutions.  This is an idealized description, of course, and most of the petty personality conflicts, competition, and self-interest that plagues any group of individuals coming together under a common set of interests also plays out here.  But generally the days are long and the workshops generally produce good products that become the de facto standard of practice in the industry. Furthermore the control that keeps the more ruthless personalities in check is the fact that, while it is a large market, the complex project management community tends to be a relatively small one, which reinforces professionalism.

The “blues” in this case is not so much borne of frustration or disappointment but, instead, from the long and intense days that the sessions offer.  The biggest news from an IT project management and application perspective was twofold. The data stream used by the industry in sharing data in an open systems manner will be simplified.  The other was the announcement that the technology used to communicate will move from XML to JSON.

Human readable formatting to Data-focused formatting.  Under Kendall’s Better Buying Power 3.0 the goal of the Department of Defense (DoD) has been to incorporate better practices from private industry where they can be applied.  I don’t see initiatives for greater efficiency and reduction of duplication going away in the new Administration, regardless of what a new initiative is called.

In case this is news to you, the federal government buys a lot of materials and end items–billions of dollars worth.  Accountability must be put in place to ensure that the money is properly spent to acquire the things being purchased.  Where technology is pushed and where there are no commercial equivalents that can be bought off the shelf, as in the systems purchased by the Department of Defense, there are measures of progress and performance (given that the contract is under a specification) that are submitted to the oversight agency in DoD.  This is a lot of data and to be brutally frank the method and format of delivery has been somewhat chaotic, inefficient, and duplicative.  The Department moved to address this by a somewhat modest requirement of open systems submission of an application-neutral XML file under the standards established by the UN/CEFACT XML organization.  This was called the Integrated Program Management Report (IMPR).  This move garnered some improvement where it has been applied, but contracts are long-term, so incorporating improvements though new contractual requirements tends to take time.  Plus, there is always resistance to change.  The Department is moving to accelerate addressing these inefficiencies in their data streams by eliminating the unnecessary overhead associated with specifications of formatting data for paper forms and dealing with data as, well, data.  Great idea and bravo!  The rub here is that in making the change, the Department has proposed dropping XML as the technology used to transfer data and move to JSON.

XML to JSON. Before I spark another techie argument about the relative merits of each, there are some basics to understand here.  First, XML is a language, JSON is simply data exchange format.  This means that XML is specifically designed to deal with hierarchical and structured data that can be queried and where validation and fidelity checks within the data are inherent in the technology. Furthermore, XML is known to scale while maintaining the integrity of the data, which is intended for use in relational databases.  Furthermore, XML is hard to break.  It is meant for editing and will maintain its structure and integrity afterward.

The counter argument encountered is that JSON is new! and uses fewer characters! (which usually turns out to be inconsequential), and people are talking about it for Big Data and NoSQL! (but this happened after the fact and the reason for shoehorning it this way is discussed below).

So does it matter?  Yes and no.  As a supplier specializing in delivering solutions that normalize and rationalize data across proprietary file structures and leverage database capabilities, I don’t care.  I can adapt quickly and will have a proof-of-concept solution out within 30 days of receiving the schema.

The risk here, which applies to DoD and the industry, is that the decision to go to JSON is made only because it is the shiny new thing used by gamers and social networking developers.  There has also been a move to adapt to other uses because of the history of significant security risks that had been found in Java, so much so that an entire Wikipedia page is devoted to them.  Oracle just killed off Java applets, though Java hangs on.  JSON, of course, isn’t Java, but it was designed from birth as JavaScript Object Notation (hence the acronym JSON), with the purpose of handling relatively small bits of data across web servers in a number of proprietary settings.

To address JSON deficiencies relative to XML, a number of tools have been and are being developed to replicate the fidelity and reliability found in XML.  Whether this is sufficient to be effective against a structured LANGUAGE is to be seen.  Much of the overhead that technies complain about in XML is due to the native functionality related to the power it brings to the table.  No doubt, a bicycle is simpler than a Formula One racer–and this is an apt comparison.  Claiming “simpler” doesn’t pass the “So What?” test knowing the business processes involved.  The technology needs to be fit to the solution.  The purpose of data transmission using APIs is not only to make it easy to produce but for it to–you know–achieve the goals of normalization and rationalization so that it can be used on the receiving end which is where the consumer (which we usually consider to be the customer) sits.

At the end of the day the ability to scale and handle hierarchical, structured data will rely on the quality and strength of the schema and the tools that are published to enforce its fidelity and compliance.  Otherwise consuming organizations will be receiving a dozen different proprietary JSON files, and that does not address the present chaos but simply adds to it.  These issues were aired out during the meeting and it seems that everyone is aware of the risks and that they can be addressed.  Furthermore, as the schema is socialized across solutions providers, it will be apparent early if the technology will be able handle the project performance data resulting from the development of a high performance aircraft or a U.S. Navy destroyer.

Takin’ Care of Business — Information Economics in Project Management

Neoclassical economics abhors inefficiency, and yet inefficiencies exist.  Among the core issues that create inefficiencies is the asymmetrical nature of information.  Asymmetry is an accepted cornerstone of economics that leads to inefficiency.  We can see in our daily lives and employment the effects of one party in a transaction having more information than the other:  knowing whether the used car you are buying is a lemon, measuring risk in the purchase of an investment and, apropos to this post, identifying how our information systems allow us to manage complex projects.

Regarding this last proposition we can peel this onion down through its various levels: the asymmetry in the information between the customer and the supplier, the asymmetry in information between the board and stockholders, the asymmetry in information between management and labor, the asymmetry in information between individual SMEs and the project team, etc.–it’s elephants all the way down.

This asymmetry, which drives inefficiency, is exacerbated in markets that are dominated by monopoly, monopsony, and oligopoly power.  When informed by the work of Hart and Holmström regarding contract theory, which recently garnered the Nobel in economics, we have a basis for understanding the internal dynamics of projects in seeking efficiency and productivity.  What is interesting about contract theory is that it incorporates the concept of asymmetrical information (labeled as adverse selection), but expands this concept in human transactions at the microeconomic level to include considerations of moral hazard and the utility of signalling.

The state of asymmetry and inefficiency is exacerbated by the patchwork quilt of “tools”–software applications that are designed to address only a very restricted portion of the total contract and project management system–that are currently deployed as the state of the art.  These tend to require the insertion of a new class of SME to manage data by essentially reversing the efficiencies in automation, involving direct effort to reconcile differences in data from differing tools. This is a sub-optimized system.  It discourages optimization of information across the project, reinforces asymmetry, and is economically and practically unsustainable.

The key in all of this is ensuring that sub-optimal behavior is discouraged, and that those activities and behaviors that are supportive of more transparent sharing of information and, therefore, contribute to greater efficiency and productivity are rewarded.  It should be noted that more transparent organizations tend to be more sustainable, healthier, and with a higher degree of employee commitment.

The path forward where there is monopsony power, where there is a dominant buyer, is to impose the conditions for normative behavior that would otherwise be leveraged through practice in a more open market.  For open markets not dominated by one player as either supplier or seller, instituting practices that reward behavior that reduces the effects of asymmetrical information, and contracting disincentives in business transactions on the open market is the key.

In the information management market as a whole the trends that are working against asymmetry and inefficiency involve the reduction of data streams, the construction of cross-domain data repositories (or reservoirs) that allow for the satisfaction of multiple business stakeholders, and the introduction of systems that are more open and adaptable to the needs of the project system in lieu of a limited portion of the project team.  These solutions exist, yet their adoption is hindered because of the long-term infrastructure that is put in place in complex project management.  This infrastructure is supported by incumbents that are reinforcing to the status quo.  Because of this, from the time a market innovation is introduced to the time that it is adopted in project-focused organizations usually involves the expenditure of several years.

This argues for establishing an environment that is more nimble.  This involves the adoption of a series of approaches to achieve the goals of broader information symmetry and efficiency in the project organization.  These are:

a. Instituting contractual relationships, both internally and externally, that encourage project personnel to identify risk.  This would include incentives to kill efforts that have breached their framing assumptions, or to consolidate progress that the project has achieved to date–sending it as it is to production–while killing further effort that would breach framing assumptions.

b. Institute policy and incentives on the data supply end to reduce the number of data streams.  Toward this end both acquisition and contracting practices should move to discourage proprietary data dead ends by encouraging normalized and rationalized data schemas that describe the environment using a common or, at least, compatible lexicon.  This reduces the inefficiency derived from opaqueness as it relates to software and data.

c.  Institute policy and incentives on the data consumer end to leverage the economies derived from the increased computing power from Moore’s Law by scaling data to construct interrelated datasets across multiple domains that will provide a more cohesive and expansive view of project performance.  This involves the warehousing of data into a common repository or reduced set of repositories.  The goal is to satisfy multiple project stakeholders from multiple domains using as few streams as necessary and encourage KDD (Knowledge Discovery in Databases).  This reduces the inefficiency derived from data opaqueness, but also from the traditional line-and-staff organization that has tended to stovepipe expertise and information.

d.  Institute acquisition and market incentives that encourage software manufacturers to engage in positive signalling behavior that reduces the opaqueness of the solutions being offered to the marketplace.

In summary, the current state of project data is one that is characterized by “best-of-breed” patchwork quilt solutions that tend to increase direct labor, reduces and limits productivity, and drives up cost.  At the end of the day the ability of the project to handle risk and adapt to technical challenges rests on the reliability and efficiency of its information systems.  A patchwork system fails to meet the needs of the organization as a whole and at the end of the day is not “takin’ care of business.”

Big Time — Elements of Data Size in Scaling

I’ve run into additional questions about scalability.  It is significant to understand the concept in terms of assessing software against data size, since there are actually various aspect of approaching the issue.

Unlike situations where data is already sorted and structured as part of the core functionality of the software service being provided, this is in dealing in an environment where there are many third-party software “tools” that put data into proprietary silos.  These act as barriers to optimizing data use and gaining corporate intelligence.  The goal here is to apply in real terms the concept that the customers generating the data (or stakeholders who pay for the data) own the data and should have full use of it across domains.  In project management and corporate governance this is an essential capability.

For run-of-the-mill software “tools” that are focused on solving one problem, this often is interpreted as just selling a lot more licenses to a big organization.  “Sure we can scale!” is code for “Sure, I’ll sell you more licenses!”  They can reasonably make this assertion, particularly in client-server or web environments, where they can point to the ability of the database system on which they store data to scale.  This also comes with, usually unstated, the additional constraint that their solution rests on a proprietary database structure.  Such responses, through, are a form of sidestepping the question, nor is it the question being asked.  Thus, it is important for those acquiring the right software to understand the subtleties.

A review of what makes data big in the first place is in order.  The basic definition, which I outlined previously, came from NASA in describing data that could not be held in local memory or local storage.  Hardware capability, however, continues to grow exponentially, so that what is big data today is not big data tomorrow.  But in handling big data, it then becomes incumbent on software publishers to drive performance to allow their customers to take advantage of the knowledge contained in these larger data sets.

The elements that determine the size of data are:

a.  Table size

b.  Row or Record size

c.  Field size

d.  Rows per table

e.  Columns per table

f.  Indexes per table

Note the interrelationships of these elements in determining size.  Thus, recently I was asked how many records are being used on the largest tables being accessed by a piece of software.  That is fine as shorthand, but the other elements add to the size of the data that is being accessed.  Thus, a set of data of say 800K records may be just as “big” as one containing 2M records because of the overall table size of fields, and the numbers of columns and indices, as well as records.  Furthermore, this question didn’t take into account the entire breadth of data across all tables.

Understanding the definition of data size then leads us to understanding the nature of software scaling.  There are two aspects to this.

The first is the software’s ability to presort the data against the database in such as way as to ensure that latency–the delay in performance when the data is loaded–is minimized.  The principles applied here go back to database management practices back in the day when organizations used to hire teams of data scientists to rationalize data that was processed in machine language–especially when it used to be stored in ASCII or, for those who want to really date themselves, EBCDIC, which were incoherent by today’s generation of more human-readable friendly formats.

Quite simply, the basic steps applied has been to identify the syntax, translate it, find its equivalents, and then sort that data into logical categories that leverage database pointers.  What you don’t want the software to do is what used to be done during the earliest days of dealing with data, which was smaller by today’s standards, of serially querying ever data element in order to fetch only what the user is calling.  Furthermore, it also doesn’t make much sense to deal with all data as a Repository of Babel to apply labor-intensive data mining in non-relational databases, especially in cases where the data is well understood and fairly well structured, even if in a proprietary structure.  If we do business in a vertical where industry standards in data apply, as in the use of the UN/CEFACT XML convention, then much of the presorting has been done for us.  In addition, more powerful industry APIs (like OLE DB and ODBC) that utilize middleware (web services, XML, SOAP, MapReduce, etc.) multiply the presorting capabilities of software, providing significant performance improvements in accessing big data.

The other aspect is in the software’s ability to understand limitations in data communications hardware systems.  This is a real problem, because the backbone in corporate communication systems, especially to ensure security, is still largely done over a wire.  The investments in these backbones is usually categorized as a capital investment, and so upgrades to the system are slow.  Furthermore, oftentimes backbone systems are embedded in physical plant building structures.  So any software performance is limited by the resistance and bandwidth of wiring.  Thus, we live in a world where hardware storage and processing is doubling every 12-18 months, and software is designed to better leverage such expansion, but the wires over which data communication depends remains stuck in the past–constrained by the basic physics of CAT 6 or Fiber Optic cabling.

Needless to say, software manufacturers who rely on constant communications with the database will see significantly degraded performance.  Some software publishers who still rely on this model use the “check out” system, treating data like a lending library, where only one user or limited users can access the same data.  This, of course, reduces customer flexibility.  Strategies that are more discrete in handling data are the needed response here until day-to-day software communications can reap the benefits of physical advancements in this category of hardware.  Furthermore, organizations must understand that the big Cloud in the sky is not the answer, since it is constrained by the same physics as the rest of the universe–with greater security risks.

All of this leads me to a discussion I had with a colleague recently.  He opened his iPhone and did a query in iTunes for an album.  In about a second or so his query selected the artist and gave a list of albums–all done without a wire connection.  “Why can’t we have this in our industry?” he asked.  Why indeed?  Well, first off, Apple iTunes has sorted its data to optimize performance with its app, and it is performed within a standard data stream and optimized for the Apple iOS and hardware.  Secondly, the variables of possible queries in iTunes are predefined and tied to a limited and internally well-defined set of data.  Thus, the data and application challenges are not equivalent as found in my friend’s industry vertical.  For example, aside from the challenges of third party data normalization and rationalization, iTunes is not dealing with dynamic time-phased or trending data that requires multiple user updates to reflect changes using predictive analytics which is then served to different classes of users in a highly secure environment.  Finally, and most significantly, Apple spent more on that system than the annual budget of my friend’s organization.  In the end his question was a good one, but in discussing this it was apparent that just saying “give me this” is a form of magical thinking and hand waving.  The devil is in the details, though I am confident that we will eventually get to an equivalent capability.

At the end of the day IT project management strategy must take into account the specific needs of classes of users in making determinations of scaling.  What this means is a segmented approach: thick-client users, compartmentalized local installs with subsets of data, thin-client, and web/mobile or terminal services equivalents.  The practical solution is still an engineered one that breaks the elephant into digestible pieces that lean forward to leveraging advances in hardware, database, and software operating environments.  These are the essential building blocks to data optimization and scaling.

I Can See Clearly Now — Knowledge Discovery in Databases, Data Scalability, and Data Relevance

I recently returned from a travel and much of the discussion revolved around the issues of scalability and the use of data.  What is clear is that the conversation at the project manager level is shifting from a long-running focus on reports and metrics to one focused on data and what can be learned from it.  As with any technology, information technology exploits what is presented before it.  Most recently, accelerated improvements in hardware and communications technology has allowed us to begin to collect and use ever larger sets of data.

The phrase “actionable” has been thrown around quite a bit in marketing materials, but what does this term really mean?  Can data be actionable?  No.  Can intelligence derived from that data be actionable?  Yes.  But is all data that is transformed into intelligence actionable?  No.  Does it need to be?  No.

There are also kinds and levels of intelligence, particularly as it relates to organizations and business enterprises.  Here is a short list:

a. Competitive intelligence.  This is intelligence derived from data that informs decision makers about how their organization fits into the external environment, further informing the development of strategic direction.

b. Business intelligence.  This is intelligence derived from data that informs decision makers about the internal effectiveness of their organization both in the past and into the future.

c. Business analytics.  The transformation of historical and trending enterprise data used to provide insight into future performance.  This includes identifying any underlying drivers of performance, and any emerging trends that will manifest into risk.  The purpose is to provide sufficient early warning to allow risk to be handled before it fully manifests, therefore keeping the effort being measured consistent with the goals of the organization.

Note, especially among those of you who may have a military background, that what I’ve outlined is a hierarchy of information and intelligence that addresses each level of an organization’s operations:  strategic, operational, and tactical.  For many decision makers, translating tactical level intelligence into strategic positioning through the operational layer presents the greatest challenge.  The reason for this is that, historically, there often has been a break in the continuity between data collected at the tactical level and that being used at the strategic level.

The culprit is the operational layer, which has always been problematic for organizations and those individuals who find themselves there.  We see this difficulty reflected in the attrition rate at this level.  Some individuals cannot successfully make this transition in thinking. For example, in the U.S. Army command structure when advancing from the battalion to the brigade level, in the U.S. Navy command structure when advancing from Department Head/Staff/sea command to organizational or fleet command (depending on line or staff corps), and in business for those just below the C level.

Another way to look at this is through the traditional hierarchical pyramid, in which data represents the wider floor upon which each subsequent, and slightly reduced, level is built.  In the past (and to a certain extent this condition still exists in many places today) each level has constructed its own data stream, with the break most often coming at the operational level.  This discontinuity is then reflected in the inconsistency between bottom-up and top-down decision making.

Information technology is influencing and changing this dynamic by addressing the main reason for the discontinuity existing–limitations in data and intelligence capabilities.  These limitations also established a mindset that relied on limited, summarized, and human-readable reporting that often was “scrubbed” (especially at the operational level) as it made its way to the senior decision maker.  Since data streams were discontinuous, there were different versions of reality.  When aspects of the human equation are added, such as selection bias, the intelligence will not match what the data would otherwise indicate.

As I’ve written about previously in this blog, the application of Moore’s Law in physical computing performance and storage has pushed software to greater needs in scaling in dealing with ever increasing datasets.  What is defined as big data today will not be big data tomorrow.

Organizations, in reaction to this condition, have in many cases tended to simply look at all of the data they collect and throw it together into one giant pool.  Not fully understanding what the data may say, a number of ad hoc approaches have been taken.  In some cases this has caused old labor-intensive data mining and rationalization efforts to once again rise from the ashes to which they were rightly consigned in the past.  On the opposite end, this has caused a reliance on pre-defined data queries or hard-coded software solutions, oftentimes based on what had been provided using human-readable reporting.  Both approaches are self-limiting and, to a large extent, self-defeating.  In the first case because the effort and time to construct the system will outlive the needs of the organization for intelligence, and in the second case, because no value (or additional insight) is added to the process.

When dealing with large, disparate sources of data, value is derived through that additional knowledge discovered through the proper use of the data.  This is the basis of the concept of what is known as KDD.  Given that organizations know the source and type of data that is being collected, it is not necessary to reinvent the wheel in approaching data as if it is a repository of Babel.  No doubt the euphemisms, semantics, and lexicon used by software publishers differs, but quite often, especially where data underlies a profession or a business discipline, these elements can be rationalized and/or normalized given that the appropriate business cross-domain knowledge is possessed by those doing the rationalization or normalization.

This leads to identifying the characteristics* of data that is necessary to achieve a continuity from the tactical to the strategic level that will achieve some additional necessary qualitative traits such as fidelity, credibility, consistency, and accuracy.  These are:

  1. Tangible.  Data must exist and the elements of data should record something that correspondingly exists.
  2. Measurable.  What exists in data must be something that is in a form that can be recorded and is measurable.
  3. Sufficient.  Data must be sufficient to derive significance.  This includes not only depth in data but also, especially in the case of marking trends, across time-phasing.
  4. Significant.  Data must be able, once processed, to contribute tangible information to the user.  This goes beyond statistical significance noted in the prior characteristic, in that the intelligence must actually contribute to some understanding of the system.
  5. Timely.  Data must be timely so that it is being delivered within its useful life.  The source of the data must also be consistently provided over consistent periodicity.
  6. Relevant.  Data must be relevant to the needs of the organization at each level.  This not only is a measure to test what is being measured, but also will identify what should be but is not being measured.
  7. Reliable.  The sources of the data be reliable, contributing to adherence to the traits already listed.

This is the shorthand that I currently use in assessing a data requirements and the list is not intended to be exhaustive.  But it points to two further considerations when delivering a solution.

First, at what point does the person cease to be the computer?  Business analytics–the tactical level of enterprise data optimization–oftentimes are stuck in providing users with a choice of chart or graph to use in representing such data.  And as noted by many writers, such as this one, no doubt the proper manner of representing data will influence its interpretation.  But in this case the person is still the computer after the brute force computing is completed digitally.  There is a need for more effective significance-testing and modeling of data, with built-in controls for selection bias.

Second, how should data be summarized to the operational and strategic levels so that “signatures” can be identified that inform information?  Furthermore, it is important to understand what kind of data must supplement the tactical level data at those other levels.  Thus, data streams are not only minimized to eliminate redundancy, but also properly aligned to the level of data intelligence.

*Note that there are other aspects of data characteristics noted by other sources here, here, and here.  Most of these concern themselves with data quality and what I would consider to be baseline data traits, which need to be separately assessed and tested, as opposed to antecedent characteristics.

 

Living in the Material World — A Call for Open Data from Materials Science

Since advocating for transparent and open data schemas and data repositories, I’ve run into other high tech professionals who have opined that such data requirements are only applicable to my core competency in the U.S. Aerospace & Defense vertical.  Now, in the 26 November on-line edition of the journal Science, comes an opinion piece from a number of Chinese corrosion scientists under the title “Materials science: Share corrosion data” have advocated the development of open data infrastructures to make the age and life of existing metallurgic infrastructure easily accessible to technology in a non-proprietary format.

As the article states, “corrosion costs six cents for every dollar of gross domestic product in the United States. Globally, that amounts to more than US$4 trillion a year — equivalent to damages from 40 Hurricane Katrinas. Half of that cost is in corrosion prevention and control, the other half in damages and lost productivity.”

In the software technology industry, the recent issues regarding the Trans-Pacific Partnership (TPP) has revealed strains in U.S.-Chinese relations on intellectual property and proprietary systems.  So the danger is always that the source of the advocacy will be ignored, despite the clear economic argument in favor of what the editorial proposes.

But the concern is transnational in nature.  As the piece points out, the U.S. Department of Energy has initiated its own program in alignment with the U.S. National Institute of Standards and Technology (NIST) Materials Genome Initiative (MGI) to establish open repositories of data in support of the alternative energy industry.  Given the aging U.S. infrastructure and the dangers from sudden failure from these engineering systems, it seems wise to track and share data for materials corrosion and lifespans.  Think of the sudden bridge failures that have hit headlines over the last several years.

Since the cost of not doing so can be measured in lives as well as economic cost, such data normalization and rationalization in this area takes on the role of being an essential element of governance.

Stay Open — Open and Proprietary Databases (and Why It Matters)

The last couple of weeks have been fairly intense workwise and so blogging has lagged a bit.  Along the way the matter of databases came up at a customer site and what constitutes open data and what comprises proprietary data.  The reason why this issue matters to customers rests of several foundations.

First, in any particular industry or niche there is a wide variety of specialized apps that have blossomed.  This is largely due to Moore’s Law.  Looking at the number of hosted and web apps alone can be quite overwhelming, particularly given the opaqueness of what one is buying at any particular time when it comes to software technology.

Second, given this explosion, it goes without saying that the market will apply its scythe ruthlessly in thinning it.  Despite the ranting of ideologues, this thinning applies to both good and bad ideas, both sound and unsound businesses equally.  The few that remain are lucky, good, or good and lucky.  Oftentimes it is being first to market on an important market discriminator, regardless of the quality in its initial state, that determines winners.

Third, most of these technology solutions will run their software on proprietary database structures.  This undermines the concept that the customer owns the data.

The reasons why software solutions providers do this is multifaceted.  For example, the database structure is established to enhance the functionality and responsiveness of the application where the structure is leveraged to work optimally with the application’s logic.

But there are also more prosaic reasons for proprietary database structures.  First, the targeted vertical or segment may not be very structured regarding the type of data, so there is wide variation on database configuration and structure.  But there is also a more base underlying motivation to keep things this way:  the database structure is designed to protect the application’s data from easy access from third party tools and, as a result, make their solution “sticky” within the market segment that is captured.  That is, database structure is a way to build barriers to competition.

For incumbents that are stable, the main disadvantages to the customer lie in the use of the database as a means of tying them to the solution as a barrier to exit.  At the same time incumbents erect artificial barriers to data entry.  For software markets with a great deal of new entries and innovation that will lead to some thinning, picking the wrong solution using proprietary data structures can lead to real problems when attempting to transition to more stable alternatives.  For example, in the case of hosted applications not only is data not on the customer’s own database servers, but that data could be located far from the worksite or even geographically dispersed outside of the physical control of the customer.

Open APIs in using data mining and variations of it as the Shaman of Big Data prescribe unstructured and non-relational databases has served to, at least in everyone’s mind, minimize such proprietary concerns.  After all, it thought, we can just crack open the data–right?  Well…not so fast.  Given a number of data scientists, data analysts, and open API object tools mainframe types can regain the status they lost with the introduction of the PC and spend months building systems that will eventually rationalize data that has been locked in proprietary prisons.  Or perhaps not.  The bigger the data the bigger the problem.  The bigger the question the more one must bring in those who understand the difference between correlation and causation.  In the end it comes down to the mathematics and valid methods of determining in real terms the behavior of systems.

Or if you are a small or medium-sized business or organization you can just decide that the data is irretrievable, or effectively so, since the ROI is not there to make it retrievable.

Or you can avoid the inevitable and, if you do business in a highly structured market, such as project management, utilize some open standard such as the UN/CEFACT XML.  Then, when choosing a COTS solution in communicating with the market, determine that databases must, at a minimum, conform to the open standard in database design.  This provides maximum flexibility to the customer, who can then perform value analysis on competing products, based on a analysis of functionality, flexibility, and sustainability.

This places the customer back into the role of owning the data.

When You’re a Jet You’re a Jet all the Way — Software as a Change Agent for Professional Development

Earlier in the week Dave Gordon at his blog responded to my post on data normalization and rightly introduced the need for data rationalization.  I had omitted the last concept in my own post, but strongly implied that the two were closely aligned in my broad definition of normalization beyond the boundaries of eliminating redundancies.  In the end in thinking about this, I prefer Dave’s dichotomy because it more clearly defines what we are doing.

Later in the week I found myself elaborating on these issues in discussions with customers and other professionals in the project management discipline.  In the projects in which I am involved, what I have found is that the process of normalizing and rationalizing data, even historical data which, contrary to Dave’s assertion can be maintained–at least in my business–acts as a change agent in defining the agnostic characteristics what defines the type of data being normalized and rationalized.

What I mean here is that, for instance, a time-phased CPM schedule that eventually becomes an integrated master schedule has an analogue.  For years we have been told, mostly by marketing types working for software manufacturers, that there is a secret sauce that they provide that cannot be reconciled against their competitors.  As a result, entire professional organizations, conferences, white papers, and presentations have been given to prove this assertion.  When looking at the data, however, the assertion is invalid.

The key differentiator between CPM scheduling applications is the optimization engine.  That is the secret sauce and the black box where the valuable IP lies.  It is the algorithms in the optimization that identifies for us those schedule activities that are on the critical and near critical paths.  But when you run these engines side by side on the same schedule, their results are well within one standard deviation of one another.

Keep in mind that I’m talking about differences in data related to normalization and rationalization and whether these differences can be reconciled.  There are other differences in features between the applications that do make a difference in their use and functionality: whether they can lock down a baseline, manage multiple baselines, prevent future work from being planned and executed in the past (yes, this happens), handle hammocks, scale properly, etc.  Because of these functional differences the same data may have been given a different value in the table or the file.  As Dave Gordon rightly points out, reconciling what on the surface are irreconcilable values requires specialized knowledge.  Well, if you have that specialized knowledge then you can achieve what otherwise seems impossible.  Once you achieve this “impossible” feat, it quickly becomes apparent that the features and functions involved are based on a very limited number of values that are common across CPM scheduling applications.

This should not be surprising for those of you out there that have been doing this a long time.  Back in the 1980s we would use visual display boards to map out short segments of the schedule.  We would manually construct schedules in very rudimentary (by today’s standards) mainframe computers and get very long dot matrix representations of the schedule to tape to the “War Room” walls.  The resources, risks, etc. had to be drawn on the schedule.  This manual process required an understanding of CPM schedule construction similar to someone still using long division today.  There was actually a time when people had to memorize their log and square root tables.  it was not very efficient, but the deep understanding of the analogue schedule has since been lost with the introduction of new technology.  This came to mind when I saw on LinkedIn a question of the types of questions that should be asked of a master scheduler in an interview.

As a result of new technology, schedulers aligned themselves into camps based on the application they selected or was selected for them in their job.  Over time I have seen brand loyalty turn into partisanship.  Once again, this should not be surprising.  If you spent ten years of your career on a very popular scheduling application, anything that may undermine one’s investment in that choice–and which makes employment possible–will be deemed as a threat.

I first came upon this behavior years ago when I was serving as CIO for a project management organization.  Some PMOs could not share information–not even e-mail and documents–because most were using PCs and some were using Macs.  Problem was that the key PMO was using Mac.  This was before Microsoft and Apple got together and solved this for us.  Needless to say this undermined organizational effectiveness.  My attempt to get everyone on the same page in terms of operating system compatibility sparked a significant backlash.  Luckily for me Microsoft soon introduced its first solution to address this issue.  So, in the end, the “Macintites”, as we good naturedly called them, could use their Macs for business common to other parts of the organization.

This almost cultish behavior finds itself in new places today: in the iPhone and Droid wars, in the use of Agile, and among CPM scheduling application partisans.  It is true that those of us in the software industry certainly want to see brand loyalty.  It is one of the key measures of success in proving the product’s value and effectiveness.  But it need not undermine the fact that a scheduler is a key specialist in the project management discipline.  If you are a Jet, you don’t need to be a Jet all the way.

Since creating generic analogues of schedules from submitted third-party data, I have found that insights into project performance can be noted that previously were not available.  The power of digitization along with normalization and rationalization allows the data to be effectively integrated at the proper point of intersection with other dimensions of project performance such as cost performance and risk.  Freed from the shackles of having to learn the specific idiosyncrasies of particular applications, the deep understanding of scheduling is being reintroduced.  This is a long time in coming.