I Can See Clearly Now — Knowledge Discovery in Databases, Data Scalability, and Data Relevance

I recently returned from a travel and much of the discussion revolved around the issues of scalability and the use of data.  What is clear is that the conversation at the project manager level is shifting from a long-running focus on reports and metrics to one focused on data and what can be learned from it.  As with any technology, information technology exploits what is presented before it.  Most recently, accelerated improvements in hardware and communications technology has allowed us to begin to collect and use ever larger sets of data.

The phrase “actionable” has been thrown around quite a bit in marketing materials, but what does this term really mean?  Can data be actionable?  No.  Can intelligence derived from that data be actionable?  Yes.  But is all data that is transformed into intelligence actionable?  No.  Does it need to be?  No.

There are also kinds and levels of intelligence, particularly as it relates to organizations and business enterprises.  Here is a short list:

a. Competitive intelligence.  This is intelligence derived from data that informs decision makers about how their organization fits into the external environment, further informing the development of strategic direction.

b. Business intelligence.  This is intelligence derived from data that informs decision makers about the internal effectiveness of their organization both in the past and into the future.

c. Business analytics.  The transformation of historical and trending enterprise data used to provide insight into future performance.  This includes identifying any underlying drivers of performance, and any emerging trends that will manifest into risk.  The purpose is to provide sufficient early warning to allow risk to be handled before it fully manifests, therefore keeping the effort being measured consistent with the goals of the organization.

Note, especially among those of you who may have a military background, that what I’ve outlined is a hierarchy of information and intelligence that addresses each level of an organization’s operations:  strategic, operational, and tactical.  For many decision makers, translating tactical level intelligence into strategic positioning through the operational layer presents the greatest challenge.  The reason for this is that, historically, there often has been a break in the continuity between data collected at the tactical level and that being used at the strategic level.

The culprit is the operational layer, which has always been problematic for organizations and those individuals who find themselves there.  We see this difficulty reflected in the attrition rate at this level.  Some individuals cannot successfully make this transition in thinking. For example, in the U.S. Army command structure when advancing from the battalion to the brigade level, in the U.S. Navy command structure when advancing from Department Head/Staff/sea command to organizational or fleet command (depending on line or staff corps), and in business for those just below the C level.

Another way to look at this is through the traditional hierarchical pyramid, in which data represents the wider floor upon which each subsequent, and slightly reduced, level is built.  In the past (and to a certain extent this condition still exists in many places today) each level has constructed its own data stream, with the break most often coming at the operational level.  This discontinuity is then reflected in the inconsistency between bottom-up and top-down decision making.

Information technology is influencing and changing this dynamic by addressing the main reason for the discontinuity existing–limitations in data and intelligence capabilities.  These limitations also established a mindset that relied on limited, summarized, and human-readable reporting that often was “scrubbed” (especially at the operational level) as it made its way to the senior decision maker.  Since data streams were discontinuous, there were different versions of reality.  When aspects of the human equation are added, such as selection bias, the intelligence will not match what the data would otherwise indicate.

As I’ve written about previously in this blog, the application of Moore’s Law in physical computing performance and storage has pushed software to greater needs in scaling in dealing with ever increasing datasets.  What is defined as big data today will not be big data tomorrow.

Organizations, in reaction to this condition, have in many cases tended to simply look at all of the data they collect and throw it together into one giant pool.  Not fully understanding what the data may say, a number of ad hoc approaches have been taken.  In some cases this has caused old labor-intensive data mining and rationalization efforts to once again rise from the ashes to which they were rightly consigned in the past.  On the opposite end, this has caused a reliance on pre-defined data queries or hard-coded software solutions, oftentimes based on what had been provided using human-readable reporting.  Both approaches are self-limiting and, to a large extent, self-defeating.  In the first case because the effort and time to construct the system will outlive the needs of the organization for intelligence, and in the second case, because no value (or additional insight) is added to the process.

When dealing with large, disparate sources of data, value is derived through that additional knowledge discovered through the proper use of the data.  This is the basis of the concept of what is known as KDD.  Given that organizations know the source and type of data that is being collected, it is not necessary to reinvent the wheel in approaching data as if it is a repository of Babel.  No doubt the euphemisms, semantics, and lexicon used by software publishers differs, but quite often, especially where data underlies a profession or a business discipline, these elements can be rationalized and/or normalized given that the appropriate business cross-domain knowledge is possessed by those doing the rationalization or normalization.

This leads to identifying the characteristics* of data that is necessary to achieve a continuity from the tactical to the strategic level that will achieve some additional necessary qualitative traits such as fidelity, credibility, consistency, and accuracy.  These are:

  1. Tangible.  Data must exist and the elements of data should record something that correspondingly exists.
  2. Measurable.  What exists in data must be something that is in a form that can be recorded and is measurable.
  3. Sufficient.  Data must be sufficient to derive significance.  This includes not only depth in data but also, especially in the case of marking trends, across time-phasing.
  4. Significant.  Data must be able, once processed, to contribute tangible information to the user.  This goes beyond statistical significance noted in the prior characteristic, in that the intelligence must actually contribute to some understanding of the system.
  5. Timely.  Data must be timely so that it is being delivered within its useful life.  The source of the data must also be consistently provided over consistent periodicity.
  6. Relevant.  Data must be relevant to the needs of the organization at each level.  This not only is a measure to test what is being measured, but also will identify what should be but is not being measured.
  7. Reliable.  The sources of the data be reliable, contributing to adherence to the traits already listed.

This is the shorthand that I currently use in assessing a data requirements and the list is not intended to be exhaustive.  But it points to two further considerations when delivering a solution.

First, at what point does the person cease to be the computer?  Business analytics–the tactical level of enterprise data optimization–oftentimes are stuck in providing users with a choice of chart or graph to use in representing such data.  And as noted by many writers, such as this one, no doubt the proper manner of representing data will influence its interpretation.  But in this case the person is still the computer after the brute force computing is completed digitally.  There is a need for more effective significance-testing and modeling of data, with built-in controls for selection bias.

Second, how should data be summarized to the operational and strategic levels so that “signatures” can be identified that inform information?  Furthermore, it is important to understand what kind of data must supplement the tactical level data at those other levels.  Thus, data streams are not only minimized to eliminate redundancy, but also properly aligned to the level of data intelligence.

*Note that there are other aspects of data characteristics noted by other sources here, here, and here.  Most of these concern themselves with data quality and what I would consider to be baseline data traits, which need to be separately assessed and tested, as opposed to antecedent characteristics.

 

The Future — Data Focus vs. “Tools” Focus

The title in this case is from the Leonard Cohen song.

Over the last few months I’ve come across this issue quite a bit and it goes to the heart of where software technology is leading us.  The basic question that underlies this issue can be boiled down into the issue of whether software should be thought of as a set of “tools” or an overarching solution that can handle data in a way that the organization requires.  It is a fundamental question because what we call Big Data–despite all of the hoopla–is really a relative term that changes with hardware, storage, and software scalability.  What was Big Data in 1997 is not Big Data in 2016.

As Moore’s Law expands scalability at lower cost, organizations and SMEs are finding that the dedicated software tools at hand are insufficient to leverage the additional information that can be derived from that data.  The reason for this is simple.  A COTS tools publisher will determine the functionality required based on a structured set of data that is to be used and code to that requirement.  The timeframe is usually extended and the approach highly structured.  There are very good reasons for this approach in particular industries where structure is necessary and the environment is fairly stable.  The list of industries that fall into this category is rapidly becoming smaller.  Thus, there is a large gap that must be filled by workarounds, custom code, and suboptimized use of Excel.  Organizations and people cannot wait until the self-styled software SMEs get around to providing that upgrade two years from now so that people can do their jobs.

Thus, the focus must be shifted to data and the software technologies that maximize its immediate exploitation for business purposes to meet organizational needs.  The key here is the arise of Fourth Generation applications that leverage object oriented programming language that most closely replicate the flexibility of open source.  What this means is that in lieu of buying a set of “tools”–each focused on solving a specific problem stitched together by a common platform or through data transfer–that software that deals with both data and UI in an agnostic fashion is now available.

The availability of flexible Fourth Generation software is of great concern, as one would imagine, to incumbents who have built their business model on defending territory based on a set of artifacts provided in the software.  Oftentimes these artifacts are nothing more than automatically filled in forms that previously were filled in manually.  That model was fine during the first and second waves of automation from the 1980s and 1990s, but such capabilities are trivial in 2016 given software focused on data that can be quickly adapted to provide functionality as needed.  What this development also does is eliminate and make trivial those old checklists that IT shops used to send out in a lazy way of assessing relative capabilities of software to simplify the competitive range.

Tools restrict themselves to a subset of data by definition to provide a specific set of capabilities.  Software that expands to include any set of data and allows that data to be displayed and processed as necessary through user configuration adapts itself more quickly and effectively to organizational needs.  They also tend to eliminate the need for multiple “best-of-breed” toolset approaches that are not the best of any breed, but more importantly, go beyond the limited functionality and ways of deriving importance from data found in structured tools.  The reason for this is that the data drives what is possible and important, rather than tools imposing a well-trod interpretation of importance based on a limited set of data stored in a proprietary format.

An important effect of Fourth Generation software that provides flexibility in UI and functionality driven by the user is that it puts the domain SME back in the driver’s seat.  This is an important development.  For too long SMEs have had to content themselves with recommending and advocating for functionality in software while waiting for the market (software publishers) to respond.  Essential business functionality with limited market commonality often required that organizations either wait until the remainder of the market drove software publishers to meet their needs, finance expensive custom development (either organic or contracted), or fill gaps with suboptimized and ad hoc internal solutions.  With software that adapts its UI and functionality based on any data that can be accessed, using simple configuration capabilities, SMEs can fill these gaps with a consistent solution that maintains data fidelity and aids in the capture and sustainability of corporate knowledge.

Furthermore, for all of the talk about Agile software techniques, one cannot implement Agile using software languages and approaches that were designed in an earlier age that resists optimization of the method.  Fourth Generation software lends itself most effectively to Agile since configuration using simple object oriented language gets us to the ideal–without a reliance on single points of failure–of releasable solutions at the end of a two-week sprint.  No doubt there are developers out there making good money that may challenge this assertion, but they are the exceptions to the rule that prove the point.  An organization should be able to optimize the pool of contributors to solution development and rollout in supporting essential business processes.  Otherwise Agile is just a pretext to overcome suboptimized developmental approaches, software languages, and the self-interest of developers that can’t plan or produce a releasable product in a timely manner within budgetary constraints.

In the end the change in mindset from tools to data goes to the issue of who owns the data: the organization that creates and utilizes the data (the customer), or the proprietary software tool publishers?  Clearly the economics will win out in favor of the customer.  It is time to displace “tools” thinking.

Note:  I’ve revised the title of the blog for clarity.

The End (of Analysis) Is the Beginning Is the End

Been back in the woodshed for a bit.  I just completed my latest post for AITS.org, which should be published sometime in mid-July.  In the meantime, I’ve been looking at issues of data visualization, process improvement, and performance management–and their interdependencies.  The APQC blog has some interesting things to say about project management challenges which, to be quite honest, sound a lot like “mom, apple pie, and Chevrolet.”

But there are nuggets of gold in there which I will save for another post, while focusing on another article by Holly Lyke-Ho-Gland on the top challenges in organizational performance management.  There are essentially three challenges.  The first is “establishing a performance culture.”  Given that APQC’s mission is broader than what I would view as traditional complex project management, this first statement is more than gratuitous.  The second is “identifying the right benchmarks and their source.”  At first blush this gets a big “duh”, but in every profession and discipline this is an area with a pretty consistent failing, especially on the back end of that statement.  For example, if one transitions from processed, human-readable reporting to just accessing the source data should not the results be the same?  I have been told otherwise in both meetings and during private conversations at project management conferences, which should be a counterfactual and raise some eyebrows.  The third and last is “defining and using process measures (leading, in-process, and lagging) in the business.”

While somewhat conceptual and non-specific, I would view all three of these challenges as elements necessary to an successful performance management system.  Furthermore, what is interesting here is that Ms. Lyke-Ho-Gland illustrates the connection between process and performance management.  The source of the data–and its credibility–is as important as collecting data.  Furthermore, I would posit that the job doesn’t stop at finding anomalies in the data or variances in performance.  This is just the beginning of the process in determining root causes of the issues and appropriate corrective action.  Thus, information analysis isn’t the end of the process, but the beginning of the process that will lead us to the ends.