Potato, Potahto, Tomato, Tomahto: Data Normalization vs. Standardization, Why the Difference Matters

In my vocation I run a technology company devoted to program management solutions that is primarily concerned with taking data and converting it into information to establish a knowledge-based environment. Similarly, in my avocation I deal with the meaning of information and how to turn it into insight and knowledge. This latter activity concerns the subject areas of history, sociology, and science.

In my travels just prior to and since the New Year, I have come upon a number of experts and fellow enthusiasts in these respective fields. The overwhelming numbers of these encounters have been productive, educational, and cordial. We respectfully disagree in some cases about the significance of a particular approach, governance when it comes to project and program management policy, but generally there is a great deal of agreement, particularly on basic facts and terminology. But some areas of disagreement–particularly those that come from left field–tend to be the most interesting because they create an opportunity to clarify a larger issue.

In a recent venue I encountered this last example where the issue was the use of the phrase data normalization. The issue at hand was that the use of “data normalization” suggested some statistical methodology in reconciling data into a standard schema. Instead, it was suggested, the term “data standardization” was more appropriate.

These phrases do not describe the same thing, but they do describe processes that are symbiotic, not mutually exclusive. So what about data normalization? No doubt there is a statistical use of the term, but we are dealing with the definition as used in digital technology here, just as the use of “standardization” was suggested in the same context. There are many examples of technical terminology that do not have the same meaning when used in different contexts. Here is the definition of normalization applied to data science from Technopedia, which is the proper use of the term in this case:

Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance.

Normalization is also known as data normalization

This is pretty basic (and necessary) stuff. I have written at length about data normalization, but also pair it with two other terms. This is data rationalization and contextualization. Here is a short definition of rationalization:

What is the benefit of Data Rationalization? To be able to effectively exploit, manage, reuse, and govern enterprise data assets (including the models which describe them), it is necessary to be able to find them. In addition, there is (or should be) a wealth of semantics (e.g. business names, definitions, relationships) embedded within an organization’s models that can be exposed for improved analysis and knowledge transfer. By linking model objects (across or within models) it is possible to discover the higher order conceptual objects for any given object. Conversely, it is possible to identify what implementation artifacts implement a higher order model object. For example, using data rationalization, one can traverse from a conceptual model entity to a logical model entity to a physical model table to a database table, etc. Similarly, Data Rationalization enables understanding of a database table by traversing up through the different model levels.

Finally, we have contextualization. Here is a good definition using Wikipedia:

Context or contextual information is any information about any entity that can be used to effectively reduce the amount of reasoning required (via filtering, aggregation, and inference) for decision making within the scope of a specific application.[2] Contextualisation is then the process of identifying the data relevant to an entity based on the entity’s contextual information. Contextualisation excludes irrelevant data from consideration and has the potential to reduce data from several aspects including volume, velocity, and variety in large-scale data intensive applications

There is no approximation of reflecting the accuracy of data in any of these terms wihin the domain of data and computer science. Nor are there statistical methods involved to approximate what needs to be accomplished precisely. The basic skill required to accomplish these tasks–knowing that the data is structured and pre-conditioned–is to reconcile the various lexicons from differing sources, much as I reconcile in my avocation the meaning of words and phrases across periods in history and across languages.

In this discussion we are dealing with the issue of different words used to describe a process or phenomenon. Similarly, we find this challenge in data.

So where does this leave data standardization? In terms of data and computer science, this describes a completely different method. Here is a definition from Wikipedia, which is the proper contextual use of the term under “Standard data model”:

A standard data model or industry standard data model (ISDM) is a data model that is widely applied in some industry, and shared amongst competitors to some degree. They are often defined by standards bodies, database vendors or operating system vendors.

In the context of project and program management, particularly as it relates to government data submission and international open standards across vendors in an industry, is the use of a common schema. In this case there is a DoD version of a UN/CEFACT XML file currently set as the standard, but soon to be replaced by a new standard using the JSON file structure.

In any event, what is clear here is that, while standardization is a necessary part of a data policy to allow for sharing of information, the strength of the chosen schema and the instructions regarding it will vary–and this variation will have an effect on the quality of the information shared. But that is not all.

This is where data normalization, rationalization, and contextualization come into play. In order to create data for the a standardized format, it is first necessary to convert what is an otherwise opaque set of data due to differences into a cohesive lexicon. In data, this is accomplished by reconciling data dictionaries to determine which items are describing the same thing, process, measure, or phenomenon. In a domain like program management, this is a finite set. But it is also specialized knowledge and where the value is added to any end product that is produced. Then, once we know how to identify the data, we must be able to map those terms to the standard schema but, keeping on eye on the use of the data down the line, must be able to properly structure and ensure interrelationships of the data are established and/or maintained to ensure its effective use. This is no mean task and why all data transformation methods and companies are not the same.

Furthermore, these functions can be accomplished efficiently or inefficiently. The inefficient method is to take the old-fashioned business intelligence method that has been around since the 1980s and before, where a team of data scientists and analysts deal with data as if it is flat and, essentially, reinvents the wheel in establishing the meaning and proper context of the data. Given enough time and money anything can be accomplished, but brute force labor will not defeat the Second Law of Thermodynamics.

In computing, which comes close to minimizing that physical law, we know that data has already been imbued with meaning upon its initial processing. In lieu of brute force labor we apply intelligence and knowledge to accomplish this requirement. This is called normalization, rationalization, and contextualization of data. It requires a small fraction of other methods in terms of time and effort, and is infinitely more transparent.

Using these methods is also where innovation, efficiency, performance, accuracy, scalability, and anticipating future requirements based on the latest technology trends comes into play. Establishing a seamless flow of data integration allows, for example, the capture of more data being able to be properly structured in a database, which lays the ground for the transition from 2D to 3D and 4D (that is, what is often called integrated) program management, as well as more effective analytics.

The term “standardization” also suffers from a weakness in data and computer science that requires that it be qualified. After all, data standardization in an enterprise or organization does not preclude the prescription of a propriety dataset. In government, this is contrary to both statutory and policy mandates. Furthermore, even given an effective, open standard, there will be a large pool of legacy and other non-conforming data that will still require capture and transformation.

The Section 809 Panel study dealt directly with this issue:

Use existing defense business system open-data requirements to improve strategic decision making on acquisition and workforce issues…. DoD has spent billions of dollars building the necessary software and institutional infrastructure to collect enterprise wide acquisition and financial data. In many cases, however, DoD lacks the expertise to effectively use that data for strategic planning and to improve decision making. Recommendation 88 would mitigate this problem by implementing congressional open-data mandates and using existing hiring authorities to bolster DoD’s pool of data science professionals.

Section 809 Volume 3, Section 9, p.477

As operating environment companies expose more and more capability into the market through middleware and other open systems methods of visualizing data, the key to a system no longer resides in its ability to produce charts and graphs. The use of Excel as an ad hoc data repository with its vulnerability to error, to manipulation, and for its resistance to the establishment of an optimized data management and corporate knowledge environment is a symptom of the larger issue.

Data and its proper structuring is at the core of organizational success and process improvement. Standardization alone will not address barriers to data optimization. According to RAND studies in 2015 and 2017* these are:

  • Data Quality and Discontinuities
  • Data Silos and Underutilized Repositories
  • Timeliness of Data for use by SMEs and Decision-makers
  • Lack of Access and Contextualization
  • Traceability and Auditability
  • Lack of the Ability to Apply Discovery in the Data
  • The issue of Contractual Technical Data and Proprietary Data

That these issues also exist in private industry demonstrates the universality of the issue. Thus, yes, standardize by all means. But also ensure that the standard is open and that transformation is traceable and auditable from the the source system to the standard schema, and then into the target database. Only then will the enterprise, the organization, and the government agency have full ownership of the data it requires to efficiently and effectively carry out its purpose.

*RAND Corporation studies are “Issues with Access to Acquisition Data and Information in the DoD: Doing Data Right in Weapons System Acquisition” (RR880, 2017), and “Issues with Access to Acquisition Data and Information in the DoD: Policy and Practice (RR1534, 2015). These can be found here.

Big Time — Elements of Data Size in Scaling

I’ve run into additional questions about scalability.  It is significant to understand the concept in terms of assessing software against data size, since there are actually various aspect of approaching the issue.

Unlike situations where data is already sorted and structured as part of the core functionality of the software service being provided, this is in dealing in an environment where there are many third-party software “tools” that put data into proprietary silos.  These act as barriers to optimizing data use and gaining corporate intelligence.  The goal here is to apply in real terms the concept that the customers generating the data (or stakeholders who pay for the data) own the data and should have full use of it across domains.  In project management and corporate governance this is an essential capability.

For run-of-the-mill software “tools” that are focused on solving one problem, this often is interpreted as just selling a lot more licenses to a big organization.  “Sure we can scale!” is code for “Sure, I’ll sell you more licenses!”  They can reasonably make this assertion, particularly in client-server or web environments, where they can point to the ability of the database system on which they store data to scale.  This also comes with, usually unstated, the additional constraint that their solution rests on a proprietary database structure.  Such responses, through, are a form of sidestepping the question, nor is it the question being asked.  Thus, it is important for those acquiring the right software to understand the subtleties.

A review of what makes data big in the first place is in order.  The basic definition, which I outlined previously, came from NASA in describing data that could not be held in local memory or local storage.  Hardware capability, however, continues to grow exponentially, so that what is big data today is not big data tomorrow.  But in handling big data, it then becomes incumbent on software publishers to drive performance to allow their customers to take advantage of the knowledge contained in these larger data sets.

The elements that determine the size of data are:

a.  Table size

b.  Row or Record size

c.  Field size

d.  Rows per table

e.  Columns per table

f.  Indexes per table

Note the interrelationships of these elements in determining size.  Thus, recently I was asked how many records are being used on the largest tables being accessed by a piece of software.  That is fine as shorthand, but the other elements add to the size of the data that is being accessed.  Thus, a set of data of say 800K records may be just as “big” as one containing 2M records because of the overall table size of fields, and the numbers of columns and indices, as well as records.  Furthermore, this question didn’t take into account the entire breadth of data across all tables.

Understanding the definition of data size then leads us to understanding the nature of software scaling.  There are two aspects to this.

The first is the software’s ability to presort the data against the database in such as way as to ensure that latency–the delay in performance when the data is loaded–is minimized.  The principles applied here go back to database management practices back in the day when organizations used to hire teams of data scientists to rationalize data that was processed in machine language–especially when it used to be stored in ASCII or, for those who want to really date themselves, EBCDIC, which were incoherent by today’s generation of more human-readable friendly formats.

Quite simply, the basic steps applied has been to identify the syntax, translate it, find its equivalents, and then sort that data into logical categories that leverage database pointers.  What you don’t want the software to do is what used to be done during the earliest days of dealing with data, which was smaller by today’s standards, of serially querying ever data element in order to fetch only what the user is calling.  Furthermore, it also doesn’t make much sense to deal with all data as a Repository of Babel to apply labor-intensive data mining in non-relational databases, especially in cases where the data is well understood and fairly well structured, even if in a proprietary structure.  If we do business in a vertical where industry standards in data apply, as in the use of the UN/CEFACT XML convention, then much of the presorting has been done for us.  In addition, more powerful industry APIs (like OLE DB and ODBC) that utilize middleware (web services, XML, SOAP, MapReduce, etc.) multiply the presorting capabilities of software, providing significant performance improvements in accessing big data.

The other aspect is in the software’s ability to understand limitations in data communications hardware systems.  This is a real problem, because the backbone in corporate communication systems, especially to ensure security, is still largely done over a wire.  The investments in these backbones is usually categorized as a capital investment, and so upgrades to the system are slow.  Furthermore, oftentimes backbone systems are embedded in physical plant building structures.  So any software performance is limited by the resistance and bandwidth of wiring.  Thus, we live in a world where hardware storage and processing is doubling every 12-18 months, and software is designed to better leverage such expansion, but the wires over which data communication depends remains stuck in the past–constrained by the basic physics of CAT 6 or Fiber Optic cabling.

Needless to say, software manufacturers who rely on constant communications with the database will see significantly degraded performance.  Some software publishers who still rely on this model use the “check out” system, treating data like a lending library, where only one user or limited users can access the same data.  This, of course, reduces customer flexibility.  Strategies that are more discrete in handling data are the needed response here until day-to-day software communications can reap the benefits of physical advancements in this category of hardware.  Furthermore, organizations must understand that the big Cloud in the sky is not the answer, since it is constrained by the same physics as the rest of the universe–with greater security risks.

All of this leads me to a discussion I had with a colleague recently.  He opened his iPhone and did a query in iTunes for an album.  In about a second or so his query selected the artist and gave a list of albums–all done without a wire connection.  “Why can’t we have this in our industry?” he asked.  Why indeed?  Well, first off, Apple iTunes has sorted its data to optimize performance with its app, and it is performed within a standard data stream and optimized for the Apple iOS and hardware.  Secondly, the variables of possible queries in iTunes are predefined and tied to a limited and internally well-defined set of data.  Thus, the data and application challenges are not equivalent as found in my friend’s industry vertical.  For example, aside from the challenges of third party data normalization and rationalization, iTunes is not dealing with dynamic time-phased or trending data that requires multiple user updates to reflect changes using predictive analytics which is then served to different classes of users in a highly secure environment.  Finally, and most significantly, Apple spent more on that system than the annual budget of my friend’s organization.  In the end his question was a good one, but in discussing this it was apparent that just saying “give me this” is a form of magical thinking and hand waving.  The devil is in the details, though I am confident that we will eventually get to an equivalent capability.

At the end of the day IT project management strategy must take into account the specific needs of classes of users in making determinations of scaling.  What this means is a segmented approach: thick-client users, compartmentalized local installs with subsets of data, thin-client, and web/mobile or terminal services equivalents.  The practical solution is still an engineered one that breaks the elephant into digestible pieces that lean forward to leveraging advances in hardware, database, and software operating environments.  These are the essential building blocks to data optimization and scaling.

I Can See Clearly Now — Knowledge Discovery in Databases, Data Scalability, and Data Relevance

I recently returned from a travel and much of the discussion revolved around the issues of scalability and the use of data.  What is clear is that the conversation at the project manager level is shifting from a long-running focus on reports and metrics to one focused on data and what can be learned from it.  As with any technology, information technology exploits what is presented before it.  Most recently, accelerated improvements in hardware and communications technology has allowed us to begin to collect and use ever larger sets of data.

The phrase “actionable” has been thrown around quite a bit in marketing materials, but what does this term really mean?  Can data be actionable?  No.  Can intelligence derived from that data be actionable?  Yes.  But is all data that is transformed into intelligence actionable?  No.  Does it need to be?  No.

There are also kinds and levels of intelligence, particularly as it relates to organizations and business enterprises.  Here is a short list:

a. Competitive intelligence.  This is intelligence derived from data that informs decision makers about how their organization fits into the external environment, further informing the development of strategic direction.

b. Business intelligence.  This is intelligence derived from data that informs decision makers about the internal effectiveness of their organization both in the past and into the future.

c. Business analytics.  The transformation of historical and trending enterprise data used to provide insight into future performance.  This includes identifying any underlying drivers of performance, and any emerging trends that will manifest into risk.  The purpose is to provide sufficient early warning to allow risk to be handled before it fully manifests, therefore keeping the effort being measured consistent with the goals of the organization.

Note, especially among those of you who may have a military background, that what I’ve outlined is a hierarchy of information and intelligence that addresses each level of an organization’s operations:  strategic, operational, and tactical.  For many decision makers, translating tactical level intelligence into strategic positioning through the operational layer presents the greatest challenge.  The reason for this is that, historically, there often has been a break in the continuity between data collected at the tactical level and that being used at the strategic level.

The culprit is the operational layer, which has always been problematic for organizations and those individuals who find themselves there.  We see this difficulty reflected in the attrition rate at this level.  Some individuals cannot successfully make this transition in thinking. For example, in the U.S. Army command structure when advancing from the battalion to the brigade level, in the U.S. Navy command structure when advancing from Department Head/Staff/sea command to organizational or fleet command (depending on line or staff corps), and in business for those just below the C level.

Another way to look at this is through the traditional hierarchical pyramid, in which data represents the wider floor upon which each subsequent, and slightly reduced, level is built.  In the past (and to a certain extent this condition still exists in many places today) each level has constructed its own data stream, with the break most often coming at the operational level.  This discontinuity is then reflected in the inconsistency between bottom-up and top-down decision making.

Information technology is influencing and changing this dynamic by addressing the main reason for the discontinuity existing–limitations in data and intelligence capabilities.  These limitations also established a mindset that relied on limited, summarized, and human-readable reporting that often was “scrubbed” (especially at the operational level) as it made its way to the senior decision maker.  Since data streams were discontinuous, there were different versions of reality.  When aspects of the human equation are added, such as selection bias, the intelligence will not match what the data would otherwise indicate.

As I’ve written about previously in this blog, the application of Moore’s Law in physical computing performance and storage has pushed software to greater needs in scaling in dealing with ever increasing datasets.  What is defined as big data today will not be big data tomorrow.

Organizations, in reaction to this condition, have in many cases tended to simply look at all of the data they collect and throw it together into one giant pool.  Not fully understanding what the data may say, a number of ad hoc approaches have been taken.  In some cases this has caused old labor-intensive data mining and rationalization efforts to once again rise from the ashes to which they were rightly consigned in the past.  On the opposite end, this has caused a reliance on pre-defined data queries or hard-coded software solutions, oftentimes based on what had been provided using human-readable reporting.  Both approaches are self-limiting and, to a large extent, self-defeating.  In the first case because the effort and time to construct the system will outlive the needs of the organization for intelligence, and in the second case, because no value (or additional insight) is added to the process.

When dealing with large, disparate sources of data, value is derived through that additional knowledge discovered through the proper use of the data.  This is the basis of the concept of what is known as KDD.  Given that organizations know the source and type of data that is being collected, it is not necessary to reinvent the wheel in approaching data as if it is a repository of Babel.  No doubt the euphemisms, semantics, and lexicon used by software publishers differs, but quite often, especially where data underlies a profession or a business discipline, these elements can be rationalized and/or normalized given that the appropriate business cross-domain knowledge is possessed by those doing the rationalization or normalization.

This leads to identifying the characteristics* of data that is necessary to achieve a continuity from the tactical to the strategic level that will achieve some additional necessary qualitative traits such as fidelity, credibility, consistency, and accuracy.  These are:

  1. Tangible.  Data must exist and the elements of data should record something that correspondingly exists.
  2. Measurable.  What exists in data must be something that is in a form that can be recorded and is measurable.
  3. Sufficient.  Data must be sufficient to derive significance.  This includes not only depth in data but also, especially in the case of marking trends, across time-phasing.
  4. Significant.  Data must be able, once processed, to contribute tangible information to the user.  This goes beyond statistical significance noted in the prior characteristic, in that the intelligence must actually contribute to some understanding of the system.
  5. Timely.  Data must be timely so that it is being delivered within its useful life.  The source of the data must also be consistently provided over consistent periodicity.
  6. Relevant.  Data must be relevant to the needs of the organization at each level.  This not only is a measure to test what is being measured, but also will identify what should be but is not being measured.
  7. Reliable.  The sources of the data be reliable, contributing to adherence to the traits already listed.

This is the shorthand that I currently use in assessing a data requirements and the list is not intended to be exhaustive.  But it points to two further considerations when delivering a solution.

First, at what point does the person cease to be the computer?  Business analytics–the tactical level of enterprise data optimization–oftentimes are stuck in providing users with a choice of chart or graph to use in representing such data.  And as noted by many writers, such as this one, no doubt the proper manner of representing data will influence its interpretation.  But in this case the person is still the computer after the brute force computing is completed digitally.  There is a need for more effective significance-testing and modeling of data, with built-in controls for selection bias.

Second, how should data be summarized to the operational and strategic levels so that “signatures” can be identified that inform information?  Furthermore, it is important to understand what kind of data must supplement the tactical level data at those other levels.  Thus, data streams are not only minimized to eliminate redundancy, but also properly aligned to the level of data intelligence.

*Note that there are other aspects of data characteristics noted by other sources here, here, and here.  Most of these concern themselves with data quality and what I would consider to be baseline data traits, which need to be separately assessed and tested, as opposed to antecedent characteristics.

 

Do You Believe in Magic? — Big Data, Buzz Phrases, and Keeping Feet Planted Firmly on the Ground

My alternative title for this post was “Money for Nothing,” which is along the same lines.  I have been engaged in discussions regarding Big Data, which has become a bit of a buzz phrase of late in both business and government.  Under the current drive to maximize the value of existing data, every data source, stream, lake, and repository (and the list goes on) has been subsumed by this concept.  So, at the risk of being a killjoy, let me point out that not all large collections of data is “Big Data.”  Furthermore, once a category of data gets tagged as Big Data, the further one seems to depart from the world of reality in determining how to approach and use the data.  So for of you who find yourself in this situation, let’s take a collective deep breath and engage our critical thinking skills.

So what exactly is Big Data?  Quite simply, as noted by this article in Forbes by Gil Press, term is a relative one, but generally means from a McKinsey study, “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”  This subjective definition is a purposeful one, since Moore’s Law tends to change what is viewed as simply digital data as opposed to big data.  I would add some characteristics to assist in defining the term based on present challenges.  Big data at first approach tends to be unstructured, variable in format, and does not adhere to a schema.  Thus, not only is size a criteria for the definition, but also the chaotic nature of the data that makes it hard to approach.  For once we find a standard means of normalizing, rationalizing, or converting digital data, it no longer is beyond the ability of standard database tools to effectively use it.  Furthermore, the very process of taming it thereby renders it non-big data, or perhaps, if a exceedingly large dataset, perhaps “small big data.”

Thus, having defined our terms and the attributes of the challenge we are engaging, we now can eliminate many of the suppositions that are floating around in organizations.  For example, there is a meme that I have come upon that asserts that disparate application file data can simply be broken down into its elements and placed into database tables for easy access by analytical solutions to derive useful metrics.  This is true in some ways but both wrong and dangerous in its apparent simplicity.  For there are many steps missing in this process.

Let’s take, for example, the least complex example in the use of structured data submitted as proprietary files.  On its surface this is an easy challenge to solve.  Once someone begins breaking the data into its constituent parts, however, greater complexity is found, since the indexing inherent to data interrelationships and structures are necessary for its effective use.  Furthermore, there will be corruption and non-standard use of user-defined and custom fields, especially in data that has not undergone domain scrutiny.  The originating third-party software is pre-wired to be able to extract this data properly.  Absent having to use and learn multiple proprietary applications with their concomitant idiosyncrasies, issues of sustainability, and overhead, such a multivariate approach defeats the goal of establishing a data repository in the first place by keeping the data in silos, preventing integration.  The indexing across, say, financial systems or planning systems are different.  So how do we solve this issue?

In approaching big data, or small big data, or datasets from disparate sources, the core concept in realizing return on investment and finding new insights, is known as Knowledge Discovery in Databases or KDD.  This was all the rage about 20 years ago, but its tenets are solid and proven and have evolved with advances in technology.  Back then, the means of extracting KDD from existing databases was the use of data mining.

The necessary first step in the data mining approach is pre-processing of data.  That is, once you get the data into tables it is all flat.  Every piece of data is the same–it is all noise.  We must add significance and structure to that data.  Keep in mind that we live in this universe, so there is a cost to every effort known as entropy.  Computing is as close as you’ll get to defeating entropy, but only because it has shifted the burden somewhere else.  For large datasets it is pushed to pre-processing, either manual or automated.  In the brute force world of data mining, we hire data scientists to pre-process the data, find commonalities, and index it.  So let’s review this “automated” process.  We take a lot of data and then add a labor-intensive manual effort to it in order to derive KDD.  Hmmm..  There may be ROI there, or there may not be.

But twenty years is a long time and we do have alternatives, especially in using Fourth Generation software that is focused on data usage without the limitations of hard-coded “tools.”  These alternatives apply when using data on existing databases, even disparate databases, or file data structured under a schema with well-defined data exchange instructions that allow for a consistent manner of posting that data to database tables. The approach in this case is to use APIs.  The API, like OLE DB or the older ODBC, can be used to read and leverage the relative indexing of the data.  It will still require some code to point it in the right place and “tell” the solution how to use and structure the data, and its interrelationship to everything else.  But at least we have a means for reducing the cost associated with pre-processing.  Note that we are, in effect, still pre-processing data.  We just let the CPU do the grunt work for us, oftentimes very quickly, while giving us control over the decision of relative significance.

So now let’s take the meme that I described above and add greater complexity to it.  You have all kinds of data coming into the stream in all kinds of formats including specialized XML, open, black-boxed data, and closed proprietary files.  This data is non-structured.  It is then processed and “dumped” into a non-relational database such as NoSQL.  How do we approach this data?  The answer has been to return to a hybrid of pre-processing, data mining, and the use of APIs.  But note that there is no silver bullet here.  These efforts are long-term and extremely labor intensive at this point.  There is no magic.  I have heard time and again from decision makers the question: “why can’t we just dump the data into a database to solve all our problems?”  No, you can’t, unless you’re ready for a significant programmatic investment in data scientists, database engineers, and other IT personnel.  At the end, what they deploy, when it gets deployed, may very well be obsolete and have wasted a good deal of money.

So, once again, what are the proper alternatives?  In my experience we need to get back to first principles.  Each business and industry has commonalities that transcend proprietary software limitations by virtue of the professions and disciplines that comprise them.  Thus, it is domain expertise to the specific business that drives the solution.  For example, in program and project management (you knew I was going to come back there) a schedule is a schedule, EVM is EVM, financial management is financial management.

Software manufacturers will, apart from issues regarding relative ease of use, scalability, flexibility, and functionality, attempt to defend their space by establishing proprietary lexicons and data structures.  Not being open, while not serving the needs of customers, helps incumbents avoid disruption from new entries.  But there often comes a time when it is apparent that these proprietary definitions are only euphemisms for a well-understood concept in a discipline or profession.  Cat = Feline.  Dog = Canine.

For a cohesive and well-defined industry the solution is to make all data within particular domains open.  This is accomplished through the acceptance and establishment of a standard schema.  For less cohesive industries, but where the data or incumbents through the use of common principles have essentially created a de facto schema, APIs are the way to extract this data for use in analytics.  This approach has been applied on a broader basis for the incorporation of machine data and signatures in social networks.  For closed or black-boxed data, the business or industry will need to execute gap analysis in order to decide if database access to such legacy data is truly essential to its business, or given specification for a more open standard from “time-now” will eventually work out suboptimization in data.

Most important of all and in the end, our results must provide metrics and visualizations that can be understood, are valid, important, material, and be right.

Sixteen Tons — Data Mining, Big Data, and the Asymmetry of variables and observations

Last Thursday I came upon what I can only interpret as an ironic comment at Mark Thoma’s Economist’s View blog entitled “Data Mining Can be a Productive Activity.”  I went to the link and it went to a VOX article by Castle and Hendry entitled “Data Mining with more variables than observations.”  All I could think after the opening line:  “A typical Oxford University econometrics exam question might take the form: ‘Data mining is bad, so mining with more candidate variables than observations must be pernicious. Discuss.'” was: are these people serious?

Data mining is a general term in high tech and not a specific approach to finding patterns and trends in large elements of data.  The authors–and I’m guessing that they are not alone in the econometrics profession–seem to be addressing a “Just Say No” approach to performing what for most of us who deal in statistical analysis and modeling of large datasets do every day, largely based on the fact that it involves these scary things called computers that run this mysterious thing behind the scenes called “code.”  Who knows what horrors may await us as we mistakenly draw causations from correlations by anything more than the use of Access or Excel spreadsheets?  It seems that Oxford dons need to get out more.

The use of microeconomic data mining has been in general use for quite some time in many businesses and business disciplines with a great deal of confidence and success (too much success in the medical insurance, financial services, and social networking fields to raise legal and ethical objections).  So the assertion that seems to be based on those of a single group of econometricians does seem to be odd.  In the end it seems to be a setup for a proprietary set of calculations placed within an Excel spreadsheet given the name “Autometrics.”  This largely argues for the proper approach to the organization of data rather than a criticism of data mining in general.

The discriminators among data mining and data mining-like technologies involve purpose, cost, ease of use, scalability, and sustainability.  New technologies are arising every year that allow for increased speed, more efficient grouping, and compression to allow organizations to handle what previously was thought to be “big data.”  Thus the concept of data mining and big data is a shifting one as our technologies drive toward greater capability in integrating and interpreting large datasets.  The authors cite the techniques of taking large data to prove anthropomorphic global warming as one of the success stories of large scale modeling based on large data.  Implicit in acknowledging this is that not every variable needs to be included in a model–only the relevant variables that drive and explain the results being produced.  There is no doubt that reification of statistical results is a common fallacy, but people had been doing that long before the development of silicon-based artificial intelligence.

There is no doubt that someday we will reach the limit of computational capabilities.  But for someone who lived through the nonexistent “crisis” of limited memory in the early ’90s followed not too after by the bogus Y2K “bug,” I am not quite ready to throw in the towel on the ability of data mining and modeling to effectively provide the tools for the more general discipline of econometrics.  We are only beginning to crack heuristic models in approaching big data and on the cusp of strong AI.