(Data) Transformation–Fear and Loathing over ETL in Project Management

ETL stands for data extract, transform, and load. This essential step is the basis for all of the new capabilities that we wish to acquire during the next wave of information technology: business analytics, big(ger) data, interdisciplinary insight into processes that provide insights into improving productivity and efficiency.

I’ve been dealing with a good deal of fear and loading regarding the introduction of this concept, even though in my day job my organization is a leading practitioner in the field in its vertical. Some of this is due to disinformation by competitors in playing upon the fears of the non-technically minded–the expected reaction of those who can’t do in the last throws of avoiding irrelevance. Better to baffle them with bullshit than with brilliance, I guess.

But, more importantly, part of this is due to the state of ETL and how it is communicated to the project management and business community at large. There is a great deal to be gained here by muddying the waters even by those who know better and have the technology. So let’s begin by clearing things up and making this entire field a bit more coherent.

Let’s start with the basics. Any organization that contains the interaction of people is a system. For purposes of a project management team, a business enterprise, or a governmental body we deal with a special class of systems known as Complex Adaptive Systems: CAS for short. A CAS is a non-linear learning system that reacts and evolves to its environment. It is complex because of the inter-relationships and interactions of more than two agents in any particular portion of the system.

I was first introduced to the concept of CAS through readings published out of the Santa Fe Institute in New Mexico. Most noteworthy is the work The Quark and the Jaguar by the physicist Murray Gell-Mann. Gell-Mann is received the Nobel in physics in 1969 for his work on elementary particles, such as the quark, and is co-founder of the Institute. He also was part of the team that first developed simulated Monte Carlo analysis during a period he spent at RAND Corporation. Anyone interested in the basic science of quanta and how the universe works that then leads to insights into subjects such as day-to-day probability and risk should read this book. It is a good popular scientific publication written by a brilliant mind, but very relevant to the subjects we deal with in project management and information science.

Understanding that our organizations are CAS allows us to apply all sorts of tools to better understand them and their relationship to the world at large. From a more practical perspective, what are the risks involved in the enterprise in which we are engaged and what are the probabilities associated with any of the range of outcomes that we can label as success. For my purposes, the science of information theory is at the forefront of these tools. In this world an engineer by the name of Claude Shannon working at Bell Labs essentially invented the mathematical basis for everything that followed in the world of telecommunications, generating, interpreting, receiving, and understanding intelligence in communication, and the methods of processing information. Needless to say, computing is the main recipient of this theory.

Thus, all CAS process and react to information. The challenge for any entity that needs to survive and adapt in a continually changing universe is to ensure that the information that is being received is of high and relevant quality so that the appropriate adaptation can occur. There will be noise in the signals that we receive. What we are looking for from a practical perspective in information science are the regularities in the data so that we can make the transformation of receiving the message in a mathematical manner (where the message transmitted is received) into the definition of information quality that we find in the humanities. I believe that we will find that mathematical link eventually, but there is still a void there. A good discussion of this difference can be found here in the on-line publication Double Dialogues.

Regardless of this gap, the challenge of those of us who engage in the business of ETL must bring to the table the ability not only to ensure that the regularities in the information are identified and transmitted to the intended (or necessary) users, but also to distinguish the quality of the message in the terms of the purpose of the organization. Shannon’s equation is where we start, not where we end. Given this background, there are really two basic types of data that we begin with when we look at a set of data: structured and unstructured data.

Structured data are those where the qualitative information content is either predefined by its nature or by a tag of some sort. For example, schedule planning and performance data, regardless of the idiosyncratic/proprietary syntax used by a software publisher, describes the same phenomena regardless of the software application. There are only so many ways to identify snow–and, no, the Inuit people do not have 100 words to describe it. Qualifiers apply in the humanities, but usually our business processes more closely align with statistical and arithmetic measures. As a result, structured data is oftentimes defined by its position in a hierarchical, time-phased, or interrelated system that contains a series of markers, indexes, and tables that allow it to be interpreted easily through the identification of a Rosetta stone, even when the system, at first blush, appears to be opaque. When you go to a book, its title describes what it is. If its content has a table of contents and/or an index it is easy to find the information needed to perform the task at hand.

Unstructured data consists of the content of things like letters, e-mails, presentations, and other forms of data disconnected from its source systems and collected together in a flat repository. In this case the data must be mined to recreate what is not there: the title that describes the type of data, a table of contents, and an index.

All data requires initial scrubbing and pre-processing. The difference here is the means used to perform this operation. Let’s take the easy path first.

For project management–and most business systems–we most often encounter structured data. What this means is that by understanding and interpreting standard industry terminology, schemas, and APIs that the simple process of aligning data to be transformed and stored in a database for consumption can be reduced to a systemic and repeatable process without the redundancy of rediscovery applied in every instance. Our business intelligence and business analytics systems can be further developed to anticipate a probable question from a user so that the query is pre-structured to allow for near immediate response. Further, structuring the user interface in such as way as to make the response to the query meaningful, especially integrated with and juxtaposed other types of data requires subject matter expertise to be incorporated into the solution.

Structured ETL is the place that I most often inhabit as a provider of software solutions. These processes are both economical and relatively fast, particularly in those cases where they are applied to an otherwise inefficient system of best-of-breed applications that require data transfers and cross-validation prior to official reporting. Time, money, and effort are all saved by automating this process, improving not only processing time but also data accuracy and transparency.

In the case of unstructured data, however, the process can be a bit more complicated and there are many ways to skin this cat. The key here is that oftentimes what seems to be unstructured data is only so because of the lack of domain knowledge by the software publisher in its target vertical.

For example, I recently read a white paper published by a large BI/BA publisher regarding their approach to financial and accounting systems. My own experience as a business manager and Navy Supply Corps Officer provide me with the understanding that these systems are highly structured and regulated. Yet, business intelligence publishers treated this data–and blatantly advertised and apparently sold as state of the art–an unstructured approach to mining this data.

This approach, which was first developed back in the 1980s when we first encountered the challenge of data that exceeded our expertise at the time, requires a team of data scientists and coders to go through the labor- and time-consuming process of pre-processing and building specialized processes. The most basic form of this approach involves techniques such as frequency analysis, summarization, correlation, and data scrubbing. This last portion also involves labor-intensive techniques at the microeconomic level such as binning and other forms of manipulation.

This is where the fear and loathing comes into play. It is not as if all information systems do not perform these functions in some manner, it is that in structured data all of this work has been done and, oftentimes, is handled by the database system. But even here there is a better way.

My colleague, Dave Gordon, who has his own blog, will emphasize that the identification of probable questions and configuration of queries in advance combined with the application of standard APIs will garner good results in most cases. Yet, one must be prepared to receive a certain amount of irrelevant information. For example, the query on Google of “Fun Things To Do” that you may use if you are planning for a weekend will yield all sorts of results, such as “50 Fun Things to Do in an Elevator.”  This result includes making farting sounds. The link provides some others, some of which are pretty funny. In writing this blog post, a simple search on Google for “Google query fails” yields what can only be described as a large number of query fails. Furthermore, this approach relies on the data originator to have marked the data with pointers and tags.

Given these different approaches to unstructured data and the complexity involved, there is a decision process to apply:

1. Determine if the data is truly unstructured. If the data is derived from a structured database from an existing application or set of applications, then it is structured and will require domain expertise to inherit the values and information content without expending unnecessary resources and time. A structured, systemic, and repeatable process can then be applied. Oftentimes an industry schema or standard can be leveraged to ensure consistency and fidelity.

2. Determine whether only a portion of the unstructured data is relative to your business processes and use it to append and enrich the existing structured data that has been used to integrate and expand your capabilities. In most cases the identification of a Rosetta Stone and standard APIs can be used to achieve this result.

3. For the remainder, determine the value of mining the targeted category of unstructured data and perform a business case analysis.

Given the rapidly expanding size of data that we can access using the advancing power of new technology, we must be able to distinguish between doing what is necessary from doing what is impressive. The definition of Big Data has evolved over time because our hardware, storage, and database systems allow us to access increasingly larger datasets that ten years ago would have been unimaginable. What this means is that–initially–as we work through this process of discovery, we will be bombarded with a plethora of irrelevant statistical measures and so-called predictive analytics that will eventually prove out to not pass the “so-what” test. This process places the users in a state of information overload, and we often see this condition today. It also means that what took an army of data scientists and developers to do ten years ago takes a technologist with a laptop and some domain knowledge to perform today. This last can be taught.

The next necessary step, aside from applying the decision process above, is to force our information systems to advance their processing to provide more relevant intelligence that is visualized and configured to the domain expertise required. In this way we will eventually discover the paradox that effectively accessing larger sets of data will yield fewer, more relevant intelligence that can be translated into action.

At the end of the day the manager and user must understand the data. There is no magic in data transformation or data processing. Even with AI and machine learning it is still incumbent upon the people within the organization to be able to apply expertise, perspective, knowledge, and wisdom in the use of information and intelligence.

I Can’t Drive 55 — The New York Times and Moore’s Law

Yesterday the New York Times published an article about Moore’s Law.  While interesting in that John Markoff, who is the Times science writer, speculates that in about 5 years the computing industry will be “manipulating material as small as atoms” and therefore may hit a wall in what has become a back of the envelope calculation of the multiplicative nature of computing complexity and power in the silicon age.

This article prompted a follow on from Brian Feldman at NY Mag, that the Institute of Electrical and Electronics Engineers (IEEE) has anticipated a broader definition of the phenomenon of the accelerating rate of computing power to take into account quantum computing.  Note here that the definition used in this context is the literal one: the doubling of the number of transistors over time that can be placed on a microchip.  That is a correct summation of what Gordon Moore said, but it not how Moore’s Law is viewed or applied within the tech industry.

Moore’s Law (which is really a rule of thumb or guideline in lieu of an ironclad law) has been used, instead, as a analogue to describe the geometric acceleration that has been seen in computer power over the last 50 years.  As Moore originally described the phenomenon, the doubling of transistors occurred every two years.  Then it was revised later to occur about every 18 months or so, and now it is down to 12 months or less.  Furthermore, aside from increasing transistors, there are many other parallel strategies that engineers have applied to increase speed and performance.  When we combine the observation of Moore’s Law with other principles tied to the physical world, such as Landauer’s Principle and Information Theory, we begin to find a coherence in our observations that are truly tied to physics.  Thus, rather than being a break from Moore’s Law (and the observations of these other principles and theory noted above), quantum computing, to which the articles refer, sits on a continuum rather than a break with these concepts.

Bottom line: computing, memory, and storage systems are becoming more powerful, faster, and expandable.

Thus, Moore’s Law in terms of computing power looks like this over time:

Moore's Law Chart

Furthermore, when we calculate the cost associated with erasing a bit of memory we begin to approach identifying the Demon* in defying the the Second Law of Thermodynamics.

Moore's Law Cost Chart

Note, however, that the Second Law is not really being defied, it is just that we are constantly approaching zero, though never actually achieving it.  But the principle here is that the marginal cost associated with each additional bit of information become vanishingly small to the point of not passing the “so what” test, at least in everyday life.  Though, of course, when we get to neural networks and strong AI such differences are very large indeed–akin to mathematics being somewhat accurate when we want to travel from, say, San Francisco to London, but requiring more rigor and fidelity when traveling from Kennedy Space Center to Gale Crater on Mars.

The challenge, then, in computing is to be able to effectively harness such power.  Our current programming languages and operating environments are only scratching the surface of how to do this, and the joke in the industry is that the speed of software is inversely proportional to the advance in computing power provided by Moore’s Law.  The issue is that our brains, and thus the languages we harness to utilize computational power, are based in an analog understanding of the universe, while the machines we are harnessing are digital.  For now this knowledge can only build bad software and robots, but given our drive into the brave new world of heuristics, may lead us to Skynet and the AI apocalypse if we are not careful–making science fiction, once again, science fact.

Back to present time, however, what this means is that for at least the next decade, we will see an acceleration of the ability to use more and larger sets of data.  The risks, that we seem to have to relearn due to a new generation of techies entering the market which lack a well rounded liberal arts education, is that the basic statistical and scientific rules in the conversion, interpretation, and application of intelligence and information can still be roundly abused and violated.  Bad management, bad decision making, bad leadership, bad mathematics, bad statisticians, specious logic, and plain old common human failings are just made worse, with greater impact on more people, given the misuse of that intelligence and information.

The watchman against these abuses, then, must be incorporated into the solutions that use this intelligence and information.  This is especially critical given the accelerated pace of computing power, and the greater interdependence of human and complex systems that this acceleration creates.

*Maxwell’s Demon

Note:  I’ve defaulted to the Wikipedia definitions of both Landauer’s Principle and Information Theory for the sake of simplicity.  I’ve referenced more detailed work on these concepts in previous posts and invite readers to seek those out in the archives of this blog.

For the Weekend: Music, Data, and Florence + The Machine

Saturdays–and some Sundays–have usually have been set aside for music as an interlude from all things data, information technology, and my work in general.  Admittedly, blogging has suffered because of the demands of work and, you know, having a life, especially with family.  But flying back from a series of important meetings that will, no doubt, make up for the lack of blogging in the near future, I settled in finally to listen to Ms. Welch’s latest.

As a fan from the beginning, I have not been impressed with the early singles that were released from her album, How Big, How Blue, How Beautiful.  My reaction to the title song, using a single syllable sound, was “meh.”  Same for the song “What Kind of Man,” which apparently grasping for some kind of significance, I viewed as inarticulate at best and largely muddled.  The message in this case, at least for me, didn’t save the medium.

So I kicked back on the plane after another 12 hour (or so) day and was intent on not giving up on her artistry.  So I listened to the album mostly with eyes closed, but with occasional forays into checking out the beautiful moonlit dome of the sky while traveling over the eastern seaboard with the glittering lights of the houses and towns 35,000 feet below.  (A series of “Supermoon” events are happening).  About four songs in I found myself taken in by what can only be described as another strong song cycle that possesses more subtlety and maturity than the bang-on pyrotechnics of Ceremonials.

The red-headed Celtic Goddess can still drive a tune and a theme that, having experienced one of her concerts in the desert of New Mexico under a cloudless night sky with the expanse of the Milky Way overhead, can become both transcendent and almost spooky, especially as her acolytes dance and sway in the trance state induced by her music.  Thus, I have come to realize that releasing any of her songs on their own from this album is largely a mistake because they cannot hold up as “singles” in the American tradition of Tin Pan Alley–nor even as prog rock.  Listening to the entire album from start to finish gives you the perspective from which you need to assess its artistic merit.  

For me, her lyrics and themes hark back and forth across the dimension of human experience, tying them together and, thus, fusing time in the process, opening up pathways in the mind to an almost elemental suggestion of the essence of existence which is communicated through the beat and expanse of the music.

Therefore, rather than a sample from Youtube, which I usually post at this point, I instead strongly recommend that you give the album a listen.  It’ll keep the band in business making more beautiful music as well.

Before I be accused by some readers of going off the deep end in exhaustion or overstatement in describing the effect of Ms. Welch’s music on me, I would caution that there is a scientific basis for it.  Many other writers and artists have noted the power of music without the need for other stimuli to have this same effect on them, as documented by the recently passed neuroscientist Oliver Sacks.  

Proust used music to delve into his inner consciousness to inform his writings.  Tolstoy was so taken by music that he was careful about when and what it was to which he listened since when he immersed himself in it he felt himself to be taken to an altered mental state.  Clinical experience document that many Parkinson’s and Tourette’s patients are affected–and sometimes coerced–by the power of music into involuntary states.  On the darker side of human experience, it is no coincidence that music is used by oppressive regimes and militaries to coerce, and sometimes manipulate, prisoners and captives.  On the positive side in my own experience, I was able to come to a mathematical solution to a problem in one afternoon by immersing myself fully in John Coltrane’s “A Love Supreme.”

Aside from being an aural experience that stimulates neurobiological systems, underlying music is mathematics, and underlying the mathematics are digital packets of information.  We live in a digital world.  (And–and yes–Madonna is a digital girl).  No doubt the larger implications of this view are somewhat controversial (though compelling) in the scientific community with the questions surrounding it under the discipline of digital physics.

But if we view music as information (which at many levels it is) and our minds as the decoders, then the images and states of consciousness that we enter are implicit in the message, with bias introduced by our conscious minds in attempting to provide both structure and coherence.  It is the same with any data.  We can listen to a single song, but find ourselves placing undue emphasis on just one small aspect of the whole, missing out on what is significant.

Our own digital systems approaches are often similar.  When we concentrate on a sliver of information we bias our perspectives.  We see this all the time in business systems and project management.  Sometimes you just have to listen to the whole album, or step up to bigger data.

Note:  The post has been edited from the original to correct grammatical errors and for clarity.


Got to Make the Best of (My information)

While finding some respite from intense op-tempo and some minor physical maladies, I’m easing back into blogging.  To start out I thought that some obvious insight is useful in outlining information and how it applies to all kinds of systems.

The photo below is an example of Anolis carolinensis, also known as the green anole.  It is very common and competes with the introduced Anolis sagrei here in the American southeast.  (Photo thanks to GeorgiaInfo).  It is a complex adaptive biological system.

Green Anole

The genome of this amniote has been sequenced.  It consists of 1.78Gb, that is, Giga base pairs.  Information is everywhere and in everything.  Deriving its significance–tempered by wisdom and humanity–will lead us to the core, essential truth of our universe; including how to live our lives within it.

Over at AITS.org — Maxwell’s Demon: Planning for Obsolescence in Acquisitions

I’ve posted another article at AITS.org’s Blogging Alliance, this one dealing with the issue of software obsolescence and the acquisition strategy that applies given what we know about the nature of software.  I also throw in a little background on information theory and the physical limitations of software as we now know it (virtually none).  As a result, we require a great deal of agility inserted into our acquisition systems for new technologies.  I’ll have a follow up article over there that provides specifics on acquisition planning and strategies.  Random thoughts on various related topics will also appear here.  Blogging has been sporadic of late due to op-tempo but I’ll try to keep things interesting and more frequent.