Big Data and the Repository of Babel

In 1941, the Argentine writer Jorge Luis Borges (1899-1986) published a short story entitled “The Library of Babel.” In the story Borges imagines a universe, known as the Library, which is described by the story’s narrator as made up of adjacent hexagonal rooms.

Each of the rooms of the library is poorly lit, with one side acting as the entrance and exit, and four of the five remaining walls of the rooms containing bookshelves whose books are placed in a completely uniform style, though the books’ contents are completely random.

The randomness includes what seem to be the absurd mixed in with every coherent book about every conceivable topic ever written, in every known and unknown language, including those that have not yet evolved. Thus, if one can organize the books, parse their contents, and find the common keys that will reconcile the different languages of the coherent books, index, and catalogue that knowledge, then one will have access to a complete understanding of the universe, including its future.

But in this universe, the magnitude of the undertaking is demoralizing to the librarians who are tasked with this challenge, and all forms of human frustration manifest themselves as a result of being unable to meet it.

Challenges in the Repository of Babel

It is apparent that Borges’ infinite Library of Babel is a metaphor for the limits to human understanding, though he does anticipate many of the challenges to data that we face in the modern world. Luckily, those of us who need to deal with large amounts of data do so at much smaller scales than the information held in the infinite universe. But the repositories of data that we have been able to construct are no less incomprehensible or daunting when first approached.

The secret in approaching our own repositories that contain data from a finite universe of sources—the Repository of Babel—is to understand that there is a lexicon common to types of information. This concept must go beyond simply identifying and leveraging a common key, which we have been doing for quite some time in establishing relational databases, or finding patterns in data and returning it to the user. I often like to call the latter “Magical Software Systems.” That is, solutions based on the belief that we can use standard access methods that will somehow select relevant data and then to serve it to the user, without added value, normalization or processing, in some preconfigured format that will magically transform it into information.

It is true that there are cases where juxtaposing related data originating from different aspects of measuring a common element can lead the user, with additional processing, to turn the insights derived from that juxtaposition into information. But if continuing to rely on the user to be an extension of the “computer” is the best we can do since at least twenty software generations have passed since these types of methods were first developed, then the technology market in relation to big data is truly in trouble.

In the Repository of Babel we have data that is incomprehensible because the otherwise common lexicon has been rendered so by the insertion of proprietary terminology and structures. For example, many individuals use scheduling applications to plan out and execute work. The principles of scheduling remain the same regardless of the specific software application chosen. Now imagine that you want to manage a portfolio of projects across an organization that may consist of subcontractors who have selected different software applications.

The choices in the past were to choose one proprietary scheduling solution as having a superior position over the others: to force conformity and therefore raise costs and reduce one’s flexibility and openness to innovation; to build a hard-coded data mining solution with its own costs, risk, and maintenance; or to transfer data between proprietary solutions with the concomitant risks of incompatibility. In the case of scheduling applications, their reliance on proprietary optimization algorithms for determining critical path calculations will skew the results of the individually reported subcontractor schedules in our example.

The Issue

The problem for the user is related to the loss of transparency and with it, sometimes control of the data. This aspect of information economics was first identified by DeLong and Froomkin in their paper “The Next Economy?” in 1997. For the authors, this loss of transparency related to that of the consumer in attempting to determine whether software functionality and performance met her needs. They found that a significant investment in evaluating the technology was required simply to make this initial determination.

But the lack of transparency has gone far beyond this restricted definition. For business systems, in particular, a contentious debate has broken out between the suppliers of software and the consumer over who actually owns the data. During the late 1990s and early ‘00s many companies and organizations were surprised to learn that software license provisions attempted to prevent them from using third-party tools in accessing their own data processed and stored by those systems, even though they were self-hosted.

We often see this same model used in the deployment of APIs and web services. These methods require that one remain in the proprietary sandbox of the publisher of the approved method of data access and integration. The problem here is that the consumer, once again, is separated from her own data. The software provider attempts to defend its turf by ensuring that the manner in which it processes and stores the consumer’s data is not only incomprehensible, but also closed off to all others unless, of course, they pay to play.

For businesses and government organizations, especially for the latter in their additional role as honest broker, such incomprehensibility of data not only reduces flexibility in deploying newer and better solutions in the future, but also undermines the ability of the organization to leverage the full value of its virtual corporate knowledge. It reduces organizational effectiveness, communication, and coordination, costing significant amounts of money devoted to using people to organize, manipulate, and manage data.

The Solution

The solution is to break down these proprietary barriers and restore title to the owner of the data—which is the consumer. Our solutions in dealing with Big Data and the Repository of Babel require that we not only be good programmers but also that we understand the nature and use of the data being used—and how to turn it into information. Looking for patterns in the data without understanding the relevance of the individual elements, or their relative importance to one another, is as much a fetish in our own time as it was in Borges’ story. Instead, our approach to Big Data requires an understanding of the common lexicon of the discipline that the solutions address so that the correct data is selected. Armed with this understanding, we then have the ability to normalize proprietary terminology and reconcile it with the common lexicon. Our solution can then process that data so that what is delivered to the individual user passes the “So What?” test—that is, to only deliver useful information, in lieu of mindless flat data which is uninformed by significance.

This goes beyond the popular concept of “data transparency” though it is a part of it. The issue is one of preventing the appropriation–the plunder–of both the commons and organizational knowledge within the private entity, combined and balanced with protecting the intellectual property of inventors, authors, and others within their lifetimes. Given our experience with the recent abuse of ML/AI (such as they are), Cambridge Analytical, X (the social network formerly known as Twitter), Facebook algorithms, and other enforced proprietary structures among the pedestrian software application publishers, it is the ethical duty of digital transformation managers to ensure that proprietary data solutions that deny the full use of their data and information be excluded as acceptable in their source determinations. That will immediately influence the accepted guardrails in the market, and maintain innovation and competition.

Author note: Most of the content in this post first appeared on AITS.org on January 23, 2015. It has been updated to reflect the current perspective of the author.