Let’s Get Physical — Pondering the Physics of Big Data

I’ll have a longer and less wonky article on this and related topics next week at AITS.org’s Blogging Alliance, but Big Data has been a hot topic of late.  It also concerns the business line in which I engage and so it is time to sweep away a lot of the foolishness concerning it: what it can do, its value, and its limitations.

As a primer a useful commentary on the ethical uses of Big Data was published today at Salon.com in an excerpt from Jacob Silverman’s book, Terms of Service: Social Media and the Price of Constant Connection.  Silverman takes a different approach from the one that I outline in my article, but he tackles the economics of new media that were identified years ago by Brad DeLong and A. Michael Froomkin back in the late 1990s and first decade of the 21st century.  This article on First Monday from 2000 regarding speculative microeconomics emerging from new media nicely summarizes their thesis.  Silverman rejects reforming the system in economic terms, entering the same ethical terrain on personal data collection that was explored by Rebecca Skloot on the medical profession’s genetic collection and use of tissue during biopsies in the book, The Immortal Life of Henrietta Lacks.

What Silverman’s book does make clear–and which is essential in understanding the issue–is that not all big data is the same.  To our brute force machines data is data absent the means of software to distinguish it, since they are not yet conscious in the manner that would pass a Turing test.  Even with software such machines still cannot pass such a test, though I personally believe that strong AI is inevitable.

Thus, there is Big Data that is swept up–often without deliberate consent by the originator of the data–from the larger pool of society at large by commercial companies that have established themselves as surveillance “statelets” in gathering data from business transactions, social media preferences, and other electronic means.

And there is data that is deliberately stored and, oftentimes shared, among conscious actors for a specific purpose.  These actors are often government agencies, corporations, and related organizations that cooperatively share business information from their internal processes and systems for the purpose of developing predictive systems toward a useful public purpose, oftentimes engaged in joint enterprises toward the development of public goods and services.  It is in this latter domain that I operate.  I like to call this “small” Big Data, since we operate in what can realistically be characterized as closed systems.

Data and computing has a physical and mathematical basis.  For anyone who has studied the history of computing (or has coded) this is a self-evident fact.  But for the larger community of users it appears–especially if one listens to the hype of our industry–that the sky is the limit.  But perhaps that is a good comparison after all, for anyone who has flown in a plane knows that the sky does indeed have limits.  To fly requires a knowledge of gravity, the atmosphere, lift, turbulence, aerodynamics, and propulsion, among other disciplines and sciences.  All of these have their underpinnings in physics and mathematics.

The equation that we use in computing is known as Landauer’s Principle.  It is as follows:

kT In 2,

where k is the Boltzmann constant, T is the temperature of the circuit in Kelvins, and In 2 is the natural logarithm of 2.

This equation follows those in thermodynamics established earlier in physics.  What this means is that the inherent entropy in a system–its onward inevitable journey toward a state of disorder–cannot be reduced, it can only be expelled from the system. For Landauer, who worked at IBM in physical computing, entropy is expelled in the form of heat and energy.  For the longest time, given the close correlation and applied proofs of the Principle, this was seen as a physical law, but modern computing seems to be undermining the manner in which entropy is expelled.

Big Data runs up against the physics identified in Landauer’s Principle because heat and energy are not the only ways to expel entropy.  For really Big Data entropy is expelled by the iron law of Boltzmann’s Constant: the calculation of probable states of disorder in the system. The larger the system, the larger the probable states of disorder, and the more our results in processing such information become a function of probability.  This may or many not matter, depending on the fidelity of the probabilistic methods and their application.

For “small” Big Data, the acceptability of variations from the likely outcome is much narrower.  We need to approach being 100% correct, 100% of the time, though small variations are acceptable depending on the type of system.  So, for example, in project management systems, we can be a percent or two off on rolling up data, since accountability is not an issue.  Financial systems compliance is a different matter.

In “small” Big Data, entropy can be expelled by pre-processing the data in the form of effort expended toward standardization, normalization, and rationalization.  Our equation, kT In 2, is the lower bound, that is, it identifies the minimum state of entropy that need be expelled in order to process a bit.  In reality we will never reach this lower bound, but we can approach it until the difference between the lower bound of entropy and the “cost” of processing data is vanishingly small.  Once we have expelled entropy by limiting the states of instability in the data, expelling the cost of entropy through the data pipeline, we can then process the data to derive its significant with a high degree of confidence.

But this is only the start.  For once “small” Big Data undergoes a process to ensure its fidelity, the same pattern recognition algorithms used in Big Data can be applied, but to more powerful and credible effect.  Early warning “signatures” of project performance can be collected and applied to provide decision-makers with information early enough to affect the outcome of efforts before risk is fully manifested, with the calculated probabilities of cost, schedule, and technical impacts possessing a higher level of certainty.