Sixteen Tons — Data Mining, Big Data, and the Asymmetry of variables and observations

Last Thursday I came upon what I can only interpret as an ironic comment at Mark Thoma’s Economist’s View blog entitled “Data Mining Can be a Productive Activity.”  I went to the link and it went to a VOX article by Castle and Hendry entitled “Data Mining with more variables than observations.”  All I could think after the opening line:  “A typical Oxford University econometrics exam question might take the form: ‘Data mining is bad, so mining with more candidate variables than observations must be pernicious. Discuss.'” was: are these people serious?

Data mining is a general term in high tech and not a specific approach to finding patterns and trends in large elements of data.  The authors–and I’m guessing that they are not alone in the econometrics profession–seem to be addressing a “Just Say No” approach to performing what for most of us who deal in statistical analysis and modeling of large datasets do every day, largely based on the fact that it involves these scary things called computers that run this mysterious thing behind the scenes called “code.”  Who knows what horrors may await us as we mistakenly draw causations from correlations by anything more than the use of Access or Excel spreadsheets?  It seems that Oxford dons need to get out more.

The use of microeconomic data mining has been in general use for quite some time in many businesses and business disciplines with a great deal of confidence and success (too much success in the medical insurance, financial services, and social networking fields to raise legal and ethical objections).  So the assertion that seems to be based on those of a single group of econometricians does seem to be odd.  In the end it seems to be a setup for a proprietary set of calculations placed within an Excel spreadsheet given the name “Autometrics.”  This largely argues for the proper approach to the organization of data rather than a criticism of data mining in general.

The discriminators among data mining and data mining-like technologies involve purpose, cost, ease of use, scalability, and sustainability.  New technologies are arising every year that allow for increased speed, more efficient grouping, and compression to allow organizations to handle what previously was thought to be “big data.”  Thus the concept of data mining and big data is a shifting one as our technologies drive toward greater capability in integrating and interpreting large datasets.  The authors cite the techniques of taking large data to prove anthropomorphic global warming as one of the success stories of large scale modeling based on large data.  Implicit in acknowledging this is that not every variable needs to be included in a model–only the relevant variables that drive and explain the results being produced.  There is no doubt that reification of statistical results is a common fallacy, but people had been doing that long before the development of silicon-based artificial intelligence.

There is no doubt that someday we will reach the limit of computational capabilities.  But for someone who lived through the nonexistent “crisis” of limited memory in the early ’90s followed not too after by the bogus Y2K “bug,” I am not quite ready to throw in the towel on the ability of data mining and modeling to effectively provide the tools for the more general discipline of econometrics.  We are only beginning to crack heuristic models in approaching big data and on the cusp of strong AI.


One thought on “Sixteen Tons — Data Mining, Big Data, and the Asymmetry of variables and observations

  1. The hard part of Big Data is monetizing the results of the analysis. It’s not uncommon to get critical insights from Big Data, or even the canned reports in your HR system, that the leadership team simply refuses to act on. They’d rather “go with their gut.” Which, in many cases, is located just above their head.

    Heinlein to the contrary, those in power don’t GAS about facts. They care about looking like they’re in control, especially when they aren’t. “Proving” anthropomorphic causes for global warming is a classic example of a failure to influence with facts. The respective Democratic and Republican positions on the matter have nothing to do with data; they are staked out purely as opposing viewpoints. Faith-based governance is simply one more reason why the United States is following Radio Shack down the chute toward oblivion.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s