Sixteen Tons — Data Mining, Big Data, and the Asymmetry of variables and observations

Last Thursday I came upon what I can only interpret as an ironic comment at Mark Thoma’s Economist’s View blog entitled “Data Mining Can be a Productive Activity.”  I went to the link and it went to a VOX article by Castle and Hendry entitled “Data Mining with more variables than observations.”  All I could think after the opening line:  “A typical Oxford University econometrics exam question might take the form: ‘Data mining is bad, so mining with more candidate variables than observations must be pernicious. Discuss.'” was: are these people serious?

Data mining is a general term in high tech and not a specific approach to finding patterns and trends in large elements of data.  The authors–and I’m guessing that they are not alone in the econometrics profession–seem to be addressing a “Just Say No” approach to performing what for most of us who deal in statistical analysis and modeling of large datasets do every day, largely based on the fact that it involves these scary things called computers that run this mysterious thing behind the scenes called “code.”  Who knows what horrors may await us as we mistakenly draw causations from correlations by anything more than the use of Access or Excel spreadsheets?  It seems that Oxford dons need to get out more.

The use of microeconomic data mining has been in general use for quite some time in many businesses and business disciplines with a great deal of confidence and success (too much success in the medical insurance, financial services, and social networking fields to raise legal and ethical objections).  So the assertion that seems to be based on those of a single group of econometricians does seem to be odd.  In the end it seems to be a setup for a proprietary set of calculations placed within an Excel spreadsheet given the name “Autometrics.”  This largely argues for the proper approach to the organization of data rather than a criticism of data mining in general.

The discriminators among data mining and data mining-like technologies involve purpose, cost, ease of use, scalability, and sustainability.  New technologies are arising every year that allow for increased speed, more efficient grouping, and compression to allow organizations to handle what previously was thought to be “big data.”  Thus the concept of data mining and big data is a shifting one as our technologies drive toward greater capability in integrating and interpreting large datasets.  The authors cite the techniques of taking large data to prove anthropomorphic global warming as one of the success stories of large scale modeling based on large data.  Implicit in acknowledging this is that not every variable needs to be included in a model–only the relevant variables that drive and explain the results being produced.  There is no doubt that reification of statistical results is a common fallacy, but people had been doing that long before the development of silicon-based artificial intelligence.

There is no doubt that someday we will reach the limit of computational capabilities.  But for someone who lived through the nonexistent “crisis” of limited memory in the early ’90s followed not too after by the bogus Y2K “bug,” I am not quite ready to throw in the towel on the ability of data mining and modeling to effectively provide the tools for the more general discipline of econometrics.  We are only beginning to crack heuristic models in approaching big data and on the cusp of strong AI.