Big Time — Elements of Data Size in Scaling

I’ve run into additional questions about scalability.  It is significant to understand the concept in terms of assessing software against data size, since there are actually various aspect of approaching the issue.

Unlike situations where data is already sorted and structured as part of the core functionality of the software service being provided, this is in dealing in an environment where there are many third-party software “tools” that put data into proprietary silos.  These act as barriers to optimizing data use and gaining corporate intelligence.  The goal here is to apply in real terms the concept that the customers generating the data (or stakeholders who pay for the data) own the data and should have full use of it across domains.  In project management and corporate governance this is an essential capability.

For run-of-the-mill software “tools” that are focused on solving one problem, this often is interpreted as just selling a lot more licenses to a big organization.  “Sure we can scale!” is code for “Sure, I’ll sell you more licenses!”  They can reasonably make this assertion, particularly in client-server or web environments, where they can point to the ability of the database system on which they store data to scale.  This also comes with, usually unstated, the additional constraint that their solution rests on a proprietary database structure.  Such responses, through, are a form of sidestepping the question, nor is it the question being asked.  Thus, it is important for those acquiring the right software to understand the subtleties.

A review of what makes data big in the first place is in order.  The basic definition, which I outlined previously, came from NASA in describing data that could not be held in local memory or local storage.  Hardware capability, however, continues to grow exponentially, so that what is big data today is not big data tomorrow.  But in handling big data, it then becomes incumbent on software publishers to drive performance to allow their customers to take advantage of the knowledge contained in these larger data sets.

The elements that determine the size of data are:

a.  Table size

b.  Row or Record size

c.  Field size

d.  Rows per table

e.  Columns per table

f.  Indexes per table

Note the interrelationships of these elements in determining size.  Thus, recently I was asked how many records are being used on the largest tables being accessed by a piece of software.  That is fine as shorthand, but the other elements add to the size of the data that is being accessed.  Thus, a set of data of say 800K records may be just as “big” as one containing 2M records because of the overall table size of fields, and the numbers of columns and indices, as well as records.  Furthermore, this question didn’t take into account the entire breadth of data across all tables.

Understanding the definition of data size then leads us to understanding the nature of software scaling.  There are two aspects to this.

The first is the software’s ability to presort the data against the database in such as way as to ensure that latency–the delay in performance when the data is loaded–is minimized.  The principles applied here go back to database management practices back in the day when organizations used to hire teams of data scientists to rationalize data that was processed in machine language–especially when it used to be stored in ASCII or, for those who want to really date themselves, EBCDIC, which were incoherent by today’s generation of more human-readable friendly formats.

Quite simply, the basic steps applied has been to identify the syntax, translate it, find its equivalents, and then sort that data into logical categories that leverage database pointers.  What you don’t want the software to do is what used to be done during the earliest days of dealing with data, which was smaller by today’s standards, of serially querying ever data element in order to fetch only what the user is calling.  Furthermore, it also doesn’t make much sense to deal with all data as a Repository of Babel to apply labor-intensive data mining in non-relational databases, especially in cases where the data is well understood and fairly well structured, even if in a proprietary structure.  If we do business in a vertical where industry standards in data apply, as in the use of the UN/CEFACT XML convention, then much of the presorting has been done for us.  In addition, more powerful industry APIs (like OLE DB and ODBC) that utilize middleware (web services, XML, SOAP, MapReduce, etc.) multiply the presorting capabilities of software, providing significant performance improvements in accessing big data.

The other aspect is in the software’s ability to understand limitations in data communications hardware systems.  This is a real problem, because the backbone in corporate communication systems, especially to ensure security, is still largely done over a wire.  The investments in these backbones is usually categorized as a capital investment, and so upgrades to the system are slow.  Furthermore, oftentimes backbone systems are embedded in physical plant building structures.  So any software performance is limited by the resistance and bandwidth of wiring.  Thus, we live in a world where hardware storage and processing is doubling every 12-18 months, and software is designed to better leverage such expansion, but the wires over which data communication depends remains stuck in the past–constrained by the basic physics of CAT 6 or Fiber Optic cabling.

Needless to say, software manufacturers who rely on constant communications with the database will see significantly degraded performance.  Some software publishers who still rely on this model use the “check out” system, treating data like a lending library, where only one user or limited users can access the same data.  This, of course, reduces customer flexibility.  Strategies that are more discrete in handling data are the needed response here until day-to-day software communications can reap the benefits of physical advancements in this category of hardware.  Furthermore, organizations must understand that the big Cloud in the sky is not the answer, since it is constrained by the same physics as the rest of the universe–with greater security risks.

All of this leads me to a discussion I had with a colleague recently.  He opened his iPhone and did a query in iTunes for an album.  In about a second or so his query selected the artist and gave a list of albums–all done without a wire connection.  “Why can’t we have this in our industry?” he asked.  Why indeed?  Well, first off, Apple iTunes has sorted its data to optimize performance with its app, and it is performed within a standard data stream and optimized for the Apple iOS and hardware.  Secondly, the variables of possible queries in iTunes are predefined and tied to a limited and internally well-defined set of data.  Thus, the data and application challenges are not equivalent as found in my friend’s industry vertical.  For example, aside from the challenges of third party data normalization and rationalization, iTunes is not dealing with dynamic time-phased or trending data that requires multiple user updates to reflect changes using predictive analytics which is then served to different classes of users in a highly secure environment.  Finally, and most significantly, Apple spent more on that system than the annual budget of my friend’s organization.  In the end his question was a good one, but in discussing this it was apparent that just saying “give me this” is a form of magical thinking and hand waving.  The devil is in the details, though I am confident that we will eventually get to an equivalent capability.

At the end of the day IT project management strategy must take into account the specific needs of classes of users in making determinations of scaling.  What this means is a segmented approach: thick-client users, compartmentalized local installs with subsets of data, thin-client, and web/mobile or terminal services equivalents.  The practical solution is still an engineered one that breaks the elephant into digestible pieces that lean forward to leveraging advances in hardware, database, and software operating environments.  These are the essential building blocks to data optimization and scaling.

Do You Believe in Magic? — Big Data, Buzz Phrases, and Keeping Feet Planted Firmly on the Ground

My alternative title for this post was “Money for Nothing,” which is along the same lines.  I have been engaged in discussions regarding Big Data, which has become a bit of a buzz phrase of late in both business and government.  Under the current drive to maximize the value of existing data, every data source, stream, lake, and repository (and the list goes on) has been subsumed by this concept.  So, at the risk of being a killjoy, let me point out that not all large collections of data is “Big Data.”  Furthermore, once a category of data gets tagged as Big Data, the further one seems to depart from the world of reality in determining how to approach and use the data.  So for of you who find yourself in this situation, let’s take a collective deep breath and engage our critical thinking skills.

So what exactly is Big Data?  Quite simply, as noted by this article in Forbes by Gil Press, term is a relative one, but generally means from a McKinsey study, “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”  This subjective definition is a purposeful one, since Moore’s Law tends to change what is viewed as simply digital data as opposed to big data.  I would add some characteristics to assist in defining the term based on present challenges.  Big data at first approach tends to be unstructured, variable in format, and does not adhere to a schema.  Thus, not only is size a criteria for the definition, but also the chaotic nature of the data that makes it hard to approach.  For once we find a standard means of normalizing, rationalizing, or converting digital data, it no longer is beyond the ability of standard database tools to effectively use it.  Furthermore, the very process of taming it thereby renders it non-big data, or perhaps, if a exceedingly large dataset, perhaps “small big data.”

Thus, having defined our terms and the attributes of the challenge we are engaging, we now can eliminate many of the suppositions that are floating around in organizations.  For example, there is a meme that I have come upon that asserts that disparate application file data can simply be broken down into its elements and placed into database tables for easy access by analytical solutions to derive useful metrics.  This is true in some ways but both wrong and dangerous in its apparent simplicity.  For there are many steps missing in this process.

Let’s take, for example, the least complex example in the use of structured data submitted as proprietary files.  On its surface this is an easy challenge to solve.  Once someone begins breaking the data into its constituent parts, however, greater complexity is found, since the indexing inherent to data interrelationships and structures are necessary for its effective use.  Furthermore, there will be corruption and non-standard use of user-defined and custom fields, especially in data that has not undergone domain scrutiny.  The originating third-party software is pre-wired to be able to extract this data properly.  Absent having to use and learn multiple proprietary applications with their concomitant idiosyncrasies, issues of sustainability, and overhead, such a multivariate approach defeats the goal of establishing a data repository in the first place by keeping the data in silos, preventing integration.  The indexing across, say, financial systems or planning systems are different.  So how do we solve this issue?

In approaching big data, or small big data, or datasets from disparate sources, the core concept in realizing return on investment and finding new insights, is known as Knowledge Discovery in Databases or KDD.  This was all the rage about 20 years ago, but its tenets are solid and proven and have evolved with advances in technology.  Back then, the means of extracting KDD from existing databases was the use of data mining.

The necessary first step in the data mining approach is pre-processing of data.  That is, once you get the data into tables it is all flat.  Every piece of data is the same–it is all noise.  We must add significance and structure to that data.  Keep in mind that we live in this universe, so there is a cost to every effort known as entropy.  Computing is as close as you’ll get to defeating entropy, but only because it has shifted the burden somewhere else.  For large datasets it is pushed to pre-processing, either manual or automated.  In the brute force world of data mining, we hire data scientists to pre-process the data, find commonalities, and index it.  So let’s review this “automated” process.  We take a lot of data and then add a labor-intensive manual effort to it in order to derive KDD.  Hmmm..  There may be ROI there, or there may not be.

But twenty years is a long time and we do have alternatives, especially in using Fourth Generation software that is focused on data usage without the limitations of hard-coded “tools.”  These alternatives apply when using data on existing databases, even disparate databases, or file data structured under a schema with well-defined data exchange instructions that allow for a consistent manner of posting that data to database tables. The approach in this case is to use APIs.  The API, like OLE DB or the older ODBC, can be used to read and leverage the relative indexing of the data.  It will still require some code to point it in the right place and “tell” the solution how to use and structure the data, and its interrelationship to everything else.  But at least we have a means for reducing the cost associated with pre-processing.  Note that we are, in effect, still pre-processing data.  We just let the CPU do the grunt work for us, oftentimes very quickly, while giving us control over the decision of relative significance.

So now let’s take the meme that I described above and add greater complexity to it.  You have all kinds of data coming into the stream in all kinds of formats including specialized XML, open, black-boxed data, and closed proprietary files.  This data is non-structured.  It is then processed and “dumped” into a non-relational database such as NoSQL.  How do we approach this data?  The answer has been to return to a hybrid of pre-processing, data mining, and the use of APIs.  But note that there is no silver bullet here.  These efforts are long-term and extremely labor intensive at this point.  There is no magic.  I have heard time and again from decision makers the question: “why can’t we just dump the data into a database to solve all our problems?”  No, you can’t, unless you’re ready for a significant programmatic investment in data scientists, database engineers, and other IT personnel.  At the end, what they deploy, when it gets deployed, may very well be obsolete and have wasted a good deal of money.

So, once again, what are the proper alternatives?  In my experience we need to get back to first principles.  Each business and industry has commonalities that transcend proprietary software limitations by virtue of the professions and disciplines that comprise them.  Thus, it is domain expertise to the specific business that drives the solution.  For example, in program and project management (you knew I was going to come back there) a schedule is a schedule, EVM is EVM, financial management is financial management.

Software manufacturers will, apart from issues regarding relative ease of use, scalability, flexibility, and functionality, attempt to defend their space by establishing proprietary lexicons and data structures.  Not being open, while not serving the needs of customers, helps incumbents avoid disruption from new entries.  But there often comes a time when it is apparent that these proprietary definitions are only euphemisms for a well-understood concept in a discipline or profession.  Cat = Feline.  Dog = Canine.

For a cohesive and well-defined industry the solution is to make all data within particular domains open.  This is accomplished through the acceptance and establishment of a standard schema.  For less cohesive industries, but where the data or incumbents through the use of common principles have essentially created a de facto schema, APIs are the way to extract this data for use in analytics.  This approach has been applied on a broader basis for the incorporation of machine data and signatures in social networks.  For closed or black-boxed data, the business or industry will need to execute gap analysis in order to decide if database access to such legacy data is truly essential to its business, or given specification for a more open standard from “time-now” will eventually work out suboptimization in data.

Most important of all and in the end, our results must provide metrics and visualizations that can be understood, are valid, important, material, and be right.