I’ve run into additional questions about scalability. It is significant to understand the concept in terms of assessing software against data size, since there are actually various aspect of approaching the issue.
Unlike situations where data is already sorted and structured as part of the core functionality of the software service being provided, this is in dealing in an environment where there are many third-party software “tools” that put data into proprietary silos. These act as barriers to optimizing data use and gaining corporate intelligence. The goal here is to apply in real terms the concept that the customers generating the data (or stakeholders who pay for the data) own the data and should have full use of it across domains. In project management and corporate governance this is an essential capability.
For run-of-the-mill software “tools” that are focused on solving one problem, this often is interpreted as just selling a lot more licenses to a big organization. “Sure we can scale!” is code for “Sure, I’ll sell you more licenses!” They can reasonably make this assertion, particularly in client-server or web environments, where they can point to the ability of the database system on which they store data to scale. This also comes with, usually unstated, the additional constraint that their solution rests on a proprietary database structure. Such responses, through, are a form of sidestepping the question, nor is it the question being asked. Thus, it is important for those acquiring the right software to understand the subtleties.
A review of what makes data big in the first place is in order. The basic definition, which I outlined previously, came from NASA in describing data that could not be held in local memory or local storage. Hardware capability, however, continues to grow exponentially, so that what is big data today is not big data tomorrow. But in handling big data, it then becomes incumbent on software publishers to drive performance to allow their customers to take advantage of the knowledge contained in these larger data sets.
The elements that determine the size of data are:
a. Table size
b. Row or Record size
c. Field size
d. Rows per table
e. Columns per table
f. Indexes per table
Note the interrelationships of these elements in determining size. Thus, recently I was asked how many records are being used on the largest tables being accessed by a piece of software. That is fine as shorthand, but the other elements add to the size of the data that is being accessed. Thus, a set of data of say 800K records may be just as “big” as one containing 2M records because of the overall table size of fields, and the numbers of columns and indices, as well as records. Furthermore, this question didn’t take into account the entire breadth of data across all tables.
Understanding the definition of data size then leads us to understanding the nature of software scaling. There are two aspects to this.
The first is the software’s ability to presort the data against the database in such as way as to ensure that latency–the delay in performance when the data is loaded–is minimized. The principles applied here go back to database management practices back in the day when organizations used to hire teams of data scientists to rationalize data that was processed in machine language–especially when it used to be stored in ASCII or, for those who want to really date themselves, EBCDIC, which were incoherent by today’s generation of more human-readable friendly formats.
Quite simply, the basic steps applied has been to identify the syntax, translate it, find its equivalents, and then sort that data into logical categories that leverage database pointers. What you don’t want the software to do is what used to be done during the earliest days of dealing with data, which was smaller by today’s standards, of serially querying ever data element in order to fetch only what the user is calling. Furthermore, it also doesn’t make much sense to deal with all data as a Repository of Babel to apply labor-intensive data mining in non-relational databases, especially in cases where the data is well understood and fairly well structured, even if in a proprietary structure. If we do business in a vertical where industry standards in data apply, as in the use of the UN/CEFACT XML convention, then much of the presorting has been done for us. In addition, more powerful industry APIs (like OLE DB and ODBC) that utilize middleware (web services, XML, SOAP, MapReduce, etc.) multiply the presorting capabilities of software, providing significant performance improvements in accessing big data.
The other aspect is in the software’s ability to understand limitations in data communications hardware systems. This is a real problem, because the backbone in corporate communication systems, especially to ensure security, is still largely done over a wire. The investments in these backbones is usually categorized as a capital investment, and so upgrades to the system are slow. Furthermore, oftentimes backbone systems are embedded in physical plant building structures. So any software performance is limited by the resistance and bandwidth of wiring. Thus, we live in a world where hardware storage and processing is doubling every 12-18 months, and software is designed to better leverage such expansion, but the wires over which data communication depends remains stuck in the past–constrained by the basic physics of CAT 6 or Fiber Optic cabling.
Needless to say, software manufacturers who rely on constant communications with the database will see significantly degraded performance. Some software publishers who still rely on this model use the “check out” system, treating data like a lending library, where only one user or limited users can access the same data. This, of course, reduces customer flexibility. Strategies that are more discrete in handling data are the needed response here until day-to-day software communications can reap the benefits of physical advancements in this category of hardware. Furthermore, organizations must understand that the big Cloud in the sky is not the answer, since it is constrained by the same physics as the rest of the universe–with greater security risks.
All of this leads me to a discussion I had with a colleague recently. He opened his iPhone and did a query in iTunes for an album. In about a second or so his query selected the artist and gave a list of albums–all done without a wire connection. “Why can’t we have this in our industry?” he asked. Why indeed? Well, first off, Apple iTunes has sorted its data to optimize performance with its app, and it is performed within a standard data stream and optimized for the Apple iOS and hardware. Secondly, the variables of possible queries in iTunes are predefined and tied to a limited and internally well-defined set of data. Thus, the data and application challenges are not equivalent as found in my friend’s industry vertical. For example, aside from the challenges of third party data normalization and rationalization, iTunes is not dealing with dynamic time-phased or trending data that requires multiple user updates to reflect changes using predictive analytics which is then served to different classes of users in a highly secure environment. Finally, and most significantly, Apple spent more on that system than the annual budget of my friend’s organization. In the end his question was a good one, but in discussing this it was apparent that just saying “give me this” is a form of magical thinking and hand waving. The devil is in the details, though I am confident that we will eventually get to an equivalent capability.
At the end of the day IT project management strategy must take into account the specific needs of classes of users in making determinations of scaling. What this means is a segmented approach: thick-client users, compartmentalized local installs with subsets of data, thin-client, and web/mobile or terminal services equivalents. The practical solution is still an engineered one that breaks the elephant into digestible pieces that lean forward to leveraging advances in hardware, database, and software operating environments. These are the essential building blocks to data optimization and scaling.