New data warehouse economics


Published on

The is my original text before GigaOM mangled it without my approval

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

New data warehouse economics

  1. 1. ©2013 Hired Brains Inc. All Rights Reserved 1 The New Data Warehouse Economics Executive Summary The real message about the economics of data warehousing is not that everything is less costly as a result of Moore’s Law1 , but rather, that the vastly expanded mission of data warehousing, and analytics in general, is only feasible today because of those economics. While “classical” data warehousing was focused on creating a process for gathering and integrating data from many mostly internal sources and provisioning data for reporting and analysis, today organizations want to know not only what happened, but what is likely to happen and how to react to it. This presents tremendous opportunities, but a very different economic model. For twenty-five years, data warehousing has proven to be a useful approach to report and understand an organization’s operations. Before data warehousing, reporting and provisioning data from disparate systems was an expensive, time-consuming and often futile attempt to satisfy an organization’s need for actionable information. Data warehousing promised clean, integrated data from a single repository. The ability to connect a wide variety of reporting tools to a single model of the data catalyzed an entire industry, Business Intelligence (BI). However, the application of the original concepts of data warehouse architecture and methodology are today weighed down by complicated methods and designs, inadequate tools and high development, maintenance and infrastructure costs. Until recently, computing resources were prohibitively costly, and data warehousing was constrained by a sense of “managing from scarcity.” Instead, approaches were designed to minimize the size of the databases 1The number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented. Moore predicted that this trendwould continue for the foreseeable future. Current thinking now is every eighteen months.
  2. 2. ©2013 Hired Brains Inc. All Rights Reserved 2 through aggregating data, creating complicated designs of sub-setted databases and closely monitoring the usage of resources. Proprietary hardware and database software are a “sticky” economic proposition – market forcesthat drive down the cost of data warehousing are blocked by the locked-in nature of these tools.Database professionals continue to build structures to maintain adequate performance, such asdata marts, cubes, specialized schemas tuned for a single application and multi- stage ETL2 thatslow the update of information and hinder innovation. Market forces and, in particular, the economics of data warehousing are intervening to change the picture entirely. Today, the economics of data warehousing are significantly different. Managing from scarcity has given way to a rush to find the insight from all of the data, internal and external, and to expand the scope of uses to everyone. Tools, disciplines, opportunities and risk have been turned upside-down. How do the new economics of data warehouse play out? There are already enormous proven benefits to organizations such as: - Offloading analytical work from data warehouses to analytical platforms that are purpose built for the work - Capturing data more quickly, in greater detail, decreasing the time to decisions - Delaying or even foregoing expensive upgrades to existing platforms - Shortening the lag to analyzing new sources of data, or even eliminating it - Broadening the audience of those who perform analytics and those who benefit from it - Shortening the time needed to add new subject areas for analysis from months to days 2ETL, or Extract, Transform and Load tools manage the process of moving data from original source systems to data warehouses and often include operations for profiling, cleansing and integrating the data to a single model.
  3. 3. ©2013 Hired Brains Inc. All Rights Reserved 3 Economic realities present a new era in Enterprise Business Intelligence, disseminating analytical processing throughout and beyond the organization, not just to a handful of analysts. With the accumulated effect of Moore’s Law, the relentless doubling of computing and storage capacity3 every two years, with a corresponding drop in prices, plus engineering innovation and Linux- based and open-source software, a redrawing of the data warehouse model is needed. Expanding the scope and range of data warehousing would be prohibitively expensive based on the economics of twenty-five years ago, or even ten years ago. The numbers are in your favor. The Numbers Part of the driving force behind the new economics of data warehousing is simply the advance of technology. For example: - From 1993 to 2013, the cost of a terabyte of disk storage has dropped from $2,000,000 to less than a $100 (this is the cost of the drives alone. Storage management hardware and software are separate) - Dynamic RAM (DRAM), the volatile memory of the computer, has seen a similar drop in price ($250/4Mb of 8-bit memory versus $450/32Gb of 64- bit memory). Today’s servers can be fitted with 64,000 times as much memory for only twice the cost twenty years ago. - Cost/Gb is only part of the story. There is a similar dramatic increase in density(from 1Mb to 8Gb/chip), allowing for smaller installations, less space and using less energy - A proprietary Unix server with four Intel 486 chips, 384Mb RAM and 32Gb of storage cost $650,000 in 1993, greater than $1 million in constant 3 Though technically not a result of Moore’s Law, the capacity and cost for storage follows a similar and even moredramatic curve
  4. 4. ©2013 Hired Brains Inc. All Rights Reserved 4 dollars. Today’s mid-range laptops of the same capacity cost between $2,000-$4,000. To put these numbers in perspective, since a chart simply cannot adequately convey the magnitude of the effect of Moore’s Law over time, the table below compares the specifications of a 1988 Ferrari Testarossa, one of the true supercars of its era, its twenty-five year’s later version in the 2013, the F12, and the effect of twenty-five years of Moore’s Law on the 1988 version for comparison: 1988 Ferrari Testarossa 2013 Ferrari F12 1988 Ferrari “enhanced” By Moore’s Law Weight 3300 lb. 3355 lb. 1.65 oz. MPG 11 12 352,000 Cost New $200,000 $330,000 $6.25 Horsepower 390 730 12,500,000 0-60 5.3 seconds 3.1 seconds .0002 seconds Top Speed 180 mph 211 mph 4,800,000 mph Expensive to Build and Maintain The reality today is that moderately priced, high-performance hardware is widely available. Thecost of CPUs, memory and storage devices have plummeted in the past five to ten years, but datawarehouse design, completely out of step with market realities, is still based on getting extremeoptimization from expensive and proprietary hardware. A great deal
  5. 5. ©2013 Hired Brains Inc. All Rights Reserved 5 of the effort involved in datamodeling a data warehouse is performance- related. In some popular methodologies, six or sevendifferent schema types are employed, with as many as twenty or thirty or more separate schemas various use cases. This has the effect of providing adequate query performance for mixed workloads, but the cost islimited access to the data and extreme inflexibility borne of the complexity of the design. Agility is difficult due to the number of elements that must be designed and modified over time. Undertoday’s cost structure, it makes more sense to ramp up the hardware and simplify the modelsfor the sake of access and flexibility. Annual data warehouse maintenance costs are considerably higher than operational systems. This is partly due to the iterative nature of analytics, with changing focus based on business conditions, but that is only part of the equation. The major cause is that the architecture and best practices are no longer aligned with the technology. It takes a great deal of time to maintaincomplex structures and it vastly complicates the understanding and usability of the utility on thefront end. For those BI tools that do provide a useful semantic layer4 , allowing business models tobe expressed in business terms, not database terms, user queries quickly overwhelm resourceswith complex queries (complex from the database perspective.) Many BI tools still operate withouta semantic layer between the user and the physical database objects, forcing non-technical usersto deal with concepts like tables, columns, keys and joins which makes navigation nearlyimpossible. All of these problems can be solved today by taking advantage of the tools madepossible by new market economics. In today’s BI environments, with “best practices” data warehouses, the business models areburned into the circuitry of the relational database 4 A semantic layer translates physical names of data in source systems into common names for analysis. For example, System 1 and 2 may name “dog” CANINE_D4 and “Chien,” respectively, but the semantic layer abstracts these terms to a common naming convention of “dog” for understandability and consistency
  6. 6. ©2013 Hired Brains Inc. All Rights Reserved 6 schemas. The range of analysis possible islimited by the ability of the database server to process the queries. Some BI software has thecapability to do much more, but it is constrained by the computing resources in the databaseserver. These tools can present a visual model to business users, allow them to manipulate,modify, enhance and share their models. When a request for analysis is invoked, the toolsdynamically generate thousands of lines of SQL per second, against the largest databases in theworld, asking the most difficult questions. The use of these tools, though they are precisely whatknowledge workers need, is very limited today because most data warehouses are built onproprietary platforms that are an order of magnitude more expensive and an order or two of magnitudeslower than those available from newer entrants into the market like Vertica. How This Affects Operations and Strategy The dramatic change in cost of computing may seem compelling, and is is, but relative hardware costs are not the entire picture. When considering the “economics” of data warehousing, it is just as important to take into accountthe whole picture: the “push” from technology and the lessening of managing from scarcity: For example: - A shift in labor from highly skilled data management staff to more flexible scenarios where more powerful platforms can take over more of the work through both brute force (faster, more complex processing) and better software designed for the new platforms - The amount of data captured for analytics has also grown dramatically, to some extent, but not entirely, normalizing the relative cost decrease. In addition, as today’s analytical applications become more operational, the need for speed and reduced latency (time to value) require faster devices
  7. 7. ©2013 Hired Brains Inc. All Rights Reserved 7 such as solid-state storage devices (SSD) which cost 5x to 10x the price of conventional disk drives (but will continue to drop in relative price) - There is far wider audience for reporting and analytics as a result of better presentation tools and mobile platformswhich drive the requirement for more fresh data and analysis, particularly to mobile devices - Data warehousing and BI were predominantly an inward-focused operation, providing information and reporting to staff. This is no longer the case. Now analytics are embedded in connections to customers, suppliers, regulators and investors. - The availability of cloud computing offers some very promising opportunities, in many cases, particularly in agility and scale A reversal of role of the data warehouse as the single repository of integrated historical data with an emphasis on the repository function to supply reporting needs rather than analytics has been reversed. BI is still useful and needed to explain the Who, What and Where questions, but there is a driving force across all organizations today to understand the “Why,” and to do something about it. To understand the “Why” requires direct access to much more detailed and diverse data not possible in first generation data warehouse architectures, and it has to be handled by systems capable of orders of magnitude faster processing. New database technology designed specifically for analytics and data warehouses such asdata warehouse/analytics appliances, columnar- orientation, optimizers designed for analytics, large memory models and cloud orientation appeared a few years ago and are mature products today. Most organizations are taking a fresh look at data warehouse architecture more purpose-built for fast implementation, greater scale and concurrency, rapid and almost continuous update, lower cost of maintenance and dramatically more detailed data for analysis to deliver better outcomes to the organization.
  8. 8. ©2013 Hired Brains Inc. All Rights Reserved 8 One thing is certain: the data warehousing calculus changed dramatically in the past few years, with many options not available previously. As a result, the cost/benefit analysis has altered. In particular, data warehouse technology today can offer more function and value, often at lower costs, especially when considering Total Cost of Ownership. Looking at Total Cost of Ownership Total Cost of Ownership is a financial metric that is part of a collection of tools to measure the financial impact of investments. Unlike ROI, Return on Investment or IRR, Internal Rate of Return, TCO does not take into account, directly, the benefits of an investment. What it does is attempt to calculate, either prospectively or retrospectively, the cost of implementing and running a project over a fixed period of time, often three to five years. A simple TCO calculation examines the replacement cost of something plus its operating costs, and compares them with some other option, such as not replacing the thing. Alternatively, it may compare the TCO of various options. People often perform an informal TCO analysis when buying a car, considering say a five- year TCO including purchase or lease, fuel, insurance and maintenance. The value of a TCO model is to reduce the prominence of initial startup costs in decision making in favor of taking a longer view. For example, of three candidates for data warehouse platform, the one with the higher purchase cost may have a lower longer term running cost because of maintenance costs, energy costs or labor. One application of TCO analysis in the new economics of data warehousing might compare the costs of implementing Hadoop as a data integration platform instead of existing tools such as enterprise ETL/ELT (Extraction, Transform & Load). Because Hadoop is Open Source software it has no purchase or license costs (though there are commercial distributions that
  9. 9. ©2013 Hired Brains Inc. All Rights Reserved 9 have conventional licensing but at modest cost) and appears to be a very appealing alternative using TCO: Software License Purchase Software Annual Maintenance Server Purchase w/utilities Server Annual Maintenance ETL $500,000 $100,000 $175,000 $35,000 Hadoop 0 0 $40,000 0 ETL Hadoop At Inception $810,000* $40,000 Year 2 135,000 0 Year 3 135,000 0 Year 4 135,000 0 Year 5 135,000 0 Total 5-year TCO $1,215,000 $40,000 * Maintenance fees are typically paid a year in advance This example does not demonstrate the desirability of Hadoop of ETL. It demonstrates how easy it is to make mistakes with TCO by not considering all of the factors. For example, ETL technology is in its third decade of development and is designed purposely for the one application of the integration of data for data warehouses. Hadoop is a general-purpose collection of tools for managing large sets of messy data. What the model above excludes is the cost of programming effort needed for Hadoop to perform what ETL already can without programming. IN addition, ETL tools have added benefits that are not often evident until a few years into the
  10. 10. ©2013 Hired Brains Inc. All Rights Reserved 10 project, such as connectors to every iniginable type of data without programming, mature metadata models and services that ease enhancement and agility, developer’s environments with version control, administrative controls and even run-time robots that advise of defects or anomalies as they occur. Once the costs of labor, delays and downtime are factoring, the TCO model provide a very different answer. The point of this example is not to promote ETL over Hadoop or vice-versa, it is only to illustrate why it is so important in this fairly turbulent time of change to build your models carefully. Unlike ROI and it various derivatives, TCO does not discount future cash flows for two reasons: first, because the time periods are relatively short and second, because TCO costs are not actually investments. The goal of TCO is to find (or measure) the cost of the implementation over time. Keep in mind that a comparative analysis using TCO does not address the question of whether you should make the investments; it only compares various alternatives. The actual decision has to be based on an investment methodology that considers benefits as well as costs. Big Data and Data Warehouses Until now, this discussion centered on data warehouses and BI. Big Data is the newest complication in the broader field of analytics, but its ramifications affect data warehousing to a great extent. Much is made of big data being a “game changer” and the successor to data warehousing. It is not. If anything, it represents an opportunity for data warehousing to fulfill, or at a least extend its reach closer to its original purpose: to be the source of useful, actionable data for all analytical processing. Data warehouse thinking will of necessity have to abandon its vision of one physical structure to do this, however. Instead, it will have to cooperate with many different data sources as a virtualized structure, a sort of surround strategy of “quiet” historical data warehouses, extreme, unconstrained analytical databases providing both
  11. 11. ©2013 Hired Brains Inc. All Rights Reserved 11 real-time update and near-time response and non-relational, big data clusters such as Hadoop. Big data forces organizations to extend the scale of their analytical operations, both in terms of volume and variety of inputs, and, equally important, to stretch their vision of how IT can expand and enhance the use of technology both within and beyond the organization’s boundaries. Big data is useful for a wide range of applications in gathering and sifting through data, and most often repeated uses are for gathering and analyzing information after the fact to draw new insights not possible previously. However, in 3-5 years, big data will move into production of all sorts of real- time activities such as biometric recognition, self-driving cars (maybe 10 years for that) and cognitive computing including natural language question and answering systems. This will only be possible because computing resources continue to expand in power and drop in relative cost. Examples include in-memory processing, ubiquitous cloud computing, text and sentiment analysis and better devices for using the information. However, as crucial as the hardware and software components are, the real advances will happen with the wide adoption of semantic technologies that use ontology5 , graph analysis6 and open data7 . Economic realities are ushering in a new era in Enterprise Business Intelligence,disseminating analytical processingthroughout and beyond the organization. BI is currently too expensive, too limited in scope and too rigid.The economics of computing are not reflected in current data warehousing best practices. Thesolution to problems of latency and depth of data is the application of the vast computing andstorage resources now available at reasonable prices. Due to Moore’s Law and innovationsin query 5Ontologies are a way to represent information in a logical relationship format. One benefit is the ability to create them and link the together with other, even external, ontologies providing a way to use the meaning of things across domains 6 Relational databases employ a logical concept of tables, two-dimensional arrangements of identically structured rows of records. Graphs are a completely different arrangement: things connected to other things by relationships Graphs are capable of traversing massive amounts of data quickly. Often in used in search, path analysis etc. 7 Open data is a collaborative effort of organizations and people to provide access to clean, consistent shared information
  12. 12. ©2013 Hired Brains Inc. All Rights Reserved 12 processing techniques, workarounds and limitations that have become “bestpractices” can be abandoned. Forward-looking companies like H- P/Vertica are positioned toprovide the kinds of capacity, easy scalability, affordability (both purchase cost and TCO8)and simplicity that allow for broadly inclusive environments with far greater flexibility.Tedious design work and endless optimization that slow a project can be eliminated with new approaches at a much lower cost. In addition to servicing the reporting and analysis needs of business people, data warehouses will soon be positioned to support a new class of applications - unattended decisionsupport. So-called hybrid or composite systems, that mix operational and analytical processing inreal-time are already in limited production and will become mainstream within the next few years.Most BI architectures today are not equipped to meet these needs.Making data warehouses more current, flexible and useful will lead to much greater propagationof BI, improving comprehension and retention of business information, facilitating collaborationand increasing the depth and breadth of business intelligence. Bringing data warehouse practices up to speed with the state of technology requires a reversal inpolarity: the focus of data warehousing is still solidly on data and databases and needs to shift tobusiness users and practices that need its functionality. The preponderance of effort in deliveringBI solutions today is still highly technical and dominated by technologists. For the effort to berelevant, the focus and control must be turned over to the stakeholders. This is impossible withcurrent best practices. Unlike systems designed to support business processes such as supply chain management,fulfillment, accounting or even human resources, where the business processes can be clearlydefined (even though it may be arduous), BI isn’t really a businessprocess at all, at least not yet. Traditional systems architecture and development methodologiespresuppose that a distinct set of 8Total Cost of Ownership
  13. 13. ©2013 Hired Brains Inc. All Rights Reserved 13 requirements can be discovered at the beginning ofan implementation. Asking knowledge workers to describe their needs and“requirements” for BI is often a fruitless effort because the most important needs areusually unmet and largely unknown. Thinking out of the box is extremely difficult in thiscontext. The entire process of “requirements gathering” in technology initiatives, and BIin particular, is questionable because the potential beneficiaries of the new and improvedcapabilities are operating under a mental model formed by the existing technology. A popular alternative to this approach is the iterative or “spiral” methodology that simply repeats the requirements gathering in steps, implementing partial solutions in between. Nevertheless, this process is still an result of managing from scarcity – models have to be built in anticipation of use. But the new economics of data warehousing are changing that. It is now possible to gather data and start investigations before the requirements are known and to continuously increment the breadth of a data warehouse because the tools are so much more capable of processing information. Prescription for Realignment No amount of interviewing or “requirements gathering” can break through and find out what sortsof questions people would have if they allowed themselves to think freely. The alternative is toramp up the capacity, allow exploration and discovery, guiding it over a period of time. What’sneeded is a new paradigm that breaks open the data warehouse to business users, enabling adhoc and iterative analyses on full sets of detailed and dynamic data. Unfortunately, old habits die hard. When it comes to BI, the industry is largely constrained by a drag of technology. What passes asacceptable BI in organizations today is rarely much more than re-platforming reports and queriesthat are ten- to twenty-years old. For data warehousing and BI to truly pay off in organizations, IT needs to shift its focus fromdeciding the informational needs of the organization through
  14. 14. ©2013 Hired Brains Inc. All Rights Reserved 14 technical architecture and discipline,to one of responding to those needs as quickly as they arise and change. Conclusion: Walk in Green Fields The new economics of data warehousing, more than anything else, free data from the doctrine of scarcity. Many ordinary business questions are dependent on examining attributes of the mostatomic pieces of data. These bits of information are lost as the data is summarized. Someexamples of queries that are simple in appearance but cannot be resolved with summarizeddata are: - Complementarity (a.k.a. market basket analysis): What options are most likely to be purchased, sorted by profitability, in connection with others? Extra credit: which aren’t? - Metric Filters (a metric derived within a query is used as a filter): Who are the top ten salespersons based on the number of times they have been in the top ten monthly(inclusion based on % of quota)? - Lift Analysis (what do we get from giving something away): How many cardholders will wesend a gift to for making a purchase greater than $250 on their VISA card in the next 90 daysand which of them did so because they were given an incentive? What if the amount is $300? How a visualization of the lift from $50 to $500? Extra credit: greater than triple their average charge in the trailing 90 days before they were sent the promotion? - Portfolio Analysis (to evaluate concentration risk, for example): Aggregating data obscuresinformation, especially in a portfolio where the positive or negative impact of one or a smallproportion of entities can adversely affect the whole collection. Evaluating and understandingall of the attributes of each security, for example, is essential, not an average. - Valuations (such as asset/liability matching in life insurance): Each coverage and its associated attributes (face amount, age at issue, etc.) is
  15. 15. ©2013 Hired Brains Inc. All Rights Reserved 15 modeled with interest assumption to generate cash flows to demonstrate solvency for a book of business. All of these examples are just extrapolations of classic analytics, but enhanced by the ability to make more observations (data points) and fewer assumptions, and to do it in near real-time. In addition to these examples,the following are examples of hybrid processing, involving the coordination of analytics andoperational processing: - Manufacturing Quality: Real-time feeds from the shop floor are monitored and combined withhistorical data to spot anomalies, resulting in drastically reduced turnaround times for quality problems. - Structured/Unstructured Analysis: The Web is monitored for press releases that indicateprice changes by competitors, causing the release to be combined with relevant marketshare reports from the data warehouse and broadcast to appropriate people, all on an unattended basis. - Personalization: B2B and B2C eCommerce dialogues are dynamically customized based onanalysis of rich integrated data from the data warehouse in real-time and real-time feeds - Dynamic Pricing: Operational systems, like airline reservations, embed real-time analysis toprice each transaction (also known as Yield Management). This list is endless. Providing people (and processes) access to this level of detail is consideredtoo difficult and too expensive. Porting these applications to newer, commodity-based platforms and purpose-built software that exploits the new economics can empower a multitude of powerful new analytical applications. Conclusion The new economics of data warehousing provide attractive alternatives in both costs and benefits. While big data gets most of the attention, evolved data warehousing will play an important role for the foreseeable future. In
  16. 16. ©2013 Hired Brains Inc. All Rights Reserved 16 order to be relevant, data warehouse design and operation needs to be simplified, taking advantage of the greatly improved hardware, software and methods. On the benefits side, careful location of processing to separate data integration and stewardship activities from those of an operational and analytical nature is clearly required. The industry has moved beyond a single relational database to do everything. There are more choices today than ever before. New BI tools are arriving every day that take reporting and analysis to new levels of visualization, collaboration and distribution. Embedded analytical processes in operational systems for decision-making rely on fast, deep and clean data from data warehouses. Data warehousing can trace its antecedents to traditional Information Technology. Big data, and especially its most prominent tool, Hadoop, trace their lineage to search and e-business. While these two streams are more or less incompatible, they are converging. Data warehouse providers are quickly adding Hadoop distributions, or even their own versions of Hadoop, into their architecture, adding further cost advantages to collection of extremely large data sets. Finding and retaining the talent to manage in this newly converged environment will not be easy. Current estimates are that in the U.S. alone, there is an existing undersupply of “data scientists” with the skill and training to be effective. To change the mix in the educational system will take more than a decade, but the economy will adjust much more quickly. New and better software has already arrived allowing knowledge workers without Ph.D.’s and/or extensive background in quantitative methods to analyze data and to build and execute models. Driven by the new economics of data warehousing, the proper design and placement of a data warehouse today would be almost unrecognizable to a practitioner of a decade ago. The opportunities are there.
  17. 17. ©2013 Hired Brains Inc. All Rights Reserved 17 ABOUT THE AUTHOR Neil Raden, based in Santa Fe, NM, is an industry analyst and active consultant, widely published author and speaker and the founder of Hired Brains, Inc., Hired Brains provides consulting, systems integration and implementation services in Data Warehousing, Business Intelligence, “big data,” Decision Automation and Advanced Analytics for clients worldwide. Hired Brains Research provides consulting, market research, product marketing and advisory services to the software industry. Neil was a contributing author to one of the first (1995) books on designing data warehouses and he is more recently the co-author of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall, 2007. He welcomes your comments at or at his blog at Competing on Decisions.
  18. 18. ©2013 Hired Brains Inc. All Rights Reserved 18 Appendix Relational Database Inverted: What Columnar Databases Do For a complete description of the features and functions of Vertica, please refer to their website at What follows is a more general description of the concept. It is very slightly technical. Classic relational databases store information in records and collections of records in tables (actually, Codd, the originator of the relational model, referred to them as tuples and relations, hence, relational databases). This scheme is excellent for dealing with the insertion, deletion or retrieval of individual records as the elements of each record, or the attributes, are logically grouped. These records can get very wide, such as network elements in a telecom company or register tickets from a Point of Sale application. This physical arrangement closely mirrors the events or transactions of a business process. When accessing a record, however, the database accesses the entire record, even if only one attribute is needed in the query. In fact, since records are physically stored in “pages,” the database has to retrieve more than one record, often dozens, to access one piece of data. Furthermore, wide database tables are often very “sparse,” meaning there are many attributes with null values that still have to be processed when used in a query. These are major hurdles in using relational databases for analytics. Columnar database technology inverts the database structure and stores each attribute separately. This completely eliminates the wasteful retrieval (and burden of carrying the needless data) as the query is satisfied. As an added benefit, when dealing with just one attribute at a time, the odds are vastly increased that instances are not unique which leads to often dramatic compression of the data (depending on the “distinct”-ness of each collection of attributes.) As a result, much more data can fit in memory, and moving data into memory is much faster.
  19. 19. ©2013 Hired Brains Inc. All Rights Reserved 19 In summary, the Vertica Analytical Database innovations include: Shared-nothing massively parallel processing Columnar architecture residing in commodity hardware Aggressive data compression leading to smaller, faster databases Auto-administration: design, optimization, failover, recovery Concurrent loading and querying