Big data a new breed of db vendor

1,640 views
1,579 views

Published on

Great analyst report from Cowen on big data. Provides great overview for non-technical folks like me

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,640
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
55
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data a new breed of db vendor

  1. 1. Non-Consensus Idea Series Software Big Data: A New Breed of Database VendorJuly 1, 2011 Means Trouble For The Existing OrderAnalysts Consensus View: The rapid explosion of unstructured and machine dataPeter Goldmacher catalyzed by the pervasiveness of the Internet and massive growth in a variety(415) 646-7206peter.goldmacher of computing devices has created a new market for capturing and analyzing@cowen.com non-traditional data. The primary beneficiaries of this data explosion are the traditional database vendors like Oracle that sell hardware and softwareJoe del Callar systems purpose built for data capture and analysis. As unstructured and(415) 646-7228joe.delcallar@cowen.com machine data grow, these traditional vendors will be able to sell more hardware, software and storage to address their customers increasing data needs. Our View: We believe the vast majority of data growth is coming in the form of data sets that are not well suited for traditional relational database vendors like Oracle. Not only is the data too unstructured and/or too voluminous for a traditional RDBMS, the software and hardware costs required to crunch through these new data sets using traditional RDBMS technology are prohibitive. To capitalize on the Big Data trend, a new breed of Big Data companies has emerged, leveraging commodity hardware, open source and proprietary technology to capture and analyze these new data sets. We believe the incumbent vendors are unlikely to be a major force in the Big Data trend primarily due to pricing issues and not a lack of technical know-how. Using current Oracle pricing models, we estimate Oracle would charge about 9x more than a blended average of the Big Data vendors to solve similar problems. As corporate data sets grow, we are skeptical that Oracle could retain its pricing power with a blended database offering of traditional and Big Data solutions if it were to try to compete against the Big Data players.Please see addendum ofthis report for importantdisclosures. Whats Inside? This note defines Big Data, outlines who the players are, reviews the competitive landscape, provides a technical overview of legacy versus new technology, and gives detailed pricing analyses.www.cowen.com
  2. 2. Software Table Of Contents Page Big Data .................................................................................................... 4 The Database is No Longer a Product, It’s a Category ....................................... 6 Big Data is Creating New Companies ................................................................ 6 The Competitive Landscape ............................................................................. 7 Data in the Internet Age: Enormous Growth .............................................. 8 The Internet Today........................................................................................... 8 Consumer Content Creation and Consumption is Increasing ............................ 9 Organizations are Following Consumer Content ............................................. 10 What Issues Do Today’s Data Pose? ......................................................... 13 Data Volume .................................................................................................. 13 Lack of Structure............................................................................................ 14 The Data Market Today ........................................................................... 15 Database Players in the Market Today ............................................................ 15 Current Database Solutions: Use Cases, Structure & Data Sets ........................ 16 Cataloging Software ....................................................................................... 19 If Relational Databases Can Handle Big Data Volumes, Why Bother? Cost. ....................................................................................................... 20 Price Caveats.................................................................................................. 21 Performance................................................................................................... 22 A Brief Background on Databases............................................................ 24 Productivity Motivated the Adoption of the RDBMS ......................................... 24 Data Use Cases – A Classification ............................................................ 26 By Volume...................................................................................................... 26 By Structure and Accessibility to Context ....................................................... 26 By Nature of End User Access/Use Case.......................................................... 27 Dominant Use Cases ...................................................................................... 28 Key Characteristics and Shortcomings of Relational Databases Today ..... 29 Record Structure ............................................................................................ 29 Relational Operation ...................................................................................... 302 July 1, 2011
  3. 3. Software Transactional Integrity ................................................................................... 31Tools of the Trade: Dealing with Data Volumes ....................................... 33 Split up the Problem: Parallelization ............................................................... 33 Get Rid of the Parallel Coordination Bottleneck: Shared Nothing ..................... 36 Turn the Problem on its Side: Columnar Databases ........................................ 37 Get Rid of the Hard Drive Bottleneck: In Memory Databases ........................... 39 Speed up the Hard Drive Bottleneck: Solid State Drives................................... 39 Go Retro: NoSQL ............................................................................................ 40Data Management Providers and Solutions.............................................. 43 The Database Establishment .......................................................................... 43 Other Large Companies with Solutions ........................................................... 46 Proprietary Upstarts ....................................................................................... 47 Open Source .................................................................................................. 48Cataloging Software: Extracting Meaning from Unstructured Data........... 54 Why? Even Unstructured Data Needs Structured Analysis. .............................. 54 The Market for Meaning ................................................................................. 55 The Players .................................................................................................... 55Appendix: System Costs for Big Data ...................................................... 58 A Database for Marketing Analysis Incorporating External Data...................... 58 An Application for Analyzing Web Log Machine Data ...................................... 63 A Database for Search and Provisioning of Unstructured Data ........................ 66 July 1, 2011 3
  4. 4. Software Big Data Defining Big Data is a matter of perspective. Wikipedia, which probably has the most objective perspective, defines it as: “Big Data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to spot business trends, prevent diseases, combat crime. Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data. Scientists regularly encounter this problem in meteorology, genomics, connectomics, complex physics simulations, biological research, Internet search, finance and business informatics. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, software logs, cameras, microphones, RFID readers, wireless sensor networks and so on. One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers. The size of Big Data varies depending on the capabilities of the organization managing the set. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.” This definition is technical, accurate and dramatically understates the impact of Big Data on the technology landscape and our daily lives. The most pervasive trend in technology since the creation of ENIAC, the world’s first computer is that as time passes, prices drop and markets expand. As markets demand, devices proliferate.4 July 1, 2011
  5. 5. SoftwareAs Prices Decline, Devices Proliferate Source: Cowen and Company.The practical impact of the massive proliferation of devices is the exponentialgrowth in the creation of data. There is tremendous value in this non-ERP data andharnessing that value has created multiple $100B companies like Google andFacebook.As Devices Proliferate, Data Creation Explodes Source: Cowen and Company, Company WebsitesJuly 1, 2011 5
  6. 6. Software The Database is No Longer a Product, It’s a Category From our perspective, Big Data is the notion that the database has transitioned from a product to a category. Most investors associate the term database with relational databases and the demand drivers that have propelled Oracle and IBM to the top of the IT heap. However, we believe that the trends that are driving Big Data are significantly larger and will have a far more profound impact on the future than just helping companies automate traditional business processes. We define Big Data is all the other data that exists beyond the structured corporate data that automates business processes. Big data is a multitude of other categories including but certainly mot limited to data sets like machine data, click streams, RFID data, search indexing, phone calls, pictures, videos, genomics etc. We have no doubt Big Data and the need to manage and understand it will dwarf the opportunity in the relational database market over the last 30 years. Investors can’t continue thinking of a database as a relational database. A relational database is a kind of database. The graphic below makes the point that even though you have two products with the same name, in this case “Sting Ray”, the use case of each Sting Ray is dramatically different. The Schwinn Sting Ray on the left represents the legacy paradigm of the RDBMS world and the Corvette Stingray on the right represents the future of Big Data. Legacy Data vs. Big Data Source: Cowen and Company. Big Data is Creating New Companies The creation of Big Data has spawned the creation of new technologies and entirely new categories of companies that are leveraging Big Data to create new markets. Most people would describe Google and Facebook as media companies because they derive the bulk of their revenues from ad sales. However, we view Google and Facebook and a myriad of other high growth internet oriented companies as Big Data companies because their businesses have been created due to their ability to6 July 1, 2011
  7. 7. Softwareharness Big Data. Google’s search indexing technology is a big data solution.Facebook’s ability to collect and correlate data about people and their preferences isa big data solution. There is an ever growing number of companies that are eitherbeing created or being reinvented to harness the power of Big Data. These arecompanies like Zynga, Twitter, Intuit, etc.The Competitive LandscapeThe competitive landscape around Big Data is as diverse as the data sets that defineBig Data. The opportunity is so large that companies and technologies are springingup like weeds to address opportunities and define the space. While we normallyhave a negative reaction to broad based and vaguely defined marketing terms like“B2B eCommerce”, “Collaboration”, “Cloud” or “Social”, we believe the term “Big Data”appropriately defines the space for now. Over time, the market will be more clearlydefined and the technologies will be standardized, but for now it is still a free for all.The following competitive slide is our take on the competitive landscape. Right nowthere are a lot of startups that generated between $10M and $100M in revenues in2011. Some of the large cap tech names that don’t have a legacy RDBMS business toprotect have been active on the M&A front paying between 10x and 30x revenues forpromising start ups. Oracle and IBM are circling, watching, marketing but waiting tosee how the space develops. Other players like Dell and SAP have focused onapplying existing technology to legacy solutions...for now.The Big Data Competitive Landscape Source: Cowen and Company, Company Websites.July 1, 2011 7
  8. 8. Software Data in the Internet Age: Enormous Growth The Internet, amplified by consumer endpoints such as smartphones and tablets, coupled with the rise of social networking, has democratized content creation on the Web. Web consumption and contribution includes log data, text, pictures, audio and video. In the beginning, content creation was the domain of professional web designers and bloggers, but now social networking sites have let virtually everyone become an author. High demand for data creation has spawned a whole class of companies whose strategy revolves around collecting and provisioning as much data as possible to as many consumers as possible. Companies are rushing to exploit this data to more effectively address their customers and to improve their operations. We see a chasm between content and corporate analysis that is poorly served by existing relational database technologies that were not designed to handle the volumes of virtually freeform data being created. This space is rapidly being filled by a developing set of technologies and growing developer community. We argue that there is significant economic opportunity in bridging this gap and we are in the early stages of the next generation of data management vendors being created to address this demand. We believe the massive quantity of unstructured data created by the widespread use of the Internet presents a growth opportunity that will be significantly larger than the little $25B relational database industry dominated by Oracle, IBM and Microsoft. The Internet Today No one knows the full size of the Internet, but anecdotal data points underscore its reach: • The UN’s International Telecommunication Union (ITU) estimates that the number of people in the world who have access to the Internet surpassed 2B in 2010. • Our Cowen colleague Matt Hoffman projects Smartphone sales of over 400M units in 2011. • Akamai, which claims 15-30% of all Web traffic, delivers daily Web traffic reaching more than 4 Terabits per second. This suggests that total Web traffic is somewhere between 13 Tbps and 27 Tbps, equivalent to between 13M and 27M photos downloaded every second or between 13M and 27M decent-quality videos being viewed at the same time. • Verisign reports that there are 196M registered domain names (e.g., cowen.com) that are used to identify websites. • At one time, Google claimed to have 1 trillion URLs in its index. Although this number has reportedly been pared down to remove duplicates, the starting point of 1 trillion is still big. • Google and Verisign data points mean that the average registered domain is associated with hundreds of URLs/web pages.8 July 1, 2011
  9. 9. SoftwareAnd there are large amounts of existing unstructured content:• We estimate that there are in excess of 30B photos uploaded on FaceBook.• Google index data suggests there are over 600M videos on YouTube.• iTunes hosts over 13M songs and over 300,000 apps.• The University of Toronto’s BlogScope project has indexed more than 53M blogs.• The ITU estimates that there were over 6 trillion text messages sent in 2010, an average of nearly 200K every second.• Digital phone calls are another large and growing data set.Consumer Content Creation and Consumption is IncreasingThe volume of content on the Internet has increased significantly over the last fiveyears, and data from seminal video and audio websites suggests that the rate atwhich this data is created and consumed is increasing.A major example of this trend in content use is YouTube. Since it was founded at thebeginning of 2005, the rate at which data is used and consumed on the site hasgrown rapidly, with an eight-fold increase in video uploads in the last four years anda 30x increase in video views over the last five.Video Creation and Consumption on YouTube is Increasing 60 3,500 Hours of Video Uploaded per Videos Viewed per Day (M) 50 3,000 2,500 40 2,000 Minute 30 1,500 20 1,000 10 500 0 0 Jan-05 Jan-06 Jan-07 Sep-07 Jan-08 May-05 Sep-05 May-06 Sep-06 May-07 May-08 Sep-08 Jan-09 May-09 Sep-09 Jan-10 May-10 Sep-10 Jan-11 May-11 # of Videos Viewed per Day (M) Hours of Video Uploaded per Minute Source: Cowen and Company, the official YouTube blog on Google.Since it was founded in February, 2005, YouTube’s use has steadily climbed to thepoint where it announced that users were uploading 35 hours of video every minute.As of May 2010, videos on YouTube were viewed 2B times per day, 20x the rate whenGoogle acquired the company in 2006.July 1, 2011 9
  10. 10. Software We note that there could be significant variance in our estimates as bit rates can vary from about 250 Kbps for teleconferencing-quality video to 5 Mbps for high definition. We assume a bit rate 1 Mbps in our estimates. We also note that growth in video use on YouTube is not an isolated case, as Apple’s iTunes store (music) and FaceBook (photos) have shown increased use during this time period as well. Audio Consumption on Apple iTunes and Photo Creation on FaceBook is Increasing over the Last 5 Years as Well 12 iTunes Song Downloads per Day (M) 10 8 6 4 2 0 Oct-05 Oct-06 Oct-08 Oct-09 Oct-10 Jan-05 Jul-05 Jan-06 Jul-06 Oct-07 Jan-08 Jul-08 Jan-09 Jul-09 Jan-10 Jul-10 Apr-05 Apr-06 Apr-08 Apr-09 Apr-10 Jan-07 Jul-07 Apr-07 120 FaceBook Photo Uploads per Day (M) 100 80 60 40 20 0 Oct-05 Oct-06 Oct-08 Oct-09 Oct-10 Jan-05 Jul-05 Jan-06 Jul-06 Jan-08 Jul-08 Jan-09 Jul-09 Jan-10 Jul-10 Apr-05 Apr-06 Oct-07 Apr-08 Apr-09 Apr-10 Jan-07 Jul-07 Apr-07 Source: Cowen and Company, Company Releases. Organizations are Following Consumer Content In conjunction with and in response to this increase in consumer usage of the Internet, organizations are following consumers online as evidenced by a growing10 July 1, 2011
  11. 11. Softwarenumber of corporations with an online presence. These companies are increasinglyattempting to leverage data they collectThe following data from Verisign on domain name registrations indicates thatcompanies and organizations continue to build an online presence. Through June,domain name registrations have grown at an average of nearly 20% over a six yearperiod.Registered Domains Have Grown Over the Last 5 Years 250 19.3% 6-Year CAGR M of Domains Registered 200 150 100 50 0 Sep-05 Sep-06 Sep-07 Sep-08 Sep-09 Sep-10 Dec-04 Mar-05 Jun-05 Dec-05 Mar-06 Jun-06 Dec-06 Mar-07 Jun-07 Dec-07 Mar-08 Jun-08 Dec-08 Mar-09 Jun-09 Dec-09 Mar-10 Jun-10 Dec-10 Source: Cowen and Company, Verisign Domain Name Report.Making the point, we believe that increasing interest in being able to record, catalogand analyze this data is evident in strong 36% growth in machine analytics providerOmniture (part of Adobe, ADBE, NC), 11% organic growth in unstructured contentmanagement and indexing provider Autonomy and 12% growth in BI softwareoverall. Notably, we note these all exceed the 4% growth observed in the enterprisesoftware category overall, and that this strong growth occurred in the midst of arecession that strained IT budgets. And we are still at the very beginning.July 1, 2011 11
  12. 12. Software Omniture, Autonomy and BI Overall Outperformed the Enterprise Software Sector Adobe Omniture Unit ($M) $500 40% 3-Year CAGR $400 $300 $200 $100 $0 2007 2008 2009 2010 Autonomy ($M, incl Acquisitions) $1,000 11% Organic 3-Year CAGR $800 $600 $400 $200 $0 2007 2008 2009 2010 BI Software Sales ($B) $12 12% 3-Year CAGR $10 $8 $6 $4 $2 $0 2007 2008 2009 2010 Enterprise Software Sales ($B) 4.2% 3-Year CAGR $250 $200 $150 $100 $50 $0 2007 2008 2009 2010 Source: Gartner Dataquest “Forecast: Enterprise Software Markets, Worldwide, 2008-2015, 1Q11 Update” by Graham et al. Company reports. Calculations by Cowen and Company.12 July 1, 2011
  13. 13. SoftwareWhat Issues Do Today’s Data Pose?Data being generated today challenges the performance of relational databases andbrings to question the suitability of common relational databases to the problem. Inparticular:• Data Volume. Data volumes are growing beyond what current relational databases can feasibly handle.• Structure (or Lack Thereof). Most of new data being created is not in the rigid record structure common to today’s databases, where each data item has a set of attributes that has particular significance or meaning. Some semblance of structure is key as users typically need some metrics by which to analyze trends in the data.The issues above are often discussed in the same breath as Big Data, as unstructureddata is often large in and of itself (e.g., a single picture is much larger than thenumbers and words stored in a typical database transaction).However, it is useful to think about the issue of structure separately from the issueof volume because a lack of metadata (data about the data) can render data uselessfor analysis. Furthermore, there is a significant need to retrieve and serve fewer bitsof non-text unstructured data, such as videos, a challenge that is different fromaggregating and analyzing Big Data-size amounts of data.Data VolumeWe believe the large volumes of structured and unstructured data being created andserved up require massive investments in hardware if implemented with traditionaldatabase systems. We break out three main issues with data volume.• There Are a Lot of Individual Data Items. Many data sets today contain billions of data items, with some data sets containing over a trillion items. While analysis of this much data alone may seem daunting, it is virtually impossible to apply typical relational database techniques, such as retrieving records from more than one table. For example, if one were to retrieve millions of rows of data from table A and millions of rows of data from table B and analyze the combined data set; the database would have to potentially sift through trillions of possible combinations of records.• Each Data Item Can Be Big. Unlike records for structured data which are typically sized in the kilobytes, unstructured data ranges from the tens of kilobytes for web pages and email to megabytes for song files to gigabytes for high definition video. These requirement increases are in orders of magnitude relative to most structured data records and therefore require much more in terms of resources. For example, throughput in traditional disk based systems may not be high enough to serve image and video, leading many sites that serve this type of data have turned to alternative technologies that typically leverage main or solid state memory.• Data Items are Being Created Rapidly. The high rate at which data is created can also put high demands on system resources, particularly if transaction integrity is required. A traditional data store would typically have to retain an additional copy of data or only allow one process at a time to update a specific set of data in order to ensure the validity of the data update even if theJuly 1, 2011 13
  14. 14. Software system fails. The former places high requirements on storage in the system as duplicate data is needed, while the latter slows down the system as processes that update can get delayed while they wait on each other for a particular piece of data. Lack of Structure In many ways, the data volume problem is an easier issue to deal with: to solve it one can throw a lot of hardware and software (and therefore a lot of money) at it, and the extent of the solution varies with the amount of the investment. In contrast, the lack of structure in unstructured data is challenging for the simple reason that consumers search or analyze data using some attribute of the data, such as the date a video was taken, who the subject of a photo is, or whether or not a blog gives a positive view of a product. Some form of metadata typically has to tag the data item or be generated on the fly. If there is no such metadata, potential consumers of the unstructured data item will have difficulty finding it or may not be able to find it at all. This is not an easy issue to solve without human involvement. For example, it is not easy for software to tell if a picture is of a yellow rose, of a yellow Labrador named Rose or of a girl named Rose in a yellow dress. There are rapidly improving solutions for both text and non-text unstructured data. For example, solutions for text analysis and metadata creation (often called “text mining”) are provided by Autonomy, SAP Inxight, IBM SPSS and a few others, while enterprise software that analyzes voice and audio data is available from Nuance.14 July 1, 2011
  15. 15. SoftwareThe Data Market TodayBelow we list the players in the markets for databases (systems intended toorganize, store, and retrieve data) and cataloging software (systems intended to helpcatalog unstructured data). While the former is a proven market with potential forsignificant expansion due to the needs of large data volumes, the latter could alsoprove to be a source of economic opportunity.In the remainder of this note, define and provide a very brief history of databases tounderstand where we are today, and then we delve into our above taxonomy ofvarious characteristics that define today’s data and the challenges of managing datawithin each classification.We also look at properties and shortcomings of existing relational databases, andthe techniques applied to handle large data. We subsequently explore opportunitiesopen to traditional database providers and areas their solutions are likely to getdisplaced. We use an analysis of the costs to build and maintain a YouTube-likesystem to frame the discussion.We then explore the established and up and coming database providers, focusing onthe technologies they are applying to handle large or unstructured data sets.We subsequently list current providers of data cataloging solutions that help extractcontext and create metadata based on unstructured data.Database Players in the Market TodayThe following summarizes the established and up-and-coming solutions in themarket today along the lines of the level of structure they are designed to handleand the volume of data each system is capable of. These assessments areindependent of price/performance ratio (e.g., Oracle can handle petabyte data sets,but we did not say that a sufficient solution would be inexpensive). We also onlyconsider the capabilities of each database “out-of-the-box”. For example, MySQL isoften cited for use in petabyte-scale datasets, but these typically involve muchcustom coding where the secret sauce is embedded in the middleware thatdistributes data to thousands of MySQL instances (hence we list Infobright as aMySQL proprietary solution).July 1, 2011 15
  16. 16. Software Capabilities of Existing Database Solutions ! More Structured Less Structured " Record-based Machine Unstructured Text Other Unstructured Data Gigabytes Informix (IBM) SimpleDB (Amazon) SolidDB (IBM) TimesTen (Oracle) 1010data 1010data Membase (open src) ++ Membase (open src) ++ ASE (SAP/Sybase) Infobright PostgreSQL (open src) PostgreSQL (open src) Ingres Membase (open src) + Redis (open src) Redis (open src) Membase (open src) + Redis (open src) SQL Server (Microsoft) SQL Server (Microsoft) Terabytes MySQL (Oracle) Splunk Voldemort (open src) Voldemort (open src) PostgreSQL (open src) SQL Server (Microsoft) Redis (open src) Vertica (HP) SAP In-Memory DB/HANA Voldemort (open src) SQL Server (Microsoft) VectorWise (Ingres) Vertica (HP) Size Voldemort (open src) Aster Data (Teradata) Aster Data (Teradata) BigTable (Google) BigTable (Google) BigTable (Google) BigTable (Google) CouchDB (open src) ++ CouchDB (open src) ++ Cassandra (open src) * Cassandra (open src) * DB2 (IBM) HBase (open src) ** DB2 (IBM) DB2 (IBM) HBase (open src) ** Hypertable (open src) GigaSpaces Greenplum (EMC) Hypertable (open src) MarkLogic Greenplum (EMC) HBase (open src) ** IQ (SAP/Sybase) MongoDB (open src) +++ Petabytes HBase (open src) ** Hive (open src) *** MarkLogic Oracle Hive (open src) *** Hypertable (open src) MongoDB (open src) +++ Hypertable (open src) MarkLogic Netezza (IBM) IQ (SAP/Sybase) Netezza (IBM) Oracle MarkLogic Oracle Netezza (IBM) Teradata Oracle ParAccel Teradata f * - Commercial support for Cassandra is provided by DataStax and Acunu ** - Commercial support for HBase is provided by Cloudera, Datameer and Hadapt *** - Commercial support for Hive is provided by DataStax, Cloudera, Datameer and Hadapt + - Commercial support for Membase is provided by Couchbase ++ - Commercial support for CouchDB is provided by Cloudant and Couchbase +++ - Commercial support for MongoDB is provided by 10Gen Cowen and Company, Company Reports. We note that Oracle can handle all of the major cases of big data loads, while IBM DB2 and Microsoft SQL Server can handle most of the use cases outlined above. However, we remind the reader that the chart above is independent of price, and although the statement “there is not enough money in the world for Oracle” is hyperbole, it is apropos. We quantify the price of Oracle systems that can handle big data cases and compare it to emerging big data-specific solutions in a later section. Current Database Solutions: Use Cases, Structure & Data Sets The following table shows the applicability of existing databases to a comprehensive range of use cases, and structure of data, along with their applicability to large data sets. We also include database appliances in the table below. We make a distinction between databases and caches as databases make a reasonable attempt to ensure that data is persistent (i.e., it does not get erased when there is a prolonged loss of power, for example), which is needed for the search,16 July 1, 2011
  17. 17. Software transaction and analytics use cases we delve into this note. While often used in lieu of databases, we do not generally deal with caches, such as Oracle Coherence or memcached, in this note.Applicability of Existing Solutions to Use Cases, Structure and Size of Contemporary Data Sets Structure of Data Nature of Access/Use Case Machine Unstructured Other Transaction > 1 PB dataset Transaction Search Analytics Data Text Unstructured Mgmt Oracle (ex BerkeleyDB and RDB) Oracle x x x x x x x x Exadata x x x x x x x x MySQL x x x TimesTen x x IBM DB2 - Mainframe x x x x x x DB2 - non-Mainframe x x x x x x Netezza x x x x Informix x x x solidDB x x SAP ASE x x IQ VLDB option x x x x SAP In-Memory DB and HANA x x x x Microsoft SQL Server Just over 1 Pb x x x x x x x Teradata Teradata x x x x Aster Data x x x x Ingres Ingres x x VectorWise x x HP Vertica ? x x x EMC Greenplum x x x x Amazon SimpleDB x x Google BigTable x x x x x x Other Proprietary Software Providers x 1010data x x x Infobright x x Splunk x x x ParAccel x x x MarkLogic x x x x x x x x GigaSpaces x x x Open Source Projects Cassandra x x x x x CouchDB x x x x add on Hadoop HBase x x x x x x Hive x x x x Hypertable x x x x x x Membase x x x x x MongoDB x x x x x PostgreSQL (basis for GreenPlum) modified x x x x x Redis x x x x x Voldemort x x x x x Cowen and Company, Company Reports. The following shows the techniques these various solutions apply to handle the large data sets of today. In the table below, we only marked off dataset characteristics that the database systems were intended to handle without considerable effort. MySQL, for example has been heavily modified to work in various systems that handle petabytes data volumes (such as at Facebook), but we July 1, 2011 17
  18. 18. Software did not mark it off as a petabyte database as it does not generally handle petabyte volumes standalone.Techniques Used by Databases to Handle Large and/or Unstructured Data Parallel Processing Memory Clustering (<1K Massive (>1K Shared Appliance In Memory Solid State Columnar NoSQL nodes) nodes) Nothing Oracle (ex BerkeleyDB and RDB) In-Memory Oracle RAC option Option Exadata x x x partial storage engine MySQL x option TimesTen x IBM DB2 - Mainframe Parallel Sysplex option DB2 - non-Mainframe pureScale option Netezza x x x solidDB x SAP ASE Cluster Edition IQ Multiplex option x SAP In-Memory DB and HANA x x x x x Microsoft SQL Server x x Upcoming Teradata Extreme Teradata x x x Performance x Appliance Aster Data x x x Ingres Ingres Ingres STAR VectorWise x x HP Vertica x x x x EMC Greenplum x x x High Perf DCA x x Amazon SimpleDB x x Google BigTable x x x Other Proprietary Software Providers 1010data (SaaS) SaaS x x Infobright depends on MySQL x Splunk x ParAccel x Option x x MarkLogic x x x x GigaSpaces x x Open Source Projects Cassandra x x x x CouchDB x x x x Hadoop x x x x HBase x x x Hive x x x Hypertable x x x Membase x x x x MongoDB x x PostgreSQL (basis for GreenPlum) x patch available Redis x x x x Voldemort x x x Cowen and Company, Company Reports. 18 July 1, 2011
  19. 19. SoftwareCataloging SoftwareCataloging software generates metadata to help consumers of unstructured datafind the data. In general, most consumers of data find or organize the data via a texttag. Conversely, unstructured data is virtually useless without these tags asconsumers will not be able to find the data. Cataloging software helps generatethese tags in an automated manner.The following lists various solution providers of software that helps generatemetadata for unstructured data sets.Data Cataloging Software Providers Ticker Text Images Audio VideoLarger Providers Autonomy (iSAS and various other products) AU. (London) x x x x IBM (Content Analytics) IBM x SAP (Inxight) SAP x SAS (Text Analytics) xStartups CallMiner x IQ Engines x kooaba x Leximancer x Megaputer x Nexidia x x x Source: Cowen and Company, Company Websites.July 1, 2011 19
  20. 20. Software If Relational Databases Can Handle Big Data Volumes, Why Bother? Cost. We believe that we are at a similar cross-road between SQL databases and upstart solutions today to the 1970’s choice between these SQL databases and their predecessors. We believe that the success of one class of solutions vs. another ultimately will boil down to • the price of the chosen solution, and • the performance that can be achieved by a class of solutions. Our conclusion is that high-performance solutions from traditional database vendors are not likely to meet the price points of high-performance solutions put together using open source software and commodity hardware. We therefore believe this will drive companies to consider these next generation systems when implementing new systems for new non transactional workloads, even when faced with the concomitant cost of developing expertise in the emerging alternatives. To illustrate this cost discrepancy, we quantify the costs of the following big data use cases. • External Marketing Data Analysis. This use case combines product- relevant data scraped from the web with internal data for the purposes of analyzing the efficacy and impact of product and marketing programs. • Web Log Machine Data Analysis. This use case is for the analysis of data generated by user visits to a company’s website. This is based on known transaction volumes from a popular SaaS provider. • Search and Provisioning of Unstructured Data. This use case looks at the provisioning of large volumes of unstructured data such as documents and training videos, internally or externally. Volume and throughput is based on YouTube. The following table compares our estimates of the up front and ongoing costs of a system built using Oracle to a comparable system built using emerging big data solutions.Emerging Solutions are Much Cheaper Oracle Costs ($M) Emerging Soln Costs ($M) Savings % Up Front Cost Annual Recurring Cost Up Front Cost Annual Recurring Cost Up Front Cost Annual Recurring External Marketing Data Analysis $33.5 $5.4 $3.4 $0.8 90% 86% Web Log Machine Data Analysis $8.5 $1.6 $0.6 $0.2 93% 85% Search and Provisioning of Unstructured Data $589.4 $99.0 $104.2 $15.1 82% 85% Average Savings for Emerging Big Data Solutions 88% 85% Source: Cowen and Company, Company Websites and White Papers 20 July 1, 2011
  21. 21. Software We see that, on average, emerging solutions are nearly 90% cheaper up front and 85% cheaper on an ongoing basis than an equivalent traditional solution from Oracle. While the phrase “there’s not enough money in the world for Oracle” recently uttered by a big data company CEO is not completely accurate, it’s close. We believe this significant price differential will prompt very strong interest in companies looking to manage large data volumes. However, despite the compelling cost advantage of next generation systems, the pervasiveness of current generation solutions and accompanying expertise within IT organizations coupled with increasing capabilities of relational solutions to handle big data sets will make it more difficult for next generation systems to replace existing relational systems at current clients that require only incremental upgrades, such as adding several extra terabytes of space. The opportunity of these next generation systems is in non traditional, non transactional work loads, which is where we believe all the growth in data is. Given these two opposite dynamics, we believe traditional database systems and next generation systems are likely to coexist for a while, with the presence of next generation solutions limiting upside for traditional commercial database solutions. Price Caveats Our calculations above suggest solutions based on alternative big data software are significantly cheaper from both an initial investment perspective and from a maintenance fee perspective than traditional commercial databases. However, the cost benefits of alternative systems come at the price of more expensive application development. In very simplistic terms, today’s commercial relational databases are generally easier to program than up and comers, although there are exceptions to this. This is important in terms of time to market and cost, particularly with budget and headcount constrained organizations. Much of the code of the non-SQL solutions in particular has a very old feel to it, which we illustrate below using a Google BigTable example right out of the company’s seminal white paper on which Hadoop was based. Hadoop code has a very similar structure.Google BigTable Code has a 70’s Feel 40-Year Old Cobol Code Google BigTable Code Scanner scanner(T); OPEN INPUT StudentFile ScanStream *stream; Open t he dat a file int o a Open t he dat a file int o a READ StudentFile stream = scanner.FetchColumnFamily("anchor"); memory st ruct ure memory st ruct ure AT END SET EndOfStudentFile TO TRUE stream->SetReturnAllVersions(); END-READ scanner.Lookup("com.cnn.www"); PERFORM UNTIL EndOfStudentFile for (; !stream->Done(); stream->Next()) { DISPLAY StudentId SPACE StudentName SPACE CourseCode SPACE YOBirth printf("%s %s %lld %sn", Loop t hrough t he memory READ StudentFile scanner.RowName(), struct ure and do work Loop t hrough t he memory AT END SET EndOfStudentFile TO TRUE stream->ColumnName(), st ruct ure and do work END-READ stream->MicroTimestamp(), END-PERFORM stream->Value()); CLOSE StudentFile } Cowen and Company, Company Report, Google Whitepaper. At the risk of overly-generalizing, much of the high-performing solutions such as Hadoop and in-line MapReduce code have a very verbose code structure and we thus believe that developers (though many may protest) are likely to be as productive as their Cobol/Codasyl counterparts. Our example on p. 25 shows that Codasyl code can require as much as 10x many lines of code and concomitant effort vs. SQL, July 1, 2011 21
  22. 22. Software suggesting that upstart solutions such as Hadoop currently require 10x as much effort as well. In reality, much of the code in this new generation of solutions is cut-and-paste boilerplate. Furthermore, the lack of a rigid structure in many emerging solutions allows for development short cuts as well. Hence, anecdotal evidence and our experience suggest that one would need “only” 4x as much effort to develop in these languages instead of using commercial SQL databases. We believe these development cost savings will not offset the CapEx and annual cost savings from using alternative open source/commodity hardware solutions. For reference, we look at our data search and provisioning case which assumes a YouTube-like volume. The original YouTube team had 5 people working 80 hour weeks, the equivalent load of 10 normal engineers, to produce the original video delivery platform. If we apply our 4x productivity ratio, this would be equivalent to 2.5 engineers on a commercial database system. At $200K per year fully-loaded costs per engineer, this translates to annual savings of only $1.5M. This is much less that the $80M in support costs saved in our data provisioning case using a big data solution instead of Oracle. Performance Much has been made of the performance gains available from replacing typical commercial databases and middleware with alternative solutions for managing large data sets. We note that many of the solutions can actually be implemented within one of the major databases, and it is folly to believe that existing database vendors will stand pat. The following shows previously discussed big data techniques currently available in offerings from the traditional commercial vendors (excluding recent acquisitions such as Netezza, Vertica, Greenplum and Aster Data). We exclude NoSQL approaches as these have not yet been adopted on a major scale by the mainstream vendors.Commercial Database Vendors are Incorporating Big Data Solutions Smaller Scale Database Parallelism (<1,000 Massive Parallelism Provider Major Database Products nodes) (>1,000 nodes) Columnar In Memory Solid State Drives Shared Nothing ORCL Oracle, MySQL, TimesTen, BerkeleyDB RAC Exadata (Partial) TimesTen, Oracle (Limited) Exadata Exadata/MySQL IBM DB2 (mainframe), DB2 UDB, solidDB PureScale solidDB UDB MSFT SQL Server SQL Server 2008 R2 Next release (2011) SQL Server 2008 (Limited) CA Ingres, VectorWise Ingres/VectorWise VectorWise SAP ASE, IQ IQ/ASE IQ/HANA HANA HANA TDC Teradata All Extreme Data Extreme Performance All Cowen and Company, Company Report, Google Whitepaper. As one can see, a significant number of commercial databases have already integrated many of the performance enhancements used by big data providers in their core databases, add-on database options and appliances. Furthermore, while not many of these vendors can exploit parallelism in thousand server installations like Hadoop and Aster Data can, we expect that at least a few of the providers will eventually produce systems that can scale closer to this. It would be naïve to assume that major companies such as Oracle and IBM would not invest in continued improvements in their technology. We do note, however, that many of these improvements mostly apply to the database’s ability to handle transaction data. Oracle text performance is known to 22 July 1, 2011
  23. 23. Softwarehave its limits, for example, as it indexes each word in what can become a massiveand unmanageable table.On the other hand, while we do believe some organizations will look long and hardat alternatives given their significantly lower cost, existing systems and applicationsthat are simply growing their transaction data to big data size may be hard pressedto make a leap to a completely new infrastructure. Hence, this portion of thecustomer base may continue on enhanced versions of current generation systems,and the presence of these systems may make sale of true next generation systemsmore challenging at these customers.July 1, 2011 23
  24. 24. Software A Brief Background on Databases In the simplest terms, a database is software intended to organize, store, and retrieve data in a persistent manner (for our purposes, we exclude popular caches like memcached that some lop into this segment). Most popular database systems today such as those from Oracle and IBM are relational, which means that these systems allow users to match data by using common characteristics found within the data set. The pervasiveness of these relational database solutions has made these solutions virtually synonymous with the database market. However, relational has not always been the sole (or even dominant) segment in databases. It is important to understand that databases have been around in one form or another since the computer was invented. Unlike current relational databases, the vast majority of older databases were custom designed, sacrificing flexibility for speed. As the performance of computers improved in the 1960’s, general-purpose databases began to emerge. Two of the significant general purpose database developments that arose in this decade were databases based on the Codasyl standard, which likely helped boost the popularity of COBOL for business computing, and IBM’s IMS, a similar but more rigidly structured database that was popular on System/360 mainframes. Other contemporaries included CAs IDMS and DATACOM, Unisys DMS II, and Software AGs Adabas. These databases were very similar to flat files in their management, some with support for rudimentary relational models such as hierarchies of records. The next decade saw the introduction of the relational database and the closely associated SQL language. Relational databases have since gained popularity over other forms of databases due to their productivity advantages and have since become the dominant database management system technology. Productivity Motivated the Adoption of the RDBMS The combination of the SQL programming language and the underlying database that supported it helped improve development productivity. In prior systems like Codasyl, developers had to tell the database what data to retrieve and how to retrieve it. If the application needed to combine data from two sources, such as employee records and company records, developers had to create the code to manually find the matched records. In contrast, SQL developers simply need to specify what data to retrieve and the underlying database will use its metadata to determine how to obtain the data. The following is an example where a developer is writing code to illustrate the amount of work saved by using SQL vs. equivalent code from Codasyl.24 July 1, 2011
  25. 25. SoftwareSQL Allowed for Brevity, Improving Productivity vs. its Contemporaries 3 Lines of SQL Code Equivalent 27+ Lines of Codasyl Code SELECT E.EmployeeName, C.CompanyName OPEN INPUT CompanyFile FROM EmployeeFile E, CompanyFile C READ CompanyFile WHERE E.EmployeeCompany = C.CompanyID AT END SET EndOfCompanyFile TO TRUE END-READ PERFORM UNTIL EndOfCompanyFile READ CompanyFile AT END SET EndOfCompanyFile TO TRUE END-READ IF EndOfCompanyFile = FALSE OPEN INPUT EmployeeFile READ EmployeeFile AT END SET EndOfEmployeeFile TO TRUE END-READ PERFORM UNTIL EndOfEmployeeFile IF EmployeeCompany = CompanyID DISPLAY EmployeeName SPACE CompanyName SET EndOfEmployeeFile TO TRUE ELSE READ EmployeeFile AT END SET EndOfEmployeeFile TO TRUE END-READ END-IF. END-PERFORM CLOSE EmployeeFile END-IF. END-PERFORM CLOSE CompanyFile Cowen and Company, Company Report. We note that the SQL code is 1/10th the size of its equivalent Codasyl code. This could translate to as little as 1/10th the developer effort. Relational databases presented a compelling advancement in productivity and this drove this solution’s rise to supplement Codasyl and similar databases of the time. Over time, improved hardware performance helped minimize the performance gap between non-SQL databases such as Codasyl such that the developer productivity became a more important priority for IT organizations. This led to the gradual rise of commercial SQL databases, displacing the simpler databases from the past. History shows time and again that performance is only part of the reason for adopting a database technology. Total cost of ownership, which includes productivity improvements, has always been a factor in the adoption of technology. Going forward, we believe that as data volumes increase, the price/performance of traditional relational database will likely be outstripped by new technology to the extent that the historical labor cost savings of relational solutions become irrelevant in many cases. July 1, 2011 25
  26. 26. Software Data Use Cases – A Classification In addition to understanding how the data being generated today is used, a thorough categorization of the data is useful. Unfortunately, there are too many varying use cases for data today to make a useful definitive list. In order to help one think through these myriad variations we present a framework to classify data use cases below. These use cases are orthogonal to each other, and hence will overlap. For example, one may think of large volume structured machine data for the purposes of running analytics. By Volume While somewhat arbitrary, many consider one petabyte the frontier for traditional relational databases such as Oracle, DB2 and SQL Server. As the amount of data approaches this threshold, most traditional relational databases manifest performance and manageability issues. It is interesting to note that some customers’ databases currently stored in major relational offerings could eventually reach this size neighborhood as well, particularly if transaction data is retained (as opposed to purged and put onto tape). It will be interesting to observe how customers deal with data growth. By Structure and Accessibility to Context It is also helpful to distinguish the type of data relative to the degree of that its structure is defined. At the most rigid end of the spectrum, we have record-based structures that represent a particular item that decomposes into a rigid set of fields that describe a particular aspect of the item (e.g., a record could represent a manufactured item and its attributes would describe the color, weight, width and height of the item). At the other end of the spectrum are the non-text data items such as video, images, music files and others that have little if any common structure other than their file formats. Metadata (text data that defines other data) typically has to be associated with these data items in order to make them accessible. For example, one would not be able to search for iPhone review videos if the videos are not tagged with metadata that identifies the subject as an iPhone. One would also not be able to determine if the review were positive or negative without watching the video if the metadata were not available. The following summarizes the spectrum of data in terms of rigidity of structure.Range of Structure in Various Classes of Data ! More Structured Less Structured " Record-based Machine Unstructured Text Other Unstructured Data Source: Cowen and Company. 26 July 1, 2011

×