Your SlideShare is downloading. ×
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Big data a new breed of db vendor
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big data a new breed of db vendor

1,479

Published on

Great analyst report from Cowen on big data. Provides great overview for non-technical folks like me

Great analyst report from Cowen on big data. Provides great overview for non-technical folks like me

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,479
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
55
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Non-Consensus Idea Series Software Big Data: A New Breed of Database VendorJuly 1, 2011 Means Trouble For The Existing OrderAnalysts Consensus View: The rapid explosion of unstructured and machine dataPeter Goldmacher catalyzed by the pervasiveness of the Internet and massive growth in a variety(415) 646-7206peter.goldmacher of computing devices has created a new market for capturing and analyzing@cowen.com non-traditional data. The primary beneficiaries of this data explosion are the traditional database vendors like Oracle that sell hardware and softwareJoe del Callar systems purpose built for data capture and analysis. As unstructured and(415) 646-7228joe.delcallar@cowen.com machine data grow, these traditional vendors will be able to sell more hardware, software and storage to address their customers increasing data needs. Our View: We believe the vast majority of data growth is coming in the form of data sets that are not well suited for traditional relational database vendors like Oracle. Not only is the data too unstructured and/or too voluminous for a traditional RDBMS, the software and hardware costs required to crunch through these new data sets using traditional RDBMS technology are prohibitive. To capitalize on the Big Data trend, a new breed of Big Data companies has emerged, leveraging commodity hardware, open source and proprietary technology to capture and analyze these new data sets. We believe the incumbent vendors are unlikely to be a major force in the Big Data trend primarily due to pricing issues and not a lack of technical know-how. Using current Oracle pricing models, we estimate Oracle would charge about 9x more than a blended average of the Big Data vendors to solve similar problems. As corporate data sets grow, we are skeptical that Oracle could retain its pricing power with a blended database offering of traditional and Big Data solutions if it were to try to compete against the Big Data players.Please see addendum ofthis report for importantdisclosures. Whats Inside? This note defines Big Data, outlines who the players are, reviews the competitive landscape, provides a technical overview of legacy versus new technology, and gives detailed pricing analyses.www.cowen.com
  • 2. Software Table Of Contents Page Big Data .................................................................................................... 4 The Database is No Longer a Product, It’s a Category ....................................... 6 Big Data is Creating New Companies ................................................................ 6 The Competitive Landscape ............................................................................. 7 Data in the Internet Age: Enormous Growth .............................................. 8 The Internet Today........................................................................................... 8 Consumer Content Creation and Consumption is Increasing ............................ 9 Organizations are Following Consumer Content ............................................. 10 What Issues Do Today’s Data Pose? ......................................................... 13 Data Volume .................................................................................................. 13 Lack of Structure............................................................................................ 14 The Data Market Today ........................................................................... 15 Database Players in the Market Today ............................................................ 15 Current Database Solutions: Use Cases, Structure & Data Sets ........................ 16 Cataloging Software ....................................................................................... 19 If Relational Databases Can Handle Big Data Volumes, Why Bother? Cost. ....................................................................................................... 20 Price Caveats.................................................................................................. 21 Performance................................................................................................... 22 A Brief Background on Databases............................................................ 24 Productivity Motivated the Adoption of the RDBMS ......................................... 24 Data Use Cases – A Classification ............................................................ 26 By Volume...................................................................................................... 26 By Structure and Accessibility to Context ....................................................... 26 By Nature of End User Access/Use Case.......................................................... 27 Dominant Use Cases ...................................................................................... 28 Key Characteristics and Shortcomings of Relational Databases Today ..... 29 Record Structure ............................................................................................ 29 Relational Operation ...................................................................................... 302 July 1, 2011
  • 3. Software Transactional Integrity ................................................................................... 31Tools of the Trade: Dealing with Data Volumes ....................................... 33 Split up the Problem: Parallelization ............................................................... 33 Get Rid of the Parallel Coordination Bottleneck: Shared Nothing ..................... 36 Turn the Problem on its Side: Columnar Databases ........................................ 37 Get Rid of the Hard Drive Bottleneck: In Memory Databases ........................... 39 Speed up the Hard Drive Bottleneck: Solid State Drives................................... 39 Go Retro: NoSQL ............................................................................................ 40Data Management Providers and Solutions.............................................. 43 The Database Establishment .......................................................................... 43 Other Large Companies with Solutions ........................................................... 46 Proprietary Upstarts ....................................................................................... 47 Open Source .................................................................................................. 48Cataloging Software: Extracting Meaning from Unstructured Data........... 54 Why? Even Unstructured Data Needs Structured Analysis. .............................. 54 The Market for Meaning ................................................................................. 55 The Players .................................................................................................... 55Appendix: System Costs for Big Data ...................................................... 58 A Database for Marketing Analysis Incorporating External Data...................... 58 An Application for Analyzing Web Log Machine Data ...................................... 63 A Database for Search and Provisioning of Unstructured Data ........................ 66 July 1, 2011 3
  • 4. Software Big Data Defining Big Data is a matter of perspective. Wikipedia, which probably has the most objective perspective, defines it as: “Big Data are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to spot business trends, prevent diseases, combat crime. Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data. Scientists regularly encounter this problem in meteorology, genomics, connectomics, complex physics simulations, biological research, Internet search, finance and business informatics. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, software logs, cameras, microphones, RFID readers, wireless sensor networks and so on. One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers. The size of Big Data varies depending on the capabilities of the organization managing the set. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.” This definition is technical, accurate and dramatically understates the impact of Big Data on the technology landscape and our daily lives. The most pervasive trend in technology since the creation of ENIAC, the world’s first computer is that as time passes, prices drop and markets expand. As markets demand, devices proliferate.4 July 1, 2011
  • 5. SoftwareAs Prices Decline, Devices Proliferate Source: Cowen and Company.The practical impact of the massive proliferation of devices is the exponentialgrowth in the creation of data. There is tremendous value in this non-ERP data andharnessing that value has created multiple $100B companies like Google andFacebook.As Devices Proliferate, Data Creation Explodes Source: Cowen and Company, Company WebsitesJuly 1, 2011 5
  • 6. Software The Database is No Longer a Product, It’s a Category From our perspective, Big Data is the notion that the database has transitioned from a product to a category. Most investors associate the term database with relational databases and the demand drivers that have propelled Oracle and IBM to the top of the IT heap. However, we believe that the trends that are driving Big Data are significantly larger and will have a far more profound impact on the future than just helping companies automate traditional business processes. We define Big Data is all the other data that exists beyond the structured corporate data that automates business processes. Big data is a multitude of other categories including but certainly mot limited to data sets like machine data, click streams, RFID data, search indexing, phone calls, pictures, videos, genomics etc. We have no doubt Big Data and the need to manage and understand it will dwarf the opportunity in the relational database market over the last 30 years. Investors can’t continue thinking of a database as a relational database. A relational database is a kind of database. The graphic below makes the point that even though you have two products with the same name, in this case “Sting Ray”, the use case of each Sting Ray is dramatically different. The Schwinn Sting Ray on the left represents the legacy paradigm of the RDBMS world and the Corvette Stingray on the right represents the future of Big Data. Legacy Data vs. Big Data Source: Cowen and Company. Big Data is Creating New Companies The creation of Big Data has spawned the creation of new technologies and entirely new categories of companies that are leveraging Big Data to create new markets. Most people would describe Google and Facebook as media companies because they derive the bulk of their revenues from ad sales. However, we view Google and Facebook and a myriad of other high growth internet oriented companies as Big Data companies because their businesses have been created due to their ability to6 July 1, 2011
  • 7. Softwareharness Big Data. Google’s search indexing technology is a big data solution.Facebook’s ability to collect and correlate data about people and their preferences isa big data solution. There is an ever growing number of companies that are eitherbeing created or being reinvented to harness the power of Big Data. These arecompanies like Zynga, Twitter, Intuit, etc.The Competitive LandscapeThe competitive landscape around Big Data is as diverse as the data sets that defineBig Data. The opportunity is so large that companies and technologies are springingup like weeds to address opportunities and define the space. While we normallyhave a negative reaction to broad based and vaguely defined marketing terms like“B2B eCommerce”, “Collaboration”, “Cloud” or “Social”, we believe the term “Big Data”appropriately defines the space for now. Over time, the market will be more clearlydefined and the technologies will be standardized, but for now it is still a free for all.The following competitive slide is our take on the competitive landscape. Right nowthere are a lot of startups that generated between $10M and $100M in revenues in2011. Some of the large cap tech names that don’t have a legacy RDBMS business toprotect have been active on the M&A front paying between 10x and 30x revenues forpromising start ups. Oracle and IBM are circling, watching, marketing but waiting tosee how the space develops. Other players like Dell and SAP have focused onapplying existing technology to legacy solutions...for now.The Big Data Competitive Landscape Source: Cowen and Company, Company Websites.July 1, 2011 7
  • 8. Software Data in the Internet Age: Enormous Growth The Internet, amplified by consumer endpoints such as smartphones and tablets, coupled with the rise of social networking, has democratized content creation on the Web. Web consumption and contribution includes log data, text, pictures, audio and video. In the beginning, content creation was the domain of professional web designers and bloggers, but now social networking sites have let virtually everyone become an author. High demand for data creation has spawned a whole class of companies whose strategy revolves around collecting and provisioning as much data as possible to as many consumers as possible. Companies are rushing to exploit this data to more effectively address their customers and to improve their operations. We see a chasm between content and corporate analysis that is poorly served by existing relational database technologies that were not designed to handle the volumes of virtually freeform data being created. This space is rapidly being filled by a developing set of technologies and growing developer community. We argue that there is significant economic opportunity in bridging this gap and we are in the early stages of the next generation of data management vendors being created to address this demand. We believe the massive quantity of unstructured data created by the widespread use of the Internet presents a growth opportunity that will be significantly larger than the little $25B relational database industry dominated by Oracle, IBM and Microsoft. The Internet Today No one knows the full size of the Internet, but anecdotal data points underscore its reach: • The UN’s International Telecommunication Union (ITU) estimates that the number of people in the world who have access to the Internet surpassed 2B in 2010. • Our Cowen colleague Matt Hoffman projects Smartphone sales of over 400M units in 2011. • Akamai, which claims 15-30% of all Web traffic, delivers daily Web traffic reaching more than 4 Terabits per second. This suggests that total Web traffic is somewhere between 13 Tbps and 27 Tbps, equivalent to between 13M and 27M photos downloaded every second or between 13M and 27M decent-quality videos being viewed at the same time. • Verisign reports that there are 196M registered domain names (e.g., cowen.com) that are used to identify websites. • At one time, Google claimed to have 1 trillion URLs in its index. Although this number has reportedly been pared down to remove duplicates, the starting point of 1 trillion is still big. • Google and Verisign data points mean that the average registered domain is associated with hundreds of URLs/web pages.8 July 1, 2011
  • 9. SoftwareAnd there are large amounts of existing unstructured content:• We estimate that there are in excess of 30B photos uploaded on FaceBook.• Google index data suggests there are over 600M videos on YouTube.• iTunes hosts over 13M songs and over 300,000 apps.• The University of Toronto’s BlogScope project has indexed more than 53M blogs.• The ITU estimates that there were over 6 trillion text messages sent in 2010, an average of nearly 200K every second.• Digital phone calls are another large and growing data set.Consumer Content Creation and Consumption is IncreasingThe volume of content on the Internet has increased significantly over the last fiveyears, and data from seminal video and audio websites suggests that the rate atwhich this data is created and consumed is increasing.A major example of this trend in content use is YouTube. Since it was founded at thebeginning of 2005, the rate at which data is used and consumed on the site hasgrown rapidly, with an eight-fold increase in video uploads in the last four years anda 30x increase in video views over the last five.Video Creation and Consumption on YouTube is Increasing 60 3,500 Hours of Video Uploaded per Videos Viewed per Day (M) 50 3,000 2,500 40 2,000 Minute 30 1,500 20 1,000 10 500 0 0 Jan-05 Jan-06 Jan-07 Sep-07 Jan-08 May-05 Sep-05 May-06 Sep-06 May-07 May-08 Sep-08 Jan-09 May-09 Sep-09 Jan-10 May-10 Sep-10 Jan-11 May-11 # of Videos Viewed per Day (M) Hours of Video Uploaded per Minute Source: Cowen and Company, the official YouTube blog on Google.Since it was founded in February, 2005, YouTube’s use has steadily climbed to thepoint where it announced that users were uploading 35 hours of video every minute.As of May 2010, videos on YouTube were viewed 2B times per day, 20x the rate whenGoogle acquired the company in 2006.July 1, 2011 9
  • 10. Software We note that there could be significant variance in our estimates as bit rates can vary from about 250 Kbps for teleconferencing-quality video to 5 Mbps for high definition. We assume a bit rate 1 Mbps in our estimates. We also note that growth in video use on YouTube is not an isolated case, as Apple’s iTunes store (music) and FaceBook (photos) have shown increased use during this time period as well. Audio Consumption on Apple iTunes and Photo Creation on FaceBook is Increasing over the Last 5 Years as Well 12 iTunes Song Downloads per Day (M) 10 8 6 4 2 0 Oct-05 Oct-06 Oct-08 Oct-09 Oct-10 Jan-05 Jul-05 Jan-06 Jul-06 Oct-07 Jan-08 Jul-08 Jan-09 Jul-09 Jan-10 Jul-10 Apr-05 Apr-06 Apr-08 Apr-09 Apr-10 Jan-07 Jul-07 Apr-07 120 FaceBook Photo Uploads per Day (M) 100 80 60 40 20 0 Oct-05 Oct-06 Oct-08 Oct-09 Oct-10 Jan-05 Jul-05 Jan-06 Jul-06 Jan-08 Jul-08 Jan-09 Jul-09 Jan-10 Jul-10 Apr-05 Apr-06 Oct-07 Apr-08 Apr-09 Apr-10 Jan-07 Jul-07 Apr-07 Source: Cowen and Company, Company Releases. Organizations are Following Consumer Content In conjunction with and in response to this increase in consumer usage of the Internet, organizations are following consumers online as evidenced by a growing10 July 1, 2011
  • 11. Softwarenumber of corporations with an online presence. These companies are increasinglyattempting to leverage data they collectThe following data from Verisign on domain name registrations indicates thatcompanies and organizations continue to build an online presence. Through June,domain name registrations have grown at an average of nearly 20% over a six yearperiod.Registered Domains Have Grown Over the Last 5 Years 250 19.3% 6-Year CAGR M of Domains Registered 200 150 100 50 0 Sep-05 Sep-06 Sep-07 Sep-08 Sep-09 Sep-10 Dec-04 Mar-05 Jun-05 Dec-05 Mar-06 Jun-06 Dec-06 Mar-07 Jun-07 Dec-07 Mar-08 Jun-08 Dec-08 Mar-09 Jun-09 Dec-09 Mar-10 Jun-10 Dec-10 Source: Cowen and Company, Verisign Domain Name Report.Making the point, we believe that increasing interest in being able to record, catalogand analyze this data is evident in strong 36% growth in machine analytics providerOmniture (part of Adobe, ADBE, NC), 11% organic growth in unstructured contentmanagement and indexing provider Autonomy and 12% growth in BI softwareoverall. Notably, we note these all exceed the 4% growth observed in the enterprisesoftware category overall, and that this strong growth occurred in the midst of arecession that strained IT budgets. And we are still at the very beginning.July 1, 2011 11
  • 12. Software Omniture, Autonomy and BI Overall Outperformed the Enterprise Software Sector Adobe Omniture Unit ($M) $500 40% 3-Year CAGR $400 $300 $200 $100 $0 2007 2008 2009 2010 Autonomy ($M, incl Acquisitions) $1,000 11% Organic 3-Year CAGR $800 $600 $400 $200 $0 2007 2008 2009 2010 BI Software Sales ($B) $12 12% 3-Year CAGR $10 $8 $6 $4 $2 $0 2007 2008 2009 2010 Enterprise Software Sales ($B) 4.2% 3-Year CAGR $250 $200 $150 $100 $50 $0 2007 2008 2009 2010 Source: Gartner Dataquest “Forecast: Enterprise Software Markets, Worldwide, 2008-2015, 1Q11 Update” by Graham et al. Company reports. Calculations by Cowen and Company.12 July 1, 2011
  • 13. SoftwareWhat Issues Do Today’s Data Pose?Data being generated today challenges the performance of relational databases andbrings to question the suitability of common relational databases to the problem. Inparticular:• Data Volume. Data volumes are growing beyond what current relational databases can feasibly handle.• Structure (or Lack Thereof). Most of new data being created is not in the rigid record structure common to today’s databases, where each data item has a set of attributes that has particular significance or meaning. Some semblance of structure is key as users typically need some metrics by which to analyze trends in the data.The issues above are often discussed in the same breath as Big Data, as unstructureddata is often large in and of itself (e.g., a single picture is much larger than thenumbers and words stored in a typical database transaction).However, it is useful to think about the issue of structure separately from the issueof volume because a lack of metadata (data about the data) can render data uselessfor analysis. Furthermore, there is a significant need to retrieve and serve fewer bitsof non-text unstructured data, such as videos, a challenge that is different fromaggregating and analyzing Big Data-size amounts of data.Data VolumeWe believe the large volumes of structured and unstructured data being created andserved up require massive investments in hardware if implemented with traditionaldatabase systems. We break out three main issues with data volume.• There Are a Lot of Individual Data Items. Many data sets today contain billions of data items, with some data sets containing over a trillion items. While analysis of this much data alone may seem daunting, it is virtually impossible to apply typical relational database techniques, such as retrieving records from more than one table. For example, if one were to retrieve millions of rows of data from table A and millions of rows of data from table B and analyze the combined data set; the database would have to potentially sift through trillions of possible combinations of records.• Each Data Item Can Be Big. Unlike records for structured data which are typically sized in the kilobytes, unstructured data ranges from the tens of kilobytes for web pages and email to megabytes for song files to gigabytes for high definition video. These requirement increases are in orders of magnitude relative to most structured data records and therefore require much more in terms of resources. For example, throughput in traditional disk based systems may not be high enough to serve image and video, leading many sites that serve this type of data have turned to alternative technologies that typically leverage main or solid state memory.• Data Items are Being Created Rapidly. The high rate at which data is created can also put high demands on system resources, particularly if transaction integrity is required. A traditional data store would typically have to retain an additional copy of data or only allow one process at a time to update a specific set of data in order to ensure the validity of the data update even if theJuly 1, 2011 13
  • 14. Software system fails. The former places high requirements on storage in the system as duplicate data is needed, while the latter slows down the system as processes that update can get delayed while they wait on each other for a particular piece of data. Lack of Structure In many ways, the data volume problem is an easier issue to deal with: to solve it one can throw a lot of hardware and software (and therefore a lot of money) at it, and the extent of the solution varies with the amount of the investment. In contrast, the lack of structure in unstructured data is challenging for the simple reason that consumers search or analyze data using some attribute of the data, such as the date a video was taken, who the subject of a photo is, or whether or not a blog gives a positive view of a product. Some form of metadata typically has to tag the data item or be generated on the fly. If there is no such metadata, potential consumers of the unstructured data item will have difficulty finding it or may not be able to find it at all. This is not an easy issue to solve without human involvement. For example, it is not easy for software to tell if a picture is of a yellow rose, of a yellow Labrador named Rose or of a girl named Rose in a yellow dress. There are rapidly improving solutions for both text and non-text unstructured data. For example, solutions for text analysis and metadata creation (often called “text mining”) are provided by Autonomy, SAP Inxight, IBM SPSS and a few others, while enterprise software that analyzes voice and audio data is available from Nuance.14 July 1, 2011
  • 15. SoftwareThe Data Market TodayBelow we list the players in the markets for databases (systems intended toorganize, store, and retrieve data) and cataloging software (systems intended to helpcatalog unstructured data). While the former is a proven market with potential forsignificant expansion due to the needs of large data volumes, the latter could alsoprove to be a source of economic opportunity.In the remainder of this note, define and provide a very brief history of databases tounderstand where we are today, and then we delve into our above taxonomy ofvarious characteristics that define today’s data and the challenges of managing datawithin each classification.We also look at properties and shortcomings of existing relational databases, andthe techniques applied to handle large data. We subsequently explore opportunitiesopen to traditional database providers and areas their solutions are likely to getdisplaced. We use an analysis of the costs to build and maintain a YouTube-likesystem to frame the discussion.We then explore the established and up and coming database providers, focusing onthe technologies they are applying to handle large or unstructured data sets.We subsequently list current providers of data cataloging solutions that help extractcontext and create metadata based on unstructured data.Database Players in the Market TodayThe following summarizes the established and up-and-coming solutions in themarket today along the lines of the level of structure they are designed to handleand the volume of data each system is capable of. These assessments areindependent of price/performance ratio (e.g., Oracle can handle petabyte data sets,but we did not say that a sufficient solution would be inexpensive). We also onlyconsider the capabilities of each database “out-of-the-box”. For example, MySQL isoften cited for use in petabyte-scale datasets, but these typically involve muchcustom coding where the secret sauce is embedded in the middleware thatdistributes data to thousands of MySQL instances (hence we list Infobright as aMySQL proprietary solution).July 1, 2011 15
  • 16. Software Capabilities of Existing Database Solutions ! More Structured Less Structured " Record-based Machine Unstructured Text Other Unstructured Data Gigabytes Informix (IBM) SimpleDB (Amazon) SolidDB (IBM) TimesTen (Oracle) 1010data 1010data Membase (open src) ++ Membase (open src) ++ ASE (SAP/Sybase) Infobright PostgreSQL (open src) PostgreSQL (open src) Ingres Membase (open src) + Redis (open src) Redis (open src) Membase (open src) + Redis (open src) SQL Server (Microsoft) SQL Server (Microsoft) Terabytes MySQL (Oracle) Splunk Voldemort (open src) Voldemort (open src) PostgreSQL (open src) SQL Server (Microsoft) Redis (open src) Vertica (HP) SAP In-Memory DB/HANA Voldemort (open src) SQL Server (Microsoft) VectorWise (Ingres) Vertica (HP) Size Voldemort (open src) Aster Data (Teradata) Aster Data (Teradata) BigTable (Google) BigTable (Google) BigTable (Google) BigTable (Google) CouchDB (open src) ++ CouchDB (open src) ++ Cassandra (open src) * Cassandra (open src) * DB2 (IBM) HBase (open src) ** DB2 (IBM) DB2 (IBM) HBase (open src) ** Hypertable (open src) GigaSpaces Greenplum (EMC) Hypertable (open src) MarkLogic Greenplum (EMC) HBase (open src) ** IQ (SAP/Sybase) MongoDB (open src) +++ Petabytes HBase (open src) ** Hive (open src) *** MarkLogic Oracle Hive (open src) *** Hypertable (open src) MongoDB (open src) +++ Hypertable (open src) MarkLogic Netezza (IBM) IQ (SAP/Sybase) Netezza (IBM) Oracle MarkLogic Oracle Netezza (IBM) Teradata Oracle ParAccel Teradata f * - Commercial support for Cassandra is provided by DataStax and Acunu ** - Commercial support for HBase is provided by Cloudera, Datameer and Hadapt *** - Commercial support for Hive is provided by DataStax, Cloudera, Datameer and Hadapt + - Commercial support for Membase is provided by Couchbase ++ - Commercial support for CouchDB is provided by Cloudant and Couchbase +++ - Commercial support for MongoDB is provided by 10Gen Cowen and Company, Company Reports. We note that Oracle can handle all of the major cases of big data loads, while IBM DB2 and Microsoft SQL Server can handle most of the use cases outlined above. However, we remind the reader that the chart above is independent of price, and although the statement “there is not enough money in the world for Oracle” is hyperbole, it is apropos. We quantify the price of Oracle systems that can handle big data cases and compare it to emerging big data-specific solutions in a later section. Current Database Solutions: Use Cases, Structure & Data Sets The following table shows the applicability of existing databases to a comprehensive range of use cases, and structure of data, along with their applicability to large data sets. We also include database appliances in the table below. We make a distinction between databases and caches as databases make a reasonable attempt to ensure that data is persistent (i.e., it does not get erased when there is a prolonged loss of power, for example), which is needed for the search,16 July 1, 2011
  • 17. Software transaction and analytics use cases we delve into this note. While often used in lieu of databases, we do not generally deal with caches, such as Oracle Coherence or memcached, in this note.Applicability of Existing Solutions to Use Cases, Structure and Size of Contemporary Data Sets Structure of Data Nature of Access/Use Case Machine Unstructured Other Transaction > 1 PB dataset Transaction Search Analytics Data Text Unstructured Mgmt Oracle (ex BerkeleyDB and RDB) Oracle x x x x x x x x Exadata x x x x x x x x MySQL x x x TimesTen x x IBM DB2 - Mainframe x x x x x x DB2 - non-Mainframe x x x x x x Netezza x x x x Informix x x x solidDB x x SAP ASE x x IQ VLDB option x x x x SAP In-Memory DB and HANA x x x x Microsoft SQL Server Just over 1 Pb x x x x x x x Teradata Teradata x x x x Aster Data x x x x Ingres Ingres x x VectorWise x x HP Vertica ? x x x EMC Greenplum x x x x Amazon SimpleDB x x Google BigTable x x x x x x Other Proprietary Software Providers x 1010data x x x Infobright x x Splunk x x x ParAccel x x x MarkLogic x x x x x x x x GigaSpaces x x x Open Source Projects Cassandra x x x x x CouchDB x x x x add on Hadoop HBase x x x x x x Hive x x x x Hypertable x x x x x x Membase x x x x x MongoDB x x x x x PostgreSQL (basis for GreenPlum) modified x x x x x Redis x x x x x Voldemort x x x x x Cowen and Company, Company Reports. The following shows the techniques these various solutions apply to handle the large data sets of today. In the table below, we only marked off dataset characteristics that the database systems were intended to handle without considerable effort. MySQL, for example has been heavily modified to work in various systems that handle petabytes data volumes (such as at Facebook), but we July 1, 2011 17
  • 18. Software did not mark it off as a petabyte database as it does not generally handle petabyte volumes standalone.Techniques Used by Databases to Handle Large and/or Unstructured Data Parallel Processing Memory Clustering (<1K Massive (>1K Shared Appliance In Memory Solid State Columnar NoSQL nodes) nodes) Nothing Oracle (ex BerkeleyDB and RDB) In-Memory Oracle RAC option Option Exadata x x x partial storage engine MySQL x option TimesTen x IBM DB2 - Mainframe Parallel Sysplex option DB2 - non-Mainframe pureScale option Netezza x x x solidDB x SAP ASE Cluster Edition IQ Multiplex option x SAP In-Memory DB and HANA x x x x x Microsoft SQL Server x x Upcoming Teradata Extreme Teradata x x x Performance x Appliance Aster Data x x x Ingres Ingres Ingres STAR VectorWise x x HP Vertica x x x x EMC Greenplum x x x High Perf DCA x x Amazon SimpleDB x x Google BigTable x x x Other Proprietary Software Providers 1010data (SaaS) SaaS x x Infobright depends on MySQL x Splunk x ParAccel x Option x x MarkLogic x x x x GigaSpaces x x Open Source Projects Cassandra x x x x CouchDB x x x x Hadoop x x x x HBase x x x Hive x x x Hypertable x x x Membase x x x x MongoDB x x PostgreSQL (basis for GreenPlum) x patch available Redis x x x x Voldemort x x x Cowen and Company, Company Reports. 18 July 1, 2011
  • 19. SoftwareCataloging SoftwareCataloging software generates metadata to help consumers of unstructured datafind the data. In general, most consumers of data find or organize the data via a texttag. Conversely, unstructured data is virtually useless without these tags asconsumers will not be able to find the data. Cataloging software helps generatethese tags in an automated manner.The following lists various solution providers of software that helps generatemetadata for unstructured data sets.Data Cataloging Software Providers Ticker Text Images Audio VideoLarger Providers Autonomy (iSAS and various other products) AU. (London) x x x x IBM (Content Analytics) IBM x SAP (Inxight) SAP x SAS (Text Analytics) xStartups CallMiner x IQ Engines x kooaba x Leximancer x Megaputer x Nexidia x x x Source: Cowen and Company, Company Websites.July 1, 2011 19
  • 20. Software If Relational Databases Can Handle Big Data Volumes, Why Bother? Cost. We believe that we are at a similar cross-road between SQL databases and upstart solutions today to the 1970’s choice between these SQL databases and their predecessors. We believe that the success of one class of solutions vs. another ultimately will boil down to • the price of the chosen solution, and • the performance that can be achieved by a class of solutions. Our conclusion is that high-performance solutions from traditional database vendors are not likely to meet the price points of high-performance solutions put together using open source software and commodity hardware. We therefore believe this will drive companies to consider these next generation systems when implementing new systems for new non transactional workloads, even when faced with the concomitant cost of developing expertise in the emerging alternatives. To illustrate this cost discrepancy, we quantify the costs of the following big data use cases. • External Marketing Data Analysis. This use case combines product- relevant data scraped from the web with internal data for the purposes of analyzing the efficacy and impact of product and marketing programs. • Web Log Machine Data Analysis. This use case is for the analysis of data generated by user visits to a company’s website. This is based on known transaction volumes from a popular SaaS provider. • Search and Provisioning of Unstructured Data. This use case looks at the provisioning of large volumes of unstructured data such as documents and training videos, internally or externally. Volume and throughput is based on YouTube. The following table compares our estimates of the up front and ongoing costs of a system built using Oracle to a comparable system built using emerging big data solutions.Emerging Solutions are Much Cheaper Oracle Costs ($M) Emerging Soln Costs ($M) Savings % Up Front Cost Annual Recurring Cost Up Front Cost Annual Recurring Cost Up Front Cost Annual Recurring External Marketing Data Analysis $33.5 $5.4 $3.4 $0.8 90% 86% Web Log Machine Data Analysis $8.5 $1.6 $0.6 $0.2 93% 85% Search and Provisioning of Unstructured Data $589.4 $99.0 $104.2 $15.1 82% 85% Average Savings for Emerging Big Data Solutions 88% 85% Source: Cowen and Company, Company Websites and White Papers 20 July 1, 2011
  • 21. Software We see that, on average, emerging solutions are nearly 90% cheaper up front and 85% cheaper on an ongoing basis than an equivalent traditional solution from Oracle. While the phrase “there’s not enough money in the world for Oracle” recently uttered by a big data company CEO is not completely accurate, it’s close. We believe this significant price differential will prompt very strong interest in companies looking to manage large data volumes. However, despite the compelling cost advantage of next generation systems, the pervasiveness of current generation solutions and accompanying expertise within IT organizations coupled with increasing capabilities of relational solutions to handle big data sets will make it more difficult for next generation systems to replace existing relational systems at current clients that require only incremental upgrades, such as adding several extra terabytes of space. The opportunity of these next generation systems is in non traditional, non transactional work loads, which is where we believe all the growth in data is. Given these two opposite dynamics, we believe traditional database systems and next generation systems are likely to coexist for a while, with the presence of next generation solutions limiting upside for traditional commercial database solutions. Price Caveats Our calculations above suggest solutions based on alternative big data software are significantly cheaper from both an initial investment perspective and from a maintenance fee perspective than traditional commercial databases. However, the cost benefits of alternative systems come at the price of more expensive application development. In very simplistic terms, today’s commercial relational databases are generally easier to program than up and comers, although there are exceptions to this. This is important in terms of time to market and cost, particularly with budget and headcount constrained organizations. Much of the code of the non-SQL solutions in particular has a very old feel to it, which we illustrate below using a Google BigTable example right out of the company’s seminal white paper on which Hadoop was based. Hadoop code has a very similar structure.Google BigTable Code has a 70’s Feel 40-Year Old Cobol Code Google BigTable Code Scanner scanner(T); OPEN INPUT StudentFile ScanStream *stream; Open t he dat a file int o a Open t he dat a file int o a READ StudentFile stream = scanner.FetchColumnFamily("anchor"); memory st ruct ure memory st ruct ure AT END SET EndOfStudentFile TO TRUE stream->SetReturnAllVersions(); END-READ scanner.Lookup("com.cnn.www"); PERFORM UNTIL EndOfStudentFile for (; !stream->Done(); stream->Next()) { DISPLAY StudentId SPACE StudentName SPACE CourseCode SPACE YOBirth printf("%s %s %lld %sn", Loop t hrough t he memory READ StudentFile scanner.RowName(), struct ure and do work Loop t hrough t he memory AT END SET EndOfStudentFile TO TRUE stream->ColumnName(), st ruct ure and do work END-READ stream->MicroTimestamp(), END-PERFORM stream->Value()); CLOSE StudentFile } Cowen and Company, Company Report, Google Whitepaper. At the risk of overly-generalizing, much of the high-performing solutions such as Hadoop and in-line MapReduce code have a very verbose code structure and we thus believe that developers (though many may protest) are likely to be as productive as their Cobol/Codasyl counterparts. Our example on p. 25 shows that Codasyl code can require as much as 10x many lines of code and concomitant effort vs. SQL, July 1, 2011 21
  • 22. Software suggesting that upstart solutions such as Hadoop currently require 10x as much effort as well. In reality, much of the code in this new generation of solutions is cut-and-paste boilerplate. Furthermore, the lack of a rigid structure in many emerging solutions allows for development short cuts as well. Hence, anecdotal evidence and our experience suggest that one would need “only” 4x as much effort to develop in these languages instead of using commercial SQL databases. We believe these development cost savings will not offset the CapEx and annual cost savings from using alternative open source/commodity hardware solutions. For reference, we look at our data search and provisioning case which assumes a YouTube-like volume. The original YouTube team had 5 people working 80 hour weeks, the equivalent load of 10 normal engineers, to produce the original video delivery platform. If we apply our 4x productivity ratio, this would be equivalent to 2.5 engineers on a commercial database system. At $200K per year fully-loaded costs per engineer, this translates to annual savings of only $1.5M. This is much less that the $80M in support costs saved in our data provisioning case using a big data solution instead of Oracle. Performance Much has been made of the performance gains available from replacing typical commercial databases and middleware with alternative solutions for managing large data sets. We note that many of the solutions can actually be implemented within one of the major databases, and it is folly to believe that existing database vendors will stand pat. The following shows previously discussed big data techniques currently available in offerings from the traditional commercial vendors (excluding recent acquisitions such as Netezza, Vertica, Greenplum and Aster Data). We exclude NoSQL approaches as these have not yet been adopted on a major scale by the mainstream vendors.Commercial Database Vendors are Incorporating Big Data Solutions Smaller Scale Database Parallelism (<1,000 Massive Parallelism Provider Major Database Products nodes) (>1,000 nodes) Columnar In Memory Solid State Drives Shared Nothing ORCL Oracle, MySQL, TimesTen, BerkeleyDB RAC Exadata (Partial) TimesTen, Oracle (Limited) Exadata Exadata/MySQL IBM DB2 (mainframe), DB2 UDB, solidDB PureScale solidDB UDB MSFT SQL Server SQL Server 2008 R2 Next release (2011) SQL Server 2008 (Limited) CA Ingres, VectorWise Ingres/VectorWise VectorWise SAP ASE, IQ IQ/ASE IQ/HANA HANA HANA TDC Teradata All Extreme Data Extreme Performance All Cowen and Company, Company Report, Google Whitepaper. As one can see, a significant number of commercial databases have already integrated many of the performance enhancements used by big data providers in their core databases, add-on database options and appliances. Furthermore, while not many of these vendors can exploit parallelism in thousand server installations like Hadoop and Aster Data can, we expect that at least a few of the providers will eventually produce systems that can scale closer to this. It would be naïve to assume that major companies such as Oracle and IBM would not invest in continued improvements in their technology. We do note, however, that many of these improvements mostly apply to the database’s ability to handle transaction data. Oracle text performance is known to 22 July 1, 2011
  • 23. Softwarehave its limits, for example, as it indexes each word in what can become a massiveand unmanageable table.On the other hand, while we do believe some organizations will look long and hardat alternatives given their significantly lower cost, existing systems and applicationsthat are simply growing their transaction data to big data size may be hard pressedto make a leap to a completely new infrastructure. Hence, this portion of thecustomer base may continue on enhanced versions of current generation systems,and the presence of these systems may make sale of true next generation systemsmore challenging at these customers.July 1, 2011 23
  • 24. Software A Brief Background on Databases In the simplest terms, a database is software intended to organize, store, and retrieve data in a persistent manner (for our purposes, we exclude popular caches like memcached that some lop into this segment). Most popular database systems today such as those from Oracle and IBM are relational, which means that these systems allow users to match data by using common characteristics found within the data set. The pervasiveness of these relational database solutions has made these solutions virtually synonymous with the database market. However, relational has not always been the sole (or even dominant) segment in databases. It is important to understand that databases have been around in one form or another since the computer was invented. Unlike current relational databases, the vast majority of older databases were custom designed, sacrificing flexibility for speed. As the performance of computers improved in the 1960’s, general-purpose databases began to emerge. Two of the significant general purpose database developments that arose in this decade were databases based on the Codasyl standard, which likely helped boost the popularity of COBOL for business computing, and IBM’s IMS, a similar but more rigidly structured database that was popular on System/360 mainframes. Other contemporaries included CAs IDMS and DATACOM, Unisys DMS II, and Software AGs Adabas. These databases were very similar to flat files in their management, some with support for rudimentary relational models such as hierarchies of records. The next decade saw the introduction of the relational database and the closely associated SQL language. Relational databases have since gained popularity over other forms of databases due to their productivity advantages and have since become the dominant database management system technology. Productivity Motivated the Adoption of the RDBMS The combination of the SQL programming language and the underlying database that supported it helped improve development productivity. In prior systems like Codasyl, developers had to tell the database what data to retrieve and how to retrieve it. If the application needed to combine data from two sources, such as employee records and company records, developers had to create the code to manually find the matched records. In contrast, SQL developers simply need to specify what data to retrieve and the underlying database will use its metadata to determine how to obtain the data. The following is an example where a developer is writing code to illustrate the amount of work saved by using SQL vs. equivalent code from Codasyl.24 July 1, 2011
  • 25. SoftwareSQL Allowed for Brevity, Improving Productivity vs. its Contemporaries 3 Lines of SQL Code Equivalent 27+ Lines of Codasyl Code SELECT E.EmployeeName, C.CompanyName OPEN INPUT CompanyFile FROM EmployeeFile E, CompanyFile C READ CompanyFile WHERE E.EmployeeCompany = C.CompanyID AT END SET EndOfCompanyFile TO TRUE END-READ PERFORM UNTIL EndOfCompanyFile READ CompanyFile AT END SET EndOfCompanyFile TO TRUE END-READ IF EndOfCompanyFile = FALSE OPEN INPUT EmployeeFile READ EmployeeFile AT END SET EndOfEmployeeFile TO TRUE END-READ PERFORM UNTIL EndOfEmployeeFile IF EmployeeCompany = CompanyID DISPLAY EmployeeName SPACE CompanyName SET EndOfEmployeeFile TO TRUE ELSE READ EmployeeFile AT END SET EndOfEmployeeFile TO TRUE END-READ END-IF. END-PERFORM CLOSE EmployeeFile END-IF. END-PERFORM CLOSE CompanyFile Cowen and Company, Company Report. We note that the SQL code is 1/10th the size of its equivalent Codasyl code. This could translate to as little as 1/10th the developer effort. Relational databases presented a compelling advancement in productivity and this drove this solution’s rise to supplement Codasyl and similar databases of the time. Over time, improved hardware performance helped minimize the performance gap between non-SQL databases such as Codasyl such that the developer productivity became a more important priority for IT organizations. This led to the gradual rise of commercial SQL databases, displacing the simpler databases from the past. History shows time and again that performance is only part of the reason for adopting a database technology. Total cost of ownership, which includes productivity improvements, has always been a factor in the adoption of technology. Going forward, we believe that as data volumes increase, the price/performance of traditional relational database will likely be outstripped by new technology to the extent that the historical labor cost savings of relational solutions become irrelevant in many cases. July 1, 2011 25
  • 26. Software Data Use Cases – A Classification In addition to understanding how the data being generated today is used, a thorough categorization of the data is useful. Unfortunately, there are too many varying use cases for data today to make a useful definitive list. In order to help one think through these myriad variations we present a framework to classify data use cases below. These use cases are orthogonal to each other, and hence will overlap. For example, one may think of large volume structured machine data for the purposes of running analytics. By Volume While somewhat arbitrary, many consider one petabyte the frontier for traditional relational databases such as Oracle, DB2 and SQL Server. As the amount of data approaches this threshold, most traditional relational databases manifest performance and manageability issues. It is interesting to note that some customers’ databases currently stored in major relational offerings could eventually reach this size neighborhood as well, particularly if transaction data is retained (as opposed to purged and put onto tape). It will be interesting to observe how customers deal with data growth. By Structure and Accessibility to Context It is also helpful to distinguish the type of data relative to the degree of that its structure is defined. At the most rigid end of the spectrum, we have record-based structures that represent a particular item that decomposes into a rigid set of fields that describe a particular aspect of the item (e.g., a record could represent a manufactured item and its attributes would describe the color, weight, width and height of the item). At the other end of the spectrum are the non-text data items such as video, images, music files and others that have little if any common structure other than their file formats. Metadata (text data that defines other data) typically has to be associated with these data items in order to make them accessible. For example, one would not be able to search for iPhone review videos if the videos are not tagged with metadata that identifies the subject as an iPhone. One would also not be able to determine if the review were positive or negative without watching the video if the metadata were not available. The following summarizes the spectrum of data in terms of rigidity of structure.Range of Structure in Various Classes of Data ! More Structured Less Structured " Record-based Machine Unstructured Text Other Unstructured Data Source: Cowen and Company. 26 July 1, 2011
  • 27. SoftwareAs data moves across the bar from left to right, it becomes progressively moreunstructured and often more difficult to manage in terms of size, organization, andavailability for user search and consumption.• Transaction Record-Based Data. This is data similar to records in relational databases today. Data is generally stored in the record’s attributes that help define the context of the record. For example, a record of a person may store that person’s name, address, SS# and phone number. Each entity with a different type of data (e.g., a company vs. a person) typically has a different structure and set of attributes.• Machine Data. Machine data is generally structured data generated by machines to record events. For example, a web server may create log of who loaded a website, a plane may generate records of its flight and a utility company may collect electronic meter readings on a frequent periodic basis. Machine data’s level of structure and accessibility to context can vary depending on the system, from being very similar to traditional structured transaction records to fairly unstructured. For example, data from water meters can be very structured, with a meter identifier, date, time and amount in the meter reading. On the other hand, a machine data log may just record errors in freeform text.• Unstructured Text Data. This category includes unstructured text data such as content posted to blogs, news feeds, documents, and social media outlets such as Twitter and Facebook. We make the distinction between this and other forms of unstructured data as there is fairly extensive development from companies such as Nuance (NUAN, NC) and BI providers such as Cognos (IBM, Katri, Outperform) and Business Objects (SAP, Underperform) to extract metadata from unstructured text to make it consumable for search, analysis and other operations. Extracting context from this data is also an order of magnitude simpler compared to other sources such as video, as one can extract words from unstructured text while other sources require some work to associate to metadata.• Other Unstructured Data. Under this category, we include photos, videos, sound files and other forms of data that by default do not have any text information on their subject. Traditional analysis and search on this data is impossible unless it is associated with metadata that describes the object. For example, a user searching for a video on cooking apples will not be able to find any such videos unless they have been properly tagged or titled “cooking apples”.By Nature of End User Access/Use CaseWe also categorize data based on how it is used by the end user. Use of data hassignificant implications on storage, input, output, and processing of data indatabase systems, thereby determining what big data techniques will be suitable tomanaging the data.The following table summarizes access and upload patterns for each of thecategories of data use. We divide updates into granular where the update of dataoccurs in small chunks and more wholesale updates, such as when Google’scontinuously running web crawlers update data on a web page. Access is dividedinto freeform where the user enters any manner of criterion they want to access thedata and structured access where the user may access the data along severalJuly 1, 2011 27
  • 28. Software parameters. We note that the search use case is applicable both data that is updated on a batch/rolling basis and where data is created on a granular basis. General Access and Update Characteristics by Nature of Data Access/Use Access Structured Freeform Batch/Rolling Granular Transactional Updates Search and Provisioning Analytics Source: Cowen and Company. Dominant Use Cases • Transaction Management. This is the traditional use case for relational databases. For the purposes of this note, this is data of any form, regardless of structure, that is regularly updated and accessed at a relatively granular level, often within the context of recording or executing on a corresponding business action. For example, a database for Amazon may be used to record and keep track of customers’ orders. Data access is generally based on a well-defined set of specified parameters, such as the name of a customer, the date of their order or possibly their order number. • Search and Provisioning. This case involves users entering text criteria for a document to be retrieved from a system. Data access is generally based on virtually freeform parameters, and the manner in which data is created in search databases varies from granular feeds that come in as data in source systems is created or on a batch basis as data is aggregated and put into the system (e.g., web crawlers). This use case is relevant for applications as well, such as for searching for contacts in a CRM application, although application-level search is usually restricted to relatively well-defined parameters. One key consequence of searches is that the document to be retrieved may be large (e.g., medical image, design schematic, training video, etc.), thereby requiring significant resource capacity to provision. • Analytics. This use case encompasses the analysis of data, almost always at an aggregate level. Users specify a set of parameters for which they want to aggregate, narrow down and view the data. For example, a retailer may state that they want to see the percentage of reviews on websites for their dresses that are positive. They may decide to narrow this down to only red dresses and then decide to view the feed back by the website the review was found on. Other uses can include sifting through transactions to detect failures or fraud, or inferring consumer preferences by the amount of time they spent on a particular site, video or audio file. In order to be used in this mode, the database must be able to aggregate large amounts of data quickly.28 July 1, 2011
  • 29. Software Key Characteristics and Shortcomings of Relational Databases Today We believe the requirements for serving, managing and analyzing massive amounts of content pose significant challenges for traditional relational database technologies. We therefore introduce certain significant properties of relational databases, before we look at the challenges of managing unstructured data. We then look at the issues of each of these properties so that one can understand the issues solved by the new crop of database solutions. Record Structure What is It? Relational databases today primarily have data in a very structured record. An entity’s data (such as a person’s surname) is stored together in the record’s attributes. For example, a record of a person may store that person’s name, address, SS# and phone number. Conventionally, this record is stored as a row in a table in the database. Each entity with a different type of data (e.g., a company vs. a person) is typically stored in a different tableExample of Records Stored in a Table in a Typical Row-Based Database attributes of the same record stored together (columns) Record # First Name Last Name Sex SS# Birthdate Street Address City State Zip 1 John Doe Male 123-456-7890 1/2/34 123 Sesame Street New York NY 01234 2 Jane Doe Female 098-765-4321 3/4/56 123 Sesame Street New York NY 01234 … each record is stored separately (rows) XXX George Washington Male 111-111-1111 2/22/32 1600 Pennsylvania Ave Washington DC DC 20500 Source: Cowen and Company. Many relational databases also support other data structures. However, support for unstructured data types is limited and any unstructured data (e.g., Oracle LOBs) is still typically stored in a record structure. The Catch Having all related data stored together as a record works well for instances where several attributes of the record are accessed and manipulated together. For example, creating and viewing individual customer orders. However, this layout is wasteful when only a portion of the data is accessed, as the whole record is retrieved even if the user is only after a particular subset of data (e.g., John Doe’s birthday). This is an issue as in many use cases, and analytics in particular, as there is an unnecessarily large amount of traffic when much less data is needed. July 1, 2011 29
  • 30. Software Relational Operation What is It? Databases provide a facility to match data attributes that are in different tables. In most databases today, this facility is provided through the ubiquitous Structured Query Language (SQL). This allows different records such as records of companies and records of persons to be associated to determine a relationship such as employment. In the example below John, Tim and Maria all work for Widgets-R-Us and this is identified through the company number attribute, also known as a key. Illustration of Records Associated to One Another in a Relational Database Companies Employees Company # Name Works for Company # Name 1 Widgets-R-Us 1 John Doe 2 Big Mart 1 Tim Posey 3 Go Mommy 1 Maria Chapman 2 Jane Doe 2 Obadiah Stark 3 George Washington 3 Peter Porker Source: Cowen and Company. The Catch While not a performance degrading aspect of the design in and of itself, the adherence to this design philosophy leads to data being spread across multiple tables. When data is retrieved from multiple tables, it is often to analyze bits of information that are distributed among the tables, which means that the records among these tables have to be matched with each other. This matching process can often be very intensive, particularly when the number of records being matched is large. In systems where many users are accessing large amounts of data, this can tax the system and slow all processes down. Without some of the special techniques discussed in this note applied, performance on a relational database can degrade quickly as data volumes increase. These data stresses are also a reason that some NoSQL systems forego the ability to join data across datasets, at least in an automated manner.30 July 1, 2011
  • 31. Software Transactional Integrity What is It? Even when data is stored across many machines, databases in use today ensure that data is added or updated in the database reliably, even if part or all of the system goes down. In the parlance, the integrity guidelines are referred to as the ACID (Atomicity, Consistency, Isolation and Durability) properties of the database. No matter how small, updates of data require some time, during which the system can fail. In order to ensure that failures do not lead to inconsistent or corrupt data , modern database systems either (1) make processes that update data reserve the associated records, preventing other processes from updating the records, or (2) keep the original copies of the data while it is being updated.Processes can Wait in Line to Update Data to Ensure Integrity Database … Data Data Data Data … Processes wait in line to update data Process Process Process 1 2 … n Process Process Process 1 2 … m Source: Cowen and Company. Or Update Copies of the Data to Ensure Integrity Database … Data Data Data Data … Process Process Copy 1 Copy 1 1 1 … … Process Copy 2 2 Process Copy m m … … Process Copy n n Source: Cowen and Company. July 1, 2011 31
  • 32. Software The Catch Both of these methods can consume significant resources and slow the system overall. Additionally, in a deployment where data is distributed among several servers (as is common today in handling Big Data), this process is further delayed as the writer of the data needs to wait on all of the targeted systems to verify that the data has been written.32 July 1, 2011
  • 33. SoftwareTools of the Trade: Dealing with Data VolumesThe primary thrust of the database industry up to this point has focused onaddressing the data volume management issue. The following outlines the solutionstaken by various vendors to deal with terabyte and larger data sets. A trite approachto meeting higher processing requirements is to throw more powerful hardware atthe problem. However, there are limits to the computing capability of a singleprocessor and single storage device. Hence, various solutions have been applied tothe big data issue.We note that there are other techniques applied by other databases (such as vectorcalculations in CA’s VectorWise), but we focus on the more popular solutions appliedto the big data issue by both commercial and open source providers.Split up the Problem: ParallelizationA common solution to the data volume issue is to distribute the workload amongmany CPUs or over many computers. By intelligently splitting up the workload, eachcomputing resource deals with a simpler portion of the overall problem and canturn the job around faster nearly all the time.We view massively parallel solutions as the likely most effective parallelizationapproach. Capabilities of most standalone systems will likely remain well below thelevel required to manage today’s enormous volumes of data, given the almostexponential growth of data. We believe the most cost-effective way to keep up withthis increasing data processing demand will be to keep adding standalone servers toa networked system of servers that can work in parallel rather than buying biggerand more expensive appliances to handle increased demand. This system of serverscan exceed the boundaries of traditional cluster systems, growing into thethousands over time.July 1, 2011 33
  • 34. Software Parallelization Splits the Work up Among Multiple Resources Parallel Resources Job User Main Entry Point Source: Cowen and Company. We divide this solution into three groups: • Parallel machines or appliances • Generic commercial clustering solutions available today • Massive custom parallelization efforts. Parallel Machines and Appliances These are single-box scaled-up systems with multiple processors and multiple storage components often designed to perform well at a specific set of tasks (e.g., retrieving data from a datawarehouse or delivering the formatted data to many users). These systems allow the operating system or database software to divide a task among the appliance’s components. Examples include Teradata’s various appliances, the pre-configured HP and IBM systems with SAP’s HANA, and Oracle’s Exadata machine. Clusters Also known as “grid” systems, these are sets of computers linked closely together to behave in many respects like a single larger computer with the operating system or other application software automatically distributing the workload. Oracle RAC and IBM pureScale are examples of grid or clustering solutions. Oracle’s Exadata, for34 July 1, 2011
  • 35. Software example uses Oracle RAC software to manage data across multiple Exadata databases. Grouped computers can afford better performance and availability than a single computer and, more importantly, can be less expensive overall than a single monolithic computer with similar performance. For example, if one were willing to forego the administrative capabilities and high availability of Solaris, a close replacement for Oracle’s mid-range $180K SPARC-based Enterprise M5000 server could be had in the form of two or three of Oracle’s $16K Intel-based Sun Fire X2270 M2 server. We note the high-availability features of the M5000 are important in production systems that often need to be running 24x7.Oracle M5000 vs. X2270: Clusters can be Cheaper SPEC Rate Benchmark Equivalent System Tested CINT2006 CFP2006 Unit Price Number System Cost SPARC Enterprise M5000 11/2010 352 278 $178,026 1x $178,026 Sun Fire X2270 M2 (Intel Xeon X5670 2.93GHz) 7/2010 342 236 $16,294 2x+ $32,588 + Source: Cowen and Company, Oracle Corporation website. SPEC® and the benchmark name SPEC CPU2006 ® are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of Feb 22, 2011. The comparison presented above is based on the latest test results for servers currently shipped by Oracle Corporation. For the latest SPEC OMP benchmark results, visit http://www.spec.org/cpu2006/results/. Massive Custom Parallelization Efforts Clustered systems where some out of the box application can manage the workload automatically, such as Oracle RAC and IBM pureScale, are generally limited to the hundreds of physical processing and/or storage components. A few commercial solutions can exceed the 1,000 server level, but many believe that performance for specific tasks can be improved by using a more close-to-the-metal level of programming that uses simpler coordination mechanisms, leading to the rise of NoSQL (discussed later). With effort, some companies such as CRM, Facebook, and Google have strung together systems that scale into the thousands and tens of thousands of databases and/or servers to manage and manipulate their immense quantities of data. These solutions often cobble together databases using custom coded solutions. Development Frameworks for Massively Parallel Systems Taking full advantage of massively parallel systems is difficult to do without tools and frameworks to appropriately distribute the data load among the component systems. Several popular frameworks appear to be emerging to manage massively parallel systems. • MapReduce. MapReduce is a software construct first used (and patented) by Google that allows relatively fine grained control of how calculations on large data sets are handled by a large number of computers. It is composed of a Map step that parcels out the work for a problem to many computers and a Reduce step that combines the answers from all the computers into some desired output. Companies such as Greenplum, AsterData and open source efforts such as Hadoop are attempting to bring this framework to a broader audience. July 1, 2011 35
  • 36. Software • Hadoop. Inspired by Google’s MapReduce and File System, Apache Hadoop provides an open source software file system and framework designed to manage enormous data volumes. Hadoop consists of common APIs with which programmers can access its supported file systems, including the Hadoop Project’s implementation of MapReduce. While not a database in and of itself, Hadoop’s file system (HDFS) can serve as a storage engine for a database (such as Apache’s HBase and Hive). It is important to note that Google holds a patent for MapReduce that, if enforced, could threaten the development of this industry. However, we note that very similar distribution/collection methods for massively parallel systems have been researched for at least the last 30 years (e.g., spawn/sync in Cilk). At worst, the name may change. At best, use of MapReduce by an industry stalwart such as Google’s will help promote its broad adoption. It is also important to moderate the current hype around MapReduce. MapReduce is another form of parallel computing techniques that we recall from university classes in the 1980s, only with different labels, such as the aforementioned spawn/sync syntax. Also, Teradata has been selling a commercial DBMS utilizing MapReduce-like techniques for more than 20 years. All modern database systems have provided some level of MapReduce-like distribution and aggregation functionality for quite a while, starting with the Illustra engine around 1995 and progressing to today’s releases of Oracle, DB2 and SQL Server. These existing database systems generally only allow limited control (if any) on parallel execution. A skilled programmer can create faster, more efficient applications using MapReduce than the automated and semi-automated methods of current databases, at the expense of some additional labor. Our experience indicates, however, that there are relatively few such programmers, so labor shortages may put a damper on MapReduce adoption. For reference, comprehensive parallel programming work is typically reserved for graduate level classes at universities. Get Rid of the Parallel Coordination Bottleneck: Shared Nothing One will often hear of shared-nothing architectures within the context of parallel solutions in databases. Most contemporary databases parallelize using a shared-everything architecture where processors share a common main memory and disk storage. Shared- everything helps in instances where data needs to be available to every user and server in the system. However, this makes memory and/or disk access a bottleneck for these systems. Shared-nothing is an alternative to the shared-everything database architecture wherein each processor is independent and self-sufficient. This architecture can significantly enhance performance as each processor need not coordinate with other processors and processors need not wait for another to be done with a portion of memory or disk is a non-issue. This approach works best with sets of data that are mostly independent of each other, or if data is replicated on more than one node but does not need to be up-to-date (i.e., transactional integrity need not be observed).36 July 1, 2011
  • 37. SoftwareShared Everything vs. Shared Nothing Shared Everything Shared Nothing Servers … … Data Storage Source: Cowen and Company. However, the shared-nothing approach does have its shortcomings. In instances where data is split among all processors and data needs to be reported on together, an additional layer must be in place to combine this data. On the other hand, if data is replicated among several servers, updating data among all the servers that are supposed to be in sync takes up network and system resources. These communication issues compound exponentially as the number of nodes in the system increases. Commercial shared-nothing architectures include Teradata, Greenplum, MySQL, and IBM DB2 UDB (i.e., not the mainframe version of DB2, and interestingly, PureScale) and a partially shared-nothing layout can be implemented in Oracle Exadata as well. Google’s BigTable database is a proprietary database implementation that uses a shared-nothing architecture. It is interesting to note that IBM’s highly-touted non-mainframe PureScale solution eschews UDB’s shared-nothing architecture for a shared-everything architecture in order to minimize inter-system communication. The inter-system communication is necessary to maintain transaction integrity, which, incidentally, is a characteristic that is relaxed in NoSQL databases, discussed later. Turn the Problem on its Side: Columnar Databases As we have pointed out, traditional relational databases store data in a transaction record structure, pulling the entire record when data in the record needs to be accessed. However, in many instances of where data needs to be analyzed, only a specific set of data attributes needs to be retrieved (e.g., determining how many people were born in each decade needs only birth date data). The limited set of data needed in typical data analyses is the impetus behind columnar databases, and their recent rise in popularity. In contrast to typical relational databases that store data in rows, columnar databases from Sybase, ParAccel, Aster Data (Teradata), Vertica and others essentially flip the storage over its side and store data in columns, as shown in the next illustration. July 1, 2011 37
  • 38. SoftwareExample of Records Stored in a Columnar Database each record is broken up and its attributes are stored separately Record # 1 2 … XXX First Name John Jane … George Last Name Doe Doe … Washington Sex Male Female … Male SS# 123-456-7890 098-765-4321 … 111-111-1111 attributes are stored with like attributes Birthdate 1/2/34 3/4/56 … 2/22/32 e.g., zip codes are stored together Home Address 123 Sesame Street 123 Sesame Street … 1600 Pennsylvania Ave City New York New York … Washington DC State NY NY … DC Zip 01234 01234 … 20500 Source: Cowen and Company. This architecture allows a very small subset of the data to be retrieved, thereby lowering the communication costs in the database. In the illustration shown above, the system need only retrieve the birth date attribute to determine how many people were born in each decade. In contrast, a row-based database would have retrieved all the records along with all their attributes from the database, 10x as much data. Having like attributes stored together also has a side benefit of making compression more effective, hence allowing for more efficient storage and retrieval of data, reducing both disk price and communication bottlenecks. The natural question to ponder is if columnar databases are much faster, then why do row-based databases from IBM and Oracle still dominate the market? The tradeoff for using this structure to improve read performance is that writing data to the database is much more expensive. One needs to write data to each attribute stored vs. simply writing one record to the database. Recent columnar database implementations have gotten around this bottleneck by leveraging parallelism and using several processes to write one record into the database. Vertica, ParAccel, Infobright, and Sybase IQ are examples of columnar databases. Some database vendors are also offering “hybrid” columnar models where a table or partitions within it is organized such that it can be stored either along rows or columns, providing some of the compression and retrieval benefits of “pure” columnar databases. Examples of this include AsterData and Oracle (only on Exadata). 38 July 1, 2011
  • 39. SoftwareGet Rid of the Hard Drive Bottleneck: In Memory DatabasesA big bottleneck in the storage and retrieval of information are the disks on whichthe information is stored. Accessing data on memory chips occurs at the speed oflight once a memory address is provided. In contrast, for disk access, one needs tophysically move an object (the head) across the diameter and along thecircumference of a disk to retrieve the data. This takes a little time, but all the timeadded up seeking the data on a disk can be significant.To solve this issue, modern database vendors including Oracle (TimesTen), IBM(Cognos TM1 and solidDB) and SAP (Sybase) have commercialized databases wherethe primary means of data storage is in the server’s main chip-based memoryinstead of a hard drive. Some mainstream vendors, such as Oracle, also offer a morelimited in memory capability in their bread and butter commercial databases, byletting the administrator specify a limited portion of the database to be stored inmemory.While many of the database solutions today can emulate an in-memory solution asdisk data gets cached into main memory, we focus here on databases that arepurpose designed to have their data stored in memory rather than brought intomemory as part of a caching mechanism.The main constraint behind these systems is that large amounts of chip-basedmemory can get very expensive, and that motherboards of systems have to becomevery large to accommodate the banks of memory needed to support large databases.At a higher level, many distributed database implementations, such as YouTube andFacebook use an open source database cache called memcached, which allows theapplication to first check if the data is available in the memcached memory, beforeaccessing the underlying disk-based database.Speed up the Hard Drive Bottleneck: Solid State DrivesA compromise to in-memory databases is to replace the hard drives in an appliancewith solid state drives (SSDs), which use a form of chip-based memory that ischeaper than traditional main memory. This form of memory is up to two orders ofmagnitude faster than traditional hard drives (0.1 ms access time vs. 5 ms-10msaccess time), at the fraction of the price of comparable chip-based main memory.The solid state drive is accessed by the CPU much like a traditional disk drive. SSDmemory is cheaper than main memory, but is also about 10x slower. However, SSDsare still a significant performance improvement over disk and we believe that theycould be a cost-effective alternative to true in-memory databases, SSDs cost roughly6% the cost of memory but only cost about 2x the cost of high-performance disk.July 1, 2011 39
  • 40. Software Recent High-End Storage Pricing at a Major Hardware Components Retailer Brand Capacity (Gb) Price Price per Gb Main Memory Crucial 16 $849.99 $53.12 Kingston 8 $256.99 $32.12 Patriot Memory 8 $239.99 $30.00 Wintec 8 $229.99 $28.75 Corsair 4 $164.99 $41.25 Lenovo 2 $109.99 $55.00 Mushkin 1 $29.99 $29.99 Weighted Average $40.04 Solid State Drives Crucial 256 $619.00 $2.42 Corsair 256 $749.99 $2.93 Intel 250 $614.99 $2.46 Weighted Average $2.60 Hard Drives Seagate 600 $439.99 $0.73 HP 600 $999.99 $1.67 IBM 600 $799.99 $1.33 Cisco 73 $319.99 $4.38 Weighted Average $1.37 Source: Cowen and Company. Major Hardware Components Retailer Website (beginning of June, 2011). The higher speed of SSDs will strain many servers that are designed for the lower throughput of disks, however, so higher end servers or purpose-designed systems such as appliances will be able to take full advantage of SSD drives. Contemporary database appliances such as Oracle’s Exadata and SAP’s HANA appliances make extensive use of SSDs. Go Retro: NoSQL Many attempts to improve the performance of databases attempt to simplify them by relaxing or removing the record constructs and transactional integrity of databases, and discouraging the use of relational constructs. The nom du jour of databases that implement these solutions is NoSQL, a bit of a misnomer as some actually use the SQL language, as developers realized they could not live without the productivity gains offered by the use of the SQL language. As an aside, many now consider NoSQL to stand for Not only SQL. NoSQL is currently generating significant buzz in the industry, as many view this as a fundamental rethinking of what a database is. This presumed breakthrough may lead to significant economic opportunity. However, the difference in performance offered by these fairly radical departures from the norm may be offset by applying some of the preceding performance improvements to existing commercial databases.40 July 1, 2011
  • 41. SoftwareMany NoSQL solutions are open source and are likely to be available at a lower pricethan traditional commercial counterparts.Examples of this class of solutions include BigTable from Google, the ApacheProject’s Cassandra, and MarkLogic.Relaxing Relational ConstraintsThe previously described solutions seek to preserve the record structure, relationalcapabilities and transactional integrity. These are akin to figuring out the fastestway to drive from New York to San Francisco. The NoSQL movement questions theunderlying principles of commercial databases, ditching the car and flying instead.The following explains how relaxing each of the previously explained characteristicsof typical commercial databases helps manage large sets of data.• Record Structure. Some NoSQL solutions relax or completely do away with structured records. This simplifies the database internals by removing the need to translate the metadata used to define the tables and their structure, thereby allowing for a simpler, faster database. This solution also lends itself to more efficiently managing unstructured data, as the source data object, if stored, need not be shoehorned into a record structure.• Relational Operations. Similar to relaxing the constraints of adhering to record structures, many NoSQL solutions remove the ability to match records in one or more tables with each other to simplify the database and increase processing speed. It has a secondary benefit in that the lack of support in the database for associating two sets of data discourages applications programmers from storing data for multiple entities in different tables/data stores. Hence, applications built on NoSQL systems will spend little time executing expensive matching operations.• Transactional Integrity. Many of the nascent NoSQL databases forego ensuring transactional integrity. Instead of having updated data available on all servers in the system simultaneously, particularly in a shared-nothing architecture, updates eventually propagate to all the servers in the system. This saves the system from the overhead of keeping old copies of the data or the delay of queuing up operations in order to guarantee the integrity of its transactions.Classes of NoSQL SolutionsThe generally accepted definition of NoSQL as the class of databases that simplifyexisting paradigms and relax traditional database constraints is very broad, andthere are many emerging NoSQL databases with very different characteristics anduses. There do appear to be several emerging classes of NoSQL solutions, based onthe primary data storage scheme, listed in the following table.July 1, 2011 41
  • 42. SoftwareEmerging Classes of NoSQL Solutions NoSQL Class Description Common Use Case Examples Key Value Store Generally stores data in memory. Data is accessed Most often used as a cache Amazon S3, Redis, using a hash key that generally retrieves the block of for web apps. Voldemort, memcached memory in which the data is located. These solns are generally very fast, but are limited by the expense of main (chip-based) memory. Document Store Stores XML or other forms of documents, normally Apps that need access to MarkLogic, CouchDB, with some limited SQL functionality to access the data. documents without the need MongoDB Most solutions of this type also implement a shared for predefined attributes. nothing parallel architecture. Column Store Solutions for very large data storage that still use Very high access volume, Cassandra, Google some form of record with attributes. Most solutions high storage volume BigTable, HBase are characterized by low-level (more difficult to systems. program) APIs and support for parallelism through Map/Reduce. Source: Cowen and Company, various industry presentations. 42 July 1, 2011
  • 43. SoftwareData Management Providers and SolutionsThe Database EstablishmentOracleEnterprise software provider Oracle produces a range of tools for managingbusiness data, supporting business operations, and facilitating collaboration andapplication development. Oracle also offers business applications for datawarehousing, customer relationship management, and supply chain management.Despite significant acquisitions in applications and hardware space, the databaseand middleware business is still Oracle’s bread and butter, accounting for roughly40% of the company’s revenues.• Oracle. The first commercial database to use SQL in 1979, Oracle’s flagship database today is a jack of all trades. It handles unstructured data through its Oracle Text and Oracle Multimedia features that are included in its Enterprise and Standard Editions, and handles machine data through its other included features such as Clickstream. The database can also handle petabyte-size data sets, albeit currently not as efficiently as up and coming solutions. Oracle also offers its Universal Content Management middleware to help improve management of unstructured data.• Exadata. Exadata is Oracle’s database appliance based on its namesake database with heavily modified hardware and software components designed to speed access to stored data. The system allows for parallelism through Oracle’s RAC option and leverages solid state drives. Although data is still fundamentally stored in rows, it uses a clever storage scheme to take advantage of columnar compression.• MySQL. Acquired by Oracle through its Sun purchase, MySQL is an open source database that does not use its own storage engine to manage how data is actually stored, but rather can use one of several commercially available or custom storage engines. The default storage engine is the InnoDB storage engine provided by Oracle. The flexibility and low cost offered by its open source nature and swappable storage engine has made MySQL a popular choice for use in large unstructured data deployments such as Facebook.• TimesTen. Originally developed at HP Labs, TimesTen was spun off in 1996 and acquired by Oracle in 2005. TimesTen is an in-memory database popular in the telecommunications industry where response times in the milliseconds or even microseconds are required. Unlike other in memory databases or caches, such as the popular open sourced memcached, TimesTen is accessed with standard SQL. Today TimesTen can be implemented standalone or serve as an in-memory feature for the mainstream Oracle database.In addition to the databases above, Oracle also maintains the RDB and open sourceBerkeleyDB databases. Oracle also has a middleware data cache product calledCoherence that is very similar to memcached in function and can serve as a high-speed cache on top of its database products.July 1, 2011 43
  • 44. Software IBM IBM is the worlds largest provider of computer products and services. Among the leaders in almost every market in which it competes, the company makes mainframes and servers, storage systems, and peripherals. The company is also one of the largest providers of software (ranking #3, behind Microsoft and Oracle). Much like Oracle, IBM has a myriad of database and appliance offerings. • DB2. One of the first databases to use SQL, IBM’s flagship DB2 database is available in mainframe and Linux/Unix/Windows (LUW) flavors that are based on two different code bases (hence features in the mainframe version are not always available in the LUW versions and vice versa). DB2 is Oracle’s main rival in the enterprise-class OLTP database arena. DB2 can store non-text unstructured data, but it does not offer additional functionality to help manage this data to our knowledge. • Netezza. IBM subsidiary Netezza (acquired in 2010) produces and markets the TwinFin and Skimmer database appliances. These appliances target the analytics market using a version of row-based PostgreSQL distributed in a shared-nothing architecture. The TwinFin appliance can be scaled to handle petabyte-scale data. • solidDB. solidDB is IBM’s relational, in-memory database for which IBM touts microsecond query response times. Much like Oracle’s TimesTen, solidDB can be deployed as a standalone database and used as a middleware cache. Current deployments include Cisco, HP, Alcatel, and Nokia. In addition to the databases above, IBM also sells Informix as its database for embedded and smaller corporation/department level applications. IBM also has its own distribution of open source file management engine Hadoop under its InfoSphere suite, and has announced a $100M investment to research petabyte-scale data management. SAP With its 2010 acquisition of Sybase, SAP significantly expanded its previously bit presence in the database market. SAP currently offers Sybase’s ASE primary OLTP database offering and its IQ analytics database to go along with its SAP In-Memory Database on the HANA appliance platform. • ASE. Adaptive Server Enterprise (ASE) is SAP’s primary relational database management system for mission-critical transaction processing. ASE supports clustering on up to 32 servers/nodes, which can help its performance on large data sets. • IQ. IQ is SAP’s analytics database product. For many years, IQ was the only columnar database sold, although this has changed recently with the advent of many columnar database competitors including Aster Data, Vertica, Greenplum, ParAccel and Infobright. In addition to leveraging a columnar layout to optimize analytics performance, IQ customers can also choose its Multiplex option to further boost performance and capacity into the petabyte range. • SAP In-Memory Database and HANA. SAP’s in memory entry database is primarily sold through its HANA database appliance platform. Data in HANA can be stored in both columnar format for analytics applications and row format for44 July 1, 2011
  • 45. Software OLTP applications. SAP intends to use HANA for its core OLTP applications. HANA platforms are currently being sold on HP and IBM hardware.In addition, the company continues to market Sybase’s RAP database that isspecifically targeted at the financial services industry and SAP’s prior in-housedatabase, MaxDB.MicrosoftMicrosoft entered the data management arena in 1989 in a partnership with Sybaseand Ashton-Tate, with code based on Sybase’ Unix database (now Sybase ASE).Although widely considered an entry level database, Microsoft has slowly addedcapabilities to this database, including text search. The database is scalable enoughand can handle unstructured data well enough that it is used by the University ofHawaii’s institute for astronomy to store photographs of the sky to facilitate itssearch for Earth-bound asteroids and comets. This project produces 1.4 terabytes ofdata per day and the database is designed to handle up to 1.1 petabytes of data.TeradataFounded in 1979, Teradata develops and markets products specifically targeted atdata warehousing and analytics. TDC was spun off from NCR in 2007 and recentlyacquired columnar data provider Aster Data.• Teradata. Teradata’s core products are based on the Teradata database, a shared-nothing parallel database intended for analytics. Teradata also markets branded appliances based on the Teradata database.• Aster Data. Aster Data is an analytics-focused database that runs on a shared- nothing parallel architecture that scales up to petabyte data range. It stores its data in both row and column format in order to allow users to optimize for various data access scenarios. To help manage parallel execution, the database also includes a built-in implementation of the MapReduce API and is compatible with Hadoop. Aster Data customers include Barnes & Noble, Myspace and Akamai.Although these two sets of products seem to fulfill the same customer requirements,the row-based nature of the Teradata database makes it more suitable to moreoperational and real-time functions (such as up-to-the-minute dashboards), whileAster Data’s ability to store data in column format makes it more suitable to deeperand broader data analyses.IngresPrivately held Ingres develops, supports and markets the open-sourced Ingresdatabase for transaction processing and the vector-processed columnar databaseVectorWise for analytics. After a stint under its umbrella, CA divested Ingres toprivate equity firm Garnett & Helfrich Capital in 2005.• Ingres. Started as a research project at the University of California at Berkeley in the early 1970’s, Ingres today is a commercially supported open-source SQLJuly 1, 2011 45
  • 46. Software database targeted at commercial and government applications. Ingres is fully open source with a growing global community of contributors, but its namesake company shepherds its development, producing and supporting certified binaries. • VectorWise. VectorWise is Ingres’ columnar database. The solution goes beyond columnar optimizations like access of more select data via columns and data compression to add vector processing – batching similar operations together to speed up computations. Other Large Companies with Solutions HP (Vertica) Founded in 2005, HP’s Vertica subsidiary developed and first marketed its columnar database in 2007. In a few short years, Vertica has successfully signed on significant blue chip clients, including Verizon, Guess, Zynga, Capital IQ, Mozilla and Comcast. HP acquired Vertica in 2011. Vertica’s database product is a columnar database targeted for use in analytics that runs on a shared-nothing parallel architecture. The database also integrates with Hadoop and is able to use Hadoop’s MapReduce API implementation. To our knowledge, there are no petabyte-level implementations of Vertica as of this time, although one of Vertica’s customers is reportedly approaching this data volume. EMC (Greenplum) EMC’s San Mateo, CA-based Greenplum subsidiary develops and markets the Greenplum database and a series of appliances built on the database. The Greenplum database is based on open-source database PostgreSQL that has been heavily modified to support a cohesive parallel computation structure with a columnar structure. However, the Greenplum database itself is currently not open- sourced. Greenplum achieves its high performance characteristics by allowing administrators to use their choice of row-based or columnar storage. For example, a table can have newer data stored in rows to optimize transaction performance and older data stored in columns to optimize analytics performance. The database layers this flexible data structure on to the ability to take advantage of parallel computing on multiple servers using a shared-nothing architecture to further improve its performance on large data sets. To help control its parallel execution, Greenplum includes a native implementation of the MapReduce API. Amazon (SimpleDB) Introduced in 2007, SimpleDB is an Amazon Web Services (AWS) NoSQL distributed database that integrates with AMZN’s Elastic Compute Cloud (EC2) and S3. As with EC2 and S3, Amazon charges fees for SimpleDB storage, transfer, and throughput over the Internet. SimpleDB is free for the first 1 GB of data and 25 machine hours per month.46 July 1, 2011
  • 47. SoftwareGoogle (BigTable)BigTable is a compressed, high performance, and proprietary database system builton the Google File System and a few other Google programs. It is currently notdistributed or used outside of Google, although Google offers access to it as part ofits Google App Engine.Proprietary Upstarts1010data1010data was founded in 2000 by developers of large-scale data systems on WallStreet. The company developed a web-based service and the underlying software toacquire, organize, manage, and analyze large volumes of data. Its software isarchitected to handle multi-terabyte databases at a lower cost and betterperformance than other data management approaches. 1010data customers includeBank of America, Credit Suisse, Equifax, Goldman Sachs, P&G, GameStop, UBS, SaveMart and Rite Aid.InfobrightCanada’s Infobright produces a database analytics engine targeted at machine datathat can be used as the underlying database engine by MySQL (as we have previouslynoted, MySQL does not actually have its own data management engine and canintegrate with several, including Infobright). The analytics engine is a columnarengine with a novel indexing scheme that saves space and as a result also reducesthe internal communication load (CPU/disk) of the server. Its engine ships with adistribution of MySQL. Infobright’s customers include Nokia Siemens Networks, 8x8,AdSafe and LiveRail.GigaSpacesFounded in 2000 and based in New York, NY, privately-held GigaSpaces Technologiesis a provider of virtualized application platforms. Its flagship product, eXtremeApplication Platform (XAP) is an in-memory application server that includes a high-performance in-memory database. The company currently has over 350 customersworldwide including Dow Jones, NYSE, Société Générale, Virgin Mobile, and Sears.GigaSpaces is funded by FTVentures, BRM Capital, Intel Capital, and Formula Vision.ParAccelBased in San Diego, CA privately-held ParAccel develops and markets a columnaranalytic database that uses a shared-nothing architecture. Investors include MDV,Bay Partners, Walden International, Tao Ventures and Menlo Ventures. ParAccelcustomers include Merkle, OfficeMax and SecureAlert.MarkLogicHeadquartered in Silicon Valley within earshot of database giant Oracle, MarkLogicdevelops and markets a database purpose-built for unstructured information.Marklogics server stores unstructured data using XML, a much more flexible semi-structured representation than the record structure used by mainstream record-based databases. It claims performance improvements of 10x to 100x overJuly 1, 2011 47
  • 48. Software traditional databases when managing unstructured data, helped by its specialized engine, caching and compression. MarkLogic ostensibly has developed a strong presence in the media market, with customers including the Oxford University Press, Pearson Education, Simon & Schuster and Zinio.com, among others. Other customer verticals include government and financial services. The company is privately held with investors Sequoia Capital and Tenaya Capital. Splunk Founded in 2004 and headquartered in San Francisco, CA, Splunk produces and markets a search and analytics database focused on machine-generated data. Splunks early success was based on providing users with search capabilities for their machine data. Customers use Splunk to index, search and report across their machine data to troubleshoot issues, find root causes, investigate security incidents and for reporting. With over 2,600 licensed customers, Splunk has the largest known number of commercial deployments of the MapReduce paradigm. Customers include BT, Chevron, Cisco, Comcast, Credit Suisse, Dow Jones, LinkedIn, Macys, Motorola, NTT, New York Life, Pegasus Solutions, Raytheon, Swisscom, Symantec, T-Mobile, Telstra, Verizon, Visa, Vodafone and the U.S. Department of Energy. Open Source Given that multiple companies often support one or more open source solutions, we have listed both the solutions and the companies separately in this section. The Technologies Open source database technologies are listed below. Proprietary databases with similar technological bents such as Google BigTable and Amazon SimpleDB are not open source and are listed in the Other Large Companies section. Cassandra Apache Cassandra is an open source distributed database management system designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. Cassandra evolved from work at Google, Amazon and Facebook, and is in use at companies such as Twitter, Netflix, Rackspace and Cisco. Cassandra is progressing towards increasing compatibility with Hadoop. Companies that provide commercial products and support for Cassandra include DataStax and Acunu. CouchDB Apache CouchDB is an open source document-oriented database designed for local replication and to scale horizontally across a wide range of devices. CouchDB is not a relational database. Rather, it is designed to manage collections of JSON-formatted documents (JSON is a text-based open standard designed for human-readable data interchange). CouchDB development began around 2005 and became an Apache48 July 1, 2011
  • 49. Softwareproject in 2008. It is currently maintained at the Apache Software Foundation withbacking from IBM and is commercially supported CouchBase and Cloudant.HadoopA project of the Apache Foundation, Hadoop features a novel distributed file systemand parallelization support via the MapReduce API. It is designed to scale tothousands of nodes and petabytes of data. Although the Hadoop release includes ahigh performance distributed file system, it can work with other compatible filesystems as well. While not a database in and of itself, Hadoop is seeing interest as ahigh-performance file system for storing and managing large amounts ofunstructured data objects.This framework can also underpin databases that provide more structured datamanagement and query functionality, such as the Apache Foundation databasesdescribed below.• HBase. HBase is an open source, distributed, versioned, column-oriented store modeled after Google BigTable. HBase can serve as both the input and output for MapReduce jobs run in Hadoop.• Hive. Hive is an open source data warehouse infrastructure that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive projects a more formal data structure on top of the underlying Hadoop file system and provides query capability into this database with a SQL-like language.Yahoo! is the largest contributor to Hadoop and it is used extensively within thecompany. The company recently decided to spin off its Hadoop effort into acompany called Hortonworks. In addition, Cloudera, Datameer and Hadapt currentlyprovide products and support for Hadoop, while Cassandra support providerDataStax recently announced a support offering for a Hive/Hadoop/Cassandradistribution. IBM also has a distribution of Hadoop for which it provides support.HypertableHypertable is a high performance open source distributed database inspired byGoogles BigTable and designed to scale up to handle petabyte-size data sets.Hypertable is designed to manage the storage and processing of information on alarge cluster of commodity servers, providing resilience to machine and componentfailures. Hypertable runs on top of a distributed file system such as the ApacheHadoop DFS, GlusterFS, or the Kosmos File System (KFS). This database is relativelynew among its NoSQL contemporaries, and has not yet acquired a commercialsupport provider.MembaseCreated by the developers of the popular memcached data cache (used by YouTube,Zynga, Facebook and Twitter), membase is an open source distributed NoSQLdatabase optimized for storing data behind interactive web applications. Membaseis designed to provide all the simplicity and performance of memcached, whileproviding the persistence and query capabilities of a database. Membase isdesigned to scale from a single machine to very large scale deployments. MembaseJuly 1, 2011 49
  • 50. Software has been deployed by Zynga, The Knot, Tribal Crossing and others. Commercial maintenance and support for membase is provided by Couchbase. MongoDB MongoDB is an open source, high-performance, schema-free, document-oriented database, managing collections of documents stored in a JSON format (a text-based open standard designed for human-readable data interchange). Its architecture allows data to be nested in complex hierarchies and still be queryable. Paid support is primarily provided by 10gen, which first released MongoDB in 2009. PostgreSQL The other major open source relational database (other than MySQL), PostgreSQL was developed by the original developers of Ingres with the intention of supporting different types of data objects, including binary objects such as pictures, sounds, or video. GreenPlum’s database is based on a heavily modified form of PostgreSQL. Other prominent users include Yahoo! and Heroku (now part of Salesforce.com). Commercial support is officially available from many different companies including PostgreSQL Inc., EnterpriseDB, the Avivo Group and others. Redis Redis is an open source, networked, in-memory, NoSQL data store. Redis typically holds the whole dataset in RAM, but can be configured to use virtual memory. Although a Redis can be used as a pure cache, we have included it here as it has facilities to copy out its cached contents for persistent storage. Development of Redis is sponsored by VMware. Voldemort Voldemort is an open source database developed at LinkedIn. It uses a parallel shared-nothing architecture where data is automatically partitioned so each server contains only a subset of the total data. This database is relatively new among its NoSQL contemporaries, and has not yet acquired a commercial support provider.50 July 1, 2011
  • 51. SoftwareThe Companies Company Open Source Technology 10Gen MongoDB Acunu Cassandra Cloudera Hadoop Cloudant CouchDB Couchbase CouchDB Membase DataStax Cassandra Hadoop/Hive Datameer Hadoop Hadapt Hadoop Hortonworks Hadoop Cowen and Company, Company Reports.10Gen10gen develops the open source MongoDB database, and offers production support,training, and consulting for the database. The focus of the MongoDB project is tocombine the best traits of NoSQL model, including high scalability, performance, andcertain aspects that ease development, with important features common intraditional databases, such as dynamic queries and indexes. 10gen customersinclude Shutterfly (SFLY), Intuit (INTU), Foursquare, IGN and SourceForge. 10gen isfunded by Flybridge Capital Partners, Sequoia Capital, and Union Square Ventures.AcunuFounded in 2009 by researchers from Cambridge and Oxford Universities, London-based Acunu distributes and supports an open source storage platform thatincludes a version of Cassandra reworked to use Acunu’s optimized storagesoftware. Its storage platform is also compatible with Amazon S3. Acunu is backedby Eden Ventures, Oxford Technology Management and Pentech Ventures, along withsupport from Finance Southeast and the Carbon Trust.CloudantCloudant is a Massachusetts-based software company which provides hosting,administrative tools, analytics and commercial support for CouchDB and BigCouch.July 1, 2011 51
  • 52. Software Cloudants CouchDB service offers customers functionality of standalone CouchDB with the added advantage of customer data being redundantly distributed over multiple machines. Cloudant’s software was motivated by the need to manage the multi-petabyte data sets from experiments such as the Large Hadron Collider. The company was formed in 2008. In addition to BigCouch and CouchDB, Cloudant supports open source projects including HAProxy, Opscode Chef, and Rebar. Cloudant customers include Meteor, Scoopler, Collecta and EasyBib. The company claims over 2,500 customers for its hosted service. Investors include Y-Combinator and Avalon Ventures. Cloudera Cloudera is a provider of Apache Hadoop-based software and services and works with customers in financial services, web, telecommunications, government and others industries. Cloudera’s software distributions include Apache Hadoop and several integrated complementary open source products in the Hadoop ecosystem including the Hive and HBase databases. Its Enterprise distribution includes some proprietary Hadoop management tools. Cloudera has stated that it generates more revenue from its Hadoop tools business than its Hadoop support business. Cloudera offers its core bundle of Apache Hadoop for free, but charges for its proprietary bundle and for support. Cloudera customers include AOL Advertising, comScore, Groupon, Samsung Bioinformatics, Trend Micro and Trulia. Cloudera is backed by Accel Partners, Greylock Partners, Meritech Capital and In-Q-Tel. Couchbase Based in Mountain View, CA, privately-held Couchbase was from the merger of NoSQL database providers Membase and CouchOne. The combined company supports the open source Apache CouchDB document database, the memcached distributed caching technology and the Membase data flow and cluster management system. Couchbase solutions scale from large thousand-server clusters down to mobile phones and tablets. Couchbase customers include Zynga, AOL, the BBC and ShareThis. Although not customers of Couchbase, 18 of the top 20 largest websites including Google use its open source technologies. Couchbase backers include Accel Partners, Mayfield Fund, North Bridge Venture Partners and Redpoint Ventures. Datameer Founded in 2009 and based in San Mateo, CA, Datameer is led by enterprise software veterans in the distributed computing space. The founders and staff are highly experienced in delivering large scale data analytics with Apache Hadoop, including a claimed 25,000 server deployment at Yahoo! Current Datameer customers include McAfee, Orange Labs and Nurago. DataStax Based in Burlingame, CA, privately-held DataStax (formerly Riptano) provides commercial support for Apache Cassandra. The company also recently announced a Cassandra solution integrated with Hive and Hadoop as well. DataStax has over 5052 July 1, 2011
  • 53. Softwarecustomers, spanning verticals including web, financial services,telecommunications, logistics and government. Customers include ConstantContact (CTCT), HP (HPQ), Netflix (NFLX) and Rackspace (RAX). The company isbacked by Lightspeed Venture Partners, Sequoia Capital and Rackspace Hosting.HadaptHadapt develops and markets an analytical platform for performing complexanalytics on structured and unstructured data built on Hadoop. Hadapt brings amore complete SQL interface, built in query load balancing capability, and a hybridstorage engine to handle structured as well as unstructured data on a singleplatform. Its Adaptive Query Execution load balancing capability allows analyticalworkloads to be automatically split between relational database engines and Hadoopto get the best possible performance out of the system.HortonworksYahoo! recently spun off its Hadoop development effort in a joint venture withSilicon Valley venture capital firm Benchmark Capital. The JV, named Hortonworksafter the elephant from Dr. Seuss books, will employ 25-30 current YHOO engineers.July 1, 2011 53
  • 54. Software Cataloging Software: Extracting Meaning from Unstructured Data Although not currently as robust a market as the market for large data set solutions, we believe there is significant market potential from extracting meaning and context from unstructured data, particularly as companies seek to leverage the wealth of data on their products on the Internet to drive revenue. Why? Even Unstructured Data Needs Structured Analysis. The collection of unstructured data in and of itself is of little strategic use to a business (beyond consumption by its customers) if the business cannot analyze the data. Analysis is conducted around key parameters that describe the data, such as the color of the products being discussed in a blog review. It is very important to understand that without these parameters, there is no way a system can aggregate the information for meaningful presentation, and there is no way for a user to specify what they want to evaluate. For example, when picking out spots for advertising, a marketer may want to know what types of video content is being viewed on a site by age group. Even when just conceiving of running this analysis on unstructured data (e.g., video), the user has determined two parameters (age group and type of video) by which they’d like to analyze the data. A marketer would learn from the sample below that if they want to target teenagers, they are best served by placing their ads in music videos. Example: Categories of Videos Viewed by Age Group 50% 40% 30% 20% 10% 0% <20 20-35 36-60 >60 Music Ent ert ainment Peo ple New s Mo vies Ot her Source: Cowen and Company We note the example above combines data from the full spectrum of structured and unstructured data. The counts of user views are obtained from semi-structured log data, the age of the user is from a structured database, and the type of video content is extracted from the unstructured video files.54 July 1, 2011
  • 55. SoftwareWe also note that the above analysis would not have been possible if the videos’content types were inaccessible for analysis. This content type data is typicallyreferred to as metadata. The metadata can be created by the authors of the originalcontent, but relying on human authors is unreliable at best (most may not evenbother to specify a content type for the video).The Market for MeaningWe believe that human sources of data may or may not appropriately tag theirinformation. With this human unreliability, the need to analyze unstructured dataand the exploding volumes of unstructured data can eventually drive strongdemand for automated means of extracting and cataloging metadata fromunstructured data.Autonomy estimates the market potential for unstructured text data alone at $20B,and the company was able to grow revenues 18% to $870M last year andexpectations are for 12% growth in 2011. We believe that a similar size market fornon-text unstructured data (audio, image and video) could arise as well, suggesting atotal unstructured data market of $40B.We note that businesses built around vertical-specific unstructured dataapplications have emerged as non-text unstructured data volumes have grown.Attensity and Clarabridge are full-package text analysis applications, while fellowstartup VideoMining analyzes consumer behavior from in-store video againstcharacteristics such as customer destinations, customer shopping time and othersto retailers and manufacturers. Witness Systems (VRNT), NICE Systems (NICE) andRightNow (RNOW) all have applications that allow for analysis of audio data fromcontact centers, while Amazon’s SnapTell provides a photo recognition applicationfor marketers. We believe the success of these vertical applications underscores thenascent potential for other companies to exploit and analyze the non-textunstructured data they collect from their customers and the Internet.The PlayersFor the purposes of this analysis, we focus on enterprise platforms that allow usersto catalog and extract metadata to perform unstructured data analysis. We do notinclude purpose-built analytics applications such as previously mentioned solutionsfrom Attensity, Clarabridge, VideoMining, AMZN, VRNT, NICE and RNOW. We alsoexclude consumer solutions such as photo-recognition software from Apple andAdobe.Larger ProvidersAutonomyFounded in 1996, UK-based Autonomy has leveraged technologies conceived atCambridge University to become Europe’s second-largest software vendor (afterSAP). The company’s Intelligence Data Operating Layer (IDOL) is a database forunstructured information. Its IDOL Server collects indexed data from connectorsand stores it in its structure, optimized for processing and retrieval of data. IDOLServer customers include DLA Piper, ABN Amro and KPMG.July 1, 2011 55
  • 56. Software IBM With roughly $100B in annual sales, IBM is one of the worlds largest providers of computer products and services. Among the leaders in almost every market in which it competes, the company is one of the largest providers of software (ranking #3, behind Microsoft and very slightly behind Oracle). Its Content Analytics application classifies and extracts metadata from unstructured text content. Metadata is made available through IBM’s open source Unstructured Information Management Architecture framework. SAP SAP is the world’s largest enterprise business applications firm. The SAP BusinessObjects Text Analysis application is designed to assist analysis of unstructured text sources such as blogs, Web sites, e-mails, support logs, research, and surveys. This software can extract, categorize, and summarize freeform text sources to identify concepts, sentiments, people, organizations, places, and other information in over 30 languages. SAS Founded in 1976, SAS Institute (Cary, NC) is the worlds largest privately held software company and with over $1.3B in revenues, SAS is the biggest BI pure play as well. SAS’ Text Analytics products allow for the extraction of metadata, including sentiment, from Web sites and social media outlets, as well as internal organizational text sources. Text Analytics customers include American Honda, Sub- Zero, Whirlpool and AFA Insurance. Startups CallMiner Founded in 2002 in the UK, CallMiner has expanded its presence to the US. CallMiner’s speech analytics products automatically convert customer interactions into searchable, measurable and reportable data. Its applications and APIs generate metadata from audio conversations that include reasons for the call, competitors mentioned, level of customer agitation, “hot” topics, and other categorical information. The company’s APIs allow for both extraction of the audio file metadata and for direct search from the company’s proprietary application. Customers include British Gas, Sentry Credit, Comcast, Microsoft and Daimler. Investors in the company include The Florida Growth Fund, the CIA’s In-Q-Tel venture fund, Inflexion, Intersouth Partners, Sigma Partners and Village Ventures. IQ Engines IQ Engines’ Vision-as-a-Service API takes in photos, identifies the key content, and returns key content metadata as text. The company has also released an iPhone application called oMoby that, among other things, serves as a demonstration of the company’s technology. IQ Engines’ verticals include retailers, photo publishers and mobile app developers. IQ Engines was founded in 2008 as a collaboration of computer neuroscientists at UC Berkeley and UC Davis.56 July 1, 2011
  • 57. Softwarekooabakooaba was founded in 2006 as a spin-off company from the Swiss Federal Instituteof Technology’s Computer Vision Lab. kooaba has made its patent pending SmartVisuals recognition platform as a service available via API, enabling enterprises tobuild solutions based on image recognition. Subscription pricing for the APIcurrently starts at roughly $2,000 per month. kooaba also develops mobileapplications that allow advertisers to interact with consumers via photos taken fromtheir smartphones. Platform customers include iPhone app developer Monarchy andSwiss comparison shopping provider Comparis.LeximancerIncorporated in 2007 after several years of incubation at the University ofQueensland in Australia, Leximancer provides applications and technology for theanalysis of unstructured, qualitative, textual data. Its applications are offered tocustomers in both on demand and on premise versions. Its Enterprise on premiseoffering includes APIs which allow customers to extract metadata from unstructuredtext. Customers include Aviva and the US Army.MegaputerFounded in 1997 and headquartered in Bloomington, IN, Megaputer Intelligencedevelops data and text mining software for predictive modeling and knowledgediscovery in large volumes of structured data and unstructured text. Its TextAnalystsoftware summarizes text, including keywords that distill the meaning of the text.The company has 500 customers including government offices, consulting and lawfirms, medical centers, scientific organizations, electronic book publishers,customer support centers, and political institutions. Customers include CambridgeTechnology Partners, Electronic Data Services and Southwest Airlines.NexidiaFounded in 2000 to market technology developed at Georgia Tech, Nexidia developsphonetic-based technology that can analyze, monitor, search, and archive text, audioand video content. The company’s products simplify and automate extraction ofmetadata from video, audio calls, chat, text, blogs, email and social forums. Thecompany sells its solutions in the contact center, media, legal and governmentvertical markets. Customers include Microsoft, Endeca, EarthLink, Vonage, andseveral state Blue Cross/Blue Shields.July 1, 2011 57
  • 58. Software Appendix: System Costs for Big Data In this section, we compare the costs of implementing several types of applications on Oracle systems and generic hardware. To simplify our analysis, and as we do not have publicly accessible OEM HW support costs for the generic hardware systems, we generally exclude hardware support costs. However, we note that Oracle hardware is generally more expensive than the equivalent generic systems and hence we therefore infer that Oracle HW support is more expensive as well. A Database for Marketing Analysis Incorporating External Data It is common practice today for major corporations today analyze their own (internal) order data to more effectively sell additional products to their own customers. Companies accomplish this today using existing database and data mining technologies, along with custom solutions leverage various parallel algorithm implementations (including Hadoop). The latter approach is often used by very high-volume retailers such as Amazon and Ebay or analysis of very specific data such as music patterns (Pandora) or video content (Netflix). Today, we envision that the treasure trove of information consumers intentionally and unintentionally drop in their on line commerce, gaming, social and mobile activities will prompt corporate interest in analysis of a broader and richer data set in order to determine where to allocate marketing dollars and to develop targeted marketing campaigns, including making purchase recommendations. For example, a department store may discover that its customer (named John) plays shooters on Xbox Live and writes a blog for paintball guns. The store can then recommend army fatigues for John’s weekend paintball forays. The store may also determine that the best way to prompt a purchase from John (since he is perpetually playing video games or out in the paintball arena) is through in-game ad placement in the forthcoming expansion pack for Call of Duty. We believe that early interest in mining this external social information is evident in big data marketing-vertical customers such as Aster Data customers Acerno, InsightExpress, Mobclix, Specific Media, and Buddy Media; Cloudera customers AOL Advertising, AdGooroo, Adconion, and comScore; Cloudant customer Meteor; and Couchbase customers AdAction and Red Aril. We therefore decided to illustrate the cost advantages of emerging big data solutions in a situation where this analysis of a company’s information found on the public internet is brought in house by the company.Marketing Analysis System Cost: Oracle Exadata vs. Open Source, Commodity HW Capital Expenditures ($M) Annual Expenses, ex HW Support ($M) System Hardware Software Total Staff Support/Subs Total Oracle Exadata $10.3 $23.2 $33.5 $0.3 $5.1 $5.4 Alternative Open Source, Commodity Hardware $3.4 $0.0 $3.4 $0.3 $0.5 $0.8 Source: Cowen and Company, Company Webstores. 58 July 1, 2011
  • 59. SoftwareIn this particular use case, we see that initial capital investment for an Oracle systemare roughly 10x that of a commodity system, and ongoing costs are roughly 7x thatof the commodity system as well.Furthermore, while both systems have similar networking and disk capacities, theopen source system has twice as many CPUs (16 for open source vs. 8 for Exadata),which means that the open source system has double the potential computingcapacity for more intensive analysis, although both systems should theoreticallyhave enough horsepower to satisfy a handful of analysts. Providing similarcomputing capacity to the Exadata system would add another $8M in upfront costsand $1.5M in ongoing costs.Our analysis assumes that the system is loaded by a web crawler that can providemetadata as well. We assume that the crawler can go through 60 GB of structuredand unstructured data per hour. This is in line with the maximum data throughputof an Autonomy IDOL server. We assume Autonomy or a similar system is used togenerate metadata to help catalog and identify the unstructured data to be stored,but we do not price this feeder into the solution.For both systems, we conservatively assumed the most data intensive scenariowhere 100% of the data analyzed is relevant to the company and stored for analysis,including the source document. With this conservative assumption, our storageestimate of 0.5 PB should be able to hold one year’s worth of data. However, giventhat not all data analyzed will be relevant, the system will likely hold data for alonger time period.We also use list pricing to ascertain each system’s cost. While not necessarilyrepresentative of the final price tag of such systems (due to frequent discounting inenterprise system deals), we believe that the price tags above roughly represent ofthe relative cost of one system compared to the other.We exclude middleware from this analysis as we assume that the database will beaccessed directly for query/analysis purposes.Analysis on this system could include sentiment relative to product introductionsand ad campaigns that the company or its competitors run, analysis ofdemographics of people reviewing or purchasing a company’s products andcompeting products to supplement data the company gets from its productregistrations, analysis of feedback (ideas, complaints) associated with products andidentification of key purchasing influencers, whether it be campaign, individualcustomer, trade publication, etc.We also assume the software stack is as follows.Software Stack Assumed in our Cost Model Tier Oracle Open Source Database Oracle MySQL Optimized Storage Exadata Hadoop Operating System Oracle Linux RHEL Cowen and Company.July 1, 2011 59
  • 60. Software Costing Out the Exadata System Hardware Costs To manage and analyze half a petabyte of data, we calculate that a system with two Exadata machines and over 100 Exadata storage cells will be required. This adds up to a $10M system for the core hardware setup alone at current list prices.Server Hardware Requirements for Marketing Analysis System Using Oracle Exadata I/O (Tb/day) Storage (Tb) Machines Scale for Price ($M) Required Capacity 1.4 527.0 Required Capacity Each Total Exadata X2-8 120.0 93.3 1 2 $1.5 $3.0 Exadata Cells 6.7 65 130 $0.1 $7.2 InfiniBand Switch 648 1 2 $0.1 $0.1 Total $10.3 Cowen and Company, Company Reports and Whitepapers. We believe the system spec’d above has sufficient processing power for analytics purposes, given that recommendation engine implementations for the Netflix Prize competition could generate purchase recommendations for over ten movies per second when projected to the number of CPU cores in the system above. Software Costs As with many Oracle systems, software cost is often a large component of the investment. Below, we price out the costs of the Oracle software needed for the above system. 60 July 1, 2011
  • 61. SoftwareSoftware License and Support Costs for Oracle-Based Marketing AnalysisSystemDatabase SoftwarePer Machine: CPUs per Exadata Server 16 * Cores per CPU 8 * Core Factor (Intel 7560) 0.5 * Price of Database Enterprise Edition w/ options $107,000= Cost per Exadata Machine ($M) $6.8* Number of Exadata Machines 2= Cost of Database Licenses ($M) $13.7Operating System Software Total servers 2 * Price of Linux Ultimate Edition $2,299= Total Linux Support Costs $0.0Total License Cost ($M, ex Storage) $13.7Exadata Storage SoftwareTotal Exadata Cells 79* Disks per Cell 12= Total Disks 948* Price Exadata Software/Disk $10,000= Cost of Exadata Storage Licenses ($M) $9.5Total License Cost ($M) $23.2Maintenance CostAnnual Maintenance Cost ($M) $5.1 Cowen and Company, Company Reports and Whitepapers.We note total license costs of $16M and ongoing maintenance costs of $3.6M morethan double the cost of this system.Other Ongoing CostsIn addition to aforementioned software maintenance costs of $3.6M, we estimatethat, given the number of hardware boxes, the system will require a full-time systemadministrator and a full-time DBA to keep the system up and running. Based ongoing rates for this expertise, we estimate a total annual cost of $300K for thispersonnel.Costing Out the Commodity HW/Open Source SW SystemHardware CostsTo manage and analyze half a petabyte of data, we estimate that a system withroughly 160 hardware boxes will be required, with most of these boxes being storageservers. This adds up to a $3.4M system for the core hardware setup alone atcurrent list prices.July 1, 2011 61
  • 62. SoftwareServer HW Requirements for Marketing Analysis System Using Generic HW and Open Source SW I/O (Tb/day) Storage (Tb) Machines Scale for Price ($M) Required Capacity 1.4 527.0 Required Capacity Each Total Full Rack HP BL620c - File Server (8-1 CPU Servers/Rack) 1,728.0 1 2 $0.1 $0.2 Full Rack HP BL620c - Analytics Server (8-2 CPU Servers/Rack) 1,728.0 1 2 $0.7 $1.4 HP StorageWorks D2200sb (Direct Attach, 2 per BL620c) 1,493.0 7.2 74 148 $0.0 $1.7 4X QDR InfiniBand Switch Module for c-Class BladeSystem 3 6 $0.0 $0.1 Total $3.4 Cowen and Company, Company Reports and Whitepapers. We believe the system spec’d above has sufficient processing power for analytics purposes, given that recommendation engine implementations for the Netflix Prize competition could generate purchase recommendations for over 20 movies per second when projected to the number of CPU cores in the system above. Software Costs The following is our estimate for the ongoing annual subscription costs for our open source platform marketing analysis system. Subscription Costs for Open Source-Based Marketing Analysis System Data Management Software Number of Data Mgmt Server Racks 4 * Servers per Rack 8 * Annual Subscription per Server, Datameer Hadoop $12,000 = Cost of Database Support ($M) $0.4 Operating System Software Number of Racks 4 * Servers per Rack 8 * Price of Red Hat Enterprise Linux Support $2,598 = Cost of Database Support ($M) $0.1 Total Support Cost ($M) $0.5 Cowen and Company, Company Reports and Whitepapers. Unlike the previously described Oracle system, the open source-based system does not carry an up front software license cost. Ongoing Costs In addition to the annual software subscription cost above of $500K, we estimate that, given the number of hardware boxes, the system will require a full-time system administrator and a full-time DBA/Hadoop manager to keep the system up and running. Based on going rates for this expertise, and assuming that Hadoop manager makes a salary roughly the same as that of a DBA, we estimate a total annual cost of $300K for this personnel. 62 July 1, 2011
  • 63. Software An Application for Analyzing Web Log Machine Data To illustrate a use case for machine data, we posit a system that can manage web server log data generated from a web application that processes up to 500M transactions per day. This volume is based on Salesforce.com’s reported transaction volumes. Assuming each transaction generates roughly 1K in log data, this transaction volume translates to a daily upload rate of 500 Gb per day. According to Splunk documentation, four or more servers can manage 100 Gb per day, hence the Splunk system will require 20 or more servers. We assume the system will store one year’s worth of data, or roughly 200 Tb of data (200 Gb per server). Based on these parameters, one half-rack Exadata machine should be able to handle this data volume.Cost Comparison of Web Log Analysis System Using Oracle Exadata vs. Splunk on Commodity HW Capital Expenditures ($M) Annual Expenses ($M) System Hardware Software Total Staff Support/Subs Total Oracle Exadata $2.0 $6.6 $8.5 $0.2 $1.4 $1.6 Alternative Open Source, Commodity Hardware $0.6 $0.0 $0.6 $0.2 $0.1 $0.2 Source: Cowen and Company, Company Websites. In this particular use case, we see that initial capital investment for an Oracle system are roughly 14x that of a commodity system, and ongoing costs are roughly 8x that of the Splunk/commodity hardware system as well. We also note that, in addition to its platform, Splunk comes pre-packaged with its own BI reports and dashboard. The Oracle system has no out-of-the box equivalent to make for an apples-to-apples comparison and hence, we include cost for Oracle BI Enterprise Edition. There is some incremental implementation cost we did not include in our estimate as proper costing of such a project requires additional details that are beyond the scope of this note. Regardless, this additional up front implementation expense would only serve to increase the cost discrepancy between the systems. Given the fairly predictable nature of this use case, we do not double the machine requirements to handle volume spikes as we do with other use cases illustrated in this note – if logs from the source web servers build up faster during peak periods than the system can handle, the system can handle this backlog at off-peak times. We assume the software stack is as follows. Software Stack Assumed in our Cost Model Tier Oracle Splunk BI Application/Dashboard Oracle BI Suite EE/Custom Development Splunk Data Management Oracle Splunk Optimized Storage Exadata Hadoop Operating System Oracle Linux RHEL Cowen and Company, Company Websites. July 1, 2011 63
  • 64. Software Costing Out the Exadata System Hardware Costs Based on its specifications, we infer that a single half-rack Exadata box can manage and analyze 200 terabytes of web log data. Along with additional storage, this adds up to a $2M system for the core hardware setup at current list prices.Server Hardware Requirements for Web Log Analysis System Using Oracle Exadata I/O (Gb/day) Storage (Tb) Machines Price ($M) Required Capacity 500.0 200.0 Required Each Total Exadata X2-2 (Half Rack) 60,000.0 46.7 1 $0.7 $0.7 Additional Exadata Cells 6.7 24 $0.1 $1.3 Total $2.0 Cowen and Company, Company Reports and Whitepapers. Software Costs As with the other Oracle systems, software cost is often a large component of the investment. Below, we price out the Oracle software for the machine data system. Software Costs for Web Log Analysis Using Oracle Database Software Per Machine: CPUs per Exadata Server 8 * Cores per CPU 4 * Core Factor (Intel X5440) 0.5 * Price of Database Enterprise Edition w/ options $107,000 = Cost per Exadata Machine ($M) $1.7 * Number of Exadata Machines 1 = Cost of Database Licenses ($M) $1.7 Operating System Software Total servers 1 * Price of Linux Ultimate Edition $2,299 = Total Linux Support Costs $0.0 Exadata Storage Software Total Exadata Cells 38 * Disks per Cell 12 = Total Disks 456 * Price Exadata Software/Disk $10,000 = Cost of Exadata Storage Licenses ($M) $4.6 BI Software Oracle Business Intelligence Suite Enterprise Edition Plus $0.3 Total License Cost ($M) $6.6 Maintenance Cost Annual Maintenance Cost ($M) $1.4 Cowen and Company, Company Reports and Whitepapers. 64 July 1, 2011
  • 65. Software Our analysis suggests that the Oracle software licenses for the system will cost over $6M. We believe this is conservative given that the Oracle system needs to have a custom application built on top of it at the cost of several hundred thousand dollars to make it functionally equivalent to Splunk’s out of the box offering. Other Ongoing Costs In addition to aforementioned software maintenance costs of $1.4M, we estimate that, given the number of hardware boxes, the system will require a full-time system/database administrator to keep the system up and running. Based on going rates for this expertise, we estimate an annual cost of $150K for this staff member. Costing Out the Splunk System Hardware Costs Given that four or more Splunk servers can manage 100 Gb of web log data per day, we estimate that our system will require roughly 20 hardware boxes. Based on the base server specs for Splunk, we have priced out a system with three racks of eight servers using HP hardware (24 servers in total). Each server costs roughly $15,000, including requisite networking, racks and power supplies, with each full rack costing roughly $120K. In addition, the system requires 26 of HP’s storage servers to store one year’s worth of web log data. This adds up to a $600K system at current list prices.Server Hardware Requirements for Web Log Analysis System Using Splunk I/O (Gb/day) Storage (Tb) Machines Price ($M) Required Capacity 500.0 200.0 Required Each Total Full Rack HP BL620c - Analytics Server (8-2 CPU Servers/Rack) 100.0 0.5 3 $0.1 $0.3 HP StorageWorks D2200sb (Direct Attach, up to 2 per BL620c) 7.2 26 $0.0 $0.3 4X QDR InfiniBand Switch Module for c-Class BladeSystem 1 $0.0 $0.0 Total $0.6 Cowen and Company, Company Websites. Software Costs The following is our cost estimate for the subscription software required in this system, including costs for Splunk and the underlying operating system. July 1, 2011 65
  • 66. Software Software Costs for Web Log Analysis Using Splunk Data Management Software (Splunk) Number of Splunk Server Racks 3 * Servers per Rack 8 * Annual Splunk Subscription per Server $2,000 = Cost of Data Management Support ($M) $0.0 Operating System Software Number of Racks 3 * Servers per Rack 8 * Price of Red Hat Enterprise Linux HPC Head Node Sppt $2,598 = Cost of Database Support ($M) $0.1 Total Support Cost ($M) $0.1 Cowen and Company, Company Reports and Whitepapers. Annual software subscription costs are only roughly $100K for the Splunk system. Ongoing Costs In addition to aforementioned software maintenance costs of $100K, we estimate that, given the number of hardware boxes, the system will require a full-time system/database administrator to keep the system up and running. Based on going rates for this expertise, we estimate an annual cost of $150K for this staff member. A Database for Search and Provisioning of Unstructured Data To illustrate the relative unsuitability of conventional relational databases to provisioning the massive amounts of data which we believe corporate data sets will achieve, we examine implementing a system that can store and deliver documents in the multi-petabyte range. YouTube serves as our storage and throughput benchmark as it delivers this amount of unstructured data today. We model the system using Oracle’s Exadata server, each of which can take in up to 120 terabytes (TB) per day and output up to 4 petabytes per day, and we also cost out a system with similar performance using commodity hardware and open source software. Our analyses that follow in this section indicate that an Exadata system, would need a capital investment of nearly $600M at list prices with an annual operating cost of roughly $95M. Nearly 2/3rds of the up front investment would be for software licenses. In contrast, a “nice” commodity hardware system would cost roughly $80M up front with annual operating costs of $11M.66 July 1, 2011
  • 67. SoftwareBig Data Provisioning System: Oracle Exadata vs. Open Source, Commodity HW Capital Expenditures ($M) Ann Expenses, ex HW Sppt ($M) System Hardware Software Total Staff Support/Subs Total Oracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0 Alternative Open Source, Commodity Hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1 Source: Cowen and Company, Company Webstores. Both of our estimates assume YouTube’s current throughput that has climbed to the point where users are uploading 48 hours of video every minute and viewing 3B videos per day. Very conservatively assuming that the average video is two minutes in length (roughly 15 Mb for relatively low-quality video), these equate to roughly 31 terabytes (TB) of video uploaded per day (14 followed by fourteen zeros) and 45 PB viewed/downloaded per day (a 3 followed by sixteen zeros). We assume the software stack is as follows. Software Stack Assumed in our Cost Model Tier Oracle Open Source Middleware WebLogic JBoss Database Oracle MySQL Optimized Storage Exadata Hadoop Operating System Oracle RHEL Cowen and Company. Costing Out the Exadata System Hardware Costs We estimate that a system that can handle the average consumption outlined above would require at least 9 full-rack Exadata database machines at $1.5M a piece. However, a system typically needs at least twice as many servers to handle peaks in demand. Therefore, this system would likely need at least 18 Exadata machines just to handle its volume. In addition, the system would need at least 14 Exalogic devices to serve this data to users at $1.1M a piece. This adds up to a conservative $42M for the core hardware setup alone, excluding networking and additional storage.Server Hardware Requirements for Reproducing YouTube Using Oracle Exadata Bandwidth (TB/Day) Machines Required Based On Requirement Scale for Peak Use Price ($M) Input Output (Avail) Input Output Storage Min Required x2 (excl Storage) Each Total Target Average Bandwidth 31 45,000 Exadata Storage 120 3,500 1 13 1,009 1,009 1,009 $0.1 $55.5 Exadata - Maximum Bandwidth 120 3,500 1 13 13 26 $1.5 $39.0 Exalogic - App Server 4,300 4,300 1 11 11 22 $1.1 $23.7 Exalogic - Web Server 4,300 4,300 1 11 11 22 $1.1 $23.7 InfiniBand Switch 648 3 6 $0.1 $0.4 QDR Infiniband Gateway Switch 864 1 52 52 104 $0.1 $5.2 Total 1,099 1,189 $147.4 Cowen and Company, Company Reports and Whitepapers. July 1, 2011 67
  • 68. Software We note that storage is a major expense. Assuming that the system needs the high performance option disks in order to manage the I/O load required above, and assuming zero compression, storing data for the system in an Exadata-like setup would require 1,000 Exadata storage cells (above and beyond the cells included in Exadata boxes) costing over $55M to store our estimate of roughly 9 PB of data. In addition to annual support costs, one would need to add 4-5 Exadata cells per day to keep up with the current upload rate of 48 hours per minute, to the tune of $260K-$325K per day, including software. We do note that compression can be applied to the video data, although we have conservatively not assumed any compression for this example. Although Oracle claims up to 10x compression, this is for text storage and the degree video files can be compressed is less certain. Furthermore, we believe this analysis understates the cost given very conservative assumptions on average video size and bit rate, and the fact that we assume that the throughput requirements of YouTube can be handled by the lower-cost storage options for Exadata. In this particular example, companies with unstructured data storage and throughput volumes like YouTube are likely spending more than our estimate to accommodate the higher bandwidth and storage requirements of high definition video. Our analysis also does not include build out costs for the data center and various additional sundry requirements like power supplies and networking equipment. We note that there could be significant variance in our estimates as bit rates can vary from about 250 Kbps for teleconferencing-quality video to 10 Mbps for high definition. We assume a bit rate 1 Mbps in our estimates. We conservatively assume an average video size of 2 minutes, which is lower than our rough calculation for a 4 minute average time of the 100 most popular videos on YouTube. Software Costs At first glance, total core hardware costs of roughly $155M, just roughly 5% of Google’s current CapEx seem reasonable. This line of thinking lasts until Oracle presents the bill for its software: a not-insignificant $400M for database and Exadata storage licenses alone, bringing the total up front investment to $570M.68 July 1, 2011
  • 69. SoftwareSoftware License Costs for Reproducing YouTube Using OracleDatabase SoftwarePer Machine: CPUs per Exadata Server 16 * Cores per CPU 8 * Core Factor (Intel 7560) 0.5 * Price of Database Enterprise Edition w/ options $107,000= Cost per Exadata Machine ($M) $6.8* Number of Exadata Machines 26= Cost of Database Licenses ($M) $178Middleware SoftwarePer Machine: CPUs per Exalogic Server 30 * Cores per CPU 12 * Core Factor (Intel 7560) 0.5 * Price of WebLogic Enterprise Edition $25,000= Cost per Exalogic Machine ($M) $4.5* Number of Exalogic Machines 22= Cost of Middleware Licenses ($M) $99Operating System Software Total servers 70.0 * Price of Linux Ultimate Edition $2,299= Total Linux Support Costs $0Exadata Storage SoftwareTotal Exadata Cells 1,373* Disks per Cell 12= Total Disks 16,476* Price Exadata Software/Disk $10,000= Cost of Exadata Storage Licenses ($M) $165Total License Cost ($M) $442Maintenance CostAnnual Maintenance Cost ($M) $97 Cowen and Company, Company Reports and Whitepapers.We note that at Exadata’s pricing of $10K per disk, undiscounted Exadata licensecosts can exceed $160M, while adding over $35M in annual support costs. While onecan argue that Oracle would discount heavily, we note that our analysis excludes thecost of packages (such as enterprise manager) that will likely be needed, althoughwe are uncertain of the quantities to be purchased.Ongoing CostsWe estimate that annual maintenance costs for the entire system will be roughly$95M per year. In addition, we project personnel costs for nine engineers of roughly$1.6M per year, based on the staffing of the original YouTube team. We assume totalcompensation, including benefits, of $180K. We assume that maintenance on theJuly 1, 2011 69
  • 70. Software applications in the Oracle database are 25% the maintenance cost of open source software, based on productivity estimates we had put forth earlier. Costing Out the Commodity HW/Open Source SW Systems Hardware Costs We estimate that a system that can handle the average consumption outlined above would require at least 36 higher-spec full blade racks to function as app servers, and a similar number of lower-spec (roughly half price) full blade racks to function as web servers. As with the Exadata system example, we price out a system with twice as many servers as needed to handle peaks in demand. Including nearly 1,800 7.2TB blade storage servers, this adds up to a roughly $104M system.Server Hardware Requirements for Reproducing YouTube Using Generic HW and Open Source SW Bandwidth (TB/Day) Machines Required Based on Requirement Scale for Peak Use Price ($M) Input Output (Avail) Input Output Storage/Sys Min Required x2 (excl Storage) Each Total Target Average Bandwidth 31 45,000 HP StorageWorks D2200sb (Direct Attach, 2 per BL620c) 1,493 1,493 1 31 1,792 1,792 1,792 $0.0 $20.6 Full Rack HP BL620c - File Server (8-1 CPU Servers/Rack) 1,728 1,728 112 112 $0.1 $10.0 Full Rack HP BL620c - DB Server (8-2 CPU Servers/Rack) 1,728 1,728 1 2 $0.7 $1.4 Full Rack HP BL620c - App Server (8-2 CPU Servers/Rack) 1,728 1,728 1 27 27 54 $0.7 $38.6 Full Rack HP BL620c - Web Server (8-1 CPU Servers/Rack) 1,728 1,728 1 27 27 54 $0.6 $29.9 4X QDR InfiniBand Switch Module for c-Class BladeSystem 13,824 13,824 3 42 42 84 $0.0 $1.1 Cisco Nexus 5548P Switch 2,592 1 17 17 34 $0.1 $2.5 Total 2,018 2,132 $104.2 Cowen and Company, Company Reports and Whitepapers. The configuration above assumes that data is stored in files in the file system as opposed to the database. The database simply stores the location of the file. This means that the location of the unstructured data files can be stored in a single MySQL database. One may ask why we even bother with MySQL and not use an in memory solution like memcached or Membase. While memory use of 1TB of memory could be cost prohibitive, we use known MySQL support pricing as a proxy for unpublished support rates of open source in memory databases such as Membase. We estimate that the size of the index database will be under 1TB. Hence, we believe it can be served, on average, by a single rack of servers. As with the Oracle system, we double the number of servers in this open source- based provisioning system to handle peak capacity. Software Costs We assume that our system will use the following open source software: Linux for its operating system, JBoss for its application server, MySQL for its database and Hadoop (including its file system) as a form of middleware. Customers generally obtain open source software for free (i.e., no license fee), and opt for support programs that are typically subscription based. Our software support estimates are in the following table. 70 July 1, 2011
  • 71. SoftwareSoftware Support Costs for Reproducing YouTube Using Open Source SoftwareData Management SoftwareNumber of Data Management Server Racks 2 * Servers per Rack 8 * Annual Subscription per Server, Datameer Hadoop $12,000= Cost of Database Support ($M) $0.2Middleware SoftwareNumber of App Server Racks 54 * Servers per Rack 8 * Number of CPUs 2 * Price of Jboss per CPU $1,406= Cost of Middleware Support ($M) $1.2Operating System Software Number of Racks 222 * Servers per Rack 8 * Price of Red Hat Enterprise Linux Support $6,498= Cost of Database Support ($M) $11.5Total Support Cost ($M) $12.9 Cowen and Company, Company Reports and Whitepapers.Ongoing CostsIn addition to the support costs outlined above, we project personnel costs fortwelve engineers of roughly $1.6M per year, based on the staffing of the originalYouTube team. We assume total compensation, including benefits, of $180K. Weassume that it is 4x as difficult to develop the code to retrieve files in Hadoopcompared to Oracle. Hence, the Oracle support contingent of nine includes one appdeveloper, while the open source team includes four app developers.July 1, 2011 71
  • 72. Software The Gartner Report(s) described herein, (the “Gartner Report(s)”) represent data, research opinion or viewpoints published, as part of a syndicated subscription service available only to clients, by Gartner, Inc., a corporation organized under the laws of the State of Delaware, USA, and its subsidiaries (“Gartner”), and are not representations of fact. The Gartner Report(s) do not constitute a specific guide to action and the reader of this [Prospectus/Company Report] assumes sole responsibility for his or her selection of, or reliance on, the Gartner Report(s), or any excerpts thereof, in making any decision, including any investment decision. Each Gartner Report speaks as of its original publication date (and not as of the date of this [Prospectus/Company Report]) and the opinions expressed in the Gartner Report(s) are subject to change without notice. Gartner is not responsible, nor shall it have any liability, to the Company or to any reader of this [Prospectus/Company Report] for errors, omissions or inadequacies in, or for any interpretations of, or for any calculations based upon data contained in, the Gartner Report(s) or any excerpts thereof.72 July 1, 2011
  • 73. Software STOCKS MENTIONED IN THIS REPORT Ticker Company Name Price Rating Analyst 06/30/2011AMZN Amazon $204.49 1 Jim FriedlandDELL Dell $16.67 2 Matt HoffmanGOOG Google $506.38 1 Jim FriedlandIBM IBM $171.55 1 Moshe KatriMSFT Microsoft $26.00 1 Gregg MoskowitzORCL Oracle $32.91 2 Peter GoldmacherSAP SAP (ADR) $60.65 3 Peter GoldmacherVRSN VeriSign $33.46 1 Gregg Moskowitz1 = Outperform; 2 = Neutral; 3 = Underperform July 1, 2011 73
  • 74. Software Addendum STOCKS MENTIONED IN IMPORTANT DISCLOSURESTicker Company NameAMZN Amazon.comDELL DellGOOG GoogleIBM IBMMSFT MicrosoftORCL OracleSAP SAP AG (ADR)VRSN VeriSign ANALYST CERTIFICATIONEach author of this research report hereby certifies that (i) the views expressed in the research report accurately reflecthis or her personal views about any and all of the subject securities or issuers, and (ii) no part of his or her compensationwas, is, or will be related, directly or indirectly, to the specific recommendations or views expressed in this report. IMPORTANT DISCLOSURESCowen and Company, LLC and or its affiliates make a market in the stock of AMZN, DELL, GOOG, IBM, MSFT, ORCL, SAP,VRSN securities.Cowen and Company, LLC compensates research analysts for activities and services intended to benefit the firmsinvestor clients. Individual compensation determinations for research analysts, including the author(s) of this report, arebased on a variety of factors, including the overall profitability of the firm and the total revenue derived from all sources,including revenues from investment banking. Cowen and Company, LLC does not compensate research analysts based onspecific investment banking transactions. DISCLAIMERThis research is for our clients only. Our research is disseminated primarily electronically and, in some cases, in printedform. Research distributed electronically is available simultaneously to all Cowen and Company, LLC clients. Allpublished research, including required disclosures, can be obtained on the Firm’s client website,www.cowenresearch.com.Further information on any of the above securities may be obtained from our offices. This report is published solely forinformation purposes, and is not to be construed as an offer to sell or the solicitation of an offer to buy any security inany state where such an offer or solicitation would be illegal. Other than disclosures relating to Cowen and Company,LLC, the information herein is based on sources we believe to be reliable but is not guaranteed by us and does not purportto be a complete statement or summary of the available data. Any opinions expressed herein are statements of ourjudgment on this date and are subject to change without notice.Notice to UK Investors: This publication is produced by Cowen and Company, LLC, which is regulated in the UnitedStates by FINRA and is disseminated in the United Kingdom by Cowen International Limited ("CIL"). In the United Kingdom,‘Cowen and Company’ is a Trading Name of CIL. It is communicated only to persons of a kind described in Articles 19 and49 of the Financial Services and Markets Act 2000 (Financial Promotion) Order 2005. It must not be further transmitted toany other person without the consent of CIL.Copyright, User Agreement and other general information related to this report© 2011 Cowen and Company, LLC. Member NYSE, FINRA and SIPC. All rights reserved. This research report is prepared forthe exclusive use of Cowen clients and may not be reproduced, displayed, modified, distributed, transmitted or disclosed,in whole or in part, or in any form or manner, to others outside your organization without the express prior writtenconsent of Cowen. Cowen research reports are distributed simultaneously to all clients eligible to receive such researchprior to any public dissemination by Cowen of the research report or information or opinion contained therein. Anyunauthorized use or disclosure is prohibited. Receipt and/or review of this research constitutes your agreement not toreproduce, display, modify, distribute, transmit, or disclose to others outside your organization the contents, opinions,74 July 1, 2011
  • 75. Softwareconclusion, or information contained in this report (including any investment recommendations, estimates or pricetargets). All Cowen trademarks displayed in this report are owned by Cowen and may not be used without its priorwritten consent.Cowen and Company, LLC. New York (646) 562-1000 Boston (617) 946-3700 San Francisco (415) 646-7200Chicago (312) 577-2240 Cleveland (440) 331-3531 Atlanta (866) 544-7009 Dallas (214) 978-0107 London(affiliate) 44-207-071-7500 Geneva (affiliate) 41-22-707-6900 COWEN AND COMPANY RATING DEFINITIONS (a)Rating DefinitionOutperform (1) Stock expected to outperform the S&P 500Neutral (2) Stock expected to perform in line with the S&P 500Underperform (3) Stock expected to underperform the S&P 500(a) Assumptions: Time horizon is 12 months; S&P 500 is flat over forecast period. COWEN AND COMPANY RATING ALLOCATION (a) Pct of companies under Pct for which Investment Banking servicesRating coverage with this rating have been provided within the past 12 monthsBuy (b) 51.1% 7.5%Hold (c) 45.9% 2.1%Sell (d) 3.0% 0.0%(a) As of 06/30/2011. (b) Corresponds to "Outperform" rated stocks as defined in Cowen and Company, LLCs rating definitions (see above). (c)Corresponds to "Neutral" as defined in Cowen and Company, LLCs ratings definitions (see above). (d) Corresponds to "Underperform" as defined inCowen and Company, LLCs ratings definitions (see above). Note: "Buy," "Hold" and "Sell" are not terms that Cowen and Company, LLC uses in itsratings system and should not be construed as investment options. Rather, these ratings terms are used illustratively to comply with NASD and NYSEregulations.To view price charts, please see http://pricecharts.cowen.com/pricechart.asp or call 1-800-221-5616 July 1, 2011 75

×