This is a selection of slides from Cloudera's 2008 pitch deck to raise a $5 million Series A. Accel wound up winning the deal and became the initial investor in the company.
2. Data Growing Much Faster thanData Growing Much Faster than
Moore’s LawMoore’s Law
04/21/17
Cloudera ConfidentialCloudera Confidential 22
Source: Richard Winter,
Why Are Data
Warehouses Growing
so Fast?, April 2008
4. Founding TeamFounding Team
• Mike Olson, CEOMike Olson, CEO
– CEO SleepycatCEO Sleepycat
– Britton Lee, Illustra,Britton Lee, Illustra,
Informix, OracleInformix, Oracle
– BA, MS CS, BerkeleyBA, MS CS, Berkeley
• Amr Awadallah, CTO, VPAmr Awadallah, CTO, VP
EngineeringEngineering
– Founder Aptivia/VivaSmartFounder Aptivia/VivaSmart
– 8 years at Yahoo! running8 years at Yahoo! running
BI infrastructure, includingBI infrastructure, including
HadoopHadoop
– PhD EE, StanfordPhD EE, Stanford
• Christophe Bisciglia, VPChristophe Bisciglia, VP
TechnologyTechnology
– Created Google/NSFCreated Google/NSF
Hadoop cluster andHadoop cluster and
programprogram
– BA CS, U WashingtonBA CS, U Washington
• Jeff Hammerbacher, VPJeff Hammerbacher, VP
ProductProduct
– Ran world’s largestRan world’s largest
operational BI supportoperational BI support
system on Hadoop, atsystem on Hadoop, at
FacebookFacebook
– BA Mathematics, HarvardBA Mathematics, Harvard
04/21/17
44Cloudera ConfidentialCloudera Confidential
5. What Is Hadoop?What Is Hadoop?
• Core engine:Core engine:
– Open source implementation of Google’sOpen source implementation of Google’s
MapReduce and GFSMapReduce and GFS
– Hundreds or thousands of serversHundreds or thousands of servers
parallelize a data analysis taskparallelize a data analysis task
• Interfaces built on top of MapReduceInterfaces built on top of MapReduce
• Storage layer beneath (HDFS)Storage layer beneath (HDFS)
• Doug Cutting, Mike Cafarella areDoug Cutting, Mike Cafarella are
advisorsadvisors
04/21/17
55Cloudera ConfidentialCloudera Confidential
6. Hadoop is Open SourceHadoop is Open Source
• Hadoop is distributed under the Apache License:Hadoop is distributed under the Apache License:
– Reduces concern about lock-inReduces concern about lock-in
– Low-cost, effective distribution strategyLow-cost, effective distribution strategy
– Allows innovation by partners, customersAllows innovation by partners, customers
– Third-party inspection of source code providesThird-party inspection of source code provides
assurances on security, product qualityassurances on security, product quality
• Business-friendly license encourages commercialBusiness-friendly license encourages commercial
developmentdevelopment
– ““Open core” licensingOpen core” licensing
– Closed-source components, applicationsClosed-source components, applications
04/21/17
66Cloudera ConfidentialCloudera Confidential
8. Momentum: Google TrendsMomentum: Google Trends
04/21/17
88Cloudera ConfidentialCloudera Confidential
Netezza: $127M in FY08, $79M in FY07
Teradata: $830M in 1H08, $1.7B in FY07
10. Why is Hadoop Successful?Why is Hadoop Successful?
• BringsBrings computation closer to datacomputation closer to data
allowing both IO and computeallowing both IO and compute
scalability.scalability.
• Map-ReduceMap-Reduce forces developers toforces developers to thinkthink
in a parallel wayin a parallel way
• Operates onOperates on unstructured dataunstructured data , and, and
structured datastructured data (HBASE, HIVE)(HBASE, HIVE)
• Prescriptive developmentPrescriptive development , grows with, grows with
you without needing to re-architectyou without needing to re-architect
• Procedural languageProcedural language offers poweroffers power
04/21/17
1010Cloudera ConfidentialCloudera Confidential
11. Current Systems Isolate Users fromCurrent Systems Isolate Users from
the Event Level Raw Datathe Event Level Raw Data
File Server Farm for Warehouse (File Server Farm for Warehouse (non-queryablenon-queryable))
Warehouse Pre-ProcessingWarehouse Pre-Processing
InstrumentationInstrumentation
Log CollectionLog Collection
Datamart DatabaseDatamart Database
BI ReportingBI Reporting
MySQLMySQL
MemCachedMemCached
Live Web SiteLive Web SiteData MiningData Mining
R, Weka,R, Weka,
SAS, SPSSSAS, SPSS
ETLETL ETLETL ETLETL
ETLETL ETLETL ETLETL
Non-Consumption
Expensive ETL Grids
Expensive ETL Grids
04/21/17
1111Cloudera ConfidentialCloudera Confidential
12. Solution: “Smart” Storage ServiceSolution: “Smart” Storage Service
Smart Storage: Grid For File Storage & Data ProcessingSmart Storage: Grid For File Storage & Data Processing
Warehouse Pre-ProcessingWarehouse Pre-Processing
InstrumentationInstrumentation
Log CollectionLog Collection
Datamart DatabaseDatamart Database
BI ReportingBI Reporting
MySQLMySQL
MemCachedMemCached
Live Web SiteLive Web SiteData MiningData Mining
R, Weka,R, Weka,
SAS, SPSSSAS, SPSS
Enable Consumption
Eliminate Expensive
ETL Grids
Eliminate Expensive
ETL Grids
04/21/17
1212Cloudera ConfidentialCloudera Confidential
13. BDP versus OLAP/OLTPBDP versus OLAP/OLTP
Schema
Complexity
Processing
Freedom
Table Join Complexity
Concurrent
Jobs
Responsiveness
Per Job
Data Volume
Data Update
Pattern
100TB
Unstructured
100TB
1PB
Append OnlyRead/Write
100PB
Total Data Volume
Structured
SQL
Generic
Data
Processing
Batch
Interactive
1000
100 Tables
10PB
1PB
10PB
100PB
OLAP/OLTP
Batch Data
Processing
04/21/17
1313Cloudera ConfidentialCloudera Confidential
15. Cloudera DifferentiatorsCloudera Differentiators
• Enabling Hadoop as an elastic platform withEnabling Hadoop as an elastic platform with
statistical multiplexing over many customersstatistical multiplexing over many customers
• Multi-Tenant Support:Multi-Tenant Support: Concurrency, Priority, NamespaceConcurrency, Priority, Namespace
Isolation, Performance Isolation.Isolation, Performance Isolation.
• Monitoring, Reliability, and AvailabilityMonitoring, Reliability, and Availability
• Resilience and Fast RecoveryResilience and Fast Recovery : A: A non-sexy problemnon-sexy problem
that isthat is critical to enterprisescritical to enterprises , no time to restart ETL job, no time to restart ETL job
from scratch, otherwise misses SLA.from scratch, otherwise misses SLA.
• IDEIDE to easilyto easily debug, deploy, and tune.debug, deploy, and tune.
• Integration withIntegration with data mining and analysisdata mining and analysis functionality (R,functionality (R,
Weka, SAS, SPSS)Weka, SAS, SPSS)
• Connector certificationConnector certification : another non-sexy problem that is: another non-sexy problem that is
ignored by community, make sure system is compatible withignored by community, make sure system is compatible with
other enterprise systems.other enterprise systems.
04/21/17
1515Cloudera ConfidentialCloudera Confidential
Editor's Notes
(Moore’s law is failing, only way to speed up going forward is massive parallelism on grids/multicores).
Furthermore, these expensive ETL grids are only needed a couple of hours in the morning to meet the loading SLA.
Another pain point is resilience to failure: currently when a hadoop job fails you have to restart it all the way from beginning. The community is not spending much time addressing this problem since it is not "sexy", but it is critical for enterprises with strict SLAs to meet. You don't want to have to restart your ETL job from scratch when a failure occurs, there is no time for that. There is a need to snapshot the jobs at intermediate checkpoints so that you don't have to restart all way from beginning in case of failure.