SlideShare a Scribd company logo
1 of 15
Cloudera:Cloudera:
Hadoop for the EnterpriseHadoop for the Enterprise
September 2008September 2008
Data Growing Much Faster thanData Growing Much Faster than
Moore’s LawMoore’s Law
04/21/17
Cloudera ConfidentialCloudera Confidential 22
Source: Richard Winter,
Why Are Data
Warehouses Growing
so Fast?, April 2008
Uniprocessor PerformanceUniprocessor Performance
04/21/17
33Cloudera ConfidentialCloudera Confidential
Founding TeamFounding Team
• Mike Olson, CEOMike Olson, CEO
– CEO SleepycatCEO Sleepycat
– Britton Lee, Illustra,Britton Lee, Illustra,
Informix, OracleInformix, Oracle
– BA, MS CS, BerkeleyBA, MS CS, Berkeley
• Amr Awadallah, CTO, VPAmr Awadallah, CTO, VP
EngineeringEngineering
– Founder Aptivia/VivaSmartFounder Aptivia/VivaSmart
– 8 years at Yahoo! running8 years at Yahoo! running
BI infrastructure, includingBI infrastructure, including
HadoopHadoop
– PhD EE, StanfordPhD EE, Stanford
• Christophe Bisciglia, VPChristophe Bisciglia, VP
TechnologyTechnology
– Created Google/NSFCreated Google/NSF
Hadoop cluster andHadoop cluster and
programprogram
– BA CS, U WashingtonBA CS, U Washington
• Jeff Hammerbacher, VPJeff Hammerbacher, VP
ProductProduct
– Ran world’s largestRan world’s largest
operational BI supportoperational BI support
system on Hadoop, atsystem on Hadoop, at
FacebookFacebook
– BA Mathematics, HarvardBA Mathematics, Harvard
04/21/17
44Cloudera ConfidentialCloudera Confidential
What Is Hadoop?What Is Hadoop?
• Core engine:Core engine:
– Open source implementation of Google’sOpen source implementation of Google’s
MapReduce and GFSMapReduce and GFS
– Hundreds or thousands of serversHundreds or thousands of servers
parallelize a data analysis taskparallelize a data analysis task
• Interfaces built on top of MapReduceInterfaces built on top of MapReduce
• Storage layer beneath (HDFS)Storage layer beneath (HDFS)
• Doug Cutting, Mike Cafarella areDoug Cutting, Mike Cafarella are
advisorsadvisors
04/21/17
55Cloudera ConfidentialCloudera Confidential
Hadoop is Open SourceHadoop is Open Source
• Hadoop is distributed under the Apache License:Hadoop is distributed under the Apache License:
– Reduces concern about lock-inReduces concern about lock-in
– Low-cost, effective distribution strategyLow-cost, effective distribution strategy
– Allows innovation by partners, customersAllows innovation by partners, customers
– Third-party inspection of source code providesThird-party inspection of source code provides
assurances on security, product qualityassurances on security, product quality
• Business-friendly license encourages commercialBusiness-friendly license encourages commercial
developmentdevelopment
– ““Open core” licensingOpen core” licensing
– Closed-source components, applicationsClosed-source components, applications
04/21/17
66Cloudera ConfidentialCloudera Confidential
Hadoop UsersHadoop Users
04/21/17
77Cloudera ConfidentialCloudera Confidential
Momentum: Google TrendsMomentum: Google Trends
04/21/17
88Cloudera ConfidentialCloudera Confidential
Netezza: $127M in FY08, $79M in FY07
Teradata: $830M in 1H08, $1.7B in FY07
Worldwide PhenomenonWorldwide Phenomenon
04/21/17
99Cloudera ConfidentialCloudera Confidential
Source:
Google Insights
world map for
searches on
“hadoop”,
Sept 2008.
Why is Hadoop Successful?Why is Hadoop Successful?
• BringsBrings computation closer to datacomputation closer to data
allowing both IO and computeallowing both IO and compute
scalability.scalability.
• Map-ReduceMap-Reduce forces developers toforces developers to thinkthink
in a parallel wayin a parallel way
• Operates onOperates on unstructured dataunstructured data , and, and
structured datastructured data (HBASE, HIVE)(HBASE, HIVE)
• Prescriptive developmentPrescriptive development , grows with, grows with
you without needing to re-architectyou without needing to re-architect
• Procedural languageProcedural language offers poweroffers power
04/21/17
1010Cloudera ConfidentialCloudera Confidential
Current Systems Isolate Users fromCurrent Systems Isolate Users from
the Event Level Raw Datathe Event Level Raw Data
File Server Farm for Warehouse (File Server Farm for Warehouse (non-queryablenon-queryable))
Warehouse Pre-ProcessingWarehouse Pre-Processing
InstrumentationInstrumentation
Log CollectionLog Collection
Datamart DatabaseDatamart Database
BI ReportingBI Reporting
MySQLMySQL
MemCachedMemCached
Live Web SiteLive Web SiteData MiningData Mining
R, Weka,R, Weka,
SAS, SPSSSAS, SPSS
ETLETL ETLETL ETLETL
ETLETL ETLETL ETLETL
Non-Consumption
Expensive ETL Grids
Expensive ETL Grids
04/21/17
1111Cloudera ConfidentialCloudera Confidential
Solution: “Smart” Storage ServiceSolution: “Smart” Storage Service
Smart Storage: Grid For File Storage & Data ProcessingSmart Storage: Grid For File Storage & Data Processing
Warehouse Pre-ProcessingWarehouse Pre-Processing
InstrumentationInstrumentation
Log CollectionLog Collection
Datamart DatabaseDatamart Database
BI ReportingBI Reporting
MySQLMySQL
MemCachedMemCached
Live Web SiteLive Web SiteData MiningData Mining
R, Weka,R, Weka,
SAS, SPSSSAS, SPSS
Enable Consumption
Eliminate Expensive
ETL Grids
Eliminate Expensive
ETL Grids
04/21/17
1212Cloudera ConfidentialCloudera Confidential
BDP versus OLAP/OLTPBDP versus OLAP/OLTP
Schema
Complexity
Processing
Freedom
Table Join Complexity
Concurrent
Jobs
Responsiveness
Per Job
Data Volume
Data Update
Pattern
100TB
Unstructured
100TB
1PB
Append OnlyRead/Write
100PB
Total Data Volume
Structured
SQL
Generic
Data
Processing
Batch
Interactive
1000
100 Tables
10PB
1PB
10PB
100PB
OLAP/OLTP
Batch Data
Processing
04/21/17
1313Cloudera ConfidentialCloudera Confidential
04/21/17
Cloudera ConfidentialCloudera Confidential 1414
Source:
Merrill Lynch
Industry
Overview,
May 7, 2008
Cloudera DifferentiatorsCloudera Differentiators
• Enabling Hadoop as an elastic platform withEnabling Hadoop as an elastic platform with
statistical multiplexing over many customersstatistical multiplexing over many customers
• Multi-Tenant Support:Multi-Tenant Support: Concurrency, Priority, NamespaceConcurrency, Priority, Namespace
Isolation, Performance Isolation.Isolation, Performance Isolation.
• Monitoring, Reliability, and AvailabilityMonitoring, Reliability, and Availability
• Resilience and Fast RecoveryResilience and Fast Recovery : A: A non-sexy problemnon-sexy problem
that isthat is critical to enterprisescritical to enterprises , no time to restart ETL job, no time to restart ETL job
from scratch, otherwise misses SLA.from scratch, otherwise misses SLA.
• IDEIDE to easilyto easily debug, deploy, and tune.debug, deploy, and tune.
• Integration withIntegration with data mining and analysisdata mining and analysis functionality (R,functionality (R,
Weka, SAS, SPSS)Weka, SAS, SPSS)
• Connector certificationConnector certification : another non-sexy problem that is: another non-sexy problem that is
ignored by community, make sure system is compatible withignored by community, make sure system is compatible with
other enterprise systems.other enterprise systems.
04/21/17
1515Cloudera ConfidentialCloudera Confidential

More Related Content

What's hot

StudentFinance Series A pitch deck
StudentFinance Series A pitch deckStudentFinance Series A pitch deck
StudentFinance Series A pitch deck
HajeJanKamps
 

What's hot (20)

Mattermark 2nd (Final) Series A Deck
Mattermark 2nd (Final) Series A DeckMattermark 2nd (Final) Series A Deck
Mattermark 2nd (Final) Series A Deck
 
Pitch Deck Teardown: ANYbotics AG's $50M Series B deck
Pitch Deck Teardown: ANYbotics AG's $50M Series B deckPitch Deck Teardown: ANYbotics AG's $50M Series B deck
Pitch Deck Teardown: ANYbotics AG's $50M Series B deck
 
Brex Pitch Deck
Brex Pitch DeckBrex Pitch Deck
Brex Pitch Deck
 
LaunchRock
LaunchRockLaunchRock
LaunchRock
 
Pich Deck for Pepper Bio, for TechCruch's Pitch Deck Teardown series
Pich Deck for Pepper Bio, for TechCruch's Pitch Deck Teardown seriesPich Deck for Pepper Bio, for TechCruch's Pitch Deck Teardown series
Pich Deck for Pepper Bio, for TechCruch's Pitch Deck Teardown series
 
Manpacks Pitch Deck
Manpacks Pitch DeckManpacks Pitch Deck
Manpacks Pitch Deck
 
StudentFinance Series A pitch deck
StudentFinance Series A pitch deckStudentFinance Series A pitch deck
StudentFinance Series A pitch deck
 
Coinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch DeckCoinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch Deck
 
Coinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch DeckCoinbase Seed Round Pitch Deck
Coinbase Seed Round Pitch Deck
 
Pitch Deck Teardown: Card Blanch's $460K Angel deck
Pitch Deck Teardown: Card Blanch's $460K Angel deckPitch Deck Teardown: Card Blanch's $460K Angel deck
Pitch Deck Teardown: Card Blanch's $460K Angel deck
 
Pitch Deck Teardown: Northspyre's $25 million Series B deck
Pitch Deck Teardown: Northspyre's $25 million Series B deckPitch Deck Teardown: Northspyre's $25 million Series B deck
Pitch Deck Teardown: Northspyre's $25 million Series B deck
 
Mixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65MMixpanel - Our pitch deck that we used to raise $65M
Mixpanel - Our pitch deck that we used to raise $65M
 
MySQL fundraising pitch deck ($16 million Series B round - 2003)
MySQL fundraising pitch deck ($16 million Series B round - 2003)MySQL fundraising pitch deck ($16 million Series B round - 2003)
MySQL fundraising pitch deck ($16 million Series B round - 2003)
 
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deckPitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
Pitch Deck Teardown: Tanbii's $1.5M Pre-seed deck
 
Square Pitch Deck
Square Pitch DeckSquare Pitch Deck
Square Pitch Deck
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch Deck
 
Flowhaven Pitch Deck
Flowhaven Pitch DeckFlowhaven Pitch Deck
Flowhaven Pitch Deck
 
7 Bridges Pitch Deck
7 Bridges Pitch Deck7 Bridges Pitch Deck
7 Bridges Pitch Deck
 
Sendgrid pitch deck
Sendgrid pitch deckSendgrid pitch deck
Sendgrid pitch deck
 
Front series A deck
Front series A deckFront series A deck
Front series A deck
 

Similar to Cloudera's Original Pitch Deck from 2008

The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
solarisyourep
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Data-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentData-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile Development
DATAVERSITY
 

Similar to Cloudera's Original Pitch Deck from 2008 (20)

Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
DOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud JourneyDOAG Big Data Days 2017 - Cloud Journey
DOAG Big Data Days 2017 - Cloud Journey
 
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
 
Unify Data at Memory Speed
Unify Data at Memory SpeedUnify Data at Memory Speed
Unify Data at Memory Speed
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
The Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with AlluxioThe Architecture of Decoupling Compute and Storage with Alluxio
The Architecture of Decoupling Compute and Storage with Alluxio
 
The New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the CloudThe New Database Frontier: Harnessing the Cloud
The New Database Frontier: Harnessing the Cloud
 
What_to_expect_from_oracle_database_12c
What_to_expect_from_oracle_database_12cWhat_to_expect_from_oracle_database_12c
What_to_expect_from_oracle_database_12c
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
Data Con LA 2018 - A tale of two BI standards: Data warehouses and data lakes...
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
13회 Oracle Developer Meetup 발표 자료: Oracle Cloud Data Interface(2019.07.20)
13회 Oracle Developer Meetup 발표 자료: Oracle Cloud Data Interface(2019.07.20)13회 Oracle Developer Meetup 발표 자료: Oracle Cloud Data Interface(2019.07.20)
13회 Oracle Developer Meetup 발표 자료: Oracle Cloud Data Interface(2019.07.20)
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Data-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile DevelopmentData-Centric Infrastructure for Agile Development
Data-Centric Infrastructure for Agile Development
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Recently uploaded (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

Cloudera's Original Pitch Deck from 2008

  • 1. Cloudera:Cloudera: Hadoop for the EnterpriseHadoop for the Enterprise September 2008September 2008
  • 2. Data Growing Much Faster thanData Growing Much Faster than Moore’s LawMoore’s Law 04/21/17 Cloudera ConfidentialCloudera Confidential 22 Source: Richard Winter, Why Are Data Warehouses Growing so Fast?, April 2008
  • 4. Founding TeamFounding Team • Mike Olson, CEOMike Olson, CEO – CEO SleepycatCEO Sleepycat – Britton Lee, Illustra,Britton Lee, Illustra, Informix, OracleInformix, Oracle – BA, MS CS, BerkeleyBA, MS CS, Berkeley • Amr Awadallah, CTO, VPAmr Awadallah, CTO, VP EngineeringEngineering – Founder Aptivia/VivaSmartFounder Aptivia/VivaSmart – 8 years at Yahoo! running8 years at Yahoo! running BI infrastructure, includingBI infrastructure, including HadoopHadoop – PhD EE, StanfordPhD EE, Stanford • Christophe Bisciglia, VPChristophe Bisciglia, VP TechnologyTechnology – Created Google/NSFCreated Google/NSF Hadoop cluster andHadoop cluster and programprogram – BA CS, U WashingtonBA CS, U Washington • Jeff Hammerbacher, VPJeff Hammerbacher, VP ProductProduct – Ran world’s largestRan world’s largest operational BI supportoperational BI support system on Hadoop, atsystem on Hadoop, at FacebookFacebook – BA Mathematics, HarvardBA Mathematics, Harvard 04/21/17 44Cloudera ConfidentialCloudera Confidential
  • 5. What Is Hadoop?What Is Hadoop? • Core engine:Core engine: – Open source implementation of Google’sOpen source implementation of Google’s MapReduce and GFSMapReduce and GFS – Hundreds or thousands of serversHundreds or thousands of servers parallelize a data analysis taskparallelize a data analysis task • Interfaces built on top of MapReduceInterfaces built on top of MapReduce • Storage layer beneath (HDFS)Storage layer beneath (HDFS) • Doug Cutting, Mike Cafarella areDoug Cutting, Mike Cafarella are advisorsadvisors 04/21/17 55Cloudera ConfidentialCloudera Confidential
  • 6. Hadoop is Open SourceHadoop is Open Source • Hadoop is distributed under the Apache License:Hadoop is distributed under the Apache License: – Reduces concern about lock-inReduces concern about lock-in – Low-cost, effective distribution strategyLow-cost, effective distribution strategy – Allows innovation by partners, customersAllows innovation by partners, customers – Third-party inspection of source code providesThird-party inspection of source code provides assurances on security, product qualityassurances on security, product quality • Business-friendly license encourages commercialBusiness-friendly license encourages commercial developmentdevelopment – ““Open core” licensingOpen core” licensing – Closed-source components, applicationsClosed-source components, applications 04/21/17 66Cloudera ConfidentialCloudera Confidential
  • 7. Hadoop UsersHadoop Users 04/21/17 77Cloudera ConfidentialCloudera Confidential
  • 8. Momentum: Google TrendsMomentum: Google Trends 04/21/17 88Cloudera ConfidentialCloudera Confidential Netezza: $127M in FY08, $79M in FY07 Teradata: $830M in 1H08, $1.7B in FY07
  • 9. Worldwide PhenomenonWorldwide Phenomenon 04/21/17 99Cloudera ConfidentialCloudera Confidential Source: Google Insights world map for searches on “hadoop”, Sept 2008.
  • 10. Why is Hadoop Successful?Why is Hadoop Successful? • BringsBrings computation closer to datacomputation closer to data allowing both IO and computeallowing both IO and compute scalability.scalability. • Map-ReduceMap-Reduce forces developers toforces developers to thinkthink in a parallel wayin a parallel way • Operates onOperates on unstructured dataunstructured data , and, and structured datastructured data (HBASE, HIVE)(HBASE, HIVE) • Prescriptive developmentPrescriptive development , grows with, grows with you without needing to re-architectyou without needing to re-architect • Procedural languageProcedural language offers poweroffers power 04/21/17 1010Cloudera ConfidentialCloudera Confidential
  • 11. Current Systems Isolate Users fromCurrent Systems Isolate Users from the Event Level Raw Datathe Event Level Raw Data File Server Farm for Warehouse (File Server Farm for Warehouse (non-queryablenon-queryable)) Warehouse Pre-ProcessingWarehouse Pre-Processing InstrumentationInstrumentation Log CollectionLog Collection Datamart DatabaseDatamart Database BI ReportingBI Reporting MySQLMySQL MemCachedMemCached Live Web SiteLive Web SiteData MiningData Mining R, Weka,R, Weka, SAS, SPSSSAS, SPSS ETLETL ETLETL ETLETL ETLETL ETLETL ETLETL Non-Consumption Expensive ETL Grids Expensive ETL Grids 04/21/17 1111Cloudera ConfidentialCloudera Confidential
  • 12. Solution: “Smart” Storage ServiceSolution: “Smart” Storage Service Smart Storage: Grid For File Storage & Data ProcessingSmart Storage: Grid For File Storage & Data Processing Warehouse Pre-ProcessingWarehouse Pre-Processing InstrumentationInstrumentation Log CollectionLog Collection Datamart DatabaseDatamart Database BI ReportingBI Reporting MySQLMySQL MemCachedMemCached Live Web SiteLive Web SiteData MiningData Mining R, Weka,R, Weka, SAS, SPSSSAS, SPSS Enable Consumption Eliminate Expensive ETL Grids Eliminate Expensive ETL Grids 04/21/17 1212Cloudera ConfidentialCloudera Confidential
  • 13. BDP versus OLAP/OLTPBDP versus OLAP/OLTP Schema Complexity Processing Freedom Table Join Complexity Concurrent Jobs Responsiveness Per Job Data Volume Data Update Pattern 100TB Unstructured 100TB 1PB Append OnlyRead/Write 100PB Total Data Volume Structured SQL Generic Data Processing Batch Interactive 1000 100 Tables 10PB 1PB 10PB 100PB OLAP/OLTP Batch Data Processing 04/21/17 1313Cloudera ConfidentialCloudera Confidential
  • 14. 04/21/17 Cloudera ConfidentialCloudera Confidential 1414 Source: Merrill Lynch Industry Overview, May 7, 2008
  • 15. Cloudera DifferentiatorsCloudera Differentiators • Enabling Hadoop as an elastic platform withEnabling Hadoop as an elastic platform with statistical multiplexing over many customersstatistical multiplexing over many customers • Multi-Tenant Support:Multi-Tenant Support: Concurrency, Priority, NamespaceConcurrency, Priority, Namespace Isolation, Performance Isolation.Isolation, Performance Isolation. • Monitoring, Reliability, and AvailabilityMonitoring, Reliability, and Availability • Resilience and Fast RecoveryResilience and Fast Recovery : A: A non-sexy problemnon-sexy problem that isthat is critical to enterprisescritical to enterprises , no time to restart ETL job, no time to restart ETL job from scratch, otherwise misses SLA.from scratch, otherwise misses SLA. • IDEIDE to easilyto easily debug, deploy, and tune.debug, deploy, and tune. • Integration withIntegration with data mining and analysisdata mining and analysis functionality (R,functionality (R, Weka, SAS, SPSS)Weka, SAS, SPSS) • Connector certificationConnector certification : another non-sexy problem that is: another non-sexy problem that is ignored by community, make sure system is compatible withignored by community, make sure system is compatible with other enterprise systems.other enterprise systems. 04/21/17 1515Cloudera ConfidentialCloudera Confidential

Editor's Notes

  1. (Moore’s law is failing, only way to speed up going forward is massive parallelism on grids/multicores).
  2. Furthermore, these expensive ETL grids are only needed a couple of hours in the morning to meet the loading SLA.
  3. Another pain point is resilience to failure: currently when a hadoop job fails you have to restart it all the way from beginning. The community is not spending much time addressing this problem since it is not "sexy", but it is critical for enterprises with strict SLAs to meet. You don't want to have to restart your ETL job from scratch when a failure occurs, there is no time for that. There is a need to snapshot the jobs at intermediate checkpoints so that you don't have to restart all way from beginning in case of failure.