Hadoop World 2011Hadoop Stack: Then, Now andFuture
In the beginning                                                                                      CORE HADOOP COMPONEN...
A good start     Apache Hadoop                                                           Shell / CLI                      ...
Core use cases    • Data processing      – Search index building      – Click sessionization4                   ©2011 Clou...
We were here             100%             100%CoreHadoop                                                                  ...
First cut at the system                                                               Shell / CLI                 Language...
Underlying projects & communities                        Apache Pig,Apache Hadoop                        Hive, Mahout     ...
Core use cases    • Data processing      – Search index building      – Click sessionization      – Data processing pipeli...
We were here             100%             100%CoreHadoop                                                                  ...
Where we are today                                     Web                                    Shell / CLI             Driv...
Where we are today                   Hue                 Apache Pig,                                                      ...
Core use cases • Data processing     – Search index building     – Click sessionization     – Data processing pipelines • ...
Current state             100%             100%CoreHadoop                                                                 ...
Limitations     Redundancy - DAG, RPC, serialization, integration, etc.     Uniformity - diff components require diff DBs,...
Ongoing work Metadata repos - shared schema and data types, table abstraction via Apache HCat (incubating) and Apache Hive...
Ongoing work: Apache Bigtop Dedicated to Hadoop stack integration and testing. Integration - between projects, dependencie...
Technical trends - software • Moving more forms of computation to   Hadoop storage • Frameworks to make HBase more   appli...
Technical trends - hardware •Increasingly powerful hosts    l# cores and memory    lNetwork - 10/40 gige    lStorage - 48/...
Enable future use cases pt 1 More valuable data •Cost = gravity. Data flows downhill to cheapest store. •High-value data n...
Enable future use cases pt 2 Lower latency / higher interactivity •Low latency response times for applications •Interactiv...
Enable future use cases pt 3 Hadoop meets ILM Policy - access control, std mgt interfaces, SLAs. MDM, etc. Operation - dis...
Things to look forward to                                        Web                                    Shell / CLI       ...
Getting crowded…                   Hue                 Apache Pig, Apache S4                                            X-...
We appreciate your time and                       interest in              For Additional Information:                +1 (...
Upcoming SlideShare
Loading in...5
×

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Collins & Charles Zedlewski, Cloudera

4,929

Published on

Many people refer to Apache Hadoop as their system of choice for big data management but few actually use just Apache Hadoop. Hadoop has become a proxy for a much larger system which has HDFS storage at its core. The Apache Hadoop based "big data stack" has changed dramatically over the past 24 months and will chance even more over the next 24 months. This talk talks about trends in the evolution of the Hadoop stack, change in architecture and changes in the kinds of use cases that are supported. It will also talk about the role of interoperability and cohesion in the Apache Hadoop stack and the role of Apache Bigtop in this regard.

Published in: Technology, Business
1 Comment
22 Likes
Statistics
Notes
  • Very informative and provide a very good overview on Hadoop ecosystems.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
4,929
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
1
Likes
22
Embeds 0
No embeds

No notes for slide

Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Collins & Charles Zedlewski, Cloudera

  1. 1. Hadoop World 2011Hadoop Stack: Then, Now andFuture
  2. 2. In the beginning CORE HADOOP COMPONENTS Hadoop was a platform for data storage and processing that is… Hadoop MapReduce Distributed File  Scalable System (HDFS)  Fault tolerant  Open source File Sharing & Data Protection Across Distributed Computing Across Physical Servers Physical Servers Flexibility Scalability Low Cost A single repository for storing  Scale-out architecture divides  Can be deployed on commodity processing & analyzing any type of workloads across multiple nodes hardware data  Flexible file system eliminates ETL  Open source platform guards Not bound by a single schema bottlenecks against vendor lock 2 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  3. 3. A good start Apache Hadoop Shell / CLI Data Processing Resource Management File storage Formats RPC Compression3 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  4. 4. Core use cases • Data processing – Search index building – Click sessionization4 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  5. 5. We were here 100% 100%CoreHadoop 58% 37% 37% 31%as % ofNewPatches 2006 2007 2008 2009 2010 2011 • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • HBase • HBase • HBase • HBase • Zookeeper • Pig • Pig • Pig • Mahout • Zookeeper • Zookeeper • Zookeeper • Mahout • Mahout • Mahout • Hive • Hive • HiveRelevant • Avro • AvroProjects • Whirr • Whirr • Sqoop • Sqoop • HCatalog • Mrunit • Bigtop • Oozie 5 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  6. 6. First cut at the system Shell / CLI Languages Libraries Workflow Data Processing Resource Management Metadata storage Record storage File storage Coordination Formats RPC Compression6 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  7. 7. Underlying projects & communities Apache Pig,Apache Hadoop Hive, Mahout Shell / CLI Languages Libraries WorkflowApache Hive Data Processing Resource Management Metadata storage Record storage File storageApache CoordinationHBase Formats RPC Compression Apache Apache Zookeeper Avro 7 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  8. 8. Core use cases • Data processing – Search index building – Click sessionization – Data processing pipelines • Analytics – Machine learning – Batch reporting • Live content serving (for the braver folks)8 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  9. 9. We were here 100% 100%CoreHadoop 58% 37% 37% 31%as % ofNewPatches 2006 2007 2008 2009 2010 2011 • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • HBase • HBase • HBase • HBase • Zookeeper • Pig • Pig • Pig • Mahout • Zookeeper • Zookeeper • Zookeeper • Mahout • Mahout • Mahout • Hive • Hive • HiveRelevant • Avro • AvroProjects • Whirr • Whirr • Sqoop • Sqoop • HCatalog • Mrunit • Bigtop • Oozie 9 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  10. 10. Where we are today Web Shell / CLI Drivers Files Languages Libraries Workflow Scheduling Data Processing Resource Management Integration Metadata storage RDBMS Record storage File storage Logs & Coordination events Formats RPC Authentication Compression10 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  11. 11. Where we are today Hue Apache Pig, Apache JDBC /Apache Hadoop Hive, Mahout Oozie ODBCApacheSqoop Web Shell / CLI Drivers Files Languages Libraries Workflow SchedulingApache Hive, Data Processing Resource Management IntegrationHCatalog Metadata storage RDBMS Record storage File storageApache Logs & CoordinationHBase events Formats RPC Authentication CompressionApache Apache ApacheFlume Zookeeper Avro 11 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  12. 12. Core use cases • Data processing – Search index building – Click sessionization – Data processing pipelines • Analytics – Machine learning – Batch reporting • Real time applications – Content serving – System management – Real-time aggregates & counters • Storage – EDW archive12 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  13. 13. Current state 100% 100%CoreHadoop 58% 37% 37% 31%as % ofNewPatches 2006 2007 2008 2009 2010 2011 • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • Core Hadoop • HBase • HBase • HBase • HBase • Zookeeper • Pig • Pig • Pig • Mahout • Zookeeper • Zookeeper • Zookeeper • Mahout • Mahout • Mahout • Hive • Hive • HiveRelevant • Avro • AvroProjects • Whirr • Whirr • Sqoop • Sqoop • HCatalog • Mrunit • Bigtop • Oozie 13 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  14. 14. Limitations Redundancy - DAG, RPC, serialization, integration, etc. Uniformity - diff components require diff DBs, mgt interfaces, etc. Ease of use - improving but still an obstacle. Eg non-native file formats require integration. Multi-datacenter - cross-DC repl. for HBase but not HDFS. Interoperability - requires conversions, end-user integration.14 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  15. 15. Ongoing work Metadata repos - shared schema and data types, table abstraction via Apache HCat (incubating) and Apache Hive. Self-describing data via Apache Avro.15 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  16. 16. Ongoing work: Apache Bigtop Dedicated to Hadoop stack integration and testing. Integration - between projects, dependencies, hosts. Testing - interoperability, multi-component use cases. 100% Apache projects, using upstream releases. Participants across the ecosystem - join us! http://incubator.apache.org/bigtop16 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  17. 17. Technical trends - software • Moving more forms of computation to Hadoop storage • Frameworks to make HBase more application and developer friendly • Taking advantage of pluggability to provide more optimized formats, schedulers, codecs, etc • More granular security models17 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  18. 18. Technical trends - hardware •Increasingly powerful hosts l# cores and memory lNetwork - 10/40 gige lStorage - 48/60 TB hosts. Flash. •Cloud - multi-tenancy and virtualization •Low power CPUs18 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  19. 19. Enable future use cases pt 1 More valuable data •Cost = gravity. Data flows downhill to cheapest store. •High-value data not just generated but also consumed by the platform ie more processing is done within the system before leaving. Richer end user applications •Apps built directly on the platform (eBay’s Cassini, Facebook messages, etc) •Web 3.0 – data centric apps. Apps move over common data sources vs tightly coupled to their data.19 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  20. 20. Enable future use cases pt 2 Lower latency / higher interactivity •Low latency response times for applications •Interactive - human-driven, correlated access, eg analytics •Low latency query execution and in-memory datasets. •Resource management - batch and interactive workloads20 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  21. 21. Enable future use cases pt 3 Hadoop meets ILM Policy - access control, std mgt interfaces, SLAs. MDM, etc. Operation - disaster recovery, archive, etc. Traditional features - availability, snapshots, mirroring, ACLs, integration via standard protocols.21 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  22. 22. Things to look forward to Web Shell / CLI Drivers Files Languages Libraries Workflow Scheduling MapReduce Stream Graph MPI Other Resource Management Integration Metadata storage RDBMS Time Series ORM OLAP OLTP Record storage File storage Logs & Coordination events Formats RPC Authentication Compression22 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  23. 23. Getting crowded… Hue Apache Pig, Apache S4 X-Rime Apache JDBC /Apache Hadoop Hive, Mahout Storm Giraph Oozie ODBC Web Shell / CLI DriversApacheSqoop Files Languages Libraries Workflow Scheduling MapReduce Stream Graph MPI Other Resource Management IntegrationApache Hive, Metadata storageHCatalog RDBMS Time Series ORM OLAP OLTPOpenTSDB Record storage File storageApache Logs & CoordinationHBase events Formats RPC Authentication CompressionApache Apache Apache ApacheFlume Zookeeper Avro Gora Omid 23 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  24. 24. We appreciate your time and interest in For Additional Information: +1 (888) 789-1488 twitter.com/ cloudera sales@cloudera.com cloudera.com facebook.com/ cloudera24 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.

×