0
Apache Hadoop Today & Tomorrow <ul><li>Eric Baldeschwieler, CEO </li></ul><ul><li>Hortonworks, Inc. </li></ul><ul><ul><li>...
Agenda <ul><li>Brief Overview of Apache Hadoop </li></ul><ul><li>Where Apache Hadoop is Used </li></ul><ul><li>Apache Hado...
What is Apache Hadoop? <ul><li>A set of  open source  projects owned by the  Apache Foundation  that </li></ul><ul><li>tra...
What is it used for? <ul><li>Internet scale data </li></ul><ul><ul><li>Web logs – Years of logs at many TB/day </li></ul><...
Apache Hadoop Projects Programming Languages Computation Object Storage Zookeeper  (Coordination) Core Apache Hadoop Relat...
Where Hadoop is Used
Everywhere! 2006 2008 2009 2010 The Datagraph Blog 2007
HADOOP @ YAHOO! 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011
<ul><ul><li>twice the engagement </li></ul></ul>CASE STUDY YAHOO! HOMEPAGE <ul><ul><li>Personalized   </li></ul></ul><ul><...
CASE STUDY YAHOO! HOMEPAGE <ul><ul><li>Serving   Maps </li></ul></ul><ul><ul><ul><li>Users - Interests </li></ul></ul></ul...
<ul><ul><li>Enabling quick response in the spam arms race </li></ul></ul>CASE STUDY YAHOO! MAIL <ul><ul><li>450M mail boxe...
A Brief History 2006 – present <ul><ul><li>, early adopters  </li></ul></ul><ul><ul><li>Scale and productize Hadoop </li><...
Traditional Enterprise Architecture Data Silos + ETL Serving Applications Unstructured Systems Traditional ETL & Message b...
Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterpris...
Serving Applications Unstructured Systems 80-90% of data  produced today  is unstructured Gartner predicts  800% data grow...
What is Driving Adoption? <ul><li>Business drivers </li></ul><ul><ul><li>Identified high value projects that require use o...
Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source:
Apache Hadoop Core
Overview <ul><li>Frameworks share  commodity  hardware </li></ul><ul><ul><li>Storage - HDFS </li></ul></ul><ul><ul><li>Pro...
Map/Reduce <ul><li>Map/Reduce is a distributed computing programming model </li></ul><ul><li>It works like a Unix pipeline...
HDFS:  Scalable, Reliable, Managable <ul><li>Scale IO, Storage, CPU </li></ul><ul><li>Add  commodity  servers &  JBODs   <...
HDFS Use Cases <ul><li>Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardw...
HDFS Architecture Persistent Namespace Metadata & Journal Namespace State Datanodes <ul><ul><li>Block ID    Data </li></u...
Client Read & Write  Directly   from  Closest  Server Namespace State Horizontally Scale IO and Storage Client 1 open 2 re...
Actively maintain data reliability Namespace State Bad/lost block replica Periodically check block checksums Namenode Bloc...
HBase <ul><li>Hadoop ecosystem Database, based on Google BigTable </li></ul><ul><li>Goal: Hosting of very large tables (bi...
What’s Next
About Hortonworks – Basics <ul><li>Founded  – July 1 st , 2011 </li></ul><ul><ul><li>22 Architects & committers from Yahoo...
Game Plan <ul><li>Support the growth of a huge Apache Hadoop ecosystem </li></ul><ul><ul><li>Invest in ease of use, manage...
Lines of Code Contributed to  Apache Hadoop
Apache Hadoop Roadmap <ul><li>Phase 1 – Making Apache Hadoop Accessible </li></ul><ul><li>Release the most stable version ...
Next-Generation Hadoop <ul><li>Core </li></ul><ul><ul><li>HDFS Federation – Scale out and innovation via new APIs </li></u...
Thank You! <ul><li>Questions </li></ul><ul><ul><li>Twitter: @jeric14 (@hortonworks) </li></ul></ul><ul><ul><li>www.hortonw...
Upcoming SlideShare
Loading in...5
×

Eric Baldeschwieler Keynote from Storage Developers Conference

5,924

Published on

Eric Baldeschwieler's keynote presentation from Storage Developer's Conference 2011. Eric is co-founder and CEO of Hortonworks.

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,924
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
203
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • Introduce self, background. Thanks SDC, attendees, etc.
  • Tell inception story, plan to differentiate Yahoo, recruit talent, insure that Y! was not built on legacy private system From YST
  • Look for Analyst/Gartner data on growing cost…..
  • The data nodes not have RAID, just JBOD
  • The data nodes not have RAID, just JBOD
  • The data nodes not have RAID, just JBOD
  • Arun to develop
  • Our commitment to Apache has already changed the market! Ultimately contributing the code that maters and making it work is the currency in open source
  • Transcript of "Eric Baldeschwieler Keynote from Storage Developers Conference"

    1. 1. Apache Hadoop Today & Tomorrow <ul><li>Eric Baldeschwieler, CEO </li></ul><ul><li>Hortonworks, Inc. </li></ul><ul><ul><li>twitter: @jeric14 (@hortonworks) </li></ul></ul><ul><ul><li>www.hortonworks.com </li></ul></ul>
    2. 2. Agenda <ul><li>Brief Overview of Apache Hadoop </li></ul><ul><li>Where Apache Hadoop is Used </li></ul><ul><li>Apache Hadoop Core </li></ul><ul><ul><li>Hadoop Distributed File System (HDFS) </li></ul></ul><ul><ul><li>Map/Reduce </li></ul></ul><ul><li>Where Apache Hadoop Is Going </li></ul><ul><li>Q&A </li></ul>
    3. 3. What is Apache Hadoop? <ul><li>A set of open source projects owned by the Apache Foundation that </li></ul><ul><li>transforms commodity computers and network into a distributed service </li></ul><ul><li>HDFS – Stores petabytes of data reliably </li></ul><ul><li>Map-Reduce – Allows huge distributed computations </li></ul><ul><li>Key Attributes </li></ul><ul><li>Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails </li></ul><ul><li>Simple and Flexible APIs – Our rocket scientists use it directly! </li></ul><ul><li>Very powerful – Harnesses huge clusters, supports best of breed analytics </li></ul><ul><li>Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases </li></ul>
    4. 4. What is it used for? <ul><li>Internet scale data </li></ul><ul><ul><li>Web logs – Years of logs at many TB/day </li></ul></ul><ul><ul><li>Web Search – All the web pages on earth </li></ul></ul><ul><ul><li>Social data – All message traffic on facebook </li></ul></ul><ul><li>Cutting edge analytics </li></ul><ul><ul><li>Machine learning, data mining… </li></ul></ul><ul><li>Enterprise apps </li></ul><ul><ul><li>Network instrumentation, Mobil logs </li></ul></ul><ul><ul><li>Video and Audio processing </li></ul></ul><ul><ul><li>Text mining </li></ul></ul><ul><li>And lots more! </li></ul>
    5. 5. Apache Hadoop Projects Programming Languages Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache Projects HDFS (Hadoop Distributed File System) MapReduce (Distributed Programing Framework) HMS (Management) Table Storage Hive (SQL) Pig (Data Flow) HBase (Columnar Storage) HCatalog (Meta Data)
    6. 6. Where Hadoop is Used
    7. 7. Everywhere! 2006 2008 2009 2010 The Datagraph Blog 2007
    8. 8. HADOOP @ YAHOO! 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011
    9. 9. <ul><ul><li>twice the engagement </li></ul></ul>CASE STUDY YAHOO! HOMEPAGE <ul><ul><li>Personalized </li></ul></ul><ul><ul><li>for each visitor </li></ul></ul><ul><ul><li>Result: </li></ul></ul><ul><ul><li>twice the engagement </li></ul></ul>© Yahoo 2011 +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News Interests Top Searches
    10. 10. CASE STUDY YAHOO! HOMEPAGE <ul><ul><li>Serving Maps </li></ul></ul><ul><ul><ul><li>Users - Interests </li></ul></ul></ul><ul><ul><li>Five Minute Production </li></ul></ul><ul><ul><li>Weekly Categorization models </li></ul></ul>USER BEHAVIOR CATEGORIZATION MODELS (weekly) SERVING MAPS (every 5 minutes) USER BEHAVIOR <ul><ul><li>» Identify user interests using Categorization models </li></ul></ul><ul><ul><li>» Machine learning to build ever better categorization models </li></ul></ul><ul><ul><li>Build customized home pages with latest data (thousands / second) </li></ul></ul>© Yahoo 2011 SCIENCE HADOOP CLUSTER SERVING SYSTEMS PRODUCTION HADOOP CLUSTER ENGAGED USERS
    11. 11. <ul><ul><li>Enabling quick response in the spam arms race </li></ul></ul>CASE STUDY YAHOO! MAIL <ul><ul><li>450M mail boxes </li></ul></ul><ul><ul><li>5B+ deliveries/day </li></ul></ul><ul><ul><li>Antispam models retrained </li></ul></ul><ul><ul><li>every few hours on Hadoop </li></ul></ul>© Yahoo 2011 <ul><ul><li>40% less spam than Hotmail and 55% less spam than Gmail </li></ul></ul><ul><ul><li>“ </li></ul></ul><ul><ul><li>“ </li></ul></ul>SCIENCE PRODUCTION
    12. 12. A Brief History 2006 – present <ul><ul><li>, early adopters </li></ul></ul><ul><ul><li>Scale and productize Hadoop </li></ul></ul>Apache Hadoop <ul><ul><li>Wide Enterprise Adoption </li></ul></ul><ul><ul><li>Funds further development, enhancements </li></ul></ul>Nascent / 2011 <ul><ul><li>Other Internet Companies </li></ul></ul><ul><ul><li>Add tools / frameworks, enhance Hadoop </li></ul></ul>2008 – present … <ul><ul><li>Service Providers </li></ul></ul><ul><ul><li>Provide training, support, hosting </li></ul></ul>2010 – present …
    13. 13. Traditional Enterprise Architecture Data Silos + ETL Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Serving Logs Social Media Sensor Data Text Systems …
    14. 14. Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics Serving Logs Social Media Sensor Data Text Systems …
    15. 15. Serving Applications Unstructured Systems 80-90% of data produced today is unstructured Gartner predicts 800% data growth over next 5 years Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Serving Logs Social Media Sensor Data Text Systems … Apache Hadoop EsTsL (s = Store) Custom Analytics
    16. 16. What is Driving Adoption? <ul><li>Business drivers </li></ul><ul><ul><li>Identified high value projects that require use of more data </li></ul></ul><ul><ul><li>Belief that there is great ROI in mastering big data </li></ul></ul><ul><li>Financial drivers </li></ul><ul><ul><li>Growing cost of data systems as proportion of IT spend </li></ul></ul><ul><ul><li>Cost advantage of commodity hardware + open source </li></ul></ul><ul><ul><ul><li>Enables departmental-level big data strategies </li></ul></ul></ul><ul><li>Technical drivers </li></ul><ul><ul><li>Existing solutions failing under growing requirements </li></ul></ul><ul><ul><ul><li>3Vs - Volume, velocity, variety </li></ul></ul></ul><ul><ul><li>Proliferation of unstructured data </li></ul></ul>
    17. 17. Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source:
    18. 18. Apache Hadoop Core
    19. 19. Overview <ul><li>Frameworks share commodity hardware </li></ul><ul><ul><li>Storage - HDFS </li></ul></ul><ul><ul><li>Processing - MapReduce </li></ul></ul>2 * 10GigE 2 * 10GigE 2 * 10GigE 2 * 10GigE <ul><li>20-40 nodes / rack </li></ul><ul><li>16 Cores </li></ul><ul><li>48G RAM </li></ul><ul><li>6-12 * 2TB disk </li></ul><ul><li>1-2 GigE to node </li></ul>… Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server …
    20. 20. Map/Reduce <ul><li>Map/Reduce is a distributed computing programming model </li></ul><ul><li>It works like a Unix pipeline: </li></ul><ul><ul><li>cat input | grep | sort | uniq -c > output </li></ul></ul><ul><ul><li>Input | Map | Shuffle & Sort | Reduce | Output </li></ul></ul><ul><li>Strengths: </li></ul><ul><ul><li>Easy to use! Developer just writes a couple of functions </li></ul></ul><ul><ul><li>Moves compute to data </li></ul></ul><ul><ul><ul><li>Schedules work on HDFS node with data if possible </li></ul></ul></ul><ul><ul><li>Scans through data, reducing seeks </li></ul></ul><ul><ul><li>Automatic reliability and re-execution on failure </li></ul></ul>
    21. 21. HDFS: Scalable, Reliable, Managable <ul><li>Scale IO, Storage, CPU </li></ul><ul><li>Add commodity servers & JBODs </li></ul><ul><li>4K nodes in cluster, 80 </li></ul><ul><li>Fault Tolerant & Easy management </li></ul><ul><ul><li>Built in redundancy </li></ul></ul><ul><ul><li>Tolerate disk and node failures </li></ul></ul><ul><ul><li>Automatically manage addition/removal of nodes </li></ul></ul><ul><ul><li>One operator per 8K nodes!! </li></ul></ul><ul><li>Storage server used for computation </li></ul><ul><ul><li>Move computation to data </li></ul></ul><ul><li>Not a SAN </li></ul><ul><ul><li>But high-bandwidth network access to data via Ethernet </li></ul></ul><ul><li>Immutable file system </li></ul><ul><ul><li>Read, Write, sync/flush </li></ul></ul><ul><ul><ul><li>No random writes </li></ul></ul></ul>` Switch … Switch … Switch … Core Switch Core Switch …
    22. 22. HDFS Use Cases <ul><li>Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware </li></ul><ul><li>Solve problems that cannot be solved using traditional systems at a cheaper price </li></ul><ul><ul><li>Large storage capacity ( >100PB raw) </li></ul></ul><ul><ul><li>Large IO/Computation bandwidth (>4K servers) </li></ul></ul><ul><ul><ul><li>> 4 Terabit bandwidth to disk! (conservatively) </li></ul></ul></ul><ul><ul><li>Scale by adding commodity hardware </li></ul></ul><ul><ul><li>Cost per GB ~= $1.5, includes MapReduce cluster </li></ul></ul>
    23. 23. HDFS Architecture Persistent Namespace Metadata & Journal Namespace State Datanodes <ul><ul><li>Block ID  Data </li></ul></ul><ul><ul><li>Hierarchal Namespace </li></ul></ul><ul><ul><li>File Name  BlockIDs </li></ul></ul>Horizontally Scale IO and Storage Block Storage Namespace Namenode Block Map Heartbeats & Block Reports <ul><ul><li>Block ID  Block Locations </li></ul></ul>NFS b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
    24. 24. Client Read & Write Directly from Closest Server Namespace State Horizontally Scale IO and Storage Client 1 open 2 read Client 2 write 1 create End-to-end checksum Namenode Block Map b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
    25. 25. Actively maintain data reliability Namespace State Bad/lost block replica Periodically check block checksums Namenode Block Map b1 b5 b3 JBOD b2 b3 b4 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD 2. copy 3. blockReceived 1. replicate
    26. 26. HBase <ul><li>Hadoop ecosystem Database, based on Google BigTable </li></ul><ul><li>Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. </li></ul><ul><ul><li>Multidimensional sorted Map </li></ul></ul><ul><ul><ul><li>Table => Row => Column => Version => Value </li></ul></ul></ul><ul><ul><li>Distributed column-oriented store </li></ul></ul><ul><ul><li>Scale – Sharding etc. done automatically </li></ul></ul><ul><ul><ul><li>No SQL, CRUD etc. </li></ul></ul></ul>
    27. 27. What’s Next
    28. 28. About Hortonworks – Basics <ul><li>Founded – July 1 st , 2011 </li></ul><ul><ul><li>22 Architects & committers from Yahoo! </li></ul></ul><ul><li>Mission – Architect the future of Big Data </li></ul><ul><ul><li>Revolutionize and commoditize the storage and processing of Big Data via open source </li></ul></ul><ul><li>Vision – Half of the worlds data will be stored in Hadoop within five years </li></ul>
    29. 29. Game Plan <ul><li>Support the growth of a huge Apache Hadoop ecosystem </li></ul><ul><ul><li>Invest in ease of use, management, and other enterprise features </li></ul></ul><ul><ul><li>Define APIs for ISVs, OEMs and others to integrate with Apache Hadoop </li></ul></ul><ul><ul><li>Continue to invest in advancing the Hadoop core, remain the experts </li></ul></ul><ul><ul><li>Contribute all of our work to Apache </li></ul></ul><ul><li>Profit by providing training & support to the Hadoop community </li></ul>
    30. 30. Lines of Code Contributed to Apache Hadoop
    31. 31. Apache Hadoop Roadmap <ul><li>Phase 1 – Making Apache Hadoop Accessible </li></ul><ul><li>Release the most stable version of Hadoop ever (Hadoop 0.20.205) </li></ul><ul><ul><li>Frequent sustaining releases </li></ul></ul><ul><li>Release directly usable code via Apache (RPMs, .debs…) </li></ul><ul><li>Improve project integration (HBase support) </li></ul>2011 <ul><li>Phase 2 – Next-Generation Apache Hadoop </li></ul><ul><li>Address key product gaps (HA, Management…) </li></ul><ul><li>Enable partner innovation via open APIs </li></ul><ul><li>Enable community innovation via modular architecture </li></ul>2012 (Alphas in Q4 2011)
    32. 32. Next-Generation Hadoop <ul><li>Core </li></ul><ul><ul><li>HDFS Federation – Scale out and innovation via new APIs </li></ul></ul><ul><ul><ul><li>Will run on 6000 node clusters with 24TB disk / node = 144PB in next release </li></ul></ul></ul><ul><ul><li>Next Gen MapReduce – Support for MPI and many other programing models </li></ul></ul><ul><ul><li>HA (no SPOF) and Wire compatibility </li></ul></ul><ul><li>Data - HCatalog 0.3 </li></ul><ul><ul><li>Pig, Hive, MapReduce and Streaming as clients </li></ul></ul><ul><ul><li>HDFS and HBase as storage systems </li></ul></ul><ul><ul><li>Performance and storage improvements </li></ul></ul><ul><li>Management & Ease of use </li></ul><ul><ul><li>Ambari – A Apache Hadoop Management & Monitoring System </li></ul></ul><ul><ul><li>Stack installation and centralized config management </li></ul></ul><ul><ul><li>REST and GUI for user & administrator tasks </li></ul></ul>
    33. 33. Thank You! <ul><li>Questions </li></ul><ul><ul><li>Twitter: @jeric14 (@hortonworks) </li></ul></ul><ul><ul><li>www.hortonworks.com </li></ul></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×