Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com
Agenda Brief Overview of Apache Hadoop Where Apache Hadoop is Used Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce Where Apache Hadoop Is Going Q&A
What is Apache Hadoop? A set of  open source  projects owned by the  Apache Foundation  that transforms  commodity computers  and network into a  distributed service HDFS – Stores petabytes of data reliably Map-Reduce – Allows huge distributed computations Key Attributes Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails Simple and Flexible APIs – Our rocket scientists use it directly! Very powerful – Harnesses huge clusters, supports best of breed analytics Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases
What is it used for? Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook Cutting edge analytics Machine learning, data mining… Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining And lots more!
Apache Hadoop Projects Programming Languages Computation Object Storage Zookeeper  (Coordination) Core Apache Hadoop Related Apache Projects HDFS  (Hadoop Distributed File System) MapReduce (Distributed Programing Framework) HMS (Management) Table Storage Hive (SQL) Pig (Data Flow) HBase (Columnar Storage) HCatalog (Meta Data)
Where Hadoop is Used
Everywhere! 2006 2008 2009 2010 The Datagraph Blog 2007
HADOOP @ YAHOO! 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011
twice the engagement CASE STUDY YAHOO! HOMEPAGE Personalized   for each visitor Result:  twice the engagement © Yahoo 2011 +160% clicks vs. one size fits all +79%  clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News Interests Top Searches
CASE STUDY YAHOO! HOMEPAGE Serving   Maps Users - Interests Five Minute Production Weekly Categorization models USER BEHAVIOR CATEGORIZATION MODELS (weekly) SERVING MAPS (every 5 minutes) USER BEHAVIOR »   Identify user interests using Categorization models »   Machine learning to build ever better categorization models Build customized home pages with latest data (thousands / second) © Yahoo 2011 SCIENCE   HADOOP  CLUSTER SERVING SYSTEMS PRODUCTION   HADOOP  CLUSTER ENGAGED USERS
Enabling quick response in the spam arms race CASE STUDY YAHOO! MAIL 450M mail boxes  5B+ deliveries/day Antispam models retrained every few hours on Hadoop © Yahoo 2011 40% less spam than Hotmail and 55% less spam than Gmail “ “ SCIENCE   PRODUCTION
A Brief History 2006 – present , early adopters  Scale and productize Hadoop Apache Hadoop Wide Enterprise Adoption  Funds further development, enhancements Nascent / 2011 Other Internet Companies   Add tools / frameworks, enhance Hadoop 2008 – present … Service Providers  Provide training, support, hosting  2010 – present …
Traditional Enterprise Architecture Data Silos + ETL Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses EDW Data Marts BI /  Analytics Traditional Data Warehouses,  BI & Analytics Serving Logs Social Media Sensor Data Text Systems …
Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data  EDW Data Marts BI /  Analytics Traditional Data Warehouses,  BI & Analytics Apache Hadoop EsTsL (s = Store)  Custom Analytics Serving Logs Social Media Sensor Data Text Systems …
Serving Applications Unstructured Systems 80-90% of data  produced today  is unstructured Gartner predicts  800% data growth  over next 5 years Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data  EDW Data Marts BI /  Analytics Traditional Data Warehouses,  BI & Analytics Serving Logs Social Media Sensor Data Text Systems … Apache Hadoop EsTsL (s = Store)  Custom Analytics
What is Driving Adoption? Business drivers Identified high value projects that require use of more data Belief that there is great ROI in mastering big data Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source  Enables departmental-level big data strategies  Technical drivers Existing solutions failing under growing requirements 3Vs - Volume, velocity, variety Proliferation of unstructured data
Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source:
Apache Hadoop Core
Overview Frameworks share  commodity  hardware Storage - HDFS Processing - MapReduce 2 * 10GigE 2 * 10GigE 2 * 10GigE 2 * 10GigE 20-40 nodes / rack 16 Cores 48G RAM 6-12 * 2TB disk 1-2 GigE to node … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server …
Map/Reduce Map/Reduce is a distributed computing programming model It works like a Unix pipeline: cat input | grep  |  sort  | uniq -c  > output Input   |  Map  | Shuffle & Sort |  Reduce   | Output Strengths: Easy to use!  Developer just writes a couple of functions Moves compute to data  Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure
HDFS:  Scalable, Reliable, Managable Scale IO, Storage, CPU Add  commodity  servers &  JBODs   4K nodes in cluster, 80 Fault Tolerant & Easy management  Built in redundancy  Tolerate disk and node failures Automatically manage addition/removal of nodes One operator per 8K nodes!! Storage server used for computation Move computation to data Not a SAN But high-bandwidth network access to data via Ethernet Immutable file system Read, Write, sync/flush No random writes ` Switch … Switch … Switch … Core Switch Core Switch …
HDFS Use Cases Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K servers) > 4 Terabit bandwidth to disk! (conservatively) Scale by adding commodity hardware  Cost per GB ~= $1.5, includes MapReduce cluster
HDFS Architecture Persistent Namespace Metadata & Journal Namespace State Datanodes Block ID    Data Hierarchal Namespace File Name     BlockIDs Horizontally Scale IO and Storage Block Storage Namespace Namenode Block Map Heartbeats & Block Reports Block ID    Block Locations NFS b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
Client Read & Write  Directly   from  Closest  Server Namespace State Horizontally Scale IO and Storage Client 1 open 2 read Client 2 write 1 create End-to-end checksum Namenode Block Map b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
Actively maintain data reliability Namespace State Bad/lost block replica Periodically check block checksums Namenode Block Map b1 b5 b3 JBOD b2 b3 b4 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD 2. copy 3. blockReceived 1. replicate
HBase Hadoop ecosystem Database, based on Google BigTable Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map  Table => Row => Column => Version => Value Distributed column-oriented store Scale – Sharding etc. done automatically No SQL, CRUD etc.
What’s Next
About Hortonworks – Basics Founded  – July 1 st , 2011 22 Architects & committers from Yahoo! Mission  – Architect the future of Big Data Revolutionize and commoditize the storage and processing of Big Data via open source Vision  – Half of the worlds data will be stored in Hadoop within five years
Game Plan Support the growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and other enterprise features Define APIs for ISVs, OEMs and others to integrate with Apache Hadoop Continue to invest in advancing the Hadoop core, remain the experts Contribute all of our work to Apache Profit by providing training & support to the Hadoop community
Lines of Code Contributed to  Apache Hadoop
Apache Hadoop Roadmap Phase 1 – Making Apache Hadoop Accessible Release the most stable version of Hadoop ever (Hadoop 0.20.205) Frequent sustaining releases Release directly usable code via Apache (RPMs, .debs…) Improve project integration (HBase support) 2011 Phase 2 – Next-Generation Apache Hadoop Address key product gaps (HA, Management…) Enable partner innovation via open APIs Enable community innovation via modular architecture 2012 (Alphas in Q4 2011)
Next-Generation Hadoop Core HDFS Federation – Scale out and innovation via new APIs Will run on 6000 node clusters with 24TB disk / node = 144PB in next release Next Gen MapReduce – Support for MPI and many other programing models HA (no SPOF) and Wire compatibility Data - HCatalog 0.3 Pig, Hive, MapReduce and Streaming as clients HDFS and HBase as storage systems Performance and storage improvements Management & Ease of use Ambari – A Apache Hadoop Management & Monitoring System Stack installation and centralized config management REST and GUI for user & administrator tasks
Thank You! Questions Twitter: @jeric14 (@hortonworks) www.hortonworks.com

Eric Baldeschwieler Keynote from Storage Developers Conference

  • 1.
    Apache Hadoop Today& Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com
  • 2.
    Agenda Brief Overviewof Apache Hadoop Where Apache Hadoop is Used Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce Where Apache Hadoop Is Going Q&A
  • 3.
    What is ApacheHadoop? A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service HDFS – Stores petabytes of data reliably Map-Reduce – Allows huge distributed computations Key Attributes Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails Simple and Flexible APIs – Our rocket scientists use it directly! Very powerful – Harnesses huge clusters, supports best of breed analytics Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases
  • 4.
    What is itused for? Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook Cutting edge analytics Machine learning, data mining… Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining And lots more!
  • 5.
    Apache Hadoop ProjectsProgramming Languages Computation Object Storage Zookeeper (Coordination) Core Apache Hadoop Related Apache Projects HDFS (Hadoop Distributed File System) MapReduce (Distributed Programing Framework) HMS (Management) Table Storage Hive (SQL) Pig (Data Flow) HBase (Columnar Storage) HCatalog (Meta Data)
  • 6.
  • 7.
    Everywhere! 2006 20082009 2010 The Datagraph Blog 2007
  • 8.
    HADOOP @ YAHOO!40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Yahoo 2011
  • 9.
    twice the engagementCASE STUDY YAHOO! HOMEPAGE Personalized for each visitor Result: twice the engagement © Yahoo 2011 +160% clicks vs. one size fits all +79% clicks vs. randomly selected +43% clicks vs. editor selected Recommended links News Interests Top Searches
  • 10.
    CASE STUDY YAHOO!HOMEPAGE Serving Maps Users - Interests Five Minute Production Weekly Categorization models USER BEHAVIOR CATEGORIZATION MODELS (weekly) SERVING MAPS (every 5 minutes) USER BEHAVIOR » Identify user interests using Categorization models » Machine learning to build ever better categorization models Build customized home pages with latest data (thousands / second) © Yahoo 2011 SCIENCE HADOOP CLUSTER SERVING SYSTEMS PRODUCTION HADOOP CLUSTER ENGAGED USERS
  • 11.
    Enabling quick responsein the spam arms race CASE STUDY YAHOO! MAIL 450M mail boxes 5B+ deliveries/day Antispam models retrained every few hours on Hadoop © Yahoo 2011 40% less spam than Hotmail and 55% less spam than Gmail “ “ SCIENCE PRODUCTION
  • 12.
    A Brief History2006 – present , early adopters Scale and productize Hadoop Apache Hadoop Wide Enterprise Adoption Funds further development, enhancements Nascent / 2011 Other Internet Companies Add tools / frameworks, enhance Hadoop 2008 – present … Service Providers Provide training, support, hosting 2010 – present …
  • 13.
    Traditional Enterprise ArchitectureData Silos + ETL Serving Applications Unstructured Systems Traditional ETL & Message buses Traditional ETL & Message buses EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Serving Logs Social Media Sensor Data Text Systems …
  • 14.
    Serving Applications UnstructuredSystems Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics Serving Logs Social Media Sensor Data Text Systems …
  • 15.
    Serving Applications UnstructuredSystems 80-90% of data produced today is unstructured Gartner predicts 800% data growth over next 5 years Traditional ETL & Message buses Traditional ETL & Message buses Hadoop Enterprise Architecture Connecting All of Your Big Data EDW Data Marts BI / Analytics Traditional Data Warehouses, BI & Analytics Serving Logs Social Media Sensor Data Text Systems … Apache Hadoop EsTsL (s = Store) Custom Analytics
  • 16.
    What is DrivingAdoption? Business drivers Identified high value projects that require use of more data Belief that there is great ROI in mastering big data Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source Enables departmental-level big data strategies Technical drivers Existing solutions failing under growing requirements 3Vs - Volume, velocity, variety Proliferation of unstructured data
  • 17.
    Big Data PlatformsCost per TB, Adoption Size of bubble = cost effectiveness of solution Source:
  • 18.
  • 19.
    Overview Frameworks share commodity hardware Storage - HDFS Processing - MapReduce 2 * 10GigE 2 * 10GigE 2 * 10GigE 2 * 10GigE 20-40 nodes / rack 16 Cores 48G RAM 6-12 * 2TB disk 1-2 GigE to node … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server … Rack Switch 1-2U server …
  • 20.
    Map/Reduce Map/Reduce isa distributed computing programming model It works like a Unix pipeline: cat input | grep | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output Strengths: Easy to use! Developer just writes a couple of functions Moves compute to data Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure
  • 21.
    HDFS: Scalable,Reliable, Managable Scale IO, Storage, CPU Add commodity servers & JBODs 4K nodes in cluster, 80 Fault Tolerant & Easy management Built in redundancy Tolerate disk and node failures Automatically manage addition/removal of nodes One operator per 8K nodes!! Storage server used for computation Move computation to data Not a SAN But high-bandwidth network access to data via Ethernet Immutable file system Read, Write, sync/flush No random writes ` Switch … Switch … Switch … Core Switch Core Switch …
  • 22.
    HDFS Use CasesPetabytes of unstructured data for parallel, distributed analytics processing using commodity hardware Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K servers) > 4 Terabit bandwidth to disk! (conservatively) Scale by adding commodity hardware Cost per GB ~= $1.5, includes MapReduce cluster
  • 23.
    HDFS Architecture PersistentNamespace Metadata & Journal Namespace State Datanodes Block ID  Data Hierarchal Namespace File Name  BlockIDs Horizontally Scale IO and Storage Block Storage Namespace Namenode Block Map Heartbeats & Block Reports Block ID  Block Locations NFS b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
  • 24.
    Client Read &Write Directly from Closest Server Namespace State Horizontally Scale IO and Storage Client 1 open 2 read Client 2 write 1 create End-to-end checksum Namenode Block Map b1 b5 b3 JBOD b2 b3 b1 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD
  • 25.
    Actively maintain datareliability Namespace State Bad/lost block replica Periodically check block checksums Namenode Block Map b1 b5 b3 JBOD b2 b3 b4 JBOD b3 b5 b2 JBOD b1 b5 b2 JBOD 2. copy 3. blockReceived 1. replicate
  • 26.
    HBase Hadoop ecosystemDatabase, based on Google BigTable Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map Table => Row => Column => Version => Value Distributed column-oriented store Scale – Sharding etc. done automatically No SQL, CRUD etc.
  • 27.
  • 28.
    About Hortonworks –Basics Founded – July 1 st , 2011 22 Architects & committers from Yahoo! Mission – Architect the future of Big Data Revolutionize and commoditize the storage and processing of Big Data via open source Vision – Half of the worlds data will be stored in Hadoop within five years
  • 29.
    Game Plan Supportthe growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and other enterprise features Define APIs for ISVs, OEMs and others to integrate with Apache Hadoop Continue to invest in advancing the Hadoop core, remain the experts Contribute all of our work to Apache Profit by providing training & support to the Hadoop community
  • 30.
    Lines of CodeContributed to Apache Hadoop
  • 31.
    Apache Hadoop RoadmapPhase 1 – Making Apache Hadoop Accessible Release the most stable version of Hadoop ever (Hadoop 0.20.205) Frequent sustaining releases Release directly usable code via Apache (RPMs, .debs…) Improve project integration (HBase support) 2011 Phase 2 – Next-Generation Apache Hadoop Address key product gaps (HA, Management…) Enable partner innovation via open APIs Enable community innovation via modular architecture 2012 (Alphas in Q4 2011)
  • 32.
    Next-Generation Hadoop CoreHDFS Federation – Scale out and innovation via new APIs Will run on 6000 node clusters with 24TB disk / node = 144PB in next release Next Gen MapReduce – Support for MPI and many other programing models HA (no SPOF) and Wire compatibility Data - HCatalog 0.3 Pig, Hive, MapReduce and Streaming as clients HDFS and HBase as storage systems Performance and storage improvements Management & Ease of use Ambari – A Apache Hadoop Management & Monitoring System Stack installation and centralized config management REST and GUI for user & administrator tasks
  • 33.
    Thank You! QuestionsTwitter: @jeric14 (@hortonworks) www.hortonworks.com

Editor's Notes

  • #2 Introduce self, background. Thanks SDC, attendees, etc.
  • #13 Tell inception story, plan to differentiate Yahoo, recruit talent, insure that Y! was not built on legacy private system From YST
  • #17 Look for Analyst/Gartner data on growing cost…..
  • #24 The data nodes not have RAID, just JBOD
  • #25 The data nodes not have RAID, just JBOD
  • #26 The data nodes not have RAID, just JBOD
  • #27 Arun to develop
  • #30 Our commitment to Apache has already changed the market! Ultimately contributing the code that maters and making it work is the currency in open source