Hadoop past, present and future


Published on

Ever wonder what Hadoop might look like in 12 months or 24 months or longer? Apache Hadoop MapReduce has undergone a complete re-haul to emerge as Apache Hadoop YARN, a generic compute fabric to support MapReduce and other application paradigms. As a result, Hadoop looks very different from itself 12 months ago. This talk will take you through some ideas for YARN itself and the many myriad ways it is really moving the needle for MapReduce, Pig, Hive, Cascading and other data-processing tools in the Hadoop ecosystem.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop past, present and future

  1. 1. Hadoop : Past, Present and Future Chris Harris Email : charris@hortonworks.com Twitter : cj_harris5 © Hortonworks Inc. 2013
  2. 2. Past © Hortonworks Inc. 2013 Page 2
  3. 3. A little history… it’s 2005 © Hortonworks Inc. 2013
  4. 4. A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2012 Enterprise Hadoop 2005: Yahoo! creates team under E14 to work on Hadoop © Hortonworks Inc. 2013 Page 4
  5. 5. Key Hadoop Data Types 1.  Sentiment Understand how your customers feel about your brand and products – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Research logs to diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents © Hortonworks Inc. 2013 Value
  6. 6. Hadoop is NOT ! ! ! ! ! ! ESB NoSQL HPC Relational Real-time The Jack of all Trades © Hortonworks Inc. 2013
  7. 7. Hadoop 1 •  Limited up to 4,000 nodes per cluster •  O(# of tasks in a cluster) •  JobTracker bottleneck - resource management, job scheduling and monitoring •  Only has one namespace for managing HDFS •  Map and Reduce slots are static •  Only job to run is MapReduce © Hortonworks Inc. 2013
  8. 8. Hadoop 1 - Basics MapReduce (Computation Framework) A B C C B B C A A A HDFS (Storage Framework) © Hortonworks Inc. 2013
  9. 9. Hadoop 1 - Reading Files NameNode read file SNameNode (fsimage/edit) Hadoop Client return DNs, block ids, etc. checkpoint heartbeat/ block report read blocks DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT Rack1 Rack2 Rack3 © Hortonworks Inc. 2013 RackN
  10. 10. Hadoop 1 - Writing Files NameNode request write Hadoop Client SNameNode (fsimage/edit) checkpoint return DNs, etc. block report write blocks DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT Rack1 Rack2 Rack3 © Hortonworks Inc. 2013 RackN replication pipelining
  11. 11. Hadoop 1 - Running Jobs Hadoop Client submit job JobTracker map deploy job shuffle DN | TT part 0 © Hortonworks Inc. 2013 DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT Rack1 reduce DN | TT Rack2 Rack3 RackN
  12. 12. Hadoop 1 - Security authN/authZ LDAP/AD Users F I R E W A L L KDC service request Hadoop Cluster block token delegate token Client Node/ Spoke Server Encryption Plugin * block token is for accessing data * delegate token is for running jobs © Hortonworks Inc. 2013
  13. 13. Hadoop 1 - APIs ! ! ! ! org.apache.hadoop.mapreduce.Partitioner org.apache.hadoop.mapreduce.Mapper org.apache.hadoop.mapreduce.Reducer org.apache.hadoop.mapreduce.Job © Hortonworks Inc. 2013
  14. 14. Present © Hortonworks Inc. 2013 Page 14
  15. 15. Hadoop 2 ! ! ! ! ! ! ! Potentially up to 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop Beyond Java © Hortonworks Inc. 2013
  16. 16. Hadoop 2 - Basics © Hortonworks Inc. 2013
  17. 17. Hadoop 2 - Reading Files (w/ NN Federation) SNameNode per NN Hadoop Client NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4 fsimage/edit copy read file checkpoint or return DNs, block ids, etc. fs sync read blocks Backup NN per NN checkpoint register/ heartbeat/ block report Block Pools DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM ns1 Rack1 Rack2 © Hortonworks Inc. 2013 Rack3 RackN ns2 ns3 ns4 dn1, dn2 dn1, dn3 dn4, dn5 dn4, dn5
  18. 18. Hadoop 2 - Writing Files SNameNode per NN Hadoop Client NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4 request write fsimage/edit copy checkpoint or return DNs, etc. fs sync write blocks block report DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM checkpoint DN | NM DN | NM Backup NN per NN Rack1 Rack2 © Hortonworks Inc. 2013 Rack3 RackN replication pipelining
  19. 19. Hadoop 2 - Running Jobs create app1 Hadoop Client 1 submit app1 ASM NM ResourceManager .......negotiates....... Containers .......reports to....... ASM Scheduler .......partitions....... Resources create app2 Hadoop Client 2 submit app2 Scheduler ASM queues status report NodeManager C2.1 NodeManager C2.2 NodeManager AM2 Rack1 © Hortonworks Inc. 2013 NodeManager NodeManager C1.3 NodeManager C2.3 C1.2 NodeManager AM1 Rack2 NodeManager C1.4 NodeManager C1.1 RackN
  20. 20. Hadoop 2 - Security DMZ KDC LDAP/AD Knox Gateway Cluster Enterprise/ Cloud SSO Provider JDBC Client F I R E W A L L F I R E W A L L Hadoop Cluster REST Client Native Hive/HBase Encryption Browser(HUE) © Hortonworks Inc. 2013
  21. 21. Hadoop 2 - APIs ! org.apache.hadoop.yarn.api.ApplicationClientProt ocol ! org.apache.hadoop.yarn.api.ApplicationMasterPro tocol ! org.apache.hadoop.yarn.api.ContainerManagemen tProtocol © Hortonworks Inc. 2013
  22. 22. Future © Hortonworks Inc. 2013 Page 22
  23. 23. Apache Tez A New Hadoop Data Processing Framework © Hortonworks Inc. 2013 Page 23
  24. 24. HDP: Enterprise Hadoop Distribution OPERATIONAL   SERVICES   AMBARI   FLUME   PIG   FALCON*   OOZIE   Hortonworks Data Platform (HDP) DATA   SERVICES   SQOOP   HIVE  &   HCATALOG   HBASE   Enterprise Hadoop OTHER   •  The ONLY 100% open source and complete distribution LOAD  &     EXTRACT   HADOOP     CORE   PLATFORM     SERVICES   NFS   WebHDFS   KNOX*   MAP     REDUCE*   TEZ*     YARN*       HDFS   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS     DATA  PLATFORM  (HDP)   © Hortonworks Inc. 2013 •  Enterprise grade, proven and tested at scale •  Ecosystem endorsed to ensure interoperability Page 24
  25. 25. Tez (“Speed”) • What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF • Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft • Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop © Hortonworks Inc. 2013
  26. 26. Moving Hadoop Beyond MapReduce • Low level data-processing execution engine • Built on YARN • Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS – Much lighter disk and network usage • New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline © Hortonworks Inc. 2013
  27. 27. Tez - Core Idea Task with pluggable Input, Processor & Output Input Processor Output Task Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks © Hortonworks Inc. 2013
  28. 28. Building Blocks for Tasks MapReduce ‘Map’ HDFS Input Map Processor Sorted Output MapReduce ‘Map’ Task Special Pig/Hive ‘Map’ HDFS Input Map Processor Pipelin e Sorter Output Tez Task © Hortonworks Inc. 2013 MapReduce ‘Reduce’ Shuffle Input Reduce Processor HDFS Output MapReduce ‘Reduce’ Task Special Pig/Hive ‘Reduce’ Shuffle Skipmerge Input Reduce Processor Tez Task Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce Shuffle Input Reduce Processor Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce In-memory Map HDFSI nput Map Processor Tez Task Inmemor y Sorted Output
  29. 29. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR © Hortonworks Inc. 2013 Pig/Hive - Tez
  30. 30. Tez on YARN: Going Beyond Batch Tez Task Tez Optimizes Execution New runtime engine for more efficient data processing © Hortonworks Inc. 2013 Always-On Tez Service Low latency processing for all Hadoop data processing
  31. 31. Apache Knox Secure Access to Hadoop © Hortonworks Inc. 2013
  32. 32. Knox Initiative Make Hadoop security simple Simplify Security Aggregate Access Client Agility Simplify security for both users and operators. Deliver unified and centralized access to the Hadoop cluster. Provide seamless access for users while securing cluster at the perimeter, shielding the intricacies of the security implementation. Make Hadoop feel like a single application to users. Ensure service users are abstracted from where services are located and how services are configured & scaled. © Hortonworks Inc. 2013
  33. 33. Knox: Make Hadoop Security Simple Authentication & Verification User Store KDC, AD, LDAP Client {REST}! © Hortonworks Inc. 2013 Knox Gateway Hadoop Cluster
  34. 34. Knox: Next Generation of Hadoop Security •  All users see one end-point website online apps + analytics tools end users •  All online systems see one endpoint RESTful service Gateway •  Consistency across all interfaces and capabilities •  Firewalled cluster that no end users need to access •  More IT-friendly. Enables: –  Systems admins –  DB admins –  Security admins –  Network admins © Hortonworks Inc. 2013 firewall Hadoop cluster firewall
  35. 35. Apache Falcon Data Lifecycle Management for Hadoop © Hortonworks Inc. 2013
  36. 36. Data Lifecycle on Hadoop is Challenging Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges. © Hortonworks Inc. 2013
  37. 37. Falcon: One-stop Shop for Data Lifecycle Apache Falcon Provides Orchestrates Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. © Hortonworks Inc. 2013
  38. 38. Falcon At A Glance Data Processing Applications Spec Files or REST APIs Falcon Data Lifecycle Management Service Data Import and Replication Scheduling and Coordination Data Lifecycle Policies Multi-Cluster Management SLA Management >  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop. © Hortonworks Inc. 2013
  39. 39. Falcon Core Capabilities • Core Functionality – Pipeline processing – Replication – Retention – Late data handling • Automates – Scheduling and retry – Recording audit, lineage and metrics • Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation • CLI and REST API © Hortonworks Inc. 2013
  40. 40. Falcon Example: Multi-Cluster Failover Primary Hadoop Cluster Cleansed Data Conformed Data Presented Data BI and Analytics Replication Staged Data Staged Data Presented Data Failover Hadoop Cluster >  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU. © Hortonworks Inc. 2013
  41. 41. Falcon Example: Retention Policies Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only >  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing. © Hortonworks Inc. 2013
  42. 42. Falcon Example: Late Data Handling Online Transaction Data (Pull via Sqoop) Wait up to 4 hours for FTP data to arrive Staging Area Combined Dataset Web Log Data (Push via FTP) >  Processing waits until all data is available. >  Developers don’t write complex data handling rules within applications. © Hortonworks Inc. 2013
  43. 43. Multi Cluster Management with Prism >  Prism is the part of Falcon that handles multi-cluster. >  Key use cases: Replication and data processing that spans clusters. © Hortonworks Inc. 2013 Page 43
  44. 44. Hortonworks Sandbox Go from Zero to Big Data in 15 minutes © Hortonworks Inc. 2013 Page 44
  45. 45. Sandbox: A Guided Tour of HDP Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop Easy-to-use editors for Apache Pig and Hive © Hortonworks Inc. 2013 Easily import data and create tables Browse and manage HDFS files Latest tutorials pushed directly to your Sandbox Page 45
  46. 46. THANK YOU! Chris Harris charris@hortonworks.com Download Sandbox hortonworks.com/sandbox © Hortonworks Inc. 2013 Page 46