Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works


Published on

The Apache Hadoop community is gearing up for the upcoming release of Apache Hadoop 0.23.  This release has major enhancements to Hadoop such as HDFS Federation for hyper-scale and a Next Generation …

The Apache Hadoop community is gearing up for the upcoming release of Apache Hadoop 0.23.  This release has major enhancements to Hadoop such as HDFS Federation for hyper-scale and a Next Generation MapReduce framework. Arun, the Apache Hadoop Release Master for 0.23, will briefly cover the highlights of the release and pay particular attention to the plans and efforts undertaken to test, stabilize and release The talk covers some of the timelines for the release, our plans for compatibility and upgrade paths for existing users of Hadoop.

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Apache Hadoop 0.23What it takes and what it means…Arun C. MurthyFounder/Architect, Hortonworks@acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun• Founder/Architect at Hortonworks Inc. – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) – Yes, I took the 3am calls! • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time ~6 years) – Release Manager - hadoop-0.23 Page 2
  • 3. Releases so far…• Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting• Initially, we did monthly releases (0.1, 0.2 …)• Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009…• hadoop-0.20 is still the basis of all current, stable, Hadoop distributions – Apache Hadoop 0.20.2xx – CDH3.* – HDP1.*• hadoop-0.20.203 (security) – 05/2011• hadoop-0.20.205 (security + append -> hbase) – 10/2011 hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.20.205 hadoop-0.23.02006 2009 2012 Page 3
  • 4. hadoop-0.23• First stable release off Apache Hadoop trunk in over 30 months…• Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC• Significant major features• Several, several enhancements Page 4
  • 5. HDFS - Federation• Significant scaling…• Separation of Namespace mgmt and Block mgmt• Suresh Srinivas (Hortonworks) – Wed 11am Page 5
  • 6. MapReduce - YARN• NextGen Hadoop Data Processing Framework• Support MR and other paradigms• Mahadev Konar (Hortonworks) – Tue 4.30pm Node Manager Container App Mstr Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container Page 6
  • 7. Performance• 2x+ across the board• HDFS read/write – CRC32 – fadvise – Shortcut for local reads• MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Small Jobs – Uber AM• Todd Lipcon (Cloudera) – Wed 10am Page 7
  • 8. HDFS NameNode HA• The famous SPOF•• Well on the way to fix in hadoop-0.23.½• Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm Page 8
  • 9. More…• HDFS Write pipeline improvements for Hbase – Append/flush etc.• Build - Full Mavenization• EditLogs re-write –• Tonnes more … Page 9
  • 10. Deployment goals• Clusters of 6,000 machines – Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks – 200+ PB (raw) per cluster – 100,000+ concurrent tasks – 10,000 concurrent jobs• Yahoo: 50,000+ machines Page 10
  • 11. What does it take to get there?• Testing, *lots* of it• Benchmarks – At least as good as the last one• Integration testing – HBase – Pig – Hive – Oozie• Deployment discipline Page 11
  • 12. Testing• Why is it hard? – MapReduce is, effectively, very wide api – Add Streaming – Add Pipes – Oh, Pig/Hive etc. etc.• Functional tests – Nightly – Nearly 1000 functional tests for MapReduce alone – Several hundred for Pig/Hive etc.• Scale tests – Simulation• Longevity tests• Stress tests Page 12
  • 13. Benchmarks• Benchmark every part of the HDFS & MR pipeline – HDFS read/write throughput – NN operations – Scan, Shuffle, Sort• GridMixv3 – Run production traces in test clusters – Thousands of jobs – Stress mode v/s Replay mode Page 13
  • 14. Integration Testing• Several projects in the ecosystem – HBase – Pig – Hive – Oozie• Cycle – Functional – Scale – Rinse, repeat Page 14
  • 15. Deployment• Alpha/Test (early UAT) – Starting Nov, 2011 – Small scale (500-800 nodes)• Alpha – Jan, 2012 – Majority of users – 2000 nodes per cluster, > 10,000 nodes in all• Beta – Misnomer: 100s of PB, Millions of user applications – Significantly wide variety of applications and load – 4000+ nodes per cluster, > 20000 nodes in all – Late Q1, 2012• Production – Well, it’s production – Mid-to-late Q2 2012 Page 15
  • 16. Questions? Release Candidate: Release Documentation: You.@acmurthy Page 16