Apache Hadoop 0.23 at Hadoop World 2011

  • 3,967 views
Uploaded on

Presentation by Arun C Murthy (Founder/Architect, Hortonworks) on Apache Hadoop 0.23 (What it is and what it takes) at Hadoop World 2011 NYC. …

Presentation by Arun C Murthy (Founder/Architect, Hortonworks) on Apache Hadoop 0.23 (What it is and what it takes) at Hadoop World 2011 NYC.

Arun is the Founder/Architect at Hortonworks and is the VP, Apache Hadoop, ASF.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,967
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Hadoop 0.23What it takes and what it means…Arun C. MurthyFounder/Architect, Hortonworks@acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun• Founder/Architect at Hortonworks Inc. – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) – Yes, I took the 3am calls! • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time ~6 years) – Release Manager - hadoop-0.23 Page 2
  • 3. Releases so far…• Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting• Initially, we did monthly releases (0.1, 0.2 …)• Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009…• hadoop-0.20 is still the basis of all current, stable, Hadoop distributions – Apache Hadoop 0.20.2xx – CDH3.* – HDP1.*• hadoop-0.20.203 (security) – 05/2011• hadoop-0.20.205 (security + append -> hbase) – 10/2011 hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.20.205 hadoop-0.23.02006 2009 2012 Page 3
  • 4. hadoop-0.23• First stable release off Apache Hadoop trunk in over 30 months…• Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC• Significant major features• Several, several enhancements Page 4
  • 5. HDFS - Federation• Significant scaling…• Separation of Namespace mgmt and Block mgmt• Suresh Srinivas (Hortonworks) – Wed 11am Page 5
  • 6. MapReduce - YARN• NextGen Hadoop Data Processing Framework• Support MR and other paradigms• Mahadev Konar (Hortonworks) – Tue 4.30pm Node Manager Container App Mstr Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container Page 6
  • 7. Performance• 2x+ across the board• HDFS read/write – CRC32 – fadvise – Shortcut for local reads• MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Small Jobs – Uber AM• Todd Lipcon (Cloudera) – Wed 10am Page 7
  • 8. HDFS NameNode HA• The famous SPOF• https://issues.apache.org/jira/browse/HDFS-1623• Well on the way to fix in hadoop-0.23.½• Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm Page 8
  • 9. More…• HDFS Write pipeline improvements for Hbase – Append/flush etc.• Build - Full Mavenization• EditLogs re-write – https://issues.apache.org/jira/browse/HDFS-1073• Tonnes more … Page 9
  • 10. Deployment goals• Clusters of 6,000 machines – Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks – 200+ PB (raw) per cluster – 100,000+ concurrent tasks – 10,000 concurrent jobs• Yahoo: 50,000+ machines Page 10
  • 11. What does it take to get there?• Testing, *lots* of it• Benchmarks – At least as good as the last one• Integration testing – HBase – Pig – Hive – Oozie• Deployment discipline Page 11
  • 12. Testing• Why is it hard? – MapReduce is, effectively, very wide api – Add Streaming – Add Pipes – Oh, Pig/Hive etc. etc.• Functional tests – Nightly – Nearly 1000 functional tests for MapReduce alone – Several hundred for Pig/Hive etc.• Scale tests – Simulation• Longevity tests• Stress tests Page 12
  • 13. Benchmarks• Benchmark every part of the HDFS & MR pipeline – HDFS read/write throughput – NN operations – Scan, Shuffle, Sort• GridMixv3 – Run production traces in test clusters – Thousands of jobs – Stress mode v/s Replay mode Page 13
  • 14. Integration Testing• Several projects in the ecosystem – HBase – Pig – Hive – Oozie• Cycle – Functional – Scale – Rinse, repeat Page 14
  • 15. Deployment• Alpha/Test (early UAT) – Starting Nov, 2011 – Small scale (500-800 nodes)• Alpha – Jan, 2012 – Majority of users – 2000 nodes per cluster, > 10,000 nodes in all• Beta – Misnomer: 100s of PB, Millions of user applications – Significantly wide variety of applications and load – 4000+ nodes per cluster, > 20000 nodes in all – Late Q1, 2012• Production – Well, it’s production – Mid-to-late Q2 2012 Page 15
  • 16. Questions? Release Candidate: http://people.apache.org/~acmurthy/hadoop-0.23.0-rc2 Release Documentation: http://people.apache.org/~acmurthy/hadoop-0.23Thank You.@acmurthy Page 16