SlideShare a Scribd company logo
1 of 16
Apache Hadoop 0.23
What it takes and what it means…
Page 1
Arun C. Murthy
Founder/Architect, Hortonworks
@acmurthy (@hortonworks)
Hello! I’m Arun
Page 2
• Founder/Architect at Hortonworks Inc.
– Formerly, Architect Hadoop MapReduce, Yahoo
– Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes
footprint)
– Yes, I took the 3am calls! 
• Apache Hadoop, ASF
– VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
– Long-term Committer/PMC member (full time ~6 years)
– Release Manager - hadoop-0.23
Releases so far…
Page 3
• Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting
• Initially, we did monthly releases (0.1, 0.2 …)
• Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009…
• hadoop-0.20 is still the basis of all current, stable, Hadoop distributions
– Apache Hadoop 0.20.2xx
– CDH3.*
– HDP1.*
• hadoop-0.20.203 (security) – 05/2011
• hadoop-0.20.205 (security + append -> hbase) – 10/2011
2006 2009 2012
hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.23.0hadoop-0.20.205
hadoop-0.23
Page 4
• First stable release off Apache Hadoop trunk in over 30 months…
• Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC
• Significant major features
• Several, several enhancements
HDFS - Federation
Page 5
• Significant scaling…
• Separation of Namespace mgmt and Block mgmt
• Suresh Srinivas (Hortonworks) – Wed 11am
MapReduce - YARN
Page 6
• NextGen Hadoop Data Processing Framework
• Support MR and other paradigms
• Mahadev Konar (Hortonworks) – Tue 4.30pm
Resource
Manager
Client
MapReduce Status
Job Submission
Client
Node
Manager
Container Container
Node
Manager
App Mstr Container
Node
Manager
Container App Mstr
Node Status
Resource Request
Performance
Page 7
• 2x+ across the board
• HDFS read/write
– CRC32
– fadvise
– Shortcut for local reads
• MapReduce
– Unlock lots of improvements from Terasort record (Owen/Arun, 2009)
– Shuffle 30%+
– Small Jobs – Uber AM
• Todd Lipcon (Cloudera) – Wed 10am
HDFS NameNode HA
Page 8
• The famous SPOF
• https://issues.apache.org/jira/browse/HDFS-1623
• Well on the way to fix in hadoop-0.23.½
• Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm
More…
Page 9
• HDFS Write pipeline improvements for Hbase
– Append/flush etc.
• Build - Full Mavenization
• EditLogs re-write
– https://issues.apache.org/jira/browse/HDFS-1073
• Tonnes more …
Deployment goals
Page 10
• Clusters of 6,000 machines
– Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks
– 200+ PB (raw) per cluster
– 100,000+ concurrent tasks
– 10,000 concurrent jobs
• Yahoo: 50,000+ machines
What does it take to get there?
Page 11
• Testing, *lots* of it
• Benchmarks – At least as good as the last one
• Integration testing
– HBase
– Pig
– Hive
– Oozie
• Deployment discipline
Testing
Page 12
• Why is it hard?
– MapReduce is, effectively, very wide api
– Add Streaming
– Add Pipes
– Oh, Pig/Hive etc. etc.
• Functional tests
– Nightly
– Nearly 1000 functional tests for MapReduce alone
– Several hundred for Pig/Hive etc.
• Scale tests
– Simulation
• Longevity tests
• Stress tests
Benchmarks
Page 13
• Benchmark every part of the HDFS & MR pipeline
– HDFS read/write throughput
– NN operations
– Scan, Shuffle, Sort
• GridMixv3
– Run production traces in test clusters
– Thousands of jobs
– Stress mode v/s Replay mode
Integration Testing
Page 14
• Several projects in the ecosystem
– HBase
– Pig
– Hive
– Oozie
• Cycle
– Functional
– Scale
– Rinse, repeat
Deployment
Page 15
• Alpha/Test (early UAT)
– Starting Nov, 2011
– Small scale (500-800 nodes)
• Alpha
– Jan, 2012
– Majority of users
– 2000 nodes per cluster, > 10,000 nodes in all
• Beta
– Misnomer: 100s of PB, Millions of user applications
– Significantly wide variety of applications and load
– 4000+ nodes per cluster, > 20000 nodes in all
– Late Q1, 2012
• Production
– Well, it’s production
– Mid-to-late Q2 2012
Questions?
Page 16
Thank You.
@acmurthy
Release Candidate:
http://people.apache.org/~acmurthy/hadoop-0.23.0-rc2
Release Documentation:
http://people.apache.org/~acmurthy/hadoop-0.23

More Related Content

What's hot

Hortonworks HBase Meetup Presentation
Hortonworks HBase Meetup PresentationHortonworks HBase Meetup Presentation
Hortonworks HBase Meetup Presentation
Hortonworks
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 

What's hot (19)

Hortonworks HBase Meetup Presentation
Hortonworks HBase Meetup PresentationHortonworks HBase Meetup Presentation
Hortonworks HBase Meetup Presentation
 
What can-be-done-around-mesos
What can-be-done-around-mesosWhat can-be-done-around-mesos
What can-be-done-around-mesos
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
Hadoop description
Hadoop descriptionHadoop description
Hadoop description
 
Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introduction
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBaseHBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 

Viewers also liked

COT LMTS Operator Course
COT LMTS Operator CourseCOT LMTS Operator Course
COT LMTS Operator Course
Donald Moore
 
How to get 1000s of followers on pinterest
How to get 1000s of followers on pinterestHow to get 1000s of followers on pinterest
How to get 1000s of followers on pinterest
mathew258
 
How to get 100 followers on pinterest for free
How to get 100 followers on pinterest for freeHow to get 100 followers on pinterest for free
How to get 100 followers on pinterest for free
mathew258
 
COT Supervisor Safety Course
COT Supervisor Safety CourseCOT Supervisor Safety Course
COT Supervisor Safety Course
Donald Moore
 
How to get 1000 pinterest followers
How to get 1000 pinterest followersHow to get 1000 pinterest followers
How to get 1000 pinterest followers
mathew258
 
Payare 140730082507-phpapp01
Payare 140730082507-phpapp01Payare 140730082507-phpapp01
Payare 140730082507-phpapp01
Nikolas Pedroza
 
CCC Quality management certificate
CCC Quality management certificateCCC Quality management certificate
CCC Quality management certificate
Mohamed Galal
 
How to generate more followers on pinterest
How to generate more followers on pinterestHow to generate more followers on pinterest
How to generate more followers on pinterest
mathew258
 

Viewers also liked (16)

COT DARN Workshop
COT DARN WorkshopCOT DARN Workshop
COT DARN Workshop
 
COT LMTS Operator Course
COT LMTS Operator CourseCOT LMTS Operator Course
COT LMTS Operator Course
 
Investigaciòn y didactica para el s xxi
Investigaciòn y didactica para el s xxiInvestigaciòn y didactica para el s xxi
Investigaciòn y didactica para el s xxi
 
How to get 1000s of followers on pinterest
How to get 1000s of followers on pinterestHow to get 1000s of followers on pinterest
How to get 1000s of followers on pinterest
 
Reinscrpición de registro sanitario
Reinscrpición de registro sanitarioReinscrpición de registro sanitario
Reinscrpición de registro sanitario
 
How to get 100 followers on pinterest for free
How to get 100 followers on pinterest for freeHow to get 100 followers on pinterest for free
How to get 100 followers on pinterest for free
 
COT Supervisor Safety Course
COT Supervisor Safety CourseCOT Supervisor Safety Course
COT Supervisor Safety Course
 
Presentacion sena
Presentacion senaPresentacion sena
Presentacion sena
 
How to get 1000 pinterest followers
How to get 1000 pinterest followersHow to get 1000 pinterest followers
How to get 1000 pinterest followers
 
Measures of central tendancy
Measures of central tendancy Measures of central tendancy
Measures of central tendancy
 
Historia de la tecnologia
Historia de la tecnologiaHistoria de la tecnologia
Historia de la tecnologia
 
Payare 140730082507-phpapp01
Payare 140730082507-phpapp01Payare 140730082507-phpapp01
Payare 140730082507-phpapp01
 
Google Fit, developer's view
Google Fit, developer's viewGoogle Fit, developer's view
Google Fit, developer's view
 
CCC Quality management certificate
CCC Quality management certificateCCC Quality management certificate
CCC Quality management certificate
 
Nancy fisica (1) np
Nancy fisica (1) npNancy fisica (1) np
Nancy fisica (1) np
 
How to generate more followers on pinterest
How to generate more followers on pinterestHow to generate more followers on pinterest
How to generate more followers on pinterest
 

Similar to 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 

Similar to 4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02 (20)

Apache Hadoop 0.23
Apache Hadoop 0.23Apache Hadoop 0.23
Apache Hadoop 0.23
 
Apache HBase: Where We've Been and What's Upcoming
Apache HBase: Where We've Been and What's UpcomingApache HBase: Where We've Been and What's Upcoming
Apache HBase: Where We've Been and What's Upcoming
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Summit 2010 Keynote
Hadoop Summit 2010 KeynoteHadoop Summit 2010 Keynote
Hadoop Summit 2010 Keynote
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 

More from gr0Jmmd7Q_qarobo gr0Jmmd7Q_qarobo (12)

2 page doc
2 page doc2 page doc
2 page doc
 
sample upload file
sample upload filesample upload file
sample upload file
 
2 slide deck
2 slide deck2 slide deck
2 slide deck
 
Do Not Delete
Do Not DeleteDo Not Delete
Do Not Delete
 
exploratory-testing - Read only not hidden
exploratory-testing - Read only not hiddenexploratory-testing - Read only not hidden
exploratory-testing - Read only not hidden
 
getting-started-1-638
getting-started-1-638getting-started-1-638
getting-started-1-638
 
Environment costs & Benifits
Environment costs & BenifitsEnvironment costs & Benifits
Environment costs & Benifits
 
Environment costs & Benifits
Environment costs & BenifitsEnvironment costs & Benifits
Environment costs & Benifits
 
Plain text presentation for slideshare
Plain text presentation for slidesharePlain text presentation for slideshare
Plain text presentation for slideshare
 
adrianalimabikinibodyworkoutdietplan-140813113633-phpapp01 (1)
adrianalimabikinibodyworkoutdietplan-140813113633-phpapp01 (1)adrianalimabikinibodyworkoutdietplan-140813113633-phpapp01 (1)
adrianalimabikinibodyworkoutdietplan-140813113633-phpapp01 (1)
 
4apachehadoop-0-23hadoopworld2011-111110151810-phpapp02
4apachehadoop-0-23hadoopworld2011-111110151810-phpapp024apachehadoop-0-23hadoopworld2011-111110151810-phpapp02
4apachehadoop-0-23hadoopworld2011-111110151810-phpapp02
 
Chinese hans
Chinese hansChinese hans
Chinese hans
 

4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02

  • 1. Apache Hadoop 0.23 What it takes and what it means… Page 1 Arun C. Murthy Founder/Architect, Hortonworks @acmurthy (@hortonworks)
  • 2. Hello! I’m Arun Page 2 • Founder/Architect at Hortonworks Inc. – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) – Yes, I took the 3am calls!  • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time ~6 years) – Release Manager - hadoop-0.23
  • 3. Releases so far… Page 3 • Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting • Initially, we did monthly releases (0.1, 0.2 …) • Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009… • hadoop-0.20 is still the basis of all current, stable, Hadoop distributions – Apache Hadoop 0.20.2xx – CDH3.* – HDP1.* • hadoop-0.20.203 (security) – 05/2011 • hadoop-0.20.205 (security + append -> hbase) – 10/2011 2006 2009 2012 hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.23.0hadoop-0.20.205
  • 4. hadoop-0.23 Page 4 • First stable release off Apache Hadoop trunk in over 30 months… • Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC • Significant major features • Several, several enhancements
  • 5. HDFS - Federation Page 5 • Significant scaling… • Separation of Namespace mgmt and Block mgmt • Suresh Srinivas (Hortonworks) – Wed 11am
  • 6. MapReduce - YARN Page 6 • NextGen Hadoop Data Processing Framework • Support MR and other paradigms • Mahadev Konar (Hortonworks) – Tue 4.30pm Resource Manager Client MapReduce Status Job Submission Client Node Manager Container Container Node Manager App Mstr Container Node Manager Container App Mstr Node Status Resource Request
  • 7. Performance Page 7 • 2x+ across the board • HDFS read/write – CRC32 – fadvise – Shortcut for local reads • MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Small Jobs – Uber AM • Todd Lipcon (Cloudera) – Wed 10am
  • 8. HDFS NameNode HA Page 8 • The famous SPOF • https://issues.apache.org/jira/browse/HDFS-1623 • Well on the way to fix in hadoop-0.23.½ • Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm
  • 9. More… Page 9 • HDFS Write pipeline improvements for Hbase – Append/flush etc. • Build - Full Mavenization • EditLogs re-write – https://issues.apache.org/jira/browse/HDFS-1073 • Tonnes more …
  • 10. Deployment goals Page 10 • Clusters of 6,000 machines – Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks – 200+ PB (raw) per cluster – 100,000+ concurrent tasks – 10,000 concurrent jobs • Yahoo: 50,000+ machines
  • 11. What does it take to get there? Page 11 • Testing, *lots* of it • Benchmarks – At least as good as the last one • Integration testing – HBase – Pig – Hive – Oozie • Deployment discipline
  • 12. Testing Page 12 • Why is it hard? – MapReduce is, effectively, very wide api – Add Streaming – Add Pipes – Oh, Pig/Hive etc. etc. • Functional tests – Nightly – Nearly 1000 functional tests for MapReduce alone – Several hundred for Pig/Hive etc. • Scale tests – Simulation • Longevity tests • Stress tests
  • 13. Benchmarks Page 13 • Benchmark every part of the HDFS & MR pipeline – HDFS read/write throughput – NN operations – Scan, Shuffle, Sort • GridMixv3 – Run production traces in test clusters – Thousands of jobs – Stress mode v/s Replay mode
  • 14. Integration Testing Page 14 • Several projects in the ecosystem – HBase – Pig – Hive – Oozie • Cycle – Functional – Scale – Rinse, repeat
  • 15. Deployment Page 15 • Alpha/Test (early UAT) – Starting Nov, 2011 – Small scale (500-800 nodes) • Alpha – Jan, 2012 – Majority of users – 2000 nodes per cluster, > 10,000 nodes in all • Beta – Misnomer: 100s of PB, Millions of user applications – Significantly wide variety of applications and load – 4000+ nodes per cluster, > 20000 nodes in all – Late Q1, 2012 • Production – Well, it’s production – Mid-to-late Q2 2012
  • 16. Questions? Page 16 Thank You. @acmurthy Release Candidate: http://people.apache.org/~acmurthy/hadoop-0.23.0-rc2 Release Documentation: http://people.apache.org/~acmurthy/hadoop-0.23