Apache Hadoop 0.23 at Hadoop World 2011

Hortonworks
HortonworksHortonworks
Apache Hadoop 0.23
What it takes and what it means…


Arun C. Murthy
Founder/Architect, Hortonworks
@acmurthy (@hortonworks)




                                   Page 1
Hello! I’m Arun
• Founder/Architect at Hortonworks Inc.
   – Formerly, Architect Hadoop MapReduce, Yahoo
   – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes
     footprint)
        – Yes, I took the 3am calls! 


• Apache Hadoop, ASF
   – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
   – Long-term Committer/PMC member (full time ~6 years)
   – Release Manager - hadoop-0.23




                                                                                  Page 2
Releases so far…
•   Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting
•   Initially, we did monthly releases (0.1, 0.2 …)
•   Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009…
•   hadoop-0.20 is still the basis of all current, stable, Hadoop distributions
       – Apache Hadoop 0.20.2xx
       – CDH3.*
       – HDP1.*
• hadoop-0.20.203 (security) – 05/2011
• hadoop-0.20.205 (security + append -> hbase) – 10/2011




        hadoop-0.1.0   hadoop-0.10.0          hadoop-0.20.0   hadoop-0.20.205      hadoop-0.23.0

2006                                   2009                                     2012



                                                                                       Page 3
hadoop-0.23
•   First stable release off Apache Hadoop trunk in over 30 months…
•   Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC
•   Significant major features
•   Several, several enhancements




                                                                        Page 4
HDFS - Federation
• Significant scaling…
• Separation of Namespace mgmt and Block mgmt
• Suresh Srinivas (Hortonworks) – Wed 11am




                                                Page 5
MapReduce - YARN
• NextGen Hadoop Data Processing Framework
• Support MR and other paradigms
• Mahadev Konar (Hortonworks) – Tue 4.30pm
                                                           Node
                                                          Manager


                                                   Container   App Mstr


                    Client

                                        Resource           Node
                                        Manager           Manager
                    Client

                                                   App Mstr    Container




                     MapReduce Status                      Node
                                                          Manager
                       Job Submission
                       Node Status
                     Resource Request              Container   Container




                                                                           Page 6
Performance
• 2x+ across the board
• HDFS read/write
   – CRC32
   – fadvise
   – Shortcut for local reads
• MapReduce
   – Unlock lots of improvements from Terasort record (Owen/Arun, 2009)
   – Shuffle 30%+
   – Small Jobs – Uber AM
• Todd Lipcon (Cloudera) – Wed 10am




                                                                          Page 7
HDFS NameNode HA
•   The famous SPOF
•   https://issues.apache.org/jira/browse/HDFS-1623
•   Well on the way to fix in hadoop-0.23.½
•   Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm




                                                                         Page 8
More…
• HDFS Write pipeline improvements for Hbase
    – Append/flush etc.
• Build - Full Mavenization
• EditLogs re-write
    – https://issues.apache.org/jira/browse/HDFS-1073
• Tonnes more …




                                                        Page 9
Deployment goals
• Clusters of 6,000 machines
   – Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks
   – 200+ PB (raw) per cluster
   – 100,000+ concurrent tasks
   – 10,000 concurrent jobs
• Yahoo: 50,000+ machines




                                                                 Page 10
What does it take to get there?
• Testing, *lots* of it
• Benchmarks – At least as good as the last one
• Integration testing
   – HBase
   – Pig
   – Hive
   – Oozie
• Deployment discipline




                                                  Page 11
Testing
• Why is it hard?
    – MapReduce is, effectively, very wide api
    – Add Streaming
    – Add Pipes
    – Oh, Pig/Hive etc. etc.
• Functional tests
    – Nightly
    – Nearly 1000 functional tests for MapReduce alone
    – Several hundred for Pig/Hive etc.
• Scale tests
    – Simulation
• Longevity tests
• Stress tests




                                                         Page 12
Benchmarks
• Benchmark every part of the HDFS & MR pipeline
   – HDFS read/write throughput
   – NN operations
   – Scan, Shuffle, Sort
• GridMixv3
   – Run production traces in test clusters
   – Thousands of jobs
   – Stress mode v/s Replay mode




                                                   Page 13
Integration Testing
• Several projects in the ecosystem
   – HBase
   – Pig
   – Hive
   – Oozie
• Cycle
   – Functional
   – Scale
   – Rinse, repeat




                                      Page 14
Deployment
• Alpha/Test (early UAT)
   – Starting Nov, 2011
   – Small scale (500-800 nodes)
• Alpha
   – Jan, 2012
   – Majority of users
   – 2000 nodes per cluster, > 10,000 nodes in all
• Beta
   – Misnomer: 100s of PB, Millions of user applications
   – Significantly wide variety of applications and load
   – 4000+ nodes per cluster, > 20000 nodes in all
   – Late Q1, 2012
• Production
   – Well, it’s production
   – Mid-to-late Q2 2012




                                                           Page 15
Questions?


                              Release Candidate:
            http://people.apache.org/~acmurthy/hadoop-0.23.0-rc2

                           Release Documentation:
              http://people.apache.org/~acmurthy/hadoop-0.23



Thank You.
@acmurthy




                                                                   Page 16
1 of 16

Recommended

Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works by
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksCloudera, Inc.
3.9K views16 slides
HBaseCon 2015: Elastic HBase on Mesos by
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
3.1K views47 slides
Taming YARN @ Hadoop conference Japan 2014 by
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Tsuyoshi OZAWA
12.6K views43 slides
Treasure Data on The YARN - Hadoop Conference Japan 2014 by
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
14.2K views47 slides
Hadoop Performance at LinkedIn by
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
8.9K views26 slides
Real-Time Data Loading from MySQL to Hadoop by
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopContinuent
1.5K views38 slides

More Related Content

What's hot

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera by
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
5.5K views30 slides
Apache Tez – Present and Future by
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
618 views38 slides
Savanna: Hadoop on OpenStack by
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStackMirantis
14.2K views31 slides
MHUG - YARN by
MHUG - YARNMHUG - YARN
MHUG - YARNJoseph Niemiec
431 views25 slides
Rigorous and Multi-tenant HBase Performance Measurement by
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
3.6K views40 slides
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase by
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
3.2K views21 slides

What's hot(18)

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera by Cloudera, Inc.
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.5.5K views
Apache Tez – Present and Future by Jianfeng Zhang
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang618 views
Savanna: Hadoop on OpenStack by Mirantis
Savanna: Hadoop on OpenStackSavanna: Hadoop on OpenStack
Savanna: Hadoop on OpenStack
Mirantis14.2K views
Rigorous and Multi-tenant HBase Performance Measurement by DataWorks Summit
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit3.6K views
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase by Cloudera, Inc.
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.3.2K views
La big datacamp2014_vikram_dixit by Data Con LA
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
Data Con LA1K views
HBaseCon 2015: HBase and Spark by HBaseCon
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
HBaseCon8.7K views
Foss evolution cos-boudnik by Data Con LA
Foss evolution cos-boudnikFoss evolution cos-boudnik
Foss evolution cos-boudnik
Data Con LA522 views
HBaseCon 2012 | HBase, the Use Case in eBay Cassini by Cloudera, Inc.
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Cloudera, Inc.6.1K views
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ... by Cloudera, Inc.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.7.3K views
Hortonworks HBase Meetup Presentation by Hortonworks
Hortonworks HBase Meetup PresentationHortonworks HBase Meetup Presentation
Hortonworks HBase Meetup Presentation
Hortonworks1.4K views
Whirr dev-up-puppetconf2011 by Puppet
Whirr dev-up-puppetconf2011Whirr dev-up-puppetconf2011
Whirr dev-up-puppetconf2011
Puppet895 views
Introduction to Hadoop Ecosystem by GetInData
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
GetInData1.7K views
Elastic HBase on Mesos - HBaseCon 2015 by Cosmin Lehene
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene12.8K views
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran by MapR Technologies
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies2.9K views
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment by Yahoo Developer Network
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment

Similar to Apache Hadoop 0.23 at Hadoop World 2011

YARN: Future of Data Processing with Apache Hadoop by
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopHortonworks
4.7K views22 slides
4apachehadoop-0-23hadoopworld2011-111110151810-phpapp02 by
4apachehadoop-0-23hadoopworld2011-111110151810-phpapp024apachehadoop-0-23hadoopworld2011-111110151810-phpapp02
4apachehadoop-0-23hadoopworld2011-111110151810-phpapp02gr0Jmmd7Q_qarobo gr0Jmmd7Q_qarobo
246 views16 slides
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02 by
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp024apachehadoop 0-23hadoopworld2011-111110151810-phpapp02
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02Nitish Bhardwaj
138 views16 slides
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02 by
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp024apachehadoop 0-23hadoopworld2011-111110151810-phpapp02
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02gr0Jmmd7Q_qarobo gr0Jmmd7Q_qarobo
133 views16 slides
Apache Hadoop MapReduce: What's Next by
Apache Hadoop MapReduce: What's NextApache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's NextDataWorks Summit
4.1K views27 slides
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0 by
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
3.1K views59 slides

Similar to Apache Hadoop 0.23 at Hadoop World 2011(20)

YARN: Future of Data Processing with Apache Hadoop by Hortonworks
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
Hortonworks4.7K views
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02 by Nitish Bhardwaj
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp024apachehadoop 0-23hadoopworld2011-111110151810-phpapp02
4apachehadoop 0-23hadoopworld2011-111110151810-phpapp02
Nitish Bhardwaj138 views
Apache Hadoop MapReduce: What's Next by DataWorks Summit
Apache Hadoop MapReduce: What's NextApache Hadoop MapReduce: What's Next
Apache Hadoop MapReduce: What's Next
DataWorks Summit4.1K views
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0 by Adam Muise
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Adam Muise3.1K views
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future by Vinod Kumar Vavilapalli
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Apache Hadoop YARN - The Future of Data Processing with Hadoop by Hortonworks
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks5K views
Apache Hadoop YARN: Present and Future by DataWorks Summit
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit3.7K views
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3 by tcloudcomputing-tw
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw3.2K views
Apache Hadoop YARN 2015: Present and Future by DataWorks Summit
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit2K views
Hadoop ecosystem framework n hadoop in live environment by Delhi/NCR HUG
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG2.5K views
Apache Tez - A unifying Framework for Hadoop Data Processing by DataWorks Summit
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit2.6K views
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor... by Data Con LA
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA775 views
Introduction to Hadoop and Big Data by Joe Alex
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex306 views
Bikas saha:the next generation of hadoop– hadoop 2 and yarn by hdhappy001
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001806 views
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop by Hortonworks
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Hortonworks53.3K views
YARN - Hadoop Next Generation Compute Platform by Bikas Saha
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
Bikas Saha945 views

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level by
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
6.1K views30 slides
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy by
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
3.1K views40 slides
Getting the Most Out of Your Data in the Cloud with Cloudbreak by
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
1.1K views13 slides
Johns Hopkins - Using Hadoop to Secure Access Log Events by
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
1.1K views15 slides
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys by
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
748 views14 slides
HDF 3.2 - What's New by
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
1.1K views22 slides

More from Hortonworks(20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level by Hortonworks
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks6.1K views
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy by Hortonworks
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Hortonworks3.1K views
Getting the Most Out of Your Data in the Cloud with Cloudbreak by Hortonworks
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Hortonworks1.1K views
Johns Hopkins - Using Hadoop to Secure Access Log Events by Hortonworks
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
Hortonworks1.1K views
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys by Hortonworks
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Hortonworks748 views
HDF 3.2 - What's New by Hortonworks
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
Hortonworks1.1K views
Curing Kafka Blindness with Hortonworks Streams Messaging Manager by Hortonworks
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Hortonworks798 views
Interpretation Tool for Genomic Sequencing Data in Clinical Environments by Hortonworks
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Hortonworks1.6K views
IBM+Hortonworks = Transformation of the Big Data Landscape by Hortonworks
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks2.1K views
Premier Inside-Out: Apache Druid by Hortonworks
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
Hortonworks3.4K views
Accelerating Data Science and Real Time Analytics at Scale by Hortonworks
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
Hortonworks948 views
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA by Hortonworks
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Hortonworks1.1K views
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ... by Hortonworks
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Hortonworks2.1K views
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense by Hortonworks
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Hortonworks1K views
Making Enterprise Big Data Small with Ease by Hortonworks
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
Hortonworks732 views
Webinewbie to Webinerd in 30 Days - Webinar World Presentation by Hortonworks
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Hortonworks492 views
Driving Digital Transformation Through Global Data Management by Hortonworks
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
Hortonworks4.3K views
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features by Hortonworks
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks908 views
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A... by Hortonworks
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks3.2K views
Unlock Value from Big Data with Apache NiFi and Streaming CDC by Hortonworks
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks4.5K views

Recently uploaded

The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
66 views25 slides
Top 10 Strategic Technologies in 2024: AI and Automation by
Top 10 Strategic Technologies in 2024: AI and AutomationTop 10 Strategic Technologies in 2024: AI and Automation
Top 10 Strategic Technologies in 2024: AI and AutomationAutomationEdge Technologies
13 views14 slides
DALI Basics Course 2023 by
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023Ivory Egg
14 views12 slides
[2023] Putting the R! in R&D.pdf by
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdfEleanor McHugh
38 views127 slides
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
11 views15 slides
Future of Learning - Khoong Chan Meng by
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengNUS-ISS
31 views7 slides

Recently uploaded(20)

DALI Basics Course 2023 by Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg14 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab11 views
Future of Learning - Khoong Chan Meng by NUS-ISS
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
NUS-ISS31 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2214 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta14 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 views
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 views
.conf Go 2023 - Data analysis as a routine by Splunk
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
Splunk90 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS23 views
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS15 views

Apache Hadoop 0.23 at Hadoop World 2011

  • 1. Apache Hadoop 0.23 What it takes and what it means… Arun C. Murthy Founder/Architect, Hortonworks @acmurthy (@hortonworks) Page 1
  • 2. Hello! I’m Arun • Founder/Architect at Hortonworks Inc. – Formerly, Architect Hadoop MapReduce, Yahoo – Responsible for running Hadoop MR as a service for all of Yahoo (50k nodes footprint) – Yes, I took the 3am calls!  • Apache Hadoop, ASF – VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) – Long-term Committer/PMC member (full time ~6 years) – Release Manager - hadoop-0.23 Page 2
  • 3. Releases so far… • Started for Nutch… Yahoo picked it up in early 2006, hired Doug Cutting • Initially, we did monthly releases (0.1, 0.2 …) • Quarterly after hadoop-0.15 until hadoop-0.20 in 04/2009… • hadoop-0.20 is still the basis of all current, stable, Hadoop distributions – Apache Hadoop 0.20.2xx – CDH3.* – HDP1.* • hadoop-0.20.203 (security) – 05/2011 • hadoop-0.20.205 (security + append -> hbase) – 10/2011 hadoop-0.1.0 hadoop-0.10.0 hadoop-0.20.0 hadoop-0.20.205 hadoop-0.23.0 2006 2009 2012 Page 3
  • 4. hadoop-0.23 • First stable release off Apache Hadoop trunk in over 30 months… • Currently alpha (hadoop-0.23.0) is under voting by the Hadoop PMC • Significant major features • Several, several enhancements Page 4
  • 5. HDFS - Federation • Significant scaling… • Separation of Namespace mgmt and Block mgmt • Suresh Srinivas (Hortonworks) – Wed 11am Page 5
  • 6. MapReduce - YARN • NextGen Hadoop Data Processing Framework • Support MR and other paradigms • Mahadev Konar (Hortonworks) – Tue 4.30pm Node Manager Container App Mstr Client Resource Node Manager Manager Client App Mstr Container MapReduce Status Node Manager Job Submission Node Status Resource Request Container Container Page 6
  • 7. Performance • 2x+ across the board • HDFS read/write – CRC32 – fadvise – Shortcut for local reads • MapReduce – Unlock lots of improvements from Terasort record (Owen/Arun, 2009) – Shuffle 30%+ – Small Jobs – Uber AM • Todd Lipcon (Cloudera) – Wed 10am Page 7
  • 8. HDFS NameNode HA • The famous SPOF • https://issues.apache.org/jira/browse/HDFS-1623 • Well on the way to fix in hadoop-0.23.½ • Suresh Srinivas (Hortonworks), Aaron Myers (Cloudera) – Tue 2.15pm Page 8
  • 9. More… • HDFS Write pipeline improvements for Hbase – Append/flush etc. • Build - Full Mavenization • EditLogs re-write – https://issues.apache.org/jira/browse/HDFS-1073 • Tonnes more … Page 9
  • 10. Deployment goals • Clusters of 6,000 machines – Each machine with 16+ cores, 48G/96G RAM, 24TB/36TB disks – 200+ PB (raw) per cluster – 100,000+ concurrent tasks – 10,000 concurrent jobs • Yahoo: 50,000+ machines Page 10
  • 11. What does it take to get there? • Testing, *lots* of it • Benchmarks – At least as good as the last one • Integration testing – HBase – Pig – Hive – Oozie • Deployment discipline Page 11
  • 12. Testing • Why is it hard? – MapReduce is, effectively, very wide api – Add Streaming – Add Pipes – Oh, Pig/Hive etc. etc. • Functional tests – Nightly – Nearly 1000 functional tests for MapReduce alone – Several hundred for Pig/Hive etc. • Scale tests – Simulation • Longevity tests • Stress tests Page 12
  • 13. Benchmarks • Benchmark every part of the HDFS & MR pipeline – HDFS read/write throughput – NN operations – Scan, Shuffle, Sort • GridMixv3 – Run production traces in test clusters – Thousands of jobs – Stress mode v/s Replay mode Page 13
  • 14. Integration Testing • Several projects in the ecosystem – HBase – Pig – Hive – Oozie • Cycle – Functional – Scale – Rinse, repeat Page 14
  • 15. Deployment • Alpha/Test (early UAT) – Starting Nov, 2011 – Small scale (500-800 nodes) • Alpha – Jan, 2012 – Majority of users – 2000 nodes per cluster, > 10,000 nodes in all • Beta – Misnomer: 100s of PB, Millions of user applications – Significantly wide variety of applications and load – 4000+ nodes per cluster, > 20000 nodes in all – Late Q1, 2012 • Production – Well, it’s production – Mid-to-late Q2 2012 Page 15
  • 16. Questions? Release Candidate: http://people.apache.org/~acmurthy/hadoop-0.23.0-rc2 Release Documentation: http://people.apache.org/~acmurthy/hadoop-0.23 Thank You. @acmurthy Page 16