SlideShare a Scribd company logo
Grid Operations



          Hadoop Operations at LinkedIn
          Allen Wittenauer
          Grid Computing Architect


         ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
“Hadoop is not a developer problem;
                                   it’s an operations problem.”
                                -- Hadoop vendor ex-employee




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
§ August 2009
               – 20 Nodes in 1 grid
               – Apache Hadoop 0.20.0
               – No configuration management
               – No monitoring
               – No security
               – Free for all, including random mafia hits on running jobs
               – FIFO Scheduling
               – ~20 users
               – 20 tasks per node
               – Solaris

               – No operational support




          ©2013 LinkedIn Corporation. All Rights Reserved.                   GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
How We Fixed This
                                                    (In Chronological Order)




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
Year One




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
§ Dropped task count
               – 10 mappers => 7 mappers
               – 10 reducers => 5 reducers


           § Reworked ETL
               – hourlies => dailies
               – Re-ordered to take advantage of compression
                  § 10x storage improvement
               – Sample impact on one job (not workflow!):
                  § 80,000 map tasks => 2,000 map tasks
                  § Run time cut in half


           § Optimize work flows/culture shift
                  § More task time, less tasks
                  § Production review to reinforce good behavio(u)r



          ©2013 LinkedIn Corporation. All Rights Reserved.             GRID OPERATIONS

Wednesday, March 20, 13
§ Switched to Capacity Scheduler                 5% ETL Tasks
               – FIFO is terrible                       15% Fast Queue:
               – Fair Share only viable for small tasks - Task Time < 15 Minutes
                                                        - Job Time < 1 Hour
               – Enforced SLAs via custom patch
                                                        - Slot stealing from "Slow" Queue

           § Submitted Jar Size Limit
                                                             80% Slow Queue:
               – Encourage distributed cache usage           - Job Time < 24 Hours
               – Enforced limit via custom patch             - Up to 80% of slots




          ©2013 LinkedIn Corporation. All Rights Reserved.                              GRID OPERATIONS

Wednesday, March 20, 13
§ Benchmarking
              – Use production code not TeraSort!

                             Old Node:                       New Node:
                             - 2 Rack Units                  - 1 Rack Unit
                             - 2 CPUs                        - 2 CPUs
                             - 16 GB                         - 24 or 32 GB
                             - 8 x 1 TB SATA                 - 6 x 2 TB SATA
                             - 1 x 2 gb NIC                  - 1 x 1 gb NIC



           § Cut cost per unit in half
           § 2x nodes per rack
           § Extra RAM
              – buffering
              – bus speed


          ©2013 LinkedIn Corporation. All Rights Reserved.                     GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
Year Two




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
§ DataNode disk partitioning
               – Separate file systems for different purposes

                                                   20 GB        200 GB
                                                                                HDFS
                                                    /, ...        MR

                                                                         ...

                                              5GB            200 GB
                                                                               HDFS
                                              Swap             MR


               – Mount options: noatime, commit=30, data=writeback


           § NN, JT, etc
               – No “special hardware” == use SW RAID




          ©2013 LinkedIn Corporation. All Rights Reserved.                             GRID OPERATIONS

Wednesday, March 20, 13
LDAP Master              Multi
                                                                                   LDAP Master
                                                  +                   Master           +
                                                                     Replication
                                              KDC Master                              KDC



                                              LDAP/KDC                             LDAP/KDC
                                                Slaves                               Slaves


                                                   username, uid                      username, uid
                                                  group name, gid                    group name, gid
                                                 netgroup, sudoers                  netgroup, sudoers



                                                         nscd                             nscd

                                                 Client Node                        Client Node



          ©2013 LinkedIn Corporation. All Rights Reserved.                                              GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
Host                                      bcfg2 Server
                                                             Group1,
                                                             Group2,
                                                                ...              Group1 -> Svc1, Svc2, ...
                                            bcfg2
                                                                                 Group2 -> Svc1, Svc3, ...
                                            client                     Svc1+
                                                                                 Group3 -> Svc4, Svc5, ...
                                                                       Svc2+
                                                                        Svc3
                                                                       Content




           § Service Bundle
               – RPMs, config files, etc
               – Conflict resolution




          ©2013 LinkedIn Corporation. All Rights Reserved.                                                   GRID OPERATIONS

Wednesday, March 20, 13
§ Different RPM names + different install locations = pre-deploy-ability:



                   Object                                    RPM Name                    File Path

                   Hadoop 1.0.4-p3 Binaries                  hadoop-1043-bin-1.0.4-3     /dir/hadoop-1.0.4-p3

                   Grid Config for 1.0.4-p3                  gridname-1043-              /dir/grid-conf-1.0.4-p3
                                                             hadoopconf-1.0.4.3-1
                   Hadoop 1.1.2-p1 Binaries                  hadoop-1121-bin-1.1.2.1-1   /dir/hadoop-1.1.2-p1

                   Grid Config for 1.1.2-p1                  gridname-1043-              /dir/grid-conf-1.1.2-p1
                                                             hadoopconf-1.0.4.3-1




          ©2013 LinkedIn Corporation. All Rights Reserved.                                                         GRID OPERATIONS

Wednesday, March 20, 13
Year Three+




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
Corp IT
                                                                                       Grid Realm
                               Active Directory                   krbtgt/GRID@CORP
                                                                                        @GRID
                                  @CORP



                                        Password
                                                                                      krbtgt/host@GRID
                                                                                     krbtgt/service@GRID




                                                              krbtgt/user@CORP           Hadoop
                                                             krbtgt/GRID@CORP
                                                                                         Services




          ©2013 LinkedIn Corporation. All Rights Reserved.                                                 GRID OPERATIONS

Wednesday, March 20, 13
Many months moving to secure Apache Hadoop...




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
§ March 2013
               – 5000 Nodes in ~10 grids
               – Apache Hadoop 1.0.4 + custom patches
               – Full configuration management
               – Full monitoring
               – Security
               – Capacity scheduler with SLA
               – ~700 users
               – 12 tasks per node
               – Linux

               – Five dedicated operations staff members




          ©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
Future Work




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
Is ‘pure Hadoop’ the right
                                             tool for all of our workloads?




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
YARN   PBS


                                                       H
                                                       D
                                                       F
                                                       S

                                                       C
                                                       E
                                                       P
                                                       H




          ©2013 LinkedIn Corporation. All Rights Reserved.                GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   BUSINESS OPERATIONS

Wednesday, March 20, 13
§ More on LinkedIn Hadoop Performance:
               – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf


           § LinkedIn Data Analytics:
               – http://data.linkedin.com/




          ©2013 LinkedIn Corporation. All Rights Reserved.                     GRID OPERATIONS

Wednesday, March 20, 13

More Related Content

What's hot

Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
rightsize
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
Altoros
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
DataWorks Summit
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationDataWorks Summit
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
Swiss Big Data User Group
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
Vibrant Technologies & Computers
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
Ovidiu Dimulescu
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
EMC
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
Adam Muise
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
Keylabs
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
DataWorks Summit
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
Rommel Garcia
 

What's hot (20)

Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 

Similar to Hadoop Operations at LinkedIn

Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Sematext Group, Inc.
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
RedWireServices
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
Linux Foundation Korea
 
Deep dive into Ruby's require - RubyConf Taiwan 2023
Deep dive into Ruby's require - RubyConf Taiwan 2023Deep dive into Ruby's require - RubyConf Taiwan 2023
Deep dive into Ruby's require - RubyConf Taiwan 2023
Hiroshi SHIBATA
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
jbellis
 
Front end performance improvements
Front end performance improvementsFront end performance improvements
Front end performance improvements
Matthew Farina
 
High Availability != High-cost
High Availability != High-costHigh Availability != High-cost
High Availability != High-costnormanmaurer
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
iammutex
 
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagramferreroroche11
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
Big Data Montreal
 
Get to know the browser better and write faster web apps
Get to know the browser better   and write faster web appsGet to know the browser better   and write faster web apps
Get to know the browser better and write faster web appsLior Bar-On
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
srisatish ambati
 
High Availability With DRBD & Heartbeat
High Availability With DRBD & HeartbeatHigh Availability With DRBD & Heartbeat
High Availability With DRBD & Heartbeat
Chris Barber
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
Rakuten Group, Inc.
 
Hong Qiangning in QConBeijing
Hong Qiangning in QConBeijingHong Qiangning in QConBeijing
Hong Qiangning in QConBeijing
shen liu
 
Lustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable StorageLustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable StorageElizabeth Ciabattari
 
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltStack
 
Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
Dong Lin
 

Similar to Hadoop Operations at LinkedIn (20)

Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
 
Deep dive into Ruby's require - RubyConf Taiwan 2023
Deep dive into Ruby's require - RubyConf Taiwan 2023Deep dive into Ruby's require - RubyConf Taiwan 2023
Deep dive into Ruby's require - RubyConf Taiwan 2023
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Front end performance improvements
Front end performance improvementsFront end performance improvements
Front end performance improvements
 
High Availability != High-cost
High Availability != High-costHigh Availability != High-cost
High Availability != High-cost
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
89025069 mike-krieger-instagram-at-the-airbnb-tech-talk-on-scaling-instagram
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Get to know the browser better and write faster web apps
Get to know the browser better   and write faster web appsGet to know the browser better   and write faster web apps
Get to know the browser better and write faster web apps
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
High Availability With DRBD & Heartbeat
High Availability With DRBD & HeartbeatHigh Availability With DRBD & Heartbeat
High Availability With DRBD & Heartbeat
 
Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06Hadoop at Rakuten, 2011/07/06
Hadoop at Rakuten, 2011/07/06
 
Hong Qiangning in QConBeijing
Hong Qiangning in QConBeijingHong Qiangning in QConBeijing
Hong Qiangning in QConBeijing
 
Lustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable StorageLustre+ZFS:Reliable/Scalable Storage
Lustre+ZFS:Reliable/Scalable Storage
 
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power ToolsSaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools
 
Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
 

More from Allen Wittenauer

2019-09-10: Testing Contributions at Scale
2019-09-10: Testing Contributions at Scale2019-09-10: Testing Contributions at Scale
2019-09-10: Testing Contributions at Scale
Allen Wittenauer
 
2018-08-23 Apache Yetus: Precommit
2018-08-23 Apache Yetus: Precommit2018-08-23 Apache Yetus: Precommit
2018-08-23 Apache Yetus: Precommit
Allen Wittenauer
 
Apache Yetus: Intro to Precommit for HBase Contributors
Apache Yetus: Intro to Precommit for HBase ContributorsApache Yetus: Intro to Precommit for HBase Contributors
Apache Yetus: Intro to Precommit for HBase Contributors
Allen Wittenauer
 
Apache Yetus: Helping Solve the Last Mile Problem
Apache Yetus: Helping Solve the Last Mile ProblemApache Yetus: Helping Solve the Last Mile Problem
Apache Yetus: Helping Solve the Last Mile Problem
Allen Wittenauer
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
Allen Wittenauer
 
Let's Talk Operations! (Hadoop Summit 2014)
Let's Talk Operations! (Hadoop Summit 2014)Let's Talk Operations! (Hadoop Summit 2014)
Let's Talk Operations! (Hadoop Summit 2014)
Allen Wittenauer
 
Apache Hadoop for System Administrators
Apache Hadoop for System AdministratorsApache Hadoop for System Administrators
Apache Hadoop for System Administrators
Allen Wittenauer
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
Allen Wittenauer
 
Hadoop 24/7
Hadoop 24/7Hadoop 24/7
Hadoop 24/7
Allen Wittenauer
 

More from Allen Wittenauer (9)

2019-09-10: Testing Contributions at Scale
2019-09-10: Testing Contributions at Scale2019-09-10: Testing Contributions at Scale
2019-09-10: Testing Contributions at Scale
 
2018-08-23 Apache Yetus: Precommit
2018-08-23 Apache Yetus: Precommit2018-08-23 Apache Yetus: Precommit
2018-08-23 Apache Yetus: Precommit
 
Apache Yetus: Intro to Precommit for HBase Contributors
Apache Yetus: Intro to Precommit for HBase ContributorsApache Yetus: Intro to Precommit for HBase Contributors
Apache Yetus: Intro to Precommit for HBase Contributors
 
Apache Yetus: Helping Solve the Last Mile Problem
Apache Yetus: Helping Solve the Last Mile ProblemApache Yetus: Helping Solve the Last Mile Problem
Apache Yetus: Helping Solve the Last Mile Problem
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Let's Talk Operations! (Hadoop Summit 2014)
Let's Talk Operations! (Hadoop Summit 2014)Let's Talk Operations! (Hadoop Summit 2014)
Let's Talk Operations! (Hadoop Summit 2014)
 
Apache Hadoop for System Administrators
Apache Hadoop for System AdministratorsApache Hadoop for System Administrators
Apache Hadoop for System Administrators
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Hadoop 24/7
Hadoop 24/7Hadoop 24/7
Hadoop 24/7
 

Recently uploaded

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 

Recently uploaded (20)

By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

Hadoop Operations at LinkedIn

  • 1. Grid Operations Hadoop Operations at LinkedIn Allen Wittenauer Grid Computing Architect ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 2. “Hadoop is not a developer problem; it’s an operations problem.” -- Hadoop vendor ex-employee ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 4. § August 2009 – 20 Nodes in 1 grid – Apache Hadoop 0.20.0 – No configuration management – No monitoring – No security – Free for all, including random mafia hits on running jobs – FIFO Scheduling – ~20 users – 20 tasks per node – Solaris – No operational support ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 6. How We Fixed This (In Chronological Order) ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 7. Year One ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 8. § Dropped task count – 10 mappers => 7 mappers – 10 reducers => 5 reducers § Reworked ETL – hourlies => dailies – Re-ordered to take advantage of compression § 10x storage improvement – Sample impact on one job (not workflow!): § 80,000 map tasks => 2,000 map tasks § Run time cut in half § Optimize work flows/culture shift § More task time, less tasks § Production review to reinforce good behavio(u)r ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 9. § Switched to Capacity Scheduler 5% ETL Tasks – FIFO is terrible 15% Fast Queue: – Fair Share only viable for small tasks - Task Time < 15 Minutes - Job Time < 1 Hour – Enforced SLAs via custom patch - Slot stealing from "Slow" Queue § Submitted Jar Size Limit 80% Slow Queue: – Encourage distributed cache usage - Job Time < 24 Hours – Enforced limit via custom patch - Up to 80% of slots ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 10. § Benchmarking – Use production code not TeraSort! Old Node: New Node: - 2 Rack Units - 1 Rack Unit - 2 CPUs - 2 CPUs - 16 GB - 24 or 32 GB - 8 x 1 TB SATA - 6 x 2 TB SATA - 1 x 2 gb NIC - 1 x 1 gb NIC § Cut cost per unit in half § 2x nodes per rack § Extra RAM – buffering – bus speed ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 12. Year Two ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 13. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 14. § DataNode disk partitioning – Separate file systems for different purposes 20 GB 200 GB HDFS /, ... MR ... 5GB 200 GB HDFS Swap MR – Mount options: noatime, commit=30, data=writeback § NN, JT, etc – No “special hardware” == use SW RAID ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 15. LDAP Master Multi LDAP Master + Master + Replication KDC Master KDC LDAP/KDC LDAP/KDC Slaves Slaves username, uid username, uid group name, gid group name, gid netgroup, sudoers netgroup, sudoers nscd nscd Client Node Client Node ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 17. Host bcfg2 Server Group1, Group2, ... Group1 -> Svc1, Svc2, ... bcfg2 Group2 -> Svc1, Svc3, ... client Svc1+ Group3 -> Svc4, Svc5, ... Svc2+ Svc3 Content § Service Bundle – RPMs, config files, etc – Conflict resolution ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 18. § Different RPM names + different install locations = pre-deploy-ability: Object RPM Name File Path Hadoop 1.0.4-p3 Binaries hadoop-1043-bin-1.0.4-3 /dir/hadoop-1.0.4-p3 Grid Config for 1.0.4-p3 gridname-1043- /dir/grid-conf-1.0.4-p3 hadoopconf-1.0.4.3-1 Hadoop 1.1.2-p1 Binaries hadoop-1121-bin-1.1.2.1-1 /dir/hadoop-1.1.2-p1 Grid Config for 1.1.2-p1 gridname-1043- /dir/grid-conf-1.1.2-p1 hadoopconf-1.0.4.3-1 ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 19. Year Three+ ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 20. Corp IT Grid Realm Active Directory krbtgt/GRID@CORP @GRID @CORP Password krbtgt/host@GRID krbtgt/service@GRID krbtgt/user@CORP Hadoop krbtgt/GRID@CORP Services ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 21. Many months moving to secure Apache Hadoop... ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 22. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 24. § March 2013 – 5000 Nodes in ~10 grids – Apache Hadoop 1.0.4 + custom patches – Full configuration management – Full monitoring – Security – Capacity scheduler with SLA – ~700 users – 12 tasks per node – Linux – Five dedicated operations staff members ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 26. Future Work ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 27. Is ‘pure Hadoop’ the right tool for all of our workloads? ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 28. YARN PBS H D F S C E P H ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. BUSINESS OPERATIONS Wednesday, March 20, 13
  • 30. § More on LinkedIn Hadoop Performance: – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf § LinkedIn Data Analytics: – http://data.linkedin.com/ ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13

Editor's Notes

  1. Goals: - Fix the performance - Make the system operationally sound
  2. Goals: - Corporate decision to switch to Linux - Start prep for security
  3. we use cobbler to control our kickstart installs. key features: * template engine * snippet system * RPM repo sync * both command line and programmable APIs * and, most importantly, great support for a “netboot always” environment. This means that we always have our hosts boot from the network and, if that fails, local disk. We generally always re-install the machine after a disk failure so that we can start it from a clean slate, cleaning any excess cruft and restoring any host specific parts like Kerberos keytabs. What may be surprising is that our kickstart environment serves primarily to do three things: * partition disks * get enough of the OS installed to troubleshoot a broken kickstart * bootstrap our configuration management tool
  4. Born out of the HPC community in 2004 Python BSD License Love the community Works with everything, not just the Hadoop ecosystem Services based methodology with conflict resolution Awesome reporting engine
  5. Goals: - Deploy secure Hadoop - Reduce user friction
  6. A talk in and of itself Highlights: - another cultural shift - finding many bugs in what was considered stable code - forking the kerberos web filter due to poor code quality
  7. Goals: - What do we need for the next 4 years?