Grid Operations          Hadoop Operations at LinkedIn          Allen Wittenauer          Grid Computing Architect        ...
“Hadoop is not a developer problem;                                   it’s an operations problem.”                        ...
©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
§ August 2009               – 20 Nodes in 1 grid               – Apache Hadoop 0.20.0               – No configuration ma...
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
How We Fixed This                                                    (In Chronological Order)          ©2013 LinkedIn Corp...
Year One          ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
§ Dropped task count               – 10 mappers => 7 mappers               – 10 reducers => 5 reducers            § Rewo...
§ Switched to Capacity Scheduler                5% ETL Tasks               – FIFO is terrible                       15% F...
§ Benchmarking              – Use production code not TeraSort!                             Old Node:                    ...
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
Year Two          ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
§ DataNode disk partitioning               – Separate file systems for different purposes                                ...
LDAP Master              Multi                                                                                   LDAP Mast...
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
Host                                      bcfg2 Server                                                             Group1,...
§ Different RPM names + different install locations = pre-deploy-ability:                   Object                       ...
Year Three+          ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
Corp IT                                                                                       Grid Realm                  ...
Many months moving to secure Apache Hadoop...          ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28,...
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
§ March 2013               – 5000 Nodes in ~10 grids               – Apache Hadoop 1.0.4 + custom patches               –...
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONSThursday, March 28, 2013
Future Work          ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
Is ‘pure Hadoop’ the right                                             tool for all of our workloads?          ©2013 Linke...
YARN   PBS                                                       H                                                       D...
©2013 LinkedIn Corporation. All Rights Reserved.   BUSINESS OPERATIONSThursday, March 28, 2013
§ More on LinkedIn Hadoop Performance:               – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf       ...
Upcoming SlideShare
Loading in...5
×

Hadoop Operations at LinkedIn

1,734

Published on

Take a peak behind the curtain at how the operations team at LinkedIn deploys and configures Hadoop and its surrounding infrastructure. This talk will feature information for both new and expert users alike. Topics will include user and machine provisioning, software deployment, configuration management, and a walk through some of the custom patches for one of the leading Hadoop installations in the world.

Published in: Technology
2 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,734
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
2
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop Operations at LinkedIn"

  1. 1. Grid Operations Hadoop Operations at LinkedIn Allen Wittenauer Grid Computing Architect ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  2. 2. “Hadoop is not a developer problem; it’s an operations problem.” -- Hadoop vendor ex-employee ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  3. 3. ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  4. 4. § August 2009 – 20 Nodes in 1 grid – Apache Hadoop 0.20.0 – No configuration management – No monitoring – No security – Free for all, including random mafia hits on running jobs – FIFO Scheduling – ~20 users – 20 tasks per node – Solaris – No operational support ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  5. 5. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  6. 6. How We Fixed This (In Chronological Order) ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  7. 7. Year One ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  8. 8. § Dropped task count – 10 mappers => 7 mappers – 10 reducers => 5 reducers § Reworked ETL – hourlies => dailies – Re-ordered to take advantage of compression § 10x storage improvement – Sample impact on one job (not workflow!): § 80,000 map tasks => 2,000 map tasks § Run time cut in half § Optimize work flows/culture shift § More task time, less tasks § Production review to reinforce good behavio(u)r ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  9. 9. § Switched to Capacity Scheduler 5% ETL Tasks – FIFO is terrible 15% Fast Queue: – Fair Share only viable for small tasks - Task Time < 15 Minutes - Job Time < 1 Hour – Enforced SLAs via custom patch - Slot stealing from "Slow" Queue § Submitted Jar Size Limit 80% Slow Queue: – Encourage distributed cache usage - Job Time < 24 Hours – Enforced limit via custom patch - Up to 80% of slots ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  10. 10. § Benchmarking – Use production code not TeraSort! Old Node: New Node: - 2 Rack Units - 1 Rack Unit - 2 CPUs - 2 CPUs - 16 GB - 24 or 32 GB - 8 x 1 TB SATA - 6 x 2 TB SATA - 1 x 2 gb NIC - 1 x 1 gb NIC § Cut cost per unit in half § 2x nodes per rack § Extra RAM – buffering – bus speed ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  11. 11. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  12. 12. Year Two ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  13. 13. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  14. 14. § DataNode disk partitioning – Separate file systems for different purposes 20 GB 200 GB HDFS /, ... MR ... 5GB 200 GB HDFS Swap MR – Mount options: noatime, commit=30, data=writeback § NN, JT, etc – No “special hardware” == use SW RAID ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  15. 15. LDAP Master Multi LDAP Master + Master + Replication KDC Master KDC LDAP/KDC LDAP/KDC Slaves Slaves username, uid username, uid group name, gid group name, gid netgroup, sudoers netgroup, sudoers nscd nscd Client Node Client Node ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  16. 16. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  17. 17. Host bcfg2 Server Group1, Group2, ... Group1 -> Svc1, Svc2, ... bcfg2 Group2 -> Svc1, Svc3, ... client Svc1+ Group3 -> Svc4, Svc5, ... Svc2+ Svc3 Content § Service Bundle – RPMs, config files, etc – Conflict resolution ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  18. 18. § Different RPM names + different install locations = pre-deploy-ability: Object RPM Name File Path Hadoop 1.0.4-p3 Binaries hadoop-1043-bin-1.0.4-3 /dir/hadoop-1.0.4-p3 Grid Config for 1.0.4-p3 gridname-1043- /dir/grid-conf-1.0.4-p3 hadoopconf-1.0.4.3-1 Hadoop 1.1.2-p1 Binaries hadoop-1121-bin-1.1.2.1-1 /dir/hadoop-1.1.2-p1 Grid Config for 1.1.2-p1 gridname-1043- /dir/grid-conf-1.1.2-p1 hadoopconf-1.0.4.3-1 ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  19. 19. Year Three+ ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  20. 20. Corp IT Grid Realm Active Directory krbtgt/GRID@CORP @GRID @CORP Password krbtgt/host@GRID krbtgt/service@GRID krbtgt/user@CORP Hadoop krbtgt/GRID@CORP Services ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  21. 21. Many months moving to secure Apache Hadoop... ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  22. 22. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  23. 23. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  24. 24. § March 2013 – 5000 Nodes in ~10 grids – Apache Hadoop 1.0.4 + custom patches – Full configuration management – Full monitoring – Security – Capacity scheduler with SLA – ~700 users – 12 tasks per node – Linux – Five dedicated operations staff members ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  25. 25. ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  26. 26. Future Work ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  27. 27. Is ‘pure Hadoop’ the right tool for all of our workloads? ©2013 LinkedIn Corporation. All Rights Reserved.Thursday, March 28, 2013
  28. 28. YARN PBS H D F S C E P H ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013
  29. 29. ©2013 LinkedIn Corporation. All Rights Reserved. BUSINESS OPERATIONSThursday, March 28, 2013
  30. 30. § More on LinkedIn Hadoop Performance: – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf § LinkedIn Data Analytics: – http://data.linkedin.com/ ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONSThursday, March 28, 2013

×