Grid Operations



          Hadoop Operations at LinkedIn
          Allen Wittenauer
          Grid Computing Architect


         ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
“Hadoop is not a developer problem;
                                   it’s an operations problem.”
                                -- Hadoop vendor ex-employee




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
§ August 2009
               – 20 Nodes in 1 grid
               – Apache Hadoop 0.20.0
               – No configuration management
               – No monitoring
               – No security
               – Free for all, including random mafia hits on running jobs
               – FIFO Scheduling
               – ~20 users
               – 20 tasks per node
               – Solaris

               – No operational support




          ©2013 LinkedIn Corporation. All Rights Reserved.                   GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
How We Fixed This
                                                    (In Chronological Order)




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
Year One




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
§ Dropped task count
               – 10 mappers => 7 mappers
               – 10 reducers => 5 reducers


           § Reworked ETL
               – hourlies => dailies
               – Re-ordered to take advantage of compression
                  § 10x storage improvement
               – Sample impact on one job (not workflow!):
                  § 80,000 map tasks => 2,000 map tasks
                  § Run time cut in half


           § Optimize work flows/culture shift
                  § More task time, less tasks
                  § Production review to reinforce good behavio(u)r



          ©2013 LinkedIn Corporation. All Rights Reserved.             GRID OPERATIONS

Wednesday, March 20, 13
§ Switched to Capacity Scheduler                 5% ETL Tasks
               – FIFO is terrible                       15% Fast Queue:
               – Fair Share only viable for small tasks - Task Time < 15 Minutes
                                                        - Job Time < 1 Hour
               – Enforced SLAs via custom patch
                                                        - Slot stealing from "Slow" Queue

           § Submitted Jar Size Limit
                                                             80% Slow Queue:
               – Encourage distributed cache usage           - Job Time < 24 Hours
               – Enforced limit via custom patch             - Up to 80% of slots




          ©2013 LinkedIn Corporation. All Rights Reserved.                              GRID OPERATIONS

Wednesday, March 20, 13
§ Benchmarking
              – Use production code not TeraSort!

                             Old Node:                       New Node:
                             - 2 Rack Units                  - 1 Rack Unit
                             - 2 CPUs                        - 2 CPUs
                             - 16 GB                         - 24 or 32 GB
                             - 8 x 1 TB SATA                 - 6 x 2 TB SATA
                             - 1 x 2 gb NIC                  - 1 x 1 gb NIC



           § Cut cost per unit in half
           § 2x nodes per rack
           § Extra RAM
              – buffering
              – bus speed


          ©2013 LinkedIn Corporation. All Rights Reserved.                     GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
Year Two




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
§ DataNode disk partitioning
               – Separate file systems for different purposes

                                                   20 GB        200 GB
                                                                                HDFS
                                                    /, ...        MR

                                                                         ...

                                              5GB            200 GB
                                                                               HDFS
                                              Swap             MR


               – Mount options: noatime, commit=30, data=writeback


           § NN, JT, etc
               – No “special hardware” == use SW RAID




          ©2013 LinkedIn Corporation. All Rights Reserved.                             GRID OPERATIONS

Wednesday, March 20, 13
LDAP Master              Multi
                                                                                   LDAP Master
                                                  +                   Master           +
                                                                     Replication
                                              KDC Master                              KDC



                                              LDAP/KDC                             LDAP/KDC
                                                Slaves                               Slaves


                                                   username, uid                      username, uid
                                                  group name, gid                    group name, gid
                                                 netgroup, sudoers                  netgroup, sudoers



                                                         nscd                             nscd

                                                 Client Node                        Client Node



          ©2013 LinkedIn Corporation. All Rights Reserved.                                              GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
Host                                      bcfg2 Server
                                                             Group1,
                                                             Group2,
                                                                ...              Group1 -> Svc1, Svc2, ...
                                            bcfg2
                                                                                 Group2 -> Svc1, Svc3, ...
                                            client                     Svc1+
                                                                                 Group3 -> Svc4, Svc5, ...
                                                                       Svc2+
                                                                        Svc3
                                                                       Content




           § Service Bundle
               – RPMs, config files, etc
               – Conflict resolution




          ©2013 LinkedIn Corporation. All Rights Reserved.                                                   GRID OPERATIONS

Wednesday, March 20, 13
§ Different RPM names + different install locations = pre-deploy-ability:



                   Object                                    RPM Name                    File Path

                   Hadoop 1.0.4-p3 Binaries                  hadoop-1043-bin-1.0.4-3     /dir/hadoop-1.0.4-p3

                   Grid Config for 1.0.4-p3                  gridname-1043-              /dir/grid-conf-1.0.4-p3
                                                             hadoopconf-1.0.4.3-1
                   Hadoop 1.1.2-p1 Binaries                  hadoop-1121-bin-1.1.2.1-1   /dir/hadoop-1.1.2-p1

                   Grid Config for 1.1.2-p1                  gridname-1043-              /dir/grid-conf-1.1.2-p1
                                                             hadoopconf-1.0.4.3-1




          ©2013 LinkedIn Corporation. All Rights Reserved.                                                         GRID OPERATIONS

Wednesday, March 20, 13
Year Three+




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
Corp IT
                                                                                       Grid Realm
                               Active Directory                   krbtgt/GRID@CORP
                                                                                        @GRID
                                  @CORP



                                        Password
                                                                                      krbtgt/host@GRID
                                                                                     krbtgt/service@GRID




                                                              krbtgt/user@CORP           Hadoop
                                                             krbtgt/GRID@CORP
                                                                                         Services




          ©2013 LinkedIn Corporation. All Rights Reserved.                                                 GRID OPERATIONS

Wednesday, March 20, 13
Many months moving to secure Apache Hadoop...




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
§ March 2013
               – 5000 Nodes in ~10 grids
               – Apache Hadoop 1.0.4 + custom patches
               – Full configuration management
               – Full monitoring
               – Security
               – Capacity scheduler with SLA
               – ~700 users
               – 12 tasks per node
               – Linux

               – Five dedicated operations staff members




          ©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   GRID OPERATIONS

Wednesday, March 20, 13
Future Work




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
Is ‘pure Hadoop’ the right
                                             tool for all of our workloads?




          ©2013 LinkedIn Corporation. All Rights Reserved.


Wednesday, March 20, 13
YARN   PBS


                                                       H
                                                       D
                                                       F
                                                       S

                                                       C
                                                       E
                                                       P
                                                       H




          ©2013 LinkedIn Corporation. All Rights Reserved.                GRID OPERATIONS

Wednesday, March 20, 13
©2013 LinkedIn Corporation. All Rights Reserved.   BUSINESS OPERATIONS

Wednesday, March 20, 13
§ More on LinkedIn Hadoop Performance:
               – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf


           § LinkedIn Data Analytics:
               – http://data.linkedin.com/




          ©2013 LinkedIn Corporation. All Rights Reserved.                     GRID OPERATIONS

Wednesday, March 20, 13

Hadoop Operations at LinkedIn

  • 1.
    Grid Operations Hadoop Operations at LinkedIn Allen Wittenauer Grid Computing Architect ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 2.
    “Hadoop is nota developer problem; it’s an operations problem.” -- Hadoop vendor ex-employee ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 3.
    ©2013 LinkedIn Corporation.All Rights Reserved. Wednesday, March 20, 13
  • 4.
    § August 2009 – 20 Nodes in 1 grid – Apache Hadoop 0.20.0 – No configuration management – No monitoring – No security – Free for all, including random mafia hits on running jobs – FIFO Scheduling – ~20 users – 20 tasks per node – Solaris – No operational support ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 5.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 6.
    How We FixedThis (In Chronological Order) ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 7.
    Year One ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 8.
    § Dropped taskcount – 10 mappers => 7 mappers – 10 reducers => 5 reducers § Reworked ETL – hourlies => dailies – Re-ordered to take advantage of compression § 10x storage improvement – Sample impact on one job (not workflow!): § 80,000 map tasks => 2,000 map tasks § Run time cut in half § Optimize work flows/culture shift § More task time, less tasks § Production review to reinforce good behavio(u)r ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 9.
    § Switched toCapacity Scheduler 5% ETL Tasks – FIFO is terrible 15% Fast Queue: – Fair Share only viable for small tasks - Task Time < 15 Minutes - Job Time < 1 Hour – Enforced SLAs via custom patch - Slot stealing from "Slow" Queue § Submitted Jar Size Limit 80% Slow Queue: – Encourage distributed cache usage - Job Time < 24 Hours – Enforced limit via custom patch - Up to 80% of slots ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 10.
    § Benchmarking – Use production code not TeraSort! Old Node: New Node: - 2 Rack Units - 1 Rack Unit - 2 CPUs - 2 CPUs - 16 GB - 24 or 32 GB - 8 x 1 TB SATA - 6 x 2 TB SATA - 1 x 2 gb NIC - 1 x 1 gb NIC § Cut cost per unit in half § 2x nodes per rack § Extra RAM – buffering – bus speed ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 11.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 12.
    Year Two ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 13.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 14.
    § DataNode diskpartitioning – Separate file systems for different purposes 20 GB 200 GB HDFS /, ... MR ... 5GB 200 GB HDFS Swap MR – Mount options: noatime, commit=30, data=writeback § NN, JT, etc – No “special hardware” == use SW RAID ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 15.
    LDAP Master Multi LDAP Master + Master + Replication KDC Master KDC LDAP/KDC LDAP/KDC Slaves Slaves username, uid username, uid group name, gid group name, gid netgroup, sudoers netgroup, sudoers nscd nscd Client Node Client Node ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 16.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 17.
    Host bcfg2 Server Group1, Group2, ... Group1 -> Svc1, Svc2, ... bcfg2 Group2 -> Svc1, Svc3, ... client Svc1+ Group3 -> Svc4, Svc5, ... Svc2+ Svc3 Content § Service Bundle – RPMs, config files, etc – Conflict resolution ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 18.
    § Different RPMnames + different install locations = pre-deploy-ability: Object RPM Name File Path Hadoop 1.0.4-p3 Binaries hadoop-1043-bin-1.0.4-3 /dir/hadoop-1.0.4-p3 Grid Config for 1.0.4-p3 gridname-1043- /dir/grid-conf-1.0.4-p3 hadoopconf-1.0.4.3-1 Hadoop 1.1.2-p1 Binaries hadoop-1121-bin-1.1.2.1-1 /dir/hadoop-1.1.2-p1 Grid Config for 1.1.2-p1 gridname-1043- /dir/grid-conf-1.1.2-p1 hadoopconf-1.0.4.3-1 ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 19.
    Year Three+ ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 20.
    Corp IT Grid Realm Active Directory krbtgt/GRID@CORP @GRID @CORP Password krbtgt/host@GRID krbtgt/service@GRID krbtgt/user@CORP Hadoop krbtgt/GRID@CORP Services ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 21.
    Many months movingto secure Apache Hadoop... ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 22.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 23.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 24.
    § March 2013 – 5000 Nodes in ~10 grids – Apache Hadoop 1.0.4 + custom patches – Full configuration management – Full monitoring – Security – Capacity scheduler with SLA – ~700 users – 12 tasks per node – Linux – Five dedicated operations staff members ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 25.
    ©2013 LinkedIn Corporation.All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 26.
    Future Work ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 27.
    Is ‘pure Hadoop’the right tool for all of our workloads? ©2013 LinkedIn Corporation. All Rights Reserved. Wednesday, March 20, 13
  • 28.
    YARN PBS H D F S C E P H ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13
  • 29.
    ©2013 LinkedIn Corporation.All Rights Reserved. BUSINESS OPERATIONS Wednesday, March 20, 13
  • 30.
    § More onLinkedIn Hadoop Performance: – http://www.slideshare.net/allenwittenauer/2012-lihadoopperf § LinkedIn Data Analytics: – http://data.linkedin.com/ ©2013 LinkedIn Corporation. All Rights Reserved. GRID OPERATIONS Wednesday, March 20, 13

Editor's Notes

  • #8 Goals: - Fix the performance - Make the system operationally sound
  • #13 Goals: - Corporate decision to switch to Linux - Start prep for security
  • #14 we use cobbler to control our kickstart installs. key features: * template engine * snippet system * RPM repo sync * both command line and programmable APIs * and, most importantly, great support for a “netboot always” environment. This means that we always have our hosts boot from the network and, if that fails, local disk. We generally always re-install the machine after a disk failure so that we can start it from a clean slate, cleaning any excess cruft and restoring any host specific parts like Kerberos keytabs. What may be surprising is that our kickstart environment serves primarily to do three things: * partition disks * get enough of the OS installed to troubleshoot a broken kickstart * bootstrap our configuration management tool
  • #17 Born out of the HPC community in 2004 Python BSD License Love the community Works with everything, not just the Hadoop ecosystem Services based methodology with conflict resolution Awesome reporting engine
  • #20 Goals: - Deploy secure Hadoop - Reduce user friction
  • #22 A talk in and of itself Highlights: - another cultural shift - finding many bugs in what was considered stable code - forking the kerberos web filter due to poor code quality
  • #27 Goals: - What do we need for the next 4 years?