PRODUCTIONIZING HADOOPNew Lessons LearnedEric Sammer
General Announcements• All lines are muted• Ask questions any time using the “questions”  pane on your GoToWebinar panel• ...
The Universe of OperationsSystem Operations       Architecture and App Ops• Server and network    • Data architecture• Ope...
Scope for Today• A focus on common stumbling blocks   • Workload-oriented planning and identification   • Network architec...
Proper Planning• Develop an understanding of your use cases   • What you (will) do defines what you need   • Analog: OLTP ...
Understanding Cluster Usage
…by use case           Data Mining / IR                                             ETL                              Repor...
…by use case                      Data Mining / IR    Network utilization is a    function of job size, its    profile, an...
Network Architecture• Your current architecture is probably fine   • Typical: traditional L2 tree (fine for North/South)  ...
Host Configuration• OS version and patches• Java 6   (HotSpot VM)• PAM limits    (nofile, nproc)• Naming    (nsswitch.conf...
Configuration Management• Puppet/Chef/<your favorite> for OS config   • Package installation   • Identity and authorizatio...
Identity, Access, and Authorization• MapReduce is a code execution engine• Identity management and access control is hard ...
Resource Sharing• One cluster, many groups• Pros   • Benefit from aggregate resources   • Greater utilization   • Reduced ...
Resource Sharing• Three dimensions of sharing a cluster    • Collocation of services (e.g. MapReduce and HBase)    • Collo...
Resource Sharing• Reasons to collocate groups / applications:   • Similar system utilization profiles   • Time-based utili...
Now What?• There’s a lot (more) to think about• We can help   • Education   • Services   • Software   • Support• Strata + ...
Questions?Type them in the “Questions” panel.Congratulations to the winnersof the book drawing!• Vani Mahobia• Ken Gayler•...
Questions?Type them in the “Questions” panel.To learn more about HadoopOperations, A Guide forDevelopers andAdministrators...
THANK YOU!Eric Sammer, Principal Solutions Architect@esammerFor more information: www.cloudera.comSales: (888)789-1488@clo...
Hardware Planning• CPU• Disk capacity and configuration• Spindle count• Memory (amount and configuration)• NIC configurati...
Baseline Hardware• Disk   • SATA II 7200RPM (SAS controller)   • JBOD (OS on R1)   • Option 1: 12x3.5” LFF 3TB   • Option ...
Upcoming SlideShare
Loading in …5
×

Productionizing Hadoop - New Lessons Learned

1,868 views

Published on

Learn how some of the largest Hadoop clusters in the world were successfully “productionized” and best practices for running Hadoop.

Published in: Technology
  • Be the first to comment

Productionizing Hadoop - New Lessons Learned

  1. 1. PRODUCTIONIZING HADOOPNew Lessons LearnedEric Sammer
  2. 2. General Announcements• All lines are muted• Ask questions any time using the “questions” pane on your GoToWebinar panel• Recording of this webinar will be available on demand at www.cloudera.com
  3. 3. The Universe of OperationsSystem Operations Architecture and App Ops• Server and network • Data architecture• Operating system • Data integration• Identity and access • Data quality monitoring• Resource management • Resource management• Maintenance • Pipeline maintenance• Cluster monitoring • Governance• Backup and DR
  4. 4. Scope for Today• A focus on common stumbling blocks • Workload-oriented planning and identification • Network architecture • Host management • Configuration management • Identity, Access, and Authorization • Cluster and resource sharing• Time for questions
  5. 5. Proper Planning• Develop an understanding of your use cases • What you (will) do defines what you need • Analog: OLTP RDBMS versus OLAP• Prototype if necessary
  6. 6. Understanding Cluster Usage
  7. 7. …by use case Data Mining / IR ETL Report Generation Analytics
  8. 8. …by use case Data Mining / IR Network utilization is a function of job size, its profile, and the number ETL of concurrent jobs Report Generation Analytics
  9. 9. Network Architecture• Your current architecture is probably fine • Typical: traditional L2 tree (fine for North/South) • Emerging: L3 spine/leaf (optimized for East/West)• Minimize oversubscription (normal: 1:1.2)• Deep port buffers (with fair allocation for shared memory)• Do not collocate low-latency apps with MR• Monitor, monitor, monitor • Bandwidth, buffer, packet count, and size deciles
  10. 10. Host Configuration• OS version and patches• Java 6 (HotSpot VM)• PAM limits (nofile, nproc)• Naming (nsswitch.conf, resolv.conf, hosts, gethostname())• OS filesystem selection and tuning• Time service• Users, groups, and identity management• Machines should not be unique snowflakes
  11. 11. Configuration Management• Puppet/Chef/<your favorite> for OS config • Package installation • Identity and authorization wiring• Cloudera Manager for platform management • Deployment and configuration • Service lifecycle • Platform-specific service monitoring and diagnostics • Activity monitoring• Complementary systems • Differentiating factors: centralized coordination, service awareness, orchestration
  12. 12. Identity, Access, and Authorization• MapReduce is a code execution engine• Identity management and access control is hard (in distributed systems like Hadoop)• Hadoop uses the OS (or Kerberos) for identity • Lots of entry points • Comparatively low level• Access control is a function of each service • HDFS: Unix-style octal permissions on objects • MapReduce: ACLs on job queues
  13. 13. Resource Sharing• One cluster, many groups• Pros • Benefit from aggregate resources • Greater utilization • Reduced cap/op-ex
  14. 14. Resource Sharing• Three dimensions of sharing a cluster • Collocation of services (e.g. MapReduce and HBase) • Collocation of groups of users • Collocation of workload profiles (ETL, analytics)• In an ideal world, collocate all and enforce policy • Not currently possible• Problems • System utilization varies wildly • Fair distribution of shared resources • Increased access control complexity • SLA of most sensitive group applies to all • …but nothing new
  15. 15. Resource Sharing• Reasons to collocate groups / applications: • Similar system utilization profiles • Time-based utilization (e.g. daily ETL and office hour analytics) • Maintain similar SLAs • Extensively data sharing • When it’s trivially easy with current control mechanisms• Reasons to segregate groups / applications: • Compliance, regulation, or where security is paramount • Wildly dissimilar utilization profiles (notably HBase and MapReduce)• A significant area of interest for Cloudera
  16. 16. Now What?• There’s a lot (more) to think about• We can help • Education • Services • Software • Support• Strata + Hadoop World 2012• Look for upcoming webinars
  17. 17. Questions?Type them in the “Questions” panel.Congratulations to the winnersof the book drawing!• Vani Mahobia• Ken Gayler• Richard Zhang• Anand Rajan• Erica Muxlow
  18. 18. Questions?Type them in the “Questions” panel.To learn more about HadoopOperations, A Guide forDevelopers andAdministrators, or about thespotted cavy, go towww.oreilly.com
  19. 19. THANK YOU!Eric Sammer, Principal Solutions Architect@esammerFor more information: www.cloudera.comSales: (888)789-1488@cloudera
  20. 20. Hardware Planning• CPU• Disk capacity and configuration• Spindle count• Memory (amount and configuration)• NIC configuration• Hadoop’s hardware preferences tend to be controversial until the architecture is understood
  21. 21. Baseline Hardware• Disk • SATA II 7200RPM (SAS controller) • JBOD (OS on R1) • Option 1: 12x3.5” LFF 3TB • Option 2: 24x2.5” SFF 1TB • Option: MDL/NL SAS drives• 2x2.2Ghz 6C 20MB cache• 48GB+ DDR3-1600 ECC• 1GbE vs. 10GbE • Is there new info here?

×