Productionizing Hadoop - New Lessons Learned


Published on

Learn how some of the largest Hadoop clusters in the world were successfully “productionized” and best practices for running Hadoop.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • INTERNAL NOTES – DELETE BEFORE POSTING!Set expectation that this is targeted to relatively beginner audience?What’s new? What are the NEW lessons learned? Example war story to start it off would help audience get into it.Scope? Core Hadoop (MR & HDFS) vs. the entire CDH stack (Hive, ZK, HBase, etc.) and how do they co-locate deployment-wise. i.e. Do I need separate HW to run other components?(MapR depositioning): Mention: HA, performance, DR, data integrity, federation, MR2,
  • SCRIPT for Zoo/Moderator (go through this as quickly as you can)Before we get started I’d like to let you know thatAll lines are mutedAsk questions any time by typing them into the QUESTIONS pane on your GoToWebinar panelThis webinar is being recorded and will be available later at cloudera.comLet me pass you to Eric Sammer, who is a Principal Solutions Architect and Cloudera and author of the recently published book “Hadoop Operations” by O’Reilly Media.
  • - Do I need to dedicated rack/network for Hadoop? Or can I run other apps services running on same rack/network?
  • Why not use Puppet/Chef for Hadoop config as well? Why is CM better? If I use Puppet/Chef for ALL my config mgmt (systems & apps), why point solution CM for Hadoop?
  • SCRIPT Zoo/moderator (speak fast):Thank you Eric. Let’s now move quickly into the Q&A portion of this webinar. Please type your questions into the QUESTIONS PANEL and we’ll get to as many questions as we have time for. While Eric is reviewing the questions I’d like to congratulate the winners of the book drawing. If you see your name listed here your book will be mailed to you by the last week of October. It’s being printed now so when you receive it it’ll be “hot off the press”.MOVE TO NEXT SLIDE – get winners’ names off the screen
  • SCRIPT Zoo/moderator (speak fast):Eric, are you ready to answer some questions?MOVE TO THANK YOU SLIDE WHILE CLOSING
  • Productionizing Hadoop - New Lessons Learned

    1. 1. PRODUCTIONIZING HADOOPNew Lessons LearnedEric Sammer
    2. 2. General Announcements• All lines are muted• Ask questions any time using the “questions” pane on your GoToWebinar panel• Recording of this webinar will be available on demand at
    3. 3. The Universe of OperationsSystem Operations Architecture and App Ops• Server and network • Data architecture• Operating system • Data integration• Identity and access • Data quality monitoring• Resource management • Resource management• Maintenance • Pipeline maintenance• Cluster monitoring • Governance• Backup and DR
    4. 4. Scope for Today• A focus on common stumbling blocks • Workload-oriented planning and identification • Network architecture • Host management • Configuration management • Identity, Access, and Authorization • Cluster and resource sharing• Time for questions
    5. 5. Proper Planning• Develop an understanding of your use cases • What you (will) do defines what you need • Analog: OLTP RDBMS versus OLAP• Prototype if necessary
    6. 6. Understanding Cluster Usage
    7. 7. …by use case Data Mining / IR ETL Report Generation Analytics
    8. 8. …by use case Data Mining / IR Network utilization is a function of job size, its profile, and the number ETL of concurrent jobs Report Generation Analytics
    9. 9. Network Architecture• Your current architecture is probably fine • Typical: traditional L2 tree (fine for North/South) • Emerging: L3 spine/leaf (optimized for East/West)• Minimize oversubscription (normal: 1:1.2)• Deep port buffers (with fair allocation for shared memory)• Do not collocate low-latency apps with MR• Monitor, monitor, monitor • Bandwidth, buffer, packet count, and size deciles
    10. 10. Host Configuration• OS version and patches• Java 6 (HotSpot VM)• PAM limits (nofile, nproc)• Naming (nsswitch.conf, resolv.conf, hosts, gethostname())• OS filesystem selection and tuning• Time service• Users, groups, and identity management• Machines should not be unique snowflakes
    11. 11. Configuration Management• Puppet/Chef/<your favorite> for OS config • Package installation • Identity and authorization wiring• Cloudera Manager for platform management • Deployment and configuration • Service lifecycle • Platform-specific service monitoring and diagnostics • Activity monitoring• Complementary systems • Differentiating factors: centralized coordination, service awareness, orchestration
    12. 12. Identity, Access, and Authorization• MapReduce is a code execution engine• Identity management and access control is hard (in distributed systems like Hadoop)• Hadoop uses the OS (or Kerberos) for identity • Lots of entry points • Comparatively low level• Access control is a function of each service • HDFS: Unix-style octal permissions on objects • MapReduce: ACLs on job queues
    13. 13. Resource Sharing• One cluster, many groups• Pros • Benefit from aggregate resources • Greater utilization • Reduced cap/op-ex
    14. 14. Resource Sharing• Three dimensions of sharing a cluster • Collocation of services (e.g. MapReduce and HBase) • Collocation of groups of users • Collocation of workload profiles (ETL, analytics)• In an ideal world, collocate all and enforce policy • Not currently possible• Problems • System utilization varies wildly • Fair distribution of shared resources • Increased access control complexity • SLA of most sensitive group applies to all • …but nothing new
    15. 15. Resource Sharing• Reasons to collocate groups / applications: • Similar system utilization profiles • Time-based utilization (e.g. daily ETL and office hour analytics) • Maintain similar SLAs • Extensively data sharing • When it’s trivially easy with current control mechanisms• Reasons to segregate groups / applications: • Compliance, regulation, or where security is paramount • Wildly dissimilar utilization profiles (notably HBase and MapReduce)• A significant area of interest for Cloudera
    16. 16. Now What?• There’s a lot (more) to think about• We can help • Education • Services • Software • Support• Strata + Hadoop World 2012• Look for upcoming webinars
    17. 17. Questions?Type them in the “Questions” panel.Congratulations to the winnersof the book drawing!• Vani Mahobia• Ken Gayler• Richard Zhang• Anand Rajan• Erica Muxlow
    18. 18. Questions?Type them in the “Questions” panel.To learn more about HadoopOperations, A Guide forDevelopers andAdministrators, or about thespotted cavy, go
    19. 19. THANK YOU!Eric Sammer, Principal Solutions Architect@esammerFor more information: www.cloudera.comSales: (888)789-1488@cloudera
    20. 20. Hardware Planning• CPU• Disk capacity and configuration• Spindle count• Memory (amount and configuration)• NIC configuration• Hadoop’s hardware preferences tend to be controversial until the architecture is understood
    21. 21. Baseline Hardware• Disk • SATA II 7200RPM (SAS controller) • JBOD (OS on R1) • Option 1: 12x3.5” LFF 3TB • Option 2: 24x2.5” SFF 1TB • Option: MDL/NL SAS drives• 2x2.2Ghz 6C 20MB cache• 48GB+ DDR3-1600 ECC• 1GbE vs. 10GbE • Is there new info here?