Your SlideShare is downloading. ×
  • Like

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop World 2010: Productionizing Hadoop: Lessons Learned



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Hi,
    I am recruiting you any for universalisation, charismation, divinisation and presentation,
    Sorry, for this comment, i have commented on topic for recession, but then i went universal, pardon me .... !
    i am not doing too much, i am doing what i think it has to be done ....
    my solution for recession is universalisation, means evaluate all resourcess and assets of universe and then apply necessary sum of new currency (Zik=100$) to pay all debts and to buy off all taxes from national governments ....
    of course for this we need adequate entity, i see on horizon only myself as the secular and universal, legal and official The God, recognised by UN and with contracts with all national states governments,
    of course i invite you all to create a fresh new account at google, free, but with my data: universal identities names and universal residence, like this: Zababau Ganetros Cirimbo Ostangu ogiriny64256142, ( you can create this one but then inform me), access to account i have to have because this is divinising universalisation, but you can open it for all, i simply have to arrange it to adapt to paradigm, isn't it ......
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Many small and midsize companies – especially those for whom technology is not their primary product or concern – start out the same. Someone, probably you, decides they need to build a Hadoop cluster.
  • …and so you do. And it looks like this. Not a bad thing, but not what you can reasonably go to production with. So how do you get from this…
  • …to this?


  • 1. Productionizing Hadoop : Lessons Learned
    Eric Sammer - Solution Architect - Cloudera
    twitter: @esammer, @cloudera
  • 2. Starting Out
    Copyright 2010 Cloudera Inc. All rights reserved
    “Let’s build a Hadoop cluster!”
  • 3. Starting Out
    Copyright 2010 Cloudera Inc. All rights reserved
  • 4. Where you want to be
    Copyright 2010 Cloudera Inc. All rights reserved
    Yahoo! Hadoop Cluster (2007)
  • 5. What is Hadoop?
    A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)
    Core Hadoop has two main components
    Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage
    MapReduce: fault-tolerant distributed processing
    Key value
    Flexible -> store data without a schema and add it later as needed
    Affordable -> cost / TB at a fraction of traditional options
    Broadly adopted -> a large and active ecosystem
    Proven at scale -> dozens of petabyte + implementations in production today
    Copyright 2010 Cloudera Inc. All Rights Reserved.
  • 6. Cloudera’s Distribution for Hadoop, Version 3
    The Industry’s Leading Hadoop Distribution
    Hue SDK
    Flume, Sqoop
    • Open source – 100% Apache licensed
    • 7. Simplified – Component versions & dependencies managed for you
    • 8. Integrated – All components & functions interoperate through standard API’s
    • 9. Reliable – Patched with fixes from future releases to improve stability
    • 10. Supported – Employs project founders and committers for >70% of components
    Copyright 2010 Cloudera Inc. All Rights Reserved.
  • 11. Overview
    Proper planning
    Data Ingestion
    ETL and Data Processing Infrastructure
    Authentication, Authorization, and Sharing
    Copyright 2010 Cloudera Inc. All rights reserved
  • 12. The production data platform
    Data storage
    ETL / data processing / analysis infrastructure
    Data ingestion infrastructure
    Integration with tools
    Data security and access control
    Health and performance monitoring
    Copyright 2010 Cloudera Inc. All rights reserved
  • 13. Proper planning
    Know your use cases!
    Log transformation, aggregation
    Text mining, IR
    Machine learning
    Critical to proper configuration
    Resource utilization, deep job insight will tell you more
    Copyright 2010 Cloudera Inc. All rights reserved
  • 14. HDFS Concerns
    Name node availability
    HA is tricky
    Consider where Hadoop lives in the system
    Manual recovery can be simple, fast, effective
    Backup Strategy
    Name node metadata – hourly, ~2 day retention
    User data
    Log shipping style strategies
    “Fan out” to multiple clusters on ingestion
    Copyright 2010 Cloudera Inc. All rights reserved
  • 15. Data Ingestion
    Many data sources
    Streaming data sources (log files, mostly)
    Files (usually exports from 3rd party)
    Common place we see DIY
    You probably shouldn’t
    Sqoop, Flume, Oozie (but I’m biased)
    No matter what - fault tolerant, performant, monitored
    Copyright 2010 Cloudera Inc. All rights reserved
  • 16. ETL and Data Processing
    Non-interactive jobs
    Establish a common directory structure for processes
    Need tools to handle complex chains of jobs
    Workflow tools support
    Job dependencies, error handling
    Invocation based on time or events
    Most common mistake: depending on jobs always completing successfully or within a window of time.
    Monitor for SLA rather than pray
    Defensive coding practices apply just as they do everywhere else!
    Copyright 2010 Cloudera Inc. All rights reserved
  • 17. Metadata Management
    Tool independent metadata about…
    Data sets we know about and their location (on HDFS)
    Authorization (currently HDFS permissions only)
    Format and compression
    Guarantees (consistency, timeliness, permits duplicates)
    Currently still DIY in many ways, tool-dependent
    Most people rely on prayer and hard coding
    (H)OWL is interesting
    Copyright 2010 Cloudera Inc. All rights reserved
  • 18. Authentication and authorization
    Don’t talk to strangers
    Should integrate with existing IT infrastructure
    Yahoo! security (Kerberos) patches now part of CDH3b3
    Not everyone can access everything
    Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.
    Mostly enforced via HDFS permissions; directory structure and organization is critical
    Not as fine grained as column level access in EDW, RDBMS
    HUE as a gateway to the cluster
    Copyright 2010 Cloudera Inc. All rights reserved
  • 19. Resource Sharing
    Prefer one large cluster to many small clusters (unless maybe you’re Facebook)
    “Stop hogging the cluster!”
    Cluster resources
    Disk space (HDFS size quotas)
    Number of files (HDFS file count quotas)
    Simultaneous jobs
    Tasks – guaranteed capacity, full utilization, SLA enforcement
    Monitor and track resource utilization across all groups
    Copyright 2010 Cloudera Inc. All rights reserved
  • 20. Monitoring
    Critical for keeping things running
    Cluster health
    Traditional monitoring tools: Nagios, Hyperic, Zenoss
    Host checks, service checks
    When to alert? It’s tricky.
    Cluster performance
    Overall utilization in aggregate
    30,000ft view of utilization and performance; macro level
    Copyright 2010 Cloudera Inc. All rights reserved
  • 21. Monitoring
    Hadoop aware cluster monitoring
    Traditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specific
    Analogous to RDBMS monitoring tools
    Job level “monitoring”
    More like analysis
    “What resources does this job use?”
    “How does this run compare to last run?”
    “How can I make this run faster, more resource efficient?”
    Two views we care about
    Job perspective
    Resource perspective (task slots, scheduler pool)
    Copyright 2010 Cloudera Inc. All rights reserved
  • 22. Wrapping it up
    Hadoop proper is awesome, but is only part of the picture
    Much of Professional Services time is filling in the blanks
    There’s still a way to go
    Metadata management
    Operational tools and support
    Improvements to Hadoop core to improve stability, security, manageability
    Adoption and feedback drive progress
    CDH provides the infrastructure for a complete system
    Copyright 2010 Cloudera Inc. All rights reserved
  • 23. Cloudera Makes Hadoop Safe For the Enterprise
    Copyright 2010 Cloudera Inc. All Rights Reserved.
  • 24. Cloudera Enterprise
    Enterprise Support and Management Tools
    • Increases reliability and consistency of the Hadoop platform
    • 25. Improves Hadoop’s conformance to important IT policies and procedures
    • 26. Lowers the cost of management and administration
    Copyright 2010 Cloudera Inc. All Rights Reserved.
  • 27. References / Resources
    Cloudera documentation -
    Cloudera Groups –
    Cloudera JIRA –
    Hadoop the Definitive Guide #cloudera, #hadoop
    Copyright 2010 Cloudera Inc. All rights reserved
  • 28. 22
    Copyright 2010 Cloudera Inc. All rights reserved