Hadoop World 2010: Productionizing Hadoop: Lessons Learned
 

Like this? Share it with your network

Share

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

on

  • 2,319 views

 

Statistics

Views

Total Views
2,319
Views on SlideShare
2,314
Embed Views
5

Actions

Likes
1
Downloads
66
Comments
1

1 Embed 5

http://192.168.6.179 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi,
    I am recruiting you any for universalisation, charismation, divinisation and presentation,
    Sorry, for this comment, i have commented on topic for recession, but then i went universal, pardon me .... !
    i am not doing too much, i am doing what i think it has to be done ....
    my solution for recession is universalisation, means evaluate all resourcess and assets of universe and then apply necessary sum of new currency (Zik=100$) to pay all debts and to buy off all taxes from national governments ....
    of course for this we need adequate entity, i see on horizon only myself as the secular and universal, legal and official The God, recognised by UN and with contracts with all national states governments,
    of course i invite you all to create a fresh new account at google, free, but with my data: universal identities names and universal residence, like this: Zababau Ganetros Cirimbo Ostangu zaqaqef@gmail.com ogiriny64256142, ( you can create this one but then inform me), access to account i have to have because this is divinising universalisation, but you can open it for all, i simply have to arrange it to adapt to paradigm, isn't it ......
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Many small and midsize companies – especially those for whom technology is not their primary product or concern – start out the same. Someone, probably you, decides they need to build a Hadoop cluster.
  • …and so you do. And it looks like this. Not a bad thing, but not what you can reasonably go to production with. So how do you get from this…
  • …to this?

Hadoop World 2010: Productionizing Hadoop: Lessons Learned Presentation Transcript

  • 1. Productionizing Hadoop : Lessons Learned
    Eric Sammer - Solution Architect - Cloudera
    email: esammer@cloudera.com
    twitter: @esammer, @cloudera
  • 2. Starting Out
    2
    Copyright 2010 Cloudera Inc. All rights reserved
    (You)
    “Let’s build a Hadoop cluster!”
    http://www.iccs.inf.ed.ac.uk/~miles/code.html
  • 3. Starting Out
    3
    Copyright 2010 Cloudera Inc. All rights reserved
    (You)
    http://www.iccs.inf.ed.ac.uk/~miles/code.html
  • 4. Where you want to be
    4
    Copyright 2010 Cloudera Inc. All rights reserved
    (You)
    Yahoo! Hadoop Cluster (2007)
  • 5. What is Hadoop?
    A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)
    Core Hadoop has two main components
    Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage
    MapReduce: fault-tolerant distributed processing
    Key value
    Flexible -> store data without a schema and add it later as needed
    Affordable -> cost / TB at a fraction of traditional options
    Broadly adopted -> a large and active ecosystem
    Proven at scale -> dozens of petabyte + implementations in production today
    Copyright 2010 Cloudera Inc. All Rights Reserved.
    5
  • 6. Cloudera’s Distribution for Hadoop, Version 3
    The Industry’s Leading Hadoop Distribution
    Hue
    Hue SDK
    Oozie
    Oozie
    Hive
    Pig/
    Hive
    Flume, Sqoop
    HBase
    Zookeeper
    • Open source – 100% Apache licensed
    • 7. Simplified – Component versions & dependencies managed for you
    • 8. Integrated – All components & functions interoperate through standard API’s
    • 9. Reliable – Patched with fixes from future releases to improve stability
    • 10. Supported – Employs project founders and committers for >70% of components
    6
    Copyright 2010 Cloudera Inc. All Rights Reserved.
  • 11. Overview
    Proper planning
    Data Ingestion
    ETL and Data Processing Infrastructure
    Authentication, Authorization, and Sharing
    Monitoring
    7
    Copyright 2010 Cloudera Inc. All rights reserved
  • 12. The production data platform
    Data storage
    ETL / data processing / analysis infrastructure
    Data ingestion infrastructure
    Integration with tools
    Data security and access control
    Health and performance monitoring
    8
    Copyright 2010 Cloudera Inc. All rights reserved
  • 13. Proper planning
    Know your use cases!
    Log transformation, aggregation
    Text mining, IR
    Analytics
    Machine learning
    Critical to proper configuration
    Hadoop
    Network
    OS
    Resource utilization, deep job insight will tell you more
    9
    Copyright 2010 Cloudera Inc. All rights reserved
  • 14. HDFS Concerns
    Name node availability
    HA is tricky
    Consider where Hadoop lives in the system
    Manual recovery can be simple, fast, effective
    Backup Strategy
    Name node metadata – hourly, ~2 day retention
    User data
    Log shipping style strategies
    DistCp
    “Fan out” to multiple clusters on ingestion
    10
    Copyright 2010 Cloudera Inc. All rights reserved
  • 15. Data Ingestion
    Many data sources
    Streaming data sources (log files, mostly)
    RDBMS
    EDW
    Files (usually exports from 3rd party)
    Common place we see DIY
    You probably shouldn’t
    Sqoop, Flume, Oozie (but I’m biased)
    No matter what - fault tolerant, performant, monitored
    11
    Copyright 2010 Cloudera Inc. All rights reserved
  • 16. ETL and Data Processing
    Non-interactive jobs
    Establish a common directory structure for processes
    Need tools to handle complex chains of jobs
    Workflow tools support
    Job dependencies, error handling
    Tracking
    Invocation based on time or events
    Most common mistake: depending on jobs always completing successfully or within a window of time.
    Monitor for SLA rather than pray
    Defensive coding practices apply just as they do everywhere else!
    12
    Copyright 2010 Cloudera Inc. All rights reserved
  • 17. Metadata Management
    Tool independent metadata about…
    Data sets we know about and their location (on HDFS)
    Schemata
    Authorization (currently HDFS permissions only)
    Partitioning
    Format and compression
    Guarantees (consistency, timeliness, permits duplicates)
    Currently still DIY in many ways, tool-dependent
    Most people rely on prayer and hard coding
    (H)OWL is interesting
    13
    Copyright 2010 Cloudera Inc. All rights reserved
  • 18. Authentication and authorization
    Authentication
    Don’t talk to strangers
    Should integrate with existing IT infrastructure
    Yahoo! security (Kerberos) patches now part of CDH3b3
    Authorization
    Not everyone can access everything
    Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.
    Mostly enforced via HDFS permissions; directory structure and organization is critical
    Not as fine grained as column level access in EDW, RDBMS
    HUE as a gateway to the cluster
    14
    Copyright 2010 Cloudera Inc. All rights reserved
  • 19. Resource Sharing
    Prefer one large cluster to many small clusters (unless maybe you’re Facebook)
    “Stop hogging the cluster!”
    Cluster resources
    Disk space (HDFS size quotas)
    Number of files (HDFS file count quotas)
    Simultaneous jobs
    Tasks – guaranteed capacity, full utilization, SLA enforcement
    Monitor and track resource utilization across all groups
    15
    Copyright 2010 Cloudera Inc. All rights reserved
  • 20. Monitoring
    Critical for keeping things running
    Cluster health
    Duh.
    Traditional monitoring tools: Nagios, Hyperic, Zenoss
    Host checks, service checks
    When to alert? It’s tricky.
    Cluster performance
    Overall utilization in aggregate
    30,000ft view of utilization and performance; macro level
    16
    Copyright 2010 Cloudera Inc. All rights reserved
  • 21. Monitoring
    Hadoop aware cluster monitoring
    Traditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specific
    Analogous to RDBMS monitoring tools
    Job level “monitoring”
    More like analysis
    “What resources does this job use?”
    “How does this run compare to last run?”
    “How can I make this run faster, more resource efficient?”
    Two views we care about
    Job perspective
    Resource perspective (task slots, scheduler pool)
    17
    Copyright 2010 Cloudera Inc. All rights reserved
  • 22. Wrapping it up
    Hadoop proper is awesome, but is only part of the picture
    Much of Professional Services time is filling in the blanks
    There’s still a way to go
    Metadata management
    Operational tools and support
    Improvements to Hadoop core to improve stability, security, manageability
    Adoption and feedback drive progress
    CDH provides the infrastructure for a complete system
    18
    Copyright 2010 Cloudera Inc. All rights reserved
  • 23. Cloudera Makes Hadoop Safe For the Enterprise
    Copyright 2010 Cloudera Inc. All Rights Reserved.
    19
    Software
    Services
    Training
  • 24. Cloudera Enterprise
    Enterprise Support and Management Tools
    • Increases reliability and consistency of the Hadoop platform
    • 25. Improves Hadoop’s conformance to important IT policies and procedures
    • 26. Lowers the cost of management and administration
    Copyright 2010 Cloudera Inc. All Rights Reserved.
    20
  • 27. References / Resources
    Cloudera documentation - http://docs.cloudera.com
    Cloudera Groups – http://groups.cloudera.org
    Cloudera JIRA – http://issues.cloudera.org
    Hadoop the Definitive Guide
    esammer@cloudera.com
    irc.freenode.net #cloudera, #hadoop
    @esammer
    21
    Copyright 2010 Cloudera Inc. All rights reserved
  • 28. 22
    Copyright 2010 Cloudera Inc. All rights reserved