Your SlideShare is downloading. ×

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

1,814

Published on

Published in: Technology
2 Comments
1 Like
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi,
    I am recruiting you any for universalisation, charismation, divinisation and presentation,
    Sorry, for this comment, i have commented on topic for recession, but then i went universal, pardon me .... !
    i am not doing too much, i am doing what i think it has to be done ....
    my solution for recession is universalisation, means evaluate all resourcess and assets of universe and then apply necessary sum of new currency (Zik=100$) to pay all debts and to buy off all taxes from national governments ....
    of course for this we need adequate entity, i see on horizon only myself as the secular and universal, legal and official The God, recognised by UN and with contracts with all national states governments,
    of course i invite you all to create a fresh new account at google, free, but with my data: universal identities names and universal residence, like this: Zababau Ganetros Cirimbo Ostangu zaqaqef@gmail.com ogiriny64256142, ( you can create this one but then inform me), access to account i have to have because this is divinising universalisation, but you can open it for all, i simply have to arrange it to adapt to paradigm, isn't it ......
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,814
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
69
Comments
2
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Many small and midsize companies – especially those for whom technology is not their primary product or concern – start out the same. Someone, probably you, decides they need to build a Hadoop cluster.
  • …and so you do. And it looks like this. Not a bad thing, but not what you can reasonably go to production with. So how do you get from this…
  • …to this?
  • Transcript

    • 1. Productionizing Hadoop : Lessons Learned
      Eric Sammer - Solution Architect - Cloudera
      email: esammer@cloudera.com
      twitter: @esammer, @cloudera
    • 2. Starting Out
      2
      Copyright 2010 Cloudera Inc. All rights reserved
      (You)
      “Let’s build a Hadoop cluster!”
      http://www.iccs.inf.ed.ac.uk/~miles/code.html
    • 3. Starting Out
      3
      Copyright 2010 Cloudera Inc. All rights reserved
      (You)
      http://www.iccs.inf.ed.ac.uk/~miles/code.html
    • 4. Where you want to be
      4
      Copyright 2010 Cloudera Inc. All rights reserved
      (You)
      Yahoo! Hadoop Cluster (2007)
    • 5. What is Hadoop?
      A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)
      Core Hadoop has two main components
      Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage
      MapReduce: fault-tolerant distributed processing
      Key value
      Flexible -> store data without a schema and add it later as needed
      Affordable -> cost / TB at a fraction of traditional options
      Broadly adopted -> a large and active ecosystem
      Proven at scale -> dozens of petabyte + implementations in production today
      Copyright 2010 Cloudera Inc. All Rights Reserved.
      5
    • 6. Cloudera’s Distribution for Hadoop, Version 3
      The Industry’s Leading Hadoop Distribution
      Hue
      Hue SDK
      Oozie
      Oozie
      Hive
      Pig/
      Hive
      Flume, Sqoop
      HBase
      Zookeeper
      • Open source – 100% Apache licensed
      • 7. Simplified – Component versions & dependencies managed for you
      • 8. Integrated – All components & functions interoperate through standard API’s
      • 9. Reliable – Patched with fixes from future releases to improve stability
      • 10. Supported – Employs project founders and committers for >70% of components
      6
      Copyright 2010 Cloudera Inc. All Rights Reserved.
    • 11. Overview
      Proper planning
      Data Ingestion
      ETL and Data Processing Infrastructure
      Authentication, Authorization, and Sharing
      Monitoring
      7
      Copyright 2010 Cloudera Inc. All rights reserved
    • 12. The production data platform
      Data storage
      ETL / data processing / analysis infrastructure
      Data ingestion infrastructure
      Integration with tools
      Data security and access control
      Health and performance monitoring
      8
      Copyright 2010 Cloudera Inc. All rights reserved
    • 13. Proper planning
      Know your use cases!
      Log transformation, aggregation
      Text mining, IR
      Analytics
      Machine learning
      Critical to proper configuration
      Hadoop
      Network
      OS
      Resource utilization, deep job insight will tell you more
      9
      Copyright 2010 Cloudera Inc. All rights reserved
    • 14. HDFS Concerns
      Name node availability
      HA is tricky
      Consider where Hadoop lives in the system
      Manual recovery can be simple, fast, effective
      Backup Strategy
      Name node metadata – hourly, ~2 day retention
      User data
      Log shipping style strategies
      DistCp
      “Fan out” to multiple clusters on ingestion
      10
      Copyright 2010 Cloudera Inc. All rights reserved
    • 15. Data Ingestion
      Many data sources
      Streaming data sources (log files, mostly)
      RDBMS
      EDW
      Files (usually exports from 3rd party)
      Common place we see DIY
      You probably shouldn’t
      Sqoop, Flume, Oozie (but I’m biased)
      No matter what - fault tolerant, performant, monitored
      11
      Copyright 2010 Cloudera Inc. All rights reserved
    • 16. ETL and Data Processing
      Non-interactive jobs
      Establish a common directory structure for processes
      Need tools to handle complex chains of jobs
      Workflow tools support
      Job dependencies, error handling
      Tracking
      Invocation based on time or events
      Most common mistake: depending on jobs always completing successfully or within a window of time.
      Monitor for SLA rather than pray
      Defensive coding practices apply just as they do everywhere else!
      12
      Copyright 2010 Cloudera Inc. All rights reserved
    • 17. Metadata Management
      Tool independent metadata about…
      Data sets we know about and their location (on HDFS)
      Schemata
      Authorization (currently HDFS permissions only)
      Partitioning
      Format and compression
      Guarantees (consistency, timeliness, permits duplicates)
      Currently still DIY in many ways, tool-dependent
      Most people rely on prayer and hard coding
      (H)OWL is interesting
      13
      Copyright 2010 Cloudera Inc. All rights reserved
    • 18. Authentication and authorization
      Authentication
      Don’t talk to strangers
      Should integrate with existing IT infrastructure
      Yahoo! security (Kerberos) patches now part of CDH3b3
      Authorization
      Not everyone can access everything
      Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.
      Mostly enforced via HDFS permissions; directory structure and organization is critical
      Not as fine grained as column level access in EDW, RDBMS
      HUE as a gateway to the cluster
      14
      Copyright 2010 Cloudera Inc. All rights reserved
    • 19. Resource Sharing
      Prefer one large cluster to many small clusters (unless maybe you’re Facebook)
      “Stop hogging the cluster!”
      Cluster resources
      Disk space (HDFS size quotas)
      Number of files (HDFS file count quotas)
      Simultaneous jobs
      Tasks – guaranteed capacity, full utilization, SLA enforcement
      Monitor and track resource utilization across all groups
      15
      Copyright 2010 Cloudera Inc. All rights reserved
    • 20. Monitoring
      Critical for keeping things running
      Cluster health
      Duh.
      Traditional monitoring tools: Nagios, Hyperic, Zenoss
      Host checks, service checks
      When to alert? It’s tricky.
      Cluster performance
      Overall utilization in aggregate
      30,000ft view of utilization and performance; macro level
      16
      Copyright 2010 Cloudera Inc. All rights reserved
    • 21. Monitoring
      Hadoop aware cluster monitoring
      Traditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specific
      Analogous to RDBMS monitoring tools
      Job level “monitoring”
      More like analysis
      “What resources does this job use?”
      “How does this run compare to last run?”
      “How can I make this run faster, more resource efficient?”
      Two views we care about
      Job perspective
      Resource perspective (task slots, scheduler pool)
      17
      Copyright 2010 Cloudera Inc. All rights reserved
    • 22. Wrapping it up
      Hadoop proper is awesome, but is only part of the picture
      Much of Professional Services time is filling in the blanks
      There’s still a way to go
      Metadata management
      Operational tools and support
      Improvements to Hadoop core to improve stability, security, manageability
      Adoption and feedback drive progress
      CDH provides the infrastructure for a complete system
      18
      Copyright 2010 Cloudera Inc. All rights reserved
    • 23. Cloudera Makes Hadoop Safe For the Enterprise
      Copyright 2010 Cloudera Inc. All Rights Reserved.
      19
      Software
      Services
      Training
    • 24. Cloudera Enterprise
      Enterprise Support and Management Tools
      • Increases reliability and consistency of the Hadoop platform
      • 25. Improves Hadoop’s conformance to important IT policies and procedures
      • 26. Lowers the cost of management and administration
      Copyright 2010 Cloudera Inc. All Rights Reserved.
      20
    • 27. References / Resources
      Cloudera documentation - http://docs.cloudera.com
      Cloudera Groups – http://groups.cloudera.org
      Cloudera JIRA – http://issues.cloudera.org
      Hadoop the Definitive Guide
      esammer@cloudera.com
      irc.freenode.net #cloudera, #hadoop
      @esammer
      21
      Copyright 2010 Cloudera Inc. All rights reserved
    • 28. 22
      Copyright 2010 Cloudera Inc. All rights reserved

    ×