Your SlideShare is downloading. ×
0
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

1,825

Published on

Published in: Technology
2 Comments
1 Like
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi,
    I am recruiting you any for universalisation, charismation, divinisation and presentation,
    Sorry, for this comment, i have commented on topic for recession, but then i went universal, pardon me .... !
    i am not doing too much, i am doing what i think it has to be done ....
    my solution for recession is universalisation, means evaluate all resourcess and assets of universe and then apply necessary sum of new currency (Zik=100$) to pay all debts and to buy off all taxes from national governments ....
    of course for this we need adequate entity, i see on horizon only myself as the secular and universal, legal and official The God, recognised by UN and with contracts with all national states governments,
    of course i invite you all to create a fresh new account at google, free, but with my data: universal identities names and universal residence, like this: Zababau Ganetros Cirimbo Ostangu zaqaqef@gmail.com ogiriny64256142, ( you can create this one but then inform me), access to account i have to have because this is divinising universalisation, but you can open it for all, i simply have to arrange it to adapt to paradigm, isn't it ......
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,825
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
69
Comments
2
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Many small and midsize companies – especially those for whom technology is not their primary product or concern – start out the same. Someone, probably you, decides they need to build a Hadoop cluster.
  • …and so you do. And it looks like this. Not a bad thing, but not what you can reasonably go to production with. So how do you get from this…
  • …to this?
  • Transcript

    • 1. Productionizing Hadoop : Lessons Learned<br />Eric Sammer - Solution Architect - Cloudera<br />email: esammer@cloudera.com<br />twitter: @esammer, @cloudera<br />
    • 2. Starting Out<br />2<br />Copyright 2010 Cloudera Inc. All rights reserved<br />(You)<br />“Let’s build a Hadoop cluster!”<br />http://www.iccs.inf.ed.ac.uk/~miles/code.html<br />
    • 3. Starting Out<br />3<br />Copyright 2010 Cloudera Inc. All rights reserved<br />(You)<br />http://www.iccs.inf.ed.ac.uk/~miles/code.html<br />
    • 4. Where you want to be<br />4<br />Copyright 2010 Cloudera Inc. All rights reserved<br />(You)<br />Yahoo! Hadoop Cluster (2007)<br />
    • 5. What is Hadoop?<br />A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)<br />Core Hadoop has two main components<br />Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage<br />MapReduce: fault-tolerant distributed processing<br />Key value<br />Flexible -&gt; store data without a schema and add it later as needed<br />Affordable -&gt; cost / TB at a fraction of traditional options<br />Broadly adopted -&gt; a large and active ecosystem<br />Proven at scale -&gt; dozens of petabyte + implementations in production today<br />Copyright 2010 Cloudera Inc. All Rights Reserved.<br />5<br />
    • 6. Cloudera’s Distribution for Hadoop, Version 3<br />The Industry’s Leading Hadoop Distribution<br />Hue<br />Hue SDK<br />Oozie<br />Oozie<br />Hive<br />Pig/<br />Hive<br />Flume, Sqoop<br />HBase<br />Zookeeper<br /><ul><li>Open source – 100% Apache licensed
    • 7. Simplified – Component versions &amp; dependencies managed for you
    • 8. Integrated – All components &amp; functions interoperate through standard API’s
    • 9. Reliable – Patched with fixes from future releases to improve stability
    • 10. Supported – Employs project founders and committers for &gt;70% of components</li></ul>6<br />Copyright 2010 Cloudera Inc. All Rights Reserved.<br />
    • 11. Overview<br />Proper planning<br />Data Ingestion<br />ETL and Data Processing Infrastructure<br />Authentication, Authorization, and Sharing<br />Monitoring<br />7<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 12. The production data platform<br />Data storage<br />ETL / data processing / analysis infrastructure<br />Data ingestion infrastructure<br />Integration with tools<br />Data security and access control<br />Health and performance monitoring<br />8<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 13. Proper planning<br />Know your use cases!<br />Log transformation, aggregation<br />Text mining, IR<br />Analytics<br />Machine learning<br />Critical to proper configuration<br />Hadoop<br />Network<br />OS<br />Resource utilization, deep job insight will tell you more<br />9<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 14. HDFS Concerns<br />Name node availability<br />HA is tricky<br />Consider where Hadoop lives in the system<br />Manual recovery can be simple, fast, effective<br />Backup Strategy<br />Name node metadata – hourly, ~2 day retention<br />User data<br />Log shipping style strategies<br />DistCp<br />“Fan out” to multiple clusters on ingestion<br />10<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 15. Data Ingestion<br />Many data sources<br />Streaming data sources (log files, mostly)<br />RDBMS<br />EDW<br />Files (usually exports from 3rd party)<br />Common place we see DIY<br />You probably shouldn’t<br />Sqoop, Flume, Oozie (but I’m biased)<br />No matter what - fault tolerant, performant, monitored<br />11<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 16. ETL and Data Processing<br />Non-interactive jobs<br />Establish a common directory structure for processes<br />Need tools to handle complex chains of jobs<br />Workflow tools support<br />Job dependencies, error handling<br />Tracking<br />Invocation based on time or events<br />Most common mistake: depending on jobs always completing successfully or within a window of time.<br />Monitor for SLA rather than pray<br />Defensive coding practices apply just as they do everywhere else!<br />12<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 17. Metadata Management<br />Tool independent metadata about…<br />Data sets we know about and their location (on HDFS)<br />Schemata<br />Authorization (currently HDFS permissions only)<br />Partitioning<br />Format and compression<br />Guarantees (consistency, timeliness, permits duplicates)<br />Currently still DIY in many ways, tool-dependent<br />Most people rely on prayer and hard coding<br />(H)OWL is interesting<br />13<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 18. Authentication and authorization<br />Authentication<br />Don’t talk to strangers<br />Should integrate with existing IT infrastructure<br />Yahoo! security (Kerberos) patches now part of CDH3b3<br />Authorization<br />Not everyone can access everything<br />Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.<br />Mostly enforced via HDFS permissions; directory structure and organization is critical<br />Not as fine grained as column level access in EDW, RDBMS<br />HUE as a gateway to the cluster<br />14<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 19. Resource Sharing<br />Prefer one large cluster to many small clusters (unless maybe you’re Facebook)<br />“Stop hogging the cluster!”<br />Cluster resources<br />Disk space (HDFS size quotas)<br />Number of files (HDFS file count quotas)<br />Simultaneous jobs<br />Tasks – guaranteed capacity, full utilization, SLA enforcement<br />Monitor and track resource utilization across all groups<br />15<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 20. Monitoring<br />Critical for keeping things running<br />Cluster health<br />Duh.<br />Traditional monitoring tools: Nagios, Hyperic, Zenoss<br />Host checks, service checks<br />When to alert? It’s tricky.<br />Cluster performance<br />Overall utilization in aggregate<br />30,000ft view of utilization and performance; macro level<br />16<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 21. Monitoring<br />Hadoop aware cluster monitoring<br />Traditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specific<br />Analogous to RDBMS monitoring tools<br />Job level “monitoring”<br />More like analysis<br />“What resources does this job use?”<br />“How does this run compare to last run?”<br />“How can I make this run faster, more resource efficient?”<br />Two views we care about<br />Job perspective<br />Resource perspective (task slots, scheduler pool)<br />17<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 22. Wrapping it up<br />Hadoop proper is awesome, but is only part of the picture<br />Much of Professional Services time is filling in the blanks<br />There’s still a way to go<br />Metadata management<br />Operational tools and support<br />Improvements to Hadoop core to improve stability, security, manageability<br />Adoption and feedback drive progress<br />CDH provides the infrastructure for a complete system<br />18<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 23. Cloudera Makes Hadoop Safe For the Enterprise<br />Copyright 2010 Cloudera Inc. All Rights Reserved.<br />19<br />Software<br />Services<br />Training<br />
    • 24. Cloudera Enterprise<br />Enterprise Support and Management Tools<br /><ul><li>Increases reliability and consistency of the Hadoop platform
    • 25. Improves Hadoop’s conformance to important IT policies and procedures
    • 26. Lowers the cost of management and administration</li></ul>Copyright 2010 Cloudera Inc. All Rights Reserved.<br />20<br />
    • 27. References / Resources<br />Cloudera documentation - http://docs.cloudera.com<br />Cloudera Groups – http://groups.cloudera.org<br />Cloudera JIRA – http://issues.cloudera.org<br />Hadoop the Definitive Guide<br />esammer@cloudera.com<br />irc.freenode.net #cloudera, #hadoop<br />@esammer<br />21<br />Copyright 2010 Cloudera Inc. All rights reserved<br />
    • 28. 22<br />Copyright 2010 Cloudera Inc. All rights reserved<br />

    ×