Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 

Webinar: Productionizing Hadoop: Lessons Learned - 20101208

on

  • 4,582 views

Key insights in installing, configuring, and running Hadoop and Cloudera's Distribution for Hadoop in production. These are lessons learned from Cloudera helping organizations move to a productions ...

Key insights in installing, configuring, and running Hadoop and Cloudera's Distribution for Hadoop in production. These are lessons learned from Cloudera helping organizations move to a productions state with Hadoop.

Statistics

Views

Total Views
4,582
Views on SlideShare
4,253
Embed Views
329

Actions

Likes
5
Downloads
240
Comments
0

1 Embed 329

http://www.cloudera.com 329

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Webinar: Productionizing Hadoop: Lessons Learned - 20101208 Webinar: Productionizing Hadoop: Lessons Learned - 20101208 Presentation Transcript

  • Welcome to Production-izing Hadoop: Lessons LearnedAudio/Telephone: +1 916 233 3087Access Code: 616-465-108Audio PIN: Shown after joining the Webinar
  • Housekeeping• Ask questions at any time using the questions panel• Problems? Use the chat panel• Book drawing - winner announced at the end• Slides and recording will be available Copyright 2010 Cloudera Inc. All rights reserved 2
  • PollWhat is your interest in Hadoop?• Just learning about it• I have a problem I think Hadoop can solve• Using Hadoop in our labs• Using Hadoop in production Copyright 2010 Cloudera Inc. All rights reserved 3
  • Speaker: Eric SammerEric is a Solution Architect and Training Instructor for Cloudera. He hasworked with dozens of customers in a variety of industries includingClouderas largest Hadoop deployments. His experience ranges fromclusters of a few nodes to clusters with hundreds of nodes with complexmulti-tenant user environments.Prior to joining Cloudera, he held roles including System Architect, Directorof Technical Operations, and Tech Lead at various New York City startupsfocusing on distributed data collection, processing, and reporting systems.Eric has over 12 years in development and technical operations and hascontributed to various open source projects such as Gentoo Linux.twitter: @esammer, @cloudera Copyright 2010 Cloudera Inc. All rights reserved 4
  • Starting Out (You) “Let’s build a Hadoop cluster!” http://www.iccs.inf.ed.ac.uk/~miles/code.html Copyright 2010 Cloudera Inc. All rights reserved 5
  • Starting Out (You) http://www.iccs.inf.ed.ac.uk/~miles/code.html Copyright 2010 Cloudera Inc. All rights reserved 6
  • Where you want to be (You)Yahoo! Hadoop Cluster (2007) Copyright 2010 Cloudera Inc. All rights reserved 7
  • What is Hadoop?• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)• Core Hadoop has two main components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing• Key value • Flexible -> store data without a schema and add it later as needed • Affordable -> cost / TB at a fraction of traditional options • Broadly adopted -> a large and active ecosystem • Proven at scale -> dozens of petabyte + implementations in production today Copyright 2010 Cloudera Inc. All Rights Reserved. 8
  • Cloudera’s Distribution for Hadoop, Version 3The Industry’s Leading Hadoop Distribution Hue Hue SDK Oozie Oozie Hive Pig/ Hive Flume, Sqoop HBase Zookeeper• Open source – 100% Apache licensed• Simplified – Component versions & dependencies managed for you• Integrated – All components & functions interoperate through standard API’s• Reliable – Patched with fixes from future releases to improve stability• Supported – Employs project founders and committers for >70% of components Copyright 2010 Cloudera Inc. All Rights Reserved. 9
  • Overview• Proper planning• Data Ingestion• ETL and Data Processing Infrastructure• Authentication, Authorization, and Sharing• Monitoring Copyright 2010 Cloudera Inc. All rights reserved 10
  • The production data platform• Data storage• ETL / data processing / analysis infrastructure• Data ingestion infrastructure• Integration with tools• Data security and access control• Health and performance monitoring Copyright 2010 Cloudera Inc. All rights reserved 11
  • Proper planning• Know your use cases! • Log transformation, aggregation • Text mining, IR • Analytics • Machine learning• Critical to proper configuration • Hadoop • Network • OS• Resource utilization, deep job insight will tell you more Copyright 2010 Cloudera Inc. All rights reserved 12
  • HDFS Concerns• Name node availability • HA is tricky • Consider where Hadoop lives in the system • Manual recovery can be simple, fast, effective• Backup Strategy • Name node metadata – hourly, ~2 day retention • User data • Log shipping style strategies • DistCp • “Fan out” to multiple clusters on ingestion Copyright 2010 Cloudera Inc. All rights reserved 13
  • Data Ingestion• Many data sources • Streaming data sources (log files, mostly) • RDBMS • EDW • Files (usually exports from 3rd party)• Common place we see DIY • You probably shouldn’t • Sqoop, Flume, Oozie (but I’m biased)• No matter what - fault tolerant, performant, monitored Copyright 2010 Cloudera Inc. All rights reserved 14
  • ETL and Data Processing• Non-interactive jobs• Establish a common directory structure for processes• Need tools to handle complex chains of jobs• Workflow tools support • Job dependencies, error handling • Tracking • Invocation based on time or events• Most common mistake: depending on jobs always completing successfully or within a window of time. • Monitor for SLA rather than pray • Defensive coding practices apply just as they do everywhere else! Copyright 2010 Cloudera Inc. All rights reserved 15
  • Metadata Management• Tool independent metadata about… • Data sets we know about and their location (on HDFS) • Schemata • Authorization (currently HDFS permissions only) • Partitioning • Format and compression • Guarantees (consistency, timeliness, permits duplicates)• Currently still DIY in many ways, tool-dependent• Most people rely on prayer and hard coding• (H)OWL is interesting Copyright 2010 Cloudera Inc. All rights reserved 16
  • Authentication and authorization• Authentication • Don’t talk to strangers • Should integrate with existing IT infrastructure • Yahoo! security (Kerberos) patches now part of CDH3b3• Authorization • Not everyone can access everything • Ex. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets. • Mostly enforced via HDFS permissions; directory structure and organization is critical • Not as fine grained as column level access in EDW, RDBMS• HUE as a gateway to the cluster Copyright 2010 Cloudera Inc. All rights reserved 17
  • Resource Sharing• Prefer one large cluster to many small clusters (unless maybe you’re Facebook)• “Stop hogging the cluster!”• Cluster resources • Disk space (HDFS size quotas) • Number of files (HDFS file count quotas) • Simultaneous jobs • Tasks – guaranteed capacity, full utilization, SLA enforcement• Monitor and track resource utilization across all groups Copyright 2010 Cloudera Inc. All rights reserved 18
  • Monitoring• Critical for keeping things running• Cluster health • Duh. • Traditional monitoring tools: Nagios, Hyperic, Zenoss • Host checks, service checks • When to alert? It’s tricky.• Cluster performance • Overall utilization in aggregate • 30,000ft view of utilization and performance; macro level Copyright 2010 Cloudera Inc. All rights reserved 19
  • Monitoring• Hadoop aware cluster monitoring • Traditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specific • Analogous to RDBMS monitoring tools• Job level “monitoring” • More like analysis • “What resources does this job use?” • “How does this run compare to last run?” • “How can I make this run faster, more resource efficient?” • Two views we care about • Job perspective • Resource perspective (task slots, scheduler pool) Copyright 2010 Cloudera Inc. All rights reserved 20
  • Wrapping it up• Hadoop proper is awesome, but is only part of the picture• Much of Professional Services time is filling in the blanks• There’s still a way to go • Metadata management • Operational tools and support • Improvements to Hadoop core to improve stability, security, manageability• Adoption and feedback drive progress• CDH provides the infrastructure for a complete system Copyright 2010 Cloudera Inc. All rights reserved 21
  • Cloudera Makes Hadoop Safe For the Enterprise Software Services Training Copyright 2010 Cloudera Inc. All Rights Reserved. 22
  • Cloudera EnterpriseEnterprise Support and Management Tools • Increases reliability and consistency of the Hadoop platform • Improves Hadoop’s conformance to important IT policies and procedures • Lowers the cost of management and administration Copyright 2010 Cloudera Inc. All Rights Reserved. 23
  • References / Resources• Cloudera documentation - http://docs.cloudera.com• Cloudera Groups – http://groups.cloudera.org• Cloudera JIRA – http://issues.cloudera.org• Hadoop the Definitive Guide• esammer@cloudera.com• irc.freenode.net #cloudera, #hadoop• @esammer Copyright 2010 Cloudera Inc. All rights reserved 24
  • PollWhat other topics would you be most interested inhearing about?• More case studies of enterprises using Hadoop• Technical "How to" sessions• Industry specific applications of Hadoop• Technical overviews of Hadoop and related components Copyright 2010 Cloudera Inc. All rights reserved 25
  • Winner of the drawing is… Copyright 2010 Cloudera Inc. All rights reserved 26
  • Q&ALearn about upcoming events: www.cloudera.com/eventsDBTA Webinar: Thursday, December 9th, 11am PT / 1pm ETNew Solutions for the Data Intensive EnterpriseRegister at www.cloudera.com/eventsThank you for attending. Copyright 2010 Cloudera Inc. All rights reserved 27