Productionizing Hadoop : Lessons LearnedEric Sammer - Solution Architect - Clouderaemail: esammer@cloudera.comtwitter: @esammer, @cloudera
Starting Out2Copyright 2010 Cloudera Inc. All rights reserved(You)“Let’s build a Hadoop cluster!”http://www.iccs.inf.ed.ac.uk/~miles/code.html
Starting Out3Copyright 2010 Cloudera Inc. All rights reserved(You)http://www.iccs.inf.ed.ac.uk/~miles/code.html
Where you want to be4Copyright 2010 Cloudera Inc. All rights reserved(You)Yahoo! Hadoop Cluster (2007)
What is Hadoop?A scalable fault-tolerant distributed system  for data storage and processing (open source under the Apache license)Core Hadoop has two main componentsHadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storageMapReduce: fault-tolerant distributed processingKey valueFlexible -> store data without a schema and add it later as neededAffordable -> cost / TB at a fraction of traditional optionsBroadly adopted -> a large and active ecosystemProven at scale -> dozens of petabyte + implementations in production todayCopyright 2010 Cloudera Inc. All Rights Reserved.5
Cloudera’s Distribution for Hadoop, Version 3The Industry’s Leading Hadoop DistributionHueHue SDKOozieOozieHivePig/HiveFlume, SqoopHBaseZookeeperOpen source – 100% Apache licensed
Simplified – Component versions & dependencies managed for you
Integrated – All components & functions interoperate through standard API’s
Reliable – Patched with fixes from future releases to improve stability
Supported – Employs project founders and committers for >70% of components6Copyright 2010 Cloudera Inc. All Rights Reserved.
OverviewProper planningData IngestionETL and Data Processing InfrastructureAuthentication, Authorization, and SharingMonitoring7Copyright 2010 Cloudera Inc. All rights reserved
The production data platformData storageETL / data processing / analysis infrastructureData ingestion infrastructureIntegration with toolsData security and access controlHealth and performance monitoring8Copyright 2010 Cloudera Inc. All rights reserved
Proper planningKnow your use cases!Log transformation, aggregationText mining, IRAnalyticsMachine learningCritical to proper configurationHadoopNetworkOSResource utilization, deep job insight will tell you more9Copyright 2010 Cloudera Inc. All rights reserved
HDFS ConcernsName node availabilityHA is trickyConsider where Hadoop lives in the systemManual recovery can be simple, fast, effectiveBackup StrategyName node metadata – hourly, ~2 day retentionUser dataLog shipping style strategiesDistCp“Fan out” to multiple clusters on ingestion10Copyright 2010 Cloudera Inc. All rights reserved
Data IngestionMany data sourcesStreaming data sources (log files, mostly)RDBMSEDWFiles (usually exports from 3rd party)Common place we see DIYYou probably shouldn’tSqoop, Flume, Oozie (but I’m biased)No matter what - fault tolerant, performant, monitored11Copyright 2010 Cloudera Inc. All rights reserved
ETL and Data ProcessingNon-interactive jobsEstablish a common directory structure for processesNeed tools to handle complex chains of jobsWorkflow tools supportJob dependencies, error handlingTrackingInvocation based on time or eventsMost common mistake: depending on jobs always completing successfully or within a window of time.Monitor for SLA rather than prayDefensive coding practices apply just as they do everywhere else!12Copyright 2010 Cloudera Inc. All rights reserved
Metadata ManagementTool independent metadata about…Data sets we know about and their location (on HDFS)SchemataAuthorization (currently HDFS permissions only)PartitioningFormat and compressionGuarantees (consistency, timeliness, permits duplicates)Currently still DIY in many ways, tool-dependentMost people rely on prayer and hard coding(H)OWL is interesting13Copyright 2010 Cloudera Inc. All rights reserved
Authentication and authorizationAuthenticationDon’t talk to strangersShould integrate with existing IT infrastructureYahoo! security (Kerberos) patches now part of CDH3b3AuthorizationNot everyone can access everythingEx. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.Mostly enforced via HDFS permissions; directory structure and organization is criticalNot as fine grained as column level access in EDW, RDBMSHUE as a gateway to the cluster14Copyright 2010 Cloudera Inc. All rights reserved
Resource SharingPrefer one large cluster to many small clusters (unless maybe you’re Facebook)“Stop hogging the cluster!”Cluster resourcesDisk space (HDFS size quotas)Number of files (HDFS file count quotas)Simultaneous jobsTasks – guaranteed capacity, full utilization, SLA enforcementMonitor and track resource utilization across all groups15Copyright 2010 Cloudera Inc. All rights reserved
MonitoringCritical for keeping things runningCluster healthDuh.Traditional monitoring tools: Nagios, Hyperic, ZenossHost checks, service checksWhen to alert? It’s tricky.Cluster performanceOverall utilization in aggregate30,000ft view of utilization and performance; macro level16Copyright 2010 Cloudera Inc. All rights reserved
MonitoringHadoop aware cluster monitoringTraditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specificAnalogous to RDBMS monitoring toolsJob level “monitoring”More like analysis“What resources does this job use?”“How does this run compare to last run?”“How can I make this run faster, more resource efficient?”Two views we care aboutJob perspectiveResource perspective (task slots, scheduler pool)17Copyright 2010 Cloudera Inc. All rights reserved
Wrapping it upHadoop proper is awesome, but is only part of the pictureMuch of Professional Services time is filling in the blanksThere’s still a way to goMetadata managementOperational tools and supportImprovements to Hadoop core to improve stability, security, manageabilityAdoption and feedback drive progressCDH provides the infrastructure for a complete system18Copyright 2010 Cloudera Inc. All rights reserved

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

  • 1.
    Productionizing Hadoop :Lessons LearnedEric Sammer - Solution Architect - Clouderaemail: esammer@cloudera.comtwitter: @esammer, @cloudera
  • 2.
    Starting Out2Copyright 2010Cloudera Inc. All rights reserved(You)“Let’s build a Hadoop cluster!”http://www.iccs.inf.ed.ac.uk/~miles/code.html
  • 3.
    Starting Out3Copyright 2010Cloudera Inc. All rights reserved(You)http://www.iccs.inf.ed.ac.uk/~miles/code.html
  • 4.
    Where you wantto be4Copyright 2010 Cloudera Inc. All rights reserved(You)Yahoo! Hadoop Cluster (2007)
  • 5.
    What is Hadoop?Ascalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)Core Hadoop has two main componentsHadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storageMapReduce: fault-tolerant distributed processingKey valueFlexible -> store data without a schema and add it later as neededAffordable -> cost / TB at a fraction of traditional optionsBroadly adopted -> a large and active ecosystemProven at scale -> dozens of petabyte + implementations in production todayCopyright 2010 Cloudera Inc. All Rights Reserved.5
  • 6.
    Cloudera’s Distribution forHadoop, Version 3The Industry’s Leading Hadoop DistributionHueHue SDKOozieOozieHivePig/HiveFlume, SqoopHBaseZookeeperOpen source – 100% Apache licensed
  • 7.
    Simplified – Componentversions & dependencies managed for you
  • 8.
    Integrated – Allcomponents & functions interoperate through standard API’s
  • 9.
    Reliable – Patchedwith fixes from future releases to improve stability
  • 10.
    Supported – Employsproject founders and committers for >70% of components6Copyright 2010 Cloudera Inc. All Rights Reserved.
  • 11.
    OverviewProper planningData IngestionETLand Data Processing InfrastructureAuthentication, Authorization, and SharingMonitoring7Copyright 2010 Cloudera Inc. All rights reserved
  • 12.
    The production dataplatformData storageETL / data processing / analysis infrastructureData ingestion infrastructureIntegration with toolsData security and access controlHealth and performance monitoring8Copyright 2010 Cloudera Inc. All rights reserved
  • 13.
    Proper planningKnow youruse cases!Log transformation, aggregationText mining, IRAnalyticsMachine learningCritical to proper configurationHadoopNetworkOSResource utilization, deep job insight will tell you more9Copyright 2010 Cloudera Inc. All rights reserved
  • 14.
    HDFS ConcernsName nodeavailabilityHA is trickyConsider where Hadoop lives in the systemManual recovery can be simple, fast, effectiveBackup StrategyName node metadata – hourly, ~2 day retentionUser dataLog shipping style strategiesDistCp“Fan out” to multiple clusters on ingestion10Copyright 2010 Cloudera Inc. All rights reserved
  • 15.
    Data IngestionMany datasourcesStreaming data sources (log files, mostly)RDBMSEDWFiles (usually exports from 3rd party)Common place we see DIYYou probably shouldn’tSqoop, Flume, Oozie (but I’m biased)No matter what - fault tolerant, performant, monitored11Copyright 2010 Cloudera Inc. All rights reserved
  • 16.
    ETL and DataProcessingNon-interactive jobsEstablish a common directory structure for processesNeed tools to handle complex chains of jobsWorkflow tools supportJob dependencies, error handlingTrackingInvocation based on time or eventsMost common mistake: depending on jobs always completing successfully or within a window of time.Monitor for SLA rather than prayDefensive coding practices apply just as they do everywhere else!12Copyright 2010 Cloudera Inc. All rights reserved
  • 17.
    Metadata ManagementTool independentmetadata about…Data sets we know about and their location (on HDFS)SchemataAuthorization (currently HDFS permissions only)PartitioningFormat and compressionGuarantees (consistency, timeliness, permits duplicates)Currently still DIY in many ways, tool-dependentMost people rely on prayer and hard coding(H)OWL is interesting13Copyright 2010 Cloudera Inc. All rights reserved
  • 18.
    Authentication and authorizationAuthenticationDon’ttalk to strangersShould integrate with existing IT infrastructureYahoo! security (Kerberos) patches now part of CDH3b3AuthorizationNot everyone can access everythingEx. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.Mostly enforced via HDFS permissions; directory structure and organization is criticalNot as fine grained as column level access in EDW, RDBMSHUE as a gateway to the cluster14Copyright 2010 Cloudera Inc. All rights reserved
  • 19.
    Resource SharingPrefer onelarge cluster to many small clusters (unless maybe you’re Facebook)“Stop hogging the cluster!”Cluster resourcesDisk space (HDFS size quotas)Number of files (HDFS file count quotas)Simultaneous jobsTasks – guaranteed capacity, full utilization, SLA enforcementMonitor and track resource utilization across all groups15Copyright 2010 Cloudera Inc. All rights reserved
  • 20.
    MonitoringCritical for keepingthings runningCluster healthDuh.Traditional monitoring tools: Nagios, Hyperic, ZenossHost checks, service checksWhen to alert? It’s tricky.Cluster performanceOverall utilization in aggregate30,000ft view of utilization and performance; macro level16Copyright 2010 Cloudera Inc. All rights reserved
  • 21.
    MonitoringHadoop aware clustermonitoringTraditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specificAnalogous to RDBMS monitoring toolsJob level “monitoring”More like analysis“What resources does this job use?”“How does this run compare to last run?”“How can I make this run faster, more resource efficient?”Two views we care aboutJob perspectiveResource perspective (task slots, scheduler pool)17Copyright 2010 Cloudera Inc. All rights reserved
  • 22.
    Wrapping it upHadoopproper is awesome, but is only part of the pictureMuch of Professional Services time is filling in the blanksThere’s still a way to goMetadata managementOperational tools and supportImprovements to Hadoop core to improve stability, security, manageabilityAdoption and feedback drive progressCDH provides the infrastructure for a complete system18Copyright 2010 Cloudera Inc. All rights reserved

Editor's Notes

  • #3 Many small and midsize companies – especially those for whom technology is not their primary product or concern – start out the same. Someone, probably you, decides they need to build a Hadoop cluster.
  • #4 …and so you do. And it looks like this. Not a bad thing, but not what you can reasonably go to production with. So how do you get from this…
  • #5 …to this?