Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Productionizing Hadoop : Lessons LearnedEric Sammer - Solution Architect - Clouderaemail: esammer@cloudera.comtwitter: @esammer, @cloudera

What is Hadoop?A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)Core Hadoop has two main componentsHadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storageMapReduce: fault-tolerant distributed processingKey valueFlexible -> store data without a schema and add it later as neededAffordable -> cost / TB at a fraction of traditional optionsBroadly adopted -> a large and active ecosystemProven at scale -> dozens of petabyte + implementations in production todayCopyright 2010 Cloudera Inc. All Rights Reserved.5

Cloudera’s Distribution for Hadoop, Version 3The Industry’s Leading Hadoop DistributionHueHue SDKOozieOozieHivePig/HiveFlume, SqoopHBaseZookeeperOpen source – 100% Apache licensed

Simplified – Component versions & dependencies managed for you

Integrated – All components & functions interoperate through standard API’s

Reliable – Patched with fixes from future releases to improve stability

The production data platformData storageETL / data processing / analysis infrastructureData ingestion infrastructureIntegration with toolsData security and access controlHealth and performance monitoring8Copyright 2010 Cloudera Inc. All rights reserved

Proper planningKnow your use cases!Log transformation, aggregationText mining, IRAnalyticsMachine learningCritical to proper configurationHadoopNetworkOSResource utilization, deep job insight will tell you more9Copyright 2010 Cloudera Inc. All rights reserved

HDFS ConcernsName node availabilityHA is trickyConsider where Hadoop lives in the systemManual recovery can be simple, fast, effectiveBackup StrategyName node metadata – hourly, ~2 day retentionUser dataLog shipping style strategiesDistCp“Fan out” to multiple clusters on ingestion10Copyright 2010 Cloudera Inc. All rights reserved

Data IngestionMany data sourcesStreaming data sources (log files, mostly)RDBMSEDWFiles (usually exports from 3rd party)Common place we see DIYYou probably shouldn’tSqoop, Flume, Oozie (but I’m biased)No matter what - fault tolerant, performant, monitored11Copyright 2010 Cloudera Inc. All rights reserved

ETL and Data ProcessingNon-interactive jobsEstablish a common directory structure for processesNeed tools to handle complex chains of jobsWorkflow tools supportJob dependencies, error handlingTrackingInvocation based on time or eventsMost common mistake: depending on jobs always completing successfully or within a window of time.Monitor for SLA rather than prayDefensive coding practices apply just as they do everywhere else!12Copyright 2010 Cloudera Inc. All rights reserved

Metadata ManagementTool independent metadata about…Data sets we know about and their location (on HDFS)SchemataAuthorization (currently HDFS permissions only)PartitioningFormat and compressionGuarantees (consistency, timeliness, permits duplicates)Currently still DIY in many ways, tool-dependentMost people rely on prayer and hard coding(H)OWL is interesting13Copyright 2010 Cloudera Inc. All rights reserved

Authentication and authorizationAuthenticationDon’t talk to strangersShould integrate with existing IT infrastructureYahoo! security (Kerberos) patches now part of CDH3b3AuthorizationNot everyone can access everythingEx. Production data sets are read-only to quants / analysts. Analysts have home or group directories for derived data sets.Mostly enforced via HDFS permissions; directory structure and organization is criticalNot as fine grained as column level access in EDW, RDBMSHUE as a gateway to the cluster14Copyright 2010 Cloudera Inc. All rights reserved

Resource SharingPrefer one large cluster to many small clusters (unless maybe you’re Facebook)“Stop hogging the cluster!”Cluster resourcesDisk space (HDFS size quotas)Number of files (HDFS file count quotas)Simultaneous jobsTasks – guaranteed capacity, full utilization, SLA enforcementMonitor and track resource utilization across all groups15Copyright 2010 Cloudera Inc. All rights reserved

MonitoringCritical for keeping things runningCluster healthDuh.Traditional monitoring tools: Nagios, Hyperic, ZenossHost checks, service checksWhen to alert? It’s tricky.Cluster performanceOverall utilization in aggregate30,000ft view of utilization and performance; macro level16Copyright 2010 Cloudera Inc. All rights reserved

MonitoringHadoop aware cluster monitoringTraditional tools don’t cut it; Hadoop monitoring is inherently Hadoop specificAnalogous to RDBMS monitoring toolsJob level “monitoring”More like analysis“What resources does this job use?”“How does this run compare to last run?”“How can I make this run faster, more resource efficient?”Two views we care aboutJob perspectiveResource perspective (task slots, scheduler pool)17Copyright 2010 Cloudera Inc. All rights reserved

Wrapping it upHadoop proper is awesome, but is only part of the pictureMuch of Professional Services time is filling in the blanksThere’s still a way to goMetadata managementOperational tools and supportImprovements to Hadoop core to improve stability, security, manageabilityAdoption and feedback drive progressCDH provides the infrastructure for a complete system18Copyright 2010 Cloudera Inc. All rights reserved

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

More Related Content

What's hot

Similar to Hadoop World 2010: Productionizing Hadoop: Lessons Learned

More from Cloudera, Inc.

Recently uploaded

Hadoop World 2010: Productionizing Hadoop: Lessons Learned

Editor's Notes