Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010



Todd gives his perspective on Apache Hadoop

Todd gives his perspective on Apache Hadoop



Total Views
Views on SlideShare
Embed Views



7 Embeds 1,359

http://www.cloudera.com 1335
http://www.slideshare.net 13
http://translate.googleusercontent.com 4
http://localhost 4
http://static.slidesharecdn.com 1
http://ngineo.onconfluence.com 1
http://test.cloudera.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010 Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010 Presentation Transcript

  • Apache Hadoop an introduction Todd Lipcon todd@cloudera.com @tlipcon @cloudera May 27, 2010 Thursday, May 27, 2010
  • Hi there! Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems Thursday, May 27, 2010
  • Outline Why should you care? (Intro) What is Hadoop? The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Thursday, May 27, 2010
  • Data is everywhere. Data is important. Thursday, May 27, 2010
  • Thursday, May 27, 2010
  • Thursday, May 27, 2010
  • Thursday, May 27, 2010
  • Thursday, May 27, 2010
  • “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Thursday, May 27, 2010
  • Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail), … . Are you throwing it away because it doesn’t ‘fit’? Thursday, May 27, 2010
  • So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Thursday, May 27, 2010
  • Apache Hadoop is an open-source system to reliably store and process gobs of information across many commodity computers. The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Thursday, May 27, 2010
  • Two Core Components HDFS Map/Reduce Self-healing high-bandwidth Fault-tolerant clustered storage. distributed computing. Thursday, May 27, 2010
  • What makes Hadoop special? Thursday, May 27, 2010
  • Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Thursday, May 27, 2010
  • Hadoop separates distributed system fault-tolerance code from application logic. Unicorns Systems Statisticians Programmers Thursday, May 27, 2010
  • Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Thursday, May 27, 2010
  • Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Thursday, May 27, 2010
  • Assumption 3: Your analysis fits on one machine Image: Matthew J. Stinson CC-BY-NC Thursday, May 27, 2010
  • Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data One Hadoop works for both applications! Thursday, May 27, 2010
  • A Typical Look... 5-4000 commodity servers (8-core, 8-24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack Thursday, May 27, 2010
  • Image: Josh Hough CC BY-NC-SA STOP! REAL METAL? Isn’t this some kind of “Cloud Computing” conference? Hadoop runs as a cloud (a cluster) and maybe in a cloud (eg EC2). Thursday, May 27, 2010
  • Hadoop sounds like magic. How is it possible? Thursday, May 27, 2010
  • dramatis personae Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… DataNodes TaskTrackers (block storage) (task execution) Thanks to Zak Stone for earmuff image! Thursday, May 27, 2010
  • Namenode HDFS 3x64MB file, 3 rep (fs metadata) 4x64MB file, 3 rep Small file, 7 rep Datanodes Thursday, May 27, 2010 One Rack A Different Rack
  • HDFS Write Path Thursday, May 27, 2010
  • HDFS Failures? Datanode crash? Clients read another copy Background rebalance/rereplicate Namenode crash? uh-oh not responsible for majority of downtime! Thursday, May 27, 2010
  • The M/R Programming Model Thursday, May 27, 2010
  • You specify map() and reduce() functions. The framework does the rest. Thursday, May 27, 2010
  • fault-tolerance (that’s what’s important) (and that’s why Hadoop) Thursday, May 27, 2010
  • map() map: K₁,V₁→list K₂,V₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Thursday, May 27, 2010
  • (the shuffle) map output is assigned to a “reducer” map output is sorted by key Thursday, May 27, 2010
  • reduce() K₂, iter(V₂)→list(K₃,V₃) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Thursday, May 27, 2010
  • Putting it together... Logical Physical Flow Physical Thursday, May 27, 2010
  • Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DWH technologies (by archiving raw data). Thursday, May 27, 2010
  • M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle One Rack A Different Rack Thursday, May 27, 2010
  • M/R Thursday, May 27, 2010
  • M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Thursday, May 27, 2010
  • There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language for easy ad-hoc Great for stdin/stdout/ analysis. analysts. stderr Developed at Developed at Yahoo! Facebook Many tasks actually require a series of M/R jobs; that’s ok! Thursday, May 27, 2010
  • The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Column DB) (Key-Value store) HDFS (Hadoop Distributed File System) Thursday, May 27, 2010
  • Hadoop in the Wild (yes, it’s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Facebook: 15TB new data per day; 10000+ cores, 12+ PB Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research) Thursday, May 27, 2010
  • Ok, fine, what next? Get Hadoop! Cloudera’s Distribution for Hadoop http://hadoop.apache.org/ Try it out! (Locally, or on EC2) Door Prize Watch free training videos on http://cloudera.com/ Thursday, May 27, 2010
  • Questions? todd@cloudera.com (feedback? yes!) (hiring? yes!) Thursday, May 27, 2010
  • Backup Slides Thursday, May 27, 2010
  • Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data Thursday, May 27, 2010
  • public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class); Thursday, May 27, 2010
  • $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams Thursday, May 27, 2010
  • JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... Thursday, May 27, 2010
  • JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } Thursday, May 27, 2010
  • The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text Thursday, May 27, 2010
  • Facebook Data Infrastructure Facebook’s DWH 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers il 1, 2009 Thursday, May 27, 2010