01-­‐1	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐2	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐3	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐4	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐5	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐6	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐7	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐8	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐9	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wit...
01-­‐10	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐11	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐12	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐13	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐14	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐15	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐16	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐17	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐18	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐19	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐20	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐21	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐22	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐23	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐24	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐25	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐26	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
01-­‐27	
  ©	
  Copyright	
  2010-­‐2013	
  Cloudera.	
  All	
  rights	
  reserved.	
  Not	
  to	
  be	
  reproduced	
  wi...
Upcoming SlideShare
Loading in...5
×

Njug presentation

245

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
245
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Njug presentation"

  1. 1. 01-­‐1  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Hadoop  101:   WriCng  a  Java  MapReduce  Program     Ian  Wrigley   Sr.  Curriculum  Manager,  Cloudera     ian@cloudera.com  |  @iwrigley  
  2. 2. 01-­‐2  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   And,  by  the  way,  what  is  Hadoop?   Why  the  World  Needs  Hadoop  
  3. 3. 01-­‐3  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § Every  day…   – More  than  1.5  billion  shares  are  traded  on  the  NYSE   – Facebook  stores  2.7  billion  comments  and  Likes   § Every  minute…   – Foursquare  handles  more  than  2,000  check-­‐ins   – TransUnion  makes  nearly  70,000  updates  to  credit  files   § And  every  second…   – Banks  process  more  than  10,000  credit  card  transacCons   Volume  
  4. 4. 01-­‐4  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § We  are  genera;ng  data  faster  than  ever   – Processes  are  increasingly  automated   – People  are  increasingly  interacCng  online   – Systems  are  increasingly  interconnected   Velocity  
  5. 5. 01-­‐5  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § We’re  producing  a  variety  of  data,  including   – Audio   – Video   – Images   – Log  files   – Web  pages   – Product  raCng  comments   – Social  network  connecCons   § Not  all  of  this  maps  cleanly  to  the  rela;onal  model   Variety  
  6. 6. 01-­‐6  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § One  tweet  is  an  anecdote   – But  a  million  tweets  may  signal  important  trends   § One  person’s  product  review  is  an  opinion   – But  a  million  reviews  might  uncover  a  design  flaw   § One  person’s  diagnosis  is  an  isolated  case   – But  a  million  medical  records  could  lead  to  a  cure   Big  Data  Can  Mean  Big  Opportunity  
  7. 7. 01-­‐7  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   A  Scalable  Data  Processing  Framework   MapReduce  
  8. 8. 01-­‐8  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § MapReduce  is  a  programming  model   – It’s  a  way  of  processing  data     § In  Hadoop,  you  supply  two  func;ons  to  process  data:  Map  and  Reduce   – Map:  typically  used  to  transform,  parse,  or  filter  data   – Reduce:  typically  used  to  summarize  results   § The  Map  func;on  always  runs  first   – The  Reduce  funcCon  runs  acerwards   – The  Hadoop  framework  performs  a  shuffle  and  sort  to  transfer  data   from  the  Map  funcCon  to  the  Reduce  funcCon   § Each  piece  is  simple,  but  can  be  powerful  when  combined   What  is  MapReduce?  
  9. 9. 01-­‐9  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § …  in  which  Ian  waves  his  hands  around  and  aRempts  to  explain  the   MapReduce  flow   MapReduce:  An  Example  
  10. 10. 01-­‐10  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § MapReduce  processing  in  Hadoop  is  batch-­‐oriented   § Usually  wriRen  in  Java   – This  uses  Hadoop’s  API  directly   – You  can  do  basic  MapReduce  in  other  languages   – Using  the  Hadoop  Streaming  wrapper  program   – Some  advanced  features  require  Java  code   MapReduce  Code  for  Hadoop  
  11. 11. 01-­‐11  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § Some  (very)  basic  concepts:   – Input  and  output  data  is  typed   – The  framework  passes  each  input  record  to  the  Mapper  in  turn   – A  record  is  a  (key,  value)  pair   – For  text  files:   – The  key  is  the  byte  offset  of  the  start  of  the  line   – The  value  is  the  line  itself   – Output  data  from  the  Mapper  is  transferred  to  the  Reducer  via  a   process  known  as  the  shuffle  and  sort   – Reducers  receive  (key,  Iterable  of  values)  sets,  in  sorted  key  order   – Job  is  configured  and  executed  using  a  driver  class   Basic  Java  API  Concepts  
  12. 12. 01-­‐12  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.       Data  Flow   Map  input   Map  output   Reduce  input   Reduce  output   Shuffle   and  sort   Nashville J. Jones 12.95 2013-07-21 Memphis S. Smith 66.57 2013-07-21 Nashville T. Harding 55.35 2013-07-22 Knoxville S. Warne 10.99 2013-07-22 Kingsport M. Thompson 99.95 2013-07-22 Nashville 12.95 Memphis 66.57 Nashville 55.35 Knoxville 10.99 Kingsport 99.95 Kingsport[99.95] Knoxville[10.99] Memphis [66.57] Nashville[12.95, 55.35] Kingsport 99.95 Knoxville 10.99 Memphis 66.57 Nashville 68.30
  13. 13. 01-­‐13  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Mapper   package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class StoreSalesMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> { 1 2 3 4 5 6 7 8 9 10 Input  key  and  value  types   Output  key  and  value  types  
  14. 14. 01-­‐14  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Mapper   /* * The map method is invoked once for each line of text in the * input data. The method receives a key of type LongWritable * (which corresponds to the byte offset in the current input * file), a value of type Text (representing the line of input * data), and a Context object (which allows us to print status * messages, among other things). */ @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 11 12 13 14 15 16 17 18 19 20 21 22 23
  15. 15. 01-­‐15  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Mapper   String line = value.toString(); // ignore empty lines if (line.trim().isEmpty()) { return; } String[] fields = line.split("t"); // ensure this line is not malformed if (fields.length != 4) { return; } 24 25 26 27 28 29 30 31 32 33 34 35 36 Convert  value  to  a  Java  String   Defensive  programming!   Split  record  into  fields   Even  more  defensive   programming!  
  16. 16. 01-­‐16  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Mapper   String storeName = fields[0]; Double saleValue = Double.parseDouble(fields[2]); context.write(new Text(storeName), new DoubleWritable(saleValue)); } } 37 38 39 40 41 42 43 44 45 46 47 Output  key  and  value   Extract  based  on  posiCon  
  17. 17. 01-­‐17  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Reducer   package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> { 1 2 3 4 5 6 7 8 9 10 Output  key  and  value  types   Input  key  and  value  types  
  18. 18. 01-­‐18  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Reducer   /* * The reduce method is invoked once for each key received from * the shuffle and sort phase of the MapReduce framework. * The method receives a key of type Text (representing the key), * a set of values of type DoubleWritable, and a Context object. */ @Override public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException { 11 12 13 14 15 16 17 18 19
  19. 19. 01-­‐19  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Reducer   // used to sum up the store sales double sum = 0; // add to it it for each new value received for (DoubleWritable value : values) { sum += value.get(); } // Our output is the event type (key) and the sum (value) context.write(key, new DoubleWritable(sum)); } } 20 21 22 23 24 25 26 27 28 29 30 31 Output  key  and  value  
  20. 20. 01-­‐20  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Driver   package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class StoreSales { public static void main(String[] args) throws Exception { 1 2 3 4 5 6 7 8 9 10 11 12 13
  21. 21. 01-­‐21  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Driver   // validate command line arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 15 16 17 18 19 20 21 22 23 24 25 26
  22. 22. 01-­‐22  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Driver   // tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(StoreSales.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Store Sale Aggregator"); // Specify which classes to use for the Mapper and Reducer job.setMapperClass(StoreSalesMapper.class); job.setReducerClass(SumReducer.class); 27 28 29 30 31 32 33 34 35 36 37
  23. 23. 01-­‐23  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   Java  MR  Job  Example:  Driver   // specify the Mapper's output key and value classes job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(DoubleWritable.class); // specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } } 38 39 40 41 42 43 44 45 46 47 48 49 50 51
  24. 24. 01-­‐24  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § And  now…  the  program  actually  running  on  a  pseudo-­‐distributed  cluster   Demo  
  25. 25. 01-­‐25  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § Obviously  there’s  much  more  to  the  Hadoop  API  than  this   – ParCConers   – Combiners   – Custom  Writables,  custom  WritableComparables   – DistributedCache   – Counters   – Etc.,  etc.,  etc   § …but  even  with  just  this  amount  of  knowledge,  you  could  write  real-­‐world   Hadoop  applica;ons   Conclusion  
  26. 26. 01-­‐26  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.   § Helps  companies  profit  from  all  their  data   – Founded  by  experts  from  Facebook,  Google,  Oracle,  and  Yahoo   § We  offer  products  and  services  for  large-­‐scale  data  analysis   – Socware  (CDH  distribuCon  and  Cloudera  Manager)   – ConsulCng  and  support  services   – Training  and  cerCficaCon   § Want  to  aRend  a  training  course?  Use  the  code  Nashville_15  for  15%  off   any  Cloudera-­‐delivered  class   About  Cloudera  
  27. 27. 01-­‐27  ©  Copyright  2010-­‐2013  Cloudera.  All  rights  reserved.  Not  to  be  reproduced  without  prior  wri>en  consent.  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×