Dachis group pigout_101
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Dachis group pigout_101

on

  • 1,054 views

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus.

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus.

Statistics

Views

Total Views
1,054
Views on SlideShare
1,052
Embed Views
2

Actions

Likes
1
Downloads
34
Comments
0

2 Embeds 2

https://si0.twimg.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Dachis group pigout_101 Presentation Transcript

  • 1. dachisgroup.comDachis GroupLas Vegas 2012 Introduction to Apache Pig Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.
  • 2. dachisgroup.comWhat’s Pig? • Data flow engine • Generates MapReduce Behind the Scenes • No requirement to write any Java • PigLatin language equipped with SQL-ish operators •® 2011 Dachis Group. join, group by, sort, filter...
  • 3. dachisgroup.comWhat Pig Isn’t• Not really a query language• Not data visualization tool• Not always friendly• Not hard to learn® 2011 Dachis Group.
  • 4. dachisgroup.comPig Data Model • Standard scalar types • Maps • Tuples • conceptually like a row • ordered, fixed length • Bag • unordered collection of tuples • not required to fit in memory® 2011 Dachis Group.
  • 5. dachisgroup.com Word Count 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } ® 2011 Dachis Group.
  • 6. dachisgroup.comComplete Works ofShakespeare http://sydney.edu.au/engineering/it/~matty/Shakespeare/® 2011 Dachis Group.
  • 7. dachisgroup.comwords: {word: {tuple_of_tokens: (token: chararray)}}({(Clown),(|)})({(Steward),(|)})({(DRAMATIS),(PERSONAE)})({(LAFEU),(an),(old),(lord.)})({(KING),(OF),(FRANCE),(KING:)})({(DUKE),(OF),(FLORENCE),(DUKE:)})({(ALLS),(WELL),(THAT),(ENDS),(WELL)})({(BERTRAM),(Count),(of),(Rousillon.)})({(PAROLLES),(a),(follower),(of),(Bertram.)})({(|),(servants),(to),(the),(Countess),(of),(Rousillon.)}) ® 2011 Dachis Group.
  • 8. dachisgroup.com(OF)(ENDS)(KING)(THAT)(WELL)(WELL)(ALLS)(FRANCE)(DRAMATIS)(PERSONAE) ® 2011 Dachis Group.
  • 9. dachisgroup.com(1,{(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)})(2,{(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2)})(3,{(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3)})(A,{(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),...(A)}) ® 2011 Dachis Group.
  • 10. dachisgroup.com(29724,the)(27474,and)(20770,i)(19980,to)(18380,of)(15131,a)(12923,my)(12413,you)(11487,in)(11202,that) ® 2011 Dachis Group.
  • 11. dachisgroup.com® 2011 Dachis Group.
  • 12. dachisgroup.comTFIDFterm frequency = # of times aterm appears in a documentdocument frequency = # ofdocuments the term appearsinTFID = tf * log(1/df)® 2011 Dachis Group.
  • 13. dachisgroup.comImagine the Map ReduceProblemMapReduce to get the numberof words per documentMapReduce to get termfrequenciesMapReduce to get documentfrequenciesMapReduce to get theproducts® 2011 Dachis Group.
  • 14. dachisgroup.com® 2011 Dachis Group.
  • 15. dachisgroup.com® 2011 Dachis Group.
  • 16. dachisgroup.com® 2011 Dachis Group.
  • 17. dachisgroup.com(cymbeline,all,1,cymbeline,1138)(cymbeline,iii,12,cymbeline,1138)(cymbeline,vii,1,cymbeline,1138)(cymbeline,lady,10,cymbeline,1138)(cymbeline,lord,41,cymbeline,1138)(cymbeline,caius,26,cymbeline,1138)(cymbeline,first,46,cymbeline,1138)(cymbeline,helen,1,cymbeline,1138)(cymbeline,lords,1,cymbeline,1138)(cymbeline,queen,28,cymbeline,1138) ® 2011 Dachis Group.
  • 18. dachisgroup.com(cymbeline,i,0.028319954362087934)(cymbeline,o,0.0028116213683223993)(cymbeline,s,4.0748135772788395E-5)(cymbeline,v,3.667332219550956E-4)(cymbeline,ah,8.149627154557679E-5)(cymbeline,am,0.0035450878122325904)(cymbeline,an,0.0016299254309115358)(cymbeline,as,0.009535063770832485)(cymbeline,at,0.002974613911413553)(cymbeline,ay,6.519701723646143E-4) ® 2011 Dachis Group.
  • 19. dachisgroup.com® 2011 Dachis Group.
  • 20. dachisgroup.com(comedyoferrors,syracuse,0.021138772) (allswellthatendswell,bertram,0.007929546)(comedyoferrors,antipholus,0.020943945) (allswellthatendswell,helena,0.0077329455)(comedyoferrors,dromio,0.020067222) (cymbeline,cymbeline,0.0074565364)(asyoulikeit,rosalind,0.016347487) (allswellthatendswell,lafeu,0.0072742114)(comedyoferrors,ephesus,0.014806883) (cymbeline,posthumus,0.006496225)(allswellthatendswell,parolles,0.010223216) (allswellthatendswell,countess,0.0063567436)(asyoulikeit,orlando,0.010070603) (cymbeline,leonatus,0.006157291)(comedyoferrors,adriana,0.008572405) (asyoulikeit,touchstone,0.0055181384)(asyoulikeit,celia,0.0081392545) (cymbeline,cloten,0.0053099575)(cymbeline,imogen,0.008021425) (cymbeline,iachimo,0.005084002) ® 2011 Dachis Group.
  • 21. dachisgroup.comSome De-bugging tips:Use describeCasting explicitlyUse explicit schemasSample, Limit, and DumpCryptic Error Messages: “Scalar has more than one row in the output”® 2011 Dachis Group.
  • 22. dachisgroup.comOther tipsFilter earlyProject out unused columnsDon’t expect Pig to know what you meanUDFs and Unit Tests are your friends Tim and Clint will tell you more® 2011 Dachis Group.
  • 23. dachisgroup.comDachis GroupLas Vegas 2012 QUESTIONS? Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.