Dachis group pigout_101

  • 892 views
Uploaded on

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus. …

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
892
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
34
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. dachisgroup.comDachis GroupLas Vegas 2012 Introduction to Apache Pig Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.
  • 2. dachisgroup.comWhat’s Pig? • Data flow engine • Generates MapReduce Behind the Scenes • No requirement to write any Java • PigLatin language equipped with SQL-ish operators •® 2011 Dachis Group. join, group by, sort, filter...
  • 3. dachisgroup.comWhat Pig Isn’t• Not really a query language• Not data visualization tool• Not always friendly• Not hard to learn® 2011 Dachis Group.
  • 4. dachisgroup.comPig Data Model • Standard scalar types • Maps • Tuples • conceptually like a row • ordered, fixed length • Bag • unordered collection of tuples • not required to fit in memory® 2011 Dachis Group.
  • 5. dachisgroup.com Word Count 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } ® 2011 Dachis Group.
  • 6. dachisgroup.comComplete Works ofShakespeare http://sydney.edu.au/engineering/it/~matty/Shakespeare/® 2011 Dachis Group.
  • 7. dachisgroup.comwords: {word: {tuple_of_tokens: (token: chararray)}}({(Clown),(|)})({(Steward),(|)})({(DRAMATIS),(PERSONAE)})({(LAFEU),(an),(old),(lord.)})({(KING),(OF),(FRANCE),(KING:)})({(DUKE),(OF),(FLORENCE),(DUKE:)})({(ALLS),(WELL),(THAT),(ENDS),(WELL)})({(BERTRAM),(Count),(of),(Rousillon.)})({(PAROLLES),(a),(follower),(of),(Bertram.)})({(|),(servants),(to),(the),(Countess),(of),(Rousillon.)}) ® 2011 Dachis Group.
  • 8. dachisgroup.com(OF)(ENDS)(KING)(THAT)(WELL)(WELL)(ALLS)(FRANCE)(DRAMATIS)(PERSONAE) ® 2011 Dachis Group.
  • 9. dachisgroup.com(1,{(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)})(2,{(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2)})(3,{(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3)})(A,{(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),...(A)}) ® 2011 Dachis Group.
  • 10. dachisgroup.com(29724,the)(27474,and)(20770,i)(19980,to)(18380,of)(15131,a)(12923,my)(12413,you)(11487,in)(11202,that) ® 2011 Dachis Group.
  • 11. dachisgroup.com® 2011 Dachis Group.
  • 12. dachisgroup.comTFIDFterm frequency = # of times aterm appears in a documentdocument frequency = # ofdocuments the term appearsinTFID = tf * log(1/df)® 2011 Dachis Group.
  • 13. dachisgroup.comImagine the Map ReduceProblemMapReduce to get the numberof words per documentMapReduce to get termfrequenciesMapReduce to get documentfrequenciesMapReduce to get theproducts® 2011 Dachis Group.
  • 14. dachisgroup.com® 2011 Dachis Group.
  • 15. dachisgroup.com® 2011 Dachis Group.
  • 16. dachisgroup.com® 2011 Dachis Group.
  • 17. dachisgroup.com(cymbeline,all,1,cymbeline,1138)(cymbeline,iii,12,cymbeline,1138)(cymbeline,vii,1,cymbeline,1138)(cymbeline,lady,10,cymbeline,1138)(cymbeline,lord,41,cymbeline,1138)(cymbeline,caius,26,cymbeline,1138)(cymbeline,first,46,cymbeline,1138)(cymbeline,helen,1,cymbeline,1138)(cymbeline,lords,1,cymbeline,1138)(cymbeline,queen,28,cymbeline,1138) ® 2011 Dachis Group.
  • 18. dachisgroup.com(cymbeline,i,0.028319954362087934)(cymbeline,o,0.0028116213683223993)(cymbeline,s,4.0748135772788395E-5)(cymbeline,v,3.667332219550956E-4)(cymbeline,ah,8.149627154557679E-5)(cymbeline,am,0.0035450878122325904)(cymbeline,an,0.0016299254309115358)(cymbeline,as,0.009535063770832485)(cymbeline,at,0.002974613911413553)(cymbeline,ay,6.519701723646143E-4) ® 2011 Dachis Group.
  • 19. dachisgroup.com® 2011 Dachis Group.
  • 20. dachisgroup.com(comedyoferrors,syracuse,0.021138772) (allswellthatendswell,bertram,0.007929546)(comedyoferrors,antipholus,0.020943945) (allswellthatendswell,helena,0.0077329455)(comedyoferrors,dromio,0.020067222) (cymbeline,cymbeline,0.0074565364)(asyoulikeit,rosalind,0.016347487) (allswellthatendswell,lafeu,0.0072742114)(comedyoferrors,ephesus,0.014806883) (cymbeline,posthumus,0.006496225)(allswellthatendswell,parolles,0.010223216) (allswellthatendswell,countess,0.0063567436)(asyoulikeit,orlando,0.010070603) (cymbeline,leonatus,0.006157291)(comedyoferrors,adriana,0.008572405) (asyoulikeit,touchstone,0.0055181384)(asyoulikeit,celia,0.0081392545) (cymbeline,cloten,0.0053099575)(cymbeline,imogen,0.008021425) (cymbeline,iachimo,0.005084002) ® 2011 Dachis Group.
  • 21. dachisgroup.comSome De-bugging tips:Use describeCasting explicitlyUse explicit schemasSample, Limit, and DumpCryptic Error Messages: “Scalar has more than one row in the output”® 2011 Dachis Group.
  • 22. dachisgroup.comOther tipsFilter earlyProject out unused columnsDon’t expect Pig to know what you meanUDFs and Unit Tests are your friends Tim and Clint will tell you more® 2011 Dachis Group.
  • 23. dachisgroup.comDachis GroupLas Vegas 2012 QUESTIONS? Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.