Your SlideShare is downloading. ×
Dachis group pigout_101
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dachis group pigout_101

908
views

Published on

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus. …

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus.

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
908
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. dachisgroup.comDachis GroupLas Vegas 2012 Introduction to Apache Pig Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.
    • 2. dachisgroup.comWhat’s Pig? • Data flow engine • Generates MapReduce Behind the Scenes • No requirement to write any Java • PigLatin language equipped with SQL-ish operators •® 2011 Dachis Group. join, group by, sort, filter...
    • 3. dachisgroup.comWhat Pig Isn’t• Not really a query language• Not data visualization tool• Not always friendly• Not hard to learn® 2011 Dachis Group.
    • 4. dachisgroup.comPig Data Model • Standard scalar types • Maps • Tuples • conceptually like a row • ordered, fixed length • Bag • unordered collection of tuples • not required to fit in memory® 2011 Dachis Group.
    • 5. dachisgroup.com Word Count 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } ® 2011 Dachis Group.
    • 6. dachisgroup.comComplete Works ofShakespeare http://sydney.edu.au/engineering/it/~matty/Shakespeare/® 2011 Dachis Group.
    • 7. dachisgroup.comwords: {word: {tuple_of_tokens: (token: chararray)}}({(Clown),(|)})({(Steward),(|)})({(DRAMATIS),(PERSONAE)})({(LAFEU),(an),(old),(lord.)})({(KING),(OF),(FRANCE),(KING:)})({(DUKE),(OF),(FLORENCE),(DUKE:)})({(ALLS),(WELL),(THAT),(ENDS),(WELL)})({(BERTRAM),(Count),(of),(Rousillon.)})({(PAROLLES),(a),(follower),(of),(Bertram.)})({(|),(servants),(to),(the),(Countess),(of),(Rousillon.)}) ® 2011 Dachis Group.
    • 8. dachisgroup.com(OF)(ENDS)(KING)(THAT)(WELL)(WELL)(ALLS)(FRANCE)(DRAMATIS)(PERSONAE) ® 2011 Dachis Group.
    • 9. dachisgroup.com(1,{(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)})(2,{(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2)})(3,{(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3)})(A,{(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),...(A)}) ® 2011 Dachis Group.
    • 10. dachisgroup.com(29724,the)(27474,and)(20770,i)(19980,to)(18380,of)(15131,a)(12923,my)(12413,you)(11487,in)(11202,that) ® 2011 Dachis Group.
    • 11. dachisgroup.com® 2011 Dachis Group.
    • 12. dachisgroup.comTFIDFterm frequency = # of times aterm appears in a documentdocument frequency = # ofdocuments the term appearsinTFID = tf * log(1/df)® 2011 Dachis Group.
    • 13. dachisgroup.comImagine the Map ReduceProblemMapReduce to get the numberof words per documentMapReduce to get termfrequenciesMapReduce to get documentfrequenciesMapReduce to get theproducts® 2011 Dachis Group.
    • 14. dachisgroup.com® 2011 Dachis Group.
    • 15. dachisgroup.com® 2011 Dachis Group.
    • 16. dachisgroup.com® 2011 Dachis Group.
    • 17. dachisgroup.com(cymbeline,all,1,cymbeline,1138)(cymbeline,iii,12,cymbeline,1138)(cymbeline,vii,1,cymbeline,1138)(cymbeline,lady,10,cymbeline,1138)(cymbeline,lord,41,cymbeline,1138)(cymbeline,caius,26,cymbeline,1138)(cymbeline,first,46,cymbeline,1138)(cymbeline,helen,1,cymbeline,1138)(cymbeline,lords,1,cymbeline,1138)(cymbeline,queen,28,cymbeline,1138) ® 2011 Dachis Group.
    • 18. dachisgroup.com(cymbeline,i,0.028319954362087934)(cymbeline,o,0.0028116213683223993)(cymbeline,s,4.0748135772788395E-5)(cymbeline,v,3.667332219550956E-4)(cymbeline,ah,8.149627154557679E-5)(cymbeline,am,0.0035450878122325904)(cymbeline,an,0.0016299254309115358)(cymbeline,as,0.009535063770832485)(cymbeline,at,0.002974613911413553)(cymbeline,ay,6.519701723646143E-4) ® 2011 Dachis Group.
    • 19. dachisgroup.com® 2011 Dachis Group.
    • 20. dachisgroup.com(comedyoferrors,syracuse,0.021138772) (allswellthatendswell,bertram,0.007929546)(comedyoferrors,antipholus,0.020943945) (allswellthatendswell,helena,0.0077329455)(comedyoferrors,dromio,0.020067222) (cymbeline,cymbeline,0.0074565364)(asyoulikeit,rosalind,0.016347487) (allswellthatendswell,lafeu,0.0072742114)(comedyoferrors,ephesus,0.014806883) (cymbeline,posthumus,0.006496225)(allswellthatendswell,parolles,0.010223216) (allswellthatendswell,countess,0.0063567436)(asyoulikeit,orlando,0.010070603) (cymbeline,leonatus,0.006157291)(comedyoferrors,adriana,0.008572405) (asyoulikeit,touchstone,0.0055181384)(asyoulikeit,celia,0.0081392545) (cymbeline,cloten,0.0053099575)(cymbeline,imogen,0.008021425) (cymbeline,iachimo,0.005084002) ® 2011 Dachis Group.
    • 21. dachisgroup.comSome De-bugging tips:Use describeCasting explicitlyUse explicit schemasSample, Limit, and DumpCryptic Error Messages: “Scalar has more than one row in the output”® 2011 Dachis Group.
    • 22. dachisgroup.comOther tipsFilter earlyProject out unused columnsDon’t expect Pig to know what you meanUDFs and Unit Tests are your friends Tim and Clint will tell you more® 2011 Dachis Group.
    • 23. dachisgroup.comDachis GroupLas Vegas 2012 QUESTIONS? Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.