dachisgroup.comDachis GroupLas Vegas 2012  Introduction to Apache Pig    Kevin Safford    Pigout Hackday, Austin TX    May...
dachisgroup.comWhat’s Pig?      •          Data flow engine      •          Generates MapReduce Behind the Scenes          ...
dachisgroup.comWhat Pig Isn’t•          Not really a query language•          Not data visualization tool•          Not al...
dachisgroup.comPig Data Model      •          Standard scalar types      •          Maps      •          Tuples        •  ...
dachisgroup.com   Word Count   1 package org.myorg;   2   3 import java.io.IOException;   4 import java.util.*;   5   6 im...
dachisgroup.comComplete Works ofShakespeare                http://sydney.edu.au/engineering/it/~matty/Shakespeare/® 2011 D...
dachisgroup.comwords: {word: {tuple_of_tokens: (token: chararray)}}({(Clown),(|)})({(Steward),(|)})({(DRAMATIS),(PERSONAE)...
dachisgroup.com(OF)(ENDS)(KING)(THAT)(WELL)(WELL)(ALLS)(FRANCE)(DRAMATIS)(PERSONAE) ® 2011 Dachis Group.
dachisgroup.com(1,{(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)...
dachisgroup.com(29724,the)(27474,and)(20770,i)(19980,to)(18380,of)(15131,a)(12923,my)(12413,you)(11487,in)(11202,that)  ® ...
dachisgroup.com® 2011 Dachis Group.
dachisgroup.comTFIDFterm frequency = # of times aterm appears in a documentdocument frequency = # ofdocuments the term app...
dachisgroup.comImagine the Map ReduceProblemMapReduce to get the numberof words per documentMapReduce to get termfrequenci...
dachisgroup.com® 2011 Dachis Group.
dachisgroup.com® 2011 Dachis Group.
dachisgroup.com® 2011 Dachis Group.
dachisgroup.com(cymbeline,all,1,cymbeline,1138)(cymbeline,iii,12,cymbeline,1138)(cymbeline,vii,1,cymbeline,1138)(cymbeline...
dachisgroup.com(cymbeline,i,0.028319954362087934)(cymbeline,o,0.0028116213683223993)(cymbeline,s,4.0748135772788395E-5)(cy...
dachisgroup.com® 2011 Dachis Group.
dachisgroup.com(comedyoferrors,syracuse,0.021138772)         (allswellthatendswell,bertram,0.007929546)(comedyoferrors,ant...
dachisgroup.comSome De-bugging tips:Use describeCasting explicitlyUse explicit schemasSample, Limit, and DumpCryptic Error...
dachisgroup.comOther tipsFilter earlyProject out unused columnsDon’t expect Pig to know what you meanUDFs and Unit Tests a...
dachisgroup.comDachis GroupLas Vegas 2012                  QUESTIONS?    Kevin Safford    Pigout Hackday, Austin TX    May...
Upcoming SlideShare
Loading in …5
×

Dachis group pigout_101

1,051 views
1,007 views

Published on

Introduction to Pig. Word Count and TFID generation on Shakespeare's corpus.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,051
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Dachis group pigout_101

    1. 1. dachisgroup.comDachis GroupLas Vegas 2012 Introduction to Apache Pig Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.
    2. 2. dachisgroup.comWhat’s Pig? • Data flow engine • Generates MapReduce Behind the Scenes • No requirement to write any Java • PigLatin language equipped with SQL-ish operators •® 2011 Dachis Group. join, group by, sort, filter...
    3. 3. dachisgroup.comWhat Pig Isn’t• Not really a query language• Not data visualization tool• Not always friendly• Not hard to learn® 2011 Dachis Group.
    4. 4. dachisgroup.comPig Data Model • Standard scalar types • Maps • Tuples • conceptually like a row • ordered, fixed length • Bag • unordered collection of tuples • not required to fit in memory® 2011 Dachis Group.
    5. 5. dachisgroup.com Word Count 1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 } ® 2011 Dachis Group.
    6. 6. dachisgroup.comComplete Works ofShakespeare http://sydney.edu.au/engineering/it/~matty/Shakespeare/® 2011 Dachis Group.
    7. 7. dachisgroup.comwords: {word: {tuple_of_tokens: (token: chararray)}}({(Clown),(|)})({(Steward),(|)})({(DRAMATIS),(PERSONAE)})({(LAFEU),(an),(old),(lord.)})({(KING),(OF),(FRANCE),(KING:)})({(DUKE),(OF),(FLORENCE),(DUKE:)})({(ALLS),(WELL),(THAT),(ENDS),(WELL)})({(BERTRAM),(Count),(of),(Rousillon.)})({(PAROLLES),(a),(follower),(of),(Bertram.)})({(|),(servants),(to),(the),(Countess),(of),(Rousillon.)}) ® 2011 Dachis Group.
    8. 8. dachisgroup.com(OF)(ENDS)(KING)(THAT)(WELL)(WELL)(ALLS)(FRANCE)(DRAMATIS)(PERSONAE) ® 2011 Dachis Group.
    9. 9. dachisgroup.com(1,{(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)})(2,{(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2)})(3,{(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3)})(A,{(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),...(A)}) ® 2011 Dachis Group.
    10. 10. dachisgroup.com(29724,the)(27474,and)(20770,i)(19980,to)(18380,of)(15131,a)(12923,my)(12413,you)(11487,in)(11202,that) ® 2011 Dachis Group.
    11. 11. dachisgroup.com® 2011 Dachis Group.
    12. 12. dachisgroup.comTFIDFterm frequency = # of times aterm appears in a documentdocument frequency = # ofdocuments the term appearsinTFID = tf * log(1/df)® 2011 Dachis Group.
    13. 13. dachisgroup.comImagine the Map ReduceProblemMapReduce to get the numberof words per documentMapReduce to get termfrequenciesMapReduce to get documentfrequenciesMapReduce to get theproducts® 2011 Dachis Group.
    14. 14. dachisgroup.com® 2011 Dachis Group.
    15. 15. dachisgroup.com® 2011 Dachis Group.
    16. 16. dachisgroup.com® 2011 Dachis Group.
    17. 17. dachisgroup.com(cymbeline,all,1,cymbeline,1138)(cymbeline,iii,12,cymbeline,1138)(cymbeline,vii,1,cymbeline,1138)(cymbeline,lady,10,cymbeline,1138)(cymbeline,lord,41,cymbeline,1138)(cymbeline,caius,26,cymbeline,1138)(cymbeline,first,46,cymbeline,1138)(cymbeline,helen,1,cymbeline,1138)(cymbeline,lords,1,cymbeline,1138)(cymbeline,queen,28,cymbeline,1138) ® 2011 Dachis Group.
    18. 18. dachisgroup.com(cymbeline,i,0.028319954362087934)(cymbeline,o,0.0028116213683223993)(cymbeline,s,4.0748135772788395E-5)(cymbeline,v,3.667332219550956E-4)(cymbeline,ah,8.149627154557679E-5)(cymbeline,am,0.0035450878122325904)(cymbeline,an,0.0016299254309115358)(cymbeline,as,0.009535063770832485)(cymbeline,at,0.002974613911413553)(cymbeline,ay,6.519701723646143E-4) ® 2011 Dachis Group.
    19. 19. dachisgroup.com® 2011 Dachis Group.
    20. 20. dachisgroup.com(comedyoferrors,syracuse,0.021138772) (allswellthatendswell,bertram,0.007929546)(comedyoferrors,antipholus,0.020943945) (allswellthatendswell,helena,0.0077329455)(comedyoferrors,dromio,0.020067222) (cymbeline,cymbeline,0.0074565364)(asyoulikeit,rosalind,0.016347487) (allswellthatendswell,lafeu,0.0072742114)(comedyoferrors,ephesus,0.014806883) (cymbeline,posthumus,0.006496225)(allswellthatendswell,parolles,0.010223216) (allswellthatendswell,countess,0.0063567436)(asyoulikeit,orlando,0.010070603) (cymbeline,leonatus,0.006157291)(comedyoferrors,adriana,0.008572405) (asyoulikeit,touchstone,0.0055181384)(asyoulikeit,celia,0.0081392545) (cymbeline,cloten,0.0053099575)(cymbeline,imogen,0.008021425) (cymbeline,iachimo,0.005084002) ® 2011 Dachis Group.
    21. 21. dachisgroup.comSome De-bugging tips:Use describeCasting explicitlyUse explicit schemasSample, Limit, and DumpCryptic Error Messages: “Scalar has more than one row in the output”® 2011 Dachis Group.
    22. 22. dachisgroup.comOther tipsFilter earlyProject out unused columnsDon’t expect Pig to know what you meanUDFs and Unit Tests are your friends Tim and Clint will tell you more® 2011 Dachis Group.
    23. 23. dachisgroup.comDachis GroupLas Vegas 2012 QUESTIONS? Kevin Safford Pigout Hackday, Austin TX May 11, 2012® 2011 Dachis Group.

    ×