SlideShare a Scribd company logo
1 of 23
dachisgroup.com




Dachis Group
Las Vegas 2012




  Introduction to Apache Pig


    Kevin Safford
    Pigout Hackday, Austin TX
    May 11, 2012
® 2011 Dachis Group.
dachisgroup.com




What’s Pig?


      •          Data flow engine

      •          Generates MapReduce Behind the Scenes

                       •   No requirement to write any Java

      •          PigLatin language equipped with SQL-ish
                 operators

              •
® 2011 Dachis Group.
                       join, group by, sort, filter...
dachisgroup.com




What Pig Isn’t




•          Not really a query language

•          Not data visualization tool

•          Not always friendly

•          Not hard to learn

® 2011 Dachis Group.
dachisgroup.com




Pig Data Model


      •          Standard scalar types
      •          Maps
      •          Tuples
        •          conceptually like a row
        •          ordered, fixed length
      •          Bag
        •          unordered collection of tuples
        •          not required to fit in memory
® 2011 Dachis Group.
dachisgroup.com




   Word Count
   1 package org.myorg;
   2
   3 import java.io.IOException;
   4 import java.util.*;
   5
   6 import org.apache.hadoop.fs.Path;
   7 import org.apache.hadoop.conf.*;
   8 import org.apache.hadoop.io.*;
   9 import org.apache.hadoop.mapreduce.*;
  10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
  11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
  13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  14
  15 public class WordCount {
  16
  17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  18     private final static IntWritable one = new IntWritable(1);
  19     private Text word = new Text();
  20
  21     public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
  22         String line = value.toString();
  23         StringTokenizer tokenizer = new StringTokenizer(line);
  24         while (tokenizer.hasMoreTokens()) {
  25             word.set(tokenizer.nextToken());
  26             context.write(word, one);
  27         }
  28     }
  29 }
  30
  31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  32
  33     public void reduce(Text key, Iterable<IntWritable> values, Context context)
  34       throws IOException, InterruptedException {
  35         int sum = 0;
  36         for (IntWritable val : values) {
  37             sum += val.get();
  38         }
  39         context.write(key, new IntWritable(sum));
  40     }
  41 }
  42
  43 public static void main(String[] args) throws Exception {
  44     Configuration conf = new Configuration();
  45
  46         Job job = new Job(conf, "wordcount");
  47
  48     job.setOutputKeyClass(Text.class);
  49     job.setOutputValueClass(IntWritable.class);
  50
  51     job.setMapperClass(Map.class);
  52     job.setReducerClass(Reduce.class);
  53
  54     job.setInputFormatClass(TextInputFormat.class);
  55     job.setOutputFormatClass(TextOutputFormat.class);
  56
  57     FileInputFormat.addInputPath(job, new Path(args[0]));
  58     FileOutputFormat.setOutputPath(job, new Path(args[1]));
  59
  60     job.waitForCompletion(true);
  61 }
  62
  63 }



     ® 2011 Dachis Group.
dachisgroup.com




Complete Works of
Shakespeare




                http://sydney.edu.au/engineering/it/~matty/Shakespeare/




® 2011 Dachis Group.
dachisgroup.com




words: {word: {tuple_of_tokens: (token: chararray)}}

({(Clown),(|)})
({(Steward),(|)})
({(DRAMATIS),(PERSONAE)})
({(LAFEU),(an),(old),(lord.)})
({(KING),(OF),(FRANCE),(KING:)})
({(DUKE),(OF),(FLORENCE),(DUKE:)})
({(ALL'S),(WELL),(THAT),(ENDS),(WELL)})
({(BERTRAM),(Count),(of),(Rousillon.)})
({(PAROLLES),(a),(follower),(of),(Bertram.)})
({(|),(servants),(to),(the),(Countess),(of),(Rousillon.)})
  ® 2011 Dachis Group.
dachisgroup.com




(OF)
(ENDS)
(KING)
(THAT)
(WELL)
(WELL)
(ALL'S)
(FRANCE)
(DRAMATIS)
(PERSONAE)
 ® 2011 Dachis Group.
dachisgroup.com




(1,{(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)})

(2,{(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),
(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2),(2)})

(3,{(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),(3),
(3)})

(A,{(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),
(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),(A),...
(A)})

   ® 2011 Dachis Group.
dachisgroup.com




(29724,the)
(27474,and)
(20770,i)
(19980,to)
(18380,of)
(15131,a)
(12923,my)
(12413,you)
(11487,in)
(11202,that)
  ® 2011 Dachis Group.
dachisgroup.com




® 2011 Dachis Group.
dachisgroup.com




TFIDF


term frequency = # of times a
term appears in a document

document frequency = # of
documents the term appears
in

TFID = tf * log(1/df)



® 2011 Dachis Group.
dachisgroup.com




Imagine the Map Reduce
Problem
MapReduce to get the number
of words per document

MapReduce to get term
frequencies

MapReduce to get document
frequencies

MapReduce to get the
products
® 2011 Dachis Group.
dachisgroup.com




® 2011 Dachis Group.
dachisgroup.com




® 2011 Dachis Group.
dachisgroup.com




® 2011 Dachis Group.
dachisgroup.com




(cymbeline,all,1,cymbeline,1138)
(cymbeline,iii,12,cymbeline,1138)
(cymbeline,vii,1,cymbeline,1138)
(cymbeline,lady,10,cymbeline,1138)
(cymbeline,lord,41,cymbeline,1138)
(cymbeline,caius,26,cymbeline,1138)
(cymbeline,first,46,cymbeline,1138)
(cymbeline,helen,1,cymbeline,1138)
(cymbeline,lords,1,cymbeline,1138)
(cymbeline,queen,28,cymbeline,1138)
  ® 2011 Dachis Group.
dachisgroup.com




(cymbeline,i,0.028319954362087934)
(cymbeline,o,0.0028116213683223993)
(cymbeline,s,4.0748135772788395E-5)
(cymbeline,v,3.667332219550956E-4)
(cymbeline,ah,8.149627154557679E-5)
(cymbeline,am,0.0035450878122325904)
(cymbeline,an,0.0016299254309115358)
(cymbeline,as,0.009535063770832485)
(cymbeline,at,0.002974613911413553)
(cymbeline,ay,6.519701723646143E-4)
  ® 2011 Dachis Group.
dachisgroup.com




® 2011 Dachis Group.
dachisgroup.com




(comedyoferrors,syracuse,0.021138772)         (allswellthatendswell,bertram,0.007929546)
(comedyoferrors,antipholus,0.020943945)       (allswellthatendswell,helena,0.0077329455)
(comedyoferrors,dromio,0.020067222)           (cymbeline,cymbeline,0.0074565364)
(asyoulikeit,rosalind,0.016347487)            (allswellthatendswell,lafeu,0.0072742114)
(comedyoferrors,ephesus,0.014806883)          (cymbeline,posthumus,0.006496225)
(allswellthatendswell,parolles,0.010223216)   (allswellthatendswell,countess,0.0063567436)
(asyoulikeit,orlando,0.010070603)             (cymbeline,leonatus,0.006157291)
(comedyoferrors,adriana,0.008572405)          (asyoulikeit,touchstone,0.0055181384)
(asyoulikeit,celia,0.0081392545)              (cymbeline,cloten,0.0053099575)
(cymbeline,imogen,0.008021425)                (cymbeline,iachimo,0.005084002)


   ® 2011 Dachis Group.
dachisgroup.com




Some De-bugging tips:


Use describe

Casting explicitly

Use explicit schemas

Sample, Limit, and Dump

Cryptic Error Messages:
         “Scalar has more than one row in the
        output”
® 2011 Dachis Group.
dachisgroup.com




Other tips


Filter early

Project out unused columns

Don’t expect Pig to know what you mean

UDFs and Unit Tests are your friends
        Tim and Clint will tell you more


® 2011 Dachis Group.
dachisgroup.com




Dachis Group
Las Vegas 2012




                  QUESTIONS?


    Kevin Safford
    Pigout Hackday, Austin TX
    May 11, 2012
® 2011 Dachis Group.

More Related Content

What's hot

Oracle 10g Performance: chapter 10 libc
Oracle 10g Performance: chapter 10 libcOracle 10g Performance: chapter 10 libc
Oracle 10g Performance: chapter 10 libc
Kyle Hailey
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
DataStax
 

What's hot (20)

Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015Cassandra 3.0 - JSON at scale - StampedeCon 2015
Cassandra 3.0 - JSON at scale - StampedeCon 2015
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
Next Top Data Model by Ian Plosker
Next Top Data Model by Ian PloskerNext Top Data Model by Ian Plosker
Next Top Data Model by Ian Plosker
 
The Ring programming language version 1.8 book - Part 41 of 202
The Ring programming language version 1.8 book - Part 41 of 202The Ring programming language version 1.8 book - Part 41 of 202
The Ring programming language version 1.8 book - Part 41 of 202
 
Big data 101 for beginners devoxxpl
Big data 101 for beginners devoxxplBig data 101 for beginners devoxxpl
Big data 101 for beginners devoxxpl
 
Data Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backData Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes back
 
Oracle 10g Performance: chapter 10 libc
Oracle 10g Performance: chapter 10 libcOracle 10g Performance: chapter 10 libc
Oracle 10g Performance: chapter 10 libc
 
WOTC_Import
WOTC_ImportWOTC_Import
WOTC_Import
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
 
The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180The Ring programming language version 1.5.1 book - Part 38 of 180
The Ring programming language version 1.5.1 book - Part 38 of 180
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
Php data structures – beyond spl (online version)
Php data structures – beyond spl (online version)Php data structures – beyond spl (online version)
Php data structures – beyond spl (online version)
 
CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26CS101- Introduction to Computing- Lecture 26
CS101- Introduction to Computing- Lecture 26
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 
Recentrer l'intelligence artificielle sur les connaissances
Recentrer l'intelligence artificielle sur les connaissancesRecentrer l'intelligence artificielle sur les connaissances
Recentrer l'intelligence artificielle sur les connaissances
 
Shooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 StreamsShooting the Rapids: Getting the Best from Java 8 Streams
Shooting the Rapids: Getting the Best from Java 8 Streams
 
Redis basics
Redis basicsRedis basics
Redis basics
 
Shooting the Rapids
Shooting the RapidsShooting the Rapids
Shooting the Rapids
 

Similar to Dachis group pigout_101

Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
cwensel
 
Remixing Confluence With Speakeasy
Remixing Confluence With SpeakeasyRemixing Confluence With Speakeasy
Remixing Confluence With Speakeasy
nabeelahali
 

Similar to Dachis group pigout_101 (20)

Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Apache Cassandra and Go
Apache Cassandra and GoApache Cassandra and Go
Apache Cassandra and Go
 
Nodejs - A quick tour (v5)
Nodejs - A quick tour (v5)Nodejs - A quick tour (v5)
Nodejs - A quick tour (v5)
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy Plugins
 
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning TalksBuilding Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
Building Atlassian Plugins with Groovy - Atlassian Summit 2010 - Lightning Talks
 
Demo Eclipse Science
Demo Eclipse ScienceDemo Eclipse Science
Demo Eclipse Science
 
Demo eclipse science
Demo eclipse scienceDemo eclipse science
Demo eclipse science
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Remixing Confluence With Speakeasy
Remixing Confluence With SpeakeasyRemixing Confluence With Speakeasy
Remixing Confluence With Speakeasy
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Down the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM WonderlandDown the Rabbit Hole: An Adventure in JVM Wonderland
Down the Rabbit Hole: An Adventure in JVM Wonderland
 
Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Dachis group pigout_101

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n