SlideShare a Scribd company logo
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Pig
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Session Agenda

• What is Pig and why should you use it?


• Installing & Setting up Pig


• Pig’s Components


• Using Pig with Hadoop MapReduce


• Summary & Conclusion
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




What is Pig?

• Higher-level abstraction for Hadoop MapReduce


• An infrastructure for data analysis using a scripting language


   • named, Pig Latin
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Why should you use Pig?

• Hadoop MapReduce:


  • Requires you to be a programmer


  • Forces you to design all your algorithms in terms of the map and reduce
    primitives
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Installing & Setting Up Pig -- Required Software

• Required Software:


  • Java 1.6.x


  • Hadoop 0.20.x


  • Ant 1.7+ (for builds)


  • JUnit 4.5 (for tests)


  • Cygwin (on Windows)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Download

• Source: http://pig.apache.org/


• Version:


   • 0.8.1 -- current stable
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Install & Configure

• Extract: tar zxvf pig-0.8.1.tar.gz


• Move & Create Symbolic Link:


   • ln -s pig-0.8.1 pig


• Edit: bin/pig


   • export PIG_CLASSPATH=$HADOOP_HOME/conf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Verify Installation

• Verify:


   (remember to start Hadoop first.)


   • bin/pig -help (command options)


   • bin/pig (run the grunt shell)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Running Pig

• Run Mode


   • Local Mode -- single machine


   • MapReduce Mode -- needs a Hadoop cluster (with HDFS)


• Run via:


   • grunt shell


   • pig scripts


   • embedded programs
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Pig IDE

• PigPen, an eclipse based IDE


  • graphical data flow definition


  • can show example data flow
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Pig Components

• Pig Latin


• Pig Engine


   • execution engine on top of Hadoop


   • includes default optimal configurations
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




A client for your cluster

• Pig does not run on a Hadoop cluster


• It connects to one
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Pig Latin

• Data flow language (Not declarative like SQL)


• Increases productivity (less lines do more)


• Includes standard operations like join, filter, group, sort


• User code and existing binaries can be included


• Supports nested data types


• Does not require metadata
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Pig Latin Example

• Will leverage the tutorial that comes with the distribution


• Check the tutorial folder in the distribution
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                            Copyright for all other & referenced work is retained by their respective owners.




Start Grunt Shell

• cd $PIG_HOME


• bin/pig -x local
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Aggregate Data

• grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp,
  query);


   • alternate delimiters can be used and de-serializers like PigJsonLoader can
     be leveraged


• grunt> grouped = GROUP log BY user;


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log);


• grunt> STORE counted INTO 'output';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Group Data

• grunt> grouped = GROUP log BY user;


• In Pig group operation generates (key, collection) pair , where the collection
  itself is a collection of tuples.


   • The key of the tuples is the same key as that of the (key, collection) pair
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Filter Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);


• grunt> grouped = GROUP log BY user;


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;


• grunt> filtered = FILTER counted BY cnt > 75;


• grunt> STORE filtered INTO 'output1';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Order Data

• grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query);


• grunt> grouped = GROUP log BY user;


• grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt;


• grunt> filtered = FILTER counted BY cnt > 50;


• grunt> sorted = ORDER filtered BY cnt;


• grunt> STORE sorted INTO 'output2';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Join Data Example

• Words appearing in Adventures of Huckleberry Finn by Mark Twain


  • http://www.gutenberg.org/ebooks/76


• Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan
  Doyle


  • http://www.gutenberg.org/ebooks/1661
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Loading & Counting Huckleberry Finn Data

• grunt> A = load 'pg76.txt';


• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;


• grunt> C = filter B by word matches 'w+';


• grunt> D = group C by word;


• grunt> E = foreach D generate COUNT(C), group;


• store E into 'huckleberry_finn_freq';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Loading & Counting Sherlock Holmes Data

• grunt> A = load 'pg1661.txt';


• grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;


• grunt> C = filter B by word matches 'w+';


• grunt> D = group C by word;


• grunt> E = foreach D generate COUNT(C), group;


• grunt> store E into 'sherlock_holmes_freq';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Join Data

• grunt> hf= LOAD 'huckleberry_finn_freq' AS (freq, word);


• grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word);


• grunt> inboth = JOIN hf BY word, sh BY word;


• grunt> STORE inboth INTO 'output3';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Set Difference (A - B, in A but not in B)

• hf	= LOAD 'huckleberry_finn_freq' AS (freq, word);


• sh = LOAD 'sherlock_holmes_freq' AS (freq, word);


• grouped = COGROUP hf BY word, sh BY word;


• not_in_hf = FILTER grouped BY COUNT(hf) == 0;


• out = FOREACH not_in_hf GENERATE FLATTEN(sh);


• STORE out INTO 'output4';
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Cogroup Data

• Extends the idea of grouping to multiple collections


• Instead of (key, collection) pair, it now emits a key and a set of tuples from
  each of the multiple collections


   • With two sources of input it would be (key, collection1, collection2), where
     tuples from the first source will be in collection1 and tuples from the
     second source will be in collection2.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Data types Supported

• int, long, double, chararray, bytearray


• map, tuple (ordered), bag (unordered)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Data type Declaration

• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


   • explicit data type declaration


• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


• weighted = FOREACH hf GENERATE freq * 100;


   • type inference, freq cast to int
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Data type Declaration

• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


   • explicit data type declaration


• hf	= LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray);


• weighted = FOREACH hf GENERATE freq * 100;


   • type inference, freq cast to int
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Custom Extensions

• User defined functions can be called from Pig scripts


• Nested operations can be carried out


   • FOREACH grouped { sorted = ORDER hf BY counted;


   • GENERATE group, CustomFunction(sorted); }


• Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

What's hot

Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
David Pilato
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
Sachin Vakkund
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
Victor Sanchez Anguix
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
Mohamed Ali Mahmoud khouder
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
knowbigdata
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
tfmailru
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
Aasim Naveed
 
Asset Pipeline
Asset PipelineAsset Pipeline
Asset Pipeline
Eric Berry
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
DataWorks Summit
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
Gera Shegalov
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
Manish Pandit
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
Yash Sharma
 
Apache Drill
Apache DrillApache Drill
API Design
API DesignAPI Design
API Design
James Gray
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
Anurag Patel
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Sematext Group, Inc.
 
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisePuppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
cbowlesUT
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
Rafał Kuć
 

What's hot (18)

Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Beginning hive and_apache_pig
Beginning hive and_apache_pigBeginning hive and_apache_pig
Beginning hive and_apache_pig
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Asset Pipeline
Asset PipelineAsset Pipeline
Asset Pipeline
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
API Design Antipatterns - APICon SF
API Design Antipatterns - APICon SFAPI Design Antipatterns - APICon SF
API Design Antipatterns - APICon SF
 
Apache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCodersApache Hive micro guide - ConfusedCoders
Apache Hive micro guide - ConfusedCoders
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
API Design
API DesignAPI Design
API Design
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet EnterprisePuppet for Everybody: Federated and Hierarchical Puppet Enterprise
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 

Viewers also liked

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
Benchmark Mail Tutorial
Benchmark Mail TutorialBenchmark Mail Tutorial
Benchmark Mail Tutorial
chairmanarnold
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
Rohit Dubey
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
colorant
 
Hive
HiveHive
Hive
Min Zhou
 
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
Korea Sdec
 
Adobe Spark Step by Step Guide
Adobe Spark Step by Step GuideAdobe Spark Step by Step Guide
Adobe Spark Step by Step Guide
chairmanarnold
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Korea Sdec
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
Adam Muise
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Someshwar Kale
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 

Viewers also liked (20)

Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Benchmark Mail Tutorial
Benchmark Mail TutorialBenchmark Mail Tutorial
Benchmark Mail Tutorial
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Hive
HiveHive
Hive
 
SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
 
Adobe Spark Step by Step Guide
Adobe Spark Step by Step GuideAdobe Spark Step by Step Guide
Adobe Spark Step by Step Guide
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for HadoopLearning Apache HIVE - Data Warehouse and Query Language for Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 

Similar to SDEC2011 Essentials of Pig

SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
Korea Sdec
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
Korea Sdec
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
Korea Sdec
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
Yahoo Developer Network
 
hotdog a TD tool for DD
hotdog a TD tool for DDhotdog a TD tool for DD
hotdog a TD tool for DD
Treasure Data, Inc.
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Puppet
 
M4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxM4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptx
Shrinivasa6
 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Yahoo!デベロッパーネットワーク
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
Treasure Data, Inc.
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
In-Memory Computing Summit
 
Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017
Mandi Walls
 
Giving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSGiving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOS
Madhava Jay
 
Habitat Overview
Habitat OverviewHabitat Overview
Habitat Overview
Mandi Walls
 
Recon like a pro
Recon like a proRecon like a pro
Recon like a pro
Nirmalthapa24
 
ki
kiki
ki
martin
 
Google Hacking Basic
Google Hacking BasicGoogle Hacking Basic
Google Hacking Basic
Ocim Nationalism
 
13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications
Karthik Gaekwad
 
Introduction to PHP - SDPHP
Introduction to PHP - SDPHPIntroduction to PHP - SDPHP
Introduction to PHP - SDPHP
Eric Johnson
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
Ravi Mutyala
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 

Similar to SDEC2011 Essentials of Pig (20)

SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 
hotdog a TD tool for DD
hotdog a TD tool for DDhotdog a TD tool for DD
hotdog a TD tool for DD
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
 
M4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxM4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptx
 
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...   Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
Dragon: A Distributed Object Storage at Yahoo! JAPAN (WebDB Forum 2017 / E...
 
HDP2 and YARN operations point
HDP2 and YARN operations pointHDP2 and YARN operations point
HDP2 and YARN operations point
 
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
IMCSummite 2016 Breakout - Nikita Ivanov - Apache Ignite 2.0 Towards a Conver...
 
Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017Habitat Workshop at Velocity London 2017
Habitat Workshop at Velocity London 2017
 
Giving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOSGiving back with GitHub - Putting the Open Source back in iOS
Giving back with GitHub - Putting the Open Source back in iOS
 
Habitat Overview
Habitat OverviewHabitat Overview
Habitat Overview
 
Recon like a pro
Recon like a proRecon like a pro
Recon like a pro
 
ki
kiki
ki
 
Google Hacking Basic
Google Hacking BasicGoogle Hacking Basic
Google Hacking Basic
 
13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications13 practical tips for writing secure golang applications
13 practical tips for writing secure golang applications
 
Introduction to PHP - SDPHP
Introduction to PHP - SDPHPIntroduction to PHP - SDPHP
Introduction to PHP - SDPHP
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 

More from Korea Sdec

SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
Korea Sdec
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
Korea Sdec
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Korea Sdec
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
Korea Sdec
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
Korea Sdec
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
Korea Sdec
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
Korea Sdec
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
Korea Sdec
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
Korea Sdec
 

More from Korea Sdec (9)

SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
 

Recently uploaded

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

SDEC2011 Essentials of Pig

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Pig Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Session Agenda • What is Pig and why should you use it? • Installing & Setting up Pig • Pig’s Components • Using Pig with Hadoop MapReduce • Summary & Conclusion
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Pig? • Higher-level abstraction for Hadoop MapReduce • An infrastructure for data analysis using a scripting language • named, Pig Latin
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Why should you use Pig? • Hadoop MapReduce: • Requires you to be a programmer • Forces you to design all your algorithms in terms of the map and reduce primitives
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Installing & Setting Up Pig -- Required Software • Required Software: • Java 1.6.x • Hadoop 0.20.x • Ant 1.7+ (for builds) • JUnit 4.5 (for tests) • Cygwin (on Windows)
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Download • Source: http://pig.apache.org/ • Version: • 0.8.1 -- current stable
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Install & Configure • Extract: tar zxvf pig-0.8.1.tar.gz • Move & Create Symbolic Link: • ln -s pig-0.8.1 pig • Edit: bin/pig • export PIG_CLASSPATH=$HADOOP_HOME/conf
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Verify Installation • Verify: (remember to start Hadoop first.) • bin/pig -help (command options) • bin/pig (run the grunt shell)
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Running Pig • Run Mode • Local Mode -- single machine • MapReduce Mode -- needs a Hadoop cluster (with HDFS) • Run via: • grunt shell • pig scripts • embedded programs
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig IDE • PigPen, an eclipse based IDE • graphical data flow definition • can show example data flow
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig Components • Pig Latin • Pig Engine • execution engine on top of Hadoop • includes default optimal configurations
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A client for your cluster • Pig does not run on a Hadoop cluster • It connects to one
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig Latin • Data flow language (Not declarative like SQL) • Increases productivity (less lines do more) • Includes standard operations like join, filter, group, sort • User code and existing binaries can be included • Supports nested data types • Does not require metadata
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Pig Latin Example • Will leverage the tutorial that comes with the distribution • Check the tutorial folder in the distribution
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Start Grunt Shell • cd $PIG_HOME • bin/pig -x local
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Aggregate Data • grunt> log = LOAD 'tutorial/data/excite-small.log' AS (user, timestamp, query); • alternate delimiters can be used and de-serializers like PigJsonLoader can be leveraged • grunt> grouped = GROUP log BY user; • grunt> counted = FOREACH grouped GENERATE group, COUNT(log); • grunt> STORE counted INTO 'output';
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Data • grunt> grouped = GROUP log BY user; • In Pig group operation generates (key, collection) pair , where the collection itself is a collection of tuples. • The key of the tuples is the same key as that of the (key, collection) pair
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Filter Data • grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query); • grunt> grouped = GROUP log BY user; • grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; • grunt> filtered = FILTER counted BY cnt > 75; • grunt> STORE filtered INTO 'output1';
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Order Data • grunt> log= LOAD 'tutorial/data/excite-small.log' AS (user, time, query); • grunt> grouped = GROUP log BY user; • grunt> counted = FOREACH grouped GENERATE group, COUNT(log) AS cnt; • grunt> filtered = FILTER counted BY cnt > 50; • grunt> sorted = ORDER filtered BY cnt; • grunt> STORE sorted INTO 'output2';
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Join Data Example • Words appearing in Adventures of Huckleberry Finn by Mark Twain • http://www.gutenberg.org/ebooks/76 • Words appearing in The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle • http://www.gutenberg.org/ebooks/1661
  • 21. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading & Counting Huckleberry Finn Data • grunt> A = load 'pg76.txt'; • grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; • grunt> C = filter B by word matches 'w+'; • grunt> D = group C by word; • grunt> E = foreach D generate COUNT(C), group; • store E into 'huckleberry_finn_freq';
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Loading & Counting Sherlock Holmes Data • grunt> A = load 'pg1661.txt'; • grunt> B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; • grunt> C = filter B by word matches 'w+'; • grunt> D = group C by word; • grunt> E = foreach D generate COUNT(C), group; • grunt> store E into 'sherlock_holmes_freq';
  • 23. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Join Data • grunt> hf= LOAD 'huckleberry_finn_freq' AS (freq, word); • grunt> sh= LOAD 'sherlock_holmes_freq' AS (freq, word); • grunt> inboth = JOIN hf BY word, sh BY word; • grunt> STORE inboth INTO 'output3';
  • 24. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Set Difference (A - B, in A but not in B) • hf = LOAD 'huckleberry_finn_freq' AS (freq, word); • sh = LOAD 'sherlock_holmes_freq' AS (freq, word); • grouped = COGROUP hf BY word, sh BY word; • not_in_hf = FILTER grouped BY COUNT(hf) == 0; • out = FOREACH not_in_hf GENERATE FLATTEN(sh); • STORE out INTO 'output4';
  • 25. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Cogroup Data • Extends the idea of grouping to multiple collections • Instead of (key, collection) pair, it now emits a key and a set of tuples from each of the multiple collections • With two sources of input it would be (key, collection1, collection2), where tuples from the first source will be in collection1 and tuples from the second source will be in collection2.
  • 26. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data types Supported • int, long, double, chararray, bytearray • map, tuple (ordered), bag (unordered)
  • 27. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data type Declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • explicit data type declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • weighted = FOREACH hf GENERATE freq * 100; • type inference, freq cast to int
  • 28. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Data type Declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • explicit data type declaration • hf = LOAD 'huckleberry_finn_freq' AS (freq:int, word:chararray); • weighted = FOREACH hf GENERATE freq * 100; • type inference, freq cast to int
  • 29. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Custom Extensions • User defined functions can be called from Pig scripts • Nested operations can be carried out • FOREACH grouped { sorted = ORDER hf BY counted; • GENERATE group, CustomFunction(sorted); } • Flow can be split: SPLIT A INTO Negative IF $0 < 0, Positive IF $0 > 0;
  • 30. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com