SlideShare a Scribd company logo
ETL with Apache Pig
By
Arjun Shah
Under the guidance of
Dr Duc Thanh Tran
Agenda
• What is Pig?
• Introduction to Pig Latin
• Installation of Pig
• Getting Started with Pig
• Examples
What is Pig?
• Pig is a dataflow language
• Language is called PigLatin
• Pretty simple syntax
• Under the covers, PigLatin scripts are turned into MapReduce jobs
and executed on the cluster
• Built for Hadoop
• Translates script to MapReduce program under the hood
• Originally developed at Yahoo!
• Huge contributions from Hortonworks, Twitter
What Pig Does
• Pig was designed for performing a long series of
data operations, making it ideal for three
categories of Big Data jobs:
• Extract-transform-load (ETL) data pipelines,
• Research on raw data, and
• Iterative data processing.
Features of Pig
• Joining datasets
• Grouping data
• Referring to elements by position rather than name ($0, $1, etc)
• Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer)
• Creation of user-defined functions (UDF), written in Java
• And more..
Pig: Install
• There are some prerequisites that one needs to
follow for installing pig. They are:
• JAVA_HOME should be set up
• Hadoop should be installed (Single node
cluster)
• Useful link :
http://codesfusion.blogspot.com/2013/10/setup-
hadoop-2x-220-on-ubuntu.html
Pig: Install(2)
pig.apache.org/docs/r0.12.0/start.html
Pig: Install(3)
Pig: Install(4)
Pig: Install(5)
Move tar file to any location
• $cd /usr/local
• • $cp ~/Download/pig-0.12.0.tar.gz
• • $sudo tar xzf pig-0.12.0.tar.gz
• • $mv pig-0.12.0.tar.gz pig
Change .bashrc
• Edit the .bashrc file:
• $ gedit ~/.bashrc
• Add to .bashrc
• • export PIG_HOME=/usr/local/pig
• • export PATH=$PATH:$PIG_HOME/bin
• Close and then open terminal. Try pig -h
pig -h : Output
Pig: Configure
• The user can run Pig in two modes:
• Local mode (pig -x local) - With access to a single
machine, all files are installed and run using a
local host and file system.
• Hadoop mode - This is the default mode, which
requires access to a Hadoop cluster
• The user can run Pig in either mode using the “pig”
command or the “java” command.
Pig: Run
• Script: Pig can run a script file that contains Pig commands.
• For example,
% pig script.pig
• Runs the commands in the local file ”script.pig”.
• Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a
command line.
•
• Grunt: Grunt is an interactive shell for running Pig commands.
• Grunt is started when no file is specified for Pig to run, and the -e option is not used.
• Note: It is also possible to run Pig scripts from within Grunt using run and exec.
• Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs
from Java.
• There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig
•
Pig Latin: Loading Data
• LOAD
- Reads data from the file system
• Syntax
- LOAD ‘input’ [USING function] [AS schema];
-Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
Schema
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
-name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int,
gpa:float);
-name is now a String (chararray), age is integer
and gpa is float
Describing Schema
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias unknown”
• grunt> A = load 'data' as (a:int, b: long, c: float);
• grunt> describe A;
• A: {a: int, b: long, c: float}
• grunt> B = load 'somemoredata';
• grunt> describe B;
• Schema for B unknown.
Dump and Store
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
Referencing Fields
• Fields are referred to by positional notation OR by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas. Eg, A = load
‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
Limit
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
• grunt> A = load 'data';
• grunt> B = LIMIT A 10;
• grunt> DUMP B; --Prints only 10 rows
Foreach.. Generate
• Used for data transformations and projections
• Syntax
• alias = FOREACH { block | nested_block };
• nested_block usage later in the deck
• grunt> A = load ‘data’ as (a1,a2,a3);
• grunt> B = FOREACH A GENERATE *,
• grunt> DUMP B;
• (1,2,3)
• (4,2,1)
• grunt> C = FOREACH A GENERATE a1, a3;
• grunt> DUMP C;
• (1,3)
• (4,1)
Filter
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray,employeesince:int,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND (NOT(name ==
‘marcbenioff’));
Group By
• Groups data in one or more relations (similar to SQL GROUP BY)
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL
n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,
employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince);
Demo: Sample Data (employee.txt)
• Example contents of ‘employee.txt’ a tab delimited text
• 1 Peter234000000 none
• 2 Peter_01 234000000 none
• 124163 Jacob 10000 cloud
• 124164 Arthur 1000000 setlabs
• 124165 Robert 1000000 setlabs
• 124166 Ram 450000 es
• 124167 Madhusudhan 450000 e&r
• 124168 Alex 6500000 e&r
• 124169 Bob 50000 cloud
Demo: Employees with salary > 1lk
• Loading data from employee.txt into emps bag and with a schema
empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double,
dept:chararray);
• Filtering the data as required
rich = FILTER empls BY $2 > 100000;
• Sorting
sortd = ORDER rich BY salary DESC;
• Storing the final results
STORE sortd INTO ‘rich_employees.txt’;
• Or alternatively we can dump the record on the screen
DUMP sortd;
------------------------------------------------------------------
• Group by salary
grp = GROUP empls BY salary;
• Get count of employees in each salary group
cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
Output
More PigLatin (1/2)
• Load using PigStorage
• empls = LOAD ‘employee.txt’ USING
PigStorage('t') AS (id:int, name:chararray,
salary:double, dept:chararray);
• Store using PigStorage
• STORE srtd INTO ‘rich_employees.txt’ USING
PigStorage('t');
More PigLatin (2/2)
• To view the schema of a relation
• DESCRIBE empls;
• To view step-by-step execution of a series of
statements
• ILLUSTRATE empls;
• To view the execution plan of a relation
• EXPLAIN empls;
Exploring Pig with Project
Data Set
Pig: Local Mode using
Project Example
Pig:Hadoop Mode (GUI)
using Project Example
Output
Crimes having category as
VANDALISM
Output
Crimes occurring on
Saturday & Sunday
Output
Grouping crimes by category
Output
PigLatin: UDF
• Pig provides extensive support for user-defined
functions (UDFs) as a way to specify custom
processing. Functions can be a part of almost
every operator in Pig
• All UDF’s are case sensitive
UDF: Types
• Eval Functions (EvalFunc)
• Ex: StringConcat (built-in) : Generates the concatenation of the first two fields
of a tuple.
• Aggregate Functions (EvalFunc & Algebraic)
• Ex: COUNT, AVG ( both built-in)
• Filter Functions (FilterFunc)
• Ex: IsEmpty (built-in)
• Load/Store Functions (LoadFunc/ StoreFunc)
• Ex: PigStorage (built-in)
• Note: URL for built in functions:
http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package-
summary.html
Summary
• Pig can be used to run ETL jobs on Hadoop. It
saves you from writing MapReduce code in Java
while its syntax may look familiar to SQL users.
Nonetheless, it is important to take some time to
learn Pig and to understand its advantages and
limitations. Who knows, maybe pigs can fly after
all.

More Related Content

What's hot

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgressyerram
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteAllen Wittenauer
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Migrating to Puppet 4.0
Migrating to Puppet 4.0Migrating to Puppet 4.0
Migrating to Puppet 4.0Puppet
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.gutierrezga00
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millionsFlorent Vilmart
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...Puppet
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codablesFlorent Vilmart
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 

What's hot (20)

Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Pig latin
Pig latinPig latin
Pig latin
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Full Text search in Django with Postgres
Full Text search in Django with PostgresFull Text search in Django with Postgres
Full Text search in Django with Postgres
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Migrating to Puppet 4.0
Migrating to Puppet 4.0Migrating to Puppet 4.0
Migrating to Puppet 4.0
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Parse, scale to millions
Parse, scale to millionsParse, scale to millions
Parse, scale to millions
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
PuppetConf 2017: What's in a Name? Scaling ENC with DNS- Cameron Nicholson, A...
 
Value protocols and codables
Value protocols and codablesValue protocols and codables
Value protocols and codables
 
Hadoop on osx
Hadoop on osxHadoop on osx
Hadoop on osx
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 

Similar to Pig_Presentation

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principlesPerl Careers
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 

Similar to Pig_Presentation (20)

pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Logstash
LogstashLogstash
Logstash
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
PSGI and Plack from first principles
PSGI and Plack from first principlesPSGI and Plack from first principles
PSGI and Plack from first principles
 
Practical pig
Practical pigPractical pig
Practical pig
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 

Recently uploaded

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesStarCompliance.io
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单ocavb
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单enxupq
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportSatyamNeelmani2
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sMAQIB18
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundOppotus
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJames Polillo
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...correoyaya
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 

Recently uploaded (20)

Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Pig_Presentation

  • 1. ETL with Apache Pig By Arjun Shah Under the guidance of Dr Duc Thanh Tran
  • 2. Agenda • What is Pig? • Introduction to Pig Latin • Installation of Pig • Getting Started with Pig • Examples
  • 3. What is Pig? • Pig is a dataflow language • Language is called PigLatin • Pretty simple syntax • Under the covers, PigLatin scripts are turned into MapReduce jobs and executed on the cluster • Built for Hadoop • Translates script to MapReduce program under the hood • Originally developed at Yahoo! • Huge contributions from Hortonworks, Twitter
  • 4. What Pig Does • Pig was designed for performing a long series of data operations, making it ideal for three categories of Big Data jobs: • Extract-transform-load (ETL) data pipelines, • Research on raw data, and • Iterative data processing.
  • 5. Features of Pig • Joining datasets • Grouping data • Referring to elements by position rather than name ($0, $1, etc) • Loading non-delimited data using a custom SerDe (Writing a custom Reader and Writer) • Creation of user-defined functions (UDF), written in Java • And more..
  • 6. Pig: Install • There are some prerequisites that one needs to follow for installing pig. They are: • JAVA_HOME should be set up • Hadoop should be installed (Single node cluster) • Useful link : http://codesfusion.blogspot.com/2013/10/setup- hadoop-2x-220-on-ubuntu.html
  • 11. Move tar file to any location • $cd /usr/local • • $cp ~/Download/pig-0.12.0.tar.gz • • $sudo tar xzf pig-0.12.0.tar.gz • • $mv pig-0.12.0.tar.gz pig
  • 12. Change .bashrc • Edit the .bashrc file: • $ gedit ~/.bashrc • Add to .bashrc • • export PIG_HOME=/usr/local/pig • • export PATH=$PATH:$PIG_HOME/bin • Close and then open terminal. Try pig -h
  • 13. pig -h : Output
  • 14. Pig: Configure • The user can run Pig in two modes: • Local mode (pig -x local) - With access to a single machine, all files are installed and run using a local host and file system. • Hadoop mode - This is the default mode, which requires access to a Hadoop cluster • The user can run Pig in either mode using the “pig” command or the “java” command.
  • 15. Pig: Run • Script: Pig can run a script file that contains Pig commands. • For example, % pig script.pig • Runs the commands in the local file ”script.pig”. • Alternatively, for very short scripts, you can use the -e option to run a script specified as a string on a command line. • • Grunt: Grunt is an interactive shell for running Pig commands. • Grunt is started when no file is specified for Pig to run, and the -e option is not used. • Note: It is also possible to run Pig scripts from within Grunt using run and exec. • Embedded: You can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java. • There are more details on the Pig wiki at http://wiki.apache.org/pig/EmbeddedPig •
  • 16. Pig Latin: Loading Data • LOAD - Reads data from the file system • Syntax - LOAD ‘input’ [USING function] [AS schema]; -Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS (name:chararray, age:int, gpa:float);
  • 17. Schema • Use schemas to assign types to fields • A = LOAD 'data' AS (name, age, gpa); -name, age, gpa default to bytearrays • A = LOAD 'data' AS (name:chararray, age:int, gpa:float); -name is now a String (chararray), age is integer and gpa is float
  • 18. Describing Schema • Describe • Provides the schema of a relation • Syntax • DESCRIBE [alias]; • If schema is not provided, describe will say “Schema for alias unknown” • grunt> A = load 'data' as (a:int, b: long, c: float); • grunt> describe A; • A: {a: int, b: long, c: float} • grunt> B = load 'somemoredata'; • grunt> describe B; • Schema for B unknown.
  • 19. Dump and Store • Dump writes the output to console • grunt> A = load ‘data’; • grunt> DUMP A; //This will print contents of A on Console • Store writes output to a HDFS location • grunt> A = load ‘data’; • grunt> STORE A INTO ‘/user/username/output’; //This will write contents of A to HDFS • Pig starts a job only when a DUMP or STORE is encountered
  • 20. Referencing Fields • Fields are referred to by positional notation OR by name (alias) • Positional notation is generated by the system • Starts with $0 • Names are assigned by you using schemas. Eg, A = load ‘data’ as (name:chararray, age:int); • With positional notation, fields can be accessed as • A = load ‘data’; • B = foreach A generate $0, $1; //1st & 2nd column
  • 21. Limit • Limits the number of output tuples • Syntax • alias = LIMIT alias n; • grunt> A = load 'data'; • grunt> B = LIMIT A 10; • grunt> DUMP B; --Prints only 10 rows
  • 22. Foreach.. Generate • Used for data transformations and projections • Syntax • alias = FOREACH { block | nested_block }; • nested_block usage later in the deck • grunt> A = load ‘data’ as (a1,a2,a3); • grunt> B = FOREACH A GENERATE *, • grunt> DUMP B; • (1,2,3) • (4,2,1) • grunt> C = FOREACH A GENERATE a1, a3; • grunt> DUMP C; • (1,3) • (4,1)
  • 23. Filter • Selects tuples from a relation based on some condition • Syntax • alias = FILTER alias BY expression; • Example, to filter for ‘marcbenioff’ • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray,employeesince:int,age:int); • B = FILTER A BY name == ‘marcbenioff’; • You can use boolean operators (AND, OR, NOT) • B = FILTER A BY (employeesince < 2005) AND (NOT(name == ‘marcbenioff’));
  • 24. Group By • Groups data in one or more relations (similar to SQL GROUP BY) • Syntax: • alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL n]; • Eg, to group by (employee start year at Salesforce) • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray, employeesince:int, age:int); • B = GROUP A BY (employeesince); • You can also group by all fields together • B = GROUP B BY ALL; • Or Group by multiple fields • B = GROUP A BY (age, employeesince);
  • 25. Demo: Sample Data (employee.txt) • Example contents of ‘employee.txt’ a tab delimited text • 1 Peter234000000 none • 2 Peter_01 234000000 none • 124163 Jacob 10000 cloud • 124164 Arthur 1000000 setlabs • 124165 Robert 1000000 setlabs • 124166 Ram 450000 es • 124167 Madhusudhan 450000 e&r • 124168 Alex 6500000 e&r • 124169 Bob 50000 cloud
  • 26. Demo: Employees with salary > 1lk • Loading data from employee.txt into emps bag and with a schema empls = LOAD ‘employee.txt’ AS (id:int, name:chararray, salary:double, dept:chararray); • Filtering the data as required rich = FILTER empls BY $2 > 100000; • Sorting sortd = ORDER rich BY salary DESC; • Storing the final results STORE sortd INTO ‘rich_employees.txt’; • Or alternatively we can dump the record on the screen DUMP sortd; ------------------------------------------------------------------ • Group by salary grp = GROUP empls BY salary; • Get count of employees in each salary group cnt = FOREACH grp GENERATE group, COUNT(empls.id) as emp_cnt;
  • 27.
  • 29. More PigLatin (1/2) • Load using PigStorage • empls = LOAD ‘employee.txt’ USING PigStorage('t') AS (id:int, name:chararray, salary:double, dept:chararray); • Store using PigStorage • STORE srtd INTO ‘rich_employees.txt’ USING PigStorage('t');
  • 30. More PigLatin (2/2) • To view the schema of a relation • DESCRIBE empls; • To view step-by-step execution of a series of statements • ILLUSTRATE empls; • To view the execution plan of a relation • EXPLAIN empls;
  • 31. Exploring Pig with Project Data Set
  • 32. Pig: Local Mode using Project Example
  • 33.
  • 34.
  • 35.
  • 36. Pig:Hadoop Mode (GUI) using Project Example
  • 37.
  • 38.
  • 40. Crimes having category as VANDALISM
  • 42.
  • 45. Grouping crimes by category
  • 47. PigLatin: UDF • Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig • All UDF’s are case sensitive
  • 48. UDF: Types • Eval Functions (EvalFunc) • Ex: StringConcat (built-in) : Generates the concatenation of the first two fields of a tuple. • Aggregate Functions (EvalFunc & Algebraic) • Ex: COUNT, AVG ( both built-in) • Filter Functions (FilterFunc) • Ex: IsEmpty (built-in) • Load/Store Functions (LoadFunc/ StoreFunc) • Ex: PigStorage (built-in) • Note: URL for built in functions: http://pig.apache.org/docs/r0.7.0/api/org/apache/pig/builtin/package- summary.html
  • 49. Summary • Pig can be used to run ETL jobs on Hadoop. It saves you from writing MapReduce code in Java while its syntax may look familiar to SQL users. Nonetheless, it is important to take some time to learn Pig and to understand its advantages and limitations. Who knows, maybe pigs can fly after all.