SlideShare a Scribd company logo
PIG in Big Data
9/18/2016 1
Data keeps growing…
BIG DATA
• ‘Big Data’ is similar to ‘small data’, but bigger in size
• It requires different approaches:
Techniques, tools and architecture
• To solve new problems or old problems in a better way
• Storage and processing of very large quantities of digital
information that cannot be analyzed with traditional computing
techniques
9/18/2016 2
INTRODUCTION TO BIG DATA
…. AND FAR FAR BEYOND
User generated content
Mobile Web
User Click Stream
Sentiment
Social Network
External Demographics
Business Data Feeds
HD Video
Speech to Text
Product / Service Logs
SMS / MMS
Petabytes
WEB
Weblogs
Offer history
A / B Testing
Dynamic Pricing
Affiliate Network
Search Marketing
Behavioral Targeting
Dynamic Funnels
Terabytes
CRM
Segmentation
Offer Details
Customer Touches
Support Contacts
Gigabytes
ERP
Purchase Details
Purchase Records
Payment Records
Megabytes
Source:http://datameer.com
9/18/2016
3
CONT.,
• Walmart handles more than 1 million customer transactions every
hour
• Facebook handles 40 billion photos from its user base
• Decoding the human genome originally took 10years to process;
now it can be achieved in one week
4
9/18/2016
NECESSITY OF HADOOP
5
9/18/2016
HADOOP
• As data is growing, we need to be able to scale-out computation
• Uses cheap(er) hardware to grow horizontally
• Tolerates a few machines going down
• Happens all the time
• Stores all your data from all system
• No need to throw your data
9/18/2016 6
EXAMPLES
9/18/2016 7
HDFS
• Hadoop Distributed File System
• A distributed, scalable, and portable file system written in Java
for the Hadoop framework
• Provides high-throughput access to the application data
• Runs on large clusters of commodity machines
• Used to store large datasets
9/18/2016 8
CONT.,
9/18/2016 9
• A file we want to store on HDFS …
We’re raising the question
because no one else wants to,
because no one else wants to say
what needs to be said.
And let’s be real, it’s the two-ton
elephant in the room with nearly
every other star’s name on the
trade rumor radar these days.
We’ve read over and over again
about Nash refusing to ask for a
trade, refusing to play the game
that so many others have late in
their careers.
600 MB
CONT.,
9/18/2016 10
• HDFS Splits file into blocks …
We’re raising the question
because no one else wants to,
because no one else wants to say
what needs to be said.
And let’s be real, it’s the two-ton
elephant in the room with nearly
every other star’s name on the
trade rumor radar these days.
We’ve read over and over again
about Nash refusing to play the
game that so many others have
late in their careers.
256 MB
256 MB
88 MB
MAP REDUCE
• Distributed data processing model and execution environment
that runs on large clusters of commodity machines
• Also called MR
• Programs are inherently parallel
9/18/2016 11
CONT.,
9/18/2016 12
PIG-INTRODUCTION
• High level data flow language for exploring very large datasets
• Provides an engine for executing data flows in parallel on Hadoop
• Compiler that produces sequences of MapReduce programs
• Structure is amenable to substantial parallelization
• Operates on files in HDFS
• Metadata is not required, but used when available
9/18/2016 13
KEY PROPERTIES OF PIG
• Ease of programming: Trivial to achieve parallel execution of
simple and parallel data analysis tasks
• Optimization opportunities: Allows the user to focus on
semantics rather than efficiency
• Extensibility: Users can create their own functions to do
special-purpose processing
9/18/2016 14
PIG EXECUTION STAGE
9/18/2016 15
Client machine
Pig
Script
Pig Execution
Engine
MapReduce
Hadoop Cluster
Hadoop Job
NECESSITY OF PIG
9/18/2016 16
EQUIVALENT MAP REDUCE CODE
9/18/2016 17
PIG VS HADOOP
• 5% of the MR code.
• 5% of the MR development
time.
• Within 25% of the MR
execution time.
• Readable and reusable.
• Easy to learn DSL.
• Increases programmer
productivity.
• No Java expertise required.
• Anyone [eg. BI folks] can
trigger the Jobs.
• Insulates against Hadoop
complexity
• Version upgrades
• Changes in Hadoop interfaces
• JobConf configuration tuning
• Job Chains
9/18/2016 18
PIG COMMANDS
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Write output to stdout
Foreach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets, based on filter conditions
19
9/18/2016
LOADING DATA
• LOAD
• Reads data from the file system
• Syntax
• LOAD ‘input’ [USING function] [AS schema];
• Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS
(name:chararray, age:int, gpa:float);
9/18/2016 20
SCHEMA
• Use schemas to assign types to fields
• A = LOAD 'data' AS (name, age, gpa);
• name, age, gpa default to bytearrays
• A = LOAD 'data' AS (name:chararray, age:int, gpa:float);
• name is now a String (chararray), age is integer and gpa is float
9/18/2016 21
DESCRIBING SCHEME
• Describe
• Provides the schema of a relation
• Syntax
• DESCRIBE [alias];
• If schema is not provided, describe will say “Schema for alias
unknown”
9/18/2016 22
DUMP AND STORE
• Dump writes the output to console
• grunt> A = load ‘data’;
• grunt> DUMP A; //This will print contents of A on Console
• Store writes output to a HDFS location
• grunt> A = load ‘data’;
• grunt> STORE A INTO ‘/user/username/output’; //This will
write contents of A to HDFS
• Pig starts a job only when a DUMP or STORE is encountered
9/18/2016 23
REFERENCING FIELDS
• Fields are referred to by positional notation or by name (alias)
• Positional notation is generated by the system
• Starts with $0
• Names are assigned by you using schemas
• Eg: A = load ‘data’ as (name:chararray, age:int);
• With positional notation, fields can be accessed as
• A = load ‘data’;
• B = foreach A generate $0, $1; //1st & 2nd column
9/18/2016 24
LIMIT
• Limits the number of output tuples
• Syntax
• alias = LIMIT alias n;
9/18/2016 25
FILTER
• Selects tuples from a relation based on some condition
• Syntax
• alias = FILTER alias BY expression;
• Example, to filter for ‘marcbenioff’
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray , employeesince:int ,age:int);
• B = FILTER A BY name == ‘marcbenioff’;
• You can use boolean operators (AND, OR, NOT)
• B = FILTER A BY (employeesince < 2005) AND
(NOT(name == ‘marcbenioff’));
9/18/2016 26
GROUP BY
• Syntax:
• alias = GROUP alias { ALL | BY expression} [, alias ALL | BY
expression …] [PARALLEL n];
• Eg, to group by (employee start year at Salesforce)
• A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as
(name:chararray, employeesince:int, age:int);
• B = GROUP A BY (employeesince);
• You can also group by all fields together
• B = GROUP B BY ALL;
• Or Group by multiple fields
• B = GROUP A BY (age, employeesince); 9/18/2016 27
AGGREGATION
• Pig provides a bunch of aggregation functions
• AVG
• COUNT
• COUNT_STAR
• SUM
• MAX
• MIN
9/18/2016 28
DEFINE
• Assigns an alias to a UDF
• Syntax
• DEFINE alias {function}
• Use DEFINE to specify a UDF function when:
• UDF has a long package name
• UDF constructor takes string parameters.
9/18/2016 29
9/18/2016 30
Thank You

More Related Content

What's hot

Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
Lars Albertsson
 
Dynamic Columns of Phoenix for SQL on Sparse(NoSql) Data
Dynamic Columns of Phoenix for SQL on Sparse(NoSql) DataDynamic Columns of Phoenix for SQL on Sparse(NoSql) Data
Dynamic Columns of Phoenix for SQL on Sparse(NoSql) Data
Anil Gupta
 
London HUG
London HUGLondon HUG
London HUGBoudicca
 
Graph Databases & OrientDB
Graph Databases & OrientDBGraph Databases & OrientDB
Graph Databases & OrientDB
Arpit Poladia
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
Jesse Wang
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applicationsdzhou
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
Max De Marzi
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
Comperio - Search Matters.
 
Graph database
Graph database Graph database
Graph database
Shruti Arya
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
faizrashid1995
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
bigdatasyd
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
Amazon Web Services
 
Graph Database
Graph DatabaseGraph Database
Graph Database
Richard Kuo
 
Graph databases
Graph databasesGraph databases
Graph databases
Vinoth Kannan
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
DataWorks Summit
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
Agileiss
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 

What's hot (20)

Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Dynamic Columns of Phoenix for SQL on Sparse(NoSql) Data
Dynamic Columns of Phoenix for SQL on Sparse(NoSql) DataDynamic Columns of Phoenix for SQL on Sparse(NoSql) Data
Dynamic Columns of Phoenix for SQL on Sparse(NoSql) Data
 
London HUG
London HUGLondon HUG
London HUG
 
Graph Databases & OrientDB
Graph Databases & OrientDBGraph Databases & OrientDB
Graph Databases & OrientDB
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Improve Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - ComperioImprove Performance in Fast Search for SharePoint - Comperio
Improve Performance in Fast Search for SharePoint - Comperio
 
Graph database
Graph database Graph database
Graph database
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Graph databases
Graph databasesGraph databases
Graph databases
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
 
Spark core
Spark coreSpark core
Spark core
 
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
Graph databases
Graph databasesGraph databases
Graph databases
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 
Hadoop Big data Solution Provider
Hadoop Big data Solution ProviderHadoop Big data Solution Provider
Hadoop Big data Solution Provider
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 

Similar to Introduction to PIG

Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
MapR Technologies
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Databricks
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
EDB
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
Amazon Web Services
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature Visualization
Bill Liu
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
Abhishek Roy
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
Amazon Web Services
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
DataWorks Summit
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
Amazon Web Services
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Amazon Web Services
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
DATAVERSITY
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
javier ramirez
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
Amazon Web Services
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
Dan Lynn
 
What enterprises can learn from Real Time Bidding
What enterprises can learn from Real Time BiddingWhat enterprises can learn from Real Time Bidding
What enterprises can learn from Real Time Bidding
Aerospike
 

Similar to Introduction to PIG (20)

Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Analytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature VisualizationAnalytics Metrics Delivery & ML Feature Visualization
Analytics Metrics Delivery & ML Feature Visualization
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?Data Lake, Virtual Database, or Data Hub - How to Choose?
Data Lake, Virtual Database, or Data Hub - How to Choose?
 
Your Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic DatabaseYour Timestamps Deserve Better than a Generic Database
Your Timestamps Deserve Better than a Generic Database
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
What enterprises can learn from Real Time Bidding
What enterprises can learn from Real Time BiddingWhat enterprises can learn from Real Time Bidding
What enterprises can learn from Real Time Bidding
 

Recently uploaded

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Introduction to PIG

  • 1. PIG in Big Data 9/18/2016 1 Data keeps growing…
  • 2. BIG DATA • ‘Big Data’ is similar to ‘small data’, but bigger in size • It requires different approaches: Techniques, tools and architecture • To solve new problems or old problems in a better way • Storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques 9/18/2016 2
  • 3. INTRODUCTION TO BIG DATA …. AND FAR FAR BEYOND User generated content Mobile Web User Click Stream Sentiment Social Network External Demographics Business Data Feeds HD Video Speech to Text Product / Service Logs SMS / MMS Petabytes WEB Weblogs Offer history A / B Testing Dynamic Pricing Affiliate Network Search Marketing Behavioral Targeting Dynamic Funnels Terabytes CRM Segmentation Offer Details Customer Touches Support Contacts Gigabytes ERP Purchase Details Purchase Records Payment Records Megabytes Source:http://datameer.com 9/18/2016 3
  • 4. CONT., • Walmart handles more than 1 million customer transactions every hour • Facebook handles 40 billion photos from its user base • Decoding the human genome originally took 10years to process; now it can be achieved in one week 4 9/18/2016
  • 6. HADOOP • As data is growing, we need to be able to scale-out computation • Uses cheap(er) hardware to grow horizontally • Tolerates a few machines going down • Happens all the time • Stores all your data from all system • No need to throw your data 9/18/2016 6
  • 8. HDFS • Hadoop Distributed File System • A distributed, scalable, and portable file system written in Java for the Hadoop framework • Provides high-throughput access to the application data • Runs on large clusters of commodity machines • Used to store large datasets 9/18/2016 8
  • 9. CONT., 9/18/2016 9 • A file we want to store on HDFS … We’re raising the question because no one else wants to, because no one else wants to say what needs to be said. And let’s be real, it’s the two-ton elephant in the room with nearly every other star’s name on the trade rumor radar these days. We’ve read over and over again about Nash refusing to ask for a trade, refusing to play the game that so many others have late in their careers. 600 MB
  • 10. CONT., 9/18/2016 10 • HDFS Splits file into blocks … We’re raising the question because no one else wants to, because no one else wants to say what needs to be said. And let’s be real, it’s the two-ton elephant in the room with nearly every other star’s name on the trade rumor radar these days. We’ve read over and over again about Nash refusing to play the game that so many others have late in their careers. 256 MB 256 MB 88 MB
  • 11. MAP REDUCE • Distributed data processing model and execution environment that runs on large clusters of commodity machines • Also called MR • Programs are inherently parallel 9/18/2016 11
  • 13. PIG-INTRODUCTION • High level data flow language for exploring very large datasets • Provides an engine for executing data flows in parallel on Hadoop • Compiler that produces sequences of MapReduce programs • Structure is amenable to substantial parallelization • Operates on files in HDFS • Metadata is not required, but used when available 9/18/2016 13
  • 14. KEY PROPERTIES OF PIG • Ease of programming: Trivial to achieve parallel execution of simple and parallel data analysis tasks • Optimization opportunities: Allows the user to focus on semantics rather than efficiency • Extensibility: Users can create their own functions to do special-purpose processing 9/18/2016 14
  • 15. PIG EXECUTION STAGE 9/18/2016 15 Client machine Pig Script Pig Execution Engine MapReduce Hadoop Cluster Hadoop Job
  • 17. EQUIVALENT MAP REDUCE CODE 9/18/2016 17
  • 18. PIG VS HADOOP • 5% of the MR code. • 5% of the MR development time. • Within 25% of the MR execution time. • Readable and reusable. • Easy to learn DSL. • Increases programmer productivity. • No Java expertise required. • Anyone [eg. BI folks] can trigger the Jobs. • Insulates against Hadoop complexity • Version upgrades • Changes in Hadoop interfaces • JobConf configuration tuning • Job Chains 9/18/2016 18
  • 19. PIG COMMANDS Statement Description Load Read data from the file system Store Write data to the file system Dump Write output to stdout Foreach Apply expression to each record and generate one or more records Filter Apply predicate to each record and remove records where false Group / Cogroup Collect records with the same key from one or more inputs Join Join two or more inputs based on a key Order Sort records based on a Key Distinct Remove duplicate records Union Merge two datasets Limit Limit the number of records Split Split data into 2 or more sets, based on filter conditions 19 9/18/2016
  • 20. LOADING DATA • LOAD • Reads data from the file system • Syntax • LOAD ‘input’ [USING function] [AS schema]; • Eg, A = LOAD ‘input’ USING PigStorage(‘t’) AS (name:chararray, age:int, gpa:float); 9/18/2016 20
  • 21. SCHEMA • Use schemas to assign types to fields • A = LOAD 'data' AS (name, age, gpa); • name, age, gpa default to bytearrays • A = LOAD 'data' AS (name:chararray, age:int, gpa:float); • name is now a String (chararray), age is integer and gpa is float 9/18/2016 21
  • 22. DESCRIBING SCHEME • Describe • Provides the schema of a relation • Syntax • DESCRIBE [alias]; • If schema is not provided, describe will say “Schema for alias unknown” 9/18/2016 22
  • 23. DUMP AND STORE • Dump writes the output to console • grunt> A = load ‘data’; • grunt> DUMP A; //This will print contents of A on Console • Store writes output to a HDFS location • grunt> A = load ‘data’; • grunt> STORE A INTO ‘/user/username/output’; //This will write contents of A to HDFS • Pig starts a job only when a DUMP or STORE is encountered 9/18/2016 23
  • 24. REFERENCING FIELDS • Fields are referred to by positional notation or by name (alias) • Positional notation is generated by the system • Starts with $0 • Names are assigned by you using schemas • Eg: A = load ‘data’ as (name:chararray, age:int); • With positional notation, fields can be accessed as • A = load ‘data’; • B = foreach A generate $0, $1; //1st & 2nd column 9/18/2016 24
  • 25. LIMIT • Limits the number of output tuples • Syntax • alias = LIMIT alias n; 9/18/2016 25
  • 26. FILTER • Selects tuples from a relation based on some condition • Syntax • alias = FILTER alias BY expression; • Example, to filter for ‘marcbenioff’ • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray , employeesince:int ,age:int); • B = FILTER A BY name == ‘marcbenioff’; • You can use boolean operators (AND, OR, NOT) • B = FILTER A BY (employeesince < 2005) AND (NOT(name == ‘marcbenioff’)); 9/18/2016 26
  • 27. GROUP BY • Syntax: • alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [PARALLEL n]; • Eg, to group by (employee start year at Salesforce) • A = LOAD ‘sfdcemployees’ USING PigStorage(‘,’) as (name:chararray, employeesince:int, age:int); • B = GROUP A BY (employeesince); • You can also group by all fields together • B = GROUP B BY ALL; • Or Group by multiple fields • B = GROUP A BY (age, employeesince); 9/18/2016 27
  • 28. AGGREGATION • Pig provides a bunch of aggregation functions • AVG • COUNT • COUNT_STAR • SUM • MAX • MIN 9/18/2016 28
  • 29. DEFINE • Assigns an alias to a UDF • Syntax • DEFINE alias {function} • Use DEFINE to specify a UDF function when: • UDF has a long package name • UDF constructor takes string parameters. 9/18/2016 29