Sql saturday pig session (wes floyd) v2

Data Manipulation with Pig
Page 1
Wes Floyd - @weswfloyd

Pig History
• Born from Yahoo Research then Apache incubated
• Built to avoid low level programming of Map/Reduce
without Hive/SQL queries
• Committers from: Yahoo, Hortonworks, LinkedIn,
SalesForce, IBM, Twitter, Netflix, and others
• Alan Gates on Pig
Page 3

Pig
• An engine for executing programs on top of
Hadoop
• It provides a language, Pig Latin, to specify these
programs
Page 4

HDP: Enterprise Hadoop Platform
Page 5
Hortonworks
Data Platform (HDP)
•  The ONLY 100% open source
and complete platform
•  Integrates full range of
enterprise-ready services
•  Certified and tested at scale
•  Engineered for deep
ecosystem interoperability
OS/VM
Cloud
Appliance

PLATFORM

SERVICES

HADOOP

CORE

Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS

DATA
PLATFORM
(HDP)

OPERATIONAL

SERVICES

DATA

SERVICES

HDFS

SQOOP

FLUME

NFS

LOAD
&

EXTRACT

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN

MAP

TEZ
REDUCE

HIVE
&

HCATALOG

PIG
HBASE

Why use Pig?
• Suppose you have user data in one file, website data
in another, and you need to find the top 5 most visited
sites by users aged 18 - 25
Page 6

In Map-Reduce
Page 7
170 lines of code, 4 hours to write

In Pig Latin
Users
=
load
‘input/users’
using
PigStorage(‘,’)
as
(name:chararray,
age:int);

Fltrd
=
filter
Users
by
age
>=
18
and
age
<=
25;

Pages
=
load
‘input/pages’
using
PigStorage(‘,’)

as
(user:chararray,

url:chararray);

Jnd
=
join
Fltrd
by
name,
Pages
by
user;

Grpd
=
group
Jnd
by
url;

Smmd
=
foreach
Grpd
generate
group,COUNT(Jnd)
as
clicks;

Srtd
=
order
Smmd
by
clicks
desc;

Top5
=
limit
Srtd
5;

store
Top5
into
‘output/top5sites’
using
PigStorage(‘,’);

Page 8
9 lines of code, 15 minutes to write
170 lines to 9 lines of code

Essence of Pig
• Map-Reduce is too low a level, SQL too high
• Pig-Latin, a language intended to sit between the two
– Provides standard relational transforms (join, sort, etc.)
– Schemas are optional, used when available, can be defined at
runtime
– User Defined Functions are first class citizens
Page 9

Pig Architecture
10
Hadoop
Pig Client:
Parses, validates, optimizes, plans, coordinates execution
Data stored in HDFS
Processing done via MapReduce

Pig Elements
Page 11
•  High-level scripting language
•  Requires no metadata or schema
•  Statements translated into a series of
MapReduce jobs
Pig Latin
•  Interactive shellGrunt
•  Shared repository for User Defined
Functions (UDFs)Piggybank

Pig Latin Data Flow
Page 12
LOAD
(HDFS/HCat)
TRANSFORM
(Pig)
DUMP or
STORE
(HDFS/HCAT)
Read data to be
manipulated from the
file system
Manipulate the
data
Output data to the
screen or store for
processing
In code:
•  VARIABLE1
=
LOAD
[somedata]

•  VARIABLE2
=
[TRANSFORM
operation]

•  STORE
VARIABLE2
INTO
‘[some
location]’

Pig Relations
1.  A bag is a collection of
unordered tuples
(can be different sizes).
2.  A tuple is an ordered
set of fields.
3.  A field is a piece of data.
Pig Latin statements work with relations
Field
Field 1
Field
2
Field 3
Tuple
Bag

FILTER, GROUP, FOREACH, ORDER
Page 14
logevents
=
LOAD
'input/my.log'
AS
(date:chararray,

level:chararray,
code:int,
message:chararray);

severe
=
FILTER
logevents
BY
(level
==
'severe’

AND

code
>=
500);

grouped
=
GROUP
severe
BY
code;

e1
=
LOAD
'pig/input/File1'
USING
PigStorage(',')

AS
(name:chararray,age:int,

zip:int,salary:double);

f
=
FOREACH
e1
GENERATE
age,
salary;

g
=
ORDER
f
BY
age

JOIN, GROUP, LIMIT
Page 15
employees
=
LOAD
‘[somefile]’

AS
(name:chararray,age:int,
zip:int,salary:double);

agegroup
=
GROUP
employees
BY
age;

h
=
LIMIT
agegroup
100;

e1
=
LOAD
’[somefile]'
USING
PigStorage(',')

AS
(name:chararray,
age:int,
zip:int,

salary:double);

e2
=
LOAD
'[somefile]'
USING
PigStorage(',')

AS
(name:chararray,
phone:chararray);

e3
=
JOIN
e1
BY
name,
e2
BY
name;

Grunt Command Line Demo
Page 17

Hive vs Pig
Page 18
Pig and Hive work well together
and many businesses use both.
Hive is a good choice:
•  when you want to query the data
•  when you need an answer to specific
questions
•  if you are familiar with SQL
Pig is a good choice:
•  for ETL (Extract -> Transform -> Load)
•  for preparing data for easier analysis
•  when you have a long series of steps to
perform

Tool Comparison
Page 19© Hortonworks 2012
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists,
char, varchar,
decimal, …
Schema Encoded in app Declared in script
or read by loader
Read from
metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata

T-SQL vs Hadoop Ecosystem
Page 20
T-SQL PIG Hive
Query Data Yes Yes (in bulk) Yes
Local Variables Yes Yes No
Conditional Logic Yes limited limited
Procedural
Programming
Yes No No
UDFs No Yes Yes

HCatalog: Data Sharing is Hard
Page 21
Photo Credit: totalAldo via Flickr
This is programmer Bob, he
uses Pig to crunch data.
This is analyst Joe, he uses Hive
to build reports and answer ad-hoc
queries.
Hmm, is it done yet? Where is it? What format
did you use to store it today? Is it compressed?
And can you help me load it into Hive, I can
never remember all the parameters I have to
pass that alter table command.
Ok
Bob, I need
today’s data
Dude, we need
HCatalog
© Hortonworks Inc. 2012

Pig Example
Page 22
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
© Hortonworks 2013

Pig Example
Page 23
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
store cntd into '/data/counted/20120530';
Using HCatalog:
raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
store cntd into 'counted' using HCatStorer();
No need to know
file location
No need to
declare schema
Partition filter
© Hortonworks 2013

Tools With HCatalog
Page 24
Feature MapReduce +
HCatalog
Pig + HCatalog Hive
Record format Record Tuple Record
Data model int, float, string,
maps, structs, lists
int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists
Schema Read from
metadata
Read from
metadata
Read from
metadata
Data location Read from
metadata
Read from
metadata
Read from
metadata
Data format Read from
metadata
Read from
metadata
Read from
metadata
•  Pig/MR users can read schema from metadata
•  Pig/MR users are insulated from schema, location, and format changes
•  All users have access to other users’ data as soon as it is committed

Data & Metadata REST Services APIs
Page 26
HDFS HBase
External
Store
Existing & New Applications
MapReduce Pig Hive
HCatalog
WebHCat RESTful Web Services
WebHDFS & WebHCat
provide RESTful API as
“front door” for Hadoop
•  Opens the door to
languages other than Java
•  Thin clients via web
services vs. fat-clients in
gateway
•  Insulation from interface
changes release to release
Opens Hadoop to integration with existing and new applications
WebHDFS

RESTful API Access for Pig
• Code example

curl
-‐s
-‐d
user.name=hue

-‐d
execute=”<pig
script>”

'http://localhost:50111/templeton/v1/pig'
•  RestSharp (restsharp.org/)
– Simple REST and HTTP API Client
for .NET
Page 27

WebHCat REST API
Page 28
Page 28© Hortonworks 2012
Hadoop/
HCatalog
Get a list of all tables in the default database:
GET
http://…/v1/ddl/database/default/table
{
"tables": ["counted","processed",],
"database": "default"
}
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}
http://…/v1/ddl/database/default/table/rawevents
{
"table": "rawevents",
"database": "default”
}{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}

Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
Page 30
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS

Pig on Tez - Design
3
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler

Performance numbers
3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Replicated
Join (2.8x)
Join +
Groupby
(1.5x)
Join +
Groupby +
Orderby
(1.5x)
3 way Split +
Join +
Groupby +
Orderby
(2.6x)
Timeinsecs
MR
Tez

User Defined Functions
• Ultimate in extensibility and portability
• Custom processing
– Java
– Python
– JavaScript
– Ruby
• Integration with MapReduce phases
– Map
– Combine
– Reduce

public class MyUDF extends EvalFunc<DataBag>
implements Algebraic {!
…!
}!
• Algebraic functions
• 3-phase execution
– Map – called once for each tuple
– Combiner – called zero or more times for each map result
– Reduce

public class MyUDF extends EvalFunc<DataBag>
implements Accumulator {!
…!
}!
• Accmulator functions
• Incremental processing of data
• Called in both map and reduce phase

public class MyUDF extends FilterFunc {!
…!
}!
• Filter functions
• Returns boolean based on processing of the tuple
• Called in both map and reduce phase

Sql saturday pig session (wes floyd) v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Sql saturday pig session (wes floyd) v2

Similar to Sql saturday pig session (wes floyd) v2 (20)

Recently uploaded

Recently uploaded (20)

Sql saturday pig session (wes floyd) v2