SlideShare a Scribd company logo
Data Manipulation with Pig
Page 1
Wes Floyd - @weswfloyd
Page 2
Pig History
• Born from Yahoo Research then Apache incubated
• Built to avoid low level programming of Map/Reduce
without Hive/SQL queries
• Committers from: Yahoo, Hortonworks, LinkedIn,
SalesForce, IBM, Twitter, Netflix, and others
• Alan Gates on Pig
Page 3
Pig
• An engine for executing programs on top of
Hadoop
• It provides a language, Pig Latin, to specify these
programs
Page 4
HDP: Enterprise Hadoop Platform
Page 5
Hortonworks
Data Platform (HDP)
•  The ONLY 100% open source
and complete platform
•  Integrates full range of
enterprise-ready services
•  Certified and tested at scale
•  Engineered for deep
ecosystem interoperability
OS/VM	
   Cloud	
   Appliance	
  
PLATFORM	
  	
  
SERVICES	
  
HADOOP	
  	
  
CORE	
  
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  
OPERATIONAL	
  
SERVICES	
  
DATA	
  
SERVICES	
  
HDFS	
  
SQOOP	
  
FLUME	
  
NFS	
  
LOAD	
  &	
  	
  
EXTRACT	
  
WebHDFS	
  
KNOX*	
  
OOZIE	
  
AMBARI	
  
FALCON*	
  
YARN	
  	
  	
  
MAP	
  	
  
	
  
TEZ	
  REDUCE	
  
HIVE	
  &	
  
HCATALOG	
  
PIG	
  HBASE	
  
Why use Pig?
• Suppose you have user data in one file, website data
in another, and you need to find the top 5 most visited
sites by users aged 18 - 25
Page 6
In Map-Reduce
Page 7
170 lines of code, 4 hours to write
In Pig Latin
Users	
  =	
  load	
  ‘input/users’	
  using	
  PigStorage(‘,’)	
  as	
  (name:chararray,	
  age:int);	
  
Fltrd	
  =	
  filter	
  Users	
  by	
  age	
  >=	
  18	
  and	
  age	
  <=	
  25;	
  
Pages	
  =	
  load	
  ‘input/pages’	
  using	
  PigStorage(‘,’)	
  	
  as	
  (user:chararray,	
  
url:chararray);	
  
Jnd	
  =	
  join	
  Fltrd	
  by	
  name,	
  Pages	
  by	
  user;	
  
Grpd	
  =	
  group	
  Jnd	
  by	
  url;	
  
Smmd	
  =	
  foreach	
  Grpd	
  generate	
  group,COUNT(Jnd)	
  as	
  clicks;	
  
Srtd	
  =	
  order	
  Smmd	
  by	
  clicks	
  desc;	
  
Top5	
  =	
  limit	
  Srtd	
  5;	
  
store	
  Top5	
  into	
  ‘output/top5sites’	
  using	
  PigStorage(‘,’);	
  
Page 8
9 lines of code, 15 minutes to write
170 lines to 9 lines of code
Essence of Pig
• Map-Reduce is too low a level, SQL too high
• Pig-Latin, a language intended to sit between the two
– Provides standard relational transforms (join, sort, etc.)
– Schemas are optional, used when available, can be defined at
runtime
– User Defined Functions are first class citizens
Page 9
Pig Architecture
10
Hadoop
Pig Client:
Parses, validates, optimizes, plans, coordinates execution
Data stored in HDFS
Processing done via MapReduce
Pig Elements
Page 11
•  High-level scripting language
•  Requires no metadata or schema
•  Statements translated into a series of
MapReduce jobs
Pig Latin
•  Interactive shellGrunt
•  Shared repository for User Defined
Functions (UDFs)Piggybank
Pig Latin Data Flow
Page 12
LOAD
(HDFS/HCat)
TRANSFORM
(Pig)
DUMP or
STORE
(HDFS/HCAT)
Read data to be
manipulated from the
file system
Manipulate the
data
Output data to the
screen or store for
processing
In code:
•  VARIABLE1	
  =	
  LOAD	
  [somedata]	
  
•  VARIABLE2	
  =	
  [TRANSFORM	
  operation]	
  
•  STORE	
  VARIABLE2	
  INTO	
  ‘[some	
  location]’	
  
Pig Relations
1.  A bag is a collection of
unordered tuples
(can be different sizes).
2.  A tuple is an ordered
set of fields.
3.  A field is a piece of data.
Pig Latin statements work with relations
Field
Field 1
Field
2
Field 3
Tuple
Bag
FILTER, GROUP, FOREACH, ORDER
Page 14
logevents	
  =	
  LOAD	
  'input/my.log'	
  AS	
  (date:chararray,	
  
	
  level:chararray,	
  code:int,	
  message:chararray);	
  
severe	
  =	
  FILTER	
  logevents	
  BY	
  (level	
  ==	
  'severe’	
  
	
  AND	
  	
  code	
  >=	
  500);	
  
grouped	
  =	
  GROUP	
  severe	
  BY	
  code;	
  
e1	
  =	
  LOAD	
  'pig/input/File1'	
  USING	
  PigStorage(',')	
  	
  
	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,age:int,	
  
zip:int,salary:double);	
  
f	
  =	
  FOREACH	
  e1	
  GENERATE	
  age,	
  salary;	
  
g	
  =	
  ORDER	
  f	
  BY	
  age	
  
JOIN, GROUP, LIMIT
Page 15
employees	
  =	
  LOAD	
  ‘[somefile]’	
  
	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,age:int,	
  zip:int,salary:double);	
  
agegroup	
  =	
  GROUP	
  employees	
  BY	
  age;	
  	
  	
  	
  	
  	
  	
  	
  	
  
h	
  =	
  LIMIT	
  agegroup	
  100;	
  
e1	
  =	
  LOAD	
  ’[somefile]'	
  USING	
  PigStorage(',')	
  	
  
	
  	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,	
  age:int,	
  zip:int,	
  
salary:double);	
  
e2	
  =	
  LOAD	
  '[somefile]'	
  USING	
  PigStorage(',')	
  	
  
	
  	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,	
  phone:chararray);	
  
e3	
  =	
  JOIN	
  e1	
  BY	
  name,	
  e2	
  BY	
  name;	
  
Pig Basics Demo
Page 16
Grunt Command Line Demo
Page 17
Hive vs Pig
Page 18
Pig and Hive work well together
and many businesses use both.
Hive is a good choice:
•  when you want to query the data
•  when you need an answer to specific
questions
•  if you are familiar with SQL
Pig is a good choice:
•  for ETL (Extract -> Transform -> Load)
•  for preparing data for easier analysis
•  when you have a long series of steps to
perform
Tool Comparison
Page 19© Hortonworks 2012
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists,
char, varchar,
decimal, …
Schema Encoded in app Declared in script
or read by loader
Read from
metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata
T-SQL vs Hadoop Ecosystem
Page 20
T-SQL PIG Hive
Query Data Yes Yes (in bulk) Yes
Local Variables Yes Yes No
Conditional Logic Yes limited limited
Procedural
Programming
Yes No No
UDFs No Yes Yes
HCatalog: Data Sharing is Hard
Page 21
Photo Credit: totalAldo via Flickr
This is programmer Bob, he
uses Pig to crunch data.
This is analyst Joe, he uses Hive
to build reports and answer ad-hoc
queries.
Hmm, is it done yet? Where is it? What format
did you use to store it today? Is it compressed?
And can you help me load it into Hive, I can
never remember all the parameters I have to
pass that alter table command.
Ok
Bob, I need
today’s data
Dude, we need
HCatalog
© Hortonworks Inc. 2012
Pig Example
Page 22
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
© Hortonworks 2013
Pig Example
Page 23
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
Using HCatalog:
raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
No need to know
file location
No need to
declare schema
Partition filter
© Hortonworks 2013
Tools With HCatalog
Page 24
Feature MapReduce +
HCatalog
Pig + HCatalog Hive
Record format Record Tuple Record
Data model int, float, string,
maps, structs, lists
int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists
Schema Read from
metadata
Read from
metadata
Read from
metadata
Data location Read from
metadata
Read from
metadata
Read from
metadata
Data format Read from
metadata
Read from
metadata
Read from
metadata
•  Pig/MR users can read schema from metadata
•  Pig/MR users are insulated from schema, location, and format changes
•  All users have access to other users’ data as soon as it is committed
Pig with HCat Demo
Page 25
Data & Metadata REST Services APIs
Page 26
HDFS HBase
External
Store
Existing & New Applications
MapReduce Pig Hive
HCatalog
WebHCat RESTful Web Services
WebHDFS & WebHCat
provide RESTful API as
“front door” for Hadoop
•  Opens the door to
languages other than Java
•  Thin clients via web
services vs. fat-clients in
gateway
•  Insulation from interface
changes release to release
Opens Hadoop to integration with existing and new applications
WebHDFS
RESTful API Access for Pig
• Code example
	
  curl	
  -­‐s	
  -­‐d	
  user.name=hue	
  	
  
	
  	
  	
  	
  	
  	
  	
  -­‐d	
  execute=”<pig	
  script>”	
  	
  
	
  	
  	
  	
  	
  	
  	
  'http://localhost:50111/templeton/v1/pig'
•  RestSharp (restsharp.org/)
– Simple REST and HTTP API Client
for .NET
Page 27
WebHCat REST API
Page 28
Page 28© Hortonworks 2012
Hadoop/
HCatalog
Get a list of all tables in the default database:
GET
http://…/v1/ddl/database/default/table
{
"tables": ["counted","processed",],
"database": "default"
}
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}
http://…/v1/ddl/database/default/table/rawevents
{
"table": "rawevents",
"database": "default”
}{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}
Pig with WebHCat Demo
Page 29
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
Page 30
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
Pig on Tez - Design
3
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler
Performance numbers
3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Replicated
Join (2.8x)
Join +
Groupby
(1.5x)
Join +
Groupby +
Orderby
(1.5x)
3 way Split +
Join +
Groupby +
Orderby
(2.6x)
Timeinsecs
MR
Tez
User Defined Functions
• Ultimate in extensibility and portability
• Custom processing
– Java
– Python
– JavaScript
– Ruby
• Integration with MapReduce phases
– Map
– Combine
– Reduce
User Defined Functions
public class MyUDF extends EvalFunc<DataBag>
implements Algebraic {!
…!
}!
• Algebraic functions
• 3-phase execution
– Map – called once for each tuple
– Combiner – called zero or more times for each map result
– Reduce
User Defined Functions
public class MyUDF extends EvalFunc<DataBag>
implements Accumulator {!
…!
}!
• Accmulator functions
• Incremental processing of data
• Called in both map and reduce phase
User Defined Functions
public class MyUDF extends FilterFunc {!
…!
}!
• Filter functions
• Returns boolean based on processing of the tuple
• Called in both map and reduce phase
Questions & Answers
Page 37

More Related Content

What's hot

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
Chicago Hadoop Users Group
 
Apache drill
Apache drillApache drill
Apache drill
Jakub Pieprzyk
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
The Hive
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
DataWorks Summit
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
Steve Loughran
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
Edureka!
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
Chicago Hadoop Users Group
 
Apache pig
Apache pigApache pig
Apache pig
Sadiq Basha
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
Hortonworks
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Hortonworks
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
MapR Technologies
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 

What's hot (20)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Apache pig
Apache pigApache pig
Apache pig
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 

Viewers also liked

A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big Data
Saurav Kumar Sinha
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
Yahoo Developer Network
 
HCatalog & Templeton
HCatalog & TempletonHCatalog & Templeton
HCatalog & Templeton
Daegeun Kim
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
Thanigai Vellore
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks
 

Viewers also liked (7)

A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big Data
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 
HCatalog & Templeton
HCatalog & TempletonHCatalog & Templeton
HCatalog & Templeton
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 

Similar to Sql saturday pig session (wes floyd) v2

Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
Codemotion
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
marklpollack
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
Hortonworks
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
SpringPeople
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
Roxycodone Online
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Hortonworks
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
Tugdual Grall
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
Aasim Naveed
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
MapR Technologies
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
Madhur Nawandar
 

Similar to Sql saturday pig session (wes floyd) v2 (20)

Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Sql saturday pig session (wes floyd) v2

  • 1. Data Manipulation with Pig Page 1 Wes Floyd - @weswfloyd
  • 3. Pig History • Born from Yahoo Research then Apache incubated • Built to avoid low level programming of Map/Reduce without Hive/SQL queries • Committers from: Yahoo, Hortonworks, LinkedIn, SalesForce, IBM, Twitter, Netflix, and others • Alan Gates on Pig Page 3
  • 4. Pig • An engine for executing programs on top of Hadoop • It provides a language, Pig Latin, to specify these programs Page 4
  • 5. HDP: Enterprise Hadoop Platform Page 5 Hortonworks Data Platform (HDP) •  The ONLY 100% open source and complete platform •  Integrates full range of enterprise-ready services •  Certified and tested at scale •  Engineered for deep ecosystem interoperability OS/VM   Cloud   Appliance   PLATFORM     SERVICES   HADOOP     CORE   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS     DATA  PLATFORM  (HDP)   OPERATIONAL   SERVICES   DATA   SERVICES   HDFS   SQOOP   FLUME   NFS   LOAD  &     EXTRACT   WebHDFS   KNOX*   OOZIE   AMBARI   FALCON*   YARN       MAP       TEZ  REDUCE   HIVE  &   HCATALOG   PIG  HBASE  
  • 6. Why use Pig? • Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18 - 25 Page 6
  • 7. In Map-Reduce Page 7 170 lines of code, 4 hours to write
  • 8. In Pig Latin Users  =  load  ‘input/users’  using  PigStorage(‘,’)  as  (name:chararray,  age:int);   Fltrd  =  filter  Users  by  age  >=  18  and  age  <=  25;   Pages  =  load  ‘input/pages’  using  PigStorage(‘,’)    as  (user:chararray,   url:chararray);   Jnd  =  join  Fltrd  by  name,  Pages  by  user;   Grpd  =  group  Jnd  by  url;   Smmd  =  foreach  Grpd  generate  group,COUNT(Jnd)  as  clicks;   Srtd  =  order  Smmd  by  clicks  desc;   Top5  =  limit  Srtd  5;   store  Top5  into  ‘output/top5sites’  using  PigStorage(‘,’);   Page 8 9 lines of code, 15 minutes to write 170 lines to 9 lines of code
  • 9. Essence of Pig • Map-Reduce is too low a level, SQL too high • Pig-Latin, a language intended to sit between the two – Provides standard relational transforms (join, sort, etc.) – Schemas are optional, used when available, can be defined at runtime – User Defined Functions are first class citizens Page 9
  • 10. Pig Architecture 10 Hadoop Pig Client: Parses, validates, optimizes, plans, coordinates execution Data stored in HDFS Processing done via MapReduce
  • 11. Pig Elements Page 11 •  High-level scripting language •  Requires no metadata or schema •  Statements translated into a series of MapReduce jobs Pig Latin •  Interactive shellGrunt •  Shared repository for User Defined Functions (UDFs)Piggybank
  • 12. Pig Latin Data Flow Page 12 LOAD (HDFS/HCat) TRANSFORM (Pig) DUMP or STORE (HDFS/HCAT) Read data to be manipulated from the file system Manipulate the data Output data to the screen or store for processing In code: •  VARIABLE1  =  LOAD  [somedata]   •  VARIABLE2  =  [TRANSFORM  operation]   •  STORE  VARIABLE2  INTO  ‘[some  location]’  
  • 13. Pig Relations 1.  A bag is a collection of unordered tuples (can be different sizes). 2.  A tuple is an ordered set of fields. 3.  A field is a piece of data. Pig Latin statements work with relations Field Field 1 Field 2 Field 3 Tuple Bag
  • 14. FILTER, GROUP, FOREACH, ORDER Page 14 logevents  =  LOAD  'input/my.log'  AS  (date:chararray,    level:chararray,  code:int,  message:chararray);   severe  =  FILTER  logevents  BY  (level  ==  'severe’    AND    code  >=  500);   grouped  =  GROUP  severe  BY  code;   e1  =  LOAD  'pig/input/File1'  USING  PigStorage(',')                AS  (name:chararray,age:int,   zip:int,salary:double);   f  =  FOREACH  e1  GENERATE  age,  salary;   g  =  ORDER  f  BY  age  
  • 15. JOIN, GROUP, LIMIT Page 15 employees  =  LOAD  ‘[somefile]’              AS  (name:chararray,age:int,  zip:int,salary:double);   agegroup  =  GROUP  employees  BY  age;                   h  =  LIMIT  agegroup  100;   e1  =  LOAD  ’[somefile]'  USING  PigStorage(',')                  AS  (name:chararray,  age:int,  zip:int,   salary:double);   e2  =  LOAD  '[somefile]'  USING  PigStorage(',')                  AS  (name:chararray,  phone:chararray);   e3  =  JOIN  e1  BY  name,  e2  BY  name;  
  • 17. Grunt Command Line Demo Page 17
  • 18. Hive vs Pig Page 18 Pig and Hive work well together and many businesses use both. Hive is a good choice: •  when you want to query the data •  when you need an answer to specific questions •  if you are familiar with SQL Pig is a good choice: •  for ETL (Extract -> Transform -> Load) •  for preparing data for easier analysis •  when you have a long series of steps to perform
  • 19. Tool Comparison Page 19© Hortonworks 2012 Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists, char, varchar, decimal, … Schema Encoded in app Declared in script or read by loader Read from metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata
  • 20. T-SQL vs Hadoop Ecosystem Page 20 T-SQL PIG Hive Query Data Yes Yes (in bulk) Yes Local Variables Yes Yes No Conditional Logic Yes limited limited Procedural Programming Yes No No UDFs No Yes Yes
  • 21. HCatalog: Data Sharing is Hard Page 21 Photo Credit: totalAldo via Flickr This is programmer Bob, he uses Pig to crunch data. This is analyst Joe, he uses Hive to build reports and answer ad-hoc queries. Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command. Ok Bob, I need today’s data Dude, we need HCatalog © Hortonworks Inc. 2012
  • 22. Pig Example Page 22 Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; © Hortonworks 2013
  • 23. Pig Example Page 23 Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; Using HCatalog: raw = load 'rawevents' using HCatLoader(); botless = filter raw by myudfs.NotABot(user) and ds == '20120530'; grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer(); No need to know file location No need to declare schema Partition filter © Hortonworks 2013
  • 24. Tools With HCatalog Page 24 Feature MapReduce + HCatalog Pig + HCatalog Hive Record format Record Tuple Record Data model int, float, string, maps, structs, lists int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Read from metadata Read from metadata Read from metadata Data location Read from metadata Read from metadata Read from metadata Data format Read from metadata Read from metadata Read from metadata •  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed
  • 25. Pig with HCat Demo Page 25
  • 26. Data & Metadata REST Services APIs Page 26 HDFS HBase External Store Existing & New Applications MapReduce Pig Hive HCatalog WebHCat RESTful Web Services WebHDFS & WebHCat provide RESTful API as “front door” for Hadoop •  Opens the door to languages other than Java •  Thin clients via web services vs. fat-clients in gateway •  Insulation from interface changes release to release Opens Hadoop to integration with existing and new applications WebHDFS
  • 27. RESTful API Access for Pig • Code example  curl  -­‐s  -­‐d  user.name=hue                  -­‐d  execute=”<pig  script>”                  'http://localhost:50111/templeton/v1/pig' •  RestSharp (restsharp.org/) – Simple REST and HTTP API Client for .NET Page 27
  • 28. WebHCat REST API Page 28 Page 28© Hortonworks 2012 Hadoop/ HCatalog Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table { "tables": ["counted","processed",], "database": "default" } •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents { "table": "rawevents", "database": "default” }{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }
  • 29. Pig with WebHCat Demo Page 29
  • 30. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez Page 30 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 31. Pig on Tez - Design 3 Logical Plan Tez Plan MR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler
  • 32. Performance numbers 3 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Replicated Join (2.8x) Join + Groupby (1.5x) Join + Groupby + Orderby (1.5x) 3 way Split + Join + Groupby + Orderby (2.6x) Timeinsecs MR Tez
  • 33. User Defined Functions • Ultimate in extensibility and portability • Custom processing – Java – Python – JavaScript – Ruby • Integration with MapReduce phases – Map – Combine – Reduce
  • 34. User Defined Functions public class MyUDF extends EvalFunc<DataBag> implements Algebraic {! …! }! • Algebraic functions • 3-phase execution – Map – called once for each tuple – Combiner – called zero or more times for each map result – Reduce
  • 35. User Defined Functions public class MyUDF extends EvalFunc<DataBag> implements Accumulator {! …! }! • Accmulator functions • Incremental processing of data • Called in both map and reduce phase
  • 36. User Defined Functions public class MyUDF extends FilterFunc {! …! }! • Filter functions • Returns boolean based on processing of the tuple • Called in both map and reduce phase