Pig programming is more fun: New features in Pig

Pig programming is more fun: New features in Pig

Daniel Dai (@daijy)
Thejas Nair (@thejasn)

© Hortonworks Inc. 2011 Page 1

What is Apache Pig?
Pig Latin, a high level An engine that
data processing executes Pig
language. Latin locally or on
a Hadoop cluster.

Pig-latin-cup pic from http://www.flickr.com/photos/frippy/2507970530/

Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011

Pig-latin example
• Query : Get the list of web pages visited by users whose
age is between 20 and 29 years.

USERS = load „users‟ as (uid, age);

USERS_20s = filter USERS by age >= 20 and age <= 29;

PVs = load „pages‟ as (url, uid, timestamp);

PVs_u20s = join USERS_20s by uid, PVs by uid;

Page 3

Why pig ?
• Faster development
– Fewer lines of code
– Don‟t re-invent the wheel

• Flexible
– Metadata is optional
– Extensible
– Procedural programming

Pic courtesy http://www.flickr.com/photos/shutterbc/471935204/

Page 4

Before pig 0.9
p1.pig p2.pig p3.pig

Page 5

With pig macros
p1.pig p2.pig p3.pig

macro1.pig macro2.pig

Page 6

With pig macros
p1.pig p1.pig rm_bots.pig

get_top.pig

Page 7

Pig macro example
• Page_views data : (url, timestamp, uname, …)
• Find
1. top 5 users (uname) by page views
2. top 10 most visited urls

Page 8

Pig Macro example
page_views = LOAD .. /* top x macro */
/* get top 5 users by page view */ DEFINE topCount (rel, col, topNum)
u_grp = GROUP .. by uname; RETURNS top_num_recs {
u_count = FOREACH .. COUNT .. grped = GROUP $rel by $col;
ord_u_count = ORDER u_count .. cnt_grp = FOREACH ..COUNT($rel)..
top_5_users = LIMIT ordered.. 5; ord_cnt = ORDER .. by cnt;
DUMP top_5_users; $top_num_recs = LIMIT.. $topNum;
}
/* get top 10 urls by page view */ -----------------------------------------
url_grp = GROUP .. by url; page_views = LOAD ..
url_count = FOREACH .. COUNT . /* get top 5 users by page view */
ord_url_count = ORDER url_count.. top_5_users = topCount(page_views,
top_10_urls = LIMIT ord_url.. 10; uname, 5);
DUMP top_10_urls; …

Page 9

Pig macro
• Coming soon – piggybank with pig macros

Page 10

Writing data flow program
• Writing a complex data pipeline is an iterative process

Load Load

Transform Join

Group Transform Filter

Page 11


Load Load

Transform Join

Group Transform Filter

No output! 

Page 12

• Debug!

Load Load

Was join on
Transform Join wrong
attributes?

Bug in Group Transform Filter
transform?

Did filter drop
everything?

Page 13

Common approaches to debug
• Running on real (large) data
–Inefficient, takes longer
• Running on (small) samples
–Empty results on join, selective filters

Page 14

Pig illustrate command
• Objective- Show examples for i/o of each statement that
are
–Realistic
–Complete
–Concise
–Generated fast
• Steps
–Downstream – sample and process
–Prune
–Upstream – generate realistic missing classes of examples
–Prune

Page 15

Illustrate command demo

Page 16

Pig relation-as-scalar
• In pig each statement alias is a relation
–Relation is a set of records
• Task: Get list of pages whose load time was more
than average.
• Steps
1. Compute average load time
2. Get list of pages whose load time is > average

Page 17

• Step 1 is like
.. = load ..
..= group ..
al_rel = foreach .. AVG(ltime) as avg_ltime;

• Step 2 looks like
page_views = load „pviews.txt‟ as
(url, ltime, ..);

slow_views = filter page_views by
ltime > avg_ltime

Page 18

• Getting results of step 1 (average_gpa)
–Join result of step 1 with students relation, or
–Write result into file, then use udf to read from file
• Pig scalar feature now simplifies this-
slow_views = filter page_views by
ltime > al_rel.avg_ltime

–Runtime exception if al_rel has more than one record.

Page 19

UDF in Scripting Language
• Benefit
–Use legacy code
–Use library in scripting language
–Leverage Hadoop for non-Java programmer
• Currently supported language
–Python (0.8)
–JavaScript (0.8)
–Ruby (0.10)
• Extensible Interface
–Minimum effort to support another language

Page 20

Writing a Python UDF
Write a Python UDF register 'util.py' using jython as util;

@outputSchema("word:chararray") B = foreach A generate util.square(i);
def concat(word):
return word + word
• Invoke Python functions when
needed
@outputSchemaFunction("squareSchema") • Type conversion
def square(num): – Python simple type <-> Pig
simple type
if num == None:
– Python Array <-> Pig Bag
return None – Python Dict <-> Pig Map
return ((num)*(num)) – Pyton Tuple <-> Pig Tuple

def squareSchema(input):
return input

Page 21

Use NLTK in Pig
• Example
register ‟nltk_util.py' using jython as nltk; Pig eats everything
……
B = foreach A generate nltk.tokenize(sentence)

Tokenize
nltk_util.py
Stemming
import nltk
porter = nltk.PorterStemmer() (Pig)
@outputSchema("words:{(word:chararray)}") (eat)
def tokenize(sentence): (everything)
tokens = nltk.word_tokenize(sentence)
words = [porter.stem(t) for t in tokens]
return words

Page 22

Comparison with Pig Streaming

Pig Streaming Scripting UDF

B = stream A through `perl B = foreach A generate
Syntax
sample.pl`; myfunc.concat(a0, a1), a2;
function parameter/return
stdin/tout
Input/Output value
entire relation
particular fields

Need to parse input/convert Type conversion is
Type Conversion
type automatic

Every streaming operator Organize the functions into
Modularize
need a separate script module

Page 23

Writing a Script Engine
Writing a bridge UDF
class JythonFunction extends EvalFunc<Object> { Convert Pig input
into Python
public Object exec(Tuple tuple) {
PyObject[] params = JythonUtils.pigTupleToPyTuple(tuple).getArray();
PyObject result = f.__call__(params); Invoke Python UDF
return JythonUtils.pythonToPig(result);
} Convert result to Pig
public Schema outputSchema(Schema input) {
PyObject outputSchemaDef = f.__findattr__("outputSchema".intern());
return Utils.getSchemaFromString(outputSchemaDef.toString());
}
}

Page 24

Writing a Script Engine
Register scripting UDF

register 'util.py' using jython as util;

What happens in Pig
class JythonScriptEngine extends ScriptEngine {
public void registerFunctions(String path, String namespace, PigContext
pigContext) {
myudf.py
def square(num):
…… square JythonFunction(“square”)
def concat(word): concat JythonFunction(“concat”)
……
def count(bag): count JythonFunction(“count”)
……
}
}

Page 25

Algebraic UDF in JRuby
class SUM < AlgebraicPigUdf
output_schema Schema.long

def initial num
num Initial Function
end

def intermed num
num.flatten.inject(:+) Intermediate Function
end

def final num
intermed(num) Final Function
end

end

Page 26

Pig Embedding
• Embed Pig inside scripting language
–Python
–JavaScript
• Algorithms which cannot complete using one Pig script
–Iterative algorithm
– PageRank, Kmeans, Neural Network, Apriori, etc

– Parallel Independent execution
– Ensemble

– Divide and Conquer
– Branching

Page 27

Pig Embedding
from org.apache.pig.scripting import Pig
Compile Pig
input= ":INPATH:/singlefile/studenttab10k” Script

P = Pig.compile("""A = load '$in' as (name, age, gpa);
store A into ’output';""")

Q = P.bind({'in':input}) Bind Variables

result = Q.runSingle() Launch Pig Script

result = stats.result('A')

for t in result.iterator(): Iterate result
print t

Page 28

Convergence Example
P = Pig.compile(“““DEFINE myudf MyUDF('$param');
A = load ‟input‟;
B = foreach A generate MyUDF(*);
store B into „output‟;””” )

while True:
Q = P.bind({‟ param':new_parameter}) Bind to new parameter
results = Q.runSingle()
iter = results.result("result").iterator()
if converged: Convergence check
break

new_parameter = xxxxxx Change parameter

Page 29

Pig Embedding
• Running embeded Pig script
pig sample.py while True:
• What happen within Pig? Q = P.bind()
results = Q.runSingle()
While Loop converge?

Pig
Script

Pytho Pytho
n n
sample.py Script Pig Script
Jython Pig

End

Page 30

Nested Operator
• Nested Operator: Operator inside foreach
B = group A by name;
C = foreach B {
C0 = limit A 10;
generate flatten(C0);
}

• Prior Pig 0.10, supported nested operator
–DISTINCT, FILTER, LIMIT, and ORDER BY
• New operators added in 0.10
–CROSS, FOREACH

Page 31

Nested Cross/ForEach
ì(i0, a)ü ì(i0, 0)ü
A= í ý B= í ý
î(i0, b)þ î(i0,1) þ

ì ì aü ì 0 ü ü
ï ï ï
CoGroup A, B C= í(i0, í ý, í ý)ý
ï ïbþ î1 þ ï
î î þ
ì ì(a, 0)üü C = CoGroup A, B;
ï ï ïï
Cross A, B ï ï(a,1) ïï D = ForEach C {
í(i0, í ýý
ï ï(b, 0)ïï X = Cross A, B;
ï
î ï(b,1) ïï
î þþ Y = ForEach X generate
CONCAT(f1, f2);
ì ì(a0)üü
ï ï ïï Generate Y;
ForEach … CONCAT ï ï(a1) ïï
í(i0, í ýý }
ï ï(b0)ïï
ï
î ï(b1) ïï
î þþ
Page 32

HCatalog Integration
• Hcatalog

Pig Map Reduce Hive

HCatalog

• HCatLoader/HCatStorage
–Load/Store from HCatalog from Pig
• HCatalog DDL Integration (Pig 0.11)
–sql “create table student(name string, age int, gpa double);”

Page 33

Misc Loaders
• HBaseStorage
–Pig builtin
• AvroStorage
–Piggybank
• CassandraStorage
–In Cassandra code base
• MongoStorage
–In Mongo DB code base
• JsonLoader/JsonStorage
–Pig builtin

Page 34

Talend
Enterprise Data Integration
• Talend Open Studio for Big Data
– Feature-rich Job Designer
– Rich palette of pre-built templates
– Supports HDFS, Pig, Hive, HBase, HCatalog
– Apache-licensed, bundled with HDP

• Key benefits
– Graphical development
– Robust and scalable execution
– Broadest connectivity to support
all systems:
450+ components
– Real-time debugging

© Hortonworks Inc. 2011 Page 35

Questions

Page 36

Pig programming is more fun: New features in Pig

More Related Content

What's hot

Viewers also liked

Similar to Pig programming is more fun: New features in Pig

Recently uploaded

Pig programming is more fun: New features in Pig