Apache Pig is a platform for analyzing large data sets using a high-level language called Pig Latin. Pig Latin scripts are compiled into MapReduce programs that process data in parallel across a cluster. Pig simplifies data analysis tasks that would otherwise require writing complex MapReduce programs by hand. Example Pig Latin scripts demonstrate how to load, filter, group, and store data.
3. PIG Introduction
• Apache Pig is a platform for analyzing large data sets
that consists of a high-level language for expressing
data analysis programs
• Pig generates and compiles a Map/Reduce program(s)
on the fly
PIG
Parse
Compile
Optimize
Plan
Pig Latin
Scripts
9. 用pig 整形後
9
北 A1 劉 12.5
LOAD
LOAD
FILTER
JOIN
GROUP
FOREACH
STORE
(nm, dp, id)
(nm, dp, id)
(id, dt, hr)
(nm, dp, id, id, dt, hr)
(group, {(nm, dp, id, id, dt, hr)})
(group, …., AVG(hr))
(dp,group, nm, hr)
Logical PlanPig Latin
A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;
B = LOAD 'file2.txt' using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE $1.dp,group,$1.nm,
AVG($1.hr);
STORE F INTO '/tmp/pig_output/';
nm dp Id Id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Tips : 先用小量資料於 pig -x local 模式驗證;
每行先配合dump or illustrate看是否正確
10. 練習一 : 實作
cd ~
git clone
https://github.com/waue0920/hadoop_example.git
cd ~/hadoop_example/pig/ex1
pig -x local -f exc1.pig
cat /tmp/pig_output/part-r-00000
練習 : 執行 pig -x mapreduce,將 exc1.pig 每一行單獨執行,並
搭配 dump , illustrate 來看結果,如 :
Grunt> A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id);
Grunt> Dump A
Grunt> Illustrate A
Q : result 是否有改進空間 ?
Q : 如何改進 ?
11. 進階
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point
Data: 10.5F or 10.5f or
10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point
Data: 10.5 or 10.5e2 or
10.5E2
Display: 10.5 or 1050.0
chararray
Character array (string) in
Unicode UTF-8 format
hello world
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
datetime datetime
1970-01-
01T00:00:00.000+00:00
biginteger Java BigInteger 2E+11
bigdecimal Java BigDecimal 33.45678332
Complex Types Description Example
Fields A piece of data John
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [open#apache]
12. 進階
• cat data;
• A = LOAD 'data' AS
( t1:tuple(t1a:int,t1b:int,t1c:int),
t2:tuple(t2a:int,t2b:int,t2c:int)
);
• X = FOREACH A GENERATE t1.t1a,t2.$0;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
(3,4)
(1,3)
(2,9)
17. 練習二
cd ~/hadoop_example/pig/ex2
hadoop fs -put myfile.txt B.txt ./
pig -x mapred
> A = LOAD 'myfile.txt' USING PigStorage('t') AS (f1,f2,f3);
> B = LOAD 'B.txt' ; dump A; dump B;
> Y = FILTER A BY f1 == '8'; dump Y;
> Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); dump Y;
> X = GROUP A BY f1; dump X;
> X = FOREACH A GENERATE f1, f2; dump X;
> X = FOREACH A GENERATE f1+f2 as sumf1f2; dump X;
> Y = FILTER X by sumf1f2 > 5.0; dump Y;
> C = COGROUP A BY $0, B BY $0; dump C;
> C = COGROUP A BY $0 INNER, B BY $0 INNER; dump C;
18. 練習三
• 說明 : 從 <userid, time, query_term> 的記錄檔,
做出使用者喜愛的關鍵字分析
• 使用技術 : UDF, DISTINCT, FLATTEN, ORDER
• Source pigtutorial.tar.gz
• Input / output
See : https://cwiki.apache.org/confluence/display/PIG/PigTutorial
19. 練習三
cd ~/hadoop_example/pig/ex3
pig -x local
> REGISTER ./tutorial.jar;
> raw = LOAD 'excite-small.log' USING PigStorage('t') AS (user, time, query);
> clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
> clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as
query;
> houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as
hour, query;
> ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
> ngramed2 = DISTINCT ngramed1;
> hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
> hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
> uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
> uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0),
flatten(org.apache.pig.tutorial.ScoreGenerator($1));
> uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as
score, $3 as count, $4 as mean;
> filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
> ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
> STORE ordered_uniq_frequency INTO 'result' USING PigStorage();
pig -x local -f script1-local.pig
cat result/part-r-00000
20. UDF
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception p
rocessing input row ", e);
}}}
grunt> register myudfs.jar
grunt> A = load 'student_data' using PigStorage(',')
as (name:chararray, age:int,gpa:double);
grunt> B = FOREACH A GENERATE
myudfs.UPPER(name);
grunt> dump B;
A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3);
B = LOAD 'B.txt' ;
Y = FILTER A BY f1 == '8';
Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1));
======
(1,{(1,2,3)})
(4,{(4,3,3),(4,2,1)})
(7,{(7,2,5)})
(8,{(8,4,3),(8,3,4)})
======
X = GROUP A BY f1;
====
(1,{(1,2,3)})
(4,{(4,3,3),(4,2,1)})
(7,{(7,2,5)})
(8,{(8,4,3),(8,3,4)})
====
Projection
X = FOREACH A GENERATE f1, f2;
====
1,2)
(4,2)
(8,3)
(4,3)
(7,2)
(8,4)
====
X = FOREACH A GENERATE f1+f2 as sumf1f2;
Y = FILTER X by sumf1f2 > 5.0;
=====
(6.0)
(11.0)
(7.0)
(9.0)
(12.0)
=====
C = COGROUP A BY $0 INNER, B BY $0 INNER;
====
(1,{(1,2,3)},{(1,3)})
(4,{(4,3,3),(4,2,1)},{(4,9),(4,6)})
(8,{(8,4,3),(8,3,4)},{(8,9)})
====