Hadoop pig

Agenda
• What is Apache Pig
• How to Setup
• Tutorial Examples

PIG Introduction
• Apache Pig is a platform for analyzing large data sets
that consists of a high-level language for expressing
data analysis programs
• Pig generates and compiles a Map/Reduce program(s)
on the fly
PIG
Parse
Compile
Optimize
Plan
Pig Latin
Scripts

有Pig後Map-Reduce簡單了！?
• Apache Pig用來處理大規模資料的高級查詢語言
• 適合操作大型半結構化數據集
• 比使用Java，C++等語言編寫大規模資料處理程式的
難度要小16倍，實現同樣的效果的代碼量也小20倍。
• Pig元件
– Pig Shell (Grunt)
– Pig Language (Latin)
– Libraries (Piggy Bank)
– UDF:使用者定義功能
4
figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing

大象遇到豬 ( setup )
• 解壓縮
• 修改~/.bashrc
• 啟動 pig shell
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PIG_HOME=/home/hadoop/pig
export PATH=$PATH:$PIG_HOME/bin
cd /home/hadoop
wget http://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0-
cdh5.3.2.tar.gz
tar –zxvf pig-0.12.0-cdh5.3.2.tar.gz
mv pig-0.12.0-cdh5.3.2 pig
$ pig
grunt>
grunt> ls /
hdfs://master:9000/hadoop <dir>
hdfs://master:9000/tmp <dir>
hdfs://master:9000/user <dir>
grunt>

豬也會的程式設計
6
功能指令
讀取 LOAD
儲存 STORE
資料
處理
REGEX_EXTRACT, FILTER, FOREACH,
GROUP, JOIN, UNION, SPLIT, …
彙總
運算
AVG, COUNT, MAX, MIN, SIZE, …
數學
運算
ABS, RANDOM, ROUND, …
字串
處理
INDEXOF, SUBSTRING, REGEX
EXTRACT, …
Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE
HDFS cat, ls, cp, mkdir, …
$ pig -x
grunt> A = LOAD ‘file1’ AS (x, y, z);
grunt> B = FILTER A by y > 10000;
grunt> STORE B INTO ‘output’;

 用 shell 硬把程式兜出來，放棄用 hadoop 了
 使用 PIG
 發憤圖強，廢寢忘食的研究…
練習一 :
• 場景:
– 老闆要我統計組織內所有員工的平均工時。於是我取
得了全台灣的打卡紀錄檔(打卡鐘的log檔)，還跟人事
部門拿到了員工 id 對應表。這些資料量又多且大，我
想到要餵進去 Hadoop 的HDFS, .. 然後
• 問題:
– 為了寫MapReduce，開始學 Java, 物件導向, hadoop
API, … @@
• 解法:
7

整型前的mapreduce code
8
nm dp Id Id dt hr
劉北 A1 A1 7/7 13
李中 B1 A1 7/8 12
王中 B2 A1 7/9 4
Java Code
Map-Reduce
A1 劉北 7/8 13
A1 劉北 7/9 12
A1 劉北 Jul 12.5

用pig 整形後
9
北 A1 劉 12.5
LOAD
LOAD
FILTER
JOIN
GROUP
FOREACH
STORE
(nm, dp, id)
(nm, dp, id)
(id, dt, hr)
(nm, dp, id, id, dt, hr)
(group, {(nm, dp, id, id, dt, hr)})
(group, …., AVG(hr))
(dp,group, nm, hr)
Logical PlanPig Latin
A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;
B = LOAD 'file2.txt' using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE $1.dp,group,$1.nm,
AVG($1.hr);
STORE F INTO '/tmp/pig_output/';
nm dp Id Id dt hr
劉北 A1 A1 7/7 13
李中 B1 A1 7/8 12
王中 B2 A1 7/9 4
Tips : 先用小量資料於 pig -x local 模式驗證；
每行先配合dump or illustrate看是否正確

練習一 : 實作
 cd ~
git clone
https://github.com/waue0920/hadoop_example.git
 cd ~/hadoop_example/pig/ex1
 pig -x local -f exc1.pig
 cat /tmp/pig_output/part-r-00000
練習 : 執行 pig -x mapreduce，將 exc1.pig 每一行單獨執行，並
搭配 dump , illustrate 來看結果，如 :
Grunt> A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id);
Grunt> Dump A
Grunt> Illustrate A
Q : result 是否有改進空間 ?
Q : 如何改進 ?

進階
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point
Data: 10.5F or 10.5f or
10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point
Data: 10.5 or 10.5e2 or
10.5E2
Display: 10.5 or 1050.0
chararray
Character array (string) in
Unicode UTF-8 format
hello world
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
datetime datetime
1970-01-
01T00:00:00.000+00:00
biginteger Java BigInteger 2E+11
bigdecimal Java BigDecimal 33.45678332
Complex Types Description Example
Fields A piece of data John
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [open#apache]

進階
• cat data;
• A = LOAD 'data' AS
( t1:tuple(t1a:int,t1b:int,t1c:int),
t2:tuple(t2a:int,t2b:int,t2c:int)
);
• X = FOREACH A GENERATE t1.t1a,t2.$0;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
(3,4)
(1,3)
(2,9)

進階
Data Types and More Relational Operators
Complex Types ASSERT MAPREDUCE
Bags COGROUP ORDER BY
Tuples CROSS RANK
Fields CUBE SAMPLE
Map DEFINE SPLIT
Simple Types DISTINCT STORE
int FILTER STREAM
long FOREACH UNION
float GROUP
double IMPORT
chararray JOIN (inner)
bytearray JOIN (outer)
boolean LIMIT
datetime LOAD
biginteger UDF Statements
bigdecimal DEFINE REGISTER

練習二
• 說明 : 從數字陣列中，觀察 pig 的語法，以及結果的
變化
• 使用技術 : filter .. by, foreach .. by, group .. by,
foreach .. generate, cogroup
• Input/ output
See : https://wiki.apache.org/pig/PigLatin (last edited 2010)
myfile.txt B.txt

練習二
 hadoop fs -put myfile.txt B.txt ./
 pig -x mapred
> A = LOAD 'myfile.txt' USING PigStorage('t') AS (f1,f2,f3);
> B = LOAD 'B.txt' ; dump A; dump B;
> Y = FILTER A BY f1 == '8'; dump Y;
> Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); dump Y;
> X = GROUP A BY f1; dump X;
> X = FOREACH A GENERATE f1, f2; dump X;
> X = FOREACH A GENERATE f1+f2 as sumf1f2; dump X;
> Y = FILTER X by sumf1f2 > 5.0; dump Y;
> C = COGROUP A BY $0, B BY $0; dump C;
> C = COGROUP A BY $0 INNER, B BY $0 INNER; dump C;

練習三
• 說明 : 從 <userid, time, query_term> 的記錄檔，
做出使用者喜愛的關鍵字分析
• 使用技術 : UDF, DISTINCT, FLATTEN, ORDER
• Source pigtutorial.tar.gz
• Input / output
See : https://cwiki.apache.org/confluence/display/PIG/PigTutorial

練習三
 pig -x local
> REGISTER ./tutorial.jar;
> raw = LOAD 'excite-small.log' USING PigStorage('t') AS (user, time, query);
> clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
> clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as
query;
> houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as
hour, query;
> ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
> ngramed2 = DISTINCT ngramed1;
> hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
> hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
> uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
> uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0),
flatten(org.apache.pig.tutorial.ScoreGenerator($1));
> uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as
score, $3 as count, $4 as mean;
> filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
> ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
> STORE ordered_uniq_frequency INTO 'result' USING PigStorage();
 pig -x local -f script1-local.pig
 cat result/part-r-00000

UDF
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception p
rocessing input row ", e);
}}}
grunt> register myudfs.jar
grunt> A = load 'student_data' using PigStorage(',')
as (name:chararray, age:int,gpa:double);
grunt> B = FOREACH A GENERATE
myudfs.UPPER(name);
grunt> dump B;

Reference
• Pig 說明
– http://pig.apache.org/docs/r0.12.0/basic.html
• Pig 參考投影片
– http://www.slideshare.net/ydn/hadoop-yahoo-
internet-scale-data-processing
• Pig 範例參考
– https://cwiki.apache.org/confluence/display/PIG/Pi
gTutorial

Hadoop pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop pig

Similar to Hadoop pig (20)

More from Wei-Yu Chen

More from Wei-Yu Chen (10)

Recently uploaded

Recently uploaded (20)

Hadoop pig

Editor's Notes