Reuse-­‐based 
op,miza,on 
for 
Pig 
La,n 
Owner 
: 
Jesus 
Presenter 
: 
Damian
Overview 
Given 
a 
workload 
of 
queries 
(pig 
la<n 
scripts): 
• it 
iden<fies 
common 
subexpressions, 
• selects 
the 
best 
ones 
to 
execute 
based 
on 
a 
cost 
func<on, 
• and 
reuses 
their 
results 
as 
needed 
in 
order 
to 
compute 
exactly 
the 
same 
output 
as 
the 
origi-­‐ 
nal 
scripts.
Few 
Details 
• Online 
repository: 
– hGps://scm.gforge.inria.fr/svn/plreuse 
• Code 
size 
(java): 
– 
8.000 
(LoC); 
50 
classes 
• List 
of 
people 
contributed: 
– Jesus 
• Current 
Owner 
(OAK 
member) 
of 
the 
Code: 
– Jesus 
• Dependencies: 
– Apache 
Pig 
0.12.1 
– Hadoop 
1.1.2 
– Gurobi 
BIP 
solver 
5.6.2
System 
architecture
Code 
overview 
• plreuse.common 
contains 
configura<on 
and 
excep<on 
classes 
share 
by 
all 
the 
modules. 
• plreuse.reuse 
is 
composed 
by 
the 
classes 
modeling 
the 
reuse 
base 
op<mizer 
(those 
related 
with 
AND-­‐OR 
DAG, 
LIP 
solver). 
• plreuse.normaliza<on 
constains 
the 
classes 
with 
the 
rules 
pushing 
projec<ons 
up 
and 
down 
(trough 
a 
plan). 
• plreuse.decomposi<on 
is 
composed 
by 
the 
classes 
in 
charge 
of 
rewri<ng 
the 
join 
operators 
into 
cogroup 
+ 
foreach. 
• Unit 
tests 
can 
be 
found 
following 
standard 
naming 
at 
test 
directory.
What 
does 
the 
code 
do? 
Given 
a 
workload 
of 
queries 
(pig 
la<n 
scripts): 
• Analyze 
the 
NRAB 
DAG 
queries 
searching 
for 
merging 
opportuni<es. 
– Iden<fy 
and 
merge 
all 
the 
equivalent 
subexpres-­‐ 
sions 
in 
the 
AND-­‐OR 
DAG. 
– Find 
the 
op<mal 
plan 
from 
the 
AND-­‐OR 
DAG 
using 
a 
cost 
model. 
• Binary 
Integer 
Linear 
Programming 
solver 
is 
used 
to 
find 
the 
minimum-­‐cost 
result 
equivalence 
graph.
Overview 
(I)
Overview 
(II)
Overview 
(III)
Publica,ons 
Reuse-­‐based 
Op,miza,on 
for 
Pig 
La,n 
(submi'ed 
for 
publica2on) 
By 
Jesús 
Camacho-­‐Rodríguez, 
Dario 
Colazzo, 
Melanie 
Herschel, 
Ioana 
Manolescu, 
Soudip 
Roy 
Chowdhury.
Future 
works 
• Develop 
more 
accurate 
cost 
func<ons. 
• Support 
mul<-­‐user 
environments. 
Where, 
each 
workload 
can 
be 
seen 
as 
a 
user’s 
workload 
(with 
QoS 
requirements). 
• Support 
mul<-­‐cluster 
environments 
in 
addi<on 
to 
mul<-­‐user 
environments 
in 
order 
to 
fit 
best 
the 
users’ 
QoS. 
• Extend 
the 
work 
to 
other 
well 
adopted 
languages 
such 
as 
HiveQL.

Plreuse

  • 1.
    Reuse-­‐based op,miza,on for Pig La,n Owner : Jesus Presenter : Damian
  • 2.
    Overview Given a workload of queries (pig la<n scripts): • it iden<fies common subexpressions, • selects the best ones to execute based on a cost func<on, • and reuses their results as needed in order to compute exactly the same output as the origi-­‐ nal scripts.
  • 3.
    Few Details •Online repository: – hGps://scm.gforge.inria.fr/svn/plreuse • Code size (java): – 8.000 (LoC); 50 classes • List of people contributed: – Jesus • Current Owner (OAK member) of the Code: – Jesus • Dependencies: – Apache Pig 0.12.1 – Hadoop 1.1.2 – Gurobi BIP solver 5.6.2
  • 4.
  • 5.
    Code overview •plreuse.common contains configura<on and excep<on classes share by all the modules. • plreuse.reuse is composed by the classes modeling the reuse base op<mizer (those related with AND-­‐OR DAG, LIP solver). • plreuse.normaliza<on constains the classes with the rules pushing projec<ons up and down (trough a plan). • plreuse.decomposi<on is composed by the classes in charge of rewri<ng the join operators into cogroup + foreach. • Unit tests can be found following standard naming at test directory.
  • 6.
    What does the code do? Given a workload of queries (pig la<n scripts): • Analyze the NRAB DAG queries searching for merging opportuni<es. – Iden<fy and merge all the equivalent subexpres-­‐ sions in the AND-­‐OR DAG. – Find the op<mal plan from the AND-­‐OR DAG using a cost model. • Binary Integer Linear Programming solver is used to find the minimum-­‐cost result equivalence graph.
  • 7.
  • 8.
  • 9.
  • 10.
    Publica,ons Reuse-­‐based Op,miza,on for Pig La,n (submi'ed for publica2on) By Jesús Camacho-­‐Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury.
  • 11.
    Future works •Develop more accurate cost func<ons. • Support mul<-­‐user environments. Where, each workload can be seen as a user’s workload (with QoS requirements). • Support mul<-­‐cluster environments in addi<on to mul<-­‐user environments in order to fit best the users’ QoS. • Extend the work to other well adopted languages such as HiveQL.