Our final presentation for the Cloud Data Management course at EPFL in 2012 by Anastasia Ailamaki and Christoph Koch.
We have compared the performance of PIG and JAQL when executing translated TPC-DS queries.
Note, it is highly probable that by now the results are outdated, the presentation is more of a historical value.
Relevant papers:
1) Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters.
2) Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing.
3) Kevin S. Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Y. Eltabakh, Carl-Christian Kanne, Fatma Özcan, Eugene J. Shekita. 2011. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Sergii Vozniuk
1. TPC-‐DS
performance
evaluation
for
JAQL
/
Pig
queries
} Andrii
Vozniuk
and
Sergii
Vozniuk
} Data
Management
in
the
Cloud
} EPFL
} June
1,
2012
1
2. Roadmap
2
} Familiarized
with
TPC-‐DS
benchmark
} Selected
and
translated
15
queries
into
Pig
LaHn
and
Jaql
} Setup
infrastructure
} Hardware:
DIAS
cluster
and
6
clusters
on
Amazon
EC2
} SoOware:
} Hadoop-‐0.20.2
} Pig-‐0.9.2
} Jaql-‐0.5.1
} Whirr,
Ganglia
} Performed
experiments
} 15
queries
in
2
languages
for
3
scaling
factors
on
7
clusters
} 315
measurements
for
Pig,
285
–
for
Jaql
} 370$
spent
on
Amazon
EC2
3. Clusters
&
Data
} Cluster:
6
Amazon
EC2
+
1
DIAS
} 1
EC2
Compute
Unit
=
1.0-‐1.2
GHz
2007
Xeon
processor
} Clusters:
5
or
10
nodes
on
EC2,
4
nodes
on
DIAS
} Data:
three
scaling
factors
(SF)
} SF
2
=
2.3
GB
} SF
5
=
5.7
GB
} SF
10
=
12.2
GB
3
5. Total
Execution
Time
on
Cluster:
Pig
5
0
5000
10000
15000
20000
25000
small5 small10 medium5 medium10 large5 large10 dias
Totalexecutiontime,s
Cluster Configuration
SF=2
SF=5
SF=10
Small datasets: job startup overhead dominates
Large datasets: startup overhead dominates on powerful clusters only
6. Total
Execution
Time
on
Cluster:
Jaql
6
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
small5 small10 medium5 medium10 large5 large10 dias
TotalExecutionTime,s
Cluster Configuration
SF=2
SF=5
SF=10
Small instances are not suitable for Jaql due to poor I/O performance
Jaql launches much more jobs for the same query than Pig – overhead is bigger
7. Pig
Latin
vs
Jaql
Performance
7
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0
500
1000
1500
2000
JaqlExecutionTime,s
Pig ExecutionTime, s
SF=2
SF=5
X=Y
Many
points
Pig outperforms Jaql on clusters of 10 EC2 small instances
8. Pig
Latin
vs
Jaql
Performance
8
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0
500
1000
1500
2000
JaqlExecutionTime,s
Pig ExecutionTime, s
SF=2
SF=5
SF=10
X=Y
Jaql performance approaches Pig’s on 10 EC2 medium instances
9. Pig
Latin
vs
Jaql
Performance
9
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0
500
1000
1500
2000
JaqlExecutionTime,s
Pig ExecutionTime, s
SF=2
SF=5
SF=10
X=Y
Half of the queries are faster in Jaql on 10 EC2 large instances
10. Query
Execution
Time
vs
Monetary
Cost
10
0
5000
10000
15000
20000
25000
0
1
2
3
4
5
6
7
TotalExecutiontime,s
Price, $
SF=2
SF=5
SF=10
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0
5
10
15
20
TotalExecutiontime,s
Price, $
SF=2
SF=5
SF=10
Pig
LaHn
Jaql
For total values, Pig outperforms Jaql. Pig should be used in all cases to obtain
minimal execution time or minimal cost or maximal performance per money paid
11. What
Language
To
Use?
Where
To
Run?
11
0
100
200
300
400
500
600
700
800
900
1000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Executiontime,s
Price, $
SF=5 Pig
SF=5 Jaql
Query
26
If we consider a single query, no single language is the best for all purposes
12. Choosing
Optimal
Tool
12
0
100
200
300
400
500
600
700
800
900
1000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Executiontime,s
Price, $
SF=5 Pig
SF=5 Jaql
Optimal
Query
26
Jaql on large10
Pig on small5
Pig on medium5
Given a dataset, a query and a utility function, which language on which
cluster should be used to optimize the function?
Given a dataset and a query what are the options for executing it in the cloud?
13. Summary:
Opinion
Pig
La'n
Jaql
} Cumbersome
scripts
} Procedural
} Long
to
write,
easy
to
debug
} Good
documentaHon
} Convenient
interpreter
} Concise
scripts
} DeclaraHve,
more
SQL-‐like
} Quick
to
write,
long
to
debug
} Poorly
documented
} Tools
are
in
rudimentary
state
13
Jaql
is
much
beeer
as
a
language
but
the
development
infrastructure
is
much
worse
(documentaHon,
user
base,
tools)
14. Summary:
Facts
Pig
La'n
Jaql
} Development
in
progress
} Faster
in
most
of
our
experiments
} Scales
beeer
with
the
dataset
size
} Checks
the
schema
before
evaluaHon
} Open-‐source
version
abandoned
one
year
ago
} Slower
in
most
of
our
experiments
} Scales
worse
with
the
dataset
size
} Doesn’t
check
the
schema
even
while
evaluaHng
14
Thank
you
for
your
aeenHon!
Feedback
&
QuesHons?
17. Directions
for
Future
Work
} Reach
communiHes
for
bigger
scale
and
more
realisHc
comparison
} Add
Hive
queries
to
the
comparison
17
Code
&
Data
on
Github:
github.com/voz