SlideShare a Scribd company logo
1 of 18
Download to read offline
TPC-­‐DS	
  performance	
  evaluation	
  for	
  
JAQL	
  /	
  Pig	
  queries	
  
}  Andrii	
  Vozniuk	
  and	
  Sergii	
  Vozniuk	
  
}  Data	
  Management	
  in	
  the	
  Cloud	
  
}  EPFL	
  
}  June	
  1,	
  2012	
  
1
Roadmap	
  
2
}  Familiarized	
  with	
  TPC-­‐DS	
  benchmark	
  
}  Selected	
  and	
  translated	
  15	
  queries	
  into	
  Pig	
  LaHn	
  and	
  Jaql	
  
}  Setup	
  infrastructure	
  
}  Hardware:	
  DIAS	
  cluster	
  and	
  6	
  clusters	
  on	
  Amazon	
  EC2	
  
}  SoOware:	
  
}  Hadoop-­‐0.20.2	
  
}  Pig-­‐0.9.2	
  
}  Jaql-­‐0.5.1	
  
}  Whirr,	
  Ganglia	
  
}  Performed	
  experiments	
  
}  15	
  queries	
  in	
  2	
  languages	
  for	
  3	
  scaling	
  factors	
  on	
  7	
  clusters	
  
}  315	
  measurements	
  for	
  Pig,	
  285	
  –	
  for	
  Jaql	
  
}  370$	
  spent	
  on	
  Amazon	
  EC2	
  
	
  
Clusters	
  &	
  Data	
  
}  Cluster:	
  6	
  Amazon	
  EC2	
  +	
  1	
  DIAS	
  	
  
}  1	
  EC2	
  Compute	
  Unit	
  =	
  1.0-­‐1.2	
  GHz	
  2007	
  Xeon	
  processor	
  
}  Clusters:	
  5	
  or	
  10	
  nodes	
  on	
  EC2,	
  4	
  nodes	
  on	
  DIAS	
  
}  Data:	
  three	
  scaling	
  factors	
  (SF)	
  
}  SF	
  2	
  =	
  2.3	
  GB	
  
}  SF	
  5	
  =	
  5.7	
  GB	
  
}  SF	
  10	
  =	
  12.2	
  GB	
  
3
Query	
  Execution	
  Times:	
  Pig	
  &	
  Jaql	
  
4
0
200
400
600
800
1000
1200
q1 q3 q6 q10 q26 q33 q48 q52 q64 q71 q82 q90 q94 q96 q99
Executiontime,s
SF=2
SF=5
SF=10
0
1000
2000
3000
4000
5000
6000
7000
q1 q3 q6 q10 q26 q33 q48 q52 q64 q71 q82 q90 q94 q96 q99
Executiontime,s
SF=2
SF=5
SF=10
Pig	
  is	
  faster	
  in	
  general:	
  in	
  1.7x	
  for	
  SF=2,	
  2.2x	
  for	
  SF=5,	
  3.2x	
  for	
  SF=10	
  
Cluster:	
  10	
  m1.medium	
  instances	
  
Pig
Jaql
TPC-DS Query TPC-DS Query
Total	
  Execution	
  Time	
  on	
  Cluster:	
  Pig	
  
5
0
5000
10000
15000
20000
25000
small5 small10 medium5 medium10 large5 large10 dias
Totalexecutiontime,s
Cluster Configuration
SF=2
SF=5
SF=10
Small datasets: job startup overhead dominates
Large datasets: startup overhead dominates on powerful clusters only
Total	
  Execution	
  Time	
  on	
  Cluster:	
  Jaql	
  
6
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
small5 small10 medium5 medium10 large5 large10 dias
TotalExecutionTime,s
Cluster Configuration
SF=2
SF=5
SF=10
Small instances are not suitable for Jaql due to poor I/O performance
Jaql launches much more jobs for the same query than Pig – overhead is bigger
Pig	
  Latin	
  vs	
  Jaql	
  Performance	
  
7
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0	
   500	
   1000	
   1500	
   2000	
  
JaqlExecutionTime,s
Pig ExecutionTime, s
SF=2
SF=5
X=Y
Many	
  points	
  
Pig outperforms Jaql on clusters of 10 EC2 small instances
Pig	
  Latin	
  vs	
  Jaql	
  Performance	
  
8
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0	
   500	
   1000	
   1500	
   2000	
  
JaqlExecutionTime,s
Pig ExecutionTime, s
SF=2
SF=5
SF=10
X=Y
Jaql performance approaches Pig’s on 10 EC2 medium instances
Pig	
  Latin	
  vs	
  Jaql	
  Performance	
  
9
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0	
   500	
   1000	
   1500	
   2000	
  
JaqlExecutionTime,s
Pig ExecutionTime, s
SF=2
SF=5
SF=10
X=Y
Half of the queries are faster in Jaql on 10 EC2 large instances
Query	
  Execution	
  Time	
  vs	
  Monetary	
  Cost	
  
10
0
5000
10000
15000
20000
25000
0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
  
TotalExecutiontime,s
Price, $
SF=2
SF=5
SF=10
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0	
   5	
   10	
   15	
   20	
  
TotalExecutiontime,s
Price, $
SF=2
SF=5
SF=10
Pig	
  LaHn	
   Jaql	
  
For total values, Pig outperforms Jaql. Pig should be used in all cases to obtain
minimal execution time or minimal cost or maximal performance per money paid
What	
  Language	
  To	
  Use?	
  Where	
  To	
  Run?	
  
11
0
100
200
300
400
500
600
700
800
900
1000
0	
   0.05	
   0.1	
   0.15	
   0.2	
   0.25	
   0.3	
   0.35	
  
Executiontime,s
Price, $
SF=5 Pig
SF=5 Jaql
Query	
  26	
  
If we consider a single query, no single language is the best for all purposes
Choosing	
  Optimal	
  Tool	
  
12
0
100
200
300
400
500
600
700
800
900
1000
0	
   0.05	
   0.1	
   0.15	
   0.2	
   0.25	
   0.3	
   0.35	
  
Executiontime,s
Price, $
SF=5 Pig
SF=5 Jaql
Optimal
Query	
  26	
  
Jaql on large10
Pig on small5
Pig on medium5
Given a dataset, a query and a utility function, which language on which
cluster should be used to optimize the function?
Given a dataset and a query what are the options for executing it in the cloud?
Summary:	
  Opinion	
  
Pig	
  La'n	
   Jaql	
  
}  Cumbersome	
  scripts	
  
}  Procedural	
  	
  
}  Long	
  to	
  write,	
  easy	
  to	
  debug	
  
}  Good	
  documentaHon	
  
}  Convenient	
  interpreter	
  
}  Concise	
  scripts	
  
}  DeclaraHve,	
  more	
  SQL-­‐like	
  
}  Quick	
  to	
  write,	
  long	
  to	
  debug	
  
}  Poorly	
  documented	
  
}  Tools	
  are	
  in	
  rudimentary	
  
state	
  
13
Jaql	
  is	
  much	
  beeer	
  as	
  a	
  language	
  but	
  the	
  development	
  
infrastructure	
  is	
  much	
  worse	
  (documentaHon,	
  user	
  base,	
  tools)	
  
	
  
Summary:	
  Facts	
  
Pig	
  La'n	
   Jaql	
  
}  Development	
  in	
  progress	
  
}  Faster	
  in	
  most	
  of	
  our	
  
experiments	
  
}  Scales	
  beeer	
  with	
  the	
  
dataset	
  size	
  
}  Checks	
  the	
  schema	
  before	
  
evaluaHon	
  
}  Open-­‐source	
  version	
  
abandoned	
  one	
  year	
  ago	
  
}  Slower	
  in	
  most	
  of	
  our	
  
experiments	
  
}  Scales	
  worse	
  with	
  the	
  
dataset	
  size	
  
}  Doesn’t	
  check	
  the	
  schema	
  
even	
  while	
  evaluaHng	
  
14
Thank	
  you	
  for	
  your	
  aeenHon!	
  
Feedback	
  &	
  QuesHons?	
  
Query	
  Execution	
  Time	
  vs	
  Monetary	
  Cost	
  
by	
  Cluster	
  ConMiguration	
  
15
Pig	
  LaHn	
   Jaql	
  
0
5000
10000
15000
20000
25000
0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
  
TotalExecutionTime
Price, $
small5
small10
medium5
medium10
large5
large10
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0	
   2	
   4	
   6	
   8	
   10	
   12	
   14	
   16	
   18	
  TotalExecutionTime
Price, $
small5
small10
medium5
medium10
large5
large10
Query	
  Execution	
  Time	
  vs	
  Monetary	
  Cost	
  
by	
  Cluster	
  	
  	
  
16
0
5000
10000
15000
20000
25000
0	
   1	
   2	
   3	
   4	
   5	
   6	
   7	
   8	
  
Executiontime,s
Price, $
SF=2
SF=5
SF=10
small5
small10, medium5
medium10, large5
large10
Directions	
  for	
  Future	
  Work	
  
}  Reach	
  communiHes	
  for	
  bigger	
  scale	
  and	
  more	
  realisHc	
  
comparison	
  
}  Add	
  Hive	
  queries	
  to	
  the	
  comparison	
  
17
Code	
  &	
  Data	
  on	
  Github:	
  
github.com/voz	
  
Questions	
  and	
  Feedback	
  
Andrii.Vozniuk@epMl.ch	
  

More Related Content

What's hot

What's hot (20)

Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...Kenneth Knowles -  Apache Beam - A Unified Model for Batch and Streaming Data...
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data...
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
Mario on spark
Mario on sparkMario on spark
Mario on spark
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
InfluxData Platform Future and Vision
InfluxData Platform Future and VisionInfluxData Platform Future and Vision
InfluxData Platform Future and Vision
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
INFLUXQL & TICKSCRIPT
INFLUXQL & TICKSCRIPTINFLUXQL & TICKSCRIPT
INFLUXQL & TICKSCRIPT
 
Lambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie StricklandLambda at Weather Scale by Robbie Strickland
Lambda at Weather Scale by Robbie Strickland
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 

Similar to TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Sergii Vozniuk

Nexthink Library - replacing a ruby on rails application with Scala and Spray
Nexthink Library - replacing a ruby on rails application with Scala and SprayNexthink Library - replacing a ruby on rails application with Scala and Spray
Nexthink Library - replacing a ruby on rails application with Scala and Spray
Matthew Farwell
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
Vyacheslav Lapin
 
A Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories BenchmarkingA Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories Benchmarking
Dhaval Thakker
 

Similar to TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Sergii Vozniuk (20)

Nexthink Library - replacing a ruby on rails application with Scala and Spray
Nexthink Library - replacing a ruby on rails application with Scala and SprayNexthink Library - replacing a ruby on rails application with Scala and Spray
Nexthink Library - replacing a ruby on rails application with Scala and Spray
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Cloud Performance Benchmarking
Cloud Performance BenchmarkingCloud Performance Benchmarking
Cloud Performance Benchmarking
 
The Puppet Master on the JVM - PuppetConf 2014
The Puppet Master on the JVM - PuppetConf 2014The Puppet Master on the JVM - PuppetConf 2014
The Puppet Master on the JVM - PuppetConf 2014
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
Gluecon 2013 Keynote Ravello Systems
Gluecon 2013 Keynote Ravello SystemsGluecon 2013 Keynote Ravello Systems
Gluecon 2013 Keynote Ravello Systems
 
Gpars - the coolest bits
Gpars - the coolest bitsGpars - the coolest bits
Gpars - the coolest bits
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
Real-Time Vote Platform Benchmark
Real-Time Vote Platform BenchmarkReal-Time Vote Platform Benchmark
Real-Time Vote Platform Benchmark
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Windows Azure Acid Test
Windows Azure Acid TestWindows Azure Acid Test
Windows Azure Acid Test
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
ITSubbotik - как скрестить ежа с ужом или подводные камни внедрения функциона...
 
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
Monitoring Uptime on the NeCTAR Research Cloud - Andy Botting, University of ...
 
Postgre sql vs oracle
Postgre sql vs oraclePostgre sql vs oracle
Postgre sql vs oracle
 
Google AppEngine (GAE/J) - Introduction and Overview from a Java Guy
Google AppEngine (GAE/J) - Introduction and Overview from a Java GuyGoogle AppEngine (GAE/J) - Introduction and Overview from a Java Guy
Google AppEngine (GAE/J) - Introduction and Overview from a Java Guy
 
A Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories BenchmarkingA Pragmatic Approach to Semantic Repositories Benchmarking
A Pragmatic Approach to Semantic Repositories Benchmarking
 

More from Andrii Vozniuk

Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...
Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...
Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...
Andrii Vozniuk
 
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Andrii Vozniuk
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Andrii Vozniuk
 

More from Andrii Vozniuk (11)

Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...
Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...
Enhancing Social Media Platforms for Educational and Humanitarian Knowledge S...
 
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
Embedded interactive learning analytics dashboards with Elasticsearch and Kib...
 
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
Combining content analytics and activity tracking to mine user interests and ...
Combining content analytics and activity tracking to mine user interests and ...Combining content analytics and activity tracking to mine user interests and ...
Combining content analytics and activity tracking to mine user interests and ...
 
Contextual learning analytics apps to create awareness in blended inquiry lea...
Contextual learning analytics apps to create awareness in blended inquiry lea...Contextual learning analytics apps to create awareness in blended inquiry lea...
Contextual learning analytics apps to create awareness in blended inquiry lea...
 
Graspeo: a Social Media Platform for Knowledge Management in NGOs - Andrii Vo...
Graspeo: a Social Media Platform for Knowledge Management in NGOs - Andrii Vo...Graspeo: a Social Media Platform for Knowledge Management in NGOs - Andrii Vo...
Graspeo: a Social Media Platform for Knowledge Management in NGOs - Andrii Vo...
 
Towards portable learning analytics dashboards - Andrii Vozniuk, Sten Govaert...
Towards portable learning analytics dashboards - Andrii Vozniuk, Sten Govaert...Towards portable learning analytics dashboards - Andrii Vozniuk, Sten Govaert...
Towards portable learning analytics dashboards - Andrii Vozniuk, Sten Govaert...
 
AngeLA: Putting the teacher in control of student privacy in the online class...
AngeLA: Putting the teacher in control of student privacy in the online class...AngeLA: Putting the teacher in control of student privacy in the online class...
AngeLA: Putting the teacher in control of student privacy in the online class...
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Symbolic Reasoning and Concrete Execution - Andrii Vozniuk
Symbolic Reasoning and Concrete Execution - Andrii Vozniuk Symbolic Reasoning and Concrete Execution - Andrii Vozniuk
Symbolic Reasoning and Concrete Execution - Andrii Vozniuk
 

Recently uploaded

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Sergii Vozniuk

  • 1. TPC-­‐DS  performance  evaluation  for   JAQL  /  Pig  queries   }  Andrii  Vozniuk  and  Sergii  Vozniuk   }  Data  Management  in  the  Cloud   }  EPFL   }  June  1,  2012   1
  • 2. Roadmap   2 }  Familiarized  with  TPC-­‐DS  benchmark   }  Selected  and  translated  15  queries  into  Pig  LaHn  and  Jaql   }  Setup  infrastructure   }  Hardware:  DIAS  cluster  and  6  clusters  on  Amazon  EC2   }  SoOware:   }  Hadoop-­‐0.20.2   }  Pig-­‐0.9.2   }  Jaql-­‐0.5.1   }  Whirr,  Ganglia   }  Performed  experiments   }  15  queries  in  2  languages  for  3  scaling  factors  on  7  clusters   }  315  measurements  for  Pig,  285  –  for  Jaql   }  370$  spent  on  Amazon  EC2    
  • 3. Clusters  &  Data   }  Cluster:  6  Amazon  EC2  +  1  DIAS     }  1  EC2  Compute  Unit  =  1.0-­‐1.2  GHz  2007  Xeon  processor   }  Clusters:  5  or  10  nodes  on  EC2,  4  nodes  on  DIAS   }  Data:  three  scaling  factors  (SF)   }  SF  2  =  2.3  GB   }  SF  5  =  5.7  GB   }  SF  10  =  12.2  GB   3
  • 4. Query  Execution  Times:  Pig  &  Jaql   4 0 200 400 600 800 1000 1200 q1 q3 q6 q10 q26 q33 q48 q52 q64 q71 q82 q90 q94 q96 q99 Executiontime,s SF=2 SF=5 SF=10 0 1000 2000 3000 4000 5000 6000 7000 q1 q3 q6 q10 q26 q33 q48 q52 q64 q71 q82 q90 q94 q96 q99 Executiontime,s SF=2 SF=5 SF=10 Pig  is  faster  in  general:  in  1.7x  for  SF=2,  2.2x  for  SF=5,  3.2x  for  SF=10   Cluster:  10  m1.medium  instances   Pig Jaql TPC-DS Query TPC-DS Query
  • 5. Total  Execution  Time  on  Cluster:  Pig   5 0 5000 10000 15000 20000 25000 small5 small10 medium5 medium10 large5 large10 dias Totalexecutiontime,s Cluster Configuration SF=2 SF=5 SF=10 Small datasets: job startup overhead dominates Large datasets: startup overhead dominates on powerful clusters only
  • 6. Total  Execution  Time  on  Cluster:  Jaql   6 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 small5 small10 medium5 medium10 large5 large10 dias TotalExecutionTime,s Cluster Configuration SF=2 SF=5 SF=10 Small instances are not suitable for Jaql due to poor I/O performance Jaql launches much more jobs for the same query than Pig – overhead is bigger
  • 7. Pig  Latin  vs  Jaql  Performance   7 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0   500   1000   1500   2000   JaqlExecutionTime,s Pig ExecutionTime, s SF=2 SF=5 X=Y Many  points   Pig outperforms Jaql on clusters of 10 EC2 small instances
  • 8. Pig  Latin  vs  Jaql  Performance   8 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0   500   1000   1500   2000   JaqlExecutionTime,s Pig ExecutionTime, s SF=2 SF=5 SF=10 X=Y Jaql performance approaches Pig’s on 10 EC2 medium instances
  • 9. Pig  Latin  vs  Jaql  Performance   9 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0   500   1000   1500   2000   JaqlExecutionTime,s Pig ExecutionTime, s SF=2 SF=5 SF=10 X=Y Half of the queries are faster in Jaql on 10 EC2 large instances
  • 10. Query  Execution  Time  vs  Monetary  Cost   10 0 5000 10000 15000 20000 25000 0   1   2   3   4   5   6   7   TotalExecutiontime,s Price, $ SF=2 SF=5 SF=10 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0   5   10   15   20   TotalExecutiontime,s Price, $ SF=2 SF=5 SF=10 Pig  LaHn   Jaql   For total values, Pig outperforms Jaql. Pig should be used in all cases to obtain minimal execution time or minimal cost or maximal performance per money paid
  • 11. What  Language  To  Use?  Where  To  Run?   11 0 100 200 300 400 500 600 700 800 900 1000 0   0.05   0.1   0.15   0.2   0.25   0.3   0.35   Executiontime,s Price, $ SF=5 Pig SF=5 Jaql Query  26   If we consider a single query, no single language is the best for all purposes
  • 12. Choosing  Optimal  Tool   12 0 100 200 300 400 500 600 700 800 900 1000 0   0.05   0.1   0.15   0.2   0.25   0.3   0.35   Executiontime,s Price, $ SF=5 Pig SF=5 Jaql Optimal Query  26   Jaql on large10 Pig on small5 Pig on medium5 Given a dataset, a query and a utility function, which language on which cluster should be used to optimize the function? Given a dataset and a query what are the options for executing it in the cloud?
  • 13. Summary:  Opinion   Pig  La'n   Jaql   }  Cumbersome  scripts   }  Procedural     }  Long  to  write,  easy  to  debug   }  Good  documentaHon   }  Convenient  interpreter   }  Concise  scripts   }  DeclaraHve,  more  SQL-­‐like   }  Quick  to  write,  long  to  debug   }  Poorly  documented   }  Tools  are  in  rudimentary   state   13 Jaql  is  much  beeer  as  a  language  but  the  development   infrastructure  is  much  worse  (documentaHon,  user  base,  tools)    
  • 14. Summary:  Facts   Pig  La'n   Jaql   }  Development  in  progress   }  Faster  in  most  of  our   experiments   }  Scales  beeer  with  the   dataset  size   }  Checks  the  schema  before   evaluaHon   }  Open-­‐source  version   abandoned  one  year  ago   }  Slower  in  most  of  our   experiments   }  Scales  worse  with  the   dataset  size   }  Doesn’t  check  the  schema   even  while  evaluaHng   14 Thank  you  for  your  aeenHon!   Feedback  &  QuesHons?  
  • 15. Query  Execution  Time  vs  Monetary  Cost   by  Cluster  ConMiguration   15 Pig  LaHn   Jaql   0 5000 10000 15000 20000 25000 0   1   2   3   4   5   6   7   TotalExecutionTime Price, $ small5 small10 medium5 medium10 large5 large10 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0   2   4   6   8   10   12   14   16   18  TotalExecutionTime Price, $ small5 small10 medium5 medium10 large5 large10
  • 16. Query  Execution  Time  vs  Monetary  Cost   by  Cluster       16 0 5000 10000 15000 20000 25000 0   1   2   3   4   5   6   7   8   Executiontime,s Price, $ SF=2 SF=5 SF=10 small5 small10, medium5 medium10, large5 large10
  • 17. Directions  for  Future  Work   }  Reach  communiHes  for  bigger  scale  and  more  realisHc   comparison   }  Add  Hive  queries  to  the  comparison   17 Code  &  Data  on  Github:   github.com/voz  
  • 18. Questions  and  Feedback   Andrii.Vozniuk@epMl.ch