TPC H
TPC-H ANALTYCS’ SCENARIOS
AND PERFORMANCES ON
HADOOP DATA CLOUDS
RIM MOUSSA
rim.moussa@esti.rnu.tn
LATICE –UNIV. OF ...
LaTICE

OUTLINE

Cloud
Analytics
A l ti

1. Business Intelligence
2. Motivation
data managment issues, NoSQL, clouds
g
,
,...
LATICE
Cloud
Analytics
A l ti

BUSINESS INTELLIGENCE

BI
Motivation
TPC‐H & COLAP
Performance
Conclusion
Future work

Busi...
LATICE

MOTIVATION

Cloud
Analytics
A l ti

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

Decision Support Systems
In...
LATICE

NOSQL

Cloud
Analytics
A l ti

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

Big Ch ll
Bi Challenges related ...
LATICE
Cloud
Analytics
A l ti

CLOUD COMPUTING

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

Cloud computing is a st...
LATICE

BI
Motivation
|—Issues
|— NoSQL
|
|—Cloud
|—COLAP

OLAP IN THE CLOUD

Cloud
Analytics
A l ti

OLAP constraints
Big...
LATICE
Cloud
Analytics
A l ti

TPC-H
DECISION-SUPPORT SYSTEM BENCHMARK
S

BI
Motivation
TPC‐H 
COLAP
Performance
Conclusio...
LATICE
Cloud
Analytics
A l ti

24th, Apr. 2012

TPC-H BENCHMARK

NDT’2012. Dubai. UAE

BI
Motivation
TPC‐H 
|— E/R schema
...
LATICE
Cloud
Analytics
A l ti

HADOOP/PIG LATIN

TPC‐H 
COLAP
|— hadoop/pig
|— translation
Performance
Conclusion

APACHE ...
PIG LATIN BENCHMARK

LATICE
Cloud
Analytics
A l ti

5 TRANSLATION HINTS

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nom...
PIG LATIN BENCHMARK

LATICE
Cloud
Analytics
A l ti

5 TRANSLATION HINTS -CTND

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|...
LATICE
Cloud
Analytics
A l ti

NOMINAL ANALYTICAL SCENARIO

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenari...
LATICE
Cloud
Analytics
A l ti

NOMINAL ANALYTICAL SCENARIO

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenari...
LATICE
Cloud
Analytics
A l ti

TPC‐H 
COLAP
|— hadoop/pig
|— translation
|
|—nominal 
scenario

COMPLEX!!
How to reduce se...
TPC-H WORKLOAD NUMERICAL STUDY
TYPE A

LATICE
Cloud
Analytics
A l ti

COLAP
|— …
|—nominal 
scenario
|
|—towards
better sc...
TPC-H WORKLOAD NUMERICAL STUDY
TYPE B

LATICE
Cloud
Analytics
A l ti

COLAP
|— …
|—nominal 
scenario
|
|—towards
better sc...
LATICE
Cloud
Analytics
A l ti

TPC-H WORKLOAD NUMERICAL STUDY
TYPE C

COLAP
|— …
|—nominal 
scenario
|
|—towards
better sc...
TPC-H WORKLOAD NUMERICAL
STUDY

LATICE
Cloud
Analytics
A l ti

Type
A

Features
•
•

Medium dimensionality
Result is TPC-H...
CLOUD
COST MANAGEMENT

LATICE
Cloud
Analytics
A l ti

COLAP
|— …
|—nominal 
scenario
|
|—towards
better scenario

Measured...
LATICE

BETTER SCENARIO

Cloud
Analytics
A l ti

COLAP
|— …
|—better scenario
Performance
Related work
Conclusion 

Pig La...
LATICE
Cloud
Analytics
A l ti

PERFORMANCE MEASUREMENTS

24th, Apr. 2012

• TPC H
TPC-H
Benchmark
• SF=1
• 1 1GB source
1....
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
...
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
...
LATICE

PERFORMANCE MEASUREMENTS

Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

Original
TPC-H
TPC H 11GB

Original 1....
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
...
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
...
LATICE
Cloud
A l ti
Analytics

Original
TPC-H
TPC H 1GB

PERFORMANCE MEASUREMENTS
Original
TPC-H
TPC H 11GB

Original 1.1
...
LATICE
Cloud
A l ti
Analytics

Big File
g

PERFORMANCE MEASUREMENTS
Big File
4.5GB
4 5GB

Big File
45GB

OLAP

OLAP 4.5GB
...
LATICE
Cloud
A l ti
Analytics

Big File
g

PERFORMANCE MEASUREMENTS
Big File
4.5GB
4 5GB

Big File
45GB

OLAP

OLAP 4.5GB
...
LATICE
Cloud
A l ti
Analytics

Big File
g

PERFORMANCE MEASUREMENTS
Big File
4.5GB
4 5GB

Big File
45GB

OLAP

OLAP 4.5GB
...
LATICE
Cloud
Analytics
A l ti

RELATED WORK

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Implementation ...
LATICE
Cloud
Analytics
A l ti

CONCLUSION

TPC‐H
COLAP
Performance
Related work
Conclusion
Future work

TPC-H in-depth
TPC...
FUTURE WORK

LATICE
Cloud
Analytics
A l ti

AQP

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Approximate...
LATICE
Cloud
Analytics
A l ti

FUTURE WORK
PIG LATIN++

TPC‐H
COLAP
Performance
Related work
Conclusion 
Future work

Most...
FUTURE WORK

LATICE
Cloud
Analytics
A l ti

PIG LATIN++〉〉 Q16 EXPLE

Q16’s DAG

TPC‐H
COLAP
Performance
Related work
Concl...
TPC-H ANALTYCS SCENARIOS AND PERFORMANCES
CLOUDS

ON

HADOOP DATA

THANK YOU FOR YOUR ATTENTION
Q&A

24th, Apr. 2012
p
4th...
Upcoming SlideShare
Loading in …5
×

TPC-H analytics' scenarios and performances on Hadoop data clouds

1,359 views

Published on

Published in: Education, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,359
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
42
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

TPC-H analytics' scenarios and performances on Hadoop data clouds

  1. 1. TPC H TPC-H ANALTYCS’ SCENARIOS AND PERFORMANCES ON HADOOP DATA CLOUDS RIM MOUSSA rim.moussa@esti.rnu.tn LATICE –UNIV. OF TUNIS TUNISIA 24th, April 2012 4th International. Conference on Networked Digital Technologies NDT’2012, Dubai, UAE.
  2. 2. LaTICE OUTLINE Cloud Analytics A l ti 1. Business Intelligence 2. Motivation data managment issues, NoSQL, clouds g , , OLAP in the cloud 3. Implementation of OLAP in the cloud TPC-H Benchmark Analytics Scenarios y Performance Measurements 4. Related Work 5. Conclusion 6. 6 Future W k Work 24th, Apr. 2012 NDT’2012. Dubai. UAE 2
  3. 3. LATICE Cloud Analytics A l ti BUSINESS INTELLIGENCE BI Motivation TPC‐H & COLAP Performance Conclusion Future work Business intelligence aims to support better business g pp decision-making. Common functions of business intelligence technologies are On-Line Analytical Processing, data mining, process mining mining mining, Business performance management Text i i T t mining and predictive analytics, … d di ti l ti Market share Gartner Research Reports BI Market Revenue Hit $12.2 Billion in 2011 24th, Apr. 2012 NDT’2012. Dubai. UAE 3
  4. 4. LATICE MOTIVATION Cloud Analytics A l ti BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP Decision Support Systems Incessant D t & complex workload • I t Data l kl d • Complex DB schema Scalability Issues • Ideally, Linear Speed up & Linear Scale up • DBMS do not scale linearly y • OLAP Technologies do not scale • • • • 24th, Apr. 2012 Hardware I/O Bottleneck I/O-bound data storage systems Gilder law: Thrice bandwidth every 3 years Moore L M Law: T Twice computing and storage capacities every 18 months. d h Obsolete by 2017 Vertical scaling cost >> Horizontal scaling cost NDT’2012. Dubai. UAE 4
  5. 5. LATICE NOSQL Cloud Analytics A l ti BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP Big Ch ll Bi Challenges related t velocity l t d to l it How fast huge volumes of data can be processed? NoSQL l ti N SQL solutions Adopted by Google, Facebook, Amazon, … Dynamic horizontal scale up scale-up Nodes are added without bringing the cluster down Shared nothing Shared-nothing architecture Independent computing& storage nodes interconnected via a high speed network Distributed programming framework: MapReduce (Google) 24th, Apr. 2012 NDT’2012. Dubai. UAE 5
  6. 6. LATICE Cloud Analytics A l ti CLOUD COMPUTING BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP Cloud computing is a style of computing where scalable and elastic ITenabled capabilities are provided "as a service" t external customers bl d biliti id d i to t l t using Internet technologies. Broad network access Resource pooling (virtualization) Self-provisioning p g Rapid elasticity Scalable Analytics Measured service Market share Forrester Research expects the global cloud computing market to reach $241 billion in 2020. In particular, SaaS market growing to $92.8 billion by 2016. Gartner group expects the cloud computing market will reach $US150.1 billion, with a compound annual rate of 26.5%, in 2013. 26 5% 2013 24th, Apr. 2012 NDT’2012. Dubai. UAE 6
  7. 7. LATICE BI Motivation |—Issues |— NoSQL | |—Cloud |—COLAP OLAP IN THE CLOUD Cloud Analytics A l ti OLAP constraints Big data analytics’ obstacles Current systems & technologies do not scale Key benefits of Cloud Computing Cloud computing Big data analytics NoSQL Performance Much faster data analysis, Dynamic and up-to-date hardware infrastructure, y p , More Economical Organizations no longer need to expend capital upfront for hardware and software purchases Services are provided on a pay-per-use basis, p p yp , 24th, Apr. 2012 NDT’2012. Dubai. UAE 7
  8. 8. LATICE Cloud Analytics A l ti TPC-H DECISION-SUPPORT SYSTEM BENCHMARK S BI Motivation TPC‐H  COLAP Performance Conclusion DATA • Complex DB schema • Scale factor 1, 10, …, 100,000 correspond respectively to 1GB, 10 GB …, 100 TB 1GB GB, • 8 data files {lineitem, customer…, region}.tbl y • broad industry-wide relevance WORKLOAD • 22 real world business questions l ld b • High degree of complexity • Star queries (complex joins) • Grouping • Nested queries 24th, Apr. 2012 NDT’2012. Dubai. UAE 8
  9. 9. LATICE Cloud Analytics A l ti 24th, Apr. 2012 TPC-H BENCHMARK NDT’2012. Dubai. UAE BI Motivation TPC‐H  |— E/R schema COLAP Performance 9
  10. 10. LATICE Cloud Analytics A l ti HADOOP/PIG LATIN TPC‐H  COLAP |— hadoop/pig |— translation Performance Conclusion APACHE HADOOP • Framework for running applications on large clusters of commodity hardware. dit h d • Implements computational framework MapReduce ( p y ) • HDFS: (hadoop distributed file system) stores data on the compute nodes • Replication & job resoumissions for failures’ handling APACHE PIG LATIN • high-level language for expressing data analysis programs (filter, projection, join, group, sort, union, …) 24th, Apr. 2012 NDT’2012. Dubai. UAE 10
  11. 11. PIG LATIN BENCHMARK LATICE Cloud Analytics A l ti 5 TRANSLATION HINTS TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario 1. Load Data for Immediate Processing • Better memory management y g • Conjunction/disjunction of predicates applied once 2. Minimum Relation Scan 3. 3 Unary operations prior to binary operations • Unary operations (projection, U ti ( j ti restriction, ) reduce data volume 24th, Apr. 2012 NDT’2012. Dubai. UAE 11
  12. 12. PIG LATIN BENCHMARK LATICE Cloud Analytics A l ti 5 TRANSLATION HINTS -CTND TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario 4. Intra-operation parallelism p p • partitioned join • Algorithm • h h join, hash j i • merge join, Star-queries: j i ordering S i joins d i 5. 5 Join Algorithm • 24th, Apr. 2012 NDT’2012. Dubai. UAE 12
  13. 13. LATICE Cloud Analytics A l ti NOMINAL ANALYTICAL SCENARIO TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario Pig Latin Script (Business Question) Response 24th, Apr. 2012 NDT’2012. Dubai. UAE 13
  14. 14. LATICE Cloud Analytics A l ti NOMINAL ANALYTICAL SCENARIO TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario High Cost • Measured Service, pay as you consume cloud ressources (bandwitdh, CPU, RAM) Performance Issues • The same query (with same or different parameters) is executed several times with no optimization Discontinuity of S i Di ti it f Service • Network failure/congestion 24th, Apr. 2012 NDT’2012. Dubai. UAE 14
  15. 15. LATICE Cloud Analytics A l ti TPC‐H  COLAP |— hadoop/pig |— translation | |—nominal  scenario COMPLEX!! How to reduce service cost? How to improve performances? How to prevent discontinuity of service? • Materialized views? • Aggregated data replication • OLAP or not? Workload Study W kl d S d for Tuning 24th, Apr. 2012 Cost Management • Exploit organization i i hardware resources ? • Cloud Right size? NDT’2012. Dubai. UAE 15
  16. 16. TPC-H WORKLOAD NUMERICAL STUDY TYPE A LATICE Cloud Analytics A l ti COLAP |— … |—nominal  scenario | |—towards better scenario *Order Priority Checking* order date dim × order priority dim × count orders measure OLAP! +export to olap server +MV 24th, Apr. 2012 always 135 NDT’2012. Dubai. UAE 16
  17. 17. TPC-H WORKLOAD NUMERICAL STUDY TYPE B LATICE Cloud Analytics A l ti COLAP |— … |—nominal  scenario | |—towards better scenario *Large Volume Orders* order dim × sum line qty measure SF × 1,500,000 almost 3.8 ppm of orders have ∑ line qty > 300, for SF = 1 Not OLAP! +MV 24th, Apr. 2012 NDT’2012. Dubai. UAE 17
  18. 18. LATICE Cloud Analytics A l ti TPC-H WORKLOAD NUMERICAL STUDY TYPE C COLAP |— … |—nominal  scenario | |—towards better scenario *Minimum Cost Supplier* supplier dim × part dim × min supply cost measure OLAP! SF 2 × 2,000,000,000 MV storage cost best supplier in each region for each part! 24th, Apr. 2012 NDT’2012. Dubai. UAE 18
  19. 19. TPC-H WORKLOAD NUMERICAL STUDY LATICE Cloud Analytics A l ti Type A Features • • Medium dimensionality Result is TPC-H Scale Factor independent COLAP |— … |—nominal  scenario | |—towards better scenario TPC-H Business Questions (OLAP Cube) Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13, Q14, Q16, Q19, Q22 13 business questions B C • • • • High dimensionality few results, lots of empty cells High dimensionality g y Result % of Scale Factor Q15, Q18 2 business questions q Q2, Q9, Q10, Q11, Q17, , , , , , Q20, Q21 7 business questions 24th, Apr. 2012 NDT’2012. Dubai. UAE 19
  20. 20. CLOUD COST MANAGEMENT LATICE Cloud Analytics A l ti COLAP |— … |—nominal  scenario | |—towards better scenario Measured Service pay as you go CPU + Memory + Bandwidth “When users understand the relationship between cost and consumption, consumption everybody wins” –Ron Miller wins Emerging need to understand, manage and proactively control costs across the cloud ti l t l t th l d Resource Utilization Monitoring Right size w.r.t. both performances & cost (client and provider) Green cloud through energy saving 24th, Apr. 2012 NDT’2012. Dubai. UAE 20
  21. 21. LATICE BETTER SCENARIO Cloud Analytics A l ti COLAP |— … |—better scenario Performance Related work Conclusion  Pig Latin Script (Generalized Business Question) Interaction Pre-aggregated Data OLAP Client 24th, Apr. 2012 Import Data into an on-site OLAP server NDT’2012. Dubai. UAE 21
  22. 22. LATICE Cloud Analytics A l ti PERFORMANCE MEASUREMENTS 24th, Apr. 2012 • TPC H TPC-H Benchmark • SF=1 • 1 1GB source 1.1GB files • 4.5GB single big file g • SF=10 • 11GB source files • 45GB single big file NDT’2012. Dubai. UAE Pig/ /HDFS S T H TPC-H • G5K K French GRID platform: a large p g scale nation wide infrastructure for Grid research. • Bordeaux Site • Borderel: 24GB RAM, 4 AMD CPUs, 2.27 GHz, and 4cores/CPU. / • Borderline: : 32GB RAM, 4 Intel Xeon CPUs, e eo C Us, 2.6 GHz, and 2 cores/CPU. • Ethernet10Gbps TPC‐H COLAP Performance Related work Conclusion  Future work • Apache Hadoop 0.20 • N=3, 5 or 8 • one Hadoop Master • (2, 4 or 7) Workers • Apache Pig 0.8.1 22
  23. 23. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Except business questions which do not perform join operations: No improvement when cluster size doubles 24th, Apr. 2012 NDT’2012. Dubai. UAE 23
  24. 24. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Improvement when cluster size doubles More data so it’s nice to have more storage & computing nodes 24th, Apr. 2012 NDT’2012. Dubai. UAE 24
  25. 25. LATICE PERFORMANCE MEASUREMENTS Cloud A l ti Analytics Original TPC-H TPC H 1GB Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Elapsed times for SF=10 (11GB) are At maximum 5 times elapsed times obtained for SF=1 (1.1GB) In average twice elapsed times obtained for SF=1 (1.1GB) 24th, Apr. 2012 NDT’2012. Dubai. UAE 25
  26. 26. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Joining partitionned files is complex! Combine all files into one file SF 1 SF=1 4.5GB SF=10 45GB Evaluation of Pig/MR without joins Denormalization saves j i cost join t increases required storage space (≈ ×4 for TPC-H) 24th, Apr. 2012 NDT’2012. Dubai. UAE 26
  27. 27. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Compared to (SF=1, 1.1GB), improvements range from 10% to 80%, p ( , ), p g , performance degradation with more than 4 workers (N=5): this is due to MR framework (before reduce phase, data is grouped and sorted which has a cost when involving more storage and computing nodes) 24th, Apr. 2012 NDT’2012. Dubai. UAE 27
  28. 28. LATICE Cloud A l ti Analytics Original TPC-H TPC H 1GB PERFORMANCE MEASUREMENTS Original TPC-H TPC H 11GB Original 1.1 vs 11GB Big File g Big File 4.5GB 4 5GB TPC‐H COLAP Performance Related work Conclusion  Future work Big File 45GB Compared to (SF=10, 11GB), after affording more nodes we obtain similar results p ( , ), g Compared to (SF=1, 4.5GB), elapsed times are less than 10× for same cluster size Performance degradation for queries which do not perform joins (SF 10,11GB), Q1 executes over 7GB lineitem file (SF=10,11GB), now it executes over 45GB file 24th, Apr. 2012 NDT’2012. Dubai. UAE 28
  29. 29. LATICE Cloud A l ti Analytics Big File g PERFORMANCE MEASUREMENTS Big File 4.5GB 4 5GB Big File 45GB OLAP OLAP 4.5GB TPC‐H COLAP Performance Related work Conclusion  Future work OLAP 45GB Aggregated data TPC-H business questions type A (SF independent & small resultset) TPC-H business questions type B (very very small resultset) 15 business questions from 22 Tradeoff between space & computation TPC-H business questions type C Add derived fields Q2: check (true) minimum supplycost by supplier for a part in PartSupp Q17: average_line_quantity field for each part Q20: sum_lines_quantities_per_year for each supplier Q21: number of waiting orders for each supplier 7 business questions from 22 24th, Apr. 2012 NDT’2012. Dubai. UAE 29
  30. 30. LATICE Cloud A l ti Analytics Big File g PERFORMANCE MEASUREMENTS Big File 4.5GB 4 5GB Big File 45GB OLAP OLAP 4.5GB TPC‐H COLAP Performance Related work Conclusion  Future work OLAP 45GB Average degradation is 5% done once + time to retreive data 24th, Apr. 2012 NDT’2012. Dubai. UAE 30
  31. 31. LATICE Cloud A l ti Analytics Big File g PERFORMANCE MEASUREMENTS Big File 4.5GB 4 5GB Big File 45GB OLAP OLAP 4.5GB TPC‐H COLAP Performance Related work Conclusion  Future work OLAP 45GB Average degradation is 60% done once + time to retreive data 24th, Apr. 2012 NDT’2012. Dubai. UAE 31
  32. 32. LATICE Cloud Analytics A l ti RELATED WORK TPC‐H COLAP Performance Related work Conclusion  Future work Implementation of relational operations using MR framework Kim et al. –MRBench, 2008 Nominal analytics scenario TPC-H benchmarking for SF=1,3 Translation from SQL to Pig Latin Iu et al. –Hadoop to SQL, 2010 Lee et al. –Ysmart, 2011 , Other pig latin use cases Shatzle et al, RDF data, 2011 Loebman etal. Astrophysical data, 2009 24th, Apr. 2012 NDT’2012. Dubai. UAE 32
  33. 33. LATICE Cloud Analytics A l ti CONCLUSION TPC‐H COLAP Performance Related work Conclusion Future work TPC-H in-depth TPC H in depth numerical study OLAP in the cloud Scenarios Implementation p Pig / Hadoop Distributed File System Performances Various cluster sizes Various data volumes Various schemas (with and without joins) ( j ) 24th, Apr. 2012 NDT’2012. Dubai. UAE 33
  34. 34. FUTURE WORK LATICE Cloud Analytics A l ti AQP TPC‐H COLAP Performance Related work Conclusion  Future work Approximate Query Processing in clouds pp y g Most Distributed File Systems implement replication for high availability MDS erasure codes outperform replication from two perspectives (i) storage cost and (ii) minimal operation cost of redundant data management New Hadoop release N H d l Facebook Generalized framework for approximate data analytics in the cloud coping with nodes’ failure 24th, Apr. 2012 NDT’2012. Dubai. UAE 34
  35. 35. LATICE Cloud Analytics A l ti FUTURE WORK PIG LATIN++ TPC‐H COLAP Performance Related work Conclusion  Future work Most of TPC-H business questions scripts are composed of jobs, which execute sequentially, some branches of the DAG are unnecessarily y blocked! namely Q1, Q3, Q4, Q9, Q10, Q11, Q12, Q13, Q1 Q3 Q4 Q9 Q10 Q11 Q12 Q13 Q14, Q16, Q17 and Q18 Pig Latin Enhancements Investigate intra-operation Parallelism for better performances of Pig Scripts f f S Investigate better job definitions strategies, in order to increase inter-job parallelism 24th, Apr. 2012 NDT’2012. Dubai. UAE 35
  36. 36. FUTURE WORK LATICE Cloud Analytics A l ti PIG LATIN++〉〉 Q16 EXPLE Q16’s DAG TPC‐H COLAP Performance Related work Conclusion  Future work Q16’s pig script J_0005 J_0004 J_0003 J_0002 J_0001 J 0001 24th, Apr. 2012 NDT’2012. Dubai. UAE 36
  37. 37. TPC-H ANALTYCS SCENARIOS AND PERFORMANCES CLOUDS ON HADOOP DATA THANK YOU FOR YOUR ATTENTION Q&A 24th, Apr. 2012 p 4th International. Conference on Networked Digital Technologies NDT’12.Dubai. UAE

×