Your SlideShare is downloading. ×
Low Latency SQL on Hadoop
What’s best for your
cluster?
Prepared by Alan Gardner
June 2014
Alan Gardner
© 2013 Pythian2
@alanctgardner
gardner@pythian.com
© 2013 Pythian3
© 2013 Pythian4
Overview
• Performance
• Architecture
• Features
• Vendor Support
• Conclusions
© 2013 Pythian5
Performance
Berkeley Big Data Benchmark
• Hive, Hive-on-Tez,
RedShift, Shark, Impala
• Tested on five
m2.4xlarge EC2
instances
• Uses ...
Berkeley Big Data Benchmark
• Finds Shark fastest at
straight scans, and tied
with Impala for
aggregation and joining
• Hi...
Cloudera SQL Benchmark
• Impala, Hive-on-Tez,
Shark and Presto
• Uses high-end hardware
with relatively large
memory, fast...
Cloudera SQL Benchmark
• Finds Impala to be
significantly faster
across all data sizes
• Shark and Tez
outperform Presto
0...
Our Configuration
• 9-node cluster of
m2.2xlarge instances
• 4 cores, 34GB RAM
• 850GB of instance
storage
• 100GB scale f...
File Formats
• Hive, Shark - ORC (ZLIB)
• Presto - ORC (ZLIB)
– RCFile (LazyBinarySerDe)
was slower
– RCFile (ColumnarSerD...
© 2013 Pythian13
TPC-H Queries
• Query 1 – filtering and
aggregation on a single
table
• Query 8 – select two
columns from joins
across man...
© 2013 Pythian15
Architecture
© 2013 Pythian17
• Hive 0.13 runs on Tez, which
executes queries as DAGs
• DAGs are more efficient than
MRv1 query plans
•...
© 2013 Pythian18
• HiveServer creates a DAG
from HQL submitted over
JDBC
• HiveServer requests or
reuses a Tez AM to run
t...
© 2013 Pythian19
• Shark uses the same core
as Hive: the HQL parser
and the file and UDF
interfaces are compatible
• DAGs ...
© 2013 Pythian20
• Spark is more mature and
offers a wider range of
optimizations right now
• Shark also supports storing
...
© 2013 Pythian21
• Impala runs as an engine
‘next to’ YARN, not on top
of it
• To reduce resource
contention and allow
sch...
© 2013 Pythian22
• Impalad receives queries,
plans and executes them
• Statestore broadcasts
metadata updates and node
sta...
© 2013 Pythian23
• Presto doesn’t interact with
YARN at all
• cgroups are the only way to
share resources between
YARN job...
© 2013 Pythian24
• Presto has a single
coordinator which plans and
distributes query fragments
• Workers are still co-loca...
Functionality
© 2013 Pythian26
© 2013 Pythian27
Text RCFile Parquet ORCFile Avro SequenceFile
Presto R R R R R R
Impala R/W R R/W - R R
Hive/Shark R/W R/W R/W R/W R/W R/W...
Text RCFile Parquet ORCFile Avro SequenceFile
Presto R R R R R R
Impala R/W R R/W - R R
Hive/Shark R/W R/W R/W R/W R/W R/W...
Text RCFile Parquet ORCFile Avro SequenceFile
Presto R R R R R R
Impala R/W R R/W - R R
Hive/Shark R/W R/W R/W R/W R/W R/W...
Vendor Support
© 2013 Pythian32
Cloudera MapR HortonWorks
Presto No No No
Impala Yes Yes No
Hive No Tez No Tez Yes
Shark Spark Yes Spark
...
© 2013 Pythian33
Cloudera MapR HortonWorks
Presto No No No
Impala Yes Yes No
Hive No Tez No Tez Yes
Shark Spark Yes Spark
...
© 2013 Pythian34
Cloudera MapR HortonWorks
Presto No No No
Impala Yes Yes No
Hive No Tez No Tez Yes
Shark Spark Yes Spark
...
Conclusions
© 2013 Pythian36
A giant, indecipherable
flowchart
Conclusions
• Shark provides a
faster alternative to
Hive 0.13 for ETL and
analytics, but support
is lacking and tuning is...
Thank you – Q&A
To contact us
gardner@pythian.com
1-877-PYTHIAN
@pythian @alanctgardner
© 2013 Pythian38
Upcoming SlideShare
Loading in...5
×

Low Latency SQL on Hadoop - What's best for your cluster

1,342

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,342
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Low Latency SQL on Hadoop - What's best for your cluster"

  1. 1. Low Latency SQL on Hadoop What’s best for your cluster? Prepared by Alan Gardner June 2014
  2. 2. Alan Gardner © 2013 Pythian2 @alanctgardner gardner@pythian.com
  3. 3. © 2013 Pythian3
  4. 4. © 2013 Pythian4
  5. 5. Overview • Performance • Architecture • Features • Vendor Support • Conclusions © 2013 Pythian5
  6. 6. Performance
  7. 7. Berkeley Big Data Benchmark • Hive, Hive-on-Tez, RedShift, Shark, Impala • Tested on five m2.4xlarge EC2 instances • Uses Intel’s Hadoop Benchmark, not TPC • ~150GB of © 2013 Pythian7
  8. 8. Berkeley Big Data Benchmark • Finds Shark fastest at straight scans, and tied with Impala for aggregation and joining • Hive-on-Tez is a distant third • Not using the optimized, columnar formats © 2013 Pythian8
  9. 9. Cloudera SQL Benchmark • Impala, Hive-on-Tez, Shark and Presto • Uses high-end hardware with relatively large memory, fastest data types for each engine • 15TB scale factor for a TPC-DS based test © 2013 Pythian9
  10. 10. Cloudera SQL Benchmark • Finds Impala to be significantly faster across all data sizes • Shark and Tez outperform Presto 0.60, with Tez performing better for larger result sets • It’s unclear if table© 2013 Pythian10
  11. 11. Our Configuration • 9-node cluster of m2.2xlarge instances • 4 cores, 34GB RAM • 850GB of instance storage • 100GB scale factor – only from disk, no RDDs • Impala 1.3.1 on CDH 5.0.1 • Hive 0.13 from the© 2013 Pythian11
  12. 12. File Formats • Hive, Shark - ORC (ZLIB) • Presto - ORC (ZLIB) – RCFile (LazyBinarySerDe) was slower – RCFile (ColumnarSerDe) may be better • Impala – Parquet (no compression) © 2013 Pythian12
  13. 13. © 2013 Pythian13
  14. 14. TPC-H Queries • Query 1 – filtering and aggregation on a single table • Query 8 – select two columns from joins across many-to-many relationships • Query 10 – select and aggregate on eight© 2013 Pythian14
  15. 15. © 2013 Pythian15
  16. 16. Architecture
  17. 17. © 2013 Pythian17 • Hive 0.13 runs on Tez, which executes queries as DAGs • DAGs are more efficient than MRv1 query plans • Runs on YARN, resources are shared between all jobs • Individual node failures are tolerated and retried automatically
  18. 18. © 2013 Pythian18 • HiveServer creates a DAG from HQL submitted over JDBC • HiveServer requests or reuses a Tez AM to run the query • Tez handles placement of query fragments based on locality and resources
  19. 19. © 2013 Pythian19 • Shark uses the same core as Hive: the HQL parser and the file and UDF interfaces are compatible • DAGs produced by Shark are optimized for Spark, rather than Tez • Spark can be run on YARN for resource sharing, as well as Mesos or stand- alone
  20. 20. © 2013 Pythian20 • Spark is more mature and offers a wider range of optimizations right now • Shark also supports storing results as an RDD within Spark
  21. 21. © 2013 Pythian21 • Impala runs as an engine ‘next to’ YARN, not on top of it • To reduce resource contention and allow scheduling to be centralized in YARN, Llama was created • Llama creates “fake” applications on YARN as placeholders for Impala
  22. 22. © 2013 Pythian22 • Impalad receives queries, plans and executes them • Statestore broadcasts metadata updates and node status • Catalog caches block metadata and Hive table metadata
  23. 23. © 2013 Pythian23 • Presto doesn’t interact with YARN at all • cgroups are the only way to share resources between YARN jobs and Presto • Presto also handles all scheduling and job placement by itself
  24. 24. © 2013 Pythian24 • Presto has a single coordinator which plans and distributes query fragments • Workers are still co-located with DataNodes for locality • Discovery service manages worker status
  25. 25. Functionality
  26. 26. © 2013 Pythian26
  27. 27. © 2013 Pythian27
  28. 28. Text RCFile Parquet ORCFile Avro SequenceFile Presto R R R R R R Impala R/W R R/W - R R Hive/Shark R/W R/W R/W R/W R/W R/W © 2013 Pythian28 File Formats Flexibility SerDes Complex Data UDFs Spill to Disk JOIN Reordering Presto Yes Yes, but slow No No None Impala No No Yes No Cost-based Hive/Shark Yes Yes Yes Yes Cardinality
  29. 29. Text RCFile Parquet ORCFile Avro SequenceFile Presto R R R R R R Impala R/W R R/W - R R Hive/Shark R/W R/W R/W R/W R/W R/W © 2013 Pythian29 File Formats Flexibility SerDes Complex Data UDFs Spill to Disk JOIN Optimization Presto Yes Yes, but slow No No None Impala No No Yes No Cost-based Hive/Shark Yes Yes Yes Yes Cardinality
  30. 30. Text RCFile Parquet ORCFile Avro SequenceFile Presto R R R R R R Impala R/W R R/W - R R Hive/Shark R/W R/W R/W R/W R/W R/W © 2013 Pythian30 SerDes Complex Data UDFs Spill to Disk JOIN Optimization Presto Yes Yes, but slow No No None Impala No No Yes No Cost-based Hive/Shark Yes Yes Yes Yes Cardinality File Formats Flexibility
  31. 31. Vendor Support
  32. 32. © 2013 Pythian32 Cloudera MapR HortonWorks Presto No No No Impala Yes Yes No Hive No Tez No Tez Yes Shark Spark Yes Spark Note: based on vendor documentation as of 31/05/2014 Official Support
  33. 33. © 2013 Pythian33 Cloudera MapR HortonWorks Presto No No No Impala Yes Yes No Hive No Tez No Tez Yes Shark Spark Yes Spark Note: based on vendor documentation as of 31/05/2014 Official Support
  34. 34. © 2013 Pythian34 Cloudera MapR HortonWorks Presto No No No Impala Yes Yes No Hive No Tez No Tez Yes Shark Spark Yes Spark Note: based on vendor documentation as of 31/05/2014 Official Support
  35. 35. Conclusions
  36. 36. © 2013 Pythian36 A giant, indecipherable flowchart
  37. 37. Conclusions • Shark provides a faster alternative to Hive 0.13 for ETL and analytics, but support is lacking and tuning is difficult • Presto is still nascent – deployment is easy, but querying is not so simple © 2013 Pythian37
  38. 38. Thank you – Q&A To contact us gardner@pythian.com 1-877-PYTHIAN @pythian @alanctgardner © 2013 Pythian38

×