Adam Kawa
Data Engineer @ Spotify
A Perfect
Hive Query
For A Perfect Meeting
A deal was made!
~5:00 AM, June 16th, 2013
■ End of our Summer Jam party
■ Organized by Martin Lorentzon
~5:00 AM, June 16th, 2013
■ Co-founder and Chairman of Spotify
Martin Lorentzon
■ Data Engineer at Spotify
Adam Kawa
* If Adam proves somehow that
he is the biggest fan of Timbuktu in his home country
The Deal
Martin will invite Adam
and T...
■ Actually, Jason Michael Bosak Diakité
■ A top Swedish rap and reggae artist
- My favourite song - http://bit.ly/1htAKhM
...
Question
Who is the biggest fan of Timbuktu in Poland?
The Rules
Question
Who is the biggest fan of Timbuktu in Poland?
Answers
A person who has streamed Timbuktu’s songs at
Spotify the m...
Data will tell the truth!
■ Follow tips and best practices related to
- Testing
- Sampling
- Troubleshooting
- Optimizing
- Executing
Hive Query
Hive query
should be faster
than my potential meeting with Timbuktu
Why?
Re-run the query during our meeting
Additional Re...
■ YARN on 690 nodes
■ Hive
- MRv2
- Join optimizations
- ORC
- Tez
- Vectorization
- Table statistics
Friends
Introduction
Datasets
USER
user_id country … date
u1 PL … 20140415
u2 PL … 20140415
u3 SE … 20140415
TRACK
json_data … date track_id
{"album":{"...
■ Information about songs streamed by users
■ 54 columns
■ 25.10 TB of compressed data since Feb 12th, 2013
STREAM
track_i...
■ Information about users
■ 40 columns
■ 10.5 GB of compressed data (2014-04-15)
USER
user_id country … date
u1 PL … 20140...
■ Metadata information about tracks
■ 5 columns
- But json_data contains dozen or so fields
■ 15.4 GB of data (2014-04-15)...
✓ Avro and a custom serde
✓ DEFLATE as a compression codec
✓ Partitioned by day and/or hour
✗ Bucketed
✗ Indexed
Tables’ P...
HiveQL
Query
FROM stream s
Query
FROM stream s
JOIN track t ON t.track_id = s.track_id
Query
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.user_id
Query
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.j...
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.j...
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.j...
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.j...
SELECT s.user_id, COUNT(*) AS count
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.use...
SELECT s.user_id, COUNT(*) AS count
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.use...
SELECT s.user_id, COUNT(*) AS count
FROM stream s
JOIN track t ON t.track_id = s.track_id
JOIN user u ON u.user_id = s.use...
HiveQL
Unit Testing
hive_test
Verbose
and
complex
Java
code
■ Hive analyst usually is not Java developer
■ Hive queries are often ad-hoc
- No big incentives to unit test
Testing Hive...
■ Easy unit testing of Hive queries locally
- Implemented by me
- Available at github.com/kawaa/Beetest
Beetest
■ A unit test consists of
- select.hql - a query to test
Beetest
■ A unit test consists of
- select.hql - a query to test
- table.ddl - schemas of input tables
Beetest
■ A unit test consists of
- select.hql - a query to test
- table.ddl - schemas of input tables
- text files with input dat...
■ A unit test consists of
- select.hql - a query to test
- table.ddl - schemas of input tables
- text files with input dat...
■ A unit test consists of
- select.hql - a query to test
- table.ddl - schemas of input tables
- text files with input dat...
table.ddl
stream(track_id String, user_id STRING, date BIGINT)
user(user_id STRING, country STRING, date BIGINT)
track(tra...
For Each Line
1. Creates a table with a given schema
- In local Hive database called beetest
table.ddl
stream(track_id Str...
For Each Line
1. Creates a table with a given schema
- In local Hive database called beetest
2. Loads table.txt file into ...
Input Files
t1 {"album":{"artistname":"Coldplay"}} 20140415
t2 {"album":{"artistname":"Timbuktu"}} 20140415
t3 {"album":{"...
Input Files
u1 PL 20140415
u2 PL 20140415
u3 SE 20140415
t1 {"album":{"artistname":"Coldplay"}} 20140415
t2 {"album":{"art...
Input Files
u1 PL 20140415
u2 PL 20140415
u3 SE 20140415
t1 u1 20140415
t3 u1 20140415
t2 u1 20140415
t3 u2 20140415
t2 u3...
Expected Output File
u1 2
u2 1
expected.txtt1 u1 20140415
t3 u1 20140415
t2 u1 20140415
t3 u2 20140415
t2 u3 20140415
stre...
$ ./run-test.sh timbuktu
Running Unit Test
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql
INFO: Running: hive --conf...
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql
INFO: Running: hive --conf...
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql
INFO: Running: hive --conf...
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql
INFO: Running: hive --conf...
Bee test
Be happy !
HiveQL
Sampling
■ TABLESAMPLE can generate samples of
- buckets
- HDFS blocks
- first N records from each input split
Sampling In Hive
✗ Bucket sampling
- Table must be bucketized by a given column
✗ HDFS block sampling
- Only CombineHiveInputFormat is supp...
SELECT * FROM user TABLESAMPLE(K ROWS)
SELECT * FROM track TABLESAMPLE(L ROWS)
SELECT * FROM stream TABLESAMPLE(M ROWS)
SE...
■ DataFu + Pig offer interesting sampling algorithms
- Weighted Random Sampling
- Reservoir Sampling
- Consistent Sampling...
DEFINE SBK datafu.pig.sampling.SampleByKey('0.01');
stream_s = FILTER stream BY SBK(user_id);
user_s = FILTER user BY SBK(...
DEFINE SBK datafu.pig.sampling.SampleByKey('0.01');
stream_s = FILTER stream BY SBK(user_id);
user_s = FILTER user BY SBK(...
DEFINE SBK datafu.pig.sampling.SampleByKey('0.01');
stream_s = FILTER stream BY SBK(user_id);
user_s = FILTER user BY SBK(...
✗ None of the sampling methods meet all of my
requirements
Try and see
■ Use SampleByKey and 1% of day’s worth of data
- M...
1. Finished successfully!
2. Returned 3 Polish users with 4 Timbuktu’s tracks
- My username was not included :(
Running Qu...
HiveQL
Understanding The Plan
■ Shows how Hive translates queries into MR jobs
- Useful, but tricky to understand
EXPLAIN
■ 330 lines
■ 11 stages
- Conditional operators and backup stages
- MapReduce jobs, Map-Only jobs
- MapReduce Local Work
■...
Optimized Query
2 MapReduce
job in total
hive.auto.convert.join.noconditionaltask.size
> filesize(user) + filesize(track)
No Conditional Map Join
Runs many Map
joi...
1. Local Work
- Fetch records from small table(s) from HDFS
- Build hash table
- If not enough memory, then abort
Map Join...
1. Local Work
- Fetch records from small table(s) from HDFS
- Build hash table
- If not enough memory, then abort
2. MapRe...
1. Local Work
- Fetch records from small table(s) from HDFS
- Build hash table
- If not enough memory, then abort
2. MapRe...
1. Local Work
- Fetch records from small table(s) from HDFS
- Build hash table
- If not enough memory, then abort
2. MapRe...
■ Table has to be downloaded from HDFS
- Problem if this step takes long
■ Hash table has to be uploaded to HDFS
- Replica...
Grouping After Joining
Runs as a single
MR job
[HIVE-3952]
Optimized Query
2 MapReduce
job in total
HiveQL
Running At Scale
2014-05-05 21:21:13Starting to launch local task
to process map join; maximum memory = 2GB
^Ckawaa@sol:~/timbuktu$ date
Mo...
✗ Dimension tables too big to read fast from HDFS
✓ Add a preprocessing step
- Apply projection and filtering in Map-only ...
✓ 15s to build hash tables!
✗ 8m 23s to run 2 Map-only jobs to prepare tables
- “Investment”
Day’s Worth Of Data
■ Map tasks
FATAL [main] org.apache.hadoop.mapred.YarnChild:
Error running child : java.lang.
OutOfMemoryError: Java heap ...
■ Increase JVM heap size
■ Decrease the size of in-memory map output buffer
- mapreduce.task.io.sort.mb
Possible Solution
■ Increase JVM heap size
■ Decrease the size of in-memory map output buffer
- mapreduce.task.io.sort.mb
(Temporary) Soluti...
✓ Works!
✗ 2-3x slower than two Regular joins
Second Challenge
Metric Regular Join Map Join
The
slowest
job
Shuffled
Bytes
12.5 GB 0.56 MB
CPU 120 M 331 M
GC 13 M 258 M
GC / CPU 0.11 0....
■ Increase the container and heap sizes
- 4096 and -Xmx3584m
Fixing Heavy GC
Metric Regular Join Map Join
The
slowest job
...
■ 1st MapReduce Job
- Heavy map phase
- No GC issues any more
- Tweak input split size
- Lightweight shuffle
- Lightweight...
■ 1st MapReduce Job
- Heavy map phase
- No GC issues any more
- Tweak input split size
- Lightweight shuffle
- Lightweight...
Current Best Time
2 months of data
50 min 2 sec
10th place
?
Changes are needed!
File Format
ORC
■ The biggest table has 52 “real” columns
- Only 2 of them are needed
- 2 columns are “partition” columns
Observation
||||...
✓ Improve speed of query
- Only read required columns
- Especially important when reading remotely
My Benefits Of ORC
✗ Need to convert Avro to ORC
✓ ORC consumes less space
- Efficient encoding and compression
ORC Storage Consumption
Storage And Access
16x
Wall Clock Time
3.5x
CPU Time
32x
■ Use Snappy compression
- ZLIB was used to optimize more for storage than
processing time
Possible Improvements
Computation
Apache Tez
■ Tez will probably not help much, because…
- Only two MapReduce jobs
- Heavy work done in map tasks
- Little intermediate...
Wall Clock Time
1.4x
2.4x
Wall Clock Time
8x
✓ Natural DAG
- No “empty” map task
- No necessity to write intermediate data to HDFS
- No job scheduling overhead
Benefit...
✓ Dynamic graph reconfiguration
- Decrease reduce task parallelism at runtime
- Pre-launch reduce tasks at runtime
Benefit...
✓ Smart number of map tasks based on e.g.
- Min/max limits input split size
- Capacity utilization
Benefits Of Tez
Container Reuse
Time
Container Reuse
The more congested
queue/cluster, the bigger
benefits of reusing
Time
Container Reuse
No scheduling overhead to
run new Reduce task
Time
Container Reuse
Time
Thinner tasks allows to
avoid stragglers
Container + Objects Reuse
Finished within 1,5 sec.
Warm !
■ Much easier to follow!
Map 1 <- Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer ...
■ Improved version of Map Join
■ Hash table is built in the cluster
- Parallelism
- Data locality
- No local and HDFS writ...
■ Try run the query that was interrupted by ^C
- On 1 day of data to introduce overhead
Broadcast Join
Metric Avro MR Tez ...
✓ Automatically rescales the size of in-memory
buffer for intermediate key-value pairs
✓ Reduced in-memory size of map-joi...
■ At scale, Tez Application Master can be busy
- DEBUG log level slows everything down
- Sometimes needs more RAM
■ A coup...
Feature
Vectorization
✓ Process data in a batch of 1024 rows together
- Instead of processing one row at a time
✓ Reduces the CPU usage
✓ Scales...
Powered By Vectorization
1.4x
✗ Not yet fully supported by vectorization
✓ Still less GC
✓ Still better use of memory to CPU bandwidth
✗ Cost of switchi...
■ Ensure that vectorization is fully or partially
supported for your query
- Datatypes
- Operators
■ Run own benchmarks an...
Feature
Table Statistics
✓ Table, partition and column statistics
✓ Possibility to
- collect when loading data
- generate for existing tables
- use...
Powered By Statistics
Period ORC/Tez
Vectorized
ORC/Tez
Vectorized ORC/Tez +
Stats
2 weeks
173 sec 225 sec 129 sec
1.22K m...
Current Best Time
14 months of data
10 min 11 sec
?
Results
Challenge
■ Who is the biggest fan of Timbuktu in Poland?
Results!
■ Who is the biggest fan of Timbuktu in Poland?
Results!
That’s all !
■ Deals at 5 AM might be the best idea ever
■ Hive can be a single solution for all SQL queries
- Both interactive and bat...
■ Gopal Vijayaraghavan
- Technical answers to Tez-related questions
- Installation scripts
github.com/t3rmin4t0r/tez-autob...
Section name
Questions?
Thank you!
Upcoming SlideShare
Loading in...5
×

A perfect Hive query for a perfect meeting (Hadoop Summit 2014)

15,839

Published on

During one of our epic parties, Martin Lorentzon (chairman of Spotify) agreed to help me to arrange a dinner for me and Timbuktu (my favourite Swedish rap and reggae artist), if I prove somehow that I am the biggest fan of Timbuktu in my home country. Because at Spotify we attack all problems using data-driven approaches, I decided to implement a Hive query that processes real datasets to figure out who streams Timbuktu the most frequently in my country. Although this problem seems to be well-defined, one can find many challenges in implementing this query efficiently and they relate to sampling, testing, debugging, troubleshooting, optimizing and executing it over terabytes of data on the Hadoop-YARN cluster that contains hundreds of nodes. During my talk, I will describe all of them, and share how to increase your (and the cluster’s) productivity by following tips and best practices for analyzing large datasets with Hive on YARN. I will also explain how the newly-added features to Hive (e.g. join optimizations, OCR File Format and Tez integration that is coming soon) can be used to make your query extremely fast.

Published in: Technology
0 Comments
41 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
15,839
On Slideshare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
350
Comments
0
Likes
41
Embeds 0
No embeds

No notes for slide

A perfect Hive query for a perfect meeting (Hadoop Summit 2014)

  1. 1. Adam Kawa Data Engineer @ Spotify A Perfect Hive Query For A Perfect Meeting
  2. 2. A deal was made! ~5:00 AM, June 16th, 2013
  3. 3. ■ End of our Summer Jam party ■ Organized by Martin Lorentzon ~5:00 AM, June 16th, 2013
  4. 4. ■ Co-founder and Chairman of Spotify Martin Lorentzon
  5. 5. ■ Data Engineer at Spotify Adam Kawa
  6. 6. * If Adam proves somehow that he is the biggest fan of Timbuktu in his home country The Deal Martin will invite Adam and Timbuktu, my favourite Swedish artist, for a beer or coke or whatever to drink * by Martin
  7. 7. ■ Actually, Jason Michael Bosak Diakité ■ A top Swedish rap and reggae artist - My favourite song - http://bit.ly/1htAKhM ■ Does not know anything about the deal! Timbuktu
  8. 8. Question Who is the biggest fan of Timbuktu in Poland? The Rules
  9. 9. Question Who is the biggest fan of Timbuktu in Poland? Answers A person who has streamed Timbuktu’s songs at Spotify the most frequently! The Rules
  10. 10. Data will tell the truth!
  11. 11. ■ Follow tips and best practices related to - Testing - Sampling - Troubleshooting - Optimizing - Executing Hive Query
  12. 12. Hive query should be faster than my potential meeting with Timbuktu Why? Re-run the query during our meeting Additional Requirement by Adam
  13. 13. ■ YARN on 690 nodes ■ Hive - MRv2 - Join optimizations - ORC - Tez - Vectorization - Table statistics Friends
  14. 14. Introduction Datasets
  15. 15. USER user_id country … date u1 PL … 20140415 u2 PL … 20140415 u3 SE … 20140415 TRACK json_data … date track_id {"album":{"artistname":"Coldplay"}} … 20140415 t1 {"album":{"artistname":"Timbuktu"}} … 20140415 t2 {"album":{"artistname":"Timbuktu"}} … 20140415 t3 track_id … user_id … date t1 … u1 … 20130311 t3 … u1 … 20130622 t3 … u2 … 20131209 t2 … u2 … 20140319 t1 ... u3 ... 20140415 STREAM
  16. 16. ■ Information about songs streamed by users ■ 54 columns ■ 25.10 TB of compressed data since Feb 12th, 2013 STREAM track_id … … user_id date t1 … … u1 20130311 t3 … … u1 20130622 t3 … … u2 20131209 t2 … … u2 20140319 t1 ... ... u3 20140415
  17. 17. ■ Information about users ■ 40 columns ■ 10.5 GB of compressed data (2014-04-15) USER user_id country … date u1 PL … 20140415 u2 PL … 20140415 u3 SE … 20140415
  18. 18. ■ Metadata information about tracks ■ 5 columns - But json_data contains dozen or so fields ■ 15.4 GB of data (2014-04-15) TRACK track_id json_data … date t1 {"album":{"artistname":"Coldplay"}} … 20140415 t2 {"album":{"artistname":"Timbuktu"}} … 20140415 t3 {"album":{"artistname":"Timbuktu"}} … 20140415
  19. 19. ✓ Avro and a custom serde ✓ DEFLATE as a compression codec ✓ Partitioned by day and/or hour ✗ Bucketed ✗ Indexed Tables’ Properties
  20. 20. HiveQL Query
  21. 21. FROM stream s Query
  22. 22. FROM stream s JOIN track t ON t.track_id = s.track_id Query
  23. 23. FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id Query
  24. 24. FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' Query
  25. 25. FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' AND u.country = 'PL' Query
  26. 26. FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' AND u.country = 'PL' AND s.date BETWEEN 20130212 AND 20140415 Query
  27. 27. FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' AND u.country = 'PL' AND s.date BETWEEN 20130212 AND 20140415 AND u.date = 20140415 AND t.date = 20140415 Query
  28. 28. SELECT s.user_id, COUNT(*) AS count FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' AND u.country = 'PL' AND s.date BETWEEN 20130212 AND 20140415 AND u.date = 20140415 AND t.date = 20140415 GROUP BY s.user_id Query
  29. 29. SELECT s.user_id, COUNT(*) AS count FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' AND u.country = 'PL' AND s.date BETWEEN 20130212 AND 20140415 AND u.date = 20140415 AND t.date = 20140415 GROUP BY s.user_id ORDER BY count DESC LIMIT 100; Query
  30. 30. SELECT s.user_id, COUNT(*) AS count FROM stream s JOIN track t ON t.track_id = s.track_id JOIN user u ON u.user_id = s.user_id WHERE lower(get_json_object(t.json_data, '$.album.artistname')) = 'timbuktu' AND u.country = 'PL' AND s.date BETWEEN 20130212 AND 20140415 AND u.date = 20140415 AND t.date = 20140415 GROUP BY s.user_id ORDER BY count DESC LIMIT 100; Query A line where I may have a bug ? !
  31. 31. HiveQL Unit Testing
  32. 32. hive_test Verbose and complex Java code
  33. 33. ■ Hive analyst usually is not Java developer ■ Hive queries are often ad-hoc - No big incentives to unit test Testing Hive Queries
  34. 34. ■ Easy unit testing of Hive queries locally - Implemented by me - Available at github.com/kawaa/Beetest Beetest
  35. 35. ■ A unit test consists of - select.hql - a query to test Beetest
  36. 36. ■ A unit test consists of - select.hql - a query to test - table.ddl - schemas of input tables Beetest
  37. 37. ■ A unit test consists of - select.hql - a query to test - table.ddl - schemas of input tables - text files with input data Beetest
  38. 38. ■ A unit test consists of - select.hql - a query to test - table.ddl - schemas of input tables - text files with input data - expected.txt - expected output Beetest
  39. 39. ■ A unit test consists of - select.hql - a query to test - table.ddl - schemas of input tables - text files with input data - expected.txt - expected output - (optional) setup.hql - any initialization query Beetest
  40. 40. table.ddl stream(track_id String, user_id STRING, date BIGINT) user(user_id STRING, country STRING, date BIGINT) track(track_id STRING, json_data STRING, date BIGINT)
  41. 41. For Each Line 1. Creates a table with a given schema - In local Hive database called beetest table.ddl stream(track_id String, user_id STRING, date BIGINT) user(user_id STRING, country STRING, date BIGINT) track(track_id STRING, json_data STRING, date BIGINT)
  42. 42. For Each Line 1. Creates a table with a given schema - In local Hive database called beetest 2. Loads table.txt file into the table - A list of files can be also explicitly specified table.ddl stream(track_id String, user_id STRING, date BIGINT) user(user_id STRING, country STRING, date BIGINT) track(track_id STRING, json_data STRING, date BIGINT)
  43. 43. Input Files t1 {"album":{"artistname":"Coldplay"}} 20140415 t2 {"album":{"artistname":"Timbuktu"}} 20140415 t3 {"album":{"artistname":"Timbuktu"}} 20140415 track.txt
  44. 44. Input Files u1 PL 20140415 u2 PL 20140415 u3 SE 20140415 t1 {"album":{"artistname":"Coldplay"}} 20140415 t2 {"album":{"artistname":"Timbuktu"}} 20140415 t3 {"album":{"artistname":"Timbuktu"}} 20140415 user.txt track.txt
  45. 45. Input Files u1 PL 20140415 u2 PL 20140415 u3 SE 20140415 t1 u1 20140415 t3 u1 20140415 t2 u1 20140415 t3 u2 20140415 t2 u3 20140415 t1 {"album":{"artistname":"Coldplay"}} 20140415 t2 {"album":{"artistname":"Timbuktu"}} 20140415 t3 {"album":{"artistname":"Timbuktu"}} 20140415 stream.txt user.txt track.txt
  46. 46. Expected Output File u1 2 u2 1 expected.txtt1 u1 20140415 t3 u1 20140415 t2 u1 20140415 t3 u2 20140415 t2 u3 20140415 stream.txt
  47. 47. $ ./run-test.sh timbuktu Running Unit Test
  48. 48. $ ./run-test.sh timbuktu INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql INFO: Running: hive --config local-config -f /tmp/beetest-test- 1934426026-query.hql Running Unit Test
  49. 49. $ ./run-test.sh timbuktu INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql INFO: Running: hive --config local-config -f /tmp/beetest-test- 1934426026-query.hql INFO: Loading data to table beetest.stream … INFO: Total MapReduce jobs = 2 INFO: Job running in-process (local Hadoop) … Running Unit Test
  50. 50. $ ./run-test.sh timbuktu INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql INFO: Running: hive --config local-config -f /tmp/beetest-test- 1934426026-query.hql INFO: Loading data to table beetest.stream … INFO: Total MapReduce jobs = 2 INFO: Job running in-process (local Hadoop) … INFO: Execution completed successfully INFO: Table beetest.output_1934426026 stats: … INFO: Time taken: 34.605 seconds Running Unit Test
  51. 51. $ ./run-test.sh timbuktu INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hql INFO: Running: hive --config local-config -f /tmp/beetest-test- 1934426026-query.hql INFO: Loading data to table beetest.stream … INFO: Total MapReduce jobs = 2 INFO: Job running in-process (local Hadoop) … INFO: Execution completed successfully INFO: Table beetest.output_1934426026 stats: … INFO: Time taken: 34.605 seconds INFO: Asserting: timbuktu/expected.txt and /tmp/beetest-test-1934426026- output_1934426026/000000_0 INFO: Test passed! Running Unit Test
  52. 52. Bee test Be happy !
  53. 53. HiveQL Sampling
  54. 54. ■ TABLESAMPLE can generate samples of - buckets - HDFS blocks - first N records from each input split Sampling In Hive
  55. 55. ✗ Bucket sampling - Table must be bucketized by a given column ✗ HDFS block sampling - Only CombineHiveInputFormat is supported TABLESAMPLE Limitations
  56. 56. SELECT * FROM user TABLESAMPLE(K ROWS) SELECT * FROM track TABLESAMPLE(L ROWS) SELECT * FROM stream TABLESAMPLE(M ROWS) SELECT … stream JOIN track ON … JOIN user ON … ✗ No guarantee that JOIN returns anything Rows TABLESAMPLE
  57. 57. ■ DataFu + Pig offer interesting sampling algorithms - Weighted Random Sampling - Reservoir Sampling - Consistent Sampling By Key ■ HCatalog can provide integration DataFu And Pig
  58. 58. DEFINE SBK datafu.pig.sampling.SampleByKey('0.01'); stream_s = FILTER stream BY SBK(user_id); user_s = FILTER user BY SBK(user_id); Consistent Sampling By Key Threshold
  59. 59. DEFINE SBK datafu.pig.sampling.SampleByKey('0.01'); stream_s = FILTER stream BY SBK(user_id); user_s = FILTER user BY SBK(user_id); stream_user_s = JOIN user_s BY user_id, stream_s BY user_id; ✓ Guarantees that JOIN returns rows after sampling - Assuming that the threshold is high enough Consistent Sampling By Key Threshold
  60. 60. DEFINE SBK datafu.pig.sampling.SampleByKey('0.01'); stream_s = FILTER stream BY SBK(user_id); user_s = FILTER user BY SBK(user_id); user_s_pl = FILTER user_s BY country == ‘PL’; stream_user_s_pl = JOIN user_s_pl BY user_id, stream_s BY user_id; ✗ Do not guarantee that Polish users will be included Consistent Sampling By Key Threshold
  61. 61. ✗ None of the sampling methods meet all of my requirements Try and see ■ Use SampleByKey and 1% of day’s worth of data - Maybe it will be good enough? What’s Next?
  62. 62. 1. Finished successfully! 2. Returned 3 Polish users with 4 Timbuktu’s tracks - My username was not included :( Running Query ?
  63. 63. HiveQL Understanding The Plan
  64. 64. ■ Shows how Hive translates queries into MR jobs - Useful, but tricky to understand EXPLAIN
  65. 65. ■ 330 lines ■ 11 stages - Conditional operators and backup stages - MapReduce jobs, Map-Only jobs - MapReduce Local Work ■ 4 MapReduce jobs - Two of them needed for JOINs EXPLAIN The Query
  66. 66. Optimized Query 2 MapReduce job in total
  67. 67. hive.auto.convert.join.noconditionaltask.size > filesize(user) + filesize(track) No Conditional Map Join Runs many Map joins in a Map-Only job [HIVE-3784]
  68. 68. 1. Local Work - Fetch records from small table(s) from HDFS - Build hash table - If not enough memory, then abort Map Join Procedure
  69. 69. 1. Local Work - Fetch records from small table(s) from HDFS - Build hash table - If not enough memory, then abort 2. MapReduce Driver - Add hash table to the distributed cache Map Join Procedure
  70. 70. 1. Local Work - Fetch records from small table(s) from HDFS - Build hash table - If not enough memory, then abort 2. MapReduce Driver - Add hash table to the distributed cache 3. Map Task - Read hash table from the distributed cache - Stream through, supposedly, the largest table - Match records and write to output Map Join Procedure
  71. 71. 1. Local Work - Fetch records from small table(s) from HDFS - Build hash table - If not enough memory, then abort 2. MapReduce Driver - Add hash table to the distributed cache 3. Map Task - Read hash table from the distributed cache - Stream through, supposedly, the largest table - Match records and write to output 4. No Reduce Task Map Join Procedure
  72. 72. ■ Table has to be downloaded from HDFS - Problem if this step takes long ■ Hash table has to be uploaded to HDFS - Replicated via the distributed cache - Fetched by hundred of nodes Map Join Cons
  73. 73. Grouping After Joining Runs as a single MR job [HIVE-3952]
  74. 74. Optimized Query 2 MapReduce job in total
  75. 75. HiveQL Running At Scale
  76. 76. 2014-05-05 21:21:13Starting to launch local task to process map join; maximum memory = 2GB ^Ckawaa@sol:~/timbuktu$ date Mon May 5 21:50:26 UTC 2014 Day’s Worth Of Data
  77. 77. ✗ Dimension tables too big to read fast from HDFS ✓ Add a preprocessing step - Apply projection and filtering in Map-only job - Make a script ugly with intermediate tables Too Big Tables
  78. 78. ✓ 15s to build hash tables! ✗ 8m 23s to run 2 Map-only jobs to prepare tables - “Investment” Day’s Worth Of Data
  79. 79. ■ Map tasks FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang. OutOfMemoryError: Java heap space Day’s Worth Of Data
  80. 80. ■ Increase JVM heap size ■ Decrease the size of in-memory map output buffer - mapreduce.task.io.sort.mb Possible Solution
  81. 81. ■ Increase JVM heap size ■ Decrease the size of in-memory map output buffer - mapreduce.task.io.sort.mb (Temporary) Solution My query generates small amount of intermediate data - MAP_OUTPUT_* counters
  82. 82. ✓ Works! ✗ 2-3x slower than two Regular joins Second Challenge
  83. 83. Metric Regular Join Map Join The slowest job Shuffled Bytes 12.5 GB 0.56 MB CPU 120 M 331 M GC 13 M 258 M GC / CPU 0.11 0.78 Wall-clock 11m 47s 31m 30s Heavy Garbage Collection!
  84. 84. ■ Increase the container and heap sizes - 4096 and -Xmx3584m Fixing Heavy GC Metric Regular Join Map Join The slowest job GC / CPU 0.11 0.03 Query CPU 1d 17m 23h 8m Wall-clock 20m 15s 14m 42s
  85. 85. ■ 1st MapReduce Job - Heavy map phase - No GC issues any more - Tweak input split size - Lightweight shuffle - Lightweight reduce - Use 1-2 reduce tasks Where To Optimize?
  86. 86. ■ 1st MapReduce Job - Heavy map phase - No GC issues any more - Tweak input split size - Lightweight shuffle - Lightweight reduce - Use 1-2 reduce tasks ■ 2nd MapReduce Job - Very small - Enable uberization [HIVE-5857] Where To Optimize?
  87. 87. Current Best Time 2 months of data 50 min 2 sec 10th place ?
  88. 88. Changes are needed!
  89. 89. File Format ORC
  90. 90. ■ The biggest table has 52 “real” columns - Only 2 of them are needed - 2 columns are “partition” columns Observation ||||||||||||||||||||||||||||||||||||||||||||||||||||
  91. 91. ✓ Improve speed of query - Only read required columns - Especially important when reading remotely My Benefits Of ORC
  92. 92. ✗ Need to convert Avro to ORC ✓ ORC consumes less space - Efficient encoding and compression ORC Storage Consumption
  93. 93. Storage And Access 16x
  94. 94. Wall Clock Time 3.5x
  95. 95. CPU Time 32x
  96. 96. ■ Use Snappy compression - ZLIB was used to optimize more for storage than processing time Possible Improvements
  97. 97. Computation Apache Tez
  98. 98. ■ Tez will probably not help much, because… - Only two MapReduce jobs - Heavy work done in map tasks - Little intermediate data written to HDFS Thoughts?
  99. 99. Wall Clock Time 1.4x 2.4x
  100. 100. Wall Clock Time 8x
  101. 101. ✓ Natural DAG - No “empty” map task - No necessity to write intermediate data to HDFS - No job scheduling overhead Benefits Of Tez
  102. 102. ✓ Dynamic graph reconfiguration - Decrease reduce task parallelism at runtime - Pre-launch reduce tasks at runtime Benefits Of Tez
  103. 103. ✓ Smart number of map tasks based on e.g. - Min/max limits input split size - Capacity utilization Benefits Of Tez
  104. 104. Container Reuse Time
  105. 105. Container Reuse The more congested queue/cluster, the bigger benefits of reusing Time
  106. 106. Container Reuse No scheduling overhead to run new Reduce task Time
  107. 107. Container Reuse Time Thinner tasks allows to avoid stragglers
  108. 108. Container + Objects Reuse Finished within 1,5 sec. Warm !
  109. 109. ■ Much easier to follow! Map 1 <- Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) ■ A dot file for graphviz is also generated EXPLAIN
  110. 110. ■ Improved version of Map Join ■ Hash table is built in the cluster - Parallelism - Data locality - No local and HDFS write barriers ■ Hash table is broadcasted to downstream tasks - Tez container can cache them and reuse Broadcast Join
  111. 111. ■ Try run the query that was interrupted by ^C - On 1 day of data to introduce overhead Broadcast Join Metric Avro MR Tez ORC Wall-clock UNKNOWN (^C after 30 minutes) 11m 16s
  112. 112. ✓ Automatically rescales the size of in-memory buffer for intermediate key-value pairs ✓ Reduced in-memory size of map-join hash tables Memory-Related Benefits
  113. 113. ■ At scale, Tez Application Master can be busy - DEBUG log level slows everything down - Sometimes needs more RAM ■ A couple of tools are missing - e.g. counters, Web UI - CLI, log parser, jstat are your friends Lessons Learned
  114. 114. Feature Vectorization
  115. 115. ✓ Process data in a batch of 1024 rows together - Instead of processing one row at a time ✓ Reduces the CPU usage ✓ Scales very well especially with large datasets - Kicks in after a number of function calls ✓ Supported by ORC Vectorization
  116. 116. Powered By Vectorization 1.4x
  117. 117. ✗ Not yet fully supported by vectorization ✓ Still less GC ✓ Still better use of memory to CPU bandwidth ✗ Cost of switching from vectorized to row- mode ✓ It will be revised soon by the community Vectorized Map Join
  118. 118. ■ Ensure that vectorization is fully or partially supported for your query - Datatypes - Operators ■ Run own benchmarks and profilers Lessons Learned
  119. 119. Feature Table Statistics
  120. 120. ✓ Table, partition and column statistics ✓ Possibility to - collect when loading data - generate for existing tables - use to produce optimal query plan ✓ Supported by ORC Table Statistics
  121. 121. Powered By Statistics Period ORC/Tez Vectorized ORC/Tez Vectorized ORC/Tez + Stats 2 weeks 173 sec 225 sec 129 sec 1.22K maps 1.22K maps 0.9K maps 6.5 months 618 sec 653 sec 468 sec 13.3K maps 13.3K maps 9.4K maps 14 months 931 sec 676 sec 611 sec 24.8K maps 24.8K maps 17.9K maps
  122. 122. Current Best Time 14 months of data 10 min 11 sec ?
  123. 123. Results Challenge
  124. 124. ■ Who is the biggest fan of Timbuktu in Poland? Results!
  125. 125. ■ Who is the biggest fan of Timbuktu in Poland? Results!
  126. 126. That’s all !
  127. 127. ■ Deals at 5 AM might be the best idea ever ■ Hive can be a single solution for all SQL queries - Both interactive and batch - Regardless of input dataset size ■ Tez is really MapReduce v2 - Very flexible and smart - Improves what was inefficient with MapReduce - Easy to deploy Summary
  128. 128. ■ Gopal Vijayaraghavan - Technical answers to Tez-related questions - Installation scripts github.com/t3rmin4t0r/tez-autobuild ■ My colleagues for technical review and feedback - Piotr Krewski, Rafal Wojdyla, Uldis Barbans, Magnus Runesson Special Thanks
  129. 129. Section name Questions?
  130. 130. Thank you!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×