FASTER, FASTER, FASTER:
THE TRUE STORY OF A
MOBILE ANALYTICS DATA
MART ON HIVE
Mithun Radhakrishnan
Josh Walters
3
• Mithun Radhakrishnan
• Hive Engineer at Yahoo
• Hive Committer
• Has an irrational fear of spider
monkeys
• mithun@apache.org
• @mithunrk
About myself
4
RECAP
55 2015 Hadoop Summit, San Jose, California
6
From: The [REDACTED] ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following 6 partition keys:
{hourly-timestamp, name, property, geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the remaining
partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to be faster, how come
queries on our table take forever just to get off the ground?
Yours gigantically,
Project [REDACTED]
7
ABOUT ME
• Josh Walters
• Data Engineer at Yahoo
• I build lots of data pipelines
• Can eat a whole plate of deep fried cookie dough
• http://joshwalters.com
• @joshwalters
8
WHAT IS THE CUSTOMER NEED?
• Faster ETL
• Faster queries
• Faster ramp up
9
CASE STUDY: MOBILE DATA MART
• Mobile app usage data
• Optimize performance
• Interactive analytics
10
LOW HANGING FRUIT
• Tez Tez Tez!
• Vectorized query execution
• Map-side aggregations
• Auto-convert map join
11
DATA PARTITIONING
• Want thousands of partitions
• Deep data partitioning
• Difficult to do at scale
12
DEEP PARTITIONING
• Greatly helps with compression
• 2015 XLDB talk on methods used
• https://youtu.be/P-vrzYYdfL8
• http://www-conf.slac.stanford.edu
/xldb2015/lightning_abstracts.asp
13
SOLID STATE DRIVES
• Didn’t really help
• Ended up CPU bound
• Regular drives are fine
14
ORC!
• Used in largest data systems
• 90% boost on sorted columns
• 30x compression versus raw text
• Fits well with our tech stack
15
SKETCH ALL THE THINGS
• Very accurate
• Can store sketches in Hive
• Union, intersection, difference
• 75% boost on relevant queries
16
SKETCH ALL THE THINGS
SELECT COUNT(DISTINCT id)
FROM DB.TABLE
WHERE ...; -- ~100 seconds
SELECT estimate(sketch(id))
FROM DB.TABLE
WHERE ...; -- ~25 seconds
17
SKETCH ALL THE THINGS
Standard Deviation 1 2 3
Confidence Interval 68% 95% 99%
K = 16 25% 51% 77%
K = 512 4% 8% 13%
K = 4096 1% 3% 4%
K = 16384 < 1% 1% 2%
18
MORE SKETCH INFO
• Summarization, Approx. and
Sampling: Tradeoffs for
Improving Query,
Hadoop Summit, 2015
• http://datasketches.github.io
19
ADVANCED QUERIES
• Desire for complex queries
• Retention, funnels, etc
• A lot can be done with UDFs
20
FUNNEL ANALYSIS
• Complex to write, difficult to reuse
• Slow, requires multiple joins
• Using UDFs, now runs in seconds, not hours
• https://github.com/yahoo/hive-funnel-udf
21
REALLY FAST OLAP
• OLAP type queries are the most common
• Aggregate only queries: group, count, sum, …
• Can we optimize for such queries?
22
OLAP WITH DRUID
• Interactive, sub-second latency
• Ingest raw records, then aggregate
• Open source, actively developed
• http://druid.io
23
BI TOOL
• Many options
• Don’t cover all needs
• Need graphs and dashboards
24
CARAVEL
• Hive, Druid, Redshift, MySQL, …
• Simple query construction
• Open source, actively developed
• https://github.com/airbnb/caravel
25
WHAT WE LEARNED
• Product teams need custom data marts
• Complex to build and run
• Just want to focus on business logic
26
DATA MART IN A BOX!
• Generalized ETL pipeline
• Easy to spin-up
• Automatic continuous delivery
• Just give us a query!
27
DATA MART ARCHITECTURE
28
INFRASTRUCTURE WORK
• We didn’t do this alone
• Partners in grid team fixed many pain points
Y!HIVE
30
Dedicated Queue Metrics:
Shared Cluster Metrics:
Hive on Tez - Interactive Queries in Shared Clusters
31
32
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Hive 0.2 Hive 0.3 Hive 0.4 Hive 0.5 Hive 0.6 Hive 0.7 Hive 0.8 Hive 0.9 Hive
0.10
Hive
0.11
Hive
0.12
Hive
0.13
Hive
0.14
Hive 1.0 Hive 1.1 Hive 1.2 Hive 1 Hive 2.0 Hive
Master
Increased Configurability or Increased complexity?
LOC
33
• Out of the box:
• Tez container reuse
• set tez.am.container.reuse.enabled=true;
• Tez speculative execution
• set tez.am.speculation.enabled=true;
• Reduce-side vectorization
• set hive.vectorized.execution.reduce.enabled=true;
• set hive.vectorized.execution.reduce.groupby.enabled=true;
Performance Tuning
34
• Understand your data:
• Use ORC’s index-based filtering:
• set hive.optimize.index.filter=true;
• Bloom filters
• ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”);
• Cardinality?
• Sort on filter-column
• Trade-offs: Parallelism vs. filtering
Performance Tuning
35
• Understand your queries:
• Prefer LIKE and INSTR over REGEXP*
• Compile-time date/time functions:
• current_date()
• current_timestamp()
• Queries generated from UI tools
Performance Tuning
36
• Index-based filtering available to Pig / MR users
• HCatLoader, HCatInputFormat
• Split-calculation improvements
• Block-based BI
• Parallel ETL
• Disabled dictionaries for Complex data types
• OOMs
Performance Improvements - ORC
37
• Skew Joins
• Already solved for Pig
• Hive for ETL
• Current Hive solution: Explicit values. (Wishful thinking)
• Poisson sampling
• Faster sorted-merge joins
• Wide-tables
• SpillableRowContainers
Performance Improvements - Joins
38
• Improvements for data-discovery
• HCatClient-based users
• Oozie, GDM
• 10x improvement
• Fetch Operator improvements:
• SELECT * FROM partitioned_table LIMIT 100;
• Lazy-load partitions
Performance Improvements – Various Sundries
39
• Avro Format is popular
• Self describing
• Flexible
• Generic
• Quirky
• Intermediate stages in pipelines
• Development
Performance Improvements: Hive’s AvroSerDe
40
“There is no mature, no stable. The only constant is change…
... [Our] work on feeds often involves new columns, several times a day.”
41
42
• AvroSerDe needs read-schema at job-runtime (i.e. map-side)
• Stored on HDFS
• ETL Jobs need 10-20K maps
• Replication factor
• Data-node outage
• It gets steadily worse
• Block-replication on node-loss
• Task attempt retry
• More nodes lost
• Rinse and repeat
The Problem
43
44
• Reconcile metastore-schema against read-schema?
• toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema
• Store schema in TBLPROPERTIES?
• Cache read-schema during SerDe::initialize()
• Once per map-task
• Prefetch read-schema at query-planning phase
• Once per job
• Separate optimizer
The Solution
4545
• Row-oriented format
• Skew-join
• Stats storage
We’re not done yet
46
• Team effort
• Chris Drome
• Selina Zhang
• Michael Natkovich
• Olga Natkovich
• Sameer Raheja
• Ravi Sankurati
Thanks
Q&A
Faster Faster Faster! Datamarts with Hive at Yahoo

Faster Faster Faster! Datamarts with Hive at Yahoo

  • 2.
    FASTER, FASTER, FASTER: THETRUE STORY OF A MOBILE ANALYTICS DATA MART ON HIVE Mithun Radhakrishnan Josh Walters
  • 3.
    3 • Mithun Radhakrishnan •Hive Engineer at Yahoo • Hive Committer • Has an irrational fear of spider monkeys • mithun@apache.org • @mithunrk About myself
  • 4.
  • 5.
    55 2015 HadoopSummit, San Jose, California
  • 6.
    6 From: The [REDACTED]ETL team To: The Yahoo Hive Team Subject: A small matter of size... Dear YHive team, We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}. For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr. If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground? Yours gigantically, Project [REDACTED]
  • 7.
    7 ABOUT ME • JoshWalters • Data Engineer at Yahoo • I build lots of data pipelines • Can eat a whole plate of deep fried cookie dough • http://joshwalters.com • @joshwalters
  • 8.
    8 WHAT IS THECUSTOMER NEED? • Faster ETL • Faster queries • Faster ramp up
  • 9.
    9 CASE STUDY: MOBILEDATA MART • Mobile app usage data • Optimize performance • Interactive analytics
  • 10.
    10 LOW HANGING FRUIT •Tez Tez Tez! • Vectorized query execution • Map-side aggregations • Auto-convert map join
  • 11.
    11 DATA PARTITIONING • Wantthousands of partitions • Deep data partitioning • Difficult to do at scale
  • 12.
    12 DEEP PARTITIONING • Greatlyhelps with compression • 2015 XLDB talk on methods used • https://youtu.be/P-vrzYYdfL8 • http://www-conf.slac.stanford.edu /xldb2015/lightning_abstracts.asp
  • 13.
    13 SOLID STATE DRIVES •Didn’t really help • Ended up CPU bound • Regular drives are fine
  • 14.
    14 ORC! • Used inlargest data systems • 90% boost on sorted columns • 30x compression versus raw text • Fits well with our tech stack
  • 15.
    15 SKETCH ALL THETHINGS • Very accurate • Can store sketches in Hive • Union, intersection, difference • 75% boost on relevant queries
  • 16.
    16 SKETCH ALL THETHINGS SELECT COUNT(DISTINCT id) FROM DB.TABLE WHERE ...; -- ~100 seconds SELECT estimate(sketch(id)) FROM DB.TABLE WHERE ...; -- ~25 seconds
  • 17.
    17 SKETCH ALL THETHINGS Standard Deviation 1 2 3 Confidence Interval 68% 95% 99% K = 16 25% 51% 77% K = 512 4% 8% 13% K = 4096 1% 3% 4% K = 16384 < 1% 1% 2%
  • 18.
    18 MORE SKETCH INFO •Summarization, Approx. and Sampling: Tradeoffs for Improving Query, Hadoop Summit, 2015 • http://datasketches.github.io
  • 19.
    19 ADVANCED QUERIES • Desirefor complex queries • Retention, funnels, etc • A lot can be done with UDFs
  • 20.
    20 FUNNEL ANALYSIS • Complexto write, difficult to reuse • Slow, requires multiple joins • Using UDFs, now runs in seconds, not hours • https://github.com/yahoo/hive-funnel-udf
  • 21.
    21 REALLY FAST OLAP •OLAP type queries are the most common • Aggregate only queries: group, count, sum, … • Can we optimize for such queries?
  • 22.
    22 OLAP WITH DRUID •Interactive, sub-second latency • Ingest raw records, then aggregate • Open source, actively developed • http://druid.io
  • 23.
    23 BI TOOL • Manyoptions • Don’t cover all needs • Need graphs and dashboards
  • 24.
    24 CARAVEL • Hive, Druid,Redshift, MySQL, … • Simple query construction • Open source, actively developed • https://github.com/airbnb/caravel
  • 25.
    25 WHAT WE LEARNED •Product teams need custom data marts • Complex to build and run • Just want to focus on business logic
  • 26.
    26 DATA MART INA BOX! • Generalized ETL pipeline • Easy to spin-up • Automatic continuous delivery • Just give us a query!
  • 27.
  • 28.
    28 INFRASTRUCTURE WORK • Wedidn’t do this alone • Partners in grid team fixed many pain points
  • 29.
  • 30.
    30 Dedicated Queue Metrics: SharedCluster Metrics: Hive on Tez - Interactive Queries in Shared Clusters
  • 31.
  • 32.
    32 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Hive 0.2 Hive0.3 Hive 0.4 Hive 0.5 Hive 0.6 Hive 0.7 Hive 0.8 Hive 0.9 Hive 0.10 Hive 0.11 Hive 0.12 Hive 0.13 Hive 0.14 Hive 1.0 Hive 1.1 Hive 1.2 Hive 1 Hive 2.0 Hive Master Increased Configurability or Increased complexity? LOC
  • 33.
    33 • Out ofthe box: • Tez container reuse • set tez.am.container.reuse.enabled=true; • Tez speculative execution • set tez.am.speculation.enabled=true; • Reduce-side vectorization • set hive.vectorized.execution.reduce.enabled=true; • set hive.vectorized.execution.reduce.groupby.enabled=true; Performance Tuning
  • 34.
    34 • Understand yourdata: • Use ORC’s index-based filtering: • set hive.optimize.index.filter=true; • Bloom filters • ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”); • Cardinality? • Sort on filter-column • Trade-offs: Parallelism vs. filtering Performance Tuning
  • 35.
    35 • Understand yourqueries: • Prefer LIKE and INSTR over REGEXP* • Compile-time date/time functions: • current_date() • current_timestamp() • Queries generated from UI tools Performance Tuning
  • 36.
    36 • Index-based filteringavailable to Pig / MR users • HCatLoader, HCatInputFormat • Split-calculation improvements • Block-based BI • Parallel ETL • Disabled dictionaries for Complex data types • OOMs Performance Improvements - ORC
  • 37.
    37 • Skew Joins •Already solved for Pig • Hive for ETL • Current Hive solution: Explicit values. (Wishful thinking) • Poisson sampling • Faster sorted-merge joins • Wide-tables • SpillableRowContainers Performance Improvements - Joins
  • 38.
    38 • Improvements fordata-discovery • HCatClient-based users • Oozie, GDM • 10x improvement • Fetch Operator improvements: • SELECT * FROM partitioned_table LIMIT 100; • Lazy-load partitions Performance Improvements – Various Sundries
  • 39.
    39 • Avro Formatis popular • Self describing • Flexible • Generic • Quirky • Intermediate stages in pipelines • Development Performance Improvements: Hive’s AvroSerDe
  • 40.
    40 “There is nomature, no stable. The only constant is change… ... [Our] work on feeds often involves new columns, several times a day.”
  • 41.
  • 42.
    42 • AvroSerDe needsread-schema at job-runtime (i.e. map-side) • Stored on HDFS • ETL Jobs need 10-20K maps • Replication factor • Data-node outage • It gets steadily worse • Block-replication on node-loss • Task attempt retry • More nodes lost • Rinse and repeat The Problem
  • 43.
  • 44.
    44 • Reconcile metastore-schemaagainst read-schema? • toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema • Store schema in TBLPROPERTIES? • Cache read-schema during SerDe::initialize() • Once per map-task • Prefetch read-schema at query-planning phase • Once per job • Separate optimizer The Solution
  • 45.
    4545 • Row-oriented format •Skew-join • Stats storage We’re not done yet
  • 46.
    46 • Team effort •Chris Drome • Selina Zhang • Michael Natkovich • Olga Natkovich • Sameer Raheja • Ravi Sankurati Thanks
  • 47.

Editor's Notes

  • #5 The story so far…
  • #6 At the last Summit talk, we presented the unique challenge that Yahoo’s large metadata poses to Hive’s Metadata storage.
  • #7 This was the email sent by the user producing the largest, most popular dataset in YGrid. Several TB/hr, 1000s of partitions per hour. This table is our largest. We use this to test and break our system.
  • #9 Customers always want data faster Everyone wants data ETL’ed faster Analysts and product owners want faster queries Users need to be able to ramp up quickly and be able to use the data
  • #10 Mobile app data: swipes, clicks, usage time, etc Query at the speed of thought Analysts needs results now, not in 3 hours
  • #11 Tez provided huge benefits to our jobs, massive performance improvements, not used for all jobs at Yahoo. Vectorized execution is easy to enable. Will perform transformations on batches of records, greatly increasing performance. Map-side aggregations can help to limit the work done in the reduce stage by performing some of the transformations in the map stage. With auto-convert join, you don’t have to provide hints in the query (which few users do). Can help speedup lots of join queries. Makes things faster, but still not good enough.
  • #12 More partitions, more control over reads, smaller data read, faster queries! Really deep partitions, multiple nested levels of partitions. Too many partitions have other problems, too many part files cluttering up your namespace. Reducers have to have handles open to many part files, causing a slowdown. Can cause a lot of problems for Hcatalog. Needed optimization We would like thousands of partitions.
  • #13 Deep partitions group similar data together, helping with compression We gave a talk on this at Stanford’s XLDB conference, if you want more info watch the video!
  • #14 Next we wanted to see if we swapped our cluster to use SSDs, would that help with Hive performance? Not much improvement, our jobs were mostly CPU bound. Does it make sense for intermediate task attempt output to be stored on solid state drives?
  • #15 Our audience data pipeline processes about 200 billion events a day. This comes out to roughly 400 Terabytes of uncompressed data a day. This compresses down to 15 Terabytes of data a day with ORC. We have to store this data for 18 months, so you can see where compression can be really important to us. Our users may also want to run queries over that whole time period, so our file format must be efficient enough to handle that
  • #16 Sketches, or streaming algorithms, provide some useful features for very large datasets Queries like distinct count are very common for analysts, and can be quite slow Sketches can perform these queries in a single pass, with minimal memory usage These sketches can be used to do distinct counts, but they can also be used in unions, intersections, and differences We observed a 75% speed boost on relevant queries
  • #19 Information about Sketches has been presented before The code is open source, and there are UDFs for Hive and Pig
  • #20 Users occasionally want to run very complex queries that would be too difficult to write in Pig or Hive One of the most common for our users was funnel analysis. In these instances, UDFs can provide a lot of help to our data users
  • #21 Funnel analysis is used to measure how users are flowing through a series of actions For example: How many people go to the signup page? How many of those people complete the information? How many of them then submit the information? Each stage should have the same or fewer users Usually you would have to do multiple selects and joins to get this to work The query can become very large and unwieldy We came up with a simple UDF to perform this whole process in a single map reduce job, greatly simplifying and speeding up the process This UDF is open source, feel free to contribute!
  • #22 Analyst queries can commonly be answered by an OLAP system Can these queries run with sub-second latencies? Aggregate only, no single record results
  • #23 Really fast, useable, interactive queries Don’t have to do anything special to the data, Druid ingests the records raw and then aggregates Open source system, lots of contributors, very actively developed
  • #24 We began a search for a user interface to sit on top of these data marts There are many options, but they don’t cover all our needs: support for many database systems, open source, actively developed, and so on Dashboards was one of the most important features we were looking for
  • #25 Caravel, out of AirBnb, was what we decided to go with Has support for Druid and any system that has a SQLAlchemy connector (which is just about everything) The project is very active, and we are contributing to it
  • #26 This mobile data mart wasn’t the only of its kind at Yahoo We had many different teams trying to build similar systems We decided it would be a good idea to build a data mart framework for other teams to use Data marts are a slice of a data warehouse, a small projection and transformation for a specific business unit These data marts cover the use case of the business unit Analysts, marketing, and sales teams may not know Oozie, how to setup continuous delivery for data pipelines, or other data pipeline best practices Could they just provide some ETL logic, and magically build a data mart pipeline?
  • #27 A data pipeline framework Fast to spin up (less than an hour) Only need a Hive ETL query Comes with continuous delivery, windowed aggregates, low latency OLAP processing, and a business intelligence UI Low latency OLAP? How?
  • #28 Simple architecture, by keeping it general it is able to cover many different use cases Features such as windowed aggregates and Druid can be easily removed if not needed Can be made even more real-time by using a lower time granularity in the initial ETL step We have successfully used 10 minute granularity, resulting in an almost real-time data system
  • #31 Example project runs in 4000 nodes busy cluster. A dedicated queue is configured to control the query concurrency. For the example use case, 50% queries runs < 0.5 min, 75% queries runs < 5 mins. Y! does not have the luxury of dedicated, underutilized clusters purely for interactive use. HDFS bandwidth, disk bandwidth, network bandwidth are all shared, even if the Yarn queue is different. To get performance in a busy, multi-tenant system, you have to tune the cr@p out of Hive, the container sizes, and the queries.
  • #32 Tuning Hive can be daunting. There are several knobs, switches and turny-things.
  • #33 Graph between HiveConf.java’s LOC in different versions. HiveConf.java is huge now. Configuration can be tricky. It’s almost as if every Hive JIRA introduces a new config parameter, just for jolly. ;]
  • #34 Here are some settings that you should be enabling out of the box. Container reuse: Useful not just to amortize the cost of container spin-up, but also to place task output closer to the next stage. Speculative execution: Same as in MR. Slow task-attempts can be worked around. Reduce-side vectorization: “Only” 10-30% improvements.
  • #35 Index-based filtering. aka PPD. ORC Files are split into Stripes, with several row-groups per stripe. Each stripe has rows stored in columnar fashion, and column-statistics, including max/min values per column. Index-based filtering skips a row-group based on your query predicate if the value doesn’t fall within the min/max limits for the row-group. Simple, right? 1.2 now has Bloom-filters. You can choose your columns. Greater likelihood of false positives if cardinality is large? Yep. That’s why Bloom-filter info is stored per row-group (i.e. 10K rows). Sorting on a column has tradeoffs. Similar column-values being contiguous helps compression/encoding, and skipping more rows together. But you could land up with a few tasks with all the data to be processed.
  • #36 REGEXP is generic, and will perform worse than LIKE and INSTR. Prefer the latter, if you don’t absolutely require REGEXP. 1.2 has compile-time date/time functions. At query-build! As opposed to once per row. Using BI/UI tools? Look closely at the generated queries. Might be using REGEXP, unix_timestamp(), etc. Tableau used to use “SELECT * FROM your_table LIMIT 0; “ to discover metadata.
  • #37 Column-projection pushdown was available in Pig through Hcat for some time. Now, PPD as well. We’ve improved split-calculation: Block-based BI: 1 split per block! (Checked in independently in Apache.) (Thanks, Prashanth!) ETL: Not usable at large scale, in current form. Better memory-usage when writing complex types in ORC, by disabling dictionaries (just for complex types).
  • #38 Skew-joins are available in Pig. The need for it is apparently specific to Y!. Current approach in Hive is a little clunky. We have a fix coming. Better memory usage with SpillableRowContainers, especially for wide-tables.
  • #39 Data-discovery APIs are now faster. Good news for Oozie-based pipelines. Fetch Operator Optimizations have been… optimized. Batched partition-loading.
  • #40 Self describing (Inline write-schema) Flexible INT-> STRING -> struct{ INT, STRING, … } Generic Custom read-schema to span the data-evolution interval Unions Quirky Self referential
  • #41 The loyalty to a data-format can approach fundamentalist proportions, as illustrated by this Y!Hive user, who was asked to consider ORC format for when his column-schema matures.
  • #42 Whatchusay?
  • #43 At scale, reading from a single schema-file on HDFS can be detrimental.
  • #44 This has gotten entirely too silly.
  • #45 Eugene O’Neill: “There is no present or future… Only the past happening over and over again, now. “ Schema stored on disk. Statistics/histograms stored alongside data.
  • #46 (Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.) We’re still looking for a good row-major format. Avro might still work, if we fix AvroSerDe. Skew-join is being hardened. Will release on JIRA shortly. We’re looking at stats-storage. Column-stats are required for CBO. Can be a problem at 100Ks of partitions.
  • #48 Come at me, bro!