Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

1,178 views

Published on

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Published in: Technology
  • Be the first to comment

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

  1. 1. FASTER, FASTER, FASTER: THE TRUE STORY OF A MOBILE ANALYTICS DATA MART ON HIVE Mithun Radhakrishnan Josh Walters
  2. 2. 3 • Mithun Radhakrishnan • Hive Engineer at Yahoo • Hive Committer • Has an irrational fear of spider monkeys • mithun@apache.org • @mithunrk About myself
  3. 3. 4 RECAP
  4. 4. 55 2015 Hadoop Summit, San Jose, California
  5. 5. 6 From: The [REDACTED] ETL team To: The Yahoo Hive Team Subject: A small matter of size... Dear YHive team, We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}. For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr. If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground? Yours gigantically, Project [REDACTED]
  6. 6. 7 ABOUT ME • Josh Walters • Data Engineer at Yahoo • I build lots of data pipelines • Can eat a whole plate of deep fried cookie dough • http://joshwalters.com • @joshwalters
  7. 7. 8 WHAT IS THE CUSTOMER NEED? • Faster ETL • Faster queries • Faster ramp up
  8. 8. 9 CASE STUDY: MOBILE DATA MART • Mobile app usage data • Optimize performance • Interactive analytics
  9. 9. 10 LOW HANGING FRUIT • Tez Tez Tez! • Vectorized query execution • Map-side aggregations • Auto-convert map join
  10. 10. 11 DATA PARTITIONING • Want thousands of partitions • Deep data partitioning • Difficult to do at scale
  11. 11. 12 DEEP PARTITIONING • Greatly helps with compression • 2015 XLDB talk on methods used • http://www.youtube.com/ watch?v=P-vrzYYdfL8 • http://www-conf.slac.stanford.edu /xldb2015/lightning_abstracts.asp
  12. 12. 13 SOLID STATE DRIVES • Didn’t really help • Ended up CPU bound • Regular drives are fine
  13. 13. 14 ORC! • Used in largest data systems • 90% boost on sorted columns • 30x compression versus raw text • Fits well with our tech stack
  14. 14. 15 SKETCH ALL THE THINGS • Very accurate • Can store sketches in Hive • Union, intersection, difference • 75% boost on relevant queries
  15. 15. 16 SKETCH ALL THE THINGS SELECT COUNT(DISTINCT id) FROM DB.TABLE WHERE ...; -- ~100 seconds SELECT estimate(sketch(id)) FROM DB.TABLE WHERE ...; -- ~25 seconds
  16. 16. 17 SKETCH ALL THE THINGS Standard Deviation 1 2 3 Confidence Interval 68% 95% 99% K = 16 25% 51% 77% K = 512 4% 8% 13% K = 4096 1% 3% 4% K = 16384 < 1% 1% 2%
  17. 17. 18 MORE SKETCH INFO • Summarization, Approx. and Sampling: Tradeoffs for Improving Query, Hadoop Summit, 2015 • http://datasketches.github.io
  18. 18. 19 ADVANCED QUERIES • Desire for complex queries • Retention, funnels, etc • A lot can be done with UDFs
  19. 19. 20 FUNNEL ANALYSIS • Complex to write, difficult to reuse • Slow, requires multiple joins • Using UDFs, now runs in seconds, not hours • https://github.com/yahoo/hive-funnel-udf
  20. 20. 21 REALLY FAST OLAP • OLAP type queries are the most common • Aggregate only queries: group, count, sum, … • Can we optimize for such queries?
  21. 21. 22 OLAP WITH DRUID • Interactive, sub-second latency • Ingest raw records, then aggregate • Open source, actively developed • http://druid.io
  22. 22. 23 BI TOOL • Many options • Don’t cover all needs • Need graphs and dashboards
  23. 23. 24 CARAVEL • Hive, Druid, Redshift, MySQL, … • Simple query construction • Open source, actively developed • https://github.com/airbnb/caravel
  24. 24. 25 WHAT WE LEARNED • Product teams need custom data marts • Complex to build and run • Just want to focus on business logic
  25. 25. 26 DATA MART IN A BOX! • Generalized ETL pipeline • Easy to spin-up • Automatic continuous delivery • Just give us a query!
  26. 26. 27 DATA MART ARCHITECTURE
  27. 27. 28 INFRASTRUCTURE WORK • We didn’t do this alone • Partners in grid team fixed many pain points
  28. 28. Y!HIVE
  29. 29. 30 Dedicated Queue Metrics: Shared Cluster Metrics: Hive on Tez - Interactive Queries in Shared Clusters
  30. 30. 31
  31. 31. 32 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Hive 0.2 Hive 0.3 Hive 0.4 Hive 0.5 Hive 0.6 Hive 0.7 Hive 0.8 Hive 0.9 Hive 0.10 Hive 0.11 Hive 0.12 Hive 0.13 Hive 0.14 Hive 1.0 Hive 1.1 Hive 1.2 Hive 1 Hive 2.0 Hive Master Increased Configurability or Increased complexity? LOC
  32. 32. 33 • Out of the box: • Tez container reuse • set tez.am.container.reuse.enabled=true; • Tez speculative execution • set tez.am.speculation.enabled=true; • Reduce-side vectorization • set hive.vectorized.execution.reduce.enabled=true; • set hive.vectorized.execution.reduce.groupby.enabled=true; Performance Tuning
  33. 33. 34 • Understand your data: • Use ORC’s index-based filtering: • set hive.optimize.index.filter=true; • Bloom filters • ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”); • Cardinality? • Sort on filter-column • Trade-offs: Parallelism vs. filtering Performance Tuning
  34. 34. 35 • Understand your queries: • Prefer LIKE and INSTR over REGEXP* • Compile-time date/time functions: • current_date() • current_timestamp() • Queries generated from UI tools Performance Tuning
  35. 35. 36 • Index-based filtering available to Pig / MR users • HCatLoader, HCatInputFormat • Split-calculation improvements • Block-based BI • Parallel ETL • Disabled dictionaries for Complex data types • OOMs Performance Improvements - ORC
  36. 36. 37 • Skew Joins • Already solved for Pig • Hive for ETL • Current Hive solution: Explicit values. (Wishful thinking) • Poisson sampling • Faster sorted-merge joins • Wide-tables • SpillableRowContainers Performance Improvements - Joins
  37. 37. 38 • Improvements for data-discovery • HCatClient-based users • Oozie, GDM • 10x improvement • Fetch Operator improvements: • SELECT * FROM partitioned_table LIMIT 100; • Lazy-load partitions Performance Improvements – Various Sundries
  38. 38. 39 • Avro Format is popular • Self describing • Flexible • Generic • Quirky • Intermediate stages in pipelines • Development Performance Improvements: Hive’s AvroSerDe
  39. 39. 40 “There is no mature, no stable. The only constant is change… ... [Our] work on feeds often involves new columns, several times a day.”
  40. 40. 41
  41. 41. 42 • AvroSerDe needs read-schema at job-runtime (i.e. map-side) • Stored on HDFS • ETL Jobs need 10-20K maps • Replication factor • Data-node outage • It gets steadily worse • Block-replication on node-loss • Task attempt retry • More nodes lost • Rinse and repeat The Problem
  42. 42. 43
  43. 43. 44 • Reconcile metastore-schema against read-schema? • toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema • Store schema in TBLPROPERTIES? • Cache read-schema during SerDe::initialize() • Once per map-task • Prefetch read-schema at query-planning phase • Once per job • Separate optimizer The Solution
  44. 44. 4545 • Row-oriented format • Skew-join • Stats storage We’re not done yet
  45. 45. 46 • Team effort • Chris Drome • Selina Zhang • Michael Natkovich • Olga Natkovich • Sameer Raheja • Ravi Sankurati Thanks
  46. 46. Q&A

×