Hive and Apache Tez: Benchmarked at Yahoo! Scale

DataWorks Summit
Jun. 17, 2014
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
1 of 33

More Related Content

What's hot

Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
Spark SQLSpark SQL
Spark SQLJoud Khattab
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Similar to Hive and Apache Tez: Benchmarked at Yahoo! Scale

June 2014 HUG : Hive On Tez - Benchmarked at Yahoo ScaleJune 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale
June 2014 HUG : Hive On Tez - Benchmarked at Yahoo ScaleYahoo Developer Network
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Hive et Hadoop Usage chez SquareHive et Hadoop Usage chez Square
Hive et Hadoop Usage chez SquareModern Data Stack France
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit

Similar to Hive and Apache Tez: Benchmarked at Yahoo! Scale(20)

More from DataWorks Summit

Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit

More from DataWorks Summit(20)

Recently uploaded

Advancing Equity and Inclusion for Deaf Students in Higher EducationAdvancing Equity and Inclusion for Deaf Students in Higher Education
Advancing Equity and Inclusion for Deaf Students in Higher Education3Play Media
Product Research Presentation-Maidy Veloso.pptxProduct Research Presentation-Maidy Veloso.pptx
Product Research Presentation-Maidy Veloso.pptxMaidyVeloso
Future of SkillsFuture of Skills
Future of SkillsAlison B. Lowndes
Common WordPress APIs - Options APICommon WordPress APIs - Options API
Common WordPress APIs - Options APIJonathan Bossenger
Data Formats: Reading and writing JSON – XML - YAMLData Formats: Reading and writing JSON – XML - YAML
Data Formats: Reading and writing JSON – XML - YAMLCSUC - Consorci de Serveis Universitaris de Catalunya
Google cloud Study Jam 2023.pptxGoogle cloud Study Jam 2023.pptx
Google cloud Study Jam 2023.pptxGDSCNiT

Hive and Apache Tez: Benchmarked at Yahoo! Scale

Editor's Notes

  1. Gopal was supposed to be presenting this with me, to talk about Tez. Point to Gopal/Jitendra’s talk on Hive/Tez for details on things I’ll have to skim over. Also, acknowledge Thomas Graves, who’s talking today about the excellent work he’s doing on driving Spark on Yarn.
  2. There are several sides to query latency: Query execution time : Addressed in the physical query-execution layer. Query optimizations: The first step while optimizing the query plan seems to be to query for all partition instances. Very expensive for “Project Benzene”. Bad queries : Tableau, I’m looking at you.
  3. The Transaction Processing Performance Council (inexplicably abbreviated to TPC) suggests a set of benchmarks for query processing. Many have adopted TPC-DS to showcase performance. We chose TPC-h to complement. (Also, 22 much smaller number to deal with than… 90?) Transliteration: Evita and Kylie Minogue
  4. Lineitem and Orders are extremely large Fact tables. Nation and Region are the smallest dimension tables.
  5. Tangent: Funny story: 1. About hard-drives: Can set up MR intermediate directories and HDFS data-node directories to be on different disks. Traffic from one doesn’t affect the other. But on the other hand, total read bandwidth might be reduced.
  6. Line-item: Partitioned on Ship-date. Orders: Order-date Customers: By market-segment Suppliers: On their region-key.
  7. Q5 and q21 are anomalous. Q21: Hit a trailing reducer across all versions of Hive tested. Perhaps this can be improved with a better plan. Q5: Slow reducer that hit only Hive 13. Could be a bad plan. Could be a difference in data distribution when data was regenerated for Hadoop 2 cluster.
  8. Tez : Scheduling. Playing the gaps, like Beethoven’s Fifth.
  9. Vectorization: On average: 1.2x.
  10. Except for a few outliers, ZLIB compression actually reduced performance for a 1TB dataset. Uncompressed was 1.3x faster than Compressed.
  11. The situation reverses at the 10 TB level. The gains from decompression are actually offset by the disk-read time. The long-tail in 10TB/q21 threw the scale of the graph off, so I’ve excluded it in the results.
  12. Talk about file-coalesce, small-file generation, Namenode pressure and parallelism. You don’t want to read an ORC stripe from a different node. Talk about distcp –pgrub, for ORC files. Mention that SNAPPY’s license is not Apache. Also, Yoda.
  13. At 100 nodes, it performs at 0.9x the 350 node performance.
  14. We’ve seen Hive and Tez scale down for latency, scale up for data-size, and scale out across larger clusters.
  15. Familiarity : We have an existing ecosystem with Hive, HCatalog, Pig and Oozie that delivers revenue to Yahoo today. It’s hard to rock the boat. Community: The Apache Hive community is large, active and thriving. They’ve been solving issues with query latency for ages now. The switch to using the Tez execution engine was a solution within the Apache Hive project. This wasn’t a fork of Hive. This is Hive, proper. Scale: We’ve seen Hive and Tez perform at scale. Heck, we’ve seen Pig perform on Tez. Multitenant: Yahoo’s use-case is unique, and not just because of data-scale. There’s hundreds of active users and genuine multitenancy and security concerns. Design: We think the Hive community has tackled the right problems first, rather than throw RAM at the problem.
  16. Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.
  17. Security: Kerberos support was patched in, after the benchmarks were run. Multi-tenancy: Data needs to be explicitly pinned into memory as RDDs. In a multi-tenant system, how would pinning work? Eviction policy for data. Compatibility: Needs to work with Metastore versions 12 and 13. Shark’s gone to 0.11 just recently. Integration with the rest of the stack: Oozie and Pig. Overall, we wanted a solution that works with high-dynamic range. i.e. works well with small datasets (100s of GBs), as well as scale to multi-terabyte datasets. We have a familiar system that seems to fit that bill. It doesn’t quite rock the boat. It’s not perfect yet. There are bugs that we’re working on. And we still haven’t solved the problem of data-volume/BI. By the way, I really like the idea of BlinkDB. I saw the JIRA.