Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hoodie - DataEngConf 2017

731 views

Published on

http://www.dataengconf.com/hoodie-an-open-source-incremental-processing-framework-from-uber

Published in: Data & Analytics
  • Get paid to post comments on Facebook - $25 per hour ★★★ http://ishbv.com/socialpaid/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hoodie - DataEngConf 2017

  1. 1. An Open Source Incremental Processing Framework Hoodie DATA
  2. 2. Who Am I Vinoth Chandar - Founding engineer/architect of the data team at Uber. - Previously, - Lead on Linkedin’s Voldemort key value store. - Oracle Database replication, Stream Processing. - HPC & Grid Computing
  3. 3. Agenda • Data @ Uber • Motivation • Concepts • Deep Dive • Use-Cases • Comparisons • Open Source
  4. 4. Data @ Uber Quick Recap of how Uber’s data ecosystem has evolved!
  5. 5. Circa 2014 Reliability - JSON data, breaking pipelines - Word-of-mouth schema Scalability - Kafka7, No Hadoop - Growing data volumes - Multi-datacenter data merges In-efficiencies - Several hours of data delays - Bulk data copies stressing OLTP systems - Single choice of query engine
  6. 6. Re-Architecture Schemafication - Avro as data lingua franca - Schema enforcement at producers Horizontally Scalable - Kafka8 - Hadoop (many PBs & 1000s servers) - Scalable data pipelines - Multi-DC Aware data flow Performant - 1-3 hrs data latency - Columnar queries via parquet - Multiple query engines
  7. 7. Data Users Analytics - Dashboards - Federated Querying - Interactive Analysis Data Apps - Machine Learning - Fraud Detection - Incentive Spends Data Warehousing - Traditional ETL - Curated data feeds - Data Lake => Data Mart
  8. 8. Query Engines Presto > 100K queries/day Spark 100s of Apps Hive 20K Pipelines & Queries
  9. 9. Hoodie : Motivations Use-cases & business needs that led to the birth of the project
  10. 10. Query Engines Presto > 100K queries/day Spark 100s of Apps Hive 20K Pipelines & Queries
  11. 11. Partitioned by trip start date 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 30 min Day level partitions Motivating Use-Case: Late Arriving Updates 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Update d Trips
  12. 12. Jan: 6 hr (500 executors) Snapshot DB Ingestion: Status Quo trips (Parquet) Changelog 12-18+ hr Derived Tables Apr: 8 hr (800 executors) Aug: 10 hr (1000 executors)
  13. 13. How can we fix this? Query HBase? - Bad Fit for scans - Lack of support for nested data - Significant operational overhead Specialized Analytical DBs? - Joins with other datasets in HDFS - Not all data will fit into memory - Lambda architecture & data copies Don’t support Snapshots, Only Logs - Logs ultimately need to be compacted anyway - Merging done inconsistently & inefficiently by users Data Modelling Tricks? - Does not change fundamental nature of problem
  14. 14. Pivotal Question What do we need to solve this directly on top of a petabyte scale Hadoop Data Lake?
  15. 15. Let’s Go A Decade Back How did RDBMS-es solve this? • Update existing row with new value (Transactions) • Consume a log of changes downstream (Redo log) • Update again downstream MySQL (Server A) MySQL (Server B) Update Update Pull Redo Log TransformationImportant Differences • Columnar file formats • Read-heavy analytical workloads • Petabytes & 1000s of servers Changes
  16. 16. Pivotal Question What do we need to solve this directly on top of a petabyte scale Hadoop Data Lake? Answer: upserts & incrementals
  17. 17. 10 hr (1000) 8 hr (800) 6 hr (500) snapshot Challenging Status Quo: upserts & incr pull 12-18+ hr 8 hr 1 hr Replicated Trip Rows New /updated trip rows Changelog
  18. 18. Hoodie : Concepts Incremental Processing Foundations & why it’s important
  19. 19. Anatomy Of Data Pipelines Core Operations • Projections (Easy) • Filtering (Easy) • Aggregations (Tricky) • Window (Tricky) • Joins (Hard) Operational Levers (Google DataFlow) • Latency • Completeness • Cost Typically Pick 2/3 Source SinkData Pipeline
  20. 20. An Artificial Dichotomy
  21. 21. It’s A Spectrum - Very Common use-cases tolerating few mins of latency - 100x more batch pipelines than streaming pipelines
  22. 22. Incremental Processing : What? Run Mini Batch Pipelines - Provide high completeness than streaming pipelines - By supporting things like multi-table joins seamlessly In Streaming Fashion - Provide lower latency than typical batch pipeline - By only consuming new input & ability to update old results
  23. 23. Incremental Processing : Increased Efficiency Less IO, On-Demand Resource Allocation
  24. 24. Incremental Processing : Leverage Hadoop SQL - Good support for joins - Columnar File Formats, - Cover wide range of use cases - exploratory, interactive
  25. 25. Incremental Processing : Simplify Architecture - Efficient pipelines on same batch infrastructure - Consolidation of storage & compute
  26. 26. Incremental Processing : Primitives Incremental Pull (Primitive #2) - Log stream of changes, avoid costly scans - Enable chaining processing in DAG Upsert (Primitive #1) - Modify processed results - Like state stores in stream processing
  27. 27. Introducing: Hoodie (Hadoop Upserts anD Incrementals) Storage Abstraction to - Apply mutations to dataset - Pull changelog incrementally Spark Library - Scales horizontally like any job - Stores dataset directly on HDFS Open Source - https://github.com/uber/hoodie - https://eng.uber.com/hoodie Upsert (Spark) Changelog Changelog Incr Pull (Hive/Spark/Presto) Normal Table (Hive/Spark/Presto)
  28. 28. Hoodie: Overview Hoodie WriteClient (Spark) Index Data Files Timeline Metadata Hive Queries Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  29. 29. Hoodie: Storage Types & Views Storage Type (How is Data stored?) Views (How is Data Read?) Copy On Write Read Optimized, LogView Merge On Read Read Optimized, RealTime, LogView
  30. 30. Hoodie : Deep Dive Design & Implementation of incremental processing primitives
  31. 31. Storage: Basic Idea 2017/02/17 Index Index File1_v2.parquet 2017/02/15 2017/02/16 2017/02/17 File1.avro.log 200 GB 30min batch File1 10 GB 5min batch File1_v1.parquet 10 GB 5 min batch ● ● ● ● ● ● ● ● ● ● 1825 Partitions (365 days * 5 yrs) ● 100 GB Partition Size ● 128 MB File Size ● ~800 Files Per Partition ● Skew spread - 0.5 % (single batch) New Files - 0.005 % (single batch) ● 7300 Files rewritten ~ 8 new Files ● 20 seconds to re-write 1 File (shuffle) ● 100 executors 10 executors ● 24 minutes to write ~2 minutes to write Input Changelog Hoodie Dataset
  32. 32. Index and Storage Index - Tag ingested record as update or insert - Index is immutable (record key to File mapping never changes) - Pluggable - Bloom Filter - HBase Storage - HDFS or Compatible Filesystem or Cloud Storage - Block aligned files - ROFormat (Apache Parquet) & WOFormat (Apache Avro)
  33. 33. Concurrency ● Multi-row atomicity ● Strong consistency (Same as HDFS guarantees) ● Single Writer - Multiple Consumer pattern ● MVCC for isolation ○ Running queries are run concurrently to ingestion
  34. 34. Data Skew Why skew is a problem? - Spark 2GB Remote Shuffle Block limit - Straggler problem Hoodie handles data skew automatically - Index lookup skew - Data write skew handled by auto sub partitioning based on history
  35. 35. Compaction Essential for Query performance - Merge Write Optimized row format with Scan Optimized column format Scheduled asynchronously to Ingestion - Ingestion already groups updates per File Id - Locks down versions of log files to compact - Pluggable strategy to prioritize compactions - Base File to Log file size ratio - Recent partitions compacted first
  36. 36. Failure recovery Automatic recovery via Spark RDD - Resilient Distributed Datasets!! No Partial writes - Commit is atomic - Auto rollback last failed commit Rollback specific commits Savepoints/Snapshots
  37. 37. Hoodie Write API // WriteConfig contains basePath of hoodie dataset (among other configs) HoodieWriteClient(JavaSparkContext jsc, HoodieWriteConfig clientConfig) // Start a commit and get a commit time to atomically upsert a batch of records String startCommit() // Upsert the RDD<Records> into the hoodie dataset JavaRDD<WriteStatus> upsert(JavaRDD<HoodieRecord<T>> records, final String commitTime) // Choose to commit boolean commit(String commitTime, JavaRDD<WriteStatus> writeStatuses) // Rollback boolean rollback(final String commitTime) throws HoodieRollbackException
  38. 38. Hoodie Record HoodieRecordPayload // Get the Avro IndexedRecord for the dataset schema ○ IndexedRecord getInsertValue(Schema schema); // Combine Existing value with New incoming value and return the combined value ○ IndexedRecord combineAndGetUpdateValue(IndexedRecord currentValue, Schema schema);
  39. 39. Hoodie: Overview Hoodie WriteClien t (Spark) Index Data Files Timeline Metadata Hive Queries Hoodie Dataset On HDFS Presto Queries Spark DAGs Store & Index Data Read data Storage Type Views
  40. 40. Hoodie Views REALTIME READ OPTIMIZED Queryexecutiontime Data Latency 3 Logical views Of Dataset Read Optimized View - Raw Parquet Query Performance - Targets existing Hive tables Real Time View - Hybrid of row & columnar data - Brings near-real time tables Log View - Stream of changes to dataset - Enables Incr. Pull
  41. 41. Hoodie Views Read Optimized Table Real Time Table Hive 2017/02/15 2017/02/16 2017/02/17 2017/02/16 File1.parquet Index Index File1_v2.parquet File1.avro.log File1 File1_v1.parquet 10 GB 5min batch 10 GB 5 min batch Input Changelog Incremental Log table
  42. 42. Read Optimized View InputFormat picks only Compacted Columnar Files Optimized for faster query runtime over data latency - Plug into query plan generation to filter out older versions - All Optimizations done to read parquet applies (Vectorized etc) Works out of the box with Presto and Apache Spark
  43. 43. Presto Read Optimized Performance
  44. 44. Real Time View InputFormat merges Columnar with Row Log at query execution - Data Latency can approach speed of HDFS appends Custom RecordReader - Logs are grouped per FileID - Single split is usually a single FileID in Hoodie (Block Aligned files) Works out of the box with Presto and Apache Spark - Specialized parquet read path optimizations not supported
  45. 45. Incremental Log View Partitioned by trip start date 2010-2014 New Data Unaffected Data Updated Data Incremental update 2015/XX/XX Every 5 min 2016/XX/XX 2017/(01-03)/XX 2017/04/16 New/Update d Trips Log View Incr Pull
  46. 46. Incremental Log View Pull ONLY changed records in a time range using SQL - ‘startTs’ > _hoodie_commit_time < ‘endTs’ Avoid full table/partition scan Do not rely on a custom sequence ID to tail
  47. 47. Hoodie : Use Cases How is it being used in real production environments?
  48. 48. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS
  49. 49. Near Real-Time Ingestion
  50. 50. Use Cases Near Real-Time ingestion / stream into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental Data Pipelines - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler
  51. 51. Incremental ETL
  52. 52. Use Cases Near Real-Time ingestion / streaming into HDFS - Replicate online state in HDFS within few minutes - Offload analytics to HDFS Incremental Data Pipelines - Don't tradeoff correctness to do incremental processing - Hoodie integration with Scheduler Unified Analytical Serving Layer - Eliminate your specialized serving layer , if latency tolerated is > 5 min - Simplify serving with HDFS for the entire dataset
  53. 53. Unified Analytics Serving
  54. 54. Adoption @ Uber Powering ~1000 Data ingestion data feeds - Every 30 mins today, several TBs per hour - Towards < 10 min in the next few months Incremental ETL for dimension tables - Data warehouse at large Reduced resource usage by 10x - In production for last 6 months - Hardened across rolling restarts, data node reboots
  55. 55. Hoodie : Comparisons What trade-offs does hoodie offer compared to other systems?
  56. 56. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage - Scans
  57. 57. Source: (CERN Blog) Performance comparison of different file formats and storage engines in the Hadoop ecosystem Comparison: Analytical Storage - Write Rate
  58. 58. Comparison Apache HBase Apache Kudu Hoodie Write Latency Milliseconds Seconds (streaming) ~5m update, ~1m insert** Scan Performance Not optimal Optimized via columnar State of Art Hadoop Formats Query Engines Hive* Impala/Spark* Hive, Presto, Spark at scale Deployment Extra Region servers Specialized Storage Servers Spark Jobs on HDFS Multi Row Commit/Rollback No No Yes Incremental Pull No No Yes Automatic Hotspot Handling No No Yes
  59. 59. Hoodie : Open Source How to get involved, roadmap..
  60. 60. Community Shopify evaluating for use - Incremental DB ingestion onto GCS - Early interest from multiple companies Engage with us on Github (uber/hoodie) - Look for “beginner-task” tagged issues - Try out tools & utilities Uber is hiring for “Hoodie” - “Software Engineer - Data Processing Plaform (Hoodie)”
  61. 61. Future Plans Merge On Read (Project #1) - Active developement, Productionizing, Shipping! Global Index (Project #2) - Fast, lightweight index to map key to fileID, globally (not just partitions) Spark Datasource (Issue #7) & Presto Plugins (Issue #81) - Native support for incremental SQL (e.g: where _hoodie_commit_time > ... ) Beam Runner (Issue #8) - Build incremental pipelines that also port across batch or streaming modes
  62. 62. Takeaways Fills a big void in Hadoop land - Upserts & Faster data Play well with Hadoop ecosystem & deployments - Leverage Spark vs re-inventing yet-another storage silo Designed for Incremental Processing - Incremental Pull is a ‘Hoodie’ special
  63. 63. Questions? Office Hours after talk 5:00pm–5:45pm source
  64. 64. Extra Slides
  65. 65. Hoodie: Storage Types & Views
  66. 66. Hoodie Views
  67. 67. ● ● ● ● ○ ○ ● ●
  68. 68. 2017/02/15 2017/02/16 2017/02/17 2017/02/16 File1.parquet Index Index File1_v2.parquet File1.avro.log Change Log 200 GB Realtime View Read Optimized View Hive File1 10 GB File1_v1.parquet
  69. 69. Hoodie Write Path Change log Index lookup updates inserts File Id1 LogFile commit (10:06) Failed commit (10:08) commit (10:08) Version 1 commit (10:09) Version 2 2017-03-11 File Id1 Compacted (10:05) 2017-03-14 File Id2 2017-03-10 2017-03-11 2017-03-12 2017-03-13 2017-03-14 Commit Time: 10:10 Empty
  70. 70. Hoodie Write Path Spark Application
  71. 71. Read Optimized View
  72. 72. Spark SQL Performance Comparison
  73. 73. Realtime View
  74. 74. Incremental Log View
  75. 75. Hoodie: Storage Types & Views
  76. 76. Incremental Log View
  77. 77. Comparison
  78. 78. Comparison
  79. 79. Petabytes to Exabytes Greater need for Incremental Processing
  80. 80. Exponential Growth is fun .. Also extremely hard, to keep up with … - Long waits for queue - Disks running out of space Common Pitfalls - Massive re-computations - Batch jobs are too big fail

×