Successfully reported this slideshow.
Your SlideShare is downloading. ×

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

Loading in …3
×

Check these out next

1 of 47 Ad
1 of 47 Ad

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

Download to read offline

Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.

In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.

Apache Hudi is an open data lake platform, designed around the streaming data model. At its core, Hudi provides a transactions, upserts, deletes on data lake storage, while also enabling CDC capabilities. Hudi also provides a coherent set of table services, which can clean, compact, cluster and optimize storage layout for better query performance. Finally, Hudi's data services provide out-of-box support for streaming data from event systems into lake storage in near real-time.

In this talk, we will walk through an end-end use case for change data capture from a relational database, starting with capture changes using the Pulsar CDC connector and then demonstrate how you can use the Hudi deltastreamer tool to then apply these changes into a table on the data lake. We will discuss various tips to operationalizing and monitoring such pipelines. We will conclude with some guidance on future integrations between the two projects including a native Hudi/Pulsar connector and Hudi tiered storage.

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Similar to Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021 (20)

Advertisement

More from StreamNative (20)

Advertisement

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsar Summit NA 2021

  1. 1. Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
  2. 2. Speaker Bio PMC Chair/Creator of Hudi Sr.Staff Eng @ Uber (Data Infra/Platforms, Networking) Principal Eng @ Confluent (ksqlDB, Kafka/Streams) Staff Eng @ Linkedin (Voldemort, DDS) Sr Eng @ Oracle (CDC/Goldengate/XStream)
  3. 3. Agenda 1) Background On CDC 2) Make a Lake 3) Hudi Deep Dive 4) Onwards
  4. 4. Background CDC, Data Lakes - What, Why
  5. 5. Change Data Capture Design Pattern for Data Integration - Not tied to any particular technology - Deliver low-latency System for tracking, fetching new data - Not concerned with how to use such data - Ideally, incremental update downstream - Minimizing number of bits read-written/change Change is the ONLY Constant - Even in Computer science - Data is immutable = Myth (well, kinda)
  6. 6. Examples of CDC Polling an external API for new events - Timestamps, status indicators, versions - Simple, works for small-scale data changes - E.g: Polling github events API Emit Events directly from Application - Data model to encode deltas - Scales for high-volume data changes - E.g: Emitting sensor state changes to Pulsar Scanning a database’s redo log - SCN and other watermarks to extract data/metadata changes - Operationally heavy, very high fidelity - E.g: Using Debezium to obtain changelogs from MySQL
  7. 7. CDC vs ETL? CDC is merely Incremental Extraction - Not really competing concepts - ETL needs one-time full bootstrap - <> CDC changes T and L significantly - T on change streams, not just table state - L incrementally, not just bulk reloads
  8. 8. CDC vs Streaming Processing CDC enables Streaming ETL - Why bulk T & L anymore? - Process change streams - Mutable Sinks Reliable Stream Processing needs distributed logs - Rewind/Replay CDC logs - Absorb spikes/batch writes to sinks
  9. 9. Ideal CDC Source Support reliable incremental consumption - <> Support rewinding/replay - <> Support ordering of changes - <>
  10. 10. Ideal CDC Sink Mutable, Transactional - <> Quickly absorb changes - <> Bonus: Also act as CDC Source - <>
  11. 11. Data Lakes Architectural Pattern for Analytical Data - Data Lake != Spark, Flink - Data Lake != Files on S3 - <> Raw Data - <> Derived Data - <>
  12. 12. Database Events Apps/ Services Queries DFS/Cloud Storage Change Stream Operational Data Infrastructure Analytics Data Infrastructure External Sources Tables CDC to Data Lakes
  13. 13. Make a Lake Putting Pulsar and Hudi to work
  14. 14. Data Flow Design <show diagram showing e2e data flow> - <..>
  15. 15. Pre-requirements Running MySQL Instance (RDS) - <..> Running Pulsar Cluster (??) - <..> Running Spark Cluster (e.g EMR) - <..>
  16. 16. Test Data Explain ‘users’ table - <..> Explain ‘github_events’ data emitted into Pulsar - <..>
  17. 17. #1: Setup CDC Connector <Show configurations> - <..> <Sample data out of Pulsar> - <..>
  18. 18. #2: Kick Off Hudi DeltaStreamer <Show configurations, Command to submit> - <..> <Query data out of Hudi tables> - <..>
  19. 19. #3: Streaming ETL using Hudi <Show how to CDC from Hudi itself> - <..> <Sample pipeline that does some enrichment of events> - <..>
  20. 20. Hudi Deep Dive Intro, Components, APIs, Design Choices
  21. 21. Hudi Data Lake Original pioneer of the transactional data lake movement Embeddable, Serverless, Distributed Database abstraction layer over DFS - We invented this! Hadoop Upserts, Deletes & Incrementals Provide transactional updates/deletes First class support for record level CDC streams
  22. 22. Stream Processing is Fast & Efficient Streaming Stack + Intelligent, Incremental + Fast, Efficient - Row oriented - Not scan optimized Batch Stack + Scans, Columnar formats + Scalable Compute - Naive, In-efficient
  23. 23. What If: Streaming Model on Batch Data? The Incremental Stack + Intelligent, Incremental + Fast, Efficient + Scans, Columnar formats + Scalable Compute https://www.oreilly.com/content/ubers-case-for- incremental-processing-on-hadoop/; 2016
  24. 24. Hudi : Open Sourcing & Evolution.. 2015 : Published core ideas/principles for incremental processing (O’reilly article) 2016 : Project created at Uber & powers all database/business critical feeds @ Uber 2017 : Project open sourced by Uber & work begun on Merge-On-Read, Cloud support 2018 : Picked up adopters, hardening, async compaction.. 2019 : Incubated into ASF, community growth, added more platform components. 2020 : Top level Apache project, Over 10x growth in community, downloads, adoption 2021 : SQL DMLs, Flink Continuous Queries, More indexing schemes, Metaserver, Caching
  25. 25. Apache Hudi - Adoption Committers/ Contributors: Uber, AWS, Alibaba, Tencent, Robinhood, Moveworks, Confluent, Snowflake, Bytedance, Zendesk, Yotpo and more https://hudi.apache.org/docs/powered_by.html
  26. 26. The Hudi Stack Complete “data” lake platform Tightly integrated, Self managing Write using Spark, Flink Query using Spark, Flink, Hive, Presto, Trino, Impala, AWS Athena/Redshift, Aliyun DLA etc Out-of-box tools/services for painless dataops
  27. 27. Design of a Hudi Table
  28. 28. File Layout
  29. 29. File Groups & Slices
  30. 30. Query Types Read Optimized Query at 10:10 Snapshot Query at 10:10 Incremental Query (10:08, 10:10)
  31. 31. Our Design Goals Streaming/Incremental - Upsert/Delete Optimized - Key based operations Faster - Frequent Commits - Design around logs - Minimize overhead
  32. 32. Delta Logs at File Level over Global Each file group is it’s own self contained log - Constant metadata size, controlled by “retention” parameters - Leverage append() when available; lower metadata overhead Merges are local to each file group - UUID keys throw off any range pruning
  33. 33. Record Indexes over Just File/Column Stats Index maps key to a file group - During upsert/deletes - Much like streaming state store Workloads have different shapes - Late arriving updates; Totally random - Trickle down to derived tables Many pluggable options - Bloom Filters + Key ranges - HBase, Join based - Global vs Local
  34. 34. MVCC Concurrency Control over Only OCC Frequent commits => More frequent clustering/compaction => More contention Differentiate writers vs table services - Much like what databases do - Table services don’t contend with writers - Async compaction/clustering Don’t be so “Optimistic” - OCC b/w writers; works, until it does n’t - Retries, split txns, wastes resources - MVCC/Log based between writers/table services
  35. 35. Record Level Merge API over Only Overwrites More generalized approach - Default: overwrite w/ latest writer wins - Support business-specific resolution Log partial updates - Log just changed column; - Drastic reduction in write amplification Log based reconciliation - Delete, Undelete based on business logic - CRDT, Operational Transform like delayed conflict resolution
  36. 36. Specialized Database over Generalized Format Approach it more like a shared-nothing database - Daemons aware of each other - E.g: Compaction, Cleaning in rocksDB E.g: Clustering & Compaction know each other - Reconcile metadata based on time order - Compactions avoid redundant scheduling Self Managing - Sorting, Time-order preservation, File- sizing
  37. 37. Record level CDC over File/Snapshot Diffing Per record metadata - _hoodie_commit_time : Kafka style compacted change streams in commit order - _hoodie_commit_seqno: Consume large commits in chunks, ala Kafka offsets File group design => CDC friendly - Efficient retrieval of old, new values - Efficient retrieval of all values for key Infinite Retention/Lookback coming later in 2021
  38. 38. Onwards Ideas, Ongoing work, Future Plans
  39. 39. Scalable, Multi Model Indexes Partitions are very coarse file-level indexes Finer grained indexes as new partitions to metadata table - Bloom Filter, Bitmaps - Column ranges (RFC-27) - HFile/Hash indexes - Search? External indexes - DynamoDB, Spanner + other cloud stores - C*, Mongo and other
  40. 40. Caching LRU Cache ala DB Buffer Pool Frequent Commits => Small objects/blocks - Today : Aggressively table services - Tomorrow : File Group/Hudi file model aware caching - Mutable data => FileSystem/Block level caches are not that effective. Benefits - Great performance for CDC tables - Avoid open/close costs for small objects
  41. 41. Timeline Metaserver Interesting fact : Hudi has a metaserver already - Runs on Spark driver; Serves FileSystem RPCs + queries on timeline - Backed by rocksDB, updated incrementally on every timeline action - Very useful in streaming jobs - But, still standalone Data lakes need a new metaserver - Flat file metastores are cool? (really?) - Sometimes I miss HMS (sometimes..) - Let’s learn from Cloud warehouses
  42. 42. Beyond Just Lake Engines
  43. 43. Pulsar Sink <Outline strawman design, Hudi facing work, Call for collab>
  44. 44. Pulsar Tiered Storage <Research sharing current challenges, call for collaboration>
  45. 45. Engage With Our Community User Docs : https://hudi.apache.org Technical Wiki : https://cwiki.apache.org/confluence/display/HUDI Github : https://github.com/apache/hudi/ Twitter : https://twitter.com/apachehudi Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) dev@hudi.apache.org (actual mailing list) Slack : https://join.slack.com/t/apache-hudi/signup
  46. 46. Thanks! Questions?
  47. 47. Hudi powers one of the largest transactional data lakes on the planet @ Uber Operated 150PB+ Data Lake platform for 4+ years Multi engine environment with Presto, Spark, Hive, Vertica & more Architected several data services for deletion/GDPR across 15K+ data users Mission critical to all of Uber w/ data monitoring/schemas/quality enforcement ~8000 Tables 150+ PB 3-30 Mins Fresh ~1.5 PB/day ~850 million vcore-secs ~4 Engines Hudi @ Uber

Editor's Notes

  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda
  • Let’s get into today’s agenda

×