Deep dive of Deduplication using Apache Apex and RTS

307 views

Published on

Abstract: This webinar will introduce the De-duplication functionality in Malhar. De-duplication is a very important part of the processing pipeline in ETL workflows. We will introduce the use cases and walk through the implementation details. Next we'll look at how to configure the Dedup operator for various use cases (time based expiry as well as batch de-duplication). We will also get into a demonstration of an application which uses De-duplication (Dedup) operator.

Presenter: Bhupesh Chadwa is a Software Engineer at DataTorrent and Committer for Apache Apex.

Learn more about Apex and DataTorrent: https://www.datatorrent.com/apex/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
307
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Deep dive of Deduplication using Apache Apex and RTS

  1. 1. Deep Dive on Deduplication Apache Apex + RTS Bhupesh Chawda bhupesh@apache.org Software Engineer @ DataTorrent Committer @ Apache Apex
  2. 2. Agenda Brief introduction to Apache Apex De-duplication and related use cases Demo! Conclusion with Q&A
  3. 3. Apache Apex - Stream Processing YARN - Native - Uses Hadoop YARN framework for resource negotiation Highly Scalable - Scales statically as well as dynamically Highly Performant - Can reach single digit millisecond end-to-end latency Fault Tolerant - Automatically recovers from failures - without manual intervention Stateful - Guarantees that no state will be lost Easily Operable - Exposes an easy API for developing Operators (part of an application) and Applications
  4. 4. Project History Project development started in 2012 at DataTorrent Open-sourced in July 2015 Apache Apex started incubation in August 2015 50+ committers from Apple, GE, Capital One, DirecTV, Silver Spring Networks, Barclays, Ampool and DataTorrent Now a top level Apache project since April 2016
  5. 5. Apex Platform Overview
  6. 6. An Apex Application is a DAG (Directed Acyclic Graph) A DAG is composed of vertices (Operators) and edges (Streams). A Stream is a sequence of data tuples which connects operators at end-points called Ports An Operator takes one or more input streams, performs computations & emits one or more output streams ● Each operator is USER’s business logic, or built-in operator from our open source library ● Operator may have multiple instances that run in parallel
  7. 7. Apex as a YARN Application ● YARN (Hadoop 2.0) replaces MapReduce with a more generic Resource Management Framework. ● Apex uses YARN for resource management and HDFS for storing any persistent storage
  8. 8. De-duplication (Dedup) Duplicates are very common in most data sets Very common in almost all data ingestion and cleansing use cases Basic functionality is to separate out the duplicates in the data set
  9. 9. Dedup - Considerations Deduplication, seems to be simple and does not look like a complicated operation. Just need to store the set of keys that we have seen so far - ALL the unique keys Situation becomes complicated when the source data is huge or when the incoming data is never ending! Need to store ALL the incoming unique keys, so that new duplicates can be detected. Any new key needs to be searched in the existing set of keys Any in-memory technique would fail in this scenario and would need more
  10. 10. De-duplication
  11. 11. Managed state A fault tolerant, large scale, persistent bucketing mechanism. Uses HDFS by default. Uses the concept of Buckets to store / hash the keys on to the storage. A Bucket is an abstraction for a collection of tuples all of which share a common hash value. Example: a bucket of data for 5 contiguous minutes. A Bucket has a span property called Bucket Span. Num Buckets = Expiry Period / Bucket Span IO from HDFS is slow and hence asynchronous calls are also supported. Supports an in-memory cache so that repeated accesses are faster. Managed State and Spillable Data Structures
  12. 12. Dedup in Streaming Scenarios As discussed, in-memory techniques would fail to process a never-ending stream of incoming data. To address the solutions mentioned before: Memory can fill up too fast Solution is to store the data on to some scalable, fault tolerant, persistent storage. HDFS is the default in Apex Search becomes slow Use some kind of hashing mechanism for storing the keys on HDFS. A Plain storage would not work
  13. 13. Expiry in Dedup - Streaming Scenarios In a streaming scenario, the search is bound to become slow eventually, as we keep storing all the incoming keys. But this is not needed for most practical scenarios. In most of the cases, the duplicates for a tuple (record) are usually encountered within a small time span of generation of the original tuple. This allows us to use the concept of expiry for streaming scenarios to purge the amount of state that we need to persist; thereby reducing our search space. Currently, expiry based on time is supported, as it is a natural expiry key.
  14. 14. Example schema {Name, Phone, Email, Date, State, Zip, Country} Tuple 1: { Austin U. Saunders, +91-319-340-59385, ausaunders@semperegestasurna.com, 2015-11-09 13:38:38, Texas, 73301, United States } Dedup Key Expiry Key
  15. 15. Details on Expiry We maintain the following points for expiry 1. Latest Point 2. Expiry Point Expiry Point = Latest Point - Expiry Period
  16. 16. Architecture of De-duplication
  17. 17. More on Bucketing Size proportional to Expiry Period Based on Time field Size of a bucket = the bucket span
  18. 18. Dedup in Streaming - Use cases Deduplication for Bounded data - Batch Parameters required Key field for de-duplication Deduplication for Un-bounded data - Streaming Parameters required Key field for de-duplication Time field for expiry Expiry duration
  19. 19. Demo time!
  20. 20. Conclusion De-duplication is an important and complex functionality provided out of the box in DataTorrent RTS and Apex Malhar Uses Managed State for state management and asynchronous processing to maintain a low latency Uses expiry based semantics for streaming de-duplication scenarios
  21. 21. Resources • Apache Apex - http://apex.apache.org/ • Subscribe to forums ᵒ Apex - http://apex.apache.org/community.html ᵒ DataTorrent - https://groups.google.com/forum/#!forum/dt-users • Download - https://datatorrent.com/download/ • Twitter ᵒ @ApacheApex; Follow - https://twitter.com/apacheapex ᵒ @DataTorrent; Follow – https://twitter.com/datatorrent • Meetups - http://meetup.com/topics/apache-apex • Webinars - https://datatorrent.com/webinars/ • Videos - https://youtube.com/user/DataTorrent • Slides - http://slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product https://datatorrent.com/product/startup- accelerator/ • Big Data Application Templates Hub – https://datatorrent.com/apphub
  22. 22. We’re hiring! jobs@datatorrent.com Developers/Architects QA Automation Developers Information Developers Build and Release Community Leaders
  23. 23. Thank you!

×