SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
1.
Enriching the data vs Filtering in Spark
Gokul Prabagaren
Master Software Engineer
CapitalOne
2.
About Me
● Master Software Engineer @ Capitalone
● Building Spark Applications since Spark 1.2
● Contributor of @CapitalOneTech Medium
blogs on Big Data processing
● @gocoolp on Twitter
● @gocool_p on LinkedIn
3.
Agenda
● Rewards use case in CapitalOne
● Filtering Approach
● Issues with Filtering Approach
● How Enriching approach solves the issue
● Conclusion & Questions
4.
Rewards Use case in CapitalOne
▪ CapitalOne develops its software as Open
source first in cloud.
▪ We will be operating fully on Cloud soon!
▪ We use Apache Spark extensively for variety
of batch,streaming and machine-learning
workloads.
5.
Rewards Use case in CapitalOne
▪ Use case
▪ One of Core Credit Card Rewards Spark
Application.
▪ Consumes daily credit card transactions
and computes the Rewards
6.
Filtering the data Approach
● This approach uses Spark inner-join
at each stage
8.
Issues with Filtering Approach
▪ Hard to debug the application post deployment
▪ Back tracing of data is not possible as computation happens in-memory
▪ Counts at each stage can only provide how many got processed.But not why the remaining got
dropped in that stage.
How did we overcome these issues ?
9.
Enriching the data approach
● This approach uses Spark left-outer join
● Instead of filtering the data from dataset at
each stage.Enriching approach keeps
enriching the data from right side dataset
12.
Advantage of Enriching over filtering
● Data from each stage is enriched into original dataset.It captures the state information,makes it
easy to debug/analyse later
● Same data columns/flags captured at each stage gives more granular details to know why
particular data got dropped at that stage
● No need of additional costly counts action at each stage.
13.
Conclusion
● We made the switch to use Enriching approach in our Spark job in production.
● It is successfully processing millions of credit card transaction daily.
● Awarding millions of miles,cash and points as Rewards to Capital One customers.
0 likes
Be the first to like this
Views
Total views
347
On SlideShare
0
From Embeds
0
Number of Embeds
0
You have now unlocked unlimited access to 20M+ documents!
Unlimited Reading
Learn faster and smarter from top experts
Unlimited Downloading
Download to take your learnings offline and on the go
You also get free access to Scribd!
Instant access to millions of ebooks, audiobooks, magazines, podcasts and more.
Read and listen offline with any device.
Free access to premium services like Tuneln, Mubi and more.