The document discusses the evolution of an ETL pipeline from an old architecture to a new streaming-based one. The old architecture ran hourly jobs that processed 12+ GB of data and could take over an hour to complete. The new architecture uses streaming to provide horizontal scalability and real-time processing. It decouples ingestion of raw data from processing via Spark streaming. Events are ingested into MongoDB as they arrive and then processed to calculate metrics and output to various destinations.
2. 2
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
● Introduction
● Old Architecture
● New Architecture
● Decoupling
● Streaming
● Conclusion
3. 3
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
● Legacy Java Process
○ “Crunches” data
○ Sends data downstream to our own datastores and to 3rd party
analytics
○ Runs every hour
● Growth
○ Process can run over an hour
○ 12 GB -> 24GB heap in less than 1 year
○ Cron is a horrible job management system
○ A failure requires rerunning a job from the beginning
● 2.0
○ Horizontably scalable
○ Real Time ETL
○ Reuesable
4. 4
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
ETL @ Vungle
● ~1 Billion Events / Day
● Deduplication
● Calculating $$$
● Outputting data to various destinations
31. 31
Introduction Problem Decoupling Streaming Conclusion
Setup connection and spark streams
Map each line of log into Mongo Objects
and insert into mongo
55. 55
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Next Steps
● Switching from JSON to ProtoBuf
● Using YARN to run multiple jobs on one cluster
● Data Science
● Who knows?