Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Processing 19 billion messages in real time and NOT dying in the process


Published on

Here is an introduction in the Jampp architecture for data processing. We walk through our journey of migrating to systems that allows us to process more data in real time

Published in: Technology
  • Be the first to like this

Processing 19 billion messages in real time and NOT dying in the process

  1. 1. Juan Martín Pampliega Fernando Otero Procesando 19 billones de mensajes en tiempo real sin morir en el intento
  2. 2. @juanpampliega Fav Beer: Paulaner Hefe-Weißbier @fotero Fav Beer: Guinness Juan Software Engineer Sr. Data Engineer, Jampp Assistant Professor @ITBA Fernando Computer Scientist Principal Engineer, Jampp Spark-Jobserver committer User & contributor to Spark since 1.0 About these 2 guys ;-)
  3. 3. About Jampp We are a tech company that helps companies grow their mobile business by driving engaged users to their apps We are a team of 75 people, 30% in the engineering team. Located in 6 cities across the US, Latin America, Europe and Africa Machine learning Post-install event optimisation Dynamic Product Ads and Segments Data Science Programmatic Buying
  4. 4. We have to be fast! Our platform combines machine learning with big data for programmatic ad buying which optimizes towards in- app activity. We process 220,000+ RTB ad bid requests per second (19+ billions per day) which amounts to about 40 TB of data per day.
  5. 5. AppStore / Google Play Postback Exchange Jampp Tracking Tracking Platform How does programmatic ads work?
  6. 6. RTB: Real Time Bidding A d Request OpenRTB (100ms) OpenRTB (100m s) OpenRTB bid $3 OpenRTBbid $2 Win notification$2.01 Ad(auction won)
  7. 7. Data @ Jampp ● Started using RDBMSs on Amazon Web Services. ● Data grew exponentially. ● Last year: 1500%+ in-app events, 500%+ RTB bids. ● We needed to evolve our architecture.
  8. 8. Initial Data Architecture
  9. 9. C1 C2 Cn Cupper Load Balancer MySQL Click Install Event PostgreSQL B1 B2 Bn Replicator API (Pivots) Auctions Bids Impressions User Segments Initial Jampp Infrastructure
  10. 10. Emerging Needs ● Increase Log forensics capabilities. ● More historical and granular data. ● Make data readily available to other systems. ● Impossible to easily scale MySQL.
  11. 11. Evolution of our Data Architecture
  12. 12. C1 C2 Cn Cupper Load Balancer MySQL Click Install Event Click Redirect ELB Logs C1 C2 Cn EMR - Hadoop Cluster AirPal Initial Evolution User Segments
  13. 13. New System Characteristics ● The new system was based on Amazon Elastic Map Reduce using Facebook PrestoDB and Apache Spark. ● Data imported hourly from RDBMSs with Sqoop. ● Logs are imported every 10 minutes from different sources to S3 tables. ● Scalable storage and processing capabilities using HDFS, S3, YARN and Hive for ETLs and data storage.
  14. 14. Aspects that needed improvement ● Data still imported in batch mode. Delay was larger for MySQL data than with Python Replicator. ● EMR not great for long running clusters. ● Data still being accumulated in RDBMSs (clicks, installs, events). ● Lots of joins still needed for day to day data analysis work.
  15. 15. Current Architecture
  16. 16. ● Continuous event processing of unbounded datasets. ● Makes data available in real-time and evens out load of data processing. Stream processing ● Requires: ○ Persistent log for fault tolerance. ○ Scalable processing layer to deal with spikes in throughput. ○ Allow data enrichment and handling out- of-order data.
  17. 17. A Real-Time Architecture
  18. 18. Current state of the evolution ● Real-time event processing architecture based on best practices for stream processing in AWS. ● Uses Amazon Kinesis for durable streaming data and Amazon Lambda for data processing. ● DynamoDB and Redis are used as temporal data storage for enrichment and analytics. ● S3 provides us with a Source of Truth for batch data applications and Kinesis for stream processing.
  19. 19. Our Real-Time Architecture
  20. 20. Benefits of the Evolution ● Enables the use of stream processing frameworks to keep data as fresh as economically possible. ● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure. ● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR. ● Monitoring, logs and alerts managed by AWS Cloudwatch.
  21. 21. Still, it isn’t perfect... ● There is no easy way to manage windows and out or order data with Amazon Lambda. ● Consistency of DynamoDB and S3. ● Price of AWS managed services for events with large numbers compared to custom maintained solutions. ● ACID guarantees of RDBMs are not an easy thing to part with. ● SQL and indexes in RDBMs make forensics easier.
  22. 22. Lessons Learned ● Traditional datawarehouses are still relevant. To cope with data volumes: lose detail. ● RDBMSs can fit a lot of use cases initially: unified log, OLAP, near real-time processing. ● Take full advantage of you Public Cloud provider’s offerings (rent first, build later). ● Development and staging for Big Data projects should involve production traffic or be prepared for trouble.
  23. 23. ● PrestoDB is really amazing in regards to performance, maturity and feature set. ● Kinesis, Dynamo and Firehose use HTTPS as transport protocol which is slow and requires aggressive batching. ● All Amazon services require automatic retries with exponential back-off + jitter. Lessons Learned
  24. 24. Questions?