Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a real-time, scalable and intelligent programmatic ad buying platform

1,644 views

Published on

After a brief introduction to programmatic ads and RTB we go through the evolution of Jampp's data platform to handle the enormous about of data we need to process.

Published in: Software
  • Be the first to comment

Building a real-time, scalable and intelligent programmatic ad buying platform

  1. 1. Building a real-time, scalable and intelligent programmatic ad buying platform Martín Bonamico Juan Martín Pampliega
  2. 2. Agenda 1. Jampp 2. Adtech, RTB, clicks, installs, events 3. Initial Architecture 4. Initial Architecture Characteristics 5. Evolution of Data Needs 6. New Data Infrastructure - Stream Processing 7. Key Take Aways
  3. 3. Jampp and AdTech
  4. 4. Jampp is a leading mobile app marketing and retargeting platform. Founded in 2013, Jampp has offices in San Francisco, London, Berlin, Buenos Aires, São Paulo and Cape Town. We help companies grow their business by seamlessly acquiring, engaging & retaining mobile app users.
  5. 5. Jampp’s platform combines machine learning with big data for programmatic ad buying which optimizes towards in-app activity. Our platform processes +200,000 RTB ad bid requests per second (17+ billions per day) which amounts to about 300 MB/s or 25 TB of data per day.
  6. 6. How does programmatic ads work? DOWNLOAD APP Source / Exchange Jampp Tracking Platform AppStore / Google Play App Install Postback Postback
  7. 7. RTB: Real Time Bidding
  8. 8. Jampp Events 1. RTB: a. Auction: the exchange asks if we want to bid for the impression. b. Bid/Non-Bid: bid with price or non-bid (less than 80ms). c. Impression: the ad is displayed to the user. 2. Non-RTB: a. Click: event that marks when the user clicks on the ad. b. Install: install of the app on first app open. c. Event: in app events like purchase, view, favorited.
  9. 9. Data @ Jampp ● Our platform started using RDBMSs and a traditional Data Warehouse architecture on Amazon Web Services. ● Data grew exponentially and data needs became more complex. ● In the last year alone, 2500%+ in-app events and 500%+ RTB bids. ● This made us evolve our architecture to be able to effectively handle Big Data.
  10. 10. Initial Data Architecture
  11. 11. C1 C2 Cn Cupper Load Balancer MySQL Click Install Event Click Redirect PostgreSQL B1 B2 Bn Replicator API (Pivots) Auctions Bids Impressions Initial Jampp Infrastructure
  12. 12. Jampp Initial Systems: Bidder ● OpenRTB bidding system implementation that runs on 200+ virtual machines with 70GB RAM each. ● Strong latency requirements. Less than 80ms to answer a request. ● Written in Cython and uses ZMQ for communication. ● Heavy use of coherent caching to comply with latency requirements. ● Data is continually replicated and enriched from MySQL by the replicator process.
  13. 13. Jampp Initial Systems: Cupper ● Event tracking system written in Node.js. ● Tracks clicks, installs and in-app events. (200+ millions per day) ● Can be scaled horizontally (10 instances) and is located behind a load balancer (ELB). ● Uses a MySQL database to store attributed events and Kinesis to store organics.
  14. 14. Jampp Initial Systems: API ● PostgreSQL is used as a Data Warehouse database apart from the use the bidder does. ● An API exposes the data for querying with a caching layer. ● Fact tables are maintained with hourly, daily and monthly granularity and high cardinality dimensions are removed in large fact tables for data older than 15 days. ● Data is continually aggregated through an aggregation process written in Python.
  15. 15. Evolution of the Data Architecture
  16. 16. Emerging Needs ● Log forensics capabilities - as our systems and company scale and we integrate with more outside systems. ● More historical and granular data for advanced analytics and model training. ● The need to make the data readily available to other systems outside from the traditional RDBMS arose. Some of these systems are too demanding for RDBMS to handle easily.
  17. 17. C1 C2 Cn Cupper Load Balancer MySQL (Ruby) Click Install Event Click Redirect ELB Logs C1 C2 Cn EMR - Hadoop Cluster AirPal Initial Evolution
  18. 18. New System Characteristics ● The new system was based on Amazon Elastic Map Reduce. ● Data imported hourly from RDBMSs with Sqoop. ● Logs are imported every 10 minutes from different sources to S3 tables. ● Facebook PrestoDB and Apache Spark are used for interactive log and analytics.
  19. 19. New System Characteristics ● Scalable storage and processing capabilities using HDFS, YARN and Hive for ETLs and data storage. ● Connectors from different languages like Python, Julia and Java/Scala. ● Data archiving in S3 for long term storage and enabling other data processing technologies.
  20. 20. Aspects that needed improvement ● Data still imported in batch mode. Delay was larger for MySQL data than with Python replicator. ● EMR not great for long running clusters. ● The EMR cluster is not designed with strong multi- user capabilities. It is better to have multiple clusters with few users than a big one with many. ● Data still being accumulated in RDBMSs (clicks, installs, events).
  21. 21. Final stage of the evolution ● Real-time event processing architecture based on best practices for stream processing in AWS. ● Uses Amazon Kinesis for streaming data storage and Amazon Lambda for data processing. ● DynamoDB and Redis are used for temporal data storage for enrichment and analytics. ● S3 gives us a Source of Truth for batch data applications and Kinesis for stream processing.
  22. 22. Our Real-Time Architecture
  23. 23. Still, it isn’t perfect... ● There is no easy way to manage windows and out or order data with Amazon Lambda. ● Consistency of DynamoDB and S3. ● Price of AWS managed services for events with large numbers compared to custom maintained solutions. ● ACID guarantees of RDBMs are not an easy thing to part with. ● SQL and indexes in RDBMs make forensics easier.
  24. 24. Benefits of the Evolution ● Enables the use of stream processing frameworks to keep data as fresh as economically possible. ● Decouples data from processing to enable multiple Big Data engines running on different clusters/ infrastructure. ● Easy on demand scaling given by AWS managed tools like AWS Lambda, AWS DynamoDB and AWS EMR. ● Monitoring, logs and alerts managed by AWS Cloudwatch.
  25. 25. Big Data Technologies at Jampp S3HDFS Hadoop/YARN Lambda DynamoDB
  26. 26. Key Take Aways ● Ad tech is a technologically intensive market which complies with the three Vs from Big Data. ● As the business’ data needs grows in complexity specialized data systems need to be put in place. ● Using technologies that are meant to scale easily and are managed by a third party can bring you peace of mind. ● Stream processing is fundamental in new Big Data Projects. ● There is currently no one tool that clearly fulfills all the needs for scalable and correct stream processing.
  27. 27. References http://radar.oreilly.com/2015/08/the-world-beyond-batch- streaming-101.html http://radar.oreilly.com/2015/08/the-world-beyond-batch- streaming-102.html https://engineering.linkedin.com/distributed-systems/log-what- every-software-engineer-should-know-about-real-time-datas- unifying http://blog.confluent.io/2015/01/29/making-sense-of-stream- processing/ JAMPP - AGRANDA 2015 http://44jaiio.sadio.org. ar/sites/default/files/agranda14-30.pdf
  28. 28. Questions? geeks.jampp.com We Are Hiring! - jampp.com/jobs.php martin.bonamico@jampp.com juan@jampp.com

×