Processing 19 billion messages in real time and NOT dying in the process
Juan Martín Pampliega
Procesando 19 billones de
mensajes en tiempo real sin
morir en el intento
Fav Beer: Paulaner Hefe-Weißbier
Fav Beer: Guinness
Sr. Data Engineer, Jampp
Assistant Professor @ITBA
Principal Engineer, Jampp
User & contributor to Spark
About these 2 guys ;-)
We are a tech company that helps companies grow
their mobile business by driving engaged users to
We are a team of 75 people, 30%
in the engineering team.
Located in 6 cities across the
US, Latin America, Europe and
Dynamic Product Ads
We have to be fast!
Our platform combines machine learning with big data
for programmatic ad buying which optimizes towards in-
We process 220,000+ RTB ad bid requests per second (19+
billions per day) which amounts to about 40 TB of data per
How does programmatic ads work?
RTB: Real Time Bidding
Data @ Jampp
● Started using RDBMSs on Amazon Web Services.
● Data grew exponentially.
● Last year: 1500%+ in-app events, 500%+ RTB bids.
● We needed to evolve our architecture.
New System Characteristics
● The new system was based on Amazon Elastic Map
Reduce using Facebook PrestoDB and Apache
● Data imported hourly from RDBMSs with Sqoop.
● Logs are imported every 10 minutes from different
sources to S3 tables.
● Scalable storage and processing capabilities using
HDFS, S3, YARN and Hive for ETLs and data storage.
Aspects that needed improvement
● Data still imported in batch mode. Delay was larger
for MySQL data than with Python Replicator.
● EMR not great for long running clusters.
● Data still being accumulated in RDBMSs (clicks,
● Lots of joins still needed for day to day data analysis
● Continuous event processing of unbounded datasets.
● Makes data available in real-time and evens out load of
○ Persistent log for fault tolerance.
○ Scalable processing layer to deal with
spikes in throughput.
○ Allow data enrichment and handling out-
Current state of the evolution
● Real-time event processing architecture based on
best practices for stream processing in AWS.
● Uses Amazon Kinesis for durable streaming data
and Amazon Lambda for data processing.
● DynamoDB and Redis are used as temporal data
storage for enrichment and analytics.
● S3 provides us with a Source of Truth for batch data
applications and Kinesis for stream processing.
Benefits of the Evolution
● Enables the use of stream processing frameworks to
keep data as fresh as economically possible.
● Decouples data from processing to enable multiple Big
Data engines running on different clusters/
● Easy on demand scaling given by AWS managed tools
like AWS Lambda, AWS DynamoDB and AWS EMR.
● Monitoring, logs and alerts managed by AWS
Still, it isn’t perfect...
● There is no easy way to manage windows and out or order data
with Amazon Lambda.
● Consistency of DynamoDB and S3.
● Price of AWS managed services
for events with large numbers
compared to custom maintained
● ACID guarantees of RDBMs are
not an easy thing to part with.
● SQL and indexes in RDBMs make forensics easier.
● Traditional datawarehouses are still relevant. To
cope with data volumes: lose detail.
● RDBMSs can fit a lot of use cases initially: unified
log, OLAP, near real-time processing.
● Take full advantage of you Public Cloud provider’s
offerings (rent first, build later).
● Development and staging for Big Data projects
should involve production traffic or be prepared for
● PrestoDB is really amazing in regards to
performance, maturity and feature set.
● Kinesis, Dynamo and Firehose use HTTPS as
transport protocol which is slow and requires
● All Amazon services require automatic retries with
exponential back-off + jitter.