Presented at the Architecture Conference (ArqConf) in Buenos Aires, Argentina. Here is a 10,000ft view of our Real Time Bidding and Stream Processing architecture.
Scanning the Internet for External Cloud Exposures via SSL Certs
Jampp's Impressive Real-Time Bidding and Streaming Architecture
1. Patricio Rocca - Agosto 2016
@patriciorocca
Arquitecturas de tiempo
real y escalables
2. About Jampp
We are a tech company that helps companies
grow their mobile business by driving engaged
users to their apps
We are a team of 75 people, 30%
in the engineering team.
Located in 6 cities across the
US, Latin America, Europe and
Africa
Machine learning
Post-install event optimisation
Dynamic Product Ads and Segments
Data Science
Programmatic Buying
3. We process 220,000 bid requests per second
We process each bid request in less than 100ms
We manage 40Tb of data everyday
We do real time machine learning
Jampp Architecture Impressive Facts
And… we are just a team of 22 nerds :) or :(
8. Bid Price = CPI * eCTR * eCVR * (1-margin) * 1000
Python + Tornado + Cython + nginx (+ antigravity)
Caching, layers upon layers upon layers
Leaky bucket-ish feedback loop for pacing
With predictive local projections to account for imperfect and laggy
inter-server communication
Selective, aggregate logging
Circa-25TB of data generated per day makes naïve logging… unwise
Real Time Bidding Architecture (details)
9. In-process L1 serves all requests
µs latency access a lifesaver for real-time,
latency-constrained workloads
Local L2 in each server
Buffers responses from the L3
Saves bandwidth to-from the L3
(3 MB/s x 230 servers x 8 procs = death)
Decreases promotion latency to L1
Remote L3 provides main distributed cache
storage
Caching
10. Uses logistic regression to predict P(click |
impression) or P(install | click) using context
features
Online solution that incrementally learns from
the Real Time Bidding events just in time
Uses regularization and hashing trick to explore
a huge feature space and keep only the
statistically most informative ones
Machine Learning
12. Stream Processing Architecture (details)
Uses Amazon Kinesis for durable streaming data and Amazon
Lambda for data processing
DynamoDB as temporal data storage for enrichment and analytics
S3 provides a Single Source of Truth for batch data applications
Decouples data from processing to enable multiple Big Data
engines running on different clusters/infrastructure
Easy on demand scaling by AWS™
13. Data Push
Pick your partition key for evenly
distributing data across shards
Encoding protocol matters! MessagePack
offered the best trade off between
compression and serialization speed
factor
14. Data Processing and Enrichment
Write/Read Batching to reduce the HTTPS
protocol overhead and costs
Exponential backoff + Jitter to reduce the
impact of in-app events bursts sent by
the tracking platforms
Increased Data Retention Period from 1 day
(default) to 3 days on the raw data
streams
15. Spark + Hadoop + PrestoDB = <3
Firehose real time data-ingestion to S3 and
auto scaling capabilities
EMR Cluster simplifies our data processing
Spark ETLs are executed by Airflow, to
enrich data, de-normalize and convert
JSON to Parquet.
Spark Streaming for real-time anomaly
detection and fraud prevention
16. Dunno fuck with real time! (caching and cython to the rescue)
Rent first, build later
Development and staging for Big Data projects should involve production
traffic or be prepared for trouble
PrestoDB is really amazing in regards to performance, maturity and
feature set
Kinesis, Dynamo and Firehose use HTTPS as transport protocol which is
slow, requires aggressive batching and exponential back-off + jitter
Monitoring, logs and alerts managed by AWS Cloudwatch oversimplifies
production support
Lessons Learned
Jampp is an advertising technology company founded on 2013. We do both user acquisition and user engagement through real time bidding (a.k.a programmatic media buying)
We built our own Demand Side Platform in Python which processes 19B auctions per day
The bidder calculates the bid price based on 1) Machine Learning stochastic gradient descent model which generates a decision tree that predicts the CTR and CVR, 2) user groups generated by the user activity in the app and the probability to generate revenue within the app (user engagement)
After the user clicks on the ad and we redirect to the Apple Store/Google Play/Deeplink to our clients apps we lose context and get completely blind
All our clients are using a tracking platform integrated with their app to track all the in-app events (user activity)
In-process LRU serves all requests
µs latency access a lifesaver for real-time, latency-constrained workloads
Remote L3 provides main cache storage,avoids multiplication of efforts
Local L2 in each server
Buffers responses from the L3
Saves bandwidth to-from the L3(3 MB/s x 230 servers x 8 procs = death)
Decreases promotion latency to L1
Precomputed slow-changing bundles in S3
Speeds up load of massive near-static data
Inter-process shared memory with mmap
Uber Engineering Team made a great analysis comparing json, ujson, protobuf, thrift and the winner was messagepack
Uber Engineering Team made a great analysis comparing json, ujson, protobuf, thrift and the winner was messagepack
Exponential backoff >> none + Jitter (adding randomness)
Data retention increase $0.020 per shard hour which is almost nothing in comparison to losing data
RDBMSs can fit a lot of use cases initially: unified log, OLAP, near real-time processing (but dunno scale)