Beeswax, which provides real time Bidder as a Service for programmatic digital advertising solutions, will talk about how they built a feature-rich, real-time streaming data solution on AWS using Amazon Kinesis, Amazon Redshift, Amazon S3, Amazon Data Pipeline. Beeswax will discuss key components of their solution including scalable data capture, messaging hub for archival, data warehousing, near real-time analytics, and real-time alerting.
Real-Time Streaming Data Solution on AWS with Beeswax
1. POWERING THE NEXT GENERATION OF REAL TIME BIDDING
Confidential please do not distribute info@beeswax.com
2. Outline of the talk
- What is Beeswax and Real-time bidding(RTB) ?
- Our System Architecture
- Why we choose Amazon Kinesis ?
- Problem 1: Collecting very high volume streams
- Problem 2: Stream data transformation and fan out
- Problem 3: Joining streams and aggregation
3. Who are we ?
AdTech Startup based out of NYC,
founded by ex-Googlers
4. We do RTB (Real time bidding)
Publisher
Ad Exchange
Beeswax Bidder
Scale: O(M) QPS
Latency_99 : 20 ms
- Target campaigns
- Target user profiles
- Optimize for ROI
- Customize
< 200 ms
Step 1:
Send ad request & userid
Step 2:
Broadcast bid request
Step 3:
Submit bid & ad markup
Step 4:
Show ad to user
Auction
5. Building a bidder is very hard
Need scale to deliver campaigns
- To reach the desired audience, bidder needs to process at least 1M QPS
- Deployment has to be in multiple regions to guarantee reach
Performance
- The timeout from ad exchanges is 100ms including the RTT over internet
- 99%ile tail latency for processing a bid request is 20 ms
Complex ecosystem
- Manage integrations with ad exchanges, third party data providers and vendors
- Requires a lot of domain expertise to optimize the bidder for maximizing performance
6. A difficult trade-off
6
Use a DSP Build your own Bidder
Risky investment of time and $
with no success guarantee
Limited to no customization;
Platform lock in
7. Our First Product: The Bidder-as-a-Service™
A full-stack solution
deployed for each
customer in a sandbox
Services
you control
Pre-built
ecosystem
and supply
relationships
Cookies,
Mobile ID’s, 3rd
Party
Data
Bidding
and Targeting
Engine
Campaign
Management UI/API
Reporting
UI/API
Custom
bidding
algos
Log-level
streaming
RESTful APIs
Direct
connections to
customer-hosted
services
Fully managed ad tech platform on
8. Our System Architecture
Event Stream
(Amazon
Kinesis Streams)
Impression & Click
Data Producer
Bid Data Producer
Streaming
Message Hub
Customer
HTTPS
endpoint
Customer Stream
(Amazon Kinesis
Streams)
Amazon
S3 Bucket
Amazon
Redshift
Cluster
Customer API
9. What is Amazon Kinesis Streams?
Fully managed service from Amazon to collect and process large streams of
data records in real time
Key Concepts for Amazon Kinesis Streams
- Each Stream is made of shards
- Each shard is a unit of parallelism and throughput (write at 1 MB/s and
read at 2 MB/s)
- Data producers call PutRecord(s) to send data along a partition key to an
Amazon Kinesis stream
10. Why we choose Amazon Kinesis ?
Infrastructure requirements motivated by RTB use cases
- Ingestion at very large scale (> 1M QPS)
- Low latency delivery
- Reliable store of data
- Sequenced retrieval of events
Options available for consideration
1. Apache Kafka on EC2
2. Amazon Kinesis Streams
Reason to choose Kinesis
- Fully managed by AWS; Really important factor for small engineering teams
- Support the scale necessary for RTB
- Pricing model provided opportunities to optimize cost
11. Filtered bids
(Amazon Kinesis Streams)
Problem 1: Collecting very high volume streams
Filtering
and
Sampling
Bids: O(M) QPS
Challenges
- Collection at very high scale (QPS > 1M)
- Minimize infrastructure cost
- Minimize delivery latency for stream output
Listening Bidders
- Filter very high QPS bid stream using boolean targeting expressions
- Sample filtered stream and deliver
12. Solution 1: Optimized Data Producers
Cost vs Reliability Tradeoff
- Uploads are priced by PUT payload size of 25K
- Buffer incoming records and pack them into single PUT payload
- Possible data loss if application crashes before buffer is flushed
- Be creative ! We use AWS Elastic Load Balancer (ELB) access logs to replay requests
Throughput vs Latency
- Buffering increases throughput as more data is uploaded per API call
- Increases average latency; Not a concern for very high QPS collectors
- Flush buffers periodically even if not full, to cap latency
Consider overall system cost
- Compression can reduce data payload size but increase data producer CPU usage
- Evaluate compression vs cost tradeoff. For e.g. We choose snappy over gzip
Choose uniformly distributed partition keys
13. Problem 2: Data transformation and fan out
Challenges
- Config driven system to determine format, schema and destination of each record
- Maximize resource utilization by scaling elastically to incoming stream volume
- Monitoring and operating the service
Transform
and
Fan Out
Event Stream
API driven, transparent and flexible platform
- Provide very detailed log level data to all our customers
- Support multiple delivery destinations and data formats
Amazon Kinesis
stream
Amazon Kinesis
stream
Amazon S3
14. What is Kinesis Client Library?
- Open source library from AWS
https://github.com/awslabs/amazon-kinesis-client
- Enables to easily consume and process data from Amazon Kinesis
Streams
- Helps with elastic scale-out and fault-tolerant processing
- Abstracts away complex tasks associated with distributed computing like
- load balancing of shards
- shard mapping between EC2 hosts
- checkpointing
- shard-level monitoring using CloudWatch
15. Solution 2: API driven Streaming Message Hub
- Kinesis Client Library (KCL) application deployed to EC2/Autoscale group
- Adapters perform schema and data format transformations
- Emitters buffer data in-memory and flush periodically to destination
- Stream is checkpointed after records are flushed by emitters
- Cloudwatch alarms on CPU utilization elastically resize fleet
Kinesis Record
BidAdapters
WinAdapters
S3Emitter
...
HTTPEmitterClickAdapters
KinesisEmitter
...
16. Streaming Message Hub design tradeoffs
Single Reader vs Multiple Readers
- Separate reader for every format & destination instead of a single reader
- Having separate readers improves fault tolerance
- However, CPU cost of parsing records is minimized with single reader
Amazon EC2 vs AWS Lambda
- Use AWS Lambda instead of self managed Autoscale/EC2
- Spot instances deeply cut down the costs of self-managed solution
- Rich set of Kinesis stream metrics simplified monitoring and management of service
Amazon Kinesis Streams vs Amazon Kinesis Firehose
- Firehose does not support record level fan out or arbitrary data transformations
- With above enhancements, it would be preferred over self managed Autoscale/EC2
17. Scale: ~400 shards, 300 MB/sec
Use Amazon CloudWatch metrics published by Kinesis Streams
Kinesis capacity alert
- Alert upon approaching 80% capacity
- Manually reshard Kinesis using KinesisScalingUtils
Reader falling behind alert
- Alert if the average Kinesis iterator age is greater than 20 sec
- Also alert on shard level using KCL’s Per-Shard metrics(Shard Starvation)
- Ensure reader application is up, examine its custom metrics and triage
Operating Streaming Message Hub
18. Problem 3: Joining and aggregation
Bids
High level value added services
- Joined data directly feeds into model building pipelines for clicks etc.
- Reporting API, powered by ETL pipeline, provides aggregated metrics.
Impressions
Clicks, Conversions
Joining
and
Aggregation
Challenges
- Supporting exactly once semantics i.e. Eliminate all duplicates
- Minimize end to end latency from capture to joining & aggregation
- Be robust to delays between arrival times of correlated events
19. Solution 3: Stream joins using Redshift
- Message hub emits separate log files into S3 for each event type
- Data pipeline periodically loads log files into Redshift
- Redshift tables of different event types are joined via primary key
- FastPath: Joined events in 15min but can miss delayed events
- SlowPath: Fully joined events after 24 hours
Streaming
Message Hub
...
S3 Buckets
Redshift
AWS Data
Pipeline
20. Joins are not truly streaming in current design
- Batch size of 15 min dictated by lowest interval for scheduling data pipeline
- Lambda can be used instead of data pipeline to lower schedule intervals
- Data loaded into Redshift cannot be easily fed into Kinesis Streams
- However, it scales well, fully AWS managed and supports many of our use cases
What are the alternatives ?
1. Spark streaming via Amazon EMR
2. Amazon Kinesis Analytics
Stream join design trade offs
21. Summary
Building real time bidding (RTB) applications is very challenging
Beeswax provides managed platform to build RTB apps on AWS
Beeswax uses Amazon Kinesis as our platform for streaming data
Beeswax platform solves key streaming data challenges
- Supports event collection at very large scale
- Config driven platform for data transformation and fan out
- Supports joining of streams and aggregation of metrics
Tradeoffs are unique to application; Beeswax is optimized for RTB