Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Beeswax: Building a Real-Time Streaming Data Platform on AWS (BDM403)


Published on

Amazon Kinesis is a platform of services for building real-time, streaming data applications in the cloud. Customers can use Amazon Kinesis to collect, stream, and process real-time data such as website clickstreams, financial transactions, social media feeds, application logs, location-tracking events, and more. In this session, we first cover best practices for building an end-to-end streaming data applications using Amazon Kinesis. Next, Beeswax, which provides real-time Bidder as a Service for programmatic digital advertising, will talk about how they built a feature-rich, real-time streaming data solution on AWS using Amazon Kinesis, Amazon Redshift, Amazon S3, Amazon EMR, and Apache Spark. Beeswax will discuss key components of their solution including scalable data capture, messaging hub for archival, data warehousing, near real-time analytics, and real-time alerting.

Published in: Technology

AWS re:Invent 2016: Beeswax: Building a Real-Time Streaming Data Platform on AWS (BDM403)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ryan Nienhuis, Sr. Technical Product Manager, Amazon Kinesis Ram Kumar Rengaswamy, co-founder and CTO, Beeswax November 29, 2016 BDM403 Beeswax Building a Real Time Streaming Data Platform on AWS
  2. 2. What to Expect from the Session • Introduction to Amazon Kinesis as a platform for real time streaming data on AWS • Key considerations for building an end to end streaming platform using Amazon Kinesis Streams • Introduction to Beeswax real time bidding platform built on AWS using Amazon Kinesis, Amazon Redshift, Amazon S3, and AWS Data Pipeline • Deep dive into best practices for streaming data using these services
  3. 3. An unbounded sequence of events that is continuously captured and processed with low latency. What is streaming data?
  4. 4. Amazon Kinesis: Streaming Data Made Easy Services make it easy to capture, deliver, process streams on AWS Amazon Kinesis Streams Amazon Kinesis Analytics Amazon Kinesis Firehose
  5. 5. Amazon Kinesis Streams • Easy administration • Build real time applications with framework of choice • Low cost
  6. 6. Amazon Kinesis Firehose • Zero administration • Direct-to-data store integration • Seamless elasticity
  7. 7. Amazon Kinesis Analytics • Apply SQL on streams • Build real-time, stream processing applications • Easy scalability
  8. 8. Key Concepts for Amazon Kinesis Streams
  9. 9. Amazon Kinesis Streams Key Concepts Data Sources App.4 [Machine Learning] AWSEndpoint App.1 [Aggregate & De-Duplicate] Data Sources Data Sources Data Sources App.2 [Metric Extraction] App.3 [Sliding Window Analysis] Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone Data Producers Amazon Kinesis stream Data Consumers Downstream systems Amazon S3 Amazon Redshift AWS Lambda Amazon Kinesis Analytics
  10. 10. An Amazon Kinesis stream • Streams are made of shards • Each shard is a unit of parallelism and throughput • Serves as a durable temporal buffer with data stored 1 - 7 days • Scale by splitting and merging shards
  11. 11. Putting Data into an Amazon Kinesis stream • Data producers call PutRecord(s) to send data to an Amazon Kinesis stream • Partition key determines which shard the data is stored • Each shard supports 1 MB in / 2 MB out • Each records gets a unique sequence number • Options for writing: AWS SDKs, Amazon Kinesis Producer Library (KPL), Amazon Kinesis agent, FluentD, Flume, and more… Producer Producer Producer Producer Producer Producer Producer Kinesis stream Shard 1 Shard 2 Shard 3 Shard 4 Shard n
  12. 12. Key considerations for data producers • Connectivity - Lost connectivity and latency fluctuations • Durability – Capture most or all records in event of failure • Efficiency – Producer’s primary job is often not collection • Distributed – Record ordering and retry strategies Most customers choose to do some buffering and use a random partition key; many strategies for failover
  13. 13. Getting Data from an Amazon Kinesis stream • Consumer applications read each shard continuously using GetRecords, determine where to start using GetShardIterator • Read model is per shard • Increasing number of shards increases scalability but reduces processing locality • Options: Amazon Kinesis Client Library (KCL) on Amazon EC2, Amazon Kinesis Analytics, AWS Lambda, Spark Streaming (Amazon EMR), Storm on EC2, and more… Kinesis stream Shard 1 Shard 2 Shard 3 Shard 4 Shard n Consumer Consumer Consumer
  14. 14. Amazon Kinesis Client Library –KCL • Open source and available for Java, Ruby, Python, Node.js dev • Deploy on your EC2 instances, scales easily with Elastic Beanstalk • Two important components: 1. Record Processor – Processor unit that processes data from a shard in Amazon Kinesis Streams 2. Worker – Processing unit that maps to each application instance • Key features include load balancing, shard mapping, check pointing, and CloudWatch monitoring
  15. 15. Key considerations for data consumer apps • Scale - Have ready mechanisms for increasing parallelism and add compute • Availability - Always be reading latest data and monitor stream position • Accuracy - Implement at least once processing logic, exactly once at destination (if you need it) • Speed - Scale test your logic to ensure linear scalability • Replay - Have retry strategy
  16. 16. Key considerations for the end-to-end solution • Use cases - Start with a simple one, progress to more advanced • Data variety – Must support different data formats and schema; centrally or decentralized management • Integrations – Determine guarantees and where to apply back pressure • Fanning out or in – Determine whether to use multiple consumers, multiple streams, or both
  17. 17. Beeswax Powering the next generation of real-time bidding
  18. 18. Who we are? Startup based out of NYC, founded by ex-Googlers We are hiring !
  19. 19. We do RTB (Real-time bidding) Publisher Ad Exchange Beeswax Bidder Scale: O(M) QPS Latency_99 : 20 ms - Target campaigns - Target user profiles - Optimize for ROI - Customize < 200 ms Step 1: Send ad request & userid Step 2: Broadcast bid request Step 3: Submit bid & ad markup Step 4: Show ad to user Auction
  20. 20. Building a bidder is very hard Need scale to deliver campaigns • To reach the desired audience, bidder needs to process at least 1M QPS • Deployment has to be in multiple regions to guarantee reach Performance • The timeout from ad exchanges is 100ms including the RTT over internet • 99%ile tail latency for processing a bid request is 20ms Complex ecosystem • Manage integrations with ad exchanges, third-party data providers and vendors • Requires a lot of domain expertise to optimize the bidder for maximizing performance
  21. 21. A difficult trade-off Build your own BidderUse a DSP Risky investment of time and $ with no success guarantee Limited to no customization; Platform lock in
  22. 22. Our First Product: The Bidder-as-a-Service™ A full-stack solution deployed for each customer in a sandbox Services you control Pre-built ecosystem and supply relationships Cookies, Mobile ID’s, 3rd Party Data Bidding and Targeting Engine Campaign Management UI/API Reporting UI/API Custom bidding algos Log-level streaming RESTful APIs Direct connections to customer-hosted services Fully managed ad tech platform on
  23. 23. Outline of the talk • System architecture • Why we chose Amazon Kinesis • Challenge 1: Collecting very high volume streams • Challenge 2: Stream data transformation and fan out • Challenge 3: Joining streams and aggregation
  24. 24. Beeswax System Architecture Event Stream Impression & Click Data Producer Bid Data Producer Streaming Message Hub Customer Stream HTTP POST S3 Bucket Amazon Redshift Customer API
  25. 25. Why we chose Amazon Kinesis? Infrastructure requirements motivated by RTB use cases Reason to choose Amazon Kinesis • Fully managed by AWS; Really important factor for small engineering teams • Support the scale necessary for RTB • Pricing model provided opportunities to optimize cost • Ingestion at very large scale (> 1M QPS) • Low latency delivery • Reliable store of data • Sequenced retrieval of events Options available for consideration 1. Amazon Kinesis 2. Apache Kafka on EC2
  26. 26. Problem 1: Collecting high volume streams Listening Bidders • Filter very high QPS bid stream using Boolean targeting expressions • Sample filtered stream and deliver Challenges • Collection at very high scale (QPS > 1M) • Minimize infrastructure cost • Minimize delivery latency for stream output ( < 10s) Filtering and Sampling Bids: O(M) QPS Filtered bid stream
  27. 27. Solution 1: Optimized Data Producers Cost vs Reliability Tradeoff • Uploads are priced by PUT payload size of 25K • Buffer incoming records and pack them into single PUT payload • Possible data loss if application crashes before buffer is flushed • Be creative! We use ELB logs to replay requests to our collector Consider overall system cost • Compression can reduce data payload size but increase data producer CPU usage • Evaluate compression vs cost tradeoff. For example, we choose snappy over gzip
  28. 28. Solution 1: Optimized Data Producers Throughput vs Latency • Buffering increases throughput as more data is uploaded per API call • Increases average latency; Not a concern for very high QPS collectors • Flush buffers periodically even if not full, to cap latency Choose uniformly distributed partition keys
  29. 29. Problem 2: Data transformation and fan out API driven, transparent and flexible platform • Provide very detailed log level data to all our customers • Support multiple delivery destinations and data formats Challenges • Config driven system to determine format, schema and destination of each record • Maximize resource utilization by scaling elastically to stream volume • Monitoring and operating the service Transform and Fan Out Event Stream
  30. 30. Solution 2: API-driven Streaming Message Hub • KCL application deployed to Auto Scaling group • CloudWatch alarms on CPU utilization elastically resize fleet • Adapters perform schema and data format transformations • Emitters buffer data in-memory and flush periodically to destination • Stream is checkpointed after records are flushed by emitters Kinesis Record BidAdapters WinAdapters S3Emitter ... HTTPEmitterClickAdapters KinesisEmitter ...
  31. 31. Streaming message hub design tradeoffs Single reader vs multiple readers • Separate reader for every format & destination instead of a single reader • Having separate readers improves fault tolerance • However, CPU cost of parsing records is minimized with single reader EC2 vs Lambda • Use AWS Lambda instead of self-managed Auto Scaling • Spot Instances deeply cut down the costs of self-managed solution • Rich set of Amazon Kinesis stream metrics simplified monitoring and management of service
  32. 32. Streaming message hub design tradeoffs Amazon Kinesis Streams versus Amazon Kinesis Firehose • Firehose does not support record level fan out or arbitrary data transformations • With above enhancements, it would be preferred over self-managed Auto Scaling in EC2
  33. 33. Operating streaming message hub Scale: ~300 shards, 250 MB/sec Use CloudWatch metrics published by Amazon Kinesis Streams Amazon Kinesis capacity alert • Alert upon approaching 80% capacity • Manually reshard Amazon Kinesis using KinesisScalingUtils (or new scaling API) Reader falling behind alert • Alert if the average iterator age is greater than 20 sec. • Ensure reader application is up, examine its custom metrics and triage Management overhead - We have roughly 2 “incidents” per month
  34. 34. Problem 3: Joining and aggregation High level value added services • Joined data directly feeds into model building pipelines for clicks, etc. • Reporting API, powered by ETL pipeline, provides aggregated metrics. Challenges • Supporting exactly once semantics, i.e., eliminate all duplicates • Minimize end-to-end latency from capture to joining & aggregation • Be robust to delays between arrival times of correlated events Bids Impressions Clicks, Conversions Joining and Aggregation
  35. 35. Solution 3: Stream joins using Amazon Redshift • Message hub emits separate log files into S3 for each event type • Data pipeline schedules periodically loads log files into Amazon Redshift • Amazon Redshift tables of different event types are joined via primary key • FastPath: Joined events in 15min but can miss delayed events • SlowPath: Fully joined events after 24 hours Streaming Message Hub ... S3 Buckets Amazon Redshift Data Pipeline
  36. 36. Stream join design trade offs Joins are not truly streaming in current design • Batch size of 15 min dictated by lowest interval for scheduling data pipeline • Lambda can be used instead of AWS Data Pipeline to lower schedule intervals • Data loaded into Amazon Redshift cannot be easily fed into Amazon Kinesis streams • However, it scales well, is fully AWS managed, and supports many of our use cases
  37. 37. What are the alternatives? • Spark streaming via EMR • Amazon Kinesis Analytics Early thoughts on comparing the alternatives • Amazon Kinesis Analytics is fully managed; Spark Streaming is not • Amazon Kinesis Analytics has usage-based pricing; Spark requires careful capacity planning • Need to evaluate Amazon Kinesis Analytics on scale and support for arbitrary data formats
  38. 38. Summary Building real time bidding (RTB) applications is very challenging Beeswax provides a managed platform to build RTB apps on AWS Beeswax uses Amazon Kinesis as infrastructure for streaming data Beeswax platform solves key streaming data challenges • Supports event collection at very large scale • API driven platform for data transformation and fan out • Supports joining of streams and aggregation of metrics Tradeoffs are unique to application; Beeswax is optimized for RTB
  39. 39. Thank you!
  40. 40. Remember to complete your evaluations!
  41. 41. Reference We have many AWS Big Data Blog posts which cover more examples. Full list here. Some good ones: 1. Amazon Kinesis Streams 1. Implement Efficient and Reliable Producers with the Amazon Kinesis Producer Library 2. Presto and Amazon Kinesis 3. Querying Amazon Kinesis Streams Directly with SQL and Sparking Streaming 4. Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams 2. Amazon Kinesis Firehose 1. Persist Streaming Data to Amazon S3 using Amazon Kinesis Firehose and AWS Lambda 2. Building a Near Real-Time Discovery Platform with AWS 3. Amazon Kinesis Analytics 1. Writing SQL on Streaming Data With Amazon Kinesis Analytics Part 1 | Part 2 2. Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics
  42. 42. • Technical documentation • Amazon Kinesis Agent • Amazon Kinesis Streams and Spark Streaming • Amazon Kinesis Producer Library Best Practice • Amazon Kinesis Firehose and AWS Lambda • Building Near Real-Time Discovery Platform with Amazon Kinesis • Public case studies • Glu mobile – Real-Time Analytics • Hearst Publishing – Clickstream Analytics • How Sonos Leverages Amazon Kinesis • Nordstorm Online Stylist Reference
  43. 43. Detailed system architecture Event Stream PartitionKey = F(EventId) Config Store Event Producer - Reliable - Record level retries Bid Producer - High throughput - Stream compression - Batch records w/ flush timeout Stream Msg Hub - KCL Application - Autoscales - At-least once processing - Record format transforms - Route to custom sinks - Stream window analytics Customer Log Stream Partition key = EventId Customer Http Post Protobuf/Json payload S3 Storage - CSV data - Customer bucket Amazon Redshift - Join by EventId - Exactly once - Fast path 30m Data Pipeline
  44. 44. Streaming data in real-time bidding application Filtering and Sampling Joining and Aggregation Analytics and Reporting Data Sources Bids: O(1M) TPS Wins: O(10K) TPS Clicks: O(1K) TPS Consumers Formats