Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Warren Paull, Solution Architect, AWS
Patricio R...
What to expect from this session
• Streaming scenarios
• Amazon Kinesis Streams overview
• Amazon Kinesis Firehose overvie...
Streaming Data Use Cases
Accelerated
Ingest-
Transform-
Load
Continual
Metric
Generation
Responsive
Data
Analysis
1 2 3
Amazon Kinesis
Streams
Build your own custom
applications that
process or analyze
streaming data
Amazon Kinesis
Firehose
E...
Amazon Kinesis: Streaming data done the AWS way
Makes it easy to capture, deliver, and process real-time data streams
Pay ...
Amazon Kinesis Streams
Build your own data streaming applications
Easy administration: Simply create a new stream and set ...
Reading and Writing with Streams
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache
S...
Real-time streaming data ingestion
Custom-built
streaming
applications
Inexpensive: $0.014 per 1,000,000 PUT payload units...
We listened to our customers…
Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3 and Amazon Redshift
Zero administration: Cap...
AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT
Amazon
S3
Amazon
Redshift
• Send data from IT infra, mobile devices, s...
1. Delivery stream: The underlying entity of Firehose. Use Firehose by
creating a delivery stream to a specified destinati...
Amazon Kinesis Firehose console experience
Unified console experience for Firehose and Streams
Amazon Kinesis Firehose console (S3)
Create fully managed resources for delivery without building an app
Amazon Kinesis Firehose console (S3)
Configure data delivery options simply using the console
Amazon Kinesis Firehoseconsole (Amazon Redshift)
Configure data delivery to Amazon Redshift simply using the console
Amazon Kinesis agent
Software agent makes submitting data easy
• Monitors files and sends new data records to your deliver...
Amazon Kinesis Firehose or Amazon Kinesis Streams?
Amazon Kinesis Streams is a service for workloads that requires
custom processing, per incoming record, with sub-1-second
...
Amazon Kinesis Analytics
Amazon Kinesis Analytics
Analyze data streams continuously with standard SQL
Apply SQL on streams: Easily connect to data ...
Patricio Rocca | July 2016
Jampp Padawans journey with Kinesis
About Jampp
We are a tech company that helps companies grow
their mobile business by driving engaged users to
their apps
M...
About Real-Time Bidding
• Ad Impressions are available through an Exchange
• Demand Platforms have to bid in less the 100 ...
Bid
Real-Time Bidding Workflow
Auction Win
Exchange Exchange
Publisher Publisher
Jampp
Bidder
Jampp
Machine
Learning
Jampp...
Real-Time Tracking Workflow
In-App
Event
Tracking
Platform
Jampp
Client
Application
Jampp
NodeJS
Tracking
Platform
• Build a retargeting platform that generates groups of users based on their in-app
activity and a look-a-like machine lea...
• 700M events/300GB per day
• 1500% in-app events growth
YoY
• Growth factor peaks out of tech
team control since it depen...
Start your engines!
The Phantom Menace (initial architecture)
Cost Savings
• Kafka supports higher throughput and lower latency and there are
tons of successful implementation cases bu...
A New Hope (final architecture)
• Invested several days picking the partition
key for evenly distributing data across
shards
• Encoding protocol matters! ...
• Write/Read Batching to reduce the HTTPS
protocol overhead and costs
• Exponential backoff + Jitter to reduce the
impact ...
• Firehose real time data-ingestion to S3
and auto scaling capabilities flushes the
data to S3/EMR cluster faster than eve...
• EMR Cluster simplifies our data
processing
• Spark ETLs are executed by Airflow, to
enrich data, de-normalize and conver...
• Airpal queries PrestoDB and simplifies
access to data for non technical people
• Jupyter notebooks are used as templates...
Jedi Knighting (from Padawan to Jedi Knight)
• Time is money
• Shards Read/Write limits... test your data volume first!
• ...
Thanks!
geeks.jampp.com
Please remember to rate this
session under My Agenda on
awssummit.london
http://blogs.aws.amazon.com/bigdata/
Thank You
Appendix
Scenarios Accelerated Ingest-
Transform-Load
Continual Metrics
Generation
Responsive Data
Analysis
Data Types IT logs, app...
Upcoming SlideShare
Loading in …5
×

Getting started with amazon kinesis

226 views

Published on

Presented at the AWS Summit in London, here's a deep dive on getting started with Amazon Kinesis and use-case with Jampp, the world's leading mobile app marketing platform.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Getting started with amazon kinesis

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Warren Paull, Solution Architect, AWS Patricio Rocca, Chief Technology Officer, Jampp July 2016 Getting Started with Amazon Kinesis
  2. 2. What to expect from this session • Streaming scenarios • Amazon Kinesis Streams overview • Amazon Kinesis Firehose overview • Firehose experience for Amazon S3 and Amazon Redshift • Jampp – Our Journey with Amazon Kinesis
  3. 3. Streaming Data Use Cases Accelerated Ingest- Transform- Load Continual Metric Generation Responsive Data Analysis 1 2 3
  4. 4. Amazon Kinesis Streams Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose Easily load massive volumes of streaming data into Amazon S3 and Amazon Redshift Amazon Kinesis Analytics Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy Services make it easy to capture, deliver, and process streams on AWS In Preview
  5. 5. Amazon Kinesis: Streaming data done the AWS way Makes it easy to capture, deliver, and process real-time data streams Pay as you go, no upfront costs Elastically scalable Right services for your specific use cases Real-time latencies Easy to provision, deploy, and manage
  6. 6. Amazon Kinesis Streams Build your own data streaming applications Easy administration: Simply create a new stream and set the desired level of capacity with shards. Scale to match your data throughput rate and volume. Build real-time applications: Perform continual processing on streaming big data using Amazon Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, and more. Low cost: Cost-efficient for workloads of any scale.
  7. 7. Reading and Writing with Streams AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Consuming AWS Mobile SDK Kinesis Producer Library AWS Lambda Apache Spark
  8. 8. Real-time streaming data ingestion Custom-built streaming applications Inexpensive: $0.014 per 1,000,000 PUT payload units Amazon Kinesis Streams Managed service for real-time processing
  9. 9. We listened to our customers…
  10. 10. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3 and Amazon Redshift Zero administration: Capture and deliver streamingdata into Amazon S3, Amazon Redshift, and other destinationswithout writing an application or managing infrastructure. Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. Seamless elasticity: Seamlessly scales to match data throughput w/o intervention. Capture and submit streaming data to Firehose Firehose loads streaming data continuously into S3 and Amazon Redshift Analyze streamingdata using your favorite BI tools
  11. 11. AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT Amazon S3 Amazon Redshift • Send data from IT infra, mobile devices, sensors • Integrated with AWS SDK, agents, and AWS IoT • Batch, compress, and encrypt data before loads • Loads data into Amazon Redshift tables by using the COPY command • Pay-as-you-go: 3.5 cents / GB transferredAmazon Kinesis Firehose Capture IT and app logs, device and sensor data, and more Enable near-real time analytics using existing tools Amazon ElasticSearch
  12. 12. 1. Delivery stream: The underlying entity of Firehose. Use Firehose by creating a delivery stream to a specified destination and send data to it. • You do not have to create a stream or provision shards. • You do not have to specify partition keys. 2. Records: The data producer sends data blobs as large as 1,000 KB to a delivery stream. That data blob is called a record. 3. Data Producers: Producers send records to a delivery stream. For example, a web server that sends log data to a delivery stream is a data producer. Amazon Kinesis Firehose Three simple concepts
  13. 13. Amazon Kinesis Firehose console experience Unified console experience for Firehose and Streams
  14. 14. Amazon Kinesis Firehose console (S3) Create fully managed resources for delivery without building an app
  15. 15. Amazon Kinesis Firehose console (S3) Configure data delivery options simply using the console
  16. 16. Amazon Kinesis Firehoseconsole (Amazon Redshift) Configure data delivery to Amazon Redshift simply using the console
  17. 17. Amazon Kinesis agent Software agent makes submitting data easy • Monitors files and sends new data records to your delivery stream • Handles file rotation, check pointing, and retry upon failures • Preprocessing capabilities such as format conversion and log parsing • Delivers all data in a reliable, timely, and simple manner • Emits Amazon CloudWatch metrics to help you better monitor and troubleshoot the streaming process • Supported on Amazon Linux AMI with version 2015.09 or later, or Red Hat Enterprise Linux version 7 or later; install on Linux-based server environments such as web servers, front ends, log servers, and more • Also enabled for Streams
  18. 18. Amazon Kinesis Firehose or Amazon Kinesis Streams?
  19. 19. Amazon Kinesis Streams is a service for workloads that requires custom processing, per incoming record, with sub-1-second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is a service for workloads that require zero administration, ability to use existing analytics tools based on S3, Amazon Redshift and Amazon Elasticsearch with data latency of 60 seconds or higher.
  20. 20. Amazon Kinesis Analytics
  21. 21. Amazon Kinesis Analytics Analyze data streams continuously with standard SQL Apply SQL on streams: Easily connect to data streams and apply existing SQL skills. Build real-time applications: Perform continual processing on streaming big data with sub-second processing latencies. Scale elastically: Elastically scales to match data throughput without any operator intervention. Connect to Amazon Kinesis streams,Firehose delivery streams Run standard SQL queries against data streams Amazon Kinesis Analytics can send processeddata to analytics tools so you can create alerts and respond in real time In Preview
  22. 22. Patricio Rocca | July 2016 Jampp Padawans journey with Kinesis
  23. 23. About Jampp We are a tech company that helps companies grow their mobile business by driving engaged users to their apps Machine learning Post-install event optimisation Dynamic Product Ads and Segments Data Science Programmatic Buying We are a team of 70 people, 30% in the engineering team. Located in 6 cities across the US, Latin America, Europe and Africa
  24. 24. About Real-Time Bidding • Ad Impressions are available through an Exchange • Demand Platforms have to bid in less the 100 ms • The highest bid wins the impressions and shows the ad! We do this 220,000 times per second
  25. 25. Bid Real-Time Bidding Workflow Auction Win Exchange Exchange Publisher Publisher Jampp Bidder Jampp Machine Learning Jampp Engagement Segments Builder IMPRESSION ;-)PLACEHOLDER
  26. 26. Real-Time Tracking Workflow In-App Event Tracking Platform Jampp Client Application Jampp NodeJS Tracking Platform
  27. 27. • Build a retargeting platform that generates groups of users based on their in-app activity and a look-a-like machine learning model • Process and enrich in-app events in less than 5 minutes to target users when they become dormant • Build a scale-on-demand platform that lets our business grow without pain • Increase the platform’s monitoring, logging and alerting capabilities Also… Non tech people should be able to query granular data and aggregate it over large periods Business Challenges
  28. 28. • 700M events/300GB per day • 1500% in-app events growth YoY • Growth factor peaks out of tech team control since it depends on the sales team pacing ;-) Data Scale
  29. 29. Start your engines!
  30. 30. The Phantom Menace (initial architecture)
  31. 31. Cost Savings • Kafka supports higher throughput and lower latency and there are tons of successful implementation cases but few are managing similar volume like Jampp • EBS allocation per topic was hard to size correctly, tune and scale “on demand” • Kafka maintainability required dedicated manpower • Kafka security configuration was not flexible enough to add secured producers outside of the VPC $ 2,848 $ 936 Tempted by the Dark Side
  32. 32. A New Hope (final architecture)
  33. 33. • Invested several days picking the partition key for evenly distributing data across shards • Encoding protocol matters! Performed several benchmarks and MessagePack offered the best trade off between compression and serialization speed factor Jedi Trial I
  34. 34. • Write/Read Batching to reduce the HTTPS protocol overhead and costs • Exponential backoff + Jitter to reduce the impact of in-app events bursts sent by the tracking platforms • Increased Data Retention Period from 1 day (default) to 3 days on the raw data streams Jedi Trial II
  35. 35. • Firehose real time data-ingestion to S3 and auto scaling capabilities flushes the data to S3/EMR cluster faster than ever letting our machine learning platform re- calculate user retargeting segments with higher frequency • Encryption is a key success factor since we manage sensitive data contained on the in- app events Jedi Trial III
  36. 36. • EMR Cluster simplifies our data processing • Spark ETLs are executed by Airflow, to enrich data, de-normalize and convert JSON to Parquet. • ML predicts user conversion and separates users based on it. This process is implemented as a Python app that queries event data stored in Parquet files through PrestoDB Jedi Trial IV
  37. 37. • Airpal queries PrestoDB and simplifies access to data for non technical people • Jupyter notebooks are used as templates to build frequently used queries and automate common analysis tasks • Spark Streaming for real-time anomaly detection and fraud prevention • Multiple Clusters (according to SLAs) Jedi Trial V
  38. 38. Jedi Knighting (from Padawan to Jedi Knight) • Time is money • Shards Read/Write limits... test your data volume first! • Shard-based provisioning throughput let you scale on demand • Exponential backoff + Jitter • Batch and compress will save you tons of headaches and money • Extended Data Retention pays off • Kinesis helps you make the data pipeline much more reliable • Kinesis + Lambda + Dynamo + EMR = <3
  39. 39. Thanks! geeks.jampp.com
  40. 40. Please remember to rate this session under My Agenda on awssummit.london
  41. 41. http://blogs.aws.amazon.com/bigdata/ Thank You
  42. 42. Appendix
  43. 43. Scenarios Accelerated Ingest- Transform-Load Continual Metrics Generation Responsive Data Analysis Data Types IT logs, applications logs, social media / clickstreams, sensor or device data, market data Ad/ Marketing Tech Publisher, bidder data aggregation Advertising metrics like coverage, yield, conversion Analytics on user engagement with ads, optimized bid / buy engines IoT Sensor, device telemetry data ingestion IT operational metrics dashboards Sensor operational intelligence, alerts, and notifications Gaming Online customer engagement data aggregation Consumer engagement metrics for level success, transition rates, CTR Clickstream analytics, leaderboard generation, player-skill match engines Consumer Engagement Online customer engagement data aggregation Consumer engagement metrics like page views, CTR Clickstream analytics, recommendation engines Streaming data scenarios across segments 1 2 3

×