Amazon Kinesis provides services for you to work with streaming data on AWS. Learn how to load streaming data continuously and cost-effectively to Amazon S3 and Amazon Redshift using Amazon Kinesis Firehose without writing custom stream processing code. Get an introduction to building custom stream processing applications with Amazon Kinesis Streams for specialised needs.
2. What to expect from this session
• Streaming scenarios
• Amazon Kinesis Streams overview
• Amazon Kinesis Firehose overview
• Firehose experience for Amazon S3 and Amazon Redshift
• Jampp – Our Journey with Amazon Kinesis
3. Streaming Data Use Cases
Accelerated
Ingest-
Transform-
Load
Continual
Metric
Generation
Responsive
Data
Analysis
1 2 3
4. Amazon Kinesis
Streams
Build your own custom
applications that
process or analyze
streaming data
Amazon Kinesis
Firehose
Easily load massive
volumes of streaming
data into Amazon S3
and Amazon Redshift
Amazon Kinesis
Analytics
Easily analyze data
streams using
standard SQL queries
Amazon Kinesis: Streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
In Preview
5. Amazon Kinesis: Streaming data done the AWS way
Makes it easy to capture, deliver, and process real-time data streams
Pay as you go, no upfront costs
Elastically scalable
Right services for your specific use cases
Real-time latencies
Easy to provision, deploy, and manage
6. Amazon Kinesis Streams
Build your own data streaming applications
Easy administration: Simply create a new stream and set the desired level of capacity
with shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data using
Amazon Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
8. Real-time streaming data ingestion
Custom-built
streaming
applications
Inexpensive: $0.014 per 1,000,000 PUT payload units
Amazon Kinesis Streams
Managed service for real-time processing
10. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3 and Amazon Redshift
Zero administration: Capture and deliver streamingdata into Amazon S3, Amazon
Redshift, and other destinationswithout writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention.
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and
Amazon Redshift
Analyze streamingdata using your favorite
BI tools
11. AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT
Amazon
S3
Amazon
Redshift
• Send data from IT infra, mobile devices, sensors
• Integrated with AWS SDK, agents, and AWS IoT
• Batch, compress, and encrypt data before loads
• Loads data into Amazon Redshift tables by using
the COPY command
• Pay-as-you-go: 3.5 cents / GB transferredAmazon Kinesis Firehose
Capture IT and app logs, device and sensor data, and more
Enable near-real time analytics using existing tools
Amazon
ElasticSearch
12. 1. Delivery stream: The underlying entity of Firehose. Use Firehose by
creating a delivery stream to a specified destination and send data to it.
• You do not have to create a stream or provision shards.
• You do not have to specify partition keys.
2. Records: The data producer sends data blobs as large as 1,000 KB to a
delivery stream. That data blob is called a record.
3. Data Producers: Producers send records to a delivery stream. For
example, a web server that sends log data to a delivery stream is a data
producer.
Amazon Kinesis Firehose
Three simple concepts
13. Amazon Kinesis Firehose console experience
Unified console experience for Firehose and Streams
14. Amazon Kinesis Firehose console (S3)
Create fully managed resources for delivery without building an app
15. Amazon Kinesis Firehose console (S3)
Configure data delivery options simply using the console
17. Amazon Kinesis agent
Software agent makes submitting data easy
• Monitors files and sends new data records to your delivery stream
• Handles file rotation, check pointing, and retry upon failures
• Preprocessing capabilities such as format conversion and log parsing
• Delivers all data in a reliable, timely, and simple manner
• Emits Amazon CloudWatch metrics to help you better monitor and
troubleshoot the streaming process
• Supported on Amazon Linux AMI with version 2015.09 or later, or Red Hat
Enterprise Linux version 7 or later; install on Linux-based server
environments such as web servers, front ends, log servers, and more
• Also enabled for Streams
19. Amazon Kinesis Streams is a service for workloads that requires
custom processing, per incoming record, with sub-1-second
processing latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is a service for workloads that require
zero administration, ability to use existing analytics tools based
on S3, Amazon Redshift and Amazon Elasticsearch with data
latency of 60 seconds or higher.
21. Amazon Kinesis Analytics
Analyze data streams continuously with standard SQL
Apply SQL on streams: Easily connect to data streams and apply existing SQL
skills.
Build real-time applications: Perform continual processing on streaming big
data with sub-second processing latencies.
Scale elastically: Elastically scales to match data throughput without any
operator intervention.
Connect to Amazon Kinesis
streams,Firehose delivery
streams
Run standard SQL queries
against data streams
Amazon Kinesis Analytics can send
processeddata to analytics tools so you
can create alerts and respond in real time
In Preview
23. About Jampp
We are a tech company that helps companies grow
their mobile business by driving engaged users to
their apps
Machine learning
Post-install event
optimisation
Dynamic Product Ads
and Segments
Data Science
Programmatic Buying
We are a team of 70 people, 30%
in the engineering team.
Located in 6 cities across the US,
Latin America, Europe and Africa
24. About Real-Time Bidding
• Ad Impressions are available through an Exchange
• Demand Platforms have to bid in less the 100 ms
• The highest bid wins the impressions and shows the ad!
We do this 220,000 times per second
27. • Build a retargeting platform that generates groups of users based on their in-app
activity and a look-a-like machine learning model
• Process and enrich in-app events in less than 5 minutes to target users when they
become dormant
• Build a scale-on-demand platform that lets our business grow without pain
• Increase the platform’s monitoring, logging and alerting capabilities
Also…
Non tech people should be able to query granular data and aggregate it over large periods
Business Challenges
28. • 700M events/300GB per day
• 1500% in-app events growth
YoY
• Growth factor peaks out of tech
team control since it depends on
the sales team pacing ;-)
Data Scale
31. Cost Savings
• Kafka supports higher throughput and lower latency and there are
tons of successful implementation cases but few are managing
similar volume like Jampp
• EBS allocation per topic was hard to size correctly, tune and scale
“on demand”
• Kafka maintainability required dedicated manpower
• Kafka security configuration was not flexible enough to add
secured producers outside of the VPC
$ 2,848 $ 936
Tempted by the Dark Side
33. • Invested several days picking the partition
key for evenly distributing data across
shards
• Encoding protocol matters! Performed
several benchmarks and MessagePack
offered the best trade off between
compression and serialization speed
factor
Jedi Trial I
34. • Write/Read Batching to reduce the HTTPS
protocol overhead and costs
• Exponential backoff + Jitter to reduce the
impact of in-app events bursts sent by the
tracking platforms
• Increased Data Retention Period from 1
day (default) to 3 days on the raw data
streams
Jedi Trial II
35. • Firehose real time data-ingestion to S3
and auto scaling capabilities flushes the
data to S3/EMR cluster faster than ever
letting our machine learning platform re-
calculate user retargeting segments with
higher frequency
• Encryption is a key success factor since we
manage sensitive data contained on the in-
app events
Jedi Trial III
36. • EMR Cluster simplifies our data
processing
• Spark ETLs are executed by Airflow, to
enrich data, de-normalize and convert JSON
to Parquet.
• ML predicts user conversion and separates
users based on it. This process is
implemented as a Python app that queries
event data stored in Parquet files through
PrestoDB
Jedi Trial IV
37. • Airpal queries PrestoDB and simplifies
access to data for non technical people
• Jupyter notebooks are used as templates
to build frequently used queries and
automate common analysis tasks
• Spark Streaming for real-time anomaly
detection and fraud prevention
• Multiple Clusters (according to SLAs)
Jedi Trial V
38. Jedi Knighting (from Padawan to Jedi Knight)
• Time is money
• Shards Read/Write limits... test your data volume first!
• Shard-based provisioning throughput let you scale on demand
• Exponential backoff + Jitter
• Batch and compress will save you tons of headaches and money
• Extended Data Retention pays off
• Kinesis helps you make the data pipeline much more reliable
• Kinesis + Lambda + Dynamo + EMR = <3