"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Introduction to Real-time, Streaming Data and Amazon Kinesis. Streaming Data Ingestion with Firehose
1. San Francisco Loft - 2017
Introduction to Real-time, Streaming
Data and Amazon Kinesis:
Streaming Data Ingestion with
Firehose
Adrian Hornsby (@adhorn)
Technical Evangelist with AWS
2. • Technical Evangelist, Developer Advocate,
… Software Engineer
• My @home is in Finland
• Previously:
• Solutions Architect @AWS
• Lead Cloud Architect @Dreambroker
• Director of Engineering, Software Engineer, DevOps, Manager, ... @Hdm
• Researcher @Nokia Research Center
• and a bunch of other stuff.
• Love climbing and ginger shots.
3. What to Expect from the Session
• Streaming data overview
• Firehose patterns overview
• Firehose usage patterns
• Streaming data end-to-end example and walk-
through
6. Most data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
7. The diminishing value of data
• Recent data is highly valuable
• Old + Recent data is more valuable
8. Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist
10. Real-time streaming data made easy
Amazon Kinesis
Streams
• For Technical Developers
• Collect and stream data
for ordered, replayable,
real-time processing
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into Amazon S3, Redshift,
ElasticSearch
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
11. Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low cost
• Build custom real-time applications to process streaming data
12. Amazon Kinesis Analytics
• Interact with streaming data in real-time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms
13. Amazon Kinesis Firehose
• Reliably ingest and deliver batched, compressed, and
encrypted data to S3, Redshift, and Elasticsearch
• Point and click setup with zero administration and
seamless elasticity
14. Amazon Kinesis makes it easy to work with
real-time streaming data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into Amazon S3, Redshift,
ElasticSearch
20. Amazon Kinesis Firehose vs. Amazon Kinesis Streams
Amazon Kinesis Streams is for use cases that require custom processing,
per incoming record, with sub-1 second processing latency, and a choice of
stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero administration,
ability to use existing analytics tools based on Amazon S3, Amazon
Redshift and Amazon Elasticsearch, and a data latency of 60 seconds or
higher.
Narrative: The reality is that most data is produced continuously and is coming at us at lightning speeds due to an explosive growth of real-time data sources.
TP: Machine data will make up 40% of our digital universe by 2020
Narrative: Whether it is log data coming from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, it all delivers information that can help companies learn about what their customers, organization, and business are doing right now.
TP: Customer Benefits
Improve operational efficiencies, improve customer experiences, new business models
Smart building: reduce energy costs, cut maintenance, increase safety and security
Smart textiles: monitor skin temperature, monitor stress
Narrative: So how much is this data worth? Well, it depends…
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
Narrative: Processing real-time data as it arrives can let you make decisions much faster and get the most value from your data. But, building your own custom applications to process streaming data is complicated and resource intensive. You need to train or hire developers with the right skillsets, and then wait for months for the applications to be built and fine-tuned, and the operate and scale the application as the business grows.
All of this takes lots of time and money, and, at the end of the day, lots of companies just never get there, settle for the status-quo, and live with information that is hours or days old.
Narrative: You need a different set of analytical tools to collect and analyze real-time streaming data than what you have traditionally used for data at rest. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach. Instead of running database queries on stored data, streaming analytics platforms have to process the data continuously and before the data lands in a database. And streaming data comes in at an incredible rate that can vary up and down all the time. Streaming analytics platforms have to be able to process this data when it arrives, often at speeds of millions and even tens of millions of events per hour.
Key requirements of stream processing
Durable: Durable ingest so that processing can be repeatable;
Continuous - Need to always be processing the latest data
Fast: Frequency (micro batches, size of batches, true streaming), and speed (sub-second, minute, hour)
Correct: at most once, at least once, and exactly once processing; event time, ingest time, processing time.
Reactive: Ability to process and respond in near real-time; feedback mechanisms to send processed data to live applications
Reliable: Highly available, fast failovers
Since Amazon Kinesis launch in 2013, the ecosystem evolved and we introduced Kinesis Firehose and Kinesis Analytics.
Streams was launched in GA at re:Invent 2014, Firehose at re:Invent 2015, and Analytics was launched in August 2016
We have continuously iterated to make it easier for customers to use streaming data, as well as expand the functionality of real-time processing
Together, these three products make up the Amazon Kinesis streaming data platform
Easy administration: Simply create a new stream, and set the desired level of capacity with shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming data using Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
Apply SQL on streams: Easily connect to a Kinesis Stream or Firehose Delivery Stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big data with sub-second processing latencies
Easy Scalability : Elastically scales to match data throughput for most workloads
Easy and interactive experience: Complete most stream processing use cases in minutes, and easily progress toward sophisticated scenarios
Zero Admin: Capture and deliver streaming data into S3, Redshift, ElasticCache and other AWS destinations without writing an application or managing infrastructure
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into S3, and other destinations in as little as 60 secs, set up in minutes
Seamless elasticity: Seamlessly scales to match data throughput
(feedback: add bullet to discuss why firehose created. Major use case)
Since Amazon Kinesis launch in 2013, the ecosystem evolved and we introduced Kinesis Firehose and Kinesis Analytics.
Streams was launched in GA at re:Invent 2014, Firehose at re:Invent 2015, and Analytics was launched in August 2016
We have continuously iterated to make it easier for customers to use streaming data, as well as expand the functionality of real-time processing
Together, these three products make up the Amazon Kinesis streaming data platform
A shard is a group of data records in a stream. When you create a stream, you specify the numberof shards for the stream.
Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The total capacity of a stream is the sum of the capacities of its shards. You can increase or decrease the number of shards in a stream as needed. However, note that you are charged on a per-shard basis.
Please stay within brand by using the attached template.
I’d recommend being visual – use imagery, font color, bold font, etc. in your slides.
Be concise – limit your number of slides and content in them.
It’s always good to have a few slides with backup information in case needed.
Please also make sure there’s VP/Service Leader approval in place for all the content disclosed in the slides and in your call.
Feedback: put best practices into context that it’s a fully-managed service
Sonos runs near real-time streaming analytics on device data logs from their connected hi-fi audio equipment.
Hearst: Analyzing 30TB+ clickstream data enabling real-time insights for Publishers.
Nordstorm recommendation team built online stylist using Amazon Kinesis Streams and AWS Lambda.