Real-time processing of streaming data is a common architectural pattern used in many applications. Amazon Kinesis Analytics is the easiest way to process streaming data in real time with standard SQL without having to learn new programming languages or processing frameworks. We will present how to use Amazon Kinesis Analytics on streaming data and gain actionable insights from your data.
Services: Amazon Kinesis Analytics, Amazon Redshift, Amazon Elastic Search Service and Amazon S3.
Presenters: Kobi Biton & Ran Tessler
3. Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/ht
docs/test
Most data is produced continuously
4. Recent data is highly valuable
• If you act on it in time
• Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
• If you have the means to combine them
The diminishing value of data
5. • Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist
Processing real-time, streaming data
6. Amazon Kinesis Streams
Easy administration: Create a stream, set capacity level with shards. Scale to
match your data throughput rate & volume.
Build real-time applications: Process streaming data w/ Kinesis Client Library
(KCL), Apache Spark/Storm, AWS Lambda,...
Low cost: Cost-efficient for workloads of any scale.
7. Amazon Kinesis Firehose
Zero administration: Capture and deliver streaming data to Amazon S3,
Redshift, Elasticsearch w/o writing an app or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming
data for delivery in as little as 60 seconds
Seamless elasticity: Seamlessly scales to match data throughput w/o
intervention
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift
and Amazon Elasticsearch
8. Amazon Kinesis Analytics
Apply SQL on streams: Easily connect to a Kinesis Stream or Firehose
Delivery Stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big
data with sub-second processing latencies.
Easy Scalability : Elastically scales to match data throughput.
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real-time
9. Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
10. A streaming table is a STREAM
• In relational databases, you work with SQL tables
• With Kinesis Analytics, you work with STREAMs
• SELECT, INSERT, and CREATE can be used with STREAMs
CREATE STREAM Tweets
(author VARCHAR(20),
text VARCHAR(140));
INSERT INTO Tweets
SELECT …
11. A simple streaming query
• Tweets about the DLD Festival Summit
• Selecting from a STREAM of tweets, an in-application
stream
• Each row has a corresponding ROWTIME
SELECT STREAM ROWTIME, author, text
FROM Tweets
WHERE text LIKE ‘%#DLDTelAviv%'
12. Writing queries on unbounded datasets
• Streams are unbounded data sets
• Need continuous queries, row by row or across rows
• WINDOWS define a start and end to the query
SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING);
14. Amazon Kinesis: Streaming Data Made Easy
Services make it easy to capture, deliver and process streams on
AWS
Amazon Kinesis Firehose
For all developers, data scientists
Easily load massive
volumes of streaming data
into Amazon S3, Redshift
ElasticSearch
Amazon Kinesis Streams
For Technical Developers
Collect and stream data
for ordered, replayable,
real-time processing
Amazon Kinesis Analytics
For all developers, data scientists
Easily analyze data streams
using standard SQL queries
Narrative: The reality is that most data is produced continuously and is coming at us at lightning speeds due to an explosive growth of real-time data sources.
TP: Machine data will make up 40% of our digital universe by 2020
TP: Over 60MM units will be deployed by retailers and bank branches by 2019
Narrative: Whether it is log data coming from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, it all delivers information that can help companies learn about what their customers, organization, and business are doing right now.
TP: Customer Benefits
Improve operational efficiencies, improve customer experiences, new business models
Smart building: reduce energy costs, cut maintenance, increase safety and security
Smart textiles: monitor skin temperature, monitor stress
Narrative: So how much is this data worth? Well, it depends…
Recent data is highly valuable
If you act on it in time
Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
If you have the means to combine them
Narrative: Processing real-time data as it arrives can let you make decisions much faster and get the most value from your data. But, building your own custom applications to process streaming data is complicated and resource intensive. You need to train or hire developers with the right skillsets, and then wait for months for the applications to be built and fine-tuned, and the operate and scale the application as the business grows.
All of this takes lots of time and money, and, at the end of the day, lots of companies just never get there, settle for the status-quo, and live with information that is hours or days old.
Narrative: You need a different set of analytical tools to collect and analyze real-time streaming data than what you have traditionally used for data at rest. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach. Instead of running database queries on stored data, streaming analytics platforms have to process the data continuously and before the data lands in a database. And streaming data comes in at an incredible rate that can vary up and down all the time. Streaming analytics platforms have to be able to process this data when it arrives, often at speeds of millions and even tens of millions of events per hour.
Key requirements of stream processing
Durable: Durable ingest so that processing can be repeatable;
Continuous - Need to always be processing the latest data
Fast: Frequency (micro batches, size of batches, true streaming), and speed (sub-second, minute, hour)
Correct: at most once, at least once, and exactly once processing; event time, ingest time, processing time.
Reactive: Ability to process and respond in near real-time; feedback mechanisms to send processed data to live applications
Reliable: Highly available, fast failovers
Connect to streaming source Select Kinesis Streams or Firehoses as input data for your SQL code
Easily write SQL code to process streaming data - Author applications with a single SQL statement, or build sophisticated applications with multi-step pipelines with advanced analytical functions
Continuously deliver SQL results - Configure one to many outputs for your processed data for real-time alerts, visualizations, and batch analytics
The are three major components of a streaming application; connecting to an input stream, SQL code, and an output destinations.
Inputs include Kinesis Streams and Firehose.
Standard SQL code consists of one or more SQL queries which make up an application. Most applications will have one to three SQL statements that perform ETL after initial schematization, and then several SQL queries to generate real-time analytics on the transformed data. This illustrates the point of the “streaming application” concept; you are chaining SQL statements together in serial and parallel through filters, joins, and merges to build a streaming graph within your application. We believe most customers will have between 5-10 SQL statements, but some customers will build sophisticated applications with 50-100 statements.
Outputs includes Kinesis Streams and Firehose (S3, Redshift, and Elasticsearch(coming Chicago Summit)). The next outputs will be CloudWatch Metrics and Quicksight, but we have yet to confirm whether these will be place for GA.