SlideShare a Scribd company logo
1 of 34
Download to read offline
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adi Krishnan, Principal Product Manager, AWS
August 11, 2016
Real-Time Streaming Data on AWS
Deep Dive & Best Practices Using Amazon Kinesis
Learning objectives
• Introduction to streaming data
• Use cases and design patterns
• Amazon Kinesis Streams data ingestion: top 6 things to
know
• Stream processing overview:
- Understanding your options
- Amazon Kinesis Client Library (KCL) overview
- Spark Streaming on Amazon EMR: overview best practices
- Amazon Kinesis Analytics (new)
• Q & A
Batch Processing
Hourly server logs
Weekly or monthly bills
Daily website clickstream
Daily fraud reports
Stream Processing
Real-time metrics
Real-time spending alerts/caps
Real-time clickstream analysis
Real-time detection
It’s all about the pace
Metering Record Common Log Entry
MQTT RecordSyslog Entry
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
{
127.0.0.1 user-
identifier frank
[10/Oct/2000:13:5
5:36 -0700] "GET
/apache_pb.gif
HTTP/1.0" 200
2326
}
{
"SeattlePublicWa
ter/Kinesis/123/
Realtime" –
412309129140
}
{
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog -
ID47 [exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@
32473 class="high"]
}
Streaming data challenges: variety & velocity
• Streaming data comes in
different types and
formats
− Metering records,
logs, and sensor
data
− JSON, CSV, TSV
• Can vary in size from a
few bytes to kilobytes or
megabytes
• High velocity and
continuous
Streaming data scenarios across verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Responsive Data
Analysis
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and conversion
User engagement with ads,
optimized bid/buy engines
IoT
Sensor, device
telemetry,
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming
Online data aggregation,
e.g., top 10 players
Massively multiplayer online
game (MMOG) live dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics
Metrics like impressions and
page views
Recommendation engines,
proactive care
Customer use cases
Sonos runs near real-time streaming
analytics on device data logs from
their connected hi-fi audio equipment
Analyzing 30 TB+ clickstream
data, enabling real-time insights for
publishers
Glu Mobile collects billions of gaming
events data points from millions of user
devices in real-time every single day
Nordstorm recommendation team built
online stylist using Amazon Kinesis
Streams and AWS Lambda
Amazon Kinesis platform overview
Amazon Kinesis: streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
Amazon Kinesis Firehose
For all developers, data
scientists, IT professionals
Transform and load streaming
data into Amazon S3, Amazon
Redshift, Amazon Elasticsearch
Service, and more
Amazon Kinesis
Analytics
For all developers, analysts
and data scientists
Easily analyze streaming data
using standard SQL
Amazon Kinesis
Streams
For technical developers
Build custom application to
process or analyze streaming
data
Amazon Kinesis Firehose
Load streaming data into Amazon S3, Amazon Redshift, and Amazon ES
Zero administration: Deliver streaming data into Amazon S3, Amazon Redshift, Amazon ES,
and other destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 seconds using simple configurations.
Seamless elasticity: Seamlessly scale to match data throughput without intervention.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Amazon
Redshift, and Amazon ES
Data
Sources
App.4
[Machine
Learning]
AWSEndpoint
App.1
[Aggregate &
De-Duplicate]
Data
Sources
Data
Sources
Data
Sources
App.2
[Metric
Extraction]
Amazon S3
Amazon Redshift
App.3
[Sliding
Window
Analysis]
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
Amazon Kinesis Streams
Managed service for real-time streaming
AWS Lambda
Amazon EMR
Streaming data ingestion
What stream storage should I use?
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
DynamoDB
Streams
AWS Managed
Service
Yes Yes No Yes
Ordering Yes Yes Yes Yes
Data Retention
Period
7 days N/A Manual/
Configurable
24 hours
Availability 3 AZ 3 AZ Manual/
Configurable
3 AZ
Scale /
Throughput
No Limit /
~ Shards
No limit /
Automatic
No limit /
~ Nodes
No limit /
~ Table IOPS
Parallel Clients Yes No Yes Yes
Stream MapReduce Yes N/A Yes Yes
Record/object size 1MB 1 MB Configurable 400 KB
Cost Low Low Low (+admin) Higher (table cost)
• Streams are made of shards
• Each shard ingests up to 1 MB/sec,
and 1000 records/sec
• Each shard emits up to 2 MB/sec
• All data is stored for 24 hours by
default; storage can be extended for
up to 7 days
• Scale Amazon Kinesis Streams using
scaling util
• Replay data inside of 24-hour window
Amazon Kinesis Streams
Managed ability to capture and store data
Managed Buffer
• Care about a reliable, scalable
way to capture data
• Defer all other aggregation to
consumer
• Generate random partition
keys
• Ensure a high cardinality for
partition keys with respect to
shards, to spray evenly across
available shards
Topic #1: framing your data ingestion pattern
Workload determines partition key strategy
Streaming MapReduce
• Streaming MapReduce: leverage
partition keys as a natural way to
aggregate data
• For e.g., partition keys per billing
customer, per DeviceId, per stock
symbol
• Design partition keys to scale
• Be aware of “hot partition keys or
shards”
Topic #2: putting data into Amazon Kinesis
Options range from AWS SDKs to agents
• Amazon Kinesis Agent – (supports preprocessing)
http://docs.aws.amazon.com/firehose/latest/dev/writing-with-
agents.html
• Amazon Kinesis Producer Library
http://docs.aws.amazon.com/firehose/latest/dev/writing-with-
agents.html
• Use open source log data aggregators
Consider Flume, Fluentd as collectors/agents
See https://github.com/awslabs/aws-fluent-plugin-kinesis
Topic #3: sizing the Amazon Kinesis stream
Provision adequate shards, you can always change them
• For ingress needs – capture all incoming data
• Count likely data producers – log servers, sensor/ things,
smartphone app installs
• Individual payload size and (desired) frequency of Puts
• For egress needs – feed all consuming applications
• Each shard can do 2 MB/sec on egress
• Add more shards for more applications
• Include head-room for “catch-up” with data in stream in
the event of application failures
java -cp
KinesisScalingUtils.jar-complete.jar
-Dstream-name=MyStream
-Dscaling-action=scaleUp
-Dcount=10
-Dregion=eu-west-1 ScalingClient
• stream-name - the name of the stream to scale
• scaling-action - the action to take to scale; must be one of “scaleUp”,
“scaleDown”, or “resize”
• count - number of shards by which to absolutely scale up or down, or
resize
See https://github.com/awslabs/amazon-kinesis-scaling-utils
Topic #4: scaling streams is an online operation
Leverage Amazon Kinesis Auto Scaling utils
Topic #5: enabling enhanced monitoring
Get per-shard metrics for Amazon Kinesis Streams
• IncomingBytes
• IncomingRecords
• OutgoingBytes
• OutgoingRecords
• WriteProvisionedThroughputExceeded
• ReadProvisionedThroughputExceeded
• IteratorAgeMilliseconds
• ALL
Topic #6: when to use Firehose or
Amazon Kinesis Streams
Depends on your use case and nature of data
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon ES, and a data
latency of 60 seconds or higher.
Amazon Kinesis Streams processing
What stream processing technology should I use?
KCL
Application
Spark
Streaming
AWS Lambda Amazon
Kinesis
Analytics
Apache
Storm
Scale ~ Nodes ~ Nodes Automatic Automatic ~ Nodes
Micro-Batch or
Real-Time
Near real-time Micro-batch Near real-time Near real-time Real-time
AWS Managed
Service
No (KCL +
EC2 + Auto
Scaling)
Yes (via EMR) Yes Yes No (EC2)
Scalability No limits
~ Nodes
No limits
~ Nodes
No limits No Limits No limits
~ Nodes
Availability Multi-AZ Single AZ Multi-AZ Multi-AZ Configurable
Programming
Languages
Java, via
MultiLang
Daemon
(.NET, Python,
Ruby, Node.js)
Java, Python,
Scala
Node.js, Java,
Python
SQL Any language
via Thrift
Amazon Kinesis Client Library (KCL)
Amazon KCL
Open source library to build your own apps
• Build Amazon Kinesis applications with KCL
• Available for Java, Ruby, Python, Node.JS dev
• Deploy on your EC2 instances
• KCL application includes three components:
1. Worker – processing unit that maps to each application EC2
instance
2. Record Processor Factory – creates the record processor
3. Record Processor – processor unit that processes data from a
shard in Amazon Kinesis Streams
State management with KCL
Enables scalable, at-least once-processing
• One record processor maps to one shard and processes data
records from that shard
• One worker maps to one or more record processors
• Balances shard-worker associations when worker/instance counts
change
• Balances shard-worker associations when shards split or merge
Apache Spark Streaming on Amazon EMR
Apache Spark and Amazon Kinesis Streams
• Apache Spark is an in-memory analytics cluster using RDD for fast
processing
• Spark Streaming can read directly from an Amazon Kinesis stream
• Amazon software license linking – add ASL dependency to
SBT/MAVEN project, artifactId = spark-streaming-
kinesis-asl_2.10
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(”Open-Source"))
.countByWindow(Seconds(5))
Example: Counting tweets on a sliding window
Using Spark Streaming with Amazon Kinesis Streams
Key configurations to bear in mind for Spark 1.6+ on Amazon EMR
• Leverage EMRFS consistent view option – if you use Amazon S3
as storage for Spark checkpoint
• DynamoDB table name – make sure there is only one instance of
the application running with Spark Streaming
• Enable Spark-based checkpoints
• Number of Amazon Kinesis receivers is multiple of executors so
they are load-balanced
• Total processing time is less than the batch interval
• Number of executors is the same as number of cores per
executor
• Spark Streaming uses default of 1 sec with KCL
Demo
Sensors
Amazon
Kinesis
Streams
Streaming
ProcessData Source
Collect /
Store
Consume
- Simulated using EC2
- 20K events/sec
- Scalable
- Managed
- Millisecond latency
- Highly available
- Scalable
- Managed
- Micro-batch
- Checkpoint in DynamoDB
Amazon
DynamoDB
Device_Id, temperature, timestamp
1,65,2016-03-25 01:01:20
2,68,2016-03-25 01:01:20
3,67,2016-03-25 01:01:20
4,85,2016-03-25 01:01:20
Amazon Kinesis Analytics
Amazon Kinesis Analytics
Analyze data streams continuously with standard SQL
Apply SQL on streams: easily connect to a stream or Firehose and apply SQL
Build real-time applications: continuous processing on streaming big data
Fully managed: no servers to manage - elastically scalable
Connect to
Amazon Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Analytics can send processed data to
analytics tools so you can create alerts
and respond in real time
Conclusion
• Amazon Kinesis offers: managed service to build applications, streaming
data ingestion, and continuous processing
• Stream ingest: Set up partition key model upfront, determine producer-side
software, leverage detailed metrics
• Stream process: Pick right tool for right job
• Try out the Amazon Kinesis platform: http://aws.amazon.com/kinesis/
• Technical documentations
• Amazon Kinesis Agent
• Amazon Kinesis Streams and Spark Streaming
• Amazon Kinesis Producer Library Best Practice
• Amazon Kinesis Firehose and AWS Lambda
• Building Near Real-Time Discovery Platform with Amazon Kinesis
• Public case studies
• Glu mobile – Real-Time Analytics
• Hearst Publishing – Clickstream Analytics
• How Sonos Leverages Amazon Kinesis
• Nordstorm Online Stylist
Reference
Thank you!

More Related Content

Viewers also liked

Barracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSBarracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWS
Amazon Web Services
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Spark Summit
 
Rackspace: Best Practices for Security Compliance on AWS
Rackspace: Best Practices for Security Compliance on AWSRackspace: Best Practices for Security Compliance on AWS
Rackspace: Best Practices for Security Compliance on AWS
Amazon Web Services
 

Viewers also liked (19)

Maximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWSMaximizing Business Value as You Migrate to AWS
Maximizing Business Value as You Migrate to AWS
 
AWSome Day Leeds
AWSome Day Leeds AWSome Day Leeds
AWSome Day Leeds
 
another day, another billion packets
another day, another billion packetsanother day, another billion packets
another day, another billion packets
 
Barracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWSBarracuda WAF: Scalable Security for Applications on AWS
Barracuda WAF: Scalable Security for Applications on AWS
 
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
AWS re:Invent 2016: AWS Customers Saving Lives with Mobile and IoT Technology...
 
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
Building HPC Clusters as Code in the (Almost) Infinite Cloud | AWS Public Sec...
 
Workshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWSWorkshop: Building Your First Big Data Application on AWS
Workshop: Building Your First Big Data Application on AWS
 
AWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWSAWS Mobile Hub - Building Mobile Apps with AWS
AWS Mobile Hub - Building Mobile Apps with AWS
 
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar SeriesAmazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
Amazon Aurora for the Enterprise - August 2016 Monthly Webinar Series
 
Customer Case Study: Achieving PCI Compliance in AWS
Customer Case Study: Achieving PCI Compliance in AWSCustomer Case Study: Achieving PCI Compliance in AWS
Customer Case Study: Achieving PCI Compliance in AWS
 
Secure Content Delivery Using Amazon CloudFront and AWS WAF
Secure Content Delivery Using Amazon CloudFront and AWS WAFSecure Content Delivery Using Amazon CloudFront and AWS WAF
Secure Content Delivery Using Amazon CloudFront and AWS WAF
 
AWS Welcome to re:Invent recap - 20161214
AWS Welcome to re:Invent recap - 20161214AWS Welcome to re:Invent recap - 20161214
AWS Welcome to re:Invent recap - 20161214
 
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
AWS Directory Service and Hybrid Strategy | AWS Public Sector Summit 2016
 
Innovating with AWS: How Microservices on AWS Can Transform Your Business
Innovating with AWS: How Microservices on AWS Can Transform Your BusinessInnovating with AWS: How Microservices on AWS Can Transform Your Business
Innovating with AWS: How Microservices on AWS Can Transform Your Business
 
Deep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECSDeep Dive on Microservices and Amazon ECS
Deep Dive on Microservices and Amazon ECS
 
Getting Started with the Hybrid Cloud: Enterprise Backup and Recovery
Getting Started with the Hybrid Cloud: Enterprise Backup and RecoveryGetting Started with the Hybrid Cloud: Enterprise Backup and Recovery
Getting Started with the Hybrid Cloud: Enterprise Backup and Recovery
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
 
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
Clickstream Analysis with Spark—Understanding Visitors in Realtime by Josef A...
 
Rackspace: Best Practices for Security Compliance on AWS
Rackspace: Best Practices for Security Compliance on AWSRackspace: Best Practices for Security Compliance on AWS
Rackspace: Best Practices for Security Compliance on AWS
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 

Deep Dive and Best Practices for Real-Time Streaming Applications

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adi Krishnan, Principal Product Manager, AWS August 11, 2016 Real-Time Streaming Data on AWS Deep Dive & Best Practices Using Amazon Kinesis
  • 2. Learning objectives • Introduction to streaming data • Use cases and design patterns • Amazon Kinesis Streams data ingestion: top 6 things to know • Stream processing overview: - Understanding your options - Amazon Kinesis Client Library (KCL) overview - Spark Streaming on Amazon EMR: overview best practices - Amazon Kinesis Analytics (new) • Q & A
  • 3. Batch Processing Hourly server logs Weekly or monthly bills Daily website clickstream Daily fraud reports Stream Processing Real-time metrics Real-time spending alerts/caps Real-time clickstream analysis Real-time detection It’s all about the pace
  • 4. Metering Record Common Log Entry MQTT RecordSyslog Entry { "payerId": "Joe", "productCode": "AmazonS3", "clientProductCode": "AmazonS3", "usageType": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" } { 127.0.0.1 user- identifier frank [10/Oct/2000:13:5 5:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 } { "SeattlePublicWa ter/Kinesis/123/ Realtime" – 412309129140 } { <165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@ 32473 class="high"] } Streaming data challenges: variety & velocity • Streaming data comes in different types and formats − Metering records, logs, and sensor data − JSON, CSV, TSV • Can vary in size from a few bytes to kilobytes or megabytes • High velocity and continuous
  • 5. Streaming data scenarios across verticals Scenarios/ Verticals Accelerated Ingest- Transform-Load Continuous Metrics Generation Responsive Data Analysis Digital Ad Tech/Marketing Publisher, bidder data aggregation Advertising metrics like coverage, yield, and conversion User engagement with ads, optimized bid/buy engines IoT Sensor, device telemetry, data ingestion Operational metrics and dashboards Device operational intelligence and alerts Gaming Online data aggregation, e.g., top 10 players Massively multiplayer online game (MMOG) live dashboard Leader board generation, player-skill match Consumer Online Clickstream analytics Metrics like impressions and page views Recommendation engines, proactive care
  • 6. Customer use cases Sonos runs near real-time streaming analytics on device data logs from their connected hi-fi audio equipment Analyzing 30 TB+ clickstream data, enabling real-time insights for publishers Glu Mobile collects billions of gaming events data points from millions of user devices in real-time every single day Nordstorm recommendation team built online stylist using Amazon Kinesis Streams and AWS Lambda
  • 8. Amazon Kinesis: streaming data made easy Services make it easy to capture, deliver, and process streams on AWS Amazon Kinesis Firehose For all developers, data scientists, IT professionals Transform and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and more Amazon Kinesis Analytics For all developers, analysts and data scientists Easily analyze streaming data using standard SQL Amazon Kinesis Streams For technical developers Build custom application to process or analyze streaming data
  • 9. Amazon Kinesis Firehose Load streaming data into Amazon S3, Amazon Redshift, and Amazon ES Zero administration: Deliver streaming data into Amazon S3, Amazon Redshift, Amazon ES, and other destinations without writing an application or managing infrastructure. Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 seconds using simple configurations. Seamless elasticity: Seamlessly scale to match data throughput without intervention. Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Amazon Redshift, and Amazon ES
  • 10. Data Sources App.4 [Machine Learning] AWSEndpoint App.1 [Aggregate & De-Duplicate] Data Sources Data Sources Data Sources App.2 [Metric Extraction] Amazon S3 Amazon Redshift App.3 [Sliding Window Analysis] Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone Amazon Kinesis Streams Managed service for real-time streaming AWS Lambda Amazon EMR
  • 12. What stream storage should I use? Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka Amazon DynamoDB Streams AWS Managed Service Yes Yes No Yes Ordering Yes Yes Yes Yes Data Retention Period 7 days N/A Manual/ Configurable 24 hours Availability 3 AZ 3 AZ Manual/ Configurable 3 AZ Scale / Throughput No Limit / ~ Shards No limit / Automatic No limit / ~ Nodes No limit / ~ Table IOPS Parallel Clients Yes No Yes Yes Stream MapReduce Yes N/A Yes Yes Record/object size 1MB 1 MB Configurable 400 KB Cost Low Low Low (+admin) Higher (table cost)
  • 13. • Streams are made of shards • Each shard ingests up to 1 MB/sec, and 1000 records/sec • Each shard emits up to 2 MB/sec • All data is stored for 24 hours by default; storage can be extended for up to 7 days • Scale Amazon Kinesis Streams using scaling util • Replay data inside of 24-hour window Amazon Kinesis Streams Managed ability to capture and store data
  • 14. Managed Buffer • Care about a reliable, scalable way to capture data • Defer all other aggregation to consumer • Generate random partition keys • Ensure a high cardinality for partition keys with respect to shards, to spray evenly across available shards Topic #1: framing your data ingestion pattern Workload determines partition key strategy Streaming MapReduce • Streaming MapReduce: leverage partition keys as a natural way to aggregate data • For e.g., partition keys per billing customer, per DeviceId, per stock symbol • Design partition keys to scale • Be aware of “hot partition keys or shards”
  • 15. Topic #2: putting data into Amazon Kinesis Options range from AWS SDKs to agents • Amazon Kinesis Agent – (supports preprocessing) http://docs.aws.amazon.com/firehose/latest/dev/writing-with- agents.html • Amazon Kinesis Producer Library http://docs.aws.amazon.com/firehose/latest/dev/writing-with- agents.html • Use open source log data aggregators Consider Flume, Fluentd as collectors/agents See https://github.com/awslabs/aws-fluent-plugin-kinesis
  • 16. Topic #3: sizing the Amazon Kinesis stream Provision adequate shards, you can always change them • For ingress needs – capture all incoming data • Count likely data producers – log servers, sensor/ things, smartphone app installs • Individual payload size and (desired) frequency of Puts • For egress needs – feed all consuming applications • Each shard can do 2 MB/sec on egress • Add more shards for more applications • Include head-room for “catch-up” with data in stream in the event of application failures
  • 17. java -cp KinesisScalingUtils.jar-complete.jar -Dstream-name=MyStream -Dscaling-action=scaleUp -Dcount=10 -Dregion=eu-west-1 ScalingClient • stream-name - the name of the stream to scale • scaling-action - the action to take to scale; must be one of “scaleUp”, “scaleDown”, or “resize” • count - number of shards by which to absolutely scale up or down, or resize See https://github.com/awslabs/amazon-kinesis-scaling-utils Topic #4: scaling streams is an online operation Leverage Amazon Kinesis Auto Scaling utils
  • 18. Topic #5: enabling enhanced monitoring Get per-shard metrics for Amazon Kinesis Streams • IncomingBytes • IncomingRecords • OutgoingBytes • OutgoingRecords • WriteProvisionedThroughputExceeded • ReadProvisionedThroughputExceeded • IteratorAgeMilliseconds • ALL
  • 19. Topic #6: when to use Firehose or Amazon Kinesis Streams Depends on your use case and nature of data Amazon Kinesis Streams is for use cases that require custom processing, per incoming record, with sub-1 second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is for use cases that require zero administration, ability to use existing analytics tools based on Amazon S3, Amazon Redshift and Amazon ES, and a data latency of 60 seconds or higher.
  • 21. What stream processing technology should I use? KCL Application Spark Streaming AWS Lambda Amazon Kinesis Analytics Apache Storm Scale ~ Nodes ~ Nodes Automatic Automatic ~ Nodes Micro-Batch or Real-Time Near real-time Micro-batch Near real-time Near real-time Real-time AWS Managed Service No (KCL + EC2 + Auto Scaling) Yes (via EMR) Yes Yes No (EC2) Scalability No limits ~ Nodes No limits ~ Nodes No limits No Limits No limits ~ Nodes Availability Multi-AZ Single AZ Multi-AZ Multi-AZ Configurable Programming Languages Java, via MultiLang Daemon (.NET, Python, Ruby, Node.js) Java, Python, Scala Node.js, Java, Python SQL Any language via Thrift
  • 22. Amazon Kinesis Client Library (KCL)
  • 23. Amazon KCL Open source library to build your own apps • Build Amazon Kinesis applications with KCL • Available for Java, Ruby, Python, Node.JS dev • Deploy on your EC2 instances • KCL application includes three components: 1. Worker – processing unit that maps to each application EC2 instance 2. Record Processor Factory – creates the record processor 3. Record Processor – processor unit that processes data from a shard in Amazon Kinesis Streams
  • 24. State management with KCL Enables scalable, at-least once-processing • One record processor maps to one shard and processes data records from that shard • One worker maps to one or more record processors • Balances shard-worker associations when worker/instance counts change • Balances shard-worker associations when shards split or merge
  • 25. Apache Spark Streaming on Amazon EMR
  • 26. Apache Spark and Amazon Kinesis Streams • Apache Spark is an in-memory analytics cluster using RDD for fast processing • Spark Streaming can read directly from an Amazon Kinesis stream • Amazon software license linking – add ASL dependency to SBT/MAVEN project, artifactId = spark-streaming- kinesis-asl_2.10 KinesisUtils.createStream(‘twitter-stream’) .filter(_.getText.contains(”Open-Source")) .countByWindow(Seconds(5)) Example: Counting tweets on a sliding window
  • 27. Using Spark Streaming with Amazon Kinesis Streams Key configurations to bear in mind for Spark 1.6+ on Amazon EMR • Leverage EMRFS consistent view option – if you use Amazon S3 as storage for Spark checkpoint • DynamoDB table name – make sure there is only one instance of the application running with Spark Streaming • Enable Spark-based checkpoints • Number of Amazon Kinesis receivers is multiple of executors so they are load-balanced • Total processing time is less than the batch interval • Number of executors is the same as number of cores per executor • Spark Streaming uses default of 1 sec with KCL
  • 28. Demo Sensors Amazon Kinesis Streams Streaming ProcessData Source Collect / Store Consume - Simulated using EC2 - 20K events/sec - Scalable - Managed - Millisecond latency - Highly available - Scalable - Managed - Micro-batch - Checkpoint in DynamoDB Amazon DynamoDB Device_Id, temperature, timestamp 1,65,2016-03-25 01:01:20 2,68,2016-03-25 01:01:20 3,67,2016-03-25 01:01:20 4,85,2016-03-25 01:01:20
  • 30. Amazon Kinesis Analytics Analyze data streams continuously with standard SQL Apply SQL on streams: easily connect to a stream or Firehose and apply SQL Build real-time applications: continuous processing on streaming big data Fully managed: no servers to manage - elastically scalable Connect to Amazon Kinesis streams, Firehose delivery streams Run standard SQL queries against data streams Analytics can send processed data to analytics tools so you can create alerts and respond in real time
  • 31.
  • 32. Conclusion • Amazon Kinesis offers: managed service to build applications, streaming data ingestion, and continuous processing • Stream ingest: Set up partition key model upfront, determine producer-side software, leverage detailed metrics • Stream process: Pick right tool for right job • Try out the Amazon Kinesis platform: http://aws.amazon.com/kinesis/
  • 33. • Technical documentations • Amazon Kinesis Agent • Amazon Kinesis Streams and Spark Streaming • Amazon Kinesis Producer Library Best Practice • Amazon Kinesis Firehose and AWS Lambda • Building Near Real-Time Discovery Platform with Amazon Kinesis • Public case studies • Glu mobile – Real-Time Analytics • Hearst Publishing – Clickstream Analytics • How Sonos Leverages Amazon Kinesis • Nordstorm Online Stylist Reference