SlideShare a Scribd company logo
The Connection Game
How We Use Kinesis Streams to
Analyze Billions of Network Traffic
Flows in Real-Time
John Bennett, Cloud Network Engineering
Senior Software Engineer
● 93 million customers
● Over 190 countries
● 37% of Internet traffic
● 125 million hours of video
Netflix is big
● 100s of microservices
● 1,000s of deployments
● More than 100,000 instances
And complex
How do we optimize the design and
use of the network at scale in a
dynamic environment?
How is the network being used?
● Immutable infrastructure
● Scaling events
● Internal (instances and containers)
● External (AWS S3, ELB, Internet, etc.)
IPs in a Dynamic Environment
Metadata Changes Over Time
Metadata Changes Over Time
Metadata Changes Over Time
Metadata Changes Over Time
● Slowly changing dimension
● Unpredictable
● Valid during a specific time interval
Metadata in a Dynamic Environment
Source
IP
Destination
IPat time t
Source
Metadata
Destination
Metadataat time t
Dredge
Transforms traffic logs into
enriched and aggregated
multi-dimensional data
● Account
● Region
● Availability Zone
● VPC, Subnet
● Protocol (TCP, UDP)
● Accept or Reject
● Application
● Cluster
● Type
• instance
• container
• AWS service
Metadata Dimensions
● Bytes transferred
● Packets sent
● Number of flows
● Latency
Aggregated Metrics
● OLAP-style (Online Analytical Processing)
● Rollup
● ex. All apps deployed to the same region rollup to that region
● Drill down
● ex. Which apps deployed to a region generate the most traffic?
● Slicing and dicing
● ex. Which apps generate the most traffic in a region by day?
Queries
● Large dataset (billions of events per day)
● Multiple dimensions and metrics
● Ad-hoc OLAP queries
● Fast aggregations
● Real-time
New source for network analytics
Dredge
Ingest
Network data from the entire system
Enrich
Traffic logs with application metadata
Aggregate
Multi-dimensional metrics
Flow Logs
AWS API for
network traffic flows
● Good: Wide coverage
● Good: Consolidated
● Good: Core info (src and dst IP, timestamp)
● Bad: 10-minute capture window
● Ugly: Stateless
Flow Logs Overview
Example
{
accountID: 123456789010,
eniID: eni-abc123de,
srcIP: 172.31.16.139,
srcPort: 12345,
dstIP: 10.13.67.49,
dstPort: 80,
protocol: 6,
packets: 123,
bytes, 42,
start: 1490746336,
end: 1490746369,
action: ACCEPT,
...
}
Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App foo sent 426718 bytes to app bar today
Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App bar received 8278392 bytes from apps foo and baz in
the last week
Given a VpcFlowLogEvent
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, …}
Enriched with application metadata
{srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata:
{app: foo}, dstMetadata: {app: bar},…}
Aggregated and indexed
App baz has outbound network dependencies on apps foo,
bar, etc.
Patterns
Read This Book
The following diagrams are adapted from Kleppman’s talks
on patterns for real-time stream processing.
Streams of Immutable events
● Database indexes and secondary indexes
● Materialized views
● Caching
Derived Data, Read-Optimized
● Separation of concerns for reading and writing
● Changelog stream is a 1st class citizen
● Consume and join streams instead of querying DB
● Maintain materialized views
● Pre-computed cache
Unbundle the database
Ingest
● Integration with AWS services
● Kinesis Client Library (KCL)
● Auto-scaling for elastic throughput
● Total Cost of Ownership (TCO)
Kinesis Over Kafka
Cross-Account Log Sharing
● Worker per EC2 instance
○ Multiple record processors per worker
○ Record processor per shard
● Load balancing between workers
● Checkpointing (with DynamoDB)
● Stream- and shard-level metrics
Kinesis Client Library
VPC Flow Logs IncomingBytes per hour
Example account and region over 1 week
Elastic throughput
VPC Flow Logs IncomingBytes per minute
Example account and region over 3 hours
Elastic throughput
● Very little operational overhead
○ Monitor stream metrics and DynamoDB table
○ Run and manage auto-scaling util
● No consultation from internal Kafka team
○ Capacity planning
○ Monitoring, failover, and replication
TCO
● Per-shard limits
○ Increase shard count or fan out to other streams
● No log compaction
○ Up to 7-day max retention
○ Manual snapshots, increased complexity
○ Not ideal for changelog joins
Limitations
● Kinesis enables us to focus
● Cross-account log sharing simplifies the system
● KCL does the boring stuff
● Auto-scaling improves efficiency
● Lower TCO
Ingest: Lessons
Enrich
Address metadata is temporal
● Hash table of sorted lists
● Key is IP, Value is metadata sorted by timestamp
● Recent updates (within capture window) or last
● Join with flow log events stream
Address Metadata Changelog
Kafka Log Compaction
Direction Src Port Dst Port
Inbound Ephemeral Non-Ephemeral
Outbound Ephemeral Non-Ephemeral
Return Non-Ephemeral Ephemeral
Derive TCP State
● Stream table join with changelog
● Log compaction for cold starts, bootstrapping
● Derive state from stateless
Enrich: Lessons
Aggregate
Bucket deadline reached
…
dataSchema: {
dataSource: flowlogs,
parser: {
dimensionsSpec: {
dimensions: [
srcApp,
srcAccount,
srcRegion,
…,
dstApp,
dstAccount,
dstRegion,
…
],
}
}
metricsSpec: [
{
type: longSum,
fieldName: packets
},
{
type: longSum,
fieldName: bytes
}
● Column-oriented
● Google BigQuery and PowerDrill
● Ad-hoc OLAP queries
● Fast aggregations
● Multi-dimensional metrics
● Scales to trillions of events
● Pre-aggregate into timestamp buckets
● Druid is a great fit for exploratory analytics
● Fast ad-hoc queries, < 1 second
Aggregate: Lessons
Results
Pivot / Swiv Demo
Drag-and-drop UI
Pivot / Swiv Demo
Contextual exploration
Pivot / Swiv Demo
Comparison
Exploratory Analysis with Pivot / Swiv Demo
Bytes sent per application, table
Exploratory Analysis with Pivot / Swiv Demo
Bytes sent per application, split by hour, line chart
Exploratory Analysis with Pivot / Swiv Demo
Bytes sent by example application, split by hour, line chart
Exploratory Analysis with Pivot / Swiv Demo
Comparison of bytes, flows, and packets, split by day, line chart
● Auditing AWS security groups (virtual firewalls)
● Anomaly and threat detection
● Deployment best practices
● Cost analysis
Other Use Cases
.
Example application
as a network graph
.
Example application
as a network graph
You are here
Enriched and aggregated traffic data
is a powerful source of information
for network design and optimization.
@yo_bennett

More Related Content

What's hot

What's hot (20)

FAIR Data-centric Information Architecture.pptx
FAIR Data-centric Information Architecture.pptxFAIR Data-centric Information Architecture.pptx
FAIR Data-centric Information Architecture.pptx
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Moving Large Scale Contact Centers to Amazon Connect (BAP324) - AWS re:Invent...
Moving Large Scale Contact Centers to Amazon Connect (BAP324) - AWS re:Invent...Moving Large Scale Contact Centers to Amazon Connect (BAP324) - AWS re:Invent...
Moving Large Scale Contact Centers to Amazon Connect (BAP324) - AWS re:Invent...
 
Cloud strategy briefing 101
Cloud strategy briefing 101 Cloud strategy briefing 101
Cloud strategy briefing 101
 
Managing a Database Migration Project Best Practices and Customer References.pdf
Managing a Database Migration Project Best Practices and Customer References.pdfManaging a Database Migration Project Best Practices and Customer References.pdf
Managing a Database Migration Project Best Practices and Customer References.pdf
 
Architecting Multitenant SaaS Applications with Azure - Microsoft Ignite The ...
Architecting Multitenant SaaS Applications with Azure - Microsoft Ignite The ...Architecting Multitenant SaaS Applications with Azure - Microsoft Ignite The ...
Architecting Multitenant SaaS Applications with Azure - Microsoft Ignite The ...
 
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the CloudAWS Neptune - A Fast and reliable Graph Database Built for the Cloud
AWS Neptune - A Fast and reliable Graph Database Built for the Cloud
 
Clean Infrastructure as Code
Clean Infrastructure as Code Clean Infrastructure as Code
Clean Infrastructure as Code
 
Microsoft: Multi-tenant SaaS with Azure
Microsoft: Multi-tenant SaaS with AzureMicrosoft: Multi-tenant SaaS with Azure
Microsoft: Multi-tenant SaaS with Azure
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
SDWAN Concept - Certificate and keys Roles in Controllers and vEdge Router Au...
SDWAN Concept - Certificate and keys Roles in Controllers and vEdge Router Au...SDWAN Concept - Certificate and keys Roles in Controllers and vEdge Router Au...
SDWAN Concept - Certificate and keys Roles in Controllers and vEdge Router Au...
 
APN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWSAPN Overview and Best Practices for Partnering with AWS
APN Overview and Best Practices for Partnering with AWS
 
Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023Data Contracts: Consensus as Code - Pycon 2023
Data Contracts: Consensus as Code - Pycon 2023
 
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
 
How Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge GraphHow Graph Data Science can turbocharge your Knowledge Graph
How Graph Data Science can turbocharge your Knowledge Graph
 
Use Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition SystemUse Amazon Rekognition to Build a Facial Recognition System
Use Amazon Rekognition to Build a Facial Recognition System
 
AWS Purpose-Built Database Strategy: The Right Tool for The Right Job
AWS Purpose-Built Database Strategy: The Right Tool for The Right JobAWS Purpose-Built Database Strategy: The Right Tool for The Right Job
AWS Purpose-Built Database Strategy: The Right Tool for The Right Job
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
 
Blueprint for omnichannel integration architecture
Blueprint for omnichannel integration architectureBlueprint for omnichannel integration architecture
Blueprint for omnichannel integration architecture
 

Similar to How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale Networks in Real-time

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 

Similar to How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale Networks in Real-time (20)

BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
 
Story of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streamingStory of migrating event pipeline from batch to streaming
Story of migrating event pipeline from batch to streaming
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Netty training
Netty trainingNetty training
Netty training
 
gRPC Design and Implementation
gRPC Design and ImplementationgRPC Design and Implementation
gRPC Design and Implementation
 
Netty training
Netty trainingNetty training
Netty training
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Ghost Environment
Ghost EnvironmentGhost Environment
Ghost Environment
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Introduction of Biology in living organisms
Introduction of Biology in living organismsIntroduction of Biology in living organisms
Introduction of Biology in living organisms
soumyapottola
 
527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf
rajpreetkaur75080
 

Recently uploaded (14)

Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
Writing Sample 2 -Bridging the Divide: Enhancing Public Engagement in Urban D...
 
Introduction of Biology in living organisms
Introduction of Biology in living organismsIntroduction of Biology in living organisms
Introduction of Biology in living organisms
 
05232024 Joint Meeting - Community Networking
05232024 Joint Meeting - Community Networking05232024 Joint Meeting - Community Networking
05232024 Joint Meeting - Community Networking
 
The Canoga Gardens Development Project. PDF
The Canoga Gardens Development Project. PDFThe Canoga Gardens Development Project. PDF
The Canoga Gardens Development Project. PDF
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
123445566544333222333444dxcvbcvcvharsh.pptx
123445566544333222333444dxcvbcvcvharsh.pptx123445566544333222333444dxcvbcvcvharsh.pptx
123445566544333222333444dxcvbcvcvharsh.pptx
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
Hi-Tech Industry 2024-25 Prospective.pptx
Hi-Tech Industry 2024-25 Prospective.pptxHi-Tech Industry 2024-25 Prospective.pptx
Hi-Tech Industry 2024-25 Prospective.pptx
 
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
Pollinator Ambassador Earth Steward Day Presentation 2024-05-22
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf527598851-ppc-due-to-various-govt-policies.pdf
527598851-ppc-due-to-various-govt-policies.pdf
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Breathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptxBreathing in New Life_ Part 3 05 22 2024.pptx
Breathing in New Life_ Part 3 05 22 2024.pptx
 
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdfOracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
Oracle Database Administration I (1Z0-082) Exam Dumps 2024.pdf
 

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale Networks in Real-time

  • 1. The Connection Game How We Use Kinesis Streams to Analyze Billions of Network Traffic Flows in Real-Time John Bennett, Cloud Network Engineering Senior Software Engineer
  • 2. ● 93 million customers ● Over 190 countries ● 37% of Internet traffic ● 125 million hours of video Netflix is big
  • 3. ● 100s of microservices ● 1,000s of deployments ● More than 100,000 instances And complex
  • 4. How do we optimize the design and use of the network at scale in a dynamic environment?
  • 5. How is the network being used?
  • 6. ● Immutable infrastructure ● Scaling events ● Internal (instances and containers) ● External (AWS S3, ELB, Internet, etc.) IPs in a Dynamic Environment
  • 11. ● Slowly changing dimension ● Unpredictable ● Valid during a specific time interval Metadata in a Dynamic Environment
  • 14. Dredge Transforms traffic logs into enriched and aggregated multi-dimensional data
  • 15. ● Account ● Region ● Availability Zone ● VPC, Subnet ● Protocol (TCP, UDP) ● Accept or Reject ● Application ● Cluster ● Type • instance • container • AWS service Metadata Dimensions
  • 16. ● Bytes transferred ● Packets sent ● Number of flows ● Latency Aggregated Metrics
  • 17. ● OLAP-style (Online Analytical Processing) ● Rollup ● ex. All apps deployed to the same region rollup to that region ● Drill down ● ex. Which apps deployed to a region generate the most traffic? ● Slicing and dicing ● ex. Which apps generate the most traffic in a region by day? Queries
  • 18. ● Large dataset (billions of events per day) ● Multiple dimensions and metrics ● Ad-hoc OLAP queries ● Fast aggregations ● Real-time New source for network analytics
  • 19. Dredge Ingest Network data from the entire system Enrich Traffic logs with application metadata Aggregate Multi-dimensional metrics
  • 20. Flow Logs AWS API for network traffic flows
  • 21. ● Good: Wide coverage ● Good: Consolidated ● Good: Core info (src and dst IP, timestamp) ● Bad: 10-minute capture window ● Ugly: Stateless Flow Logs Overview
  • 23. { accountID: 123456789010, eniID: eni-abc123de, srcIP: 172.31.16.139, srcPort: 12345, dstIP: 10.13.67.49, dstPort: 80, protocol: 6, packets: 123, bytes, 42, start: 1490746336, end: 1490746369, action: ACCEPT, ... }
  • 24. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App foo sent 426718 bytes to app bar today
  • 25. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App bar received 8278392 bytes from apps foo and baz in the last week
  • 26. Given a VpcFlowLogEvent {srcIP: 172.31.16.139, dstIP: 10.13.67.49, …} Enriched with application metadata {srcIP: 172.31.16.139, dstIP: 10.13.67.49, srcMetadata: {app: foo}, dstMetadata: {app: bar},…} Aggregated and indexed App baz has outbound network dependencies on apps foo, bar, etc.
  • 28. Read This Book The following diagrams are adapted from Kleppman’s talks on patterns for real-time stream processing.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. ● Database indexes and secondary indexes ● Materialized views ● Caching Derived Data, Read-Optimized
  • 38. ● Separation of concerns for reading and writing ● Changelog stream is a 1st class citizen ● Consume and join streams instead of querying DB ● Maintain materialized views ● Pre-computed cache Unbundle the database
  • 39.
  • 40.
  • 42. ● Integration with AWS services ● Kinesis Client Library (KCL) ● Auto-scaling for elastic throughput ● Total Cost of Ownership (TCO) Kinesis Over Kafka
  • 44. ● Worker per EC2 instance ○ Multiple record processors per worker ○ Record processor per shard ● Load balancing between workers ● Checkpointing (with DynamoDB) ● Stream- and shard-level metrics Kinesis Client Library
  • 45. VPC Flow Logs IncomingBytes per hour Example account and region over 1 week Elastic throughput
  • 46. VPC Flow Logs IncomingBytes per minute Example account and region over 3 hours Elastic throughput
  • 47. ● Very little operational overhead ○ Monitor stream metrics and DynamoDB table ○ Run and manage auto-scaling util ● No consultation from internal Kafka team ○ Capacity planning ○ Monitoring, failover, and replication TCO
  • 48. ● Per-shard limits ○ Increase shard count or fan out to other streams ● No log compaction ○ Up to 7-day max retention ○ Manual snapshots, increased complexity ○ Not ideal for changelog joins Limitations
  • 49. ● Kinesis enables us to focus ● Cross-account log sharing simplifies the system ● KCL does the boring stuff ● Auto-scaling improves efficiency ● Lower TCO Ingest: Lessons
  • 51.
  • 53.
  • 54.
  • 55.
  • 56. ● Hash table of sorted lists ● Key is IP, Value is metadata sorted by timestamp ● Recent updates (within capture window) or last ● Join with flow log events stream Address Metadata Changelog
  • 58. Direction Src Port Dst Port Inbound Ephemeral Non-Ephemeral Outbound Ephemeral Non-Ephemeral Return Non-Ephemeral Ephemeral Derive TCP State
  • 59. ● Stream table join with changelog ● Log compaction for cold starts, bootstrapping ● Derive state from stateless Enrich: Lessons
  • 61.
  • 63.
  • 64. … dataSchema: { dataSource: flowlogs, parser: { dimensionsSpec: { dimensions: [ srcApp, srcAccount, srcRegion, …, dstApp, dstAccount, dstRegion, … ], } } metricsSpec: [ { type: longSum, fieldName: packets }, { type: longSum, fieldName: bytes } ● Column-oriented ● Google BigQuery and PowerDrill ● Ad-hoc OLAP queries ● Fast aggregations ● Multi-dimensional metrics ● Scales to trillions of events
  • 65. ● Pre-aggregate into timestamp buckets ● Druid is a great fit for exploratory analytics ● Fast ad-hoc queries, < 1 second Aggregate: Lessons
  • 67. Pivot / Swiv Demo Drag-and-drop UI
  • 68. Pivot / Swiv Demo Contextual exploration
  • 69. Pivot / Swiv Demo Comparison
  • 70. Exploratory Analysis with Pivot / Swiv Demo Bytes sent per application, table
  • 71. Exploratory Analysis with Pivot / Swiv Demo Bytes sent per application, split by hour, line chart
  • 72. Exploratory Analysis with Pivot / Swiv Demo Bytes sent by example application, split by hour, line chart
  • 73. Exploratory Analysis with Pivot / Swiv Demo Comparison of bytes, flows, and packets, split by day, line chart
  • 74. ● Auditing AWS security groups (virtual firewalls) ● Anomaly and threat detection ● Deployment best practices ● Cost analysis Other Use Cases
  • 75. . Example application as a network graph
  • 76. . Example application as a network graph You are here
  • 77. Enriched and aggregated traffic data is a powerful source of information for network design and optimization.