Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

Massive Scale
Data Processing
Pallavi Phadnis, Snehal Nagmote
Flink Forward SF 2019

● Consolidated Logging (CL) Overview
● High Level Architecture of CL platform
● Log Processing at Scale
● Event Extractor Use Case
● Monitoring and Alerting
● Impact of Flink based Platform
Agenda

Build an integrated solution to provide insights into user
behavior and application performance metrics through
client-side logging.
Consolidated Logging

Use Cases Powered By CL
● Personalization
● Recommendations
● A/B Experimentation
● Application Performance

Consolidated Logging
X
Event Types
300+
Log
ProfileIdentify
Presented
NavigationLevel
Focus
...
Play
Device Platforms /
App Versions
10+
TVUI
Android
iOS
Web
...
Log Events
100s of billion
events / day
1+ petabyte of user
behavior data per
day
=

CL Platform
NETFLIX
APPS
OTHER
INTERNAL/EXTERNAL
APPS
Consolidated
Logging
DATA COLLECTION
DATA TRANSFORMATION
EVENT EXTRACTION
DATA ENRICHMENT
- USER INFO
- GEO/DEVICE INFO
SITE USAGE
APP NAVIGATIONS
SHOWS WATCHED
USER SESSIONIZATION
CL Schema
App Schema
iOS ANDROID TVUI WWW
PLATFORM FEATURES
REALTIME & BATCH DATA SINKS

Legacy Pipeline
Flink Based Platform
Landing
Service
Kafka
Event
Extractor
CL App
Kafka Streams
Elasticsearch
Hive tables
Landing
Service
SQS
Log
Processing
Server
Kafka
CL
Streaming
App
CL
ETL
CL DW
(Hive)
Kafka
CL Router
App
13 Keystone routes
S3

● Generic log processing application - supports different logging specifications
● Real-time processing
○ Data transformations
○ Data enrichment - Membership information, Geo, Device type
■ Joins
● Single source of truth with unified output schema
● Supports different data sinks: Kafka/Hive
● SLA
○ RPS: 3.5 million events per sec at peak, Latency: < 3ms
CL App Features

CL App Design
● Stateless Flink Application (Flink 1.4, Kafka 1.1)
○ At-least once processing
● Isolation of concerns through separate Flink jobs for different use cases/sink types
● Different job DAGs with common framework library: Fan In/ Fan Out

Common Log Processing Framework
Log
Consumer
Config
Reader
(FP)
Data
Enrichment
Data
Transformations
Spec
Parser
Data
Sink
Raw
events
Processed
events
Kafka
Kafka
Hive / Iceberg
CL Schema /
App Schema
Request
Type &
Version
Source
Segregated
sources
Multiple
sinks
Raw
events
Hive
Data
Partitioning
Events
Backup

● Embarrassingly parallel job (parallelism over 2000)
○ Uniform CPU utilization with high number of partitions on source kafka topic
● High memory pressure and GC pause on JM - Recovery failure/restart loop
○ Memory leak in archiving execution history (FLINK-10066)
○ Scaling bottleneck of kafka source’s union state (FLINK-10122)
● Overwhelmed coordinator due to thundering herd problem with high parallelism
(KIP-266)
Learnings & Best Practices

Data compression - a factor to consider

● Data compression ratio was worse for parquet and kafka (~ 4x)
○ Upstream kafka producer batching difference increased data entropy
● Backlog in kafka can lead to sudden load on external micro-services
● Kafka backpressure leads to task failures
○ Duplicate events
● Guice dependency injection conflicts with Flink
○ classloader.resolve-order=parent-first
Learnings & Best Practices

Event Extractor Use Case
Personalization
Pipeline
CL Consumers
User clicks
User Searches
App perf metrics
Impressions
CL Stream
(Transformed
and enriched)
Personalization stream
Search stream
Impressions stream
Experimentation stream
Search Pipeline
Impressions
Pipeline
A/B
Experimentation
Pipeline
Consumer Insights
Pipeline
Exploratory
Analysis
Customer Service Tool

● Growth/Scale
○ 3.5 million events/sec
○ Reading same data multiple times
■ Compute redundancy
■ Scale Kafka infrastructure for outgoing bytes
■ Operational Overhead
● High Compute and Operational cost
Problems with CL Legacy Pipeline

Keystone Routes For CL Event Extractor

● Stateless Single Flink Application
● Read data once, apply processing and route it to multiple streams
● Configuration driven Processing, without code change
● SQL Support on Stream
● Filter, Transformation and Projection support on stream
● Out of box metrics for users
What is Event Extractor ?

● User configuration in Yaml
● Confings are managed in version control and updated in s3
● Example config
Event Extractor User Interface
filterExpression: field1= 'Presented' and field2 like '%impressionToken%' and field3 not like
'%storyArt%'
projectionExpression: field_name1, field_name2, field_name3, field_name5
transformations: { OutputFieldName:inner_field, fieldName:top_level_field,
nestedFieldName:inner_field, type: type}
sinkDetails: {sinkType: kafka, name: topic_name}
ownerName: email-address
routeName: unique_name

Event Extractor Design
Config
Reader
SQL Parser
Config
Parser
Transformation
Projection
User Config Management Pipeline
Filter Function
Schema Builder
Elastic Search
Sink
Kafka Sink
Hive Sink
User
Configs via
S3
CL Enriched
Stream
Hive
Multiple Kafka Sinks
Event Extractor

● Scaling single Flink Application
● Lack of Isolation
○ Isolated by type of sink application writes to
○ Deployment per sink type (Kafka,Hive,Elasticsearch)
● Back pressure is shared between multiple consumers
○ Consumer Kafka topics are created in the same cluster
○ Canaries and testing before on boarding new config
Challenges with Event Extractor

● Buildup of Network Pressure caused S3 checkpoint failures due to socket timeouts
○ Job goes into restart loop due to high frequency of checkpoint failures
○ Better g1gc and increase s3 timeouts
● Tuning parallelism to avoid unbalanced CPU Utilization
○ Extensive CPU Flame Graphs and system metrics to identify bottlenecks
○ Setting parallelism in multiples of Kafka partitions and task slots to achieve better
cpu utilization
Learnings and Best Practices

● Flink Kafka Consumer needs continuous stream to progress high watermark (FLINK-5479)
○ StickyPartitioner Producer skips producing data to out of sync partitions
○ Setting stickyPartitioner.minQualifiedIsrRatio=1.0 helps to produce data to out of
sync partitions
● Outlier Container/Broker (due to bad hardware)
○ Consumer gets non-linear traffic pattern (stuck consumer alert)
○ Producer throws BatchExpiredTimeout Exception and increase in checkpoint
failures
Learnings and Best Practices

● Keystone (Self-Serve UI) for deployment of streaming apps
○ Out of box ELK stack support for application logs
○ Automated Alerts integration with Atlas
● Deployment Strategy
○ Minimize Duplicates, Checkpoints are stored in S3
● Restart Strategy
○ Fine-grained Recovery
Deployment

CL Platform Benefits
Improved
Data
Processing
Can Handle
Large Payloads
compared to
Legacy pipeline
Improved error
handling
Reduced
Data Loss
Reduced points
of failures
Ability to backfill
or reprocess
historic raw
events
Legacy Tables
Decommission
and Reduced
Storage
Redundancy
Read once and
route to
different sinks
through event
extractor
Single source of
truth (SSOT) for
CL Data in Data
warehouse
Schema
consistency
across CL
components
and Tools
Single
Source of
Truth
Reduced Cost
&
Operational
Overhead

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis

Similar to Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis