SlideShare a Scribd company logo
1 of 25
Download to read offline
Schema-aware Data Streams
at Netflix Scale
Jagannathrao Mudda
Ramayan Tiwari
The 3Vs of Big Data
Handling DATA
VARIETY is critical
for the DATA
QUALITY
167+ million members spanning 190+ countries
Netflix Scale
Millions of client devices with different versions of desktop, mobile and tv OS
Netflix Scale
Data Variety:
● 450+ event types from millions of devices
● Structured and Semi-structured event data
* These data points are limited to user behaviour data coming from client devices
The 3Vs at Netflix*
Data Velocity:
● 350K+ requests per second real time βœ…
● 7+ million events processing per second βœ…
Data Volume:
● 400+ billion events being collected every day βœ…
● Petabyte of data per day at rest βœ…
● Netflix consumer app
β—‹ Events capturing user interaction, intent and
behavior
β—‹ Events capturing app and system
performance
● Netflix production studio apps
● Netflix partners apps (resellers bundling Netflix
with their services)
● Sales, marketing, advertising, promotions events
Data Variety at Netflix
● Misinterpretation of data leads to:
β—‹ Inconsistent metrics, data insights
β—‹ Poor recommendations and personalization
β—‹ Inconclusive A/B testing results
β—‹ Decrease in Member Joy leading to Churn
● Data producer changes could break data consumer apps
● Hard to deprecate any event types
Data Variety Impact on Data Quality
● Limit unstructured data unless absolutely required
● Curate or transform unstructured data during
processing
● Schematize structured/semi-structured data
● Build Schema-aware Data Streams
How to Handle Data Variety ?
Use Case:
Event Processing Pipeline
● Schematization (Defining/Updating Schema)
● Schemafication (Generating schema compliant events)
● Schema Validation
● Integration with Streaming Application
● Schema Definition of Data at Rest
Phases for Schema-aware Data Stream
Design
● Schematization
β—‹ New events types are added frequently
β—‹ Existing events types are being updated
β—‹ How to define schema for event types ?
β—‹ How to seamlessly notify client app/server side app ?
β—‹ How to handle schema evolution of event types ?
Design Challenges
● Schemafication
β–  Client side
β–  Server Side
● Schema Validation
β—‹ Compile time/Runtime ?
β—‹ How to handle schema non-compliant events?
Design Challenges
● Schema-aware data streams
β—‹ How to define schema for data streams generated by
stateless/stateful applications?
β—‹ How to handle schema evolution of data streams ?
β—‹ How does consumers get access to schema of the data stream?
● Data at Rest
β—‹ How to make to cost effective and still highly performant
Schematization Design
● Client side schemafication
β—‹ Send schema update notification to every client/device
β—‹ Access to schema registry from client/device (outside vip)
β—‹ Package updated schema with the image and deploy new
version on each device
● Server side schemafication
β—‹ Generate schema compliant records in Flink Streaming App
β—‹ Use latest Avro Schema from schema registry and generate
Avro Records
β—‹ Schema client in the app to get schema update notification
Design Approaches - Schemafication
Schemafication Design
● Compile time validation
β—‹ Data type and mandatory field validation while creating instance of Specific Avro
Record
β—‹ Build and push a new image for every schema change.
● Run time validation
β—‹ Data type validation while creating instance of Avro generic records
β—‹ Mandatory fields validations when Avro generic records are serialized
β—‹ Send schema non-compliant records to a different channel with schema errors
β—‹ Schema non-compliant records can continue to be in JSON format
Design Approaches - Schema Validation
Schema Validation Sequence Diagram
● Data Streams can contain event, context and other
enriched attributes
● Data Streams can be enriched, transformed by
streaming apps
● Data Streams schema can be evolved
● Stateless and Stateful application can perform
generic transformation and aggregation
Schema-aware Data Streams
Design Requirements
Schema Aware Data Streams Design
● Data At Rest in Avro format
β—‹ Full schema evolution support
β—‹ Row oriented not good for wide, high volume table
● Embedded Avro Binary Column in Parquet format
β—‹ Serialize large column using avro binary format
β—‹ Table is columnar in parquet format with embedded avro binary
column
β—‹ Highly performant
β—‹ An UDF to deserialize the avro binary column
Design Approaches - Data At Rest
Data At Rest Design
● Schema for Data In Motion
β—‹ No misinterpretation of data
β—‹ High Data Quality
β–  Realtime data quality checks
β–  Segregation of Schema compliant and non compliant data
● Compute Efficiency
β—‹ Binary Encoded data in motion
β—‹ Processing data more efficient upto 30%
● Storage Efficiency
β—‹ Binary encoded column in the data store
β—‹ Upto 40% less storage
● Cost Efficiency
β—‹ Upto 40% Cost Savings
● Enable to Deliver High-Quality Performant and Cost Efficient Schema-aware Data Streams
Schema-aware Data Streams Benefits
● JSON Processing versus Avro Generic Record Processing
● Enabled us to do more compute/processing at ingestion layer
● Moved Decompaction to an app that is doing avro processing
An example of Compute Benefit
● Consumers are in sync with the schema of data streams
● Consistent metrics, data insights
● Great recommendations and personalization
● Conclusive A/B testing results
● Decrease in turnaround time for feature/app performance improvement
● …
● Increase in Member Joy
Greater Data Quality Translates to
Questions

More Related Content

More from Flink Forward

Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Β 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
Β 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
Β 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Β 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Β 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
Β 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
Β 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Β 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Β 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Β 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Β 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
Β 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
Β 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
Β 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesFlink Forward
Β 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Β 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
Β 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkFlink Forward
Β 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
Β 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Β 

More from Flink Forward (20)

Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Β 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Β 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Β 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Β 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Β 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Β 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Β 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Β 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Β 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Β 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Β 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Β 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
Β 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Β 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Β 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
Β 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
Β 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Β 
Large Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior DetectionLarge Scale Real Time Fraudulent Web Behavior Detection
Large Scale Real Time Fraudulent Web Behavior Detection
Β 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Β 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
Β 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraΓΊjo
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Β 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Β 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Β 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
Β 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Β 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Β 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Β 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
Β 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Β 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Β 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Β 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Β 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Β 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Β 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Β 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Β 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Β 

Virtual Flink Forward 2020: High quality performant and cost efficient schema-aware data streams on Flink at Netflix scale - Jagannathrao Mudda, Ramayan Tiwari

  • 1. Schema-aware Data Streams at Netflix Scale Jagannathrao Mudda Ramayan Tiwari
  • 2. The 3Vs of Big Data Handling DATA VARIETY is critical for the DATA QUALITY
  • 3. 167+ million members spanning 190+ countries Netflix Scale
  • 4. Millions of client devices with different versions of desktop, mobile and tv OS Netflix Scale
  • 5. Data Variety: ● 450+ event types from millions of devices ● Structured and Semi-structured event data * These data points are limited to user behaviour data coming from client devices The 3Vs at Netflix* Data Velocity: ● 350K+ requests per second real time βœ… ● 7+ million events processing per second βœ… Data Volume: ● 400+ billion events being collected every day βœ… ● Petabyte of data per day at rest βœ…
  • 6. ● Netflix consumer app β—‹ Events capturing user interaction, intent and behavior β—‹ Events capturing app and system performance ● Netflix production studio apps ● Netflix partners apps (resellers bundling Netflix with their services) ● Sales, marketing, advertising, promotions events Data Variety at Netflix
  • 7. ● Misinterpretation of data leads to: β—‹ Inconsistent metrics, data insights β—‹ Poor recommendations and personalization β—‹ Inconclusive A/B testing results β—‹ Decrease in Member Joy leading to Churn ● Data producer changes could break data consumer apps ● Hard to deprecate any event types Data Variety Impact on Data Quality
  • 8. ● Limit unstructured data unless absolutely required ● Curate or transform unstructured data during processing ● Schematize structured/semi-structured data ● Build Schema-aware Data Streams How to Handle Data Variety ?
  • 10. ● Schematization (Defining/Updating Schema) ● Schemafication (Generating schema compliant events) ● Schema Validation ● Integration with Streaming Application ● Schema Definition of Data at Rest Phases for Schema-aware Data Stream Design
  • 11. ● Schematization β—‹ New events types are added frequently β—‹ Existing events types are being updated β—‹ How to define schema for event types ? β—‹ How to seamlessly notify client app/server side app ? β—‹ How to handle schema evolution of event types ? Design Challenges ● Schemafication β–  Client side β–  Server Side
  • 12. ● Schema Validation β—‹ Compile time/Runtime ? β—‹ How to handle schema non-compliant events? Design Challenges ● Schema-aware data streams β—‹ How to define schema for data streams generated by stateless/stateful applications? β—‹ How to handle schema evolution of data streams ? β—‹ How does consumers get access to schema of the data stream? ● Data at Rest β—‹ How to make to cost effective and still highly performant
  • 14. ● Client side schemafication β—‹ Send schema update notification to every client/device β—‹ Access to schema registry from client/device (outside vip) β—‹ Package updated schema with the image and deploy new version on each device ● Server side schemafication β—‹ Generate schema compliant records in Flink Streaming App β—‹ Use latest Avro Schema from schema registry and generate Avro Records β—‹ Schema client in the app to get schema update notification Design Approaches - Schemafication
  • 16. ● Compile time validation β—‹ Data type and mandatory field validation while creating instance of Specific Avro Record β—‹ Build and push a new image for every schema change. ● Run time validation β—‹ Data type validation while creating instance of Avro generic records β—‹ Mandatory fields validations when Avro generic records are serialized β—‹ Send schema non-compliant records to a different channel with schema errors β—‹ Schema non-compliant records can continue to be in JSON format Design Approaches - Schema Validation
  • 18. ● Data Streams can contain event, context and other enriched attributes ● Data Streams can be enriched, transformed by streaming apps ● Data Streams schema can be evolved ● Stateless and Stateful application can perform generic transformation and aggregation Schema-aware Data Streams Design Requirements
  • 19. Schema Aware Data Streams Design
  • 20. ● Data At Rest in Avro format β—‹ Full schema evolution support β—‹ Row oriented not good for wide, high volume table ● Embedded Avro Binary Column in Parquet format β—‹ Serialize large column using avro binary format β—‹ Table is columnar in parquet format with embedded avro binary column β—‹ Highly performant β—‹ An UDF to deserialize the avro binary column Design Approaches - Data At Rest
  • 21. Data At Rest Design
  • 22. ● Schema for Data In Motion β—‹ No misinterpretation of data β—‹ High Data Quality β–  Realtime data quality checks β–  Segregation of Schema compliant and non compliant data ● Compute Efficiency β—‹ Binary Encoded data in motion β—‹ Processing data more efficient upto 30% ● Storage Efficiency β—‹ Binary encoded column in the data store β—‹ Upto 40% less storage ● Cost Efficiency β—‹ Upto 40% Cost Savings ● Enable to Deliver High-Quality Performant and Cost Efficient Schema-aware Data Streams Schema-aware Data Streams Benefits
  • 23. ● JSON Processing versus Avro Generic Record Processing ● Enabled us to do more compute/processing at ingestion layer ● Moved Decompaction to an app that is doing avro processing An example of Compute Benefit
  • 24. ● Consumers are in sync with the schema of data streams ● Consistent metrics, data insights ● Great recommendations and personalization ● Conclusive A/B testing results ● Decrease in turnaround time for feature/app performance improvement ● … ● Increase in Member Joy Greater Data Quality Translates to