SlideShare a Scribd company logo
1 of 21
Big Data @

         Using Big Data to Grow our Business
               & Retain our Customers.

                         Jerome Boulon
           Lead Architect, Hadoop Big Data Infrastructure

                         February 15, 2012
jboulon@netflix.com
Big Data @ Netflix
Offline analysis:
•  Honu: Scalable log analysis system to gain business
   insights:
   –  Errors logs (unstructured logs)
   –  Statistical logs & Performance logs
   –  Etc

Online analysis:
•  Cassandra for all online activities and user facing
   data
   –  A/B testing (test allocation, metadata)
   –  Service level Configuration
   –  etc
                                  2
Overview
                            Data collection pipeline


Applica'on	
                                           Collectors	
  




                 Hive	
                                    M/R	
  




                            Data processing pipeline
                                       3
Honu - Structured Log API
Using	
  Annota+ons	
           Using the Key/Value API
•  Convert Java Class to Hive   •  Produce the same result as
   Table dynamically               Annotation
•  Add/Remove column            •  Avoid unnecessary object
•  Supported java types:           creation
        •  All primitives       •  Fully dynamic
        •  Map                  •  Thread Safe
        •  Object using the
           toString method
Honu, What you get:




log.logEvent(myObject)
                                        Hive table
                         movieId customerId timestamp hostname



      Select customerId, count(1) from MyTable group by customerId;
December 2009
                                                                          Collectors	
  
–    POC for Streaming analysis                Applica'on	
  
–    Single AWS zone
–    1 application
–    60 Millions events/Day
–    50 clients
–    Small Hadoop cluster         Oracle	
  
–    1 Map/Reduce
–    1 Table
                                                                M/R	
  
Feb 2012
                                                40+ Billion events/Day
                                                 8+ tables with 1+TB/Day
                                                100+ smaller tables
                                                Self-serve:
                                                à No DBA
                                                à No Pre-provisioning


                                 	
            	
  
                                                à Fully integrated with Hive
- Multi Regions deployments
- Transparent to our engineers
- Streaming based solution
- Zero configuration
- 7000+ clients
- Built-in:                                           Netflix Hive warehouse
  - Fail-Over
  - Load balancing
                                        	
       à One central Data warehouse
                                                 à Hourly/Daily reports
                                                 à Data retention/expiration
Traceability & Performance
              analysis
•  Track service level call
   –  Instrument low level HTTP client
   –  Calls graph
   –  Request processing vs Perceive latency
   –  Payload marshalling/unmarshalling
      - duration, size, etc
   –  Service Result
      - Status, Error code, Exception, etc
Diagnostic Information
•  Collect latency information for all external
   operations
•  If Latency > threshold log to Honu:
    –  AWS Region & Zone
    –  Instance
    –  Service details
•  Open Jira/Ticket & Attach diagnostic info
Mix Offline and Online Data
Offline data                             Specific conditions
- Fire & forget                          - Online Data availability is not mandatory
- Scale to very large volumes            - If exist, data could be useful online
- Cost effective                         - Only a subset useful Online
                                         - Ready to pay a little bit more




 Special collectors                        Customer support
 - All data goes to Hive                   - Browsing history
 - A subset goes to a real-time system     - Historical & non-critical actions
 - Still cost effective                    Debug
                                           - Push validation
                                           - Root cause analysis
Honu Realtime usages
•  Movie playback experience        •  Customer Support
   –  Video quality                     –  Historical usage
   –  Network issue                     –  Last activity



•  Errors Summary                   •  Launch Reports
    –  Error tracking per service       –  Push validation
    –  Error tracking per device        –  Root cause analysis
Honu Realtime - Architecture
                 Realtime Data collection pipeline


Applica'on	
                                         Collectors	
  




   Real'me	
  
    Access	
  
                             Realtime
                             System                         M/R	
  
A/B Testing
 Test: An experiment where several
 competing behaviors are
 implemented and compared.

 Cell: different experiences within a
 test that are being compared against
 each other.

 Allocation: a customer-specific
 assignment to a cell within a test

Online data:                            Tracking       1 M customers per Test
- Cell Allocation > 1 Billion records   information    8 tracking events per Day
- Test config: 1 entry/test/customer    (example)     ------------------------------------
                                        100 Tests =   800 M events/ Day
                                        3 Months =       72 B events
Movie Presentation A/B Test
A/B Testing - Architecture
         Online Data            Offline Data




- Customer test allocation   - Test tracking
- Metadata about the test    Ex:
Ex:                          - Retention
- Start/End date             - Engagement metrics
- UI directives
- Logging directives
Beacon Server

User behavior
- Client side interactions
- Search/Play/Stop/Pause
                                           Ajax calls
Device monitoring
- Heartbeat
- Status & Key metrics        Beacon	
      Beacon	
     Beacon	
  
BI Integration
Three main technologies

•  Teradata (Data center)
•  Hive (Cloud)
•  Cassandra (Cloud)
Hive ß à BI
–  Dimension tables (daily export from Teradata)
–  Hourly/Daily Hive summary queries
–  Hourly/Daily export from Hive to BI
  •  Queries runs in the cloud
  •  Aggregated result goes back to our BI solution
Hive Reports
Cassandra à BI

•  Use Cassandra backups to run analytics
•  Export SSTable to Hadoop
•  Pig to:
  –  Parse SSTable
  –  Extract/Group required information
•  Load the result back to Teradata
jboulon@gmail.com
www.linkedin.com/in/jboulon

More Related Content

What's hot

How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience Martin Zapletal
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Spark Summit
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...HostedbyConfluent
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewMonal Daxini
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
 
The Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownThe Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownData Con LA
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Timothy Spann
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiKevin McEntee
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaA unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaHostedbyConfluent
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...HostedbyConfluent
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechHostedbyConfluent
 

What's hot (20)

ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
How Disney+ uses fast data ubiquity to improve the customer experience
 How Disney+ uses fast data ubiquity to improve the customer experience  How Disney+ uses fast data ubiquity to improve the customer experience
How Disney+ uses fast data ubiquity to improve the customer experience
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
 
Instrumenting your Instruments
Instrumenting your Instruments Instrumenting your Instruments
Instrumenting your Instruments
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...
 
The Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt BrownThe Netflix data platform: Now and in the future by Kurt Brown
The Netflix data platform: Now and in the future by Kurt Brown
 
Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020Fast data for fitness 10 nov 2020
Fast data for fitness 10 nov 2020
 
Netflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwikiNetflix incloudsmarch8 2011forwiki
Netflix incloudsmarch8 2011forwiki
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaA unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
 
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
 

Viewers also liked

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programminghye-jin-lee
 
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESAdrija Chowdhury
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
 
Mobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionMobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionnagarajc007
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsBlake Irvine
 

Viewers also liked (6)

The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original ProgrammingThe Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
The Big Data TV: Data Analytics, Algorithm, and Netflix’s Original Programming
 
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILESACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
ACCIDENT PREVENTION AND SECURITY SYSTEM FOR AUTOMOBILES
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
Mobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detectionMobile Phone Based Drunk driving detection
Mobile Phone Based Drunk driving detection
 
Netflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of AnalyticsNetflix - Enabling a Culture of Analytics
Netflix - Enabling a Culture of Analytics
 

Similar to Cloud Connect 2012, Big Data @ Netflix

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
 
Resistance is futile, resilience is crucial
Resistance is futile, resilience is crucialResistance is futile, resilience is crucial
Resistance is futile, resilience is crucialHristo Iliev
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroGaurav "GP" Pal
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsDataWorks Summit
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio
 

Similar to Cloud Connect 2012, Big Data @ Netflix (20)

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
Machine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville MeetupMachine Learning for Smarter Apps - Jacksonville Meetup
Machine Learning for Smarter Apps - Jacksonville Meetup
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
 
Resistance is futile, resilience is crucial
Resistance is futile, resilience is crucialResistance is futile, resilience is crucial
Resistance is futile, resilience is crucial
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suroDevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
DevOps in the Amazon Cloud – Learn from the pioneersNetflix suro
 
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data AnalyticsCombining Hadoop RDBMS for Large-Scale Big Data Analytics
Combining Hadoop RDBMS for Large-Scale Big Data Analytics
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Cloud Connect 2012, Big Data @ Netflix

  • 1. Big Data @ Using Big Data to Grow our Business & Retain our Customers. Jerome Boulon Lead Architect, Hadoop Big Data Infrastructure February 15, 2012 jboulon@netflix.com
  • 2. Big Data @ Netflix Offline analysis: •  Honu: Scalable log analysis system to gain business insights: –  Errors logs (unstructured logs) –  Statistical logs & Performance logs –  Etc Online analysis: •  Cassandra for all online activities and user facing data –  A/B testing (test allocation, metadata) –  Service level Configuration –  etc 2
  • 3. Overview Data collection pipeline Applica'on   Collectors   Hive   M/R   Data processing pipeline 3
  • 4. Honu - Structured Log API Using  Annota+ons   Using the Key/Value API •  Convert Java Class to Hive •  Produce the same result as Table dynamically Annotation •  Add/Remove column •  Avoid unnecessary object •  Supported java types: creation •  All primitives •  Fully dynamic •  Map •  Thread Safe •  Object using the toString method
  • 5. Honu, What you get: log.logEvent(myObject) Hive table movieId customerId timestamp hostname Select customerId, count(1) from MyTable group by customerId;
  • 6. December 2009 Collectors   –  POC for Streaming analysis Applica'on   –  Single AWS zone –  1 application –  60 Millions events/Day –  50 clients –  Small Hadoop cluster Oracle   –  1 Map/Reduce –  1 Table M/R  
  • 7. Feb 2012 40+ Billion events/Day 8+ tables with 1+TB/Day 100+ smaller tables Self-serve: à No DBA à No Pre-provisioning     à Fully integrated with Hive - Multi Regions deployments - Transparent to our engineers - Streaming based solution - Zero configuration - 7000+ clients - Built-in: Netflix Hive warehouse - Fail-Over - Load balancing   à One central Data warehouse à Hourly/Daily reports à Data retention/expiration
  • 8. Traceability & Performance analysis •  Track service level call –  Instrument low level HTTP client –  Calls graph –  Request processing vs Perceive latency –  Payload marshalling/unmarshalling - duration, size, etc –  Service Result - Status, Error code, Exception, etc
  • 9. Diagnostic Information •  Collect latency information for all external operations •  If Latency > threshold log to Honu: –  AWS Region & Zone –  Instance –  Service details •  Open Jira/Ticket & Attach diagnostic info
  • 10. Mix Offline and Online Data Offline data Specific conditions - Fire & forget - Online Data availability is not mandatory - Scale to very large volumes - If exist, data could be useful online - Cost effective - Only a subset useful Online - Ready to pay a little bit more Special collectors Customer support - All data goes to Hive - Browsing history - A subset goes to a real-time system - Historical & non-critical actions - Still cost effective Debug - Push validation - Root cause analysis
  • 11. Honu Realtime usages •  Movie playback experience •  Customer Support –  Video quality –  Historical usage –  Network issue –  Last activity •  Errors Summary •  Launch Reports –  Error tracking per service –  Push validation –  Error tracking per device –  Root cause analysis
  • 12. Honu Realtime - Architecture Realtime Data collection pipeline Applica'on   Collectors   Real'me   Access   Realtime System M/R  
  • 13. A/B Testing Test: An experiment where several competing behaviors are implemented and compared. Cell: different experiences within a test that are being compared against each other. Allocation: a customer-specific assignment to a cell within a test Online data: Tracking 1 M customers per Test - Cell Allocation > 1 Billion records information 8 tracking events per Day - Test config: 1 entry/test/customer (example) ------------------------------------ 100 Tests = 800 M events/ Day 3 Months = 72 B events
  • 15. A/B Testing - Architecture Online Data Offline Data - Customer test allocation - Test tracking - Metadata about the test Ex: Ex: - Retention - Start/End date - Engagement metrics - UI directives - Logging directives
  • 16. Beacon Server User behavior - Client side interactions - Search/Play/Stop/Pause Ajax calls Device monitoring - Heartbeat - Status & Key metrics Beacon   Beacon   Beacon  
  • 17. BI Integration Three main technologies •  Teradata (Data center) •  Hive (Cloud) •  Cassandra (Cloud)
  • 18. Hive ß à BI –  Dimension tables (daily export from Teradata) –  Hourly/Daily Hive summary queries –  Hourly/Daily export from Hive to BI •  Queries runs in the cloud •  Aggregated result goes back to our BI solution
  • 20. Cassandra à BI •  Use Cassandra backups to run analytics •  Export SSTable to Hadoop •  Pig to: –  Parse SSTable –  Extract/Group required information •  Load the result back to Teradata