SlideShare a Scribd company logo
1 of 22
Honu:
 A large scale data collection
and processing pipeline
Jerome Boulon
Netflix
Session Agenda


§  Honu
§  Goals
§  Architecture – Overview
§  Data Collection pipeline
§  Data Processing pipeline
§  Hive data warehouse
§  Honu Roadmap
§  Questions


                               2
Honu


§  Honu is a streaming data & log collection and processing
    pipeline built using:
 ›    Hadoop
 ›    Hive
 ›    Thrift


§  Honu is running in production for 6 months now and
    process over a billion log events/day running on EC2/
    EMR.


                                3
What are we trying to achieve?


§  Scalable log analysis to gain business insights:
 ›    Errors logs (unstructured logs)
 ›    Statistical logs (structured logs - App specific)
 ›    Performance logs (structured logs – Standard + App specific)
§  Output required:
 ›    Engineers access:
      •  Ad-hoc query and reporting
 ›    BI access:
      •  Flat files to be loaded into BI system for cross-functional reporting.
      •  Ad-hoc query for data examinations, etc.

                                           4
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 5
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 6
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 7
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 8
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 9
Current Netflix deployment EC2/EMR


                Applications
   AMAZON EC2



                                               Hive & Hadoop
                                                     EMR
                   Honu                          (for ad-hoc
                 Collectors                         query)



                                    S3                           Hive
                                                               MetaStore


                               Hive & Hadoop
                                EMR clusters



                                         10
Honu – Client SDK



  Structured Log API    -  Log4j Appenders
                                              - Hadoop Metric Plugin


                                                     - Tomcat Access Log




 -  Convert individual messages to   Communication layer:
 batches                             -  Discovery Service
 -  In memory buffering System       -  Transparent fail-over & load-
                                     balancing
                                     -  Thrift as a transport protocol & RPC
                                     (NIO)
                                       11
Honu - Unstructured/Structured logs
Log4J Appender

§  Configuration using standard Log4j properties file
§  Control:
  ›    In memory size
  ›    Batch size
  ›    Number of senders + Vip Address
  ›    Timeout




                                  12
Honu - Structured Log API                                   Hive



                                    App



Using Annotations                 Using the Key/Value API
§  Convert Java Class to Hive    §  Produce the same result as
    Table dynamically                 Annotation
§  Add/Remove column             §  Avoid unnecessary object
                                      creation
§  Supported java types:
   •  All primitives              §  Fully dynamic

   •  Map                         §  Thread Safe
   •  Object using the toString
      method
Structured Log API – Using Annotation

@Resource(table= MyTable")

public class MyClass implements
Annotatable {
 @Column("movieId")                      log.info(myAnnotatableObj);
 public String getMovieId() {
   […]
                                         DB=MyTable
 }
                                         Movied=XXXX
    @Column("clickIndex")                clieckIndex=3
    public int getClickIndex() {         requestInfo.returnCode=200
      […]                                requestInfo.duration_ms=300
    }                                    requestInfo.yyy=zzz

    @Column("requestInfo")
    public Map getRequestInfo() {
      […]
    }
}


                                    14
Structured Log API - Key/Value API


KeyValueSerialization kv ;               log.info
kv = new KeyValueSerialization();        (kv.generateMessage());

[…]                                      DB=MyTable
kv.startMessage("MyTable");              Movied=XXXX
                                         clickIndex=3
kv.addKeyValue("movieid", XXX");
kv.addKeyValue("clickIndex", 3);         requestInfo.returnCode=200
kv.addKeyValue( requestInfo",            requestInfo.duration_ms=300
requestInfoMap);                         requestInfo.yyy=zzz




                                    15
Honu Collector


§  Honu collector:
 ›    Save logs to FS using local storage & Hadoop FS API
 ›    FS could be localFS, HDFS, S3n, NFS…
      •  FS fail-over coming (similar to scribe)
 ›    Thrift NIO
 ›    Multiple writers (Data grouping)


§  Output: DataSink (Binary compatible with Chukwa)
 ›    Compression (LZO/GZIP via Hadoop codecs)
 ›    S3 optimizations
                                         16
Data Processing pipeline

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 17
Data Processing Pipeline
§  Proprietary Data Warehouse Workflows
   ›  Ability to test new build with production data
   ›  Ability to replay some data processing



§  CheckPoint System
   ›    keep track of all current states for recovery


§  Demuxer: Map-Reduce to parse/dispatch all logs to the
    right Hive table
   ›  Multiple parsers
   ›  Dynamic Output Format for Hive (Tables & columns
      management)
       •  Default schema (Map, hostname & timestamp)
       •  Table’s specific schema
       •  All tables partitioned by Date, hour & batchID
                                        18
Data Processing Pipeline - Details
                                               Table 1

                      Map/Reduce               Table 2

                                               Table n
§  Demux output:
 ›    1 directory per Hive table
                                                       Load
 ›    1 file per partition * reducerCount

                                                         Hive
                       Hive

                                   S3

                                        Hourly Merge
Hive Data warehouse


§  All data on S3 when final (SLA 2 hours)
 ›    All data (CheckPoint, DataSink & Hive final Output) are
      saved on S3, everything else is transient
 ›    Don t need to maintain live instances to store n years of
      data
 ›    Start/Stop EMR query cluster(s) on demand
      •  Get the best cluster s size for you
      •  Hive Reload partitions (EMR specific)




                                        20
Roadmap for Honu


§  Open source: GitHub (end of July)
  ›    Client SDK
  ›    Collector
  ›    Demuxer
§  Multiple writers
§  Persistent queue (client & server)
§  Real Time integration with external monitoring system
§  HBase/Cassandra investigation
§  Map/Reduce based data aggregator

                                    21
Questions?

   Jerome Boulon
   jboulon@gmail.com

More Related Content

What's hot

Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaDataWorks Summit/Hadoop Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAmazon Web Services
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Sangjin Lee
 
Pulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at ScalePulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at ScaleTony Ng
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 

What's hot (20)

Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Pulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at ScalePulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at Scale
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
ebay
ebayebay
ebay
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 

Similar to Hadoop summit 2010, HONU

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Yahoo Developer Network
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China TelecomMichael Stack
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 

Similar to Hadoop summit 2010, HONU (20)

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China Telecom
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Apache drill
Apache drillApache drill
Apache drill
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Hadoop summit 2010, HONU

  • 1. Honu: A large scale data collection and processing pipeline Jerome Boulon Netflix
  • 2. Session Agenda §  Honu §  Goals §  Architecture – Overview §  Data Collection pipeline §  Data Processing pipeline §  Hive data warehouse §  Honu Roadmap §  Questions 2
  • 3. Honu §  Honu is a streaming data & log collection and processing pipeline built using: ›  Hadoop ›  Hive ›  Thrift §  Honu is running in production for 6 months now and process over a billion log events/day running on EC2/ EMR. 3
  • 4. What are we trying to achieve? §  Scalable log analysis to gain business insights: ›  Errors logs (unstructured logs) ›  Statistical logs (structured logs - App specific) ›  Performance logs (structured logs – Standard + App specific) §  Output required: ›  Engineers access: •  Ad-hoc query and reporting ›  BI access: •  Flat files to be loaded into BI system for cross-functional reporting. •  Ad-hoc query for data examinations, etc. 4
  • 5. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 5
  • 6. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 6
  • 7. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 7
  • 8. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 8
  • 9. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 9
  • 10. Current Netflix deployment EC2/EMR Applications AMAZON EC2 Hive & Hadoop EMR Honu (for ad-hoc Collectors query) S3 Hive MetaStore Hive & Hadoop EMR clusters 10
  • 11. Honu – Client SDK Structured Log API -  Log4j Appenders - Hadoop Metric Plugin - Tomcat Access Log -  Convert individual messages to Communication layer: batches -  Discovery Service -  In memory buffering System -  Transparent fail-over & load- balancing -  Thrift as a transport protocol & RPC (NIO) 11
  • 12. Honu - Unstructured/Structured logs Log4J Appender §  Configuration using standard Log4j properties file §  Control: ›  In memory size ›  Batch size ›  Number of senders + Vip Address ›  Timeout 12
  • 13. Honu - Structured Log API Hive App Using Annotations Using the Key/Value API §  Convert Java Class to Hive §  Produce the same result as Table dynamically Annotation §  Add/Remove column §  Avoid unnecessary object creation §  Supported java types: •  All primitives §  Fully dynamic •  Map §  Thread Safe •  Object using the toString method
  • 14. Structured Log API – Using Annotation @Resource(table= MyTable") public class MyClass implements Annotatable { @Column("movieId") log.info(myAnnotatableObj); public String getMovieId() { […] DB=MyTable } Movied=XXXX @Column("clickIndex") clieckIndex=3 public int getClickIndex() { requestInfo.returnCode=200 […] requestInfo.duration_ms=300 } requestInfo.yyy=zzz @Column("requestInfo") public Map getRequestInfo() { […] } } 14
  • 15. Structured Log API - Key/Value API KeyValueSerialization kv ; log.info kv = new KeyValueSerialization(); (kv.generateMessage()); […] DB=MyTable kv.startMessage("MyTable"); Movied=XXXX clickIndex=3 kv.addKeyValue("movieid", XXX"); kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200 kv.addKeyValue( requestInfo", requestInfo.duration_ms=300 requestInfoMap); requestInfo.yyy=zzz 15
  • 16. Honu Collector §  Honu collector: ›  Save logs to FS using local storage & Hadoop FS API ›  FS could be localFS, HDFS, S3n, NFS… •  FS fail-over coming (similar to scribe) ›  Thrift NIO ›  Multiple writers (Data grouping) §  Output: DataSink (Binary compatible with Chukwa) ›  Compression (LZO/GZIP via Hadoop codecs) ›  S3 optimizations 16
  • 17. Data Processing pipeline Data collection pipeline Application Collector Hive M/R Data processing pipeline 17
  • 18. Data Processing Pipeline §  Proprietary Data Warehouse Workflows ›  Ability to test new build with production data ›  Ability to replay some data processing §  CheckPoint System ›  keep track of all current states for recovery §  Demuxer: Map-Reduce to parse/dispatch all logs to the right Hive table ›  Multiple parsers ›  Dynamic Output Format for Hive (Tables & columns management) •  Default schema (Map, hostname & timestamp) •  Table’s specific schema •  All tables partitioned by Date, hour & batchID 18
  • 19. Data Processing Pipeline - Details Table 1 Map/Reduce Table 2 Table n §  Demux output: ›  1 directory per Hive table Load ›  1 file per partition * reducerCount Hive Hive S3 Hourly Merge
  • 20. Hive Data warehouse §  All data on S3 when final (SLA 2 hours) ›  All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient ›  Don t need to maintain live instances to store n years of data ›  Start/Stop EMR query cluster(s) on demand •  Get the best cluster s size for you •  Hive Reload partitions (EMR specific) 20
  • 21. Roadmap for Honu §  Open source: GitHub (end of July) ›  Client SDK ›  Collector ›  Demuxer §  Multiple writers §  Persistent queue (client & server) §  Real Time integration with external monitoring system §  HBase/Cassandra investigation §  Map/Reduce based data aggregator 21
  • 22. Questions? Jerome Boulon jboulon@gmail.com