SlideShare a Scribd company logo
1 of 22
Honu:
 A large scale data collection
and processing pipeline
Jerome Boulon
Netflix
Session Agenda


§  Honu
§  Goals
§  Architecture – Overview
§  Data Collection pipeline
§  Data Processing pipeline
§  Hive data warehouse
§  Honu Roadmap
§  Questions


                               2
Honu


§  Honu is a streaming data & log collection and processing
    pipeline built using:
 ›    Hadoop
 ›    Hive
 ›    Thrift


§  Honu is running in production for 6 months now and
    process over a billion log events/day running on EC2/
    EMR.


                                3
What are we trying to achieve?


§  Scalable log analysis to gain business insights:
 ›    Errors logs (unstructured logs)
 ›    Statistical logs (structured logs - App specific)
 ›    Performance logs (structured logs – Standard + App specific)
§  Output required:
 ›    Engineers access:
      •  Ad-hoc query and reporting
 ›    BI access:
      •  Flat files to be loaded into BI system for cross-functional reporting.
      •  Ad-hoc query for data examinations, etc.

                                           4
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 5
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 6
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 7
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 8
Architecture - Overview

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 9
Current Netflix deployment EC2/EMR


                Applications
   AMAZON EC2



                                               Hive & Hadoop
                                                     EMR
                   Honu                          (for ad-hoc
                 Collectors                         query)



                                    S3                           Hive
                                                               MetaStore


                               Hive & Hadoop
                                EMR clusters



                                         10
Honu – Client SDK



  Structured Log API    -  Log4j Appenders
                                              - Hadoop Metric Plugin


                                                     - Tomcat Access Log




 -  Convert individual messages to   Communication layer:
 batches                             -  Discovery Service
 -  In memory buffering System       -  Transparent fail-over & load-
                                     balancing
                                     -  Thrift as a transport protocol & RPC
                                     (NIO)
                                       11
Honu - Unstructured/Structured logs
Log4J Appender

§  Configuration using standard Log4j properties file
§  Control:
  ›    In memory size
  ›    Batch size
  ›    Number of senders + Vip Address
  ›    Timeout




                                  12
Honu - Structured Log API                                   Hive



                                    App



Using Annotations                 Using the Key/Value API
§  Convert Java Class to Hive    §  Produce the same result as
    Table dynamically                 Annotation
§  Add/Remove column             §  Avoid unnecessary object
                                      creation
§  Supported java types:
   •  All primitives              §  Fully dynamic

   •  Map                         §  Thread Safe
   •  Object using the toString
      method
Structured Log API – Using Annotation

@Resource(table= MyTable")

public class MyClass implements
Annotatable {
 @Column("movieId")                      log.info(myAnnotatableObj);
 public String getMovieId() {
   […]
                                         DB=MyTable
 }
                                         Movied=XXXX
    @Column("clickIndex")                clieckIndex=3
    public int getClickIndex() {         requestInfo.returnCode=200
      […]                                requestInfo.duration_ms=300
    }                                    requestInfo.yyy=zzz

    @Column("requestInfo")
    public Map getRequestInfo() {
      […]
    }
}


                                    14
Structured Log API - Key/Value API


KeyValueSerialization kv ;               log.info
kv = new KeyValueSerialization();        (kv.generateMessage());

[…]                                      DB=MyTable
kv.startMessage("MyTable");              Movied=XXXX
                                         clickIndex=3
kv.addKeyValue("movieid", XXX");
kv.addKeyValue("clickIndex", 3);         requestInfo.returnCode=200
kv.addKeyValue( requestInfo",            requestInfo.duration_ms=300
requestInfoMap);                         requestInfo.yyy=zzz




                                    15
Honu Collector


§  Honu collector:
 ›    Save logs to FS using local storage & Hadoop FS API
 ›    FS could be localFS, HDFS, S3n, NFS…
      •  FS fail-over coming (similar to scribe)
 ›    Thrift NIO
 ›    Multiple writers (Data grouping)


§  Output: DataSink (Binary compatible with Chukwa)
 ›    Compression (LZO/GZIP via Hadoop codecs)
 ›    S3 optimizations
                                         16
Data Processing pipeline

                     Data collection pipeline


Application                                     Collector




              Hive                                 M/R




                       Data processing
                       pipeline 17
Data Processing Pipeline
§  Proprietary Data Warehouse Workflows
   ›  Ability to test new build with production data
   ›  Ability to replay some data processing



§  CheckPoint System
   ›    keep track of all current states for recovery


§  Demuxer: Map-Reduce to parse/dispatch all logs to the
    right Hive table
   ›  Multiple parsers
   ›  Dynamic Output Format for Hive (Tables & columns
      management)
       •  Default schema (Map, hostname & timestamp)
       •  Table’s specific schema
       •  All tables partitioned by Date, hour & batchID
                                        18
Data Processing Pipeline - Details
                                               Table 1

                      Map/Reduce               Table 2

                                               Table n
§  Demux output:
 ›    1 directory per Hive table
                                                       Load
 ›    1 file per partition * reducerCount

                                                         Hive
                       Hive

                                   S3

                                        Hourly Merge
Hive Data warehouse


§  All data on S3 when final (SLA 2 hours)
 ›    All data (CheckPoint, DataSink & Hive final Output) are
      saved on S3, everything else is transient
 ›    Don t need to maintain live instances to store n years of
      data
 ›    Start/Stop EMR query cluster(s) on demand
      •  Get the best cluster s size for you
      •  Hive Reload partitions (EMR specific)




                                        20
Roadmap for Honu


§  Open source: GitHub (end of July)
  ›    Client SDK
  ›    Collector
  ›    Demuxer
§  Multiple writers
§  Persistent queue (client & server)
§  Real Time integration with external monitoring system
§  HBase/Cassandra investigation
§  Map/Reduce based data aggregator

                                    21
Questions?

   Jerome Boulon
   jboulon@gmail.com

More Related Content

What's hot

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 

What's hot (20)

Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Pulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at ScalePulsar - Real-time Analytics at Scale
Pulsar - Real-time Analytics at Scale
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
ebay
ebayebay
ebay
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 

Similar to Hadoop summit 2010, HONU

Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 

Similar to Hadoop summit 2010, HONU (20)

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
HBaseConAsia2018 Track3-2: HBase at China Telecom
HBaseConAsia2018 Track3-2:  HBase at China TelecomHBaseConAsia2018 Track3-2:  HBase at China Telecom
HBaseConAsia2018 Track3-2: HBase at China Telecom
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Apache drill
Apache drillApache drill
Apache drill
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 

Recently uploaded (20)

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 

Hadoop summit 2010, HONU

  • 1. Honu: A large scale data collection and processing pipeline Jerome Boulon Netflix
  • 2. Session Agenda §  Honu §  Goals §  Architecture – Overview §  Data Collection pipeline §  Data Processing pipeline §  Hive data warehouse §  Honu Roadmap §  Questions 2
  • 3. Honu §  Honu is a streaming data & log collection and processing pipeline built using: ›  Hadoop ›  Hive ›  Thrift §  Honu is running in production for 6 months now and process over a billion log events/day running on EC2/ EMR. 3
  • 4. What are we trying to achieve? §  Scalable log analysis to gain business insights: ›  Errors logs (unstructured logs) ›  Statistical logs (structured logs - App specific) ›  Performance logs (structured logs – Standard + App specific) §  Output required: ›  Engineers access: •  Ad-hoc query and reporting ›  BI access: •  Flat files to be loaded into BI system for cross-functional reporting. •  Ad-hoc query for data examinations, etc. 4
  • 5. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 5
  • 6. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 6
  • 7. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 7
  • 8. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 8
  • 9. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 9
  • 10. Current Netflix deployment EC2/EMR Applications AMAZON EC2 Hive & Hadoop EMR Honu (for ad-hoc Collectors query) S3 Hive MetaStore Hive & Hadoop EMR clusters 10
  • 11. Honu – Client SDK Structured Log API -  Log4j Appenders - Hadoop Metric Plugin - Tomcat Access Log -  Convert individual messages to Communication layer: batches -  Discovery Service -  In memory buffering System -  Transparent fail-over & load- balancing -  Thrift as a transport protocol & RPC (NIO) 11
  • 12. Honu - Unstructured/Structured logs Log4J Appender §  Configuration using standard Log4j properties file §  Control: ›  In memory size ›  Batch size ›  Number of senders + Vip Address ›  Timeout 12
  • 13. Honu - Structured Log API Hive App Using Annotations Using the Key/Value API §  Convert Java Class to Hive §  Produce the same result as Table dynamically Annotation §  Add/Remove column §  Avoid unnecessary object creation §  Supported java types: •  All primitives §  Fully dynamic •  Map §  Thread Safe •  Object using the toString method
  • 14. Structured Log API – Using Annotation @Resource(table= MyTable") public class MyClass implements Annotatable { @Column("movieId") log.info(myAnnotatableObj); public String getMovieId() { […] DB=MyTable } Movied=XXXX @Column("clickIndex") clieckIndex=3 public int getClickIndex() { requestInfo.returnCode=200 […] requestInfo.duration_ms=300 } requestInfo.yyy=zzz @Column("requestInfo") public Map getRequestInfo() { […] } } 14
  • 15. Structured Log API - Key/Value API KeyValueSerialization kv ; log.info kv = new KeyValueSerialization(); (kv.generateMessage()); […] DB=MyTable kv.startMessage("MyTable"); Movied=XXXX clickIndex=3 kv.addKeyValue("movieid", XXX"); kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200 kv.addKeyValue( requestInfo", requestInfo.duration_ms=300 requestInfoMap); requestInfo.yyy=zzz 15
  • 16. Honu Collector §  Honu collector: ›  Save logs to FS using local storage & Hadoop FS API ›  FS could be localFS, HDFS, S3n, NFS… •  FS fail-over coming (similar to scribe) ›  Thrift NIO ›  Multiple writers (Data grouping) §  Output: DataSink (Binary compatible with Chukwa) ›  Compression (LZO/GZIP via Hadoop codecs) ›  S3 optimizations 16
  • 17. Data Processing pipeline Data collection pipeline Application Collector Hive M/R Data processing pipeline 17
  • 18. Data Processing Pipeline §  Proprietary Data Warehouse Workflows ›  Ability to test new build with production data ›  Ability to replay some data processing §  CheckPoint System ›  keep track of all current states for recovery §  Demuxer: Map-Reduce to parse/dispatch all logs to the right Hive table ›  Multiple parsers ›  Dynamic Output Format for Hive (Tables & columns management) •  Default schema (Map, hostname & timestamp) •  Table’s specific schema •  All tables partitioned by Date, hour & batchID 18
  • 19. Data Processing Pipeline - Details Table 1 Map/Reduce Table 2 Table n §  Demux output: ›  1 directory per Hive table Load ›  1 file per partition * reducerCount Hive Hive S3 Hourly Merge
  • 20. Hive Data warehouse §  All data on S3 when final (SLA 2 hours) ›  All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient ›  Don t need to maintain live instances to store n years of data ›  Start/Stop EMR query cluster(s) on demand •  Get the best cluster s size for you •  Hive Reload partitions (EMR specific) 20
  • 21. Roadmap for Honu §  Open source: GitHub (end of July) ›  Client SDK ›  Collector ›  Demuxer §  Multiple writers §  Persistent queue (client & server) §  Real Time integration with external monitoring system §  HBase/Cassandra investigation §  Map/Reduce based data aggregator 21
  • 22. Questions? Jerome Boulon jboulon@gmail.com