Honu:   A large scale data collection and processing pipeline Jerome Boulon Netflix
Honu Goals Architecture – Overview Data Collection pipeline Data Processing pipeline Hive data warehouse Honu Roadmap Questions Session Agenda
Honu Honu is a streaming data & log collection and processing pipeline built using: Hadoop Hive Thrift Honu is running in production for 6 months now and process over a billion log events/day running on EC2/EMR.
What are we trying to achieve? Scalable log analysis to gain business insights: Errors logs (unstructured logs) Statistical logs (structured logs - App specific) Performance logs (structured logs – Standard + App specific) Output required: Engineers access: Ad-hoc query and reporting BI access: Flat files to be loaded into BI system for cross-functional reporting. Ad-hoc query for data examinations, etc.
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Current Netflix deployment EC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters  Hive & Hadoop EMR (for ad-hoc query) Honu Collectors AMAZON EC2
Honu – Client SDK Communication layer: Discovery Service Transparent fail-over & load-balancing Thrift as a transport protocol & RPC (NIO) Log4j Appenders Convert individual messages to batches In memory buffering System Structured Log API Hadoop Metric Plugin - Tomcat Access Log
Honu - Unstructured/Structured logs  Log4J Appender Configuration using standard Log4j properties file Control: In memory size Batch size Number of senders + Vip Address Timeout
Produce the same result as Annotation Avoid unnecessary object creation Fully dynamic Thread Safe Honu - Structured Log API Convert Java Class to Hive Table dynamically Add/Remove “column” Supported java types: All primitives Map Object using the toString method Using Annotations Using the Key/Value API App Hive
Structured Log API – Using Annotation @ Resource (table=”MyTable") public class MyClass  implements Annotatable { @ Column ("movieId") public String getMovieId() { […] } @Column("clickIndex") public int getClickIndex() { […] } @Column("requestInfo") public Map getRequestInfo() { […] } } DB=MyTable Movied=XXXX clieckIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(myAnnotatableObj);
Structured Log API - Key/Value API KeyValueSerialization  kv ; kv = new KeyValueSerialization(); […] kv.startMessage("MyTable"); kv.addKeyValue("movieid", ”XXX"); kv.addKeyValue("clickIndex", 3); kv.addKeyValue(”requestInfo", requestInfoMap); DB=MyTable Movied=XXXX clickIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(kv.generateMessage());
Honu Collector Honu collector: Save logs to FS using local storage & Hadoop FS API FS could be localFS, HDFS, S3n, NFS… FS fail-over coming (similar to scribe) Thrift NIO Multiple writers (Data grouping) Output: DataSink (Binary compatible with Chukwa) Compression (LZO/GZIP via Hadoop codecs) S3 optimizations
Data Processing pipeline Application Collector Hive Data collection pipeline Data processing pipeline M/R
Data Processing Pipeline
Data Processing Pipeline - Details Demux output:  1 directory per Hive table 1 file per partition * reducerCount S3 Map/Reduce Table 1 Table n Table 2 Hourly Merge Hive Hive Load
Hive Data warehouse All data on S3 when final (SLA 2 hours) All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient Don’t need to maintain “live” instances to store n years of data Start/Stop EMR query cluster(s) on demand “ Get the best cluster’s size for you” Hive Reload partitions (EMR specific)
Roadmap for Honu Open source: GitHub (end of July) Client SDK Collector Demuxer Multiple writers Persistent queue (client & server) Real Time integration with external monitoring system HBase/Cassandra investigation Map/Reduce based data aggregator
Questions? Jerome Boulon [email_address] http://wiki.github.com/jboulon/Honu/

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010

  • 1.
    Honu: A large scale data collection and processing pipeline Jerome Boulon Netflix
  • 2.
    Honu Goals Architecture– Overview Data Collection pipeline Data Processing pipeline Hive data warehouse Honu Roadmap Questions Session Agenda
  • 3.
    Honu Honu isa streaming data & log collection and processing pipeline built using: Hadoop Hive Thrift Honu is running in production for 6 months now and process over a billion log events/day running on EC2/EMR.
  • 4.
    What are wetrying to achieve? Scalable log analysis to gain business insights: Errors logs (unstructured logs) Statistical logs (structured logs - App specific) Performance logs (structured logs – Standard + App specific) Output required: Engineers access: Ad-hoc query and reporting BI access: Flat files to be loaded into BI system for cross-functional reporting. Ad-hoc query for data examinations, etc.
  • 5.
    Architecture - OverviewApplication Collector Hive Data collection pipeline Data processing pipeline M/R
  • 6.
    Architecture - OverviewApplication Collector Hive Data collection pipeline Data processing pipeline M/R
  • 7.
    Architecture - OverviewApplication Collector Hive Data collection pipeline Data processing pipeline M/R
  • 8.
    Architecture - OverviewApplication Collector Hive Data collection pipeline Data processing pipeline M/R
  • 9.
    Architecture - OverviewApplication Collector Hive Data collection pipeline Data processing pipeline M/R
  • 10.
    Current Netflix deploymentEC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters Hive & Hadoop EMR (for ad-hoc query) Honu Collectors AMAZON EC2
  • 11.
    Honu – ClientSDK Communication layer: Discovery Service Transparent fail-over & load-balancing Thrift as a transport protocol & RPC (NIO) Log4j Appenders Convert individual messages to batches In memory buffering System Structured Log API Hadoop Metric Plugin - Tomcat Access Log
  • 12.
    Honu - Unstructured/Structuredlogs Log4J Appender Configuration using standard Log4j properties file Control: In memory size Batch size Number of senders + Vip Address Timeout
  • 13.
    Produce the sameresult as Annotation Avoid unnecessary object creation Fully dynamic Thread Safe Honu - Structured Log API Convert Java Class to Hive Table dynamically Add/Remove “column” Supported java types: All primitives Map Object using the toString method Using Annotations Using the Key/Value API App Hive
  • 14.
    Structured Log API– Using Annotation @ Resource (table=”MyTable") public class MyClass implements Annotatable { @ Column ("movieId") public String getMovieId() { […] } @Column("clickIndex") public int getClickIndex() { […] } @Column("requestInfo") public Map getRequestInfo() { […] } } DB=MyTable Movied=XXXX clieckIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(myAnnotatableObj);
  • 15.
    Structured Log API- Key/Value API KeyValueSerialization kv ; kv = new KeyValueSerialization(); […] kv.startMessage("MyTable"); kv.addKeyValue("movieid", ”XXX"); kv.addKeyValue("clickIndex", 3); kv.addKeyValue(”requestInfo", requestInfoMap); DB=MyTable Movied=XXXX clickIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(kv.generateMessage());
  • 16.
    Honu Collector Honucollector: Save logs to FS using local storage & Hadoop FS API FS could be localFS, HDFS, S3n, NFS… FS fail-over coming (similar to scribe) Thrift NIO Multiple writers (Data grouping) Output: DataSink (Binary compatible with Chukwa) Compression (LZO/GZIP via Hadoop codecs) S3 optimizations
  • 17.
    Data Processing pipelineApplication Collector Hive Data collection pipeline Data processing pipeline M/R
  • 18.
  • 19.
    Data Processing Pipeline- Details Demux output: 1 directory per Hive table 1 file per partition * reducerCount S3 Map/Reduce Table 1 Table n Table 2 Hourly Merge Hive Hive Load
  • 20.
    Hive Data warehouseAll data on S3 when final (SLA 2 hours) All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient Don’t need to maintain “live” instances to store n years of data Start/Stop EMR query cluster(s) on demand “ Get the best cluster’s size for you” Hive Reload partitions (EMR specific)
  • 21.
    Roadmap for HonuOpen source: GitHub (end of July) Client SDK Collector Demuxer Multiple writers Persistent queue (client & server) Real Time integration with external monitoring system HBase/Cassandra investigation Map/Reduce based data aggregator
  • 22.
    Questions? Jerome Boulon[email_address] http://wiki.github.com/jboulon/Honu/

Editor's Notes

  • #2 This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • #3 This is the agenda slide. There is only one of these in the deck.
  • #4 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #5 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #6 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #7 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #8 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #9 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #10 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #11 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #12 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #13 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #14 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #15 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #16 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #17 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #18 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #19 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #21 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #22 This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • #23 This is the final slide; generally for questions at the end of the talk. Please post your contact information here.