Honu:   A large scale data collection and processing pipeline <ul><li>Jerome Boulon </li></ul>Netflix
<ul><li>Honu </li></ul><ul><li>Goals </li></ul><ul><li>Architecture – Overview </li></ul><ul><li>Data Collection pipeline ...
Honu <ul><li>Honu is a streaming data & log collection and processing pipeline built using: </li></ul><ul><ul><li>Hadoop <...
What are we trying to achieve? <ul><li>Scalable log analysis to gain business insights: </li></ul><ul><ul><li>Errors logs ...
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
Current Netflix deployment EC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters  Hive & Hadoop EMR (for ad-ho...
Honu – Client SDK <ul><li>Communication layer: </li></ul><ul><li>Discovery Service </li></ul><ul><li>Transparent fail-over...
Honu - Unstructured/Structured logs  Log4J Appender <ul><li>Configuration using standard Log4j properties file </li></ul><...
<ul><li>Produce the same result as Annotation </li></ul><ul><li>Avoid unnecessary object creation </li></ul><ul><li>Fully ...
Structured Log API – Using Annotation @ Resource (table=”MyTable&quot;) public class MyClass  implements Annotatable { @ C...
Structured Log API - Key/Value API KeyValueSerialization  kv ; kv = new KeyValueSerialization(); […] kv.startMessage(&quot...
Honu Collector <ul><li>Honu collector: </li></ul><ul><ul><li>Save logs to FS using local storage & Hadoop FS API </li></ul...
Data Processing pipeline Application Collector Hive Data collection pipeline Data processing pipeline M/R
Data Processing Pipeline
Data Processing Pipeline - Details <ul><li>Demux output:  </li></ul><ul><ul><li>1 directory per Hive table </li></ul></ul>...
Hive Data warehouse <ul><li>All data on S3 when final (SLA 2 hours) </li></ul><ul><ul><li>All data (CheckPoint, DataSink &...
Roadmap for Honu <ul><li>Open source: GitHub (end of July) </li></ul><ul><ul><li>Client SDK </li></ul></ul><ul><ul><li>Col...
Questions? <ul><li>Jerome Boulon </li></ul><ul><li>[email_address] </li></ul>http://wiki.github.com/jboulon/Honu/
Upcoming SlideShare
Loading in...5
×

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010

7,596

Published on

Hadoop Summit 2010 - Developers Track
Honu - A Large Scale Streaming Data Collection and Processing Pipeline
Jerome Boulon, Netflix

Published in: Technology, Business
1 Comment
26 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,596
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
26
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010

    1. 1. Honu: A large scale data collection and processing pipeline <ul><li>Jerome Boulon </li></ul>Netflix
    2. 2. <ul><li>Honu </li></ul><ul><li>Goals </li></ul><ul><li>Architecture – Overview </li></ul><ul><li>Data Collection pipeline </li></ul><ul><li>Data Processing pipeline </li></ul><ul><li>Hive data warehouse </li></ul><ul><li>Honu Roadmap </li></ul><ul><li>Questions </li></ul>Session Agenda
    3. 3. Honu <ul><li>Honu is a streaming data & log collection and processing pipeline built using: </li></ul><ul><ul><li>Hadoop </li></ul></ul><ul><ul><li>Hive </li></ul></ul><ul><ul><li>Thrift </li></ul></ul><ul><li>Honu is running in production for 6 months now and process over a billion log events/day running on EC2/EMR. </li></ul>
    4. 4. What are we trying to achieve? <ul><li>Scalable log analysis to gain business insights: </li></ul><ul><ul><li>Errors logs (unstructured logs) </li></ul></ul><ul><ul><li>Statistical logs (structured logs - App specific) </li></ul></ul><ul><ul><li>Performance logs (structured logs – Standard + App specific) </li></ul></ul><ul><li>Output required: </li></ul><ul><ul><li>Engineers access: </li></ul></ul><ul><ul><ul><li>Ad-hoc query and reporting </li></ul></ul></ul><ul><ul><li>BI access: </li></ul></ul><ul><ul><ul><li>Flat files to be loaded into BI system for cross-functional reporting. </li></ul></ul></ul><ul><ul><ul><li>Ad-hoc query for data examinations, etc. </li></ul></ul></ul>
    5. 5. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    6. 6. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    7. 7. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    8. 8. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    9. 9. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    10. 10. Current Netflix deployment EC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters Hive & Hadoop EMR (for ad-hoc query) Honu Collectors AMAZON EC2
    11. 11. Honu – Client SDK <ul><li>Communication layer: </li></ul><ul><li>Discovery Service </li></ul><ul><li>Transparent fail-over & load-balancing </li></ul><ul><li>Thrift as a transport protocol & RPC (NIO) </li></ul><ul><li>Log4j Appenders </li></ul><ul><li>Convert individual messages to batches </li></ul><ul><li>In memory buffering System </li></ul>Structured Log API <ul><li>Hadoop Metric Plugin </li></ul>- Tomcat Access Log
    12. 12. Honu - Unstructured/Structured logs Log4J Appender <ul><li>Configuration using standard Log4j properties file </li></ul><ul><li>Control: </li></ul><ul><ul><li>In memory size </li></ul></ul><ul><ul><li>Batch size </li></ul></ul><ul><ul><li>Number of senders + Vip Address </li></ul></ul><ul><ul><li>Timeout </li></ul></ul>
    13. 13. <ul><li>Produce the same result as Annotation </li></ul><ul><li>Avoid unnecessary object creation </li></ul><ul><li>Fully dynamic </li></ul><ul><li>Thread Safe </li></ul>Honu - Structured Log API <ul><li>Convert Java Class to Hive Table dynamically </li></ul><ul><li>Add/Remove “column” </li></ul><ul><li>Supported java types: </li></ul><ul><ul><ul><li>All primitives </li></ul></ul></ul><ul><ul><ul><li>Map </li></ul></ul></ul><ul><ul><ul><li>Object using the toString method </li></ul></ul></ul>Using Annotations Using the Key/Value API App Hive
    14. 14. Structured Log API – Using Annotation @ Resource (table=”MyTable&quot;) public class MyClass implements Annotatable { @ Column (&quot;movieId&quot;) public String getMovieId() { […] } @Column(&quot;clickIndex&quot;) public int getClickIndex() { […] } @Column(&quot;requestInfo&quot;) public Map getRequestInfo() { […] } } DB=MyTable Movied=XXXX clieckIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(myAnnotatableObj);
    15. 15. Structured Log API - Key/Value API KeyValueSerialization kv ; kv = new KeyValueSerialization(); […] kv.startMessage(&quot;MyTable&quot;); kv.addKeyValue(&quot;movieid&quot;, ”XXX&quot;); kv.addKeyValue(&quot;clickIndex&quot;, 3); kv.addKeyValue(”requestInfo&quot;, requestInfoMap); DB=MyTable Movied=XXXX clickIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(kv.generateMessage());
    16. 16. Honu Collector <ul><li>Honu collector: </li></ul><ul><ul><li>Save logs to FS using local storage & Hadoop FS API </li></ul></ul><ul><ul><li>FS could be localFS, HDFS, S3n, NFS… </li></ul></ul><ul><ul><ul><li>FS fail-over coming (similar to scribe) </li></ul></ul></ul><ul><ul><li>Thrift NIO </li></ul></ul><ul><ul><li>Multiple writers (Data grouping) </li></ul></ul><ul><li>Output: DataSink (Binary compatible with Chukwa) </li></ul><ul><ul><li>Compression (LZO/GZIP via Hadoop codecs) </li></ul></ul><ul><ul><li>S3 optimizations </li></ul></ul>
    17. 17. Data Processing pipeline Application Collector Hive Data collection pipeline Data processing pipeline M/R
    18. 18. Data Processing Pipeline
    19. 19. Data Processing Pipeline - Details <ul><li>Demux output: </li></ul><ul><ul><li>1 directory per Hive table </li></ul></ul><ul><ul><li>1 file per partition * reducerCount </li></ul></ul>S3 Map/Reduce Table 1 Table n Table 2 Hourly Merge Hive Hive Load
    20. 20. Hive Data warehouse <ul><li>All data on S3 when final (SLA 2 hours) </li></ul><ul><ul><li>All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient </li></ul></ul><ul><ul><li>Don’t need to maintain “live” instances to store n years of data </li></ul></ul><ul><ul><li>Start/Stop EMR query cluster(s) on demand </li></ul></ul><ul><ul><ul><li>“ Get the best cluster’s size for you” </li></ul></ul></ul><ul><ul><ul><li>Hive Reload partitions (EMR specific) </li></ul></ul></ul>
    21. 21. Roadmap for Honu <ul><li>Open source: GitHub (end of July) </li></ul><ul><ul><li>Client SDK </li></ul></ul><ul><ul><li>Collector </li></ul></ul><ul><ul><li>Demuxer </li></ul></ul><ul><li>Multiple writers </li></ul><ul><li>Persistent queue (client & server) </li></ul><ul><li>Real Time integration with external monitoring system </li></ul><ul><li>HBase/Cassandra investigation </li></ul><ul><li>Map/Reduce based data aggregator </li></ul>
    22. 22. Questions? <ul><li>Jerome Boulon </li></ul><ul><li>[email_address] </li></ul>http://wiki.github.com/jboulon/Honu/

    ×