Your SlideShare is downloading. ×
0
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010

7,536

Published on

Hadoop Summit 2010 - Developers Track …

Hadoop Summit 2010 - Developers Track
Honu - A Large Scale Streaming Data Collection and Processing Pipeline
Jerome Boulon, Netflix

Published in: Technology, Business
1 Comment
26 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,536
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
26
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Honu: A large scale data collection and processing pipeline <ul><li>Jerome Boulon </li></ul>Netflix
    • 2. <ul><li>Honu </li></ul><ul><li>Goals </li></ul><ul><li>Architecture – Overview </li></ul><ul><li>Data Collection pipeline </li></ul><ul><li>Data Processing pipeline </li></ul><ul><li>Hive data warehouse </li></ul><ul><li>Honu Roadmap </li></ul><ul><li>Questions </li></ul>Session Agenda
    • 3. Honu <ul><li>Honu is a streaming data & log collection and processing pipeline built using: </li></ul><ul><ul><li>Hadoop </li></ul></ul><ul><ul><li>Hive </li></ul></ul><ul><ul><li>Thrift </li></ul></ul><ul><li>Honu is running in production for 6 months now and process over a billion log events/day running on EC2/EMR. </li></ul>
    • 4. What are we trying to achieve? <ul><li>Scalable log analysis to gain business insights: </li></ul><ul><ul><li>Errors logs (unstructured logs) </li></ul></ul><ul><ul><li>Statistical logs (structured logs - App specific) </li></ul></ul><ul><ul><li>Performance logs (structured logs – Standard + App specific) </li></ul></ul><ul><li>Output required: </li></ul><ul><ul><li>Engineers access: </li></ul></ul><ul><ul><ul><li>Ad-hoc query and reporting </li></ul></ul></ul><ul><ul><li>BI access: </li></ul></ul><ul><ul><ul><li>Flat files to be loaded into BI system for cross-functional reporting. </li></ul></ul></ul><ul><ul><ul><li>Ad-hoc query for data examinations, etc. </li></ul></ul></ul>
    • 5. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 6. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 7. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 8. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 9. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 10. Current Netflix deployment EC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters Hive & Hadoop EMR (for ad-hoc query) Honu Collectors AMAZON EC2
    • 11. Honu – Client SDK <ul><li>Communication layer: </li></ul><ul><li>Discovery Service </li></ul><ul><li>Transparent fail-over & load-balancing </li></ul><ul><li>Thrift as a transport protocol & RPC (NIO) </li></ul><ul><li>Log4j Appenders </li></ul><ul><li>Convert individual messages to batches </li></ul><ul><li>In memory buffering System </li></ul>Structured Log API <ul><li>Hadoop Metric Plugin </li></ul>- Tomcat Access Log
    • 12. Honu - Unstructured/Structured logs Log4J Appender <ul><li>Configuration using standard Log4j properties file </li></ul><ul><li>Control: </li></ul><ul><ul><li>In memory size </li></ul></ul><ul><ul><li>Batch size </li></ul></ul><ul><ul><li>Number of senders + Vip Address </li></ul></ul><ul><ul><li>Timeout </li></ul></ul>
    • 13. <ul><li>Produce the same result as Annotation </li></ul><ul><li>Avoid unnecessary object creation </li></ul><ul><li>Fully dynamic </li></ul><ul><li>Thread Safe </li></ul>Honu - Structured Log API <ul><li>Convert Java Class to Hive Table dynamically </li></ul><ul><li>Add/Remove “column” </li></ul><ul><li>Supported java types: </li></ul><ul><ul><ul><li>All primitives </li></ul></ul></ul><ul><ul><ul><li>Map </li></ul></ul></ul><ul><ul><ul><li>Object using the toString method </li></ul></ul></ul>Using Annotations Using the Key/Value API App Hive
    • 14. Structured Log API – Using Annotation @ Resource (table=”MyTable&quot;) public class MyClass implements Annotatable { @ Column (&quot;movieId&quot;) public String getMovieId() { […] } @Column(&quot;clickIndex&quot;) public int getClickIndex() { […] } @Column(&quot;requestInfo&quot;) public Map getRequestInfo() { […] } } DB=MyTable Movied=XXXX clieckIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(myAnnotatableObj);
    • 15. Structured Log API - Key/Value API KeyValueSerialization kv ; kv = new KeyValueSerialization(); […] kv.startMessage(&quot;MyTable&quot;); kv.addKeyValue(&quot;movieid&quot;, ”XXX&quot;); kv.addKeyValue(&quot;clickIndex&quot;, 3); kv.addKeyValue(”requestInfo&quot;, requestInfoMap); DB=MyTable Movied=XXXX clickIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(kv.generateMessage());
    • 16. Honu Collector <ul><li>Honu collector: </li></ul><ul><ul><li>Save logs to FS using local storage & Hadoop FS API </li></ul></ul><ul><ul><li>FS could be localFS, HDFS, S3n, NFS… </li></ul></ul><ul><ul><ul><li>FS fail-over coming (similar to scribe) </li></ul></ul></ul><ul><ul><li>Thrift NIO </li></ul></ul><ul><ul><li>Multiple writers (Data grouping) </li></ul></ul><ul><li>Output: DataSink (Binary compatible with Chukwa) </li></ul><ul><ul><li>Compression (LZO/GZIP via Hadoop codecs) </li></ul></ul><ul><ul><li>S3 optimizations </li></ul></ul>
    • 17. Data Processing pipeline Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 18. Data Processing Pipeline
    • 19. Data Processing Pipeline - Details <ul><li>Demux output: </li></ul><ul><ul><li>1 directory per Hive table </li></ul></ul><ul><ul><li>1 file per partition * reducerCount </li></ul></ul>S3 Map/Reduce Table 1 Table n Table 2 Hourly Merge Hive Hive Load
    • 20. Hive Data warehouse <ul><li>All data on S3 when final (SLA 2 hours) </li></ul><ul><ul><li>All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient </li></ul></ul><ul><ul><li>Don’t need to maintain “live” instances to store n years of data </li></ul></ul><ul><ul><li>Start/Stop EMR query cluster(s) on demand </li></ul></ul><ul><ul><ul><li>“ Get the best cluster’s size for you” </li></ul></ul></ul><ul><ul><ul><li>Hive Reload partitions (EMR specific) </li></ul></ul></ul>
    • 21. Roadmap for Honu <ul><li>Open source: GitHub (end of July) </li></ul><ul><ul><li>Client SDK </li></ul></ul><ul><ul><li>Collector </li></ul></ul><ul><ul><li>Demuxer </li></ul></ul><ul><li>Multiple writers </li></ul><ul><li>Persistent queue (client & server) </li></ul><ul><li>Real Time integration with external monitoring system </li></ul><ul><li>HBase/Cassandra investigation </li></ul><ul><li>Map/Reduce based data aggregator </li></ul>
    • 22. Questions? <ul><li>Jerome Boulon </li></ul><ul><li>[email_address] </li></ul>http://wiki.github.com/jboulon/Honu/

    ×