Your SlideShare is downloading. ×
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010

7,484

Published on

Hadoop Summit 2010 - Developers Track …

Hadoop Summit 2010 - Developers Track
Honu - A Large Scale Streaming Data Collection and Processing Pipeline
Jerome Boulon, Netflix

Published in: Technology, Business
1 Comment
26 Likes
Statistics
Notes
No Downloads
Views
Total Views
7,484
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
26
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Honu: A large scale data collection and processing pipeline
      • Jerome Boulon
      Netflix
    • 2.
      • Honu
      • Goals
      • Architecture – Overview
      • Data Collection pipeline
      • Data Processing pipeline
      • Hive data warehouse
      • Honu Roadmap
      • Questions
      Session Agenda
    • 3. Honu
      • Honu is a streaming data & log collection and processing pipeline built using:
        • Hadoop
        • Hive
        • Thrift
      • Honu is running in production for 6 months now and process over a billion log events/day running on EC2/EMR.
    • 4. What are we trying to achieve?
      • Scalable log analysis to gain business insights:
        • Errors logs (unstructured logs)
        • Statistical logs (structured logs - App specific)
        • Performance logs (structured logs – Standard + App specific)
      • Output required:
        • Engineers access:
          • Ad-hoc query and reporting
        • BI access:
          • Flat files to be loaded into BI system for cross-functional reporting.
          • Ad-hoc query for data examinations, etc.
    • 5. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 6. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 7. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 8. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 9. Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 10. Current Netflix deployment EC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters Hive & Hadoop EMR (for ad-hoc query) Honu Collectors AMAZON EC2
    • 11. Honu – Client SDK
      • Communication layer:
      • Discovery Service
      • Transparent fail-over & load-balancing
      • Thrift as a transport protocol & RPC (NIO)
      • Log4j Appenders
      • Convert individual messages to batches
      • In memory buffering System
      Structured Log API
      • Hadoop Metric Plugin
      - Tomcat Access Log
    • 12. Honu - Unstructured/Structured logs Log4J Appender
      • Configuration using standard Log4j properties file
      • Control:
        • In memory size
        • Batch size
        • Number of senders + Vip Address
        • Timeout
    • 13.
      • Produce the same result as Annotation
      • Avoid unnecessary object creation
      • Fully dynamic
      • Thread Safe
      Honu - Structured Log API
      • Convert Java Class to Hive Table dynamically
      • Add/Remove “column”
      • Supported java types:
          • All primitives
          • Map
          • Object using the toString method
      Using Annotations Using the Key/Value API App Hive
    • 14. Structured Log API – Using Annotation @ Resource (table=”MyTable") public class MyClass implements Annotatable { @ Column ("movieId") public String getMovieId() { […] } @Column("clickIndex") public int getClickIndex() { […] } @Column("requestInfo") public Map getRequestInfo() { […] } } DB=MyTable Movied=XXXX clieckIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(myAnnotatableObj);
    • 15. Structured Log API - Key/Value API KeyValueSerialization kv ; kv = new KeyValueSerialization(); […] kv.startMessage("MyTable"); kv.addKeyValue("movieid", ”XXX"); kv.addKeyValue("clickIndex", 3); kv.addKeyValue(”requestInfo", requestInfoMap); DB=MyTable Movied=XXXX clickIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(kv.generateMessage());
    • 16. Honu Collector
      • Honu collector:
        • Save logs to FS using local storage & Hadoop FS API
        • FS could be localFS, HDFS, S3n, NFS…
          • FS fail-over coming (similar to scribe)
        • Thrift NIO
        • Multiple writers (Data grouping)
      • Output: DataSink (Binary compatible with Chukwa)
        • Compression (LZO/GZIP via Hadoop codecs)
        • S3 optimizations
    • 17. Data Processing pipeline Application Collector Hive Data collection pipeline Data processing pipeline M/R
    • 18. Data Processing Pipeline
    • 19. Data Processing Pipeline - Details
      • Demux output:
        • 1 directory per Hive table
        • 1 file per partition * reducerCount
      S3 Map/Reduce Table 1 Table n Table 2 Hourly Merge Hive Hive Load
    • 20. Hive Data warehouse
      • All data on S3 when final (SLA 2 hours)
        • All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient
        • Don’t need to maintain “live” instances to store n years of data
        • Start/Stop EMR query cluster(s) on demand
          • “ Get the best cluster’s size for you”
          • Hive Reload partitions (EMR specific)
    • 21. Roadmap for Honu
      • Open source: GitHub (end of July)
        • Client SDK
        • Collector
        • Demuxer
      • Multiple writers
      • Persistent queue (client & server)
      • Real Time integration with external monitoring system
      • HBase/Cassandra investigation
      • Map/Reduce based data aggregator
    • 22. Questions?
      • Jerome Boulon
      • [email_address]
      http://wiki.github.com/jboulon/Honu/

    ×