• Save
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010

on

  • 8,748 views

Hadoop Summit 2010 - Developers Track

Hadoop Summit 2010 - Developers Track
Honu - A Large Scale Streaming Data Collection and Processing Pipeline
Jerome Boulon, Netflix

Statistics

Views

Total Views
8,748
Views on SlideShare
8,747
Embed Views
1

Actions

Likes
24
Downloads
0
Comments
1

1 Embed 1

http://www.taaza.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010 Honu - A Large Scale Streaming Data Collection and Processing Pipeline__HadoopSummit2010 Presentation Transcript

  • Honu: A large scale data collection and processing pipeline
    • Jerome Boulon
    Netflix
    • Honu
    • Goals
    • Architecture – Overview
    • Data Collection pipeline
    • Data Processing pipeline
    • Hive data warehouse
    • Honu Roadmap
    • Questions
    Session Agenda
  • Honu
    • Honu is a streaming data & log collection and processing pipeline built using:
      • Hadoop
      • Hive
      • Thrift
    • Honu is running in production for 6 months now and process over a billion log events/day running on EC2/EMR.
  • What are we trying to achieve?
    • Scalable log analysis to gain business insights:
      • Errors logs (unstructured logs)
      • Statistical logs (structured logs - App specific)
      • Performance logs (structured logs – Standard + App specific)
    • Output required:
      • Engineers access:
        • Ad-hoc query and reporting
      • BI access:
        • Flat files to be loaded into BI system for cross-functional reporting.
        • Ad-hoc query for data examinations, etc.
  • Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
  • Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
  • Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
  • Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
  • Architecture - Overview Application Collector Hive Data collection pipeline Data processing pipeline M/R
  • Current Netflix deployment EC2/EMR Applications S3 Hive MetaStore Hive & Hadoop EMR clusters Hive & Hadoop EMR (for ad-hoc query) Honu Collectors AMAZON EC2
  • Honu – Client SDK
    • Communication layer:
    • Discovery Service
    • Transparent fail-over & load-balancing
    • Thrift as a transport protocol & RPC (NIO)
    • Log4j Appenders
    • Convert individual messages to batches
    • In memory buffering System
    Structured Log API
    • Hadoop Metric Plugin
    - Tomcat Access Log
  • Honu - Unstructured/Structured logs Log4J Appender
    • Configuration using standard Log4j properties file
    • Control:
      • In memory size
      • Batch size
      • Number of senders + Vip Address
      • Timeout
    • Produce the same result as Annotation
    • Avoid unnecessary object creation
    • Fully dynamic
    • Thread Safe
    Honu - Structured Log API
    • Convert Java Class to Hive Table dynamically
    • Add/Remove “column”
    • Supported java types:
        • All primitives
        • Map
        • Object using the toString method
    Using Annotations Using the Key/Value API App Hive
  • Structured Log API – Using Annotation @ Resource (table=”MyTable") public class MyClass implements Annotatable { @ Column ("movieId") public String getMovieId() { […] } @Column("clickIndex") public int getClickIndex() { […] } @Column("requestInfo") public Map getRequestInfo() { […] } } DB=MyTable Movied=XXXX clieckIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(myAnnotatableObj);
  • Structured Log API - Key/Value API KeyValueSerialization kv ; kv = new KeyValueSerialization(); […] kv.startMessage("MyTable"); kv.addKeyValue("movieid", ”XXX"); kv.addKeyValue("clickIndex", 3); kv.addKeyValue(”requestInfo", requestInfoMap); DB=MyTable Movied=XXXX clickIndex=3 requestInfo.returnCode=200 requestInfo.duration_ms=300 requestInfo.yyy=zzz log.info(kv.generateMessage());
  • Honu Collector
    • Honu collector:
      • Save logs to FS using local storage & Hadoop FS API
      • FS could be localFS, HDFS, S3n, NFS…
        • FS fail-over coming (similar to scribe)
      • Thrift NIO
      • Multiple writers (Data grouping)
    • Output: DataSink (Binary compatible with Chukwa)
      • Compression (LZO/GZIP via Hadoop codecs)
      • S3 optimizations
  • Data Processing pipeline Application Collector Hive Data collection pipeline Data processing pipeline M/R
  • Data Processing Pipeline
  • Data Processing Pipeline - Details
    • Demux output:
      • 1 directory per Hive table
      • 1 file per partition * reducerCount
    S3 Map/Reduce Table 1 Table n Table 2 Hourly Merge Hive Hive Load
  • Hive Data warehouse
    • All data on S3 when final (SLA 2 hours)
      • All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient
      • Don’t need to maintain “live” instances to store n years of data
      • Start/Stop EMR query cluster(s) on demand
        • “ Get the best cluster’s size for you”
        • Hive Reload partitions (EMR specific)
  • Roadmap for Honu
    • Open source: GitHub (end of July)
      • Client SDK
      • Collector
      • Demuxer
    • Multiple writers
    • Persistent queue (client & server)
    • Real Time integration with external monitoring system
    • HBase/Cassandra investigation
    • Map/Reduce based data aggregator
  • Questions?
    • Jerome Boulon
    • [email_address]
    http://wiki.github.com/jboulon/Honu/