• Like
  • Save
Hadoop summit 2010, HONU
Upcoming SlideShare
Loading in...5
×
 

Hadoop summit 2010, HONU

on

  • 2,281 views

 

Statistics

Views

Total Views
2,281
Views on SlideShare
2,231
Embed Views
50

Actions

Likes
0
Downloads
0
Comments
0

4 Embeds 50

http://calistream.com 39
http://www.linkedin.com 5
https://www.linkedin.com 5
http://www.calistream.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop summit 2010, HONU Hadoop summit 2010, HONU Presentation Transcript

    • Honu: A large scale data collectionand processing pipelineJerome BoulonNetflix
    • Session Agenda§  Honu§  Goals§  Architecture – Overview§  Data Collection pipeline§  Data Processing pipeline§  Hive data warehouse§  Honu Roadmap§  Questions 2
    • Honu§  Honu is a streaming data & log collection and processing pipeline built using: ›  Hadoop ›  Hive ›  Thrift§  Honu is running in production for 6 months now and process over a billion log events/day running on EC2/ EMR. 3
    • What are we trying to achieve?§  Scalable log analysis to gain business insights: ›  Errors logs (unstructured logs) ›  Statistical logs (structured logs - App specific) ›  Performance logs (structured logs – Standard + App specific)§  Output required: ›  Engineers access: •  Ad-hoc query and reporting ›  BI access: •  Flat files to be loaded into BI system for cross-functional reporting. •  Ad-hoc query for data examinations, etc. 4
    • Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 5
    • Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 6
    • Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 7
    • Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 8
    • Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 9
    • Current Netflix deployment EC2/EMR Applications AMAZON EC2 Hive & Hadoop EMR Honu (for ad-hoc Collectors query) S3 Hive MetaStore Hive & Hadoop EMR clusters 10
    • Honu – Client SDK Structured Log API -  Log4j Appenders - Hadoop Metric Plugin - Tomcat Access Log -  Convert individual messages to Communication layer: batches -  Discovery Service -  In memory buffering System -  Transparent fail-over & load- balancing -  Thrift as a transport protocol & RPC (NIO) 11
    • Honu - Unstructured/Structured logsLog4J Appender§  Configuration using standard Log4j properties file§  Control: ›  In memory size ›  Batch size ›  Number of senders + Vip Address ›  Timeout 12
    • Honu - Structured Log API Hive AppUsing Annotations Using the Key/Value API§  Convert Java Class to Hive §  Produce the same result as Table dynamically Annotation§  Add/Remove column §  Avoid unnecessary object creation§  Supported java types: •  All primitives §  Fully dynamic •  Map §  Thread Safe •  Object using the toString method
    • Structured Log API – Using Annotation@Resource(table= MyTable")public class MyClass implementsAnnotatable { @Column("movieId") log.info(myAnnotatableObj); public String getMovieId() { […] DB=MyTable } Movied=XXXX @Column("clickIndex") clieckIndex=3 public int getClickIndex() { requestInfo.returnCode=200 […] requestInfo.duration_ms=300 } requestInfo.yyy=zzz @Column("requestInfo") public Map getRequestInfo() { […] }} 14
    • Structured Log API - Key/Value APIKeyValueSerialization kv ; log.infokv = new KeyValueSerialization(); (kv.generateMessage());[…] DB=MyTablekv.startMessage("MyTable"); Movied=XXXX clickIndex=3kv.addKeyValue("movieid", XXX");kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200kv.addKeyValue( requestInfo", requestInfo.duration_ms=300requestInfoMap); requestInfo.yyy=zzz 15
    • Honu Collector§  Honu collector: ›  Save logs to FS using local storage & Hadoop FS API ›  FS could be localFS, HDFS, S3n, NFS… •  FS fail-over coming (similar to scribe) ›  Thrift NIO ›  Multiple writers (Data grouping)§  Output: DataSink (Binary compatible with Chukwa) ›  Compression (LZO/GZIP via Hadoop codecs) ›  S3 optimizations 16
    • Data Processing pipeline Data collection pipelineApplication Collector Hive M/R Data processing pipeline 17
    • Data Processing Pipeline§  Proprietary Data Warehouse Workflows ›  Ability to test new build with production data ›  Ability to replay some data processing§  CheckPoint System ›  keep track of all current states for recovery§  Demuxer: Map-Reduce to parse/dispatch all logs to the right Hive table ›  Multiple parsers ›  Dynamic Output Format for Hive (Tables & columns management) •  Default schema (Map, hostname & timestamp) •  Table’s specific schema •  All tables partitioned by Date, hour & batchID 18
    • Data Processing Pipeline - Details Table 1 Map/Reduce Table 2 Table n§  Demux output: ›  1 directory per Hive table Load ›  1 file per partition * reducerCount Hive Hive S3 Hourly Merge
    • Hive Data warehouse§  All data on S3 when final (SLA 2 hours) ›  All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient ›  Don t need to maintain live instances to store n years of data ›  Start/Stop EMR query cluster(s) on demand •  Get the best cluster s size for you •  Hive Reload partitions (EMR specific) 20
    • Roadmap for Honu§  Open source: GitHub (end of July) ›  Client SDK ›  Collector ›  Demuxer§  Multiple writers§  Persistent queue (client & server)§  Real Time integration with external monitoring system§  HBase/Cassandra investigation§  Map/Reduce based data aggregator 21
    • Questions? Jerome Boulon jboulon@gmail.com