Honu: A large scale data collectionand processing pipelineJerome BoulonNetflix
Session Agenda§  Honu§  Goals§  Architecture – Overview§  Data Collection pipeline§  Data Processing pipeline§  Hive...
Honu§  Honu is a streaming data & log collection and processing    pipeline built using: ›    Hadoop ›    Hive ›    Thrif...
What are we trying to achieve?§  Scalable log analysis to gain business insights: ›    Errors logs (unstructured logs) › ...
Architecture - Overview                     Data collection pipelineApplication                                     Collec...
Architecture - Overview                     Data collection pipelineApplication                                     Collec...
Architecture - Overview                     Data collection pipelineApplication                                     Collec...
Architecture - Overview                     Data collection pipelineApplication                                     Collec...
Architecture - Overview                     Data collection pipelineApplication                                     Collec...
Current Netflix deployment EC2/EMR                Applications   AMAZON EC2                                               ...
Honu – Client SDK  Structured Log API    -  Log4j Appenders                                              - Hadoop Metric P...
Honu - Unstructured/Structured logsLog4J Appender§  Configuration using standard Log4j properties file§  Control:  ›    ...
Honu - Structured Log API                                   Hive                                    AppUsing Annotations  ...
Structured Log API – Using Annotation@Resource(table= MyTable")public class MyClass implementsAnnotatable { @Column("movie...
Structured Log API - Key/Value APIKeyValueSerialization kv ;               log.infokv = new KeyValueSerialization();      ...
Honu Collector§  Honu collector: ›    Save logs to FS using local storage & Hadoop FS API ›    FS could be localFS, HDFS,...
Data Processing pipeline                     Data collection pipelineApplication                                     Colle...
Data Processing Pipeline§  Proprietary Data Warehouse Workflows   ›  Ability to test new build with production data   ›  ...
Data Processing Pipeline - Details                                               Table 1                      Map/Reduce  ...
Hive Data warehouse§  All data on S3 when final (SLA 2 hours) ›    All data (CheckPoint, DataSink & Hive final Output) ar...
Roadmap for Honu§  Open source: GitHub (end of July)  ›    Client SDK  ›    Collector  ›    Demuxer§  Multiple writers§...
Questions?   Jerome Boulon   jboulon@gmail.com
Upcoming SlideShare
Loading in …5
×

Hadoop summit 2010, HONU

2,496 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,496
On SlideShare
0
From Embeds
0
Number of Embeds
61
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop summit 2010, HONU

  1. 1. Honu: A large scale data collectionand processing pipelineJerome BoulonNetflix
  2. 2. Session Agenda§  Honu§  Goals§  Architecture – Overview§  Data Collection pipeline§  Data Processing pipeline§  Hive data warehouse§  Honu Roadmap§  Questions 2
  3. 3. Honu§  Honu is a streaming data & log collection and processing pipeline built using: ›  Hadoop ›  Hive ›  Thrift§  Honu is running in production for 6 months now and process over a billion log events/day running on EC2/ EMR. 3
  4. 4. What are we trying to achieve?§  Scalable log analysis to gain business insights: ›  Errors logs (unstructured logs) ›  Statistical logs (structured logs - App specific) ›  Performance logs (structured logs – Standard + App specific)§  Output required: ›  Engineers access: •  Ad-hoc query and reporting ›  BI access: •  Flat files to be loaded into BI system for cross-functional reporting. •  Ad-hoc query for data examinations, etc. 4
  5. 5. Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 5
  6. 6. Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 6
  7. 7. Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 7
  8. 8. Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 8
  9. 9. Architecture - Overview Data collection pipelineApplication Collector Hive M/R Data processing pipeline 9
  10. 10. Current Netflix deployment EC2/EMR Applications AMAZON EC2 Hive & Hadoop EMR Honu (for ad-hoc Collectors query) S3 Hive MetaStore Hive & Hadoop EMR clusters 10
  11. 11. Honu – Client SDK Structured Log API -  Log4j Appenders - Hadoop Metric Plugin - Tomcat Access Log -  Convert individual messages to Communication layer: batches -  Discovery Service -  In memory buffering System -  Transparent fail-over & load- balancing -  Thrift as a transport protocol & RPC (NIO) 11
  12. 12. Honu - Unstructured/Structured logsLog4J Appender§  Configuration using standard Log4j properties file§  Control: ›  In memory size ›  Batch size ›  Number of senders + Vip Address ›  Timeout 12
  13. 13. Honu - Structured Log API Hive AppUsing Annotations Using the Key/Value API§  Convert Java Class to Hive §  Produce the same result as Table dynamically Annotation§  Add/Remove column §  Avoid unnecessary object creation§  Supported java types: •  All primitives §  Fully dynamic •  Map §  Thread Safe •  Object using the toString method
  14. 14. Structured Log API – Using Annotation@Resource(table= MyTable")public class MyClass implementsAnnotatable { @Column("movieId") log.info(myAnnotatableObj); public String getMovieId() { […] DB=MyTable } Movied=XXXX @Column("clickIndex") clieckIndex=3 public int getClickIndex() { requestInfo.returnCode=200 […] requestInfo.duration_ms=300 } requestInfo.yyy=zzz @Column("requestInfo") public Map getRequestInfo() { […] }} 14
  15. 15. Structured Log API - Key/Value APIKeyValueSerialization kv ; log.infokv = new KeyValueSerialization(); (kv.generateMessage());[…] DB=MyTablekv.startMessage("MyTable"); Movied=XXXX clickIndex=3kv.addKeyValue("movieid", XXX");kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200kv.addKeyValue( requestInfo", requestInfo.duration_ms=300requestInfoMap); requestInfo.yyy=zzz 15
  16. 16. Honu Collector§  Honu collector: ›  Save logs to FS using local storage & Hadoop FS API ›  FS could be localFS, HDFS, S3n, NFS… •  FS fail-over coming (similar to scribe) ›  Thrift NIO ›  Multiple writers (Data grouping)§  Output: DataSink (Binary compatible with Chukwa) ›  Compression (LZO/GZIP via Hadoop codecs) ›  S3 optimizations 16
  17. 17. Data Processing pipeline Data collection pipelineApplication Collector Hive M/R Data processing pipeline 17
  18. 18. Data Processing Pipeline§  Proprietary Data Warehouse Workflows ›  Ability to test new build with production data ›  Ability to replay some data processing§  CheckPoint System ›  keep track of all current states for recovery§  Demuxer: Map-Reduce to parse/dispatch all logs to the right Hive table ›  Multiple parsers ›  Dynamic Output Format for Hive (Tables & columns management) •  Default schema (Map, hostname & timestamp) •  Table’s specific schema •  All tables partitioned by Date, hour & batchID 18
  19. 19. Data Processing Pipeline - Details Table 1 Map/Reduce Table 2 Table n§  Demux output: ›  1 directory per Hive table Load ›  1 file per partition * reducerCount Hive Hive S3 Hourly Merge
  20. 20. Hive Data warehouse§  All data on S3 when final (SLA 2 hours) ›  All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient ›  Don t need to maintain live instances to store n years of data ›  Start/Stop EMR query cluster(s) on demand •  Get the best cluster s size for you •  Hive Reload partitions (EMR specific) 20
  21. 21. Roadmap for Honu§  Open source: GitHub (end of July) ›  Client SDK ›  Collector ›  Demuxer§  Multiple writers§  Persistent queue (client & server)§  Real Time integration with external monitoring system§  HBase/Cassandra investigation§  Map/Reduce based data aggregator 21
  22. 22. Questions? Jerome Boulon jboulon@gmail.com

×