Successfully reported this slideshow.
Your SlideShare is downloading. ×

Hadoop summit 2010, HONU

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Hadoop summit 2010, HONU

  1. 1. Honu: A large scale data collection and processing pipeline Jerome Boulon Netflix
  2. 2. Session Agenda §  Honu §  Goals §  Architecture – Overview §  Data Collection pipeline §  Data Processing pipeline §  Hive data warehouse §  Honu Roadmap §  Questions 2
  3. 3. Honu §  Honu is a streaming data & log collection and processing pipeline built using: ›  Hadoop ›  Hive ›  Thrift §  Honu is running in production for 6 months now and process over a billion log events/day running on EC2/ EMR. 3
  4. 4. What are we trying to achieve? §  Scalable log analysis to gain business insights: ›  Errors logs (unstructured logs) ›  Statistical logs (structured logs - App specific) ›  Performance logs (structured logs – Standard + App specific) §  Output required: ›  Engineers access: •  Ad-hoc query and reporting ›  BI access: •  Flat files to be loaded into BI system for cross-functional reporting. •  Ad-hoc query for data examinations, etc. 4
  5. 5. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 5
  6. 6. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 6
  7. 7. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 7
  8. 8. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 8
  9. 9. Architecture - Overview Data collection pipeline Application Collector Hive M/R Data processing pipeline 9
  10. 10. Current Netflix deployment EC2/EMR Applications AMAZON EC2 Hive & Hadoop EMR Honu (for ad-hoc Collectors query) S3 Hive MetaStore Hive & Hadoop EMR clusters 10
  11. 11. Honu – Client SDK Structured Log API -  Log4j Appenders - Hadoop Metric Plugin - Tomcat Access Log -  Convert individual messages to Communication layer: batches -  Discovery Service -  In memory buffering System -  Transparent fail-over & load- balancing -  Thrift as a transport protocol & RPC (NIO) 11
  12. 12. Honu - Unstructured/Structured logs Log4J Appender §  Configuration using standard Log4j properties file §  Control: ›  In memory size ›  Batch size ›  Number of senders + Vip Address ›  Timeout 12
  13. 13. Honu - Structured Log API Hive App Using Annotations Using the Key/Value API §  Convert Java Class to Hive §  Produce the same result as Table dynamically Annotation §  Add/Remove column §  Avoid unnecessary object creation §  Supported java types: •  All primitives §  Fully dynamic •  Map §  Thread Safe •  Object using the toString method
  14. 14. Structured Log API – Using Annotation @Resource(table= MyTable") public class MyClass implements Annotatable { @Column("movieId") log.info(myAnnotatableObj); public String getMovieId() { […] DB=MyTable } Movied=XXXX @Column("clickIndex") clieckIndex=3 public int getClickIndex() { requestInfo.returnCode=200 […] requestInfo.duration_ms=300 } requestInfo.yyy=zzz @Column("requestInfo") public Map getRequestInfo() { […] } } 14
  15. 15. Structured Log API - Key/Value API KeyValueSerialization kv ; log.info kv = new KeyValueSerialization(); (kv.generateMessage()); […] DB=MyTable kv.startMessage("MyTable"); Movied=XXXX clickIndex=3 kv.addKeyValue("movieid", XXX"); kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200 kv.addKeyValue( requestInfo", requestInfo.duration_ms=300 requestInfoMap); requestInfo.yyy=zzz 15
  16. 16. Honu Collector §  Honu collector: ›  Save logs to FS using local storage & Hadoop FS API ›  FS could be localFS, HDFS, S3n, NFS… •  FS fail-over coming (similar to scribe) ›  Thrift NIO ›  Multiple writers (Data grouping) §  Output: DataSink (Binary compatible with Chukwa) ›  Compression (LZO/GZIP via Hadoop codecs) ›  S3 optimizations 16
  17. 17. Data Processing pipeline Data collection pipeline Application Collector Hive M/R Data processing pipeline 17
  18. 18. Data Processing Pipeline §  Proprietary Data Warehouse Workflows ›  Ability to test new build with production data ›  Ability to replay some data processing §  CheckPoint System ›  keep track of all current states for recovery §  Demuxer: Map-Reduce to parse/dispatch all logs to the right Hive table ›  Multiple parsers ›  Dynamic Output Format for Hive (Tables & columns management) •  Default schema (Map, hostname & timestamp) •  Table’s specific schema •  All tables partitioned by Date, hour & batchID 18
  19. 19. Data Processing Pipeline - Details Table 1 Map/Reduce Table 2 Table n §  Demux output: ›  1 directory per Hive table Load ›  1 file per partition * reducerCount Hive Hive S3 Hourly Merge
  20. 20. Hive Data warehouse §  All data on S3 when final (SLA 2 hours) ›  All data (CheckPoint, DataSink & Hive final Output) are saved on S3, everything else is transient ›  Don t need to maintain live instances to store n years of data ›  Start/Stop EMR query cluster(s) on demand •  Get the best cluster s size for you •  Hive Reload partitions (EMR specific) 20
  21. 21. Roadmap for Honu §  Open source: GitHub (end of July) ›  Client SDK ›  Collector ›  Demuxer §  Multiple writers §  Persistent queue (client & server) §  Real Time integration with external monitoring system §  HBase/Cassandra investigation §  Map/Reduce based data aggregator 21
  22. 22. Questions? Jerome Boulon jboulon@gmail.com

×