1. Honu:
A large scale data collection
and processing pipeline
Jerome Boulon
Netflix
2. Session Agenda
§ Honu
§ Goals
§ Architecture – Overview
§ Data Collection pipeline
§ Data Processing pipeline
§ Hive data warehouse
§ Honu Roadmap
§ Questions
2
3. Honu
§ Honu is a streaming data & log collection and processing
pipeline built using:
› Hadoop
› Hive
› Thrift
§ Honu is running in production for 6 months now and
process over a billion log events/day running on EC2/
EMR.
3
4. What are we trying to achieve?
§ Scalable log analysis to gain business insights:
› Errors logs (unstructured logs)
› Statistical logs (structured logs - App specific)
› Performance logs (structured logs – Standard + App specific)
§ Output required:
› Engineers access:
• Ad-hoc query and reporting
› BI access:
• Flat files to be loaded into BI system for cross-functional reporting.
• Ad-hoc query for data examinations, etc.
4
5. Architecture - Overview
Data collection pipeline
Application Collector
Hive M/R
Data processing
pipeline 5
6. Architecture - Overview
Data collection pipeline
Application Collector
Hive M/R
Data processing
pipeline 6
7. Architecture - Overview
Data collection pipeline
Application Collector
Hive M/R
Data processing
pipeline 7
8. Architecture - Overview
Data collection pipeline
Application Collector
Hive M/R
Data processing
pipeline 8
9. Architecture - Overview
Data collection pipeline
Application Collector
Hive M/R
Data processing
pipeline 9
11. Honu – Client SDK
Structured Log API - Log4j Appenders
- Hadoop Metric Plugin
- Tomcat Access Log
- Convert individual messages to Communication layer:
batches - Discovery Service
- In memory buffering System - Transparent fail-over & load-
balancing
- Thrift as a transport protocol & RPC
(NIO)
11
12. Honu - Unstructured/Structured logs
Log4J Appender
§ Configuration using standard Log4j properties file
§ Control:
› In memory size
› Batch size
› Number of senders + Vip Address
› Timeout
12
13. Honu - Structured Log API Hive
App
Using Annotations Using the Key/Value API
§ Convert Java Class to Hive § Produce the same result as
Table dynamically Annotation
§ Add/Remove column § Avoid unnecessary object
creation
§ Supported java types:
• All primitives § Fully dynamic
• Map § Thread Safe
• Object using the toString
method
14. Structured Log API – Using Annotation
@Resource(table= MyTable")
public class MyClass implements
Annotatable {
@Column("movieId") log.info(myAnnotatableObj);
public String getMovieId() {
[…]
DB=MyTable
}
Movied=XXXX
@Column("clickIndex") clieckIndex=3
public int getClickIndex() { requestInfo.returnCode=200
[…] requestInfo.duration_ms=300
} requestInfo.yyy=zzz
@Column("requestInfo")
public Map getRequestInfo() {
[…]
}
}
14
16. Honu Collector
§ Honu collector:
› Save logs to FS using local storage & Hadoop FS API
› FS could be localFS, HDFS, S3n, NFS…
• FS fail-over coming (similar to scribe)
› Thrift NIO
› Multiple writers (Data grouping)
§ Output: DataSink (Binary compatible with Chukwa)
› Compression (LZO/GZIP via Hadoop codecs)
› S3 optimizations
16
17. Data Processing pipeline
Data collection pipeline
Application Collector
Hive M/R
Data processing
pipeline 17
18. Data Processing Pipeline
§ Proprietary Data Warehouse Workflows
› Ability to test new build with production data
› Ability to replay some data processing
§ CheckPoint System
› keep track of all current states for recovery
§ Demuxer: Map-Reduce to parse/dispatch all logs to the
right Hive table
› Multiple parsers
› Dynamic Output Format for Hive (Tables & columns
management)
• Default schema (Map, hostname & timestamp)
• Table’s specific schema
• All tables partitioned by Date, hour & batchID
18
19. Data Processing Pipeline - Details
Table 1
Map/Reduce Table 2
Table n
§ Demux output:
› 1 directory per Hive table
Load
› 1 file per partition * reducerCount
Hive
Hive
S3
Hourly Merge
20. Hive Data warehouse
§ All data on S3 when final (SLA 2 hours)
› All data (CheckPoint, DataSink & Hive final Output) are
saved on S3, everything else is transient
› Don t need to maintain live instances to store n years of
data
› Start/Stop EMR query cluster(s) on demand
• Get the best cluster s size for you
• Hive Reload partitions (EMR specific)
20
21. Roadmap for Honu
§ Open source: GitHub (end of July)
› Client SDK
› Collector
› Demuxer
§ Multiple writers
§ Persistent queue (client & server)
§ Real Time integration with external monitoring system
§ HBase/Cassandra investigation
§ Map/Reduce based data aggregator
21