Hadoop summit 2010, HONU

Honu:
A large scale data collection
and processing pipeline
Jerome Boulon
Netflix

Session Agenda

§  Honu
§  Goals
§  Architecture – Overview
§  Data Collection pipeline
§  Data Processing pipeline
§  Hive data warehouse
§  Honu Roadmap
§  Questions

2

Honu

§  Honu is a streaming data & log collection and processing
pipeline built using:
›  Hadoop
›  Hive
›  Thrift

§  Honu is running in production for 6 months now and
process over a billion log events/day running on EC2/
EMR.

3

What are we trying to achieve?

§  Scalable log analysis to gain business insights:
›  Errors logs (unstructured logs)
›  Statistical logs (structured logs - App specific)
›  Performance logs (structured logs – Standard + App specific)
§  Output required:
›  Engineers access:
•  Ad-hoc query and reporting
›  BI access:
•  Flat files to be loaded into BI system for cross-functional reporting.
•  Ad-hoc query for data examinations, etc.

4

Architecture - Overview

Data collection pipeline

Application Collector

Hive M/R

Data processing
pipeline 5




Hive M/R

Data processing
pipeline 6




Hive M/R

Data processing
pipeline 7




Hive M/R

Data processing
pipeline 8




Hive M/R

Data processing
pipeline 9

Current Netflix deployment EC2/EMR

Applications
AMAZON EC2

Hive & Hadoop
EMR
Honu (for ad-hoc
Collectors query)

S3 Hive
MetaStore

Hive & Hadoop
EMR clusters

10

Honu – Client SDK

Structured Log API -  Log4j Appenders
- Hadoop Metric Plugin

- Tomcat Access Log

-  Convert individual messages to Communication layer:
batches -  Discovery Service
-  In memory buffering System -  Transparent fail-over & load-
balancing
-  Thrift as a transport protocol & RPC
(NIO)
11

Honu - Unstructured/Structured logs
Log4J Appender

§  Configuration using standard Log4j properties file
§  Control:
›  In memory size
›  Batch size
›  Number of senders + Vip Address
›  Timeout

12

Honu - Structured Log API Hive

App

Using Annotations Using the Key/Value API
§  Convert Java Class to Hive §  Produce the same result as
Table dynamically Annotation
§  Add/Remove column §  Avoid unnecessary object
creation
§  Supported java types:
•  All primitives §  Fully dynamic

•  Map §  Thread Safe
•  Object using the toString
method

Structured Log API – Using Annotation

@Resource(table= MyTable")

public class MyClass implements
Annotatable {
@Column("movieId") log.info(myAnnotatableObj);
public String getMovieId() {
[…]
DB=MyTable
}
Movied=XXXX
@Column("clickIndex") clieckIndex=3
public int getClickIndex() { requestInfo.returnCode=200
[…] requestInfo.duration_ms=300
} requestInfo.yyy=zzz

@Column("requestInfo")
public Map getRequestInfo() {
[…]
}
}

14

Structured Log API - Key/Value API

KeyValueSerialization kv ; log.info
kv = new KeyValueSerialization(); (kv.generateMessage());

[…] DB=MyTable
kv.startMessage("MyTable"); Movied=XXXX
clickIndex=3
kv.addKeyValue("movieid", XXX");
kv.addKeyValue("clickIndex", 3); requestInfo.returnCode=200
kv.addKeyValue( requestInfo", requestInfo.duration_ms=300
requestInfoMap); requestInfo.yyy=zzz

15

Honu Collector

§  Honu collector:
›  Save logs to FS using local storage & Hadoop FS API
›  FS could be localFS, HDFS, S3n, NFS…
•  FS fail-over coming (similar to scribe)
›  Thrift NIO
›  Multiple writers (Data grouping)

§  Output: DataSink (Binary compatible with Chukwa)
›  Compression (LZO/GZIP via Hadoop codecs)
›  S3 optimizations
16

Data Processing pipeline



Hive M/R

Data processing
pipeline 17

Data Processing Pipeline
§  Proprietary Data Warehouse Workflows
›  Ability to test new build with production data
›  Ability to replay some data processing

§  CheckPoint System
›  keep track of all current states for recovery

§  Demuxer: Map-Reduce to parse/dispatch all logs to the
right Hive table
›  Multiple parsers
›  Dynamic Output Format for Hive (Tables & columns
management)
•  Default schema (Map, hostname & timestamp)
•  Table’s specific schema
•  All tables partitioned by Date, hour & batchID
18

Data Processing Pipeline - Details
Table 1

Map/Reduce Table 2

Table n
§  Demux output:
›  1 directory per Hive table
Load
›  1 file per partition * reducerCount

Hive
Hive

S3

Hourly Merge

Hive Data warehouse

§  All data on S3 when final (SLA 2 hours)
›  All data (CheckPoint, DataSink & Hive final Output) are
saved on S3, everything else is transient
›  Don t need to maintain live instances to store n years of
data
›  Start/Stop EMR query cluster(s) on demand
•  Get the best cluster s size for you
•  Hive Reload partitions (EMR specific)

20

Roadmap for Honu

§  Open source: GitHub (end of July)
›  Client SDK
›  Collector
›  Demuxer
§  Multiple writers
§  Persistent queue (client & server)
§  Real Time integration with external monitoring system
§  HBase/Cassandra investigation
§  Map/Reduce based data aggregator

21

Questions?

Jerome Boulon
jboulon@gmail.com

Hadoop summit 2010, HONU

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop summit 2010, HONU

Similar to Hadoop summit 2010, HONU (20)

Recently uploaded

Recently uploaded (20)

Hadoop summit 2010, HONU