Hive user group presentation from Netflix (3/18/2010)


Published in: Technology

  Use case study of Hive/Hadoop
Eva Tse, Jerome Boulon
  What are we trying to achieve?
Scalable log analysis to gain business insights:
Logs for website streaming (phase 1)
All logs from web (phase 2)
Output required:
Engineers access:
Ad-hoc query and reporting
BI access:
Flat files to be loaded into BI system for cross-functional reporting.
  Some Metrics
Parsing 0.6 TB logs per day
Running 50+ persistent nodes
  Architecture Overview
Web App
Phase 1
Phase 2
Phase 2
Chukwa Collector
Log copy deamon
Hive & Hadoop (for query)
Hive MetaStore
S3 HDFS / S3
Hive & Hadoop running on the cloud
  Chukwa Streaming
MyApp
Collector
No data written to disk on the application side
  Data sent to a remote collector using Thrift
  Collector write to localFS/S3n/HDFS compressed
(stay tuned)
  Workflow to Hive (phase 1)
Streaming Session reconstruction
Each hour:
Run hadoop job to parse last hour log and reconstruct sessions
merge small files (from each reducer)
load to hive
Session expiration after 24 hours.
Will have sessions for each of past 24 hours.
After 24 hours, will need to merge again by: insert overwrite …. select <column list> from table
  Workflow to Hive (phase 2)
Continuous log collection via Chukwa
Generic and continuous parse/merge/load to 'real-time' Hive warehouse
merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data.
Daily/Hourly job:
For summary.
For publishing data to BI for reporting.
  Today's Hive usage at Netflix
Streaming summary data:
CDN performance
# of streams/day
# of errors/session
Test cell analysis
Ad-hoc query for further analysis like:
Raw log inspection
Detailed inspection of one stream session
Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics
  Challenges
Hive UI (for query building)
Multi-DB support (Hive-675) and user access management
Hive query on subset of partition files for handling late files (Hive-837 or Hive-951)
Merging small files (can't use hive.merge.mapfiles)