Use case study of Hive/Hadoop Eva Tse,Jerome Boulon
What are we trying to achieve? Scalable log analysis to gain business insights: Logs for website streaming (phase 1) All logs from web (phase 2) Output required: Engineers access: Ad-hoc query and reporting BI access: Flat files to be loaded into BI system for cross-functional reporting.
Some Metrics Parsing 0.6 TB logs per day Running 50+ persistent nodes
Workflow to Hive (phase 1) Streaming Session reconstruction Each hour: Run hadoop job to parse last hour log and reconstruct sessions merge small files (from each reducer) load to hive Session expiration after 24 hours. Will have sessions for each of past 24 hours. After 24 hours, will need to merge again by: insert overwrite …. select <column list> from table
Workflow to Hive (phase 2) Continuous log collection via Chukwa Generic and continuous parse/merge/load to ‘real-time’ Hive warehouse merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data. Daily/Hourly job: For summary. For publishing data to BI for reporting.
Today’s Hive usage at Netflix Streaming summary data: CDN performance # of streams/day # of errors/session Test cell analysis Ad-hoc query for further analysis like: Raw log inspection Detailed inspection of one stream session Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics
Challenges Hive UI (for query building) Multi-DB support (Hive-675) and user access management Hive query on subset of partition files for handling late files (Hive-837 or Hive-951) Merging small files (can’t use hive.merge.mapfiles)