Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hive user group presentation from Netflix (3/18/2010)


Published on

Published in: Technology

Hive user group presentation from Netflix (3/18/2010)

  1. 1. Use case study of Hive/Hadoop<br />Eva Tse,Jerome Boulon<br />
  2. 2. What are we trying to achieve?<br />Scalable log analysis to gain business insights:<br />Logs for website streaming (phase 1)<br />All logs from web (phase 2)<br />Output required:<br />Engineers access:<br />Ad-hoc query and reporting<br />BI access:<br />Flat files to be loaded into BI system for cross-functional reporting.<br />
  3. 3. Some Metrics <br />Parsing 0.6 TB logs per day<br />Running 50+ persistent nodes<br />
  4. 4. Architecture Overview<br /> Web App<br />Phase 1<br />Phase 2<br />Phase 2<br />Chukwa Collector<br />Log copy deamon<br />Hive & Hadoop (for query)<br />Hive MetaStore<br /> S3 HDFS / S3<br />Hive & Hadoop running on <br />the cloud<br />
  5. 5. Chukwa Streaming<br />MyApp<br />Collector<br /><ul><li>No data written to disk on the application side
  6. 6. Data sent to a remote collector using Thrift
  7. 7. Collector write to localFS/S3n/HDFS compressed
  8. 8. (stay tuned)</li></li></ul><li>Workflow to Hive (phase 1)<br />Streaming Session reconstruction<br />Each hour: <br />Run hadoop job to parse last hour log and reconstruct sessions<br /> merge small files (from each reducer)<br /> load to hive<br />Session expiration after 24 hours.<br />Will have sessions for each of past 24 hours.<br />After 24 hours, will need to merge again by: insert overwrite …. select <column list> from table<br />
  9. 9. Workflow to Hive (phase 2)<br />Continuous log collection via Chukwa<br />Generic and continuous parse/merge/load to ‘real-time’ Hive warehouse<br />merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data.<br />Daily/Hourly job: <br />For summary.<br />For publishing data to BI for reporting.<br />
  10. 10. Today’s Hive usage at Netflix<br />Streaming summary data:<br />CDN performance<br /># of streams/day<br /># of errors/session<br />Test cell analysis<br />Ad-hoc query for further analysis like: <br />Raw log inspection<br />Detailed inspection of one stream session<br />Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics<br />
  11. 11. Challenges<br />Hive UI (for query building)<br />Multi-DB support (Hive-675) and user access management<br />Hive query on subset of partition files for handling late files (Hive-837 or Hive-951)<br />Merging small files (can’t use hive.merge.mapfiles) <br />