Your SlideShare is downloading. ×
0
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Hive user group presentation from Netflix (3/18/2010)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hive user group presentation from Netflix (3/18/2010)

14,767

Published on

Published in: Technology
3 Comments
26 Likes
Statistics
Notes
No Downloads
Views
Total Views
14,767
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
3
Likes
26
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Use case study of Hive/Hadoop<br />Eva Tse,Jerome Boulon<br />
  • 2. What are we trying to achieve?<br />Scalable log analysis to gain business insights:<br />Logs for website streaming (phase 1)<br />All logs from web (phase 2)<br />Output required:<br />Engineers access:<br />Ad-hoc query and reporting<br />BI access:<br />Flat files to be loaded into BI system for cross-functional reporting.<br />
  • 3. Some Metrics <br />Parsing 0.6 TB logs per day<br />Running 50+ persistent nodes<br />
  • 4. Architecture Overview<br /> Web App<br />Phase 1<br />Phase 2<br />Phase 2<br />Chukwa Collector<br />Log copy deamon<br />Hive &amp; Hadoop (for query)<br />Hive MetaStore<br /> S3 HDFS / S3<br />Hive &amp; Hadoop running on <br />the cloud<br />
  • 5. Chukwa Streaming<br />MyApp<br />Collector<br /><ul><li>No data written to disk on the application side
  • 6. Data sent to a remote collector using Thrift
  • 7. Collector write to localFS/S3n/HDFS compressed
  • 8. http://wiki.github.com/jboulon/Honu/ (stay tuned)</li></li></ul><li>Workflow to Hive (phase 1)<br />Streaming Session reconstruction<br />Each hour: <br />Run hadoop job to parse last hour log and reconstruct sessions<br /> merge small files (from each reducer)<br /> load to hive<br />Session expiration after 24 hours.<br />Will have sessions for each of past 24 hours.<br />After 24 hours, will need to merge again by: insert overwrite …. select &lt;column list&gt; from table<br />
  • 9. Workflow to Hive (phase 2)<br />Continuous log collection via Chukwa<br />Generic and continuous parse/merge/load to ‘real-time’ Hive warehouse<br />merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data.<br />Daily/Hourly job: <br />For summary.<br />For publishing data to BI for reporting.<br />
  • 10. Today’s Hive usage at Netflix<br />Streaming summary data:<br />CDN performance<br /># of streams/day<br /># of errors/session<br />Test cell analysis<br />Ad-hoc query for further analysis like: <br />Raw log inspection<br />Detailed inspection of one stream session<br />Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics<br />
  • 11. Challenges<br />Hive UI (for query building)<br />Multi-DB support (Hive-675) and user access management<br />Hive query on subset of partition files for handling late files (Hive-837 or Hive-951)<br />Merging small files (can’t use hive.merge.mapfiles) <br />

×