• Save
Hive user group presentation from Netflix (3/18/2010)
Upcoming SlideShare
Loading in...5
×
 

Hive user group presentation from Netflix (3/18/2010)

on

  • 15,064 views

 

Statistics

Views

Total Views
15,064
Views on SlideShare
15,006
Embed Views
58

Actions

Likes
24
Downloads
0
Comments
3

3 Embeds 58

http://www.slideshare.net 50
http://www.linkedin.com 5
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hive user group presentation from Netflix (3/18/2010) Hive user group presentation from Netflix (3/18/2010) Presentation Transcript

  • Use case study of Hive/Hadoop
    Eva Tse,Jerome Boulon
  • What are we trying to achieve?
    Scalable log analysis to gain business insights:
    Logs for website streaming (phase 1)
    All logs from web (phase 2)
    Output required:
    Engineers access:
    Ad-hoc query and reporting
    BI access:
    Flat files to be loaded into BI system for cross-functional reporting.
  • Some Metrics
    Parsing 0.6 TB logs per day
    Running 50+ persistent nodes
  • Architecture Overview
    Web App
    Phase 1
    Phase 2
    Phase 2
    Chukwa Collector
    Log copy deamon
    Hive & Hadoop (for query)
    Hive MetaStore
    S3 HDFS / S3
    Hive & Hadoop running on
    the cloud
  • Chukwa Streaming
    MyApp
    Collector
    • No data written to disk on the application side
    • Data sent to a remote collector using Thrift
    • Collector write to localFS/S3n/HDFS compressed
    • http://wiki.github.com/jboulon/Honu/ (stay tuned)
  • Workflow to Hive (phase 1)
    Streaming Session reconstruction
    Each hour:
    Run hadoop job to parse last hour log and reconstruct sessions
    merge small files (from each reducer)
    load to hive
    Session expiration after 24 hours.
    Will have sessions for each of past 24 hours.
    After 24 hours, will need to merge again by: insert overwrite …. select <column list> from table
  • Workflow to Hive (phase 2)
    Continuous log collection via Chukwa
    Generic and continuous parse/merge/load to ‘real-time’ Hive warehouse
    merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data.
    Daily/Hourly job:
    For summary.
    For publishing data to BI for reporting.
  • Today’s Hive usage at Netflix
    Streaming summary data:
    CDN performance
    # of streams/day
    # of errors/session
    Test cell analysis
    Ad-hoc query for further analysis like:
    Raw log inspection
    Detailed inspection of one stream session
    Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics
  • Challenges
    Hive UI (for query building)
    Multi-DB support (Hive-675) and user access management
    Hive query on subset of partition files for handling late files (Hive-837 or Hive-951)
    Merging small files (can’t use hive.merge.mapfiles)