• Save
Hive user group presentation from Netflix (3/18/2010)
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Hive user group presentation from Netflix (3/18/2010)

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 60

http://www.slideshare.net 50
http://www.linkedin.com 5
https://www.linkedin.com 5

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Use case study of Hive/Hadoop
    Eva Tse,Jerome Boulon
  • 2. What are we trying to achieve?
    Scalable log analysis to gain business insights:
    Logs for website streaming (phase 1)
    All logs from web (phase 2)
    Output required:
    Engineers access:
    Ad-hoc query and reporting
    BI access:
    Flat files to be loaded into BI system for cross-functional reporting.
  • 3. Some Metrics
    Parsing 0.6 TB logs per day
    Running 50+ persistent nodes
  • 4. Architecture Overview
    Web App
    Phase 1
    Phase 2
    Phase 2
    Chukwa Collector
    Log copy deamon
    Hive & Hadoop (for query)
    Hive MetaStore
    S3 HDFS / S3
    Hive & Hadoop running on
    the cloud
  • 5. Chukwa Streaming
    • No data written to disk on the application side
    • 6. Data sent to a remote collector using Thrift
    • 7. Collector write to localFS/S3n/HDFS compressed
    • 8. http://wiki.github.com/jboulon/Honu/ (stay tuned)
  • Workflow to Hive (phase 1)
    Streaming Session reconstruction
    Each hour:
    Run hadoop job to parse last hour log and reconstruct sessions
    merge small files (from each reducer)
    load to hive
    Session expiration after 24 hours.
    Will have sessions for each of past 24 hours.
    After 24 hours, will need to merge again by: insert overwrite …. select <column list> from table
  • 9. Workflow to Hive (phase 2)
    Continuous log collection via Chukwa
    Generic and continuous parse/merge/load to ‘real-time’ Hive warehouse
    merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data.
    Daily/Hourly job:
    For summary.
    For publishing data to BI for reporting.
  • 10. Today’s Hive usage at Netflix
    Streaming summary data:
    CDN performance
    # of streams/day
    # of errors/session
    Test cell analysis
    Ad-hoc query for further analysis like:
    Raw log inspection
    Detailed inspection of one stream session
    Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics
  • 11. Challenges
    Hive UI (for query building)
    Multi-DB support (Hive-675) and user access management
    Hive query on subset of partition files for handling late files (Hive-837 or Hive-951)
    Merging small files (can’t use hive.merge.mapfiles)