• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hive user group presentation from Netflix (3/18/2010)

Hive user group presentation from Netflix (3/18/2010)






Total Views
Views on SlideShare
Embed Views



3 Embeds 56

http://www.slideshare.net 50
http://www.linkedin.com 5
https://www.linkedin.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


13 of 3 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hive user group presentation from Netflix (3/18/2010) Hive user group presentation from Netflix (3/18/2010) Presentation Transcript

    • Use case study of Hive/Hadoop
      Eva Tse,Jerome Boulon
    • What are we trying to achieve?
      Scalable log analysis to gain business insights:
      Logs for website streaming (phase 1)
      All logs from web (phase 2)
      Output required:
      Engineers access:
      Ad-hoc query and reporting
      BI access:
      Flat files to be loaded into BI system for cross-functional reporting.
    • Some Metrics
      Parsing 0.6 TB logs per day
      Running 50+ persistent nodes
    • Architecture Overview
      Web App
      Phase 1
      Phase 2
      Phase 2
      Chukwa Collector
      Log copy deamon
      Hive & Hadoop (for query)
      Hive MetaStore
      S3 HDFS / S3
      Hive & Hadoop running on
      the cloud
    • Chukwa Streaming
      • No data written to disk on the application side
      • Data sent to a remote collector using Thrift
      • Collector write to localFS/S3n/HDFS compressed
      • http://wiki.github.com/jboulon/Honu/ (stay tuned)
    • Workflow to Hive (phase 1)
      Streaming Session reconstruction
      Each hour:
      Run hadoop job to parse last hour log and reconstruct sessions
      merge small files (from each reducer)
      load to hive
      Session expiration after 24 hours.
      Will have sessions for each of past 24 hours.
      After 24 hours, will need to merge again by: insert overwrite …. select <column list> from table
    • Workflow to Hive (phase 2)
      Continuous log collection via Chukwa
      Generic and continuous parse/merge/load to ‘real-time’ Hive warehouse
      merge at hourly boundary and load to public Hive warehouse. SLA is 2 Hr on merged data.
      Daily/Hourly job:
      For summary.
      For publishing data to BI for reporting.
    • Today’s Hive usage at Netflix
      Streaming summary data:
      CDN performance
      # of streams/day
      # of errors/session
      Test cell analysis
      Ad-hoc query for further analysis like:
      Raw log inspection
      Detailed inspection of one stream session
      Simple summary (e.g., percentile, count, max, min, bucketing) for operational metrics
    • Challenges
      Hive UI (for query building)
      Multi-DB support (Hive-675) and user access management
      Hive query on subset of partition files for handling late files (Hive-837 or Hive-951)
      Merging small files (can’t use hive.merge.mapfiles)