• Save
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
 

HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics

on

  • 2,033 views

Presented by: Suman Srinivasan, LongTail Video

Presented by: Suman Srinivasan, LongTail Video

Statistics

Views

Total Views
2,033
Views on SlideShare
2,032
Embed Views
1

Actions

Likes
3
Downloads
0
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics Presentation Transcript

  • Hadoop and HBase for Real-Time Video Analytics Suman Srinivasan
  • About LongTail Video • Home of JW player – JW player is embedded on over 2 million+ sites • Founded in 2007 • 32 Employees • $5M investment • Headquartered in New York disney.co.uk chevrolet.com
  • JW Player - Key Features Works on all mobile devices and desktops. Chrome, IE, Firefox, iOS, Android, etc Easy to customize, extend and embed. Scripting API, PNG Skinning, Mgmt dashboard HD-quality, secure, adaptive streaming. Utilizing Apple HTTP Live Streaming Cross-platform advertising & analytics. VAST/VPAID, SiteCatalyst, Google
  • JW Analytics: Numbers and Tech Stack • 156 million unique viewers - intl • 24 million unique viewers – USA • 1.04 billion video streams (plays) • 29.94 million hours of video watched • 134,000 live domains • 16 billion analytics events • 20,000 simultaneous pings per second (peak) • 3 TB (gzip compressed) per month • 12-15 TB (uncompressed) per month Technology Stack •Runs completely in Amazon AWS •Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR •We upload data to and process from S3 •Full-stack Python: boto (AWS S3, EMR), happybase (HBase) • Look ma, no Java! JW Player Numbers (Version 6.0 and above) – May 2013
  • JW Analytics: Demo • Available to the public • Must be a registered user of JWPlayer (free included!) http://account.longtailvideo.com/
  • Real-Time Analytics: The Holy Grail DatabaseDatabase Crunch data Insert into a DB Real-time querying Raw logs with player data
  • Why We Chose HBase • Goal: Build “Google Analytics for video”! • Requirements: – Fast queries across data sets – Support date-range queries – Store huge amounts of aggregate data – Flexibility in dimensions used for rollup tables • HBase! But why? – Open source! And good community! • Based on & closely integrated with Hadoop – Facebook uses it (as do other large companies) – Amazon AWS released a “hosted” HBase solution on EMR
  • JW Analytics Architecture
  • Schema: HBase Row-Key Design • Allows us to do date range queries • If we need new metrics, we just create a new table – Specify this in a JSON config file used by our Hadoop mapper • We don’t use column filters, secondary indexes, etc • We do need to know the “prefix” ahead of time QueryString _ yyyy mm dd Row prefix for a specific table •We need to know this ahead of time •Like the “WHERE” clause in SQL Date in yyyymmdd format •ISO8601 makes date range scans lexographic (perfect for HBase)
  • E.g.: A Tale of Two Tables (Domains, URLs) import happybase conn = happybase.Connection(SERVER) # User1: “I want my list of domains from May 1 to # May 31, 2013” t = conn.table(“user_domains”) t.scan(row_start = “User1_20130501”, row_end = “User1_20130531”) # ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … } # User1: “Oooh, D1.com looks interesting. Wonder # what the URLs were popular for 2 months.” <Click> t = conn.table(“user_domain_urls”) t.scan(row_start = “User1_D1.com_20130501”, row_end = “User1_D1.com_20130631”) # ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
  • HBase + Thrift Setup Master Data Data TT TT TT API Hadoop Hadoop • Used for HBase RPC with non-Java languages (e.g.: Python!) • Thrift runs on all nodes in our HBase clusters – Thrift on Master is read-only: used by API – Thrift on Data Nodes is write-only: data inserts from Hadoop • We use batch puts/inserts to improve write speed – Our analytics is VERY write-intensive Thrift is …? RPC framework developed at Facebook, now in wide use NOT the Macklemore & Ryan Lewis music video (that’s Thrift Shop!)
  • What We Like About HBase • Giant, sorted key-value store – Hadoop output (also key-value!) can have a 1-to-1 correspondence to HBase • FAST lookups over large data set – O(1) lookup time to find key; lookups complete in ms across billion-plus rows • Usually retrieval is fast as well – But slow if data sets are large! – O(n). No simple way to solve this. – Most times you only need top N => can be solved through optimization of key All HBase dataAll HBase data Data we want Data we want O(1) lookup = fast! O(n) read = could be slow Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
  • Challenges With HBase • Most programmers prefer SQL queries, not list iteration – “Why can’t I do a SELECT * FROM domains WHERE …???” • Thrift server goes down under load – We wrote our own HBase Thrift watchdog script • We deal with pretty exotic bugs at scale… – … with sometimes one blog post documenting a fix. – When was the last time Google showed you one useful result?  • Some things we dealt with (we are on HBase 0.92) – org.apache.hadoop.hbase.NotServingRegionException • SSH into master, clean out Zookeeper meta-data, restart master. • Kinda scary the first time you actually do this? – java.util.concurrent.RejectedExecutionException (hbck) • Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1 – org.apache.hadoop.hbase.MasterNotRunningException
  • Conclusion • Real-time analytics on Hadoop and HBase – Handling 16 billion events a month (~15 TB data) – Inserting ~80 million data points into HBase daily – Running in production for 7 months! – Did I mention we built it on Python (& bash)? • Important lessons – Design your row key well (with room to iterate) – Give HBase as much memory/CPU as it needs • HBase is resource-hungry; better to over-provision – Backup frequently! Questions?