Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

Hadoop and HBase for Real-Time
Video Analytics
Suman Srinivasan

About LongTail Video
• Home of JW player
– JW player is embedded on
over 2 million+ sites
• Founded in 2007
• 32 Employees
• $5M investment
• Headquartered in New York
disney.co.uk
chevrolet.com

JW Player - Key Features
Works on all mobile devices and desktops.
Chrome, IE, Firefox, iOS, Android, etc
Easy to customize, extend and embed.
Scripting API, PNG Skinning, Mgmt dashboard
HD-quality, secure, adaptive streaming.
Utilizing Apple HTTP Live Streaming
Cross-platform advertising & analytics.
VAST/VPAID, SiteCatalyst, Google

JW Analytics: Numbers and Tech Stack
• 156 million unique viewers - intl
• 24 million unique viewers – USA
• 1.04 billion video streams (plays)
• 29.94 million hours of video watched
• 134,000 live domains
• 16 billion analytics events
• 20,000 simultaneous pings per
second (peak)
• 3 TB (gzip compressed) per month
• 12-15 TB (uncompressed) per month
Technology Stack
•Runs completely in Amazon AWS
•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR
•We upload data to and process from S3
•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)
• Look ma, no Java!
JW Player Numbers (Version 6.0 and above) – May 2013

JW Analytics: Demo
• Available
to the
public
• Must be a
registered
user of
JWPlayer
(free
included!)
http://account.longtailvideo.com/

Real-Time Analytics: The Holy Grail
DatabaseDatabase
Crunch data
Insert into a DB
Real-time
querying
Raw logs with player data

Why We Chose HBase
• Goal: Build “Google Analytics for video”!
• Requirements:
– Fast queries across data sets
– Support date-range queries
– Store huge amounts of aggregate data
– Flexibility in dimensions used for rollup tables
• HBase! But why?
– Open source! And good community!
• Based on & closely integrated with Hadoop
– Facebook uses it (as do other large companies)
– Amazon AWS released a “hosted” HBase solution on EMR

Schema: HBase Row-Key Design
• Allows us to do date range queries
• If we need new metrics, we just create a new table
– Specify this in a JSON config file used by our Hadoop mapper
• We don’t use column filters, secondary indexes, etc
• We do need to know the “prefix” ahead of time
QueryString _ yyyy mm dd
Row prefix for a specific table
•We need to know this ahead of time
•Like the “WHERE” clause in SQL
Date in yyyymmdd format
•ISO8601 makes date range scans
lexographic (perfect for HBase)

E.g.: A Tale of Two Tables (Domains, URLs)
import happybase
conn = happybase.Connection(SERVER)
# User1: “I want my list of domains from May 1 to
# May 31, 2013”
t = conn.table(“user_domains”)
t.scan(row_start = “User1_20130501”,
row_end = “User1_20130531”)
# ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … }
# User1: “Oooh, D1.com looks interesting. Wonder
# what the URLs were popular for 2 months.” <Click>
t = conn.table(“user_domain_urls”)
t.scan(row_start = “User1_D1.com_20130501”,
row_end = “User1_D1.com_20130631”)
# ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }

HBase + Thrift Setup
Master Data Data
TT TT TT
API Hadoop Hadoop
• Used for HBase RPC with non-Java languages (e.g.: Python!)
• Thrift runs on all nodes in our HBase clusters
– Thrift on Master is read-only: used by API
– Thrift on Data Nodes is write-only: data inserts from Hadoop
• We use batch puts/inserts to improve write speed
– Our analytics is VERY write-intensive
Thrift is …?
RPC framework
developed at Facebook,
now in wide use
NOT the Macklemore &
Ryan Lewis music video
(that’s Thrift Shop!)

What We Like About HBase
• Giant, sorted key-value store
– Hadoop output (also key-value!) can have
a 1-to-1 correspondence to HBase
• FAST lookups over large data set
– O(1) lookup time to find key; lookups
complete in ms across billion-plus rows
• Usually retrieval is fast as well
– But slow if data sets are large!
– O(n). No simple way to solve this.
– Most times you only need top N => can be
solved through optimization of key
All HBase dataAll HBase data
Data
we
want
Data
we
want
O(1) lookup = fast!
O(n) read =
could be slow
Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!

Challenges With HBase
• Most programmers prefer SQL queries, not list iteration
– “Why can’t I do a SELECT * FROM domains WHERE …???”
• Thrift server goes down under load
– We wrote our own HBase Thrift watchdog script
• We deal with pretty exotic bugs at scale…
– … with sometimes one blog post documenting a fix.
– When was the last time Google showed you one useful result? 
• Some things we dealt with (we are on HBase 0.92)
– org.apache.hadoop.hbase.NotServingRegionException
• SSH into master, clean out Zookeeper meta-data, restart master.
• Kinda scary the first time you actually do this?
– java.util.concurrent.RejectedExecutionException (hbck)
• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1
– org.apache.hadoop.hbase.MasterNotRunningException

Conclusion
• Real-time analytics on Hadoop and HBase
– Handling 16 billion events a month (~15 TB data)
– Inserting ~80 million data points into HBase daily
– Running in production for 7 months!
– Did I mention we built it on Python (& bash)?
• Important lessons
– Design your row key well (with room to iterate)
– Give HBase as much memory/CPU as it needs
• HBase is resource-hungry; better to over-provision
– Backup frequently!
Questions?

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

More Related Content

What's hot

Viewers also liked

Similar to Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

More from Suman Srinivasan

Recently uploaded

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)