Hadoop and HBase for Real-Time
Video Analytics
Suman Srinivasan
About LongTail Video
• Home of JW player
– JW player is embedded on
over 2 million+ sites
• Founded in 2007
• 32 Employees
• $5M investment
• Headquartered in New York
disney.co.uk
chevrolet.com
JW Player - Key Features
Works on all mobile devices and desktops.
Chrome, IE, Firefox, iOS, Android, etc
Easy to customize, extend and embed.
Scripting API, PNG Skinning, Mgmt dashboard
HD-quality, secure, adaptive streaming.
Utilizing Apple HTTP Live Streaming
Cross-platform advertising & analytics.
VAST/VPAID, SiteCatalyst, Google
JW Analytics: Numbers and Tech Stack
• 156 million unique viewers - intl
• 24 million unique viewers – USA
• 1.04 billion video streams (plays)
• 29.94 million hours of video watched
• 134,000 live domains
• 16 billion analytics events
• 20,000 simultaneous pings per
second (peak)
• 3 TB (gzip compressed) per month
• 12-15 TB (uncompressed) per month
Technology Stack
•Runs completely in Amazon AWS
•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR
•We upload data to and process from S3
•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)
• Look ma, no Java!
JW Player Numbers (Version 6.0 and above) – May 2013
JW Analytics: Demo
• Available
to the
public
• Must be a
registered
user of
JWPlayer
(free
included!)
http://account.longtailvideo.com/
Real-Time Analytics: The Holy Grail
DatabaseDatabase
Crunch data
Insert into a DB
Real-time
querying
Raw logs with player data
Why We Chose HBase
• Goal: Build “Google Analytics for video”!
• Requirements:
– Fast queries across data sets
– Support date-range queries
– Store huge amounts of aggregate data
– Flexibility in dimensions used for rollup tables
• HBase! But why?
– Open source! And good community!
• Based on & closely integrated with Hadoop
– Facebook uses it (as do other large companies)
– Amazon AWS released a “hosted” HBase solution on EMR
JW Analytics Architecture
Schema: HBase Row-Key Design
• Allows us to do date range queries
• If we need new metrics, we just create a new table
– Specify this in a JSON config file used by our Hadoop mapper
• We don’t use column filters, secondary indexes, etc
• We do need to know the “prefix” ahead of time
QueryString _ yyyy mm dd
Row prefix for a specific table
•We need to know this ahead of time
•Like the “WHERE” clause in SQL
Date in yyyymmdd format
•ISO8601 makes date range scans
lexographic (perfect for HBase)
E.g.: A Tale of Two Tables (Domains, URLs)
import happybase
conn = happybase.Connection(SERVER)
# User1: “I want my list of domains from May 1 to
# May 31, 2013”
t = conn.table(“user_domains”)
t.scan(row_start = “User1_20130501”,
row_end = “User1_20130531”)
# ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … }
# User1: “Oooh, D1.com looks interesting. Wonder
# what the URLs were popular for 2 months.” <Click>
t = conn.table(“user_domain_urls”)
t.scan(row_start = “User1_D1.com_20130501”,
row_end = “User1_D1.com_20130631”)
# ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
HBase + Thrift Setup
Master Data Data
TT TT TT
API Hadoop Hadoop
• Used for HBase RPC with non-Java languages (e.g.: Python!)
• Thrift runs on all nodes in our HBase clusters
– Thrift on Master is read-only: used by API
– Thrift on Data Nodes is write-only: data inserts from Hadoop
• We use batch puts/inserts to improve write speed
– Our analytics is VERY write-intensive
Thrift is …?
RPC framework
developed at Facebook,
now in wide use
NOT the Macklemore &
Ryan Lewis music video
(that’s Thrift Shop!)
What We Like About HBase
• Giant, sorted key-value store
– Hadoop output (also key-value!) can have
a 1-to-1 correspondence to HBase
• FAST lookups over large data set
– O(1) lookup time to find key; lookups
complete in ms across billion-plus rows
• Usually retrieval is fast as well
– But slow if data sets are large!
– O(n). No simple way to solve this.
– Most times you only need top N => can be
solved through optimization of key
All HBase dataAll HBase data
Data
we
want
Data
we
want
O(1) lookup = fast!
O(n) read =
could be slow
Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
Challenges With HBase
• Most programmers prefer SQL queries, not list iteration
– “Why can’t I do a SELECT * FROM domains WHERE …???”
• Thrift server goes down under load
– We wrote our own HBase Thrift watchdog script
• We deal with pretty exotic bugs at scale…
– … with sometimes one blog post documenting a fix.
– When was the last time Google showed you one useful result? 
• Some things we dealt with (we are on HBase 0.92)
– org.apache.hadoop.hbase.NotServingRegionException
• SSH into master, clean out Zookeeper meta-data, restart master.
• Kinda scary the first time you actually do this?
– java.util.concurrent.RejectedExecutionException (hbck)
• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1
– org.apache.hadoop.hbase.MasterNotRunningException
Conclusion
• Real-time analytics on Hadoop and HBase
– Handling 16 billion events a month (~15 TB data)
– Inserting ~80 million data points into HBase daily
– Running in production for 7 months!
– Did I mention we built it on Python (& bash)?
• Important lessons
– Design your row key well (with room to iterate)
– Give HBase as much memory/CPU as it needs
• HBase is resource-hungry; better to over-provision
– Backup frequently!
Questions?

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

  • 1.
    Hadoop and HBasefor Real-Time Video Analytics Suman Srinivasan
  • 2.
    About LongTail Video •Home of JW player – JW player is embedded on over 2 million+ sites • Founded in 2007 • 32 Employees • $5M investment • Headquartered in New York disney.co.uk chevrolet.com
  • 3.
    JW Player -Key Features Works on all mobile devices and desktops. Chrome, IE, Firefox, iOS, Android, etc Easy to customize, extend and embed. Scripting API, PNG Skinning, Mgmt dashboard HD-quality, secure, adaptive streaming. Utilizing Apple HTTP Live Streaming Cross-platform advertising & analytics. VAST/VPAID, SiteCatalyst, Google
  • 4.
    JW Analytics: Numbersand Tech Stack • 156 million unique viewers - intl • 24 million unique viewers – USA • 1.04 billion video streams (plays) • 29.94 million hours of video watched • 134,000 live domains • 16 billion analytics events • 20,000 simultaneous pings per second (peak) • 3 TB (gzip compressed) per month • 12-15 TB (uncompressed) per month Technology Stack •Runs completely in Amazon AWS •Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR •We upload data to and process from S3 •Full-stack Python: boto (AWS S3, EMR), happybase (HBase) • Look ma, no Java! JW Player Numbers (Version 6.0 and above) – May 2013
  • 5.
    JW Analytics: Demo •Available to the public • Must be a registered user of JWPlayer (free included!) http://account.longtailvideo.com/
  • 6.
    Real-Time Analytics: TheHoly Grail DatabaseDatabase Crunch data Insert into a DB Real-time querying Raw logs with player data
  • 7.
    Why We ChoseHBase • Goal: Build “Google Analytics for video”! • Requirements: – Fast queries across data sets – Support date-range queries – Store huge amounts of aggregate data – Flexibility in dimensions used for rollup tables • HBase! But why? – Open source! And good community! • Based on & closely integrated with Hadoop – Facebook uses it (as do other large companies) – Amazon AWS released a “hosted” HBase solution on EMR
  • 8.
  • 9.
    Schema: HBase Row-KeyDesign • Allows us to do date range queries • If we need new metrics, we just create a new table – Specify this in a JSON config file used by our Hadoop mapper • We don’t use column filters, secondary indexes, etc • We do need to know the “prefix” ahead of time QueryString _ yyyy mm dd Row prefix for a specific table •We need to know this ahead of time •Like the “WHERE” clause in SQL Date in yyyymmdd format •ISO8601 makes date range scans lexographic (perfect for HBase)
  • 10.
    E.g.: A Taleof Two Tables (Domains, URLs) import happybase conn = happybase.Connection(SERVER) # User1: “I want my list of domains from May 1 to # May 31, 2013” t = conn.table(“user_domains”) t.scan(row_start = “User1_20130501”, row_end = “User1_20130531”) # ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … } # User1: “Oooh, D1.com looks interesting. Wonder # what the URLs were popular for 2 months.” <Click> t = conn.table(“user_domain_urls”) t.scan(row_start = “User1_D1.com_20130501”, row_end = “User1_D1.com_20130631”) # ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
  • 11.
    HBase + ThriftSetup Master Data Data TT TT TT API Hadoop Hadoop • Used for HBase RPC with non-Java languages (e.g.: Python!) • Thrift runs on all nodes in our HBase clusters – Thrift on Master is read-only: used by API – Thrift on Data Nodes is write-only: data inserts from Hadoop • We use batch puts/inserts to improve write speed – Our analytics is VERY write-intensive Thrift is …? RPC framework developed at Facebook, now in wide use NOT the Macklemore & Ryan Lewis music video (that’s Thrift Shop!)
  • 12.
    What We LikeAbout HBase • Giant, sorted key-value store – Hadoop output (also key-value!) can have a 1-to-1 correspondence to HBase • FAST lookups over large data set – O(1) lookup time to find key; lookups complete in ms across billion-plus rows • Usually retrieval is fast as well – But slow if data sets are large! – O(n). No simple way to solve this. – Most times you only need top N => can be solved through optimization of key All HBase dataAll HBase data Data we want Data we want O(1) lookup = fast! O(n) read = could be slow Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
  • 13.
    Challenges With HBase •Most programmers prefer SQL queries, not list iteration – “Why can’t I do a SELECT * FROM domains WHERE …???” • Thrift server goes down under load – We wrote our own HBase Thrift watchdog script • We deal with pretty exotic bugs at scale… – … with sometimes one blog post documenting a fix. – When was the last time Google showed you one useful result?  • Some things we dealt with (we are on HBase 0.92) – org.apache.hadoop.hbase.NotServingRegionException • SSH into master, clean out Zookeeper meta-data, restart master. • Kinda scary the first time you actually do this? – java.util.concurrent.RejectedExecutionException (hbck) • Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1 – org.apache.hadoop.hbase.MasterNotRunningException
  • 14.
    Conclusion • Real-time analyticson Hadoop and HBase – Handling 16 billion events a month (~15 TB data) – Inserting ~80 million data points into HBase daily – Running in production for 7 months! – Did I mention we built it on Python (& bash)? • Important lessons – Design your row key well (with room to iterate) – Give HBase as much memory/CPU as it needs • HBase is resource-hungry; better to over-provision – Backup frequently! Questions?