2. About LongTail Video
• Home of JW player
– JW player is embedded on
over 2 million+ sites
• Founded in 2007
• 32 Employees
• $5M investment
• Headquartered in New York
disney.co.uk
chevrolet.com
3. JW Player - Key Features
Works on all mobile devices and desktops.
Chrome, IE, Firefox, iOS, Android, etc
Easy to customize, extend and embed.
Scripting API, PNG Skinning, Mgmt dashboard
HD-quality, secure, adaptive streaming.
Utilizing Apple HTTP Live Streaming
Cross-platform advertising & analytics.
VAST/VPAID, SiteCatalyst, Google
4. JW Analytics: Numbers and Tech Stack
• 156 million unique viewers - intl
• 24 million unique viewers – USA
• 1.04 billion video streams (plays)
• 29.94 million hours of video watched
• 134,000 live domains
• 16 billion analytics events
• 20,000 simultaneous pings per
second (peak)
• 3 TB (gzip compressed) per month
• 12-15 TB (uncompressed) per month
Technology Stack
•Runs completely in Amazon AWS
•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR
•We upload data to and process from S3
•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)
• Look ma, no Java!
JW Player Numbers (Version 6.0 and above) – May 2013
5. JW Analytics: Demo
• Available
to the
public
• Must be a
registered
user of
JWPlayer
(free
included!)
http://account.longtailvideo.com/
6. Real-Time Analytics: The Holy Grail
DatabaseDatabase
Crunch data
Insert into a DB
Real-time
querying
Raw logs with player data
7. Why We Chose HBase
• Goal: Build “Google Analytics for video”!
• Requirements:
– Fast queries across data sets
– Support date-range queries
– Store huge amounts of aggregate data
– Flexibility in dimensions used for rollup tables
• HBase! But why?
– Open source! And good community!
• Based on & closely integrated with Hadoop
– Facebook uses it (as do other large companies)
– Amazon AWS released a “hosted” HBase solution on EMR
9. Schema: HBase Row-Key Design
• Allows us to do date range queries
• If we need new metrics, we just create a new table
– Specify this in a JSON config file used by our Hadoop mapper
• We don’t use column filters, secondary indexes, etc
• We do need to know the “prefix” ahead of time
QueryString _ yyyy mm dd
Row prefix for a specific table
•We need to know this ahead of time
•Like the “WHERE” clause in SQL
Date in yyyymmdd format
•ISO8601 makes date range scans
lexographic (perfect for HBase)
10. E.g.: A Tale of Two Tables (Domains, URLs)
import happybase
conn = happybase.Connection(SERVER)
# User1: “I want my list of domains from May 1 to
# May 31, 2013”
t = conn.table(“user_domains”)
t.scan(row_start = “User1_20130501”,
row_end = “User1_20130531”)
# ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … }
# User1: “Oooh, D1.com looks interesting. Wonder
# what the URLs were popular for 2 months.” <Click>
t = conn.table(“user_domain_urls”)
t.scan(row_start = “User1_D1.com_20130501”,
row_end = “User1_D1.com_20130631”)
# ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
11. HBase + Thrift Setup
Master Data Data
TT TT TT
API Hadoop Hadoop
• Used for HBase RPC with non-Java languages (e.g.: Python!)
• Thrift runs on all nodes in our HBase clusters
– Thrift on Master is read-only: used by API
– Thrift on Data Nodes is write-only: data inserts from Hadoop
• We use batch puts/inserts to improve write speed
– Our analytics is VERY write-intensive
Thrift is …?
RPC framework
developed at Facebook,
now in wide use
NOT the Macklemore &
Ryan Lewis music video
(that’s Thrift Shop!)
12. What We Like About HBase
• Giant, sorted key-value store
– Hadoop output (also key-value!) can have
a 1-to-1 correspondence to HBase
• FAST lookups over large data set
– O(1) lookup time to find key; lookups
complete in ms across billion-plus rows
• Usually retrieval is fast as well
– But slow if data sets are large!
– O(n). No simple way to solve this.
– Most times you only need top N => can be
solved through optimization of key
All HBase dataAll HBase data
Data
we
want
Data
we
want
O(1) lookup = fast!
O(n) read =
could be slow
Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
13. Challenges With HBase
• Most programmers prefer SQL queries, not list iteration
– “Why can’t I do a SELECT * FROM domains WHERE …???”
• Thrift server goes down under load
– We wrote our own HBase Thrift watchdog script
• We deal with pretty exotic bugs at scale…
– … with sometimes one blog post documenting a fix.
– When was the last time Google showed you one useful result?
• Some things we dealt with (we are on HBase 0.92)
– org.apache.hadoop.hbase.NotServingRegionException
• SSH into master, clean out Zookeeper meta-data, restart master.
• Kinda scary the first time you actually do this?
– java.util.concurrent.RejectedExecutionException (hbck)
• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1
– org.apache.hadoop.hbase.MasterNotRunningException
14. Conclusion
• Real-time analytics on Hadoop and HBase
– Handling 16 billion events a month (~15 TB data)
– Inserting ~80 million data points into HBase daily
– Running in production for 7 months!
– Did I mention we built it on Python (& bash)?
• Important lessons
– Design your row key well (with room to iterate)
– Give HBase as much memory/CPU as it needs
• HBase is resource-hungry; better to over-provision
– Backup frequently!
Questions?