Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics


Published on

Presented by: Suman Srinivasan, LongTail Video

Published in: Technology
  • Be the first to comment

HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics

  1. 1. Hadoop and HBase for Real-Time Video Analytics Suman Srinivasan
  2. 2. About LongTail Video • Home of JW player – JW player is embedded on over 2 million+ sites • Founded in 2007 • 32 Employees • $5M investment • Headquartered in New York
  3. 3. JW Player - Key Features Works on all mobile devices and desktops. Chrome, IE, Firefox, iOS, Android, etc Easy to customize, extend and embed. Scripting API, PNG Skinning, Mgmt dashboard HD-quality, secure, adaptive streaming. Utilizing Apple HTTP Live Streaming Cross-platform advertising & analytics. VAST/VPAID, SiteCatalyst, Google
  4. 4. JW Analytics: Numbers and Tech Stack • 156 million unique viewers - intl • 24 million unique viewers – USA • 1.04 billion video streams (plays) • 29.94 million hours of video watched • 134,000 live domains • 16 billion analytics events • 20,000 simultaneous pings per second (peak) • 3 TB (gzip compressed) per month • 12-15 TB (uncompressed) per month Technology Stack •Runs completely in Amazon AWS •Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR •We upload data to and process from S3 •Full-stack Python: boto (AWS S3, EMR), happybase (HBase) • Look ma, no Java! JW Player Numbers (Version 6.0 and above) – May 2013
  5. 5. JW Analytics: Demo • Available to the public • Must be a registered user of JWPlayer (free included!)
  6. 6. Real-Time Analytics: The Holy Grail DatabaseDatabase Crunch data Insert into a DB Real-time querying Raw logs with player data
  7. 7. Why We Chose HBase • Goal: Build “Google Analytics for video”! • Requirements: – Fast queries across data sets – Support date-range queries – Store huge amounts of aggregate data – Flexibility in dimensions used for rollup tables • HBase! But why? – Open source! And good community! • Based on & closely integrated with Hadoop – Facebook uses it (as do other large companies) – Amazon AWS released a “hosted” HBase solution on EMR
  8. 8. JW Analytics Architecture
  9. 9. Schema: HBase Row-Key Design • Allows us to do date range queries • If we need new metrics, we just create a new table – Specify this in a JSON config file used by our Hadoop mapper • We don’t use column filters, secondary indexes, etc • We do need to know the “prefix” ahead of time QueryString _ yyyy mm dd Row prefix for a specific table •We need to know this ahead of time •Like the “WHERE” clause in SQL Date in yyyymmdd format •ISO8601 makes date range scans lexographic (perfect for HBase)
  10. 10. E.g.: A Tale of Two Tables (Domains, URLs) import happybase conn = happybase.Connection(SERVER) # User1: “I want my list of domains from May 1 to # May 31, 2013” t = conn.table(“user_domains”) t.scan(row_start = “User1_20130501”, row_end = “User1_20130531”) # ‘User1_20130501’: { ‘’: ‘100’; … } # User1: “Oooh, looks interesting. Wonder # what the URLs were popular for 2 months.” <Click> t = conn.table(“user_domain_urls”) t.scan(row_start = “User1_D1.com_20130501”, row_end = “User1_D1.com_20130631”) # ‘User1_Domain1_20130501’: {‘’: ’80’ }
  11. 11. HBase + Thrift Setup Master Data Data TT TT TT API Hadoop Hadoop • Used for HBase RPC with non-Java languages (e.g.: Python!) • Thrift runs on all nodes in our HBase clusters – Thrift on Master is read-only: used by API – Thrift on Data Nodes is write-only: data inserts from Hadoop • We use batch puts/inserts to improve write speed – Our analytics is VERY write-intensive Thrift is …? RPC framework developed at Facebook, now in wide use NOT the Macklemore & Ryan Lewis music video (that’s Thrift Shop!)
  12. 12. What We Like About HBase • Giant, sorted key-value store – Hadoop output (also key-value!) can have a 1-to-1 correspondence to HBase • FAST lookups over large data set – O(1) lookup time to find key; lookups complete in ms across billion-plus rows • Usually retrieval is fast as well – But slow if data sets are large! – O(n). No simple way to solve this. – Most times you only need top N => can be solved through optimization of key All HBase dataAll HBase data Data we want Data we want O(1) lookup = fast! O(n) read = could be slow Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
  13. 13. Challenges With HBase • Most programmers prefer SQL queries, not list iteration – “Why can’t I do a SELECT * FROM domains WHERE …???” • Thrift server goes down under load – We wrote our own HBase Thrift watchdog script • We deal with pretty exotic bugs at scale… – … with sometimes one blog post documenting a fix. – When was the last time Google showed you one useful result?  • Some things we dealt with (we are on HBase 0.92) – org.apache.hadoop.hbase.NotServingRegionException • SSH into master, clean out Zookeeper meta-data, restart master. • Kinda scary the first time you actually do this? – java.util.concurrent.RejectedExecutionException (hbck) • Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1 – org.apache.hadoop.hbase.MasterNotRunningException
  14. 14. Conclusion • Real-time analytics on Hadoop and HBase – Handling 16 billion events a month (~15 TB data) – Inserting ~80 million data points into HBase daily – Running in production for 7 months! – Did I mention we built it on Python (& bash)? • Important lessons – Design your row key well (with room to iterate) – Give HBase as much memory/CPU as it needs • HBase is resource-hungry; better to over-provision – Backup frequently! Questions?