Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)


Published on

At LongTail Video, we use Hadoop and HBase for our real-time analytics engine. This was presented at HBaseCon 2013 in San Francisco.

Published in: Technology
  • Be the first to comment

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

  1. 1. Hadoop and HBase for Real-TimeVideo AnalyticsSuman Srinivasan
  2. 2. About LongTail Video• Home of JW player– JW player is embedded onover 2 million+ sites• Founded in 2007• 32 Employees• $5M investment• Headquartered in New
  3. 3. JW Player - Key FeaturesWorks on all mobile devices and desktops.Chrome, IE, Firefox, iOS, Android, etcEasy to customize, extend and embed.Scripting API, PNG Skinning, Mgmt dashboardHD-quality, secure, adaptive streaming.Utilizing Apple HTTP Live StreamingCross-platform advertising & analytics.VAST/VPAID, SiteCatalyst, Google
  4. 4. JW Analytics: Numbers and Tech Stack• 156 million unique viewers - intl• 24 million unique viewers – USA• 1.04 billion video streams (plays)• 29.94 million hours of video watched• 134,000 live domains• 16 billion analytics events• 20,000 simultaneous pings persecond (peak)• 3 TB (gzip compressed) per month• 12-15 TB (uncompressed) per monthTechnology Stack•Runs completely in Amazon AWS•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR•We upload data to and process from S3•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)• Look ma, no Java!JW Player Numbers (Version 6.0 and above) – May 2013
  5. 5. JW Analytics: Demo• Availableto thepublic• Must be aregistereduser ofJWPlayer(freeincluded!)
  6. 6. Real-Time Analytics: The Holy GrailDatabaseDatabaseCrunch dataInsert into a DBReal-timequeryingRaw logs with player data
  7. 7. Why We Chose HBase• Goal: Build “Google Analytics for video”!• Requirements:– Fast queries across data sets– Support date-range queries– Store huge amounts of aggregate data– Flexibility in dimensions used for rollup tables• HBase! But why?– Open source! And good community!• Based on & closely integrated with Hadoop– Facebook uses it (as do other large companies)– Amazon AWS released a “hosted” HBase solution on EMR
  8. 8. JW Analytics Architecture
  9. 9. Schema: HBase Row-Key Design• Allows us to do date range queries• If we need new metrics, we just create a new table– Specify this in a JSON config file used by our Hadoop mapper• We don’t use column filters, secondary indexes, etc• We do need to know the “prefix” ahead of timeQueryString _ yyyy mm ddRow prefix for a specific table•We need to know this ahead of time•Like the “WHERE” clause in SQLDate in yyyymmdd format•ISO8601 makes date range scanslexographic (perfect for HBase)
  10. 10. E.g.: A Tale of Two Tables (Domains, URLs)import happybaseconn = happybase.Connection(SERVER)# User1: “I want my list of domains from May 1 to# May 31, 2013”t = conn.table(“user_domains”)t.scan(row_start = “User1_20130501”,row_end = “User1_20130531”)# ‘User1_20130501’: { ‘’: ‘100’; … }# User1: “Oooh, looks interesting. Wonder# what the URLs were popular for 2 months.” <Click>t = conn.table(“user_domain_urls”)t.scan(row_start = “User1_D1.com_20130501”,row_end = “User1_D1.com_20130631”)# ‘User1_Domain1_20130501’: {‘’: ’80’ }
  11. 11. HBase + Thrift SetupMaster Data DataTT TT TTAPI Hadoop Hadoop• Used for HBase RPC with non-Java languages (e.g.: Python!)• Thrift runs on all nodes in our HBase clusters– Thrift on Master is read-only: used by API– Thrift on Data Nodes is write-only: data inserts from Hadoop• We use batch puts/inserts to improve write speed– Our analytics is VERY write-intensiveThrift is …?RPC frameworkdeveloped at Facebook,now in wide useNOT the Macklemore &Ryan Lewis music video(that’s Thrift Shop!)
  12. 12. What We Like About HBase• Giant, sorted key-value store– Hadoop output (also key-value!) can havea 1-to-1 correspondence to HBase• FAST lookups over large data set– O(1) lookup time to find key; lookupscomplete in ms across billion-plus rows• Usually retrieval is fast as well– But slow if data sets are large!– O(n). No simple way to solve this.– Most times you only need top N => can besolved through optimization of keyAll HBase dataAll HBase dataDatawewantDatawewantO(1) lookup = fast!O(n) read =could be slowGot good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
  13. 13. Challenges With HBase• Most programmers prefer SQL queries, not list iteration– “Why can’t I do a SELECT * FROM domains WHERE …???”• Thrift server goes down under load– We wrote our own HBase Thrift watchdog script• We deal with pretty exotic bugs at scale…– … with sometimes one blog post documenting a fix.– When was the last time Google showed you one useful result? • Some things we dealt with (we are on HBase 0.92)– org.apache.hadoop.hbase.NotServingRegionException• SSH into master, clean out Zookeeper meta-data, restart master.• Kinda scary the first time you actually do this?– java.util.concurrent.RejectedExecutionException (hbck)• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1– org.apache.hadoop.hbase.MasterNotRunningException
  14. 14. Conclusion• Real-time analytics on Hadoop and HBase– Handling 16 billion events a month (~15 TB data)– Inserting ~80 million data points into HBase daily– Running in production for 7 months!– Did I mention we built it on Python (& bash)?• Important lessons– Design your row key well (with room to iterate)– Give HBase as much memory/CPU as it needs• HBase is resource-hungry; better to over-provision– Backup frequently!Questions?