• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

on

  • 2,640 views

At LongTail Video, we use Hadoop and HBase for our real-time analytics engine. This was presented at HBaseCon 2013 in San Francisco.

At LongTail Video, we use Hadoop and HBase for our real-time analytics engine. This was presented at HBaseCon 2013 in San Francisco.

Statistics

Views

Total Views
2,640
Views on SlideShare
1,967
Embed Views
673

Actions

Likes
6
Downloads
0
Comments
0

8 Embeds 673

http://sumanrs.wordpress.com 658
https://twitter.com 4
http://www.linkedin.com 4
http://www.365dailyjournal.com 3
http://webcache.googleusercontent.com 1
http://translate.googleusercontent.com 1
https://www.linkedin.com 1
https://sumanrs.wordpress.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) Presentation Transcript

    • Hadoop and HBase for Real-TimeVideo AnalyticsSuman Srinivasan
    • About LongTail Video• Home of JW player– JW player is embedded onover 2 million+ sites• Founded in 2007• 32 Employees• $5M investment• Headquartered in New Yorkdisney.co.ukchevrolet.com
    • JW Player - Key FeaturesWorks on all mobile devices and desktops.Chrome, IE, Firefox, iOS, Android, etcEasy to customize, extend and embed.Scripting API, PNG Skinning, Mgmt dashboardHD-quality, secure, adaptive streaming.Utilizing Apple HTTP Live StreamingCross-platform advertising & analytics.VAST/VPAID, SiteCatalyst, Google
    • JW Analytics: Numbers and Tech Stack• 156 million unique viewers - intl• 24 million unique viewers – USA• 1.04 billion video streams (plays)• 29.94 million hours of video watched• 134,000 live domains• 16 billion analytics events• 20,000 simultaneous pings persecond (peak)• 3 TB (gzip compressed) per month• 12-15 TB (uncompressed) per monthTechnology Stack•Runs completely in Amazon AWS•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR•We upload data to and process from S3•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)• Look ma, no Java!JW Player Numbers (Version 6.0 and above) – May 2013
    • JW Analytics: Demo• Availableto thepublic• Must be aregistereduser ofJWPlayer(freeincluded!)http://account.longtailvideo.com/
    • Real-Time Analytics: The Holy GrailDatabaseDatabaseCrunch dataInsert into a DBReal-timequeryingRaw logs with player data
    • Why We Chose HBase• Goal: Build “Google Analytics for video”!• Requirements:– Fast queries across data sets– Support date-range queries– Store huge amounts of aggregate data– Flexibility in dimensions used for rollup tables• HBase! But why?– Open source! And good community!• Based on & closely integrated with Hadoop– Facebook uses it (as do other large companies)– Amazon AWS released a “hosted” HBase solution on EMR
    • JW Analytics Architecture
    • Schema: HBase Row-Key Design• Allows us to do date range queries• If we need new metrics, we just create a new table– Specify this in a JSON config file used by our Hadoop mapper• We don’t use column filters, secondary indexes, etc• We do need to know the “prefix” ahead of timeQueryString _ yyyy mm ddRow prefix for a specific table•We need to know this ahead of time•Like the “WHERE” clause in SQLDate in yyyymmdd format•ISO8601 makes date range scanslexographic (perfect for HBase)
    • E.g.: A Tale of Two Tables (Domains, URLs)import happybaseconn = happybase.Connection(SERVER)# User1: “I want my list of domains from May 1 to# May 31, 2013”t = conn.table(“user_domains”)t.scan(row_start = “User1_20130501”,row_end = “User1_20130531”)# ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … }# User1: “Oooh, D1.com looks interesting. Wonder# what the URLs were popular for 2 months.” <Click>t = conn.table(“user_domain_urls”)t.scan(row_start = “User1_D1.com_20130501”,row_end = “User1_D1.com_20130631”)# ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
    • HBase + Thrift SetupMaster Data DataTT TT TTAPI Hadoop Hadoop• Used for HBase RPC with non-Java languages (e.g.: Python!)• Thrift runs on all nodes in our HBase clusters– Thrift on Master is read-only: used by API– Thrift on Data Nodes is write-only: data inserts from Hadoop• We use batch puts/inserts to improve write speed– Our analytics is VERY write-intensiveThrift is …?RPC frameworkdeveloped at Facebook,now in wide useNOT the Macklemore &Ryan Lewis music video(that’s Thrift Shop!)
    • What We Like About HBase• Giant, sorted key-value store– Hadoop output (also key-value!) can havea 1-to-1 correspondence to HBase• FAST lookups over large data set– O(1) lookup time to find key; lookupscomplete in ms across billion-plus rows• Usually retrieval is fast as well– But slow if data sets are large!– O(n). No simple way to solve this.– Most times you only need top N => can besolved through optimization of keyAll HBase dataAll HBase dataDatawewantDatawewantO(1) lookup = fast!O(n) read =could be slowGot good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
    • Challenges With HBase• Most programmers prefer SQL queries, not list iteration– “Why can’t I do a SELECT * FROM domains WHERE …???”• Thrift server goes down under load– We wrote our own HBase Thrift watchdog script• We deal with pretty exotic bugs at scale…– … with sometimes one blog post documenting a fix.– When was the last time Google showed you one useful result? • Some things we dealt with (we are on HBase 0.92)– org.apache.hadoop.hbase.NotServingRegionException• SSH into master, clean out Zookeeper meta-data, restart master.• Kinda scary the first time you actually do this?– java.util.concurrent.RejectedExecutionException (hbck)• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1– org.apache.hadoop.hbase.MasterNotRunningException
    • Conclusion• Real-time analytics on Hadoop and HBase– Handling 16 billion events a month (~15 TB data)– Inserting ~80 million data points into HBase daily– Running in production for 7 months!– Did I mention we built it on Python (& bash)?• Important lessons– Design your row key well (with room to iterate)– Give HBase as much memory/CPU as it needs• HBase is resource-hungry; better to over-provision– Backup frequently!Questions?