SlideShare a Scribd company logo
1 of 14
Hadoop and HBase for Real-Time
Video Analytics
Suman Srinivasan
About LongTail Video
• Home of JW player
– JW player is embedded on
over 2 million+ sites
• Founded in 2007
• 32 Employees
• $5M investment
• Headquartered in New York
disney.co.uk
chevrolet.com
JW Player - Key Features
Works on all mobile devices and desktops.
Chrome, IE, Firefox, iOS, Android, etc
Easy to customize, extend and embed.
Scripting API, PNG Skinning, Mgmt dashboard
HD-quality, secure, adaptive streaming.
Utilizing Apple HTTP Live Streaming
Cross-platform advertising & analytics.
VAST/VPAID, SiteCatalyst, Google
JW Analytics: Numbers and Tech Stack
• 156 million unique viewers - intl
• 24 million unique viewers – USA
• 1.04 billion video streams (plays)
• 29.94 million hours of video watched
• 134,000 live domains
• 16 billion analytics events
• 20,000 simultaneous pings per
second (peak)
• 3 TB (gzip compressed) per month
• 12-15 TB (uncompressed) per month
Technology Stack
•Runs completely in Amazon AWS
•Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR
•We upload data to and process from S3
•Full-stack Python: boto (AWS S3, EMR), happybase (HBase)
• Look ma, no Java!
JW Player Numbers (Version 6.0 and above) – May 2013
JW Analytics: Demo
• Available
to the
public
• Must be a
registered
user of
JWPlayer
(free
included!)
http://account.longtailvideo.com/
Real-Time Analytics: The Holy Grail
DatabaseDatabase
Crunch data
Insert into a DB
Real-time
querying
Raw logs with player data
Why We Chose HBase
• Goal: Build “Google Analytics for video”!
• Requirements:
– Fast queries across data sets
– Support date-range queries
– Store huge amounts of aggregate data
– Flexibility in dimensions used for rollup tables
• HBase! But why?
– Open source! And good community!
• Based on & closely integrated with Hadoop
– Facebook uses it (as do other large companies)
– Amazon AWS released a “hosted” HBase solution on EMR
JW Analytics Architecture
Schema: HBase Row-Key Design
• Allows us to do date range queries
• If we need new metrics, we just create a new table
– Specify this in a JSON config file used by our Hadoop mapper
• We don’t use column filters, secondary indexes, etc
• We do need to know the “prefix” ahead of time
QueryString _ yyyy mm dd
Row prefix for a specific table
•We need to know this ahead of time
•Like the “WHERE” clause in SQL
Date in yyyymmdd format
•ISO8601 makes date range scans
lexographic (perfect for HBase)
E.g.: A Tale of Two Tables (Domains, URLs)
import happybase
conn = happybase.Connection(SERVER)
# User1: “I want my list of domains from May 1 to
# May 31, 2013”
t = conn.table(“user_domains”)
t.scan(row_start = “User1_20130501”,
row_end = “User1_20130531”)
# ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … }
# User1: “Oooh, D1.com looks interesting. Wonder
# what the URLs were popular for 2 months.” <Click>
t = conn.table(“user_domain_urls”)
t.scan(row_start = “User1_D1.com_20130501”,
row_end = “User1_D1.com_20130631”)
# ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
HBase + Thrift Setup
Master Data Data
TT TT TT
API Hadoop Hadoop
• Used for HBase RPC with non-Java languages (e.g.: Python!)
• Thrift runs on all nodes in our HBase clusters
– Thrift on Master is read-only: used by API
– Thrift on Data Nodes is write-only: data inserts from Hadoop
• We use batch puts/inserts to improve write speed
– Our analytics is VERY write-intensive
Thrift is …?
RPC framework
developed at Facebook,
now in wide use
NOT the Macklemore &
Ryan Lewis music video
(that’s Thrift Shop!)
What We Like About HBase
• Giant, sorted key-value store
– Hadoop output (also key-value!) can have
a 1-to-1 correspondence to HBase
• FAST lookups over large data set
– O(1) lookup time to find key; lookups
complete in ms across billion-plus rows
• Usually retrieval is fast as well
– But slow if data sets are large!
– O(n). No simple way to solve this.
– Most times you only need top N => can be
solved through optimization of key
All HBase dataAll HBase data
Data
we
want
Data
we
want
O(1) lookup = fast!
O(n) read =
could be slow
Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
Challenges With HBase
• Most programmers prefer SQL queries, not list iteration
– “Why can’t I do a SELECT * FROM domains WHERE …???”
• Thrift server goes down under load
– We wrote our own HBase Thrift watchdog script
• We deal with pretty exotic bugs at scale…
– … with sometimes one blog post documenting a fix.
– When was the last time Google showed you one useful result? 
• Some things we dealt with (we are on HBase 0.92)
– org.apache.hadoop.hbase.NotServingRegionException
• SSH into master, clean out Zookeeper meta-data, restart master.
• Kinda scary the first time you actually do this?
– java.util.concurrent.RejectedExecutionException (hbck)
• Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1
– org.apache.hadoop.hbase.MasterNotRunningException
Conclusion
• Real-time analytics on Hadoop and HBase
– Handling 16 billion events a month (~15 TB data)
– Inserting ~80 million data points into HBase daily
– Running in production for 7 months!
– Did I mention we built it on Python (& bash)?
• Important lessons
– Design your row key well (with room to iterate)
– Give HBase as much memory/CPU as it needs
• HBase is resource-hungry; better to over-provision
– Backup frequently!
Questions?

More Related Content

What's hot

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackElasticsearch
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Karanjeet Singh
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Databricks
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...Alluxio, Inc.
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at ScaleElasticsearch
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsDataWorks Summit
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Databricks
 

What's hot (20)

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
 

Viewers also liked

Use of Big Data Technology in the area of Video Analytics
Use of Big Data Technology in the area of Video AnalyticsUse of Big Data Technology in the area of Video Analytics
Use of Big Data Technology in the area of Video Analyticsdatasciencekorea
 
Video Transcoding on Hadoop
Video Transcoding on HadoopVideo Transcoding on Hadoop
Video Transcoding on HadoopDataWorks Summit
 
My PhD thesis defense presentation
My PhD thesis defense presentationMy PhD thesis defense presentation
My PhD thesis defense presentationSuman Srinivasan
 
Real time video analytics with InfoSphere Streams, OpenCV and R
Real time video analytics with InfoSphere Streams, OpenCV and RReal time video analytics with InfoSphere Streams, OpenCV and R
Real time video analytics with InfoSphere Streams, OpenCV and RStephan Reimann
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics Chartbeat
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309DrVictorFang
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVNinou Haiko
 
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreIBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreNicolas Desachy
 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksLucidworks
 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive ParadigmLucidworks
 
New trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industryNew trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industrySchneider Electric
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Datalarsgeorge
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010Ysance
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeKhanh Maudoux
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
 

Viewers also liked (20)

Use of Big Data Technology in the area of Video Analytics
Use of Big Data Technology in the area of Video AnalyticsUse of Big Data Technology in the area of Video Analytics
Use of Big Data Technology in the area of Video Analytics
 
Video Transcoding on Hadoop
Video Transcoding on HadoopVideo Transcoding on Hadoop
Video Transcoding on Hadoop
 
My PhD thesis defense presentation
My PhD thesis defense presentationMy PhD thesis defense presentation
My PhD thesis defense presentation
 
Real time video analytics with InfoSphere Streams, OpenCV and R
Real time video analytics with InfoSphere Streams, OpenCV and RReal time video analytics with InfoSphere Streams, OpenCV and R
Real time video analytics with InfoSphere Streams, OpenCV and R
 
An Introduction to Video Analytics
An Introduction to Video Analytics An Introduction to Video Analytics
An Introduction to Video Analytics
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Analyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTVAnalyse des médias étrangers CNN vs CCTV
Analyse des médias étrangers CNN vs CCTV
 
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvreIBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
IBM : Gouvernance de l\'Information - Principes &amp; Mise en oeuvre
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, Lucidworks
 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive Paradigm
 
New trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industryNew trends in video analytics and surveillance systems for the mining industry
New trends in video analytics and surveillance systems for the mining industry
 
Social Networks and the Richness of Data
Social Networks and the Richness of DataSocial Networks and the Richness of Data
Social Networks and the Richness of Data
 
Ysance conference - cloud computing - aws - 3 mai 2010
Ysance   conference - cloud computing - aws - 3 mai 2010Ysance   conference - cloud computing - aws - 3 mai 2010
Ysance conference - cloud computing - aws - 3 mai 2010
 
Hadoop unit
Hadoop unitHadoop unit
Hadoop unit
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Introduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuéeIntroduction sur les problématiques d'une architecture distribuée
Introduction sur les problématiques d'une architecture distribuée
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
 

Similar to Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoicebazaarvoice_engineering
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...European SharePoint Conference
 
A Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - HabilelabsA Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - HabilelabsHabilelabs
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community EngineCommunity Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community enginemathraq
 
node-crate: node.js and big data
 node-crate: node.js and big data node-crate: node.js and big data
node-crate: node.js and big dataStefan Thies
 

Similar to Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) (20)

Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
Apache drill
Apache drillApache drill
Apache drill
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
SQL Server and SharePoint - Best Practices presented by Steffen Krause, Micro...
 
A Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - HabilelabsA Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - Habilelabs
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Mongo DB at Community Engine
Mongo DB at Community EngineMongo DB at Community Engine
Mongo DB at Community Engine
 
MongoDB at community engine
MongoDB at community engineMongoDB at community engine
MongoDB at community engine
 
Hive
HiveHive
Hive
 
node-crate: node.js and big data
 node-crate: node.js and big data node-crate: node.js and big data
node-crate: node.js and big data
 

More from Suman Srinivasan

More from Suman Srinivasan (9)

Data science and Artificial Intelligence
Data science and Artificial IntelligenceData science and Artificial Intelligence
Data science and Artificial Intelligence
 
PHP, LAMP Stack & WordPress
PHP, LAMP Stack & WordPressPHP, LAMP Stack & WordPress
PHP, LAMP Stack & WordPress
 
My PhD Thesis
My PhD Thesis My PhD Thesis
My PhD Thesis
 
OSGi summary
OSGi summaryOSGi summary
OSGi summary
 
ActiveCDN on NetServ
ActiveCDN on NetServActiveCDN on NetServ
ActiveCDN on NetServ
 
Suman's PhD Candidacy Talk
Suman's PhD Candidacy TalkSuman's PhD Candidacy Talk
Suman's PhD Candidacy Talk
 
7DS Version 1
7DS Version 17DS Version 1
7DS Version 1
 
BonAHA framework - Lab presentation
BonAHA framework - Lab presentationBonAHA framework - Lab presentation
BonAHA framework - Lab presentation
 
BonAHA framework - IEEE CCNC 2009
BonAHA framework - IEEE CCNC 2009BonAHA framework - IEEE CCNC 2009
BonAHA framework - IEEE CCNC 2009
 

Recently uploaded

NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 

Recently uploaded (20)

NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 

Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)

  • 1. Hadoop and HBase for Real-Time Video Analytics Suman Srinivasan
  • 2. About LongTail Video • Home of JW player – JW player is embedded on over 2 million+ sites • Founded in 2007 • 32 Employees • $5M investment • Headquartered in New York disney.co.uk chevrolet.com
  • 3. JW Player - Key Features Works on all mobile devices and desktops. Chrome, IE, Firefox, iOS, Android, etc Easy to customize, extend and embed. Scripting API, PNG Skinning, Mgmt dashboard HD-quality, secure, adaptive streaming. Utilizing Apple HTTP Live Streaming Cross-platform advertising & analytics. VAST/VPAID, SiteCatalyst, Google
  • 4. JW Analytics: Numbers and Tech Stack • 156 million unique viewers - intl • 24 million unique viewers – USA • 1.04 billion video streams (plays) • 29.94 million hours of video watched • 134,000 live domains • 16 billion analytics events • 20,000 simultaneous pings per second (peak) • 3 TB (gzip compressed) per month • 12-15 TB (uncompressed) per month Technology Stack •Runs completely in Amazon AWS •Master node & ping nodes in EC2; Hadoop and HBase clusters run in EMR •We upload data to and process from S3 •Full-stack Python: boto (AWS S3, EMR), happybase (HBase) • Look ma, no Java! JW Player Numbers (Version 6.0 and above) – May 2013
  • 5. JW Analytics: Demo • Available to the public • Must be a registered user of JWPlayer (free included!) http://account.longtailvideo.com/
  • 6. Real-Time Analytics: The Holy Grail DatabaseDatabase Crunch data Insert into a DB Real-time querying Raw logs with player data
  • 7. Why We Chose HBase • Goal: Build “Google Analytics for video”! • Requirements: – Fast queries across data sets – Support date-range queries – Store huge amounts of aggregate data – Flexibility in dimensions used for rollup tables • HBase! But why? – Open source! And good community! • Based on & closely integrated with Hadoop – Facebook uses it (as do other large companies) – Amazon AWS released a “hosted” HBase solution on EMR
  • 9. Schema: HBase Row-Key Design • Allows us to do date range queries • If we need new metrics, we just create a new table – Specify this in a JSON config file used by our Hadoop mapper • We don’t use column filters, secondary indexes, etc • We do need to know the “prefix” ahead of time QueryString _ yyyy mm dd Row prefix for a specific table •We need to know this ahead of time •Like the “WHERE” clause in SQL Date in yyyymmdd format •ISO8601 makes date range scans lexographic (perfect for HBase)
  • 10. E.g.: A Tale of Two Tables (Domains, URLs) import happybase conn = happybase.Connection(SERVER) # User1: “I want my list of domains from May 1 to # May 31, 2013” t = conn.table(“user_domains”) t.scan(row_start = “User1_20130501”, row_end = “User1_20130531”) # ‘User1_20130501’: { ‘cf:D1.com’: ‘100’; … } # User1: “Oooh, D1.com looks interesting. Wonder # what the URLs were popular for 2 months.” <Click> t = conn.table(“user_domain_urls”) t.scan(row_start = “User1_D1.com_20130501”, row_end = “User1_D1.com_20130631”) # ‘User1_Domain1_20130501’: {‘cf:D1.com/url’: ’80’ }
  • 11. HBase + Thrift Setup Master Data Data TT TT TT API Hadoop Hadoop • Used for HBase RPC with non-Java languages (e.g.: Python!) • Thrift runs on all nodes in our HBase clusters – Thrift on Master is read-only: used by API – Thrift on Data Nodes is write-only: data inserts from Hadoop • We use batch puts/inserts to improve write speed – Our analytics is VERY write-intensive Thrift is …? RPC framework developed at Facebook, now in wide use NOT the Macklemore & Ryan Lewis music video (that’s Thrift Shop!)
  • 12. What We Like About HBase • Giant, sorted key-value store – Hadoop output (also key-value!) can have a 1-to-1 correspondence to HBase • FAST lookups over large data set – O(1) lookup time to find key; lookups complete in ms across billion-plus rows • Usually retrieval is fast as well – But slow if data sets are large! – O(n). No simple way to solve this. – Most times you only need top N => can be solved through optimization of key All HBase dataAll HBase data Data we want Data we want O(1) lookup = fast! O(n) read = could be slow Got good row-key design? HBase excels at finding needles in haystacks!Got good row-key design? HBase excels at finding needles in haystacks!
  • 13. Challenges With HBase • Most programmers prefer SQL queries, not list iteration – “Why can’t I do a SELECT * FROM domains WHERE …???” • Thrift server goes down under load – We wrote our own HBase Thrift watchdog script • We deal with pretty exotic bugs at scale… – … with sometimes one blog post documenting a fix. – When was the last time Google showed you one useful result?  • Some things we dealt with (we are on HBase 0.92) – org.apache.hadoop.hbase.NotServingRegionException • SSH into master, clean out Zookeeper meta-data, restart master. • Kinda scary the first time you actually do this? – java.util.concurrent.RejectedExecutionException (hbck) • Ticket #6018 (“hbck fails … when >50 regions present”); fixed in 0.94.1 – org.apache.hadoop.hbase.MasterNotRunningException
  • 14. Conclusion • Real-time analytics on Hadoop and HBase – Handling 16 billion events a month (~15 TB data) – Inserting ~80 million data points into HBase daily – Running in production for 7 months! – Did I mention we built it on Python (& bash)? • Important lessons – Design your row key well (with room to iterate) – Give HBase as much memory/CPU as it needs • HBase is resource-hungry; better to over-provision – Backup frequently! Questions?