CONTENT IDENTIFICATION
USING HBASE
Daniel Nelson
3-10-2014
2
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
TV MEASUREMENT HISTORY
Traditional TV Video On
Demand
Time S...
3
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
WHY IDENTIFY CONTENT?
Broadcasters:
What are people watching...
4
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
WHAT ARE AUDIO FINGERPRINTS?
Program Audio
Audio
Fingerprints
5
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
BUILDING A LIBRARY OF CONTENT
Nielsen Remote Media
Monitorin...
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
6
NEED FOR HBASE
• Rapidly Growing Content
• Monolithic Limita...
7
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
COLLECTING VIEWING DATA
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
8
CONTENT IDENTIFICATION
Matching Process identifies Content b...
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
9
FINGERPRINTS AND HBASE
• Think of HBase as One Big Hash Tabl...
10
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
REFERENCE HOUSE KEEPING
• Maintaining a Moving Window of Re...
11
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
OPTIMIZING USE OF HBASE
• Network
• Fastest Network
• Bonde...
12
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
HBase Region Servers
HBASE ENDPOINT CO-PROCESSORS
• Push yo...
13
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
QUERYING HBASE
• Table.scan (…)
• Table.get (…)
• Table.get...
14
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
STORING MILLIONS OF FILES
• Have a lot of Files to Store, u...
15
Copyright©2013TheNielsenCompany.Confidentialandproprietary.
LEVERAGING WHAT ALREADY WORKS
Example File To Store: /foo/b...
THANK YOU
Upcoming SlideShare
Loading in...5
×

Content Identification using HBase

865

Published on

Speaker: Daniel Nelson (Nielsen)

The motivation behind content identification is to determine the media people are consuming (via TV shows, movies, or streaming). Nielsen collects that data via its Fingerprints system, which generates significant amounts of structured data that is stored in HBase. This presentation will review the options a developer has for HBase querying and retrieval of hash data. Also covered is the use of wire protocols (Protocol Buffers), and how they can improve network efficiency and throughput, especially when combined with an HBase coprocessor.

Published in: Software, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
865
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
49
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Content Identification using HBase

  1. 1. CONTENT IDENTIFICATION USING HBASE Daniel Nelson 3-10-2014
  2. 2. 2 Copyright©2013TheNielsenCompany.Confidentialandproprietary. TV MEASUREMENT HISTORY Traditional TV Video On Demand Time Shifted Viewing Internet-Based Content What are people watching?
  3. 3. 3 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHY IDENTIFY CONTENT? Broadcasters: What are people watching? Advertisers: How can I focus my spend? Movies Commercials Programs Streaming
  4. 4. 4 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHAT ARE AUDIO FINGERPRINTS? Program Audio Audio Fingerprints
  5. 5. 5 Copyright©2013TheNielsenCompany.Confidentialandproprietary. BUILDING A LIBRARY OF CONTENT Nielsen Remote Media Monitoring Sites Audio Movies Commercials Programs Streaming Fingerprint Generator Fingerprints HBase Region Servers
  6. 6. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 6 NEED FOR HBASE • Rapidly Growing Content • Monolithic Limitations • Storage Scalability • Distributed Computations
  7. 7. 7 Copyright©2013TheNielsenCompany.Confidentialandproprietary. COLLECTING VIEWING DATA
  8. 8. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 8 CONTENT IDENTIFICATION Matching Process identifies Content by comparing Unknown Fingerprints (Left) against Reference Fingerprints (Right). Match Unknown Fingerprints Reference Fingerprints ESPN QVC SyFy BNN
  9. 9. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 9 FINGERPRINTS AND HBASE • Think of HBase as One Big Hash Table • Fingerprints – Fit Nicely into HBase as the Key • Keys Are Not 100% Unique – Collisions Without Loss? • Near Constant Time Lookups • Hash Table Load Factor will impact this ( n/k )
  10. 10. 10 Copyright©2013TheNielsenCompany.Confidentialandproprietary. REFERENCE HOUSE KEEPING • Maintaining a Moving Window of Relevant Data • Broadcast Reference Fingerprints Expire after 8 Days. • Managing of 20+ Billion Hash Keys is no Small Task • HBase TTL (Time To Live) • Places an Expiration Date on All Table Data • Hides Expired Data From Queries • Purged on Next Compaction Cycle
  11. 11. 11 Copyright©2013TheNielsenCompany.Confidentialandproprietary. OPTIMIZING USE OF HBASE • Network • Fastest Network • Bonded 1Gig Ethernet • Reduce Data Volume in Network Transfers • Protobufs – Google Protocol Buffers • HBase Co-Processors • Your Code Running on Region Servers • Computations, Advanced Filtering, Transformations…
  12. 12. 12 Copyright©2013TheNielsenCompany.Confidentialandproprietary. HBase Region Servers HBASE ENDPOINT CO-PROCESSORS • Push your Business Logic into the Coprocessor. • Keeping Co-Processor Code Simple • Loading your Co-Processor HBase Client Application Co-Processor Co-Processor Co-Processor Co-Processor Query/Response
  13. 13. 13 Copyright©2013TheNielsenCompany.Confidentialandproprietary. QUERYING HBASE • Table.scan (…) • Table.get (…) • Table.get(List<Get>keyList) • Query using a Co Processor • coprocessorExec( yourProtocolClass, null, null, … ) • coprocessorExec( yourProtocolClass, startKey, endKey, … )
  14. 14. 14 Copyright©2013TheNielsenCompany.Confidentialandproprietary. STORING MILLIONS OF FILES • Have a lot of Files to Store, use SAN/NAS/HDFS Right? • SAN/NAS • Costly • More Hardware to Buy/Maintain • HDFS • Limited File Count • Sequence Files • Immutable • Inefficient to Retrieve, Delete, Modify or Add Files... • There is another way….
  15. 15. 15 Copyright©2013TheNielsenCompany.Confidentialandproprietary. LEVERAGING WHAT ALREADY WORKS Example File To Store: /foo/bar/myFile.bin • HBase Key = File Path: /foo/bar/ • Qualifier = File Name: myFile.bin • Value = Your File Data (Serialized Object, Text, etc…) • “List” Files - Scan using built in KeyOnlyFilter • Wrap this with an API and your Application can use “HBase FS” • File Retrieval, Delete, Modification, Updates, Versions • Apply TTL to Purge Files
  16. 16. THANK YOU
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×