Your SlideShare is downloading. ×
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Content Identification using HBase
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Content Identification using HBase

725

Published on

Speaker: Daniel Nelson (Nielsen) …

Speaker: Daniel Nelson (Nielsen)

The motivation behind content identification is to determine the media people are consuming (via TV shows, movies, or streaming). Nielsen collects that data via its Fingerprints system, which generates significant amounts of structured data that is stored in HBase. This presentation will review the options a developer has for HBase querying and retrieval of hash data. Also covered is the use of wire protocols (Protocol Buffers), and how they can improve network efficiency and throughput, especially when combined with an HBase coprocessor.

Published in: Software, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
725
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
44
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CONTENT IDENTIFICATION USING HBASE Daniel Nelson 3-10-2014
  • 2. 2 Copyright©2013TheNielsenCompany.Confidentialandproprietary. TV MEASUREMENT HISTORY Traditional TV Video On Demand Time Shifted Viewing Internet-Based Content What are people watching?
  • 3. 3 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHY IDENTIFY CONTENT? Broadcasters: What are people watching? Advertisers: How can I focus my spend? Movies Commercials Programs Streaming
  • 4. 4 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHAT ARE AUDIO FINGERPRINTS? Program Audio Audio Fingerprints
  • 5. 5 Copyright©2013TheNielsenCompany.Confidentialandproprietary. BUILDING A LIBRARY OF CONTENT Nielsen Remote Media Monitoring Sites Audio Movies Commercials Programs Streaming Fingerprint Generator Fingerprints HBase Region Servers
  • 6. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 6 NEED FOR HBASE • Rapidly Growing Content • Monolithic Limitations • Storage Scalability • Distributed Computations
  • 7. 7 Copyright©2013TheNielsenCompany.Confidentialandproprietary. COLLECTING VIEWING DATA
  • 8. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 8 CONTENT IDENTIFICATION Matching Process identifies Content by comparing Unknown Fingerprints (Left) against Reference Fingerprints (Right). Match Unknown Fingerprints Reference Fingerprints ESPN QVC SyFy BNN
  • 9. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 9 FINGERPRINTS AND HBASE • Think of HBase as One Big Hash Table • Fingerprints – Fit Nicely into HBase as the Key • Keys Are Not 100% Unique – Collisions Without Loss? • Near Constant Time Lookups • Hash Table Load Factor will impact this ( n/k )
  • 10. 10 Copyright©2013TheNielsenCompany.Confidentialandproprietary. REFERENCE HOUSE KEEPING • Maintaining a Moving Window of Relevant Data • Broadcast Reference Fingerprints Expire after 8 Days. • Managing of 20+ Billion Hash Keys is no Small Task • HBase TTL (Time To Live) • Places an Expiration Date on All Table Data • Hides Expired Data From Queries • Purged on Next Compaction Cycle
  • 11. 11 Copyright©2013TheNielsenCompany.Confidentialandproprietary. OPTIMIZING USE OF HBASE • Network • Fastest Network • Bonded 1Gig Ethernet • Reduce Data Volume in Network Transfers • Protobufs – Google Protocol Buffers • HBase Co-Processors • Your Code Running on Region Servers • Computations, Advanced Filtering, Transformations…
  • 12. 12 Copyright©2013TheNielsenCompany.Confidentialandproprietary. HBase Region Servers HBASE ENDPOINT CO-PROCESSORS • Push your Business Logic into the Coprocessor. • Keeping Co-Processor Code Simple • Loading your Co-Processor HBase Client Application Co-Processor Co-Processor Co-Processor Co-Processor Query/Response
  • 13. 13 Copyright©2013TheNielsenCompany.Confidentialandproprietary. QUERYING HBASE • Table.scan (…) • Table.get (…) • Table.get(List<Get>keyList) • Query using a Co Processor • coprocessorExec( yourProtocolClass, null, null, … ) • coprocessorExec( yourProtocolClass, startKey, endKey, … )
  • 14. 14 Copyright©2013TheNielsenCompany.Confidentialandproprietary. STORING MILLIONS OF FILES • Have a lot of Files to Store, use SAN/NAS/HDFS Right? • SAN/NAS • Costly • More Hardware to Buy/Maintain • HDFS • Limited File Count • Sequence Files • Immutable • Inefficient to Retrieve, Delete, Modify or Add Files... • There is another way….
  • 15. 15 Copyright©2013TheNielsenCompany.Confidentialandproprietary. LEVERAGING WHAT ALREADY WORKS Example File To Store: /foo/bar/myFile.bin • HBase Key = File Path: /foo/bar/ • Qualifier = File Name: myFile.bin • Value = Your File Data (Serialized Object, Text, etc…) • “List” Files - Scan using built in KeyOnlyFilter • Wrap this with an API and your Application can use “HBase FS” • File Retrieval, Delete, Modification, Updates, Versions • Apply TTL to Purge Files
  • 16. THANK YOU

×