Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Content Identification using HBase


Published on

Speaker: Daniel Nelson (Nielsen)

The motivation behind content identification is to determine the media people are consuming (via TV shows, movies, or streaming). Nielsen collects that data via its Fingerprints system, which generates significant amounts of structured data that is stored in HBase. This presentation will review the options a developer has for HBase querying and retrieval of hash data. Also covered is the use of wire protocols (Protocol Buffers), and how they can improve network efficiency and throughput, especially when combined with an HBase coprocessor.

Published in: Software, Technology
  • Be the first to comment

Content Identification using HBase

  1. 1. CONTENT IDENTIFICATION USING HBASE Daniel Nelson 3-10-2014
  2. 2. 2 Copyright©2013TheNielsenCompany.Confidentialandproprietary. TV MEASUREMENT HISTORY Traditional TV Video On Demand Time Shifted Viewing Internet-Based Content What are people watching?
  3. 3. 3 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHY IDENTIFY CONTENT? Broadcasters: What are people watching? Advertisers: How can I focus my spend? Movies Commercials Programs Streaming
  4. 4. 4 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHAT ARE AUDIO FINGERPRINTS? Program Audio Audio Fingerprints
  5. 5. 5 Copyright©2013TheNielsenCompany.Confidentialandproprietary. BUILDING A LIBRARY OF CONTENT Nielsen Remote Media Monitoring Sites Audio Movies Commercials Programs Streaming Fingerprint Generator Fingerprints HBase Region Servers
  6. 6. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 6 NEED FOR HBASE • Rapidly Growing Content • Monolithic Limitations • Storage Scalability • Distributed Computations
  7. 7. 7 Copyright©2013TheNielsenCompany.Confidentialandproprietary. COLLECTING VIEWING DATA
  8. 8. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 8 CONTENT IDENTIFICATION Matching Process identifies Content by comparing Unknown Fingerprints (Left) against Reference Fingerprints (Right). Match Unknown Fingerprints Reference Fingerprints ESPN QVC SyFy BNN
  9. 9. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 9 FINGERPRINTS AND HBASE • Think of HBase as One Big Hash Table • Fingerprints – Fit Nicely into HBase as the Key • Keys Are Not 100% Unique – Collisions Without Loss? • Near Constant Time Lookups • Hash Table Load Factor will impact this ( n/k )
  10. 10. 10 Copyright©2013TheNielsenCompany.Confidentialandproprietary. REFERENCE HOUSE KEEPING • Maintaining a Moving Window of Relevant Data • Broadcast Reference Fingerprints Expire after 8 Days. • Managing of 20+ Billion Hash Keys is no Small Task • HBase TTL (Time To Live) • Places an Expiration Date on All Table Data • Hides Expired Data From Queries • Purged on Next Compaction Cycle
  11. 11. 11 Copyright©2013TheNielsenCompany.Confidentialandproprietary. OPTIMIZING USE OF HBASE • Network • Fastest Network • Bonded 1Gig Ethernet • Reduce Data Volume in Network Transfers • Protobufs – Google Protocol Buffers • HBase Co-Processors • Your Code Running on Region Servers • Computations, Advanced Filtering, Transformations…
  12. 12. 12 Copyright©2013TheNielsenCompany.Confidentialandproprietary. HBase Region Servers HBASE ENDPOINT CO-PROCESSORS • Push your Business Logic into the Coprocessor. • Keeping Co-Processor Code Simple • Loading your Co-Processor HBase Client Application Co-Processor Co-Processor Co-Processor Co-Processor Query/Response
  13. 13. 13 Copyright©2013TheNielsenCompany.Confidentialandproprietary. QUERYING HBASE • Table.scan (…) • Table.get (…) • Table.get(List<Get>keyList) • Query using a Co Processor • coprocessorExec( yourProtocolClass, null, null, … ) • coprocessorExec( yourProtocolClass, startKey, endKey, … )
  14. 14. 14 Copyright©2013TheNielsenCompany.Confidentialandproprietary. STORING MILLIONS OF FILES • Have a lot of Files to Store, use SAN/NAS/HDFS Right? • SAN/NAS • Costly • More Hardware to Buy/Maintain • HDFS • Limited File Count • Sequence Files • Immutable • Inefficient to Retrieve, Delete, Modify or Add Files... • There is another way….
  15. 15. 15 Copyright©2013TheNielsenCompany.Confidentialandproprietary. LEVERAGING WHAT ALREADY WORKS Example File To Store: /foo/bar/myFile.bin • HBase Key = File Path: /foo/bar/ • Qualifier = File Name: myFile.bin • Value = Your File Data (Serialized Object, Text, etc…) • “List” Files - Scan using built in KeyOnlyFilter • Wrap this with an API and your Application can use “HBase FS” • File Retrieval, Delete, Modification, Updates, Versions • Apply TTL to Purge Files
  16. 16. THANK YOU