More Related Content

Slideshows for you(20)

Similar to Content Identification using HBase(20)


Content Identification using HBase

  2. 2 Copyright©2013TheNielsenCompany.Confidentialandproprietary. TV MEASUREMENT HISTORY Traditional TV Video On Demand Time Shifted Viewing Internet-Based Content What are people watching?
  3. 3 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHY IDENTIFY CONTENT? Broadcasters: What are people watching? Advertisers: How can I focus my spend? Movies Commercials Programs Streaming
  4. 4 Copyright©2013TheNielsenCompany.Confidentialandproprietary. WHAT ARE AUDIO FINGERPRINTS? Program Audio Audio Fingerprints
  5. 5 Copyright©2013TheNielsenCompany.Confidentialandproprietary. BUILDING A LIBRARY OF CONTENT Nielsen Remote Media Monitoring Sites Audio Movies Commercials Programs Streaming Fingerprint Generator Fingerprints HBase Region Servers
  6. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 6 NEED FOR HBASE • Rapidly Growing Content • Monolithic Limitations • Storage Scalability • Distributed Computations
  7. 7 Copyright©2013TheNielsenCompany.Confidentialandproprietary. COLLECTING VIEWING DATA
  8. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 8 CONTENT IDENTIFICATION Matching Process identifies Content by comparing Unknown Fingerprints (Left) against Reference Fingerprints (Right). Match Unknown Fingerprints Reference Fingerprints ESPN QVC SyFy BNN
  9. Copyright©2013TheNielsenCompany.Confidentialandproprietary. 9 FINGERPRINTS AND HBASE • Think of HBase as One Big Hash Table • Fingerprints – Fit Nicely into HBase as the Key • Keys Are Not 100% Unique – Collisions Without Loss? • Near Constant Time Lookups • Hash Table Load Factor will impact this ( n/k )
  10. 10 Copyright©2013TheNielsenCompany.Confidentialandproprietary. REFERENCE HOUSE KEEPING • Maintaining a Moving Window of Relevant Data • Broadcast Reference Fingerprints Expire after 8 Days. • Managing of 20+ Billion Hash Keys is no Small Task • HBase TTL (Time To Live) • Places an Expiration Date on All Table Data • Hides Expired Data From Queries • Purged on Next Compaction Cycle
  11. 11 Copyright©2013TheNielsenCompany.Confidentialandproprietary. OPTIMIZING USE OF HBASE • Network • Fastest Network • Bonded 1Gig Ethernet • Reduce Data Volume in Network Transfers • Protobufs – Google Protocol Buffers • HBase Co-Processors • Your Code Running on Region Servers • Computations, Advanced Filtering, Transformations…
  12. 12 Copyright©2013TheNielsenCompany.Confidentialandproprietary. HBase Region Servers HBASE ENDPOINT CO-PROCESSORS • Push your Business Logic into the Coprocessor. • Keeping Co-Processor Code Simple • Loading your Co-Processor HBase Client Application Co-Processor Co-Processor Co-Processor Co-Processor Query/Response
  13. 13 Copyright©2013TheNielsenCompany.Confidentialandproprietary. QUERYING HBASE • Table.scan (…) • Table.get (…) • Table.get(List<Get>keyList) • Query using a Co Processor • coprocessorExec( yourProtocolClass, null, null, … ) • coprocessorExec( yourProtocolClass, startKey, endKey, … )
  14. 14 Copyright©2013TheNielsenCompany.Confidentialandproprietary. STORING MILLIONS OF FILES • Have a lot of Files to Store, use SAN/NAS/HDFS Right? • SAN/NAS • Costly • More Hardware to Buy/Maintain • HDFS • Limited File Count • Sequence Files • Immutable • Inefficient to Retrieve, Delete, Modify or Add Files... • There is another way….
  15. 15 Copyright©2013TheNielsenCompany.Confidentialandproprietary. LEVERAGING WHAT ALREADY WORKS Example File To Store: /foo/bar/myFile.bin • HBase Key = File Path: /foo/bar/ • Qualifier = File Name: myFile.bin • Value = Your File Data (Serialized Object, Text, etc…) • “List” Files - Scan using built in KeyOnlyFilter • Wrap this with an API and your Application can use “HBase FS” • File Retrieval, Delete, Modification, Updates, Versions • Apply TTL to Purge Files