Information Excellence Summit,February 25, 2012 Bangalorehttp://Informationexcellence.wordpress.com        Making Data Rep...
Srinivasan H Sengamedu      Bio: Srinivasan H Sengamedu (SHS) is the Head of Komli Labs where he      works on real-time b...
Lot of information on the web               Web pages               Images               Videos               Blogs       ...
Not all information is new!         Web pages about the same product,         business, etc.         Near-duplicate images...
Leveraging redundant information  Classical use  – Compression      • Lossless compression (LZW)      • Perceptually lossl...
Pointwise Mutual Information                      Confidential
Any other uses of redundancy?  Commercial Spam Detection  – Min-closed sequences  Information Extraction  – Strong Similar...
Identifying Near-Duplicate      Subsequences           Confidential
Two Spam CommentsHappy to see she is progressing well.          Texas and Israel forever. Happy 2011Happy 2011 to everyone...
Sequence-based Spam DetectionMotivation: Commercial spammers repeat variations of the spam content and embed   it in good ...
mcPrismThe main ingredients in the algorithm: A modified DFS on the lexicographically ordered sequence  tree. The tree is...
Commercial Spam Detection – Results   Subsequence: happy 2011 friend yrs lady announced wedding it amazing posted   receiv...
The sequences are discriminative                       Confidential
Identifying Near-Duplicate Strings                Confidential
Content Matching Approach  Key idea: Leverage redundant content across template-based sites    for automatic information e...
Baseline Similarity Measure   Use q-grams to handle spelling errors      String    3-grams      chinese   { chi, hin, ine,...
Strong Similarity      Address (Seed)         Address (Site)               WS      120 Lexington Avenue   120 Lexington Av...
Support & Strong SimilarityAddress (Seed)          Address (Site)                  Matching   Matching                    ...
Need for Support of a Matching Pattern       Address (Seed)                  Address (Site)       120 Lexington Avenue    ...
Strong Similarity Scores         String 1                              String 2                    WS        SS         98...
Identifying Near-Duplicate Images               Confidential
Near-Duplicates on the Web                      Confidential
Approach Feature  – DCT/FMT transform  – Choose low-frequency coefficients Signature  – Median-based quantization  – Signa...
FMT Detections                 Confidential
Signature-based Image Retrieval                       Confidential
Large-scale Face Recognition             Confidential
Face Recognition  Face recognition was an important open problem in  computer vision.  Availability of text and image/vide...
Conclusions  There is an information explosion but the  information has lots of near-duplicates.  Spotting near-duplicates...
References  Ravi Kant, Srinivasan H. Sengamedu, Krishnan S. Kumar:  Comment spam detection by sequence mining, WSDM  2012....
Questions/Comments?     shs@komli.com          Confidential
Making Data Repetitions Work for You                     Confidential
Upcoming SlideShare
Loading in...5
×

Information excellence 2012feb_komli_srinivasan s h_making data repitions work

453

Published on

Information Excellence 2012 Spring Summit "DATA DYNAMICS"
Making data repetitions work for you
Srinivasan Sengamedu, VP, Komli

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
453
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Information excellence 2012feb_komli_srinivasan s h_making data repitions work

  1. 1. Information Excellence Summit,February 25, 2012 Bangalorehttp://Informationexcellence.wordpress.com Making Data Repetitions Work for You Srinivasan H Sengamedu Komli Labs shs@komli.com Confidential
  2. 2. Srinivasan H Sengamedu Bio: Srinivasan H Sengamedu (SHS) is the Head of Komli Labs where he works on real-time bidding, user modeling, and other areas related to computational advertising. He was Director of Audience and Search Sciences at Yahoo Labs, Bangalore earlier where he worked on information extraction, machine learned ranking, pornography detection in images, comment spam detection, etc. Most of the technologies are powering various Yahoo products. He got his PhD from Indian Institute of Science, Bangalore and has held visiting positions at UCSD and NUS. He has published over 100 papers and has more than 30 approved or filed patent applications. Hes generally excited about creating and productizing advanced technologies. Confidential
  3. 3. Lot of information on the web Web pages Images Videos Blogs Mails Confidential
  4. 4. Not all information is new! Web pages about the same product, business, etc. Near-duplicate images Similar comments, tweets, etc. Confidential
  5. 5. Leveraging redundant information Classical use – Compression • Lossless compression (LZW) • Perceptually lossless compression (JPEG, MP3) – Co-occurrence • Pointwise Mutual Information Redundancy ≈ Confidence Leveraging redundancy requires care. Confidential
  6. 6. Pointwise Mutual Information Confidential
  7. 7. Any other uses of redundancy? Commercial Spam Detection – Min-closed sequences Information Extraction – Strong Similarity Near-duplicate images – Image Signatures Face recognition – Consistency Learning Confidential
  8. 8. Identifying Near-Duplicate Subsequences Confidential
  9. 9. Two Spam CommentsHappy to see she is progressing well. Texas and Israel forever. Happy 2011Happy 2011 to everyone....My everyone....This has got to be afriend Vanessa, a 25 yrs lady, has better year!...My friend Vanessa, aannounced her wedding with a 25 yrs lady, has announced hermillionaire young man Ronald who wedding with a millionaire youngis the CEO of a MNC. Its amazing, man Ronald who is the CEO of ashe said she just posted her profile MNC. Its amazing, she said she juston a millionaire dating site called -- posted her profile on a millionaire--------Celeb Mingle.C○M-------- - and dating site called ----------received his chat invitations a few RichFriends.Org----- - and receiveddays later. Then, everything went so his chat invitations a few days later.well that I cant believe its true! I cant believe its true! Every loveEvery love story will unfold on its story will unfold on its own. you canown. Also happy to see that most start your own wealthy love story forAmericans reject the blame-the- real at there too !many famous andconservatives crap that some (not wealthy people had a profile thereall) liberals from all social strata were ,why not me ? Taking out the worldstrying to promote for political gain. trash. Oooraaaah!----- Confidential
  10. 10. Sequence-based Spam DetectionMotivation: Commercial spammers repeat variations of the spam content and embed it in good content. These usually avoid detection by spam filters.Technical Challenge: mine frequent subsequences efficiently. The general problem is NP-Hard. The algorithms in the literature do not scale to web-scale data. The spam patterns change every few hours.Basic Ideas A new sequence mining algorithm that scales to internet scale and is faster than those in the literature even for other public data sets like Gazelle A new framework for spam detection using frequent subsequences Experimental studies to measure the efficacy of the subsequence mining approach in detecting spam. We also study the life cycle of a typical spam pattern and use it to tune our mining parametersResults Experiments on News comment data show Coverage >70% Editorial Savings of a factor of ~30. Confidential
  11. 11. mcPrismThe main ingredients in the algorithm: A modified DFS on the lexicographically ordered sequence tree. The tree is pruned whenever we encounter a prefix-l- closed node. The set of prefix-l-closed nodes is pruned by inclusion check Prime Block Encodings for fast computation of joins. We enhance the encoding scheme to handle gap and closure constraints. On-the-fly closure checking. We use the bidirectional closure checking and the backscan pruning schemes in BIDE. This is done using an enhancement of the Block encoding scheme This enhancement also solves an open problem: how to use block encodings to speed up closed sequence mining. Confidential
  12. 12. Commercial Spam Detection – Results Subsequence: happy 2011 friend yrs lady announced wedding it amazing posted received chat invitations days believe it true love story unfold it ownMatch 1: Happy 2011 to everyone....My friend Vanessa, a 25 yrs lady, has announced herwedding with a millionaire young man Ronald who is the CEO of a MNC. Its amazing, she saidshe just posted her profile on a millionaire dating site called ----------Celeb Mingle.C○M--------- and received his chat invitations a few days later. Then, everything went so well that I cantbelieve its true! Every love story will unfold on its own...=====Happy to see she isprogressing well. Also happy to see that most Americans reject the blame-the-conservativescrap that some (not all) liberals from all social strata were trying to promote for political gain.Match 2: Happy 2011 everyone....This has got to be a better year!...My friend Vanessa, a 25yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO ofa MNC. Its amazing, she said she just posted her profile on a millionaire dating site called ----------RichFriends.Org----- - and received his chat invitations a few days later. Then, everythingwent so well that I cant believe its true! Every love story will unfold on its own. you can startyour own wealthy love story for real at there too !many famous and wealthy people had aprofile there ,why not me ?Texas and Israel forever. Taking out the worlds trash. Oooraaaah!----- Total Matches: 35; Only 15 marked spam by existing classifiers/editors Confidential
  13. 13. The sequences are discriminative Confidential
  14. 14. Identifying Near-Duplicate Strings Confidential
  15. 15. Content Matching Approach Key idea: Leverage redundant content across template-based sites for automatic information extraction. Web page Seed Database Name Address Chinese Mirrch 120 Lexington Ave, New York, NY 10016 Tiffin Wallah 127 E 28th St New York, NY 10079 Confidential
  16. 16. Baseline Similarity Measure Use q-grams to handle spelling errors String 3-grams chinese { chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch} mirch chinese { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch} mirrch • Weight of a q-gram (attribute specific) = Sum of the IDFs of the words it appears in.Weak Similarity = Cosine-similarity between IDF-weighted q-grams. Confidential
  17. 17. Strong Similarity Address (Seed) Address (Site) WS 120 Lexington Avenue 120 Lexington Ave 0.53 New York, NY 10016 (between 28th and 29th St) New York, NY 10016 312 W 34th Street 312 W 34th St 0.49 New York, NY 10001 (between 8th and 9th Ave) New York, NY 10001 1. Variations are systematic and site-dependent. 2. Cannot be handled by term weighting. Strong similarity is defined between two sets of strings. 1. Calculate the matching pattern between weakly similar pairs in the two sets. 2. Pick matching patterns with sufficient “support” 3. Use only parts of a string selected by the matching pattern in the final similarity calculation. Confidential
  18. 18. Support & Strong SimilarityAddress (Seed) Address (Site) Matching Matching Pattern Segments120 Lexington 120 Lexington Ave 103 103 120 LexingtonAvenue (between 28th and 29th St) New York, NYNew York, NY 10016 New York, NY 10016 10016312 W 34th Street 312 W 34th St 103 103 312 W 34th NewNew York, NY 10001 (between 8th and 9th Ave) York, NY 10001 New York, NY 10001 Matching Pattern: 103 103 Support(103 103) = |{“120 Lexington New York, NY 10016”, “312 W 34th New York, NY 10001”}| = 2 (100% support) Address’ (Seed) Address’ (Site) SS 120 Lexington 120 Lexington 1 New York, NY 10016 New York, NY 10016 312 W 34th 312 W 34th 1 New York, NY 10001 New York, NY 10001 Confidential
  19. 19. Need for Support of a Matching Pattern Address (Seed) Address (Site) 120 Lexington Avenue 1075 Fifth Ave New York, NY 10016 New York, NY 10128 312 W 34th Street 1167 Madison Ave New York, NY 10001 New York, NY 10128 Address (Seed) Address (Site) Matching Matching Pattern Segments 120 Lexington Avenue 1075 Fifth Ave 010 010 New York, New York, NY 10016 New York, NY 10128 NY 312 W 34th Street 1167 Madison Ave 010 010 New York, New York, NY 10001 New York, NY 10128 NY Matching Pattern: 010 010 Support(010 010): |{“New York, NY”}| = 1 (50% support) Hence Strong Similarity = Weak Similarity Confidential
  20. 20. Strong Similarity Scores String 1 String 2 WS SS 980 n michigan ave 14th floor 980 n michigan ave 0.57 1 chicago il chicago il 60611 1100 e north ave west 300 w north ave west 0.74 0.74 chicago il 60185 chicago il 60185SS boosts the similarity scores of TPs over a wide-range of WS scores without boosting that of FPs.SS is not always 1 – even for true positives.SS scores are very high for most true positives. Confidential
  21. 21. Identifying Near-Duplicate Images Confidential
  22. 22. Near-Duplicates on the Web Confidential
  23. 23. Approach Feature – DCT/FMT transform – Choose low-frequency coefficients Signature – Median-based quantization – Signature size depends on number of coefficients Performance – Large Signature  Near dup detection – Small size  Image Similarity Confidential
  24. 24. FMT Detections Confidential
  25. 25. Signature-based Image Retrieval Confidential
  26. 26. Large-scale Face Recognition Confidential
  27. 27. Face Recognition Face recognition was an important open problem in computer vision. Availability of text and image/video data has provided new directions in web-scale face recognition. If an image occurs in a news article, the named entities in the article can be associated with the faces in the images. This provides weak labels. With large amount of data, such weak signals can be boosted. Confidential
  28. 28. Conclusions There is an information explosion but the information has lots of near-duplicates. Spotting near-duplicates has lots of advantages but is a challenge. Large datasets present an equally large opportunity (“Unreasonable effectiveness of data …”). Confidential
  29. 29. References Ravi Kant, Srinivasan H. Sengamedu, Krishnan S. Kumar: Comment spam detection by sequence mining, WSDM 2012. Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, Ashwin Tengli: Exploiting Content Redundancy for Web Information Extraction, PVLDB, 2010. Srinivasan H. Sengamedu, Neela Sawant: Finding near- duplicate images on the web using fingerprints, ACM Multimedia 2008. Ming Zhao, Jay Yagnik, Hartwig Adam, David Bau, Large Scale Learning and Recognition of Faces in Web Videos, FG 2008. Confidential
  30. 30. Questions/Comments? shs@komli.com Confidential
  31. 31. Making Data Repetitions Work for You Confidential
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×