Your SlideShare is downloading. ×
  • Like
Information excellence 2012feb_komli_srinivasan s h_making data repitions work
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Information excellence 2012feb_komli_srinivasan s h_making data repitions work

  • 429 views
Published

Information Excellence 2012 Spring Summit "DATA DYNAMICS" …

Information Excellence 2012 Spring Summit "DATA DYNAMICS"
Making data repetitions work for you
Srinivasan Sengamedu, VP, Komli

Published in Technology , News & Politics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
429
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
13
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Information Excellence Summit,February 25, 2012 Bangalorehttp://Informationexcellence.wordpress.com Making Data Repetitions Work for You Srinivasan H Sengamedu Komli Labs shs@komli.com Confidential
  • 2. Srinivasan H Sengamedu Bio: Srinivasan H Sengamedu (SHS) is the Head of Komli Labs where he works on real-time bidding, user modeling, and other areas related to computational advertising. He was Director of Audience and Search Sciences at Yahoo Labs, Bangalore earlier where he worked on information extraction, machine learned ranking, pornography detection in images, comment spam detection, etc. Most of the technologies are powering various Yahoo products. He got his PhD from Indian Institute of Science, Bangalore and has held visiting positions at UCSD and NUS. He has published over 100 papers and has more than 30 approved or filed patent applications. Hes generally excited about creating and productizing advanced technologies. Confidential
  • 3. Lot of information on the web Web pages Images Videos Blogs Mails Confidential
  • 4. Not all information is new! Web pages about the same product, business, etc. Near-duplicate images Similar comments, tweets, etc. Confidential
  • 5. Leveraging redundant information Classical use – Compression • Lossless compression (LZW) • Perceptually lossless compression (JPEG, MP3) – Co-occurrence • Pointwise Mutual Information Redundancy ≈ Confidence Leveraging redundancy requires care. Confidential
  • 6. Pointwise Mutual Information Confidential
  • 7. Any other uses of redundancy? Commercial Spam Detection – Min-closed sequences Information Extraction – Strong Similarity Near-duplicate images – Image Signatures Face recognition – Consistency Learning Confidential
  • 8. Identifying Near-Duplicate Subsequences Confidential
  • 9. Two Spam CommentsHappy to see she is progressing well. Texas and Israel forever. Happy 2011Happy 2011 to everyone....My everyone....This has got to be afriend Vanessa, a 25 yrs lady, has better year!...My friend Vanessa, aannounced her wedding with a 25 yrs lady, has announced hermillionaire young man Ronald who wedding with a millionaire youngis the CEO of a MNC. Its amazing, man Ronald who is the CEO of ashe said she just posted her profile MNC. Its amazing, she said she juston a millionaire dating site called -- posted her profile on a millionaire--------Celeb Mingle.C○M-------- - and dating site called ----------received his chat invitations a few RichFriends.Org----- - and receiveddays later. Then, everything went so his chat invitations a few days later.well that I cant believe its true! I cant believe its true! Every loveEvery love story will unfold on its story will unfold on its own. you canown. Also happy to see that most start your own wealthy love story forAmericans reject the blame-the- real at there too !many famous andconservatives crap that some (not wealthy people had a profile thereall) liberals from all social strata were ,why not me ? Taking out the worldstrying to promote for political gain. trash. Oooraaaah!----- Confidential
  • 10. Sequence-based Spam DetectionMotivation: Commercial spammers repeat variations of the spam content and embed it in good content. These usually avoid detection by spam filters.Technical Challenge: mine frequent subsequences efficiently. The general problem is NP-Hard. The algorithms in the literature do not scale to web-scale data. The spam patterns change every few hours.Basic Ideas A new sequence mining algorithm that scales to internet scale and is faster than those in the literature even for other public data sets like Gazelle A new framework for spam detection using frequent subsequences Experimental studies to measure the efficacy of the subsequence mining approach in detecting spam. We also study the life cycle of a typical spam pattern and use it to tune our mining parametersResults Experiments on News comment data show Coverage >70% Editorial Savings of a factor of ~30. Confidential
  • 11. mcPrismThe main ingredients in the algorithm: A modified DFS on the lexicographically ordered sequence tree. The tree is pruned whenever we encounter a prefix-l- closed node. The set of prefix-l-closed nodes is pruned by inclusion check Prime Block Encodings for fast computation of joins. We enhance the encoding scheme to handle gap and closure constraints. On-the-fly closure checking. We use the bidirectional closure checking and the backscan pruning schemes in BIDE. This is done using an enhancement of the Block encoding scheme This enhancement also solves an open problem: how to use block encodings to speed up closed sequence mining. Confidential
  • 12. Commercial Spam Detection – Results Subsequence: happy 2011 friend yrs lady announced wedding it amazing posted received chat invitations days believe it true love story unfold it ownMatch 1: Happy 2011 to everyone....My friend Vanessa, a 25 yrs lady, has announced herwedding with a millionaire young man Ronald who is the CEO of a MNC. Its amazing, she saidshe just posted her profile on a millionaire dating site called ----------Celeb Mingle.C○M--------- and received his chat invitations a few days later. Then, everything went so well that I cantbelieve its true! Every love story will unfold on its own...=====Happy to see she isprogressing well. Also happy to see that most Americans reject the blame-the-conservativescrap that some (not all) liberals from all social strata were trying to promote for political gain.Match 2: Happy 2011 everyone....This has got to be a better year!...My friend Vanessa, a 25yrs lady, has announced her wedding with a millionaire young man Ronald who is the CEO ofa MNC. Its amazing, she said she just posted her profile on a millionaire dating site called ----------RichFriends.Org----- - and received his chat invitations a few days later. Then, everythingwent so well that I cant believe its true! Every love story will unfold on its own. you can startyour own wealthy love story for real at there too !many famous and wealthy people had aprofile there ,why not me ?Texas and Israel forever. Taking out the worlds trash. Oooraaaah!----- Total Matches: 35; Only 15 marked spam by existing classifiers/editors Confidential
  • 13. The sequences are discriminative Confidential
  • 14. Identifying Near-Duplicate Strings Confidential
  • 15. Content Matching Approach Key idea: Leverage redundant content across template-based sites for automatic information extraction. Web page Seed Database Name Address Chinese Mirrch 120 Lexington Ave, New York, NY 10016 Tiffin Wallah 127 E 28th St New York, NY 10079 Confidential
  • 16. Baseline Similarity Measure Use q-grams to handle spelling errors String 3-grams chinese { chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch} mirch chinese { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch} mirrch • Weight of a q-gram (attribute specific) = Sum of the IDFs of the words it appears in.Weak Similarity = Cosine-similarity between IDF-weighted q-grams. Confidential
  • 17. Strong Similarity Address (Seed) Address (Site) WS 120 Lexington Avenue 120 Lexington Ave 0.53 New York, NY 10016 (between 28th and 29th St) New York, NY 10016 312 W 34th Street 312 W 34th St 0.49 New York, NY 10001 (between 8th and 9th Ave) New York, NY 10001 1. Variations are systematic and site-dependent. 2. Cannot be handled by term weighting. Strong similarity is defined between two sets of strings. 1. Calculate the matching pattern between weakly similar pairs in the two sets. 2. Pick matching patterns with sufficient “support” 3. Use only parts of a string selected by the matching pattern in the final similarity calculation. Confidential
  • 18. Support & Strong SimilarityAddress (Seed) Address (Site) Matching Matching Pattern Segments120 Lexington 120 Lexington Ave 103 103 120 LexingtonAvenue (between 28th and 29th St) New York, NYNew York, NY 10016 New York, NY 10016 10016312 W 34th Street 312 W 34th St 103 103 312 W 34th NewNew York, NY 10001 (between 8th and 9th Ave) York, NY 10001 New York, NY 10001 Matching Pattern: 103 103 Support(103 103) = |{“120 Lexington New York, NY 10016”, “312 W 34th New York, NY 10001”}| = 2 (100% support) Address’ (Seed) Address’ (Site) SS 120 Lexington 120 Lexington 1 New York, NY 10016 New York, NY 10016 312 W 34th 312 W 34th 1 New York, NY 10001 New York, NY 10001 Confidential
  • 19. Need for Support of a Matching Pattern Address (Seed) Address (Site) 120 Lexington Avenue 1075 Fifth Ave New York, NY 10016 New York, NY 10128 312 W 34th Street 1167 Madison Ave New York, NY 10001 New York, NY 10128 Address (Seed) Address (Site) Matching Matching Pattern Segments 120 Lexington Avenue 1075 Fifth Ave 010 010 New York, New York, NY 10016 New York, NY 10128 NY 312 W 34th Street 1167 Madison Ave 010 010 New York, New York, NY 10001 New York, NY 10128 NY Matching Pattern: 010 010 Support(010 010): |{“New York, NY”}| = 1 (50% support) Hence Strong Similarity = Weak Similarity Confidential
  • 20. Strong Similarity Scores String 1 String 2 WS SS 980 n michigan ave 14th floor 980 n michigan ave 0.57 1 chicago il chicago il 60611 1100 e north ave west 300 w north ave west 0.74 0.74 chicago il 60185 chicago il 60185SS boosts the similarity scores of TPs over a wide-range of WS scores without boosting that of FPs.SS is not always 1 – even for true positives.SS scores are very high for most true positives. Confidential
  • 21. Identifying Near-Duplicate Images Confidential
  • 22. Near-Duplicates on the Web Confidential
  • 23. Approach Feature – DCT/FMT transform – Choose low-frequency coefficients Signature – Median-based quantization – Signature size depends on number of coefficients Performance – Large Signature  Near dup detection – Small size  Image Similarity Confidential
  • 24. FMT Detections Confidential
  • 25. Signature-based Image Retrieval Confidential
  • 26. Large-scale Face Recognition Confidential
  • 27. Face Recognition Face recognition was an important open problem in computer vision. Availability of text and image/video data has provided new directions in web-scale face recognition. If an image occurs in a news article, the named entities in the article can be associated with the faces in the images. This provides weak labels. With large amount of data, such weak signals can be boosted. Confidential
  • 28. Conclusions There is an information explosion but the information has lots of near-duplicates. Spotting near-duplicates has lots of advantages but is a challenge. Large datasets present an equally large opportunity (“Unreasonable effectiveness of data …”). Confidential
  • 29. References Ravi Kant, Srinivasan H. Sengamedu, Krishnan S. Kumar: Comment spam detection by sequence mining, WSDM 2012. Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, Ashwin Tengli: Exploiting Content Redundancy for Web Information Extraction, PVLDB, 2010. Srinivasan H. Sengamedu, Neela Sawant: Finding near- duplicate images on the web using fingerprints, ACM Multimedia 2008. Ming Zhao, Jay Yagnik, Hartwig Adam, David Bau, Large Scale Learning and Recognition of Faces in Web Videos, FG 2008. Confidential
  • 30. Questions/Comments? shs@komli.com Confidential
  • 31. Making Data Repetitions Work for You Confidential