MLconf NYC Edo Liberty

  • 148 views
Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
148
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Streaming Data Mining PRESENTED BY Edo Liberty⎪ April 11, 2014 Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission. Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.
  • 2. 2 Yahoo Confidential & Proprietary Data Computation Result The World Single machine data mining
  • 3. 3 Yahoo Confidential & Proprietary Data Data Data Data Computation Result The World Distributed storage
  • 4. 4 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute Distributed model (map/reduce, message passing, …)
  • 5. 5 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute ComputationQuery Distributed model (indexes, tables, databases, …)
  • 6. 207 big-data infographics (meta infographic) 6 Yahoo Confidential & Proprietary
  • 7. 7 Yahoo Confidential & Proprietary
  • 8. 8 Yahoo Confidential & Proprietary Sketch The World Query Algorithm ResultQuery Result Computation The streaming model
  • 9. 9 Yahoo Confidential & Proprietary Aggregate+ Sketch The World Query Algorithm ResultQuery Result Compute + Sketch Compute + Sketch Compute + Sketch Compute + Sketch The parallel streaming model
  • 10. 10 Yahoo Confidential & Proprietary 1 7 8 1 0 1 7 7 Sketch Result Iterator Computation The streaming model (more accurately) O(n)Items O(polylog(n)) Space O(polylog(n)) Computation per item
  • 11. 11 Yahoo Confidential & Proprietary Sketch Result Iterator Iterator Communication complexity 1 7 8 1 0 1 7 7
  • 12. Frequent items Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
  • 13. 13 Yahoo Confidential & Proprietary d n f( ) = 5
  • 14. 14 Yahoo Confidential & Proprietary f( ) = 5 d
  • 15. 15 Yahoo Confidential & Proprietary `
  • 16. 16 Yahoo Confidential & Proprietary `
  • 17. 17 Yahoo Confidential & Proprietary `
  • 18. 18 Yahoo Confidential & Proprietary `
  • 19. 19 Yahoo Confidential & Proprietary `
  • 20. 20 Yahoo Confidential & Proprietary `
  • 21. 21 Yahoo Confidential & Proprietary `
  • 22. 22 Yahoo Confidential & Proprietary f0 ( ) = 0 ` f0 ( ) = 2
  • 23. 23 Yahoo Confidential & Proprietary Assume we do this timest Second fact: f0 (x) f(x) t f0 (x)  f(x)First fact: The proof (very short)
  • 24. 24 Yahoo Confidential & Proprietary Third (not so obvious) fact: Which gives . In words: We can only delete items times! t  n/` 0 P f0 (x) = P f(x) t · ` = n t · ` ⌅ The proof (very short) ` n/` |f0 (x) f(x)|  n/`
  • 25. Useful form… 25 Yahoo Confidential & Proprietary Define And We get that This is very useful for keeping approx’ distributions! p(x) = f(x)/n p0 (x) = f0 (x)/n |p0 (x) p(x)|  1/`
  • 26. Threading Machine Generated Email
  • 27. 27 Yahoo Confidential & Proprietary Email threads A simple email thread (that’s not very hard to do…)
  • 28. Threading Machine Generated Email 28 Yahoo Confidential & Proprietary Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
  • 29. 29 Yahoo Confidential & Proprietary Threading Machine Generated Email
  • 30. 30 Yahoo Confidential & Proprietary Threading Machine Generated Email
  • 31. What else can we do in the streaming model… 31 Yahoo Confidential & Proprietary Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least) Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification
  • 32. Thanks! 32 Yahoo Confidential & Proprietary Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA Seth Tropper satropper@yahoo-inc.com Doug DeSimone desimone@yahoo-inc.com Keith Daniels kdnl@yahoo-inc.com Yahoo is an equal opportunity employer.