Streaming Data Mining
PRESENTED BY Edo Liberty⎪ April 11, 2014
Copyright © 2014 Yahoo! All rights reserved. No reproducti...
2 Yahoo Confidential & Proprietary
Data
Computation Result
The World
Single machine data mining
3 Yahoo Confidential & Proprietary
Data Data Data Data
Computation Result
The World
Distributed storage
4 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The Worl...
5 Yahoo Confidential & Proprietary
Data +
Compute
Data +
Compute
Data +
Compute
Data +
Compute
Computation Result
The Worl...
207 big-data infographics (meta infographic)
6 Yahoo Confidential & Proprietary
7 Yahoo Confidential & Proprietary
8 Yahoo Confidential & Proprietary
Sketch
The World
Query Algorithm ResultQuery
Result
Computation
The streaming model
9 Yahoo Confidential & Proprietary
Aggregate+
Sketch
The World
Query Algorithm ResultQuery
Result
Compute
+ Sketch
Compute...
10 Yahoo Confidential & Proprietary
1 7 8 1 0 1 7 7
Sketch
Result
Iterator
Computation
The streaming model (more accuratel...
11 Yahoo Confidential & Proprietary
Sketch Result
Iterator Iterator
Communication complexity
1 7 8 1 0 1 7 7
Frequent items
Misra, Gries. Finding repeated elements, 1982.
Demaine, Lopez-Ortiz, Munro. Frequency estimation of interne...
13 Yahoo Confidential & Proprietary
d
n
f( ) = 5
14 Yahoo Confidential & Proprietary
f( ) = 5
d
15 Yahoo Confidential & Proprietary
`
16 Yahoo Confidential & Proprietary
`
17 Yahoo Confidential & Proprietary
`
18 Yahoo Confidential & Proprietary
`
19 Yahoo Confidential & Proprietary
`
20 Yahoo Confidential & Proprietary
`
21 Yahoo Confidential & Proprietary
`
22 Yahoo Confidential & Proprietary
f0
( ) = 0
`
f0
( ) = 2
23 Yahoo Confidential & Proprietary
Assume we do this timest
Second fact: f0
(x) f(x) t
f0
(x)  f(x)First fact:
The proof...
24 Yahoo Confidential & Proprietary
Third (not so obvious) fact:
Which gives . In words:
We can only delete items times!
t...
Useful form…
25 Yahoo Confidential & Proprietary
Define
And
We get that
This is very useful for keeping approx’ distributi...
Threading Machine Generated Email
27 Yahoo Confidential & Proprietary
Email threads
A simple email thread (that’s not very hard to do…)
Threading Machine Generated Email
28 Yahoo Confidential & Proprietary
Ailon, Karnin, Maarek, Liberty, Threading Machine Ge...
29 Yahoo Confidential & Proprietary
Threading Machine Generated Email
30 Yahoo Confidential & Proprietary
Threading Machine Generated Email
What else can we do in the streaming model…
31 Yahoo Confidential & Proprietary
Items (words, IP-adresses, events, clicks,...
Thanks!
32 Yahoo Confidential & Proprietary
Yahoo does big data algorithms, software and systems!
Speak to our Talent Team...
Upcoming SlideShare
Loading in …5
×

MLconf NYC Edo Liberty

447 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
447
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

MLconf NYC Edo Liberty

  1. 1. Streaming Data Mining PRESENTED BY Edo Liberty⎪ April 11, 2014 Copyright © 2014 Yahoo! All rights reserved. No reproduction or distribution allowed without express written permission. Parts of this presentation were given with Jelani Nelson (Harvard) as a KDD tutorial on streaming data mining.
  2. 2. 2 Yahoo Confidential & Proprietary Data Computation Result The World Single machine data mining
  3. 3. 3 Yahoo Confidential & Proprietary Data Data Data Data Computation Result The World Distributed storage
  4. 4. 4 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute Distributed model (map/reduce, message passing, …)
  5. 5. 5 Yahoo Confidential & Proprietary Data + Compute Data + Compute Data + Compute Data + Compute Computation Result The World Data + Compute Data + Compute Data + Compute Data + Compute ComputationQuery Distributed model (indexes, tables, databases, …)
  6. 6. 207 big-data infographics (meta infographic) 6 Yahoo Confidential & Proprietary
  7. 7. 7 Yahoo Confidential & Proprietary
  8. 8. 8 Yahoo Confidential & Proprietary Sketch The World Query Algorithm ResultQuery Result Computation The streaming model
  9. 9. 9 Yahoo Confidential & Proprietary Aggregate+ Sketch The World Query Algorithm ResultQuery Result Compute + Sketch Compute + Sketch Compute + Sketch Compute + Sketch The parallel streaming model
  10. 10. 10 Yahoo Confidential & Proprietary 1 7 8 1 0 1 7 7 Sketch Result Iterator Computation The streaming model (more accurately) O(n)Items O(polylog(n)) Space O(polylog(n)) Computation per item
  11. 11. 11 Yahoo Confidential & Proprietary Sketch Result Iterator Iterator Communication complexity 1 7 8 1 0 1 7 7
  12. 12. Frequent items Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet streams with limited space, 2002 Karp, Shenker, Papadimitriou. A simple algorithm for finding frequent elements in streams and bags, 2003 The name ``Lossy Counting" was used for a different algorithm by Manku and Motwani, 2002 Metwally, Agrawal, Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams, 2006
  13. 13. 13 Yahoo Confidential & Proprietary d n f( ) = 5
  14. 14. 14 Yahoo Confidential & Proprietary f( ) = 5 d
  15. 15. 15 Yahoo Confidential & Proprietary `
  16. 16. 16 Yahoo Confidential & Proprietary `
  17. 17. 17 Yahoo Confidential & Proprietary `
  18. 18. 18 Yahoo Confidential & Proprietary `
  19. 19. 19 Yahoo Confidential & Proprietary `
  20. 20. 20 Yahoo Confidential & Proprietary `
  21. 21. 21 Yahoo Confidential & Proprietary `
  22. 22. 22 Yahoo Confidential & Proprietary f0 ( ) = 0 ` f0 ( ) = 2
  23. 23. 23 Yahoo Confidential & Proprietary Assume we do this timest Second fact: f0 (x) f(x) t f0 (x)  f(x)First fact: The proof (very short)
  24. 24. 24 Yahoo Confidential & Proprietary Third (not so obvious) fact: Which gives . In words: We can only delete items times! t  n/` 0 P f0 (x) = P f(x) t · ` = n t · ` ⌅ The proof (very short) ` n/` |f0 (x) f(x)|  n/`
  25. 25. Useful form… 25 Yahoo Confidential & Proprietary Define And We get that This is very useful for keeping approx’ distributions! p(x) = f(x)/n p0 (x) = f0 (x)/n |p0 (x) p(x)|  1/`
  26. 26. Threading Machine Generated Email
  27. 27. 27 Yahoo Confidential & Proprietary Email threads A simple email thread (that’s not very hard to do…)
  28. 28. Threading Machine Generated Email 28 Yahoo Confidential & Proprietary Ailon, Karnin, Maarek, Liberty, Threading Machine Generated Email, WSDM 2013
  29. 29. 29 Yahoo Confidential & Proprietary Threading Machine Generated Email
  30. 30. 30 Yahoo Confidential & Proprietary Threading Machine Generated Email
  31. 31. What else can we do in the streaming model… 31 Yahoo Confidential & Proprietary Items (words, IP-adresses, events, clicks,...): §  Item frequencies §  Counting distinct elements §  Moment and entropy estimation §  Approximate set operations Vectors (text documents, images, example features,...) §  Dimensionality reduction §  Clustering (k-means, k-median,…) §  Linear Regression §  Machine learning (some of it at least) Matrices (text corpora, user preferences, graphs...) §  Covariance estimation matrix §  Low rank approximation §  Sparsification
  32. 32. Thanks! 32 Yahoo Confidential & Proprietary Yahoo does big data algorithms, software and systems! Speak to our Talent Team or visit Careers.Yahoo.com and explore our career opportunities in NYC or Sunnyvale, CA Seth Tropper satropper@yahoo-inc.com Doug DeSimone desimone@yahoo-inc.com Keith Daniels kdnl@yahoo-inc.com Yahoo is an equal opportunity employer.

×