Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU  www.cs.cmu.edu/~christos
Outline <ul><li>Goal: ‘Find  similar / interesting  things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Ind...
Problem <ul><li>Given a large collection of (multimedia) records, find similar/interesting things, ie: </li></ul><ul><li>A...
Sample queries <ul><li>Similarity search </li></ul><ul><ul><li>Find pairs of branches with similar sales patterns </li></u...
Sample queries –cont’d <ul><li>Rule discovery </li></ul><ul><ul><li>Clusters (of patients; of customers; ...) </li></ul></...
Outline <ul><li>Goal: ‘Find  similar / interesting  things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Ind...
Indexing - Multimedia <ul><li>Problem: </li></ul><ul><li>given a set of (multimedia) objects, </li></ul><ul><li>find the o...
distance function: by  expert day $price 1 365 day $price 1 365 day $price 1 365
‘GEMINI’ - Pictorially day 1 365 day 1 365 S1 Sn F(S1) F(Sn) eg, avg eg,. std off-the-shelf S.A.Ms  (spatial Access Methods)
‘ GEMINI’ <ul><li>fast; ‘correct’ (=no false dismissals) </li></ul><ul><li>used for  </li></ul><ul><ul><li>images (eg., QB...
Remaining issues <ul><li>how to extract features automatically? </li></ul><ul><li>how to merge similarity scores from diff...
Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Index...
FastMap ?? 0 1 100 100 100 O5 1 0 100 100 100 O4 100 100 0 1 1 O3 100 100 1 0 1 O2 100 100 1 1 0 O1 O5 O4 O3 O2 O1 ~100 ~1
FastMap <ul><li>Multi-dimensional scaling (MDS) can do that, but in  O(N**2)   time </li></ul><ul><li>We want a  linear  a...
Applications: time sequences <ul><li>given  n  co-evolving time sequences </li></ul><ul><li>visualize them + find rules [I...
Applications - financial <ul><li>currency exchange rates [ICDE00] </li></ul>USD(t) USD(t-5) FRF GBP JPY HKD
Applications - financial <ul><li>currency exchange rates [ICDE00] </li></ul>USD(t) USD(t-5) USD HKD JPY FRF DEM GBP
Application: VideoTrails <ul><li>[ACM MM97] </li></ul>HIDE
VideoTrails - usage <ul><li>scene-cut detection (about 10% errors) </li></ul><ul><li>scene classification (eg., dialogue v...
Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Index...
Merging similarity scores <ul><li>eg., video: text, color, motion, audio </li></ul><ul><ul><li>weights change with the que...
DEMO server demo
‘FALCON’ Inverted Vs Vs Trader wants only ‘unstable’ stocks
‘FALCON’ Inverted Vs Vs average: is flat!
“Single query point” methods Rocchio + + + + + + x avg std
“Single query point” methods Rocchio MindReader + + + + + + MARS The averaging affect in action... x x x + + + + + + + + +...
+ + + + + Main idea: FALCON Contours feature1 (eg., avg) feature2 eg., std [Wu+, vldb2000]
A: Aggregate Dissimilarity <ul><li> : parameter (~ -5 ~ ‘soft OR’) </li></ul>g1 g2 x + + + + +
<ul><li>converges quickly (~5 iterations) </li></ul><ul><li>good precision/recall </li></ul><ul><li>is fast (can use off-t...
Conclusions for indexing + visualization <ul><li>GEMINI: fast indexing, exploiting off-the-shelf SAMs </li></ul><ul><li>Fa...
Outline <ul><li>Goal: ‘Find  similar / interesting  things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Ind...
Data mining & fractals –  Road map <ul><li>Motivation – problems / case study </li></ul><ul><li>Definition of fractals and...
Problem #1 - spatial d.m. <ul><li>Galaxies (Sloan Digital Sky Survey w/ B. Nichol) </li></ul><ul><li>- ‘spiral’ and ‘ elli...
Problem#2: dim. reduction <ul><li>given attributes x 1 , ... x n </li></ul><ul><ul><li>possibly, non-linearly correlated <...
Answer: <ul><li>Fractals / self-similarities / power laws </li></ul>
What is a fractal? <ul><li>= self-similar point set, e.g., Sierpinski triangle: </li></ul>... zero area; infinite length!
Definitions (cont’d) <ul><li>Paradox: Infinite perimeter ; Zero area! </li></ul><ul><li>‘dimensionality’: between 1 and 2 ...
Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul>Eg: #cylinders;  miles / gallon 4 2 3 3 ...
Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 1 </li></ul>
Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 1 </li></ul>...
Sierpinsky triangle == ‘correlation integral’ log( r ) log(#pairs  within <=r ) 1.58
Observations <ul><li>self-similarity -> </li></ul><ul><li><=> fractals  </li></ul><ul><li><=> scale-free </li></ul><ul><li...
Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><u...
Solution#1: spatial d.m. <ul><li>Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) </li></ul><...
Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Se...
spatial d.m. Heuristic on choosing # of clusters r1 r2 r1 r2
Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell <ul><li>- 1.8 slope </li></ul><ul><li>- pl...
Problem #2: Dim. reduction
Solution: <ul><li>drop the attributes that don’t increase the ‘partial f.d.’ PFD </li></ul><ul><li>dfn: PFD of attribute s...
Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1
Problem #2: dim. reduction PFD~1 PFD=1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: ‘max variance’ would fail here
Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: SVD would fail here
Currency dataset HIDE
self-similar? fd=1.98 fd=4.25 currency eigenfaces HIDE
FDR on the ‘currency’ dataset HIDE if unif + indep.
FDR on the ‘currency’ dataset <ul><li>HKD: “useless” </li></ul><ul><li>>1.98 axis are needed </li></ul>HIDE if unif + indep.
Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><u...
App. : traffic <ul><li>disk traces: self-similar (also: web traffic; comm. errors; etc) </li></ul>time #bytes
More apps: Brain scans <ul><li>Oct-trees; brain-scans </li></ul>octree levels Log(#octants) 2.63 = fd
More fractals: <ul><li>stock prices (LYCOS) - random walks: 1.5 </li></ul>1 year 2 years
More fractals: <ul><li>coast-lines: 1.1-1.2 (up to 1.58) </li></ul>
 
Examples:MG county <ul><li>Montgomery County of MD (road end-points) </li></ul>
Examples:LB county <ul><li>Long Beach county of CA (road end-points) </li></ul>
More power laws: Zipf’s law <ul><li>Bible -  rank  vs  frequency  (log-log) </li></ul>log(rank) log(freq) “ a” “ the”
More power laws <ul><li>Freq. distr. of  first names; last names (Mandelbrot) </li></ul>
Internet <ul><li>Internet routers: how many neighbors within  h  hops? </li></ul>U of Alberta
Internet topology <ul><li>Internet routers: how many neighbors within  h  hops? [SIGCOMM 99] </li></ul>Reachability functi...
More power laws: areas – Korcak’s law Scandinavian lakes ([icde99], w/ Proietti)
More power laws: areas – Korcak’s law Scandinavian lakes  area vs complementary cumulative count (log-log axes) log(count(...
Olympic medals: log rank log(# medals)
More power laws <ul><li>Energy of earthquakes (Gutenberg-Richter law) [simscience.org] </li></ul>log(count) magnitude day ...
Even more power laws: <ul><li>Income distribution (Pareto’s law); </li></ul><ul><li>sales distributions; </li></ul><ul><li...
Even more power laws: <ul><li>web hit frequencies ([Huberman]) </li></ul><ul><li>hyper-link distribution [Barabasi], ++ </...
Overall Conclusions: <ul><li>‘ Find similar/interesting things’ in multimedia databases </li></ul><ul><li>Indexing: featur...
Conclusions - cont’d <ul><li>New tools for Data Mining: Fractals/power laws: </li></ul><ul><ul><li>appear everywhere </li>...
Conclusions - cont’d <ul><ul><li>can model bursty time sequences (buffering/prefetching) </li></ul></ul><ul><ul><li>select...
Resources: <ul><li>Software and papers: </li></ul><ul><ul><li>http://www.cs.cmu.edu/~christos </li></ul></ul><ul><ul><li>F...
Upcoming SlideShare
Loading in …5
×

Indexing and Data Mining in Multimedia Databases

555 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
555
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Indexing and Data Mining in Multimedia Databases

  1. 1. Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos
  2. 2. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><li>New tools for Data Mining: Fractals </li></ul><ul><li>Conclusions </li></ul><ul><li>Resources </li></ul>
  3. 3. Problem <ul><li>Given a large collection of (multimedia) records, find similar/interesting things, ie: </li></ul><ul><li>Allow fast, approximate queries, and </li></ul><ul><li>Find rules/patterns </li></ul>
  4. 4. Sample queries <ul><li>Similarity search </li></ul><ul><ul><li>Find pairs of branches with similar sales patterns </li></ul></ul><ul><ul><li>find medical cases similar to Smith's </li></ul></ul><ul><ul><li>Find pairs of sensor series that move in sync </li></ul></ul>
  5. 5. Sample queries –cont’d <ul><li>Rule discovery </li></ul><ul><ul><li>Clusters (of patients; of customers; ...) </li></ul></ul><ul><ul><li>Forecasting (total sales for next year?) </li></ul></ul><ul><ul><li>Outliers (eg., fraud detection) </li></ul></ul>
  6. 6. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><li>New tools for Data Mining: Fractals </li></ul><ul><li>Conclusions </li></ul><ul><li>Resourses </li></ul>
  7. 7. Indexing - Multimedia <ul><li>Problem: </li></ul><ul><li>given a set of (multimedia) objects, </li></ul><ul><li>find the ones similar to a desirable query object (quickly!) </li></ul>
  8. 8. distance function: by expert day $price 1 365 day $price 1 365 day $price 1 365
  9. 9. ‘GEMINI’ - Pictorially day 1 365 day 1 365 S1 Sn F(S1) F(Sn) eg, avg eg,. std off-the-shelf S.A.Ms (spatial Access Methods)
  10. 10. ‘ GEMINI’ <ul><li>fast; ‘correct’ (=no false dismissals) </li></ul><ul><li>used for </li></ul><ul><ul><li>images (eg., QBIC) (2x, 10x faster) </li></ul></ul><ul><ul><li>shapes (27x faster) </li></ul></ul><ul><ul><li>video (eg., InforMedia) </li></ul></ul><ul><ul><li>time sequences ([Rafiei+Mendelzon], ++) </li></ul></ul>
  11. 11. Remaining issues <ul><li>how to extract features automatically? </li></ul><ul><li>how to merge similarity scores from different media </li></ul>
  12. 12. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><ul><li>Visualization: Fastmap </li></ul></ul><ul><ul><li>Relevance feedback: FALCON </li></ul></ul><ul><li>Data Mining / Fractals </li></ul><ul><li>Conclusions </li></ul>
  13. 13. FastMap ?? 0 1 100 100 100 O5 1 0 100 100 100 O4 100 100 0 1 1 O3 100 100 1 0 1 O2 100 100 1 1 0 O1 O5 O4 O3 O2 O1 ~100 ~1
  14. 14. FastMap <ul><li>Multi-dimensional scaling (MDS) can do that, but in O(N**2) time </li></ul><ul><li>We want a linear algorithm: FastMap [SIGMOD95] </li></ul>
  15. 15. Applications: time sequences <ul><li>given n co-evolving time sequences </li></ul><ul><li>visualize them + find rules [ICDE00] </li></ul>time rate HKD JPY DEM
  16. 16. Applications - financial <ul><li>currency exchange rates [ICDE00] </li></ul>USD(t) USD(t-5) FRF GBP JPY HKD
  17. 17. Applications - financial <ul><li>currency exchange rates [ICDE00] </li></ul>USD(t) USD(t-5) USD HKD JPY FRF DEM GBP
  18. 18. Application: VideoTrails <ul><li>[ACM MM97] </li></ul>HIDE
  19. 19. VideoTrails - usage <ul><li>scene-cut detection (about 10% errors) </li></ul><ul><li>scene classification (eg., dialogue vs action) </li></ul>HIDE
  20. 20. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><ul><li>Visualization: Fastmap </li></ul></ul><ul><ul><li>Relevance feedback: FALCON </li></ul></ul><ul><li>Data Mining / Fractals </li></ul><ul><li>Conclusions </li></ul>
  21. 21. Merging similarity scores <ul><li>eg., video: text, color, motion, audio </li></ul><ul><ul><li>weights change with the query! </li></ul></ul><ul><li>solution 1: user specifies weights </li></ul><ul><li>solution 2: user gives examples  </li></ul><ul><ul><li>and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) </li></ul></ul><ul><ul><li>but: how about disjunctive queries? </li></ul></ul>
  22. 22. DEMO server demo
  23. 23. ‘FALCON’ Inverted Vs Vs Trader wants only ‘unstable’ stocks
  24. 24. ‘FALCON’ Inverted Vs Vs average: is flat!
  25. 25. “Single query point” methods Rocchio + + + + + + x avg std
  26. 26. “Single query point” methods Rocchio MindReader + + + + + + MARS The averaging affect in action... x x x + + + + + + + + + + + +
  27. 27. + + + + + Main idea: FALCON Contours feature1 (eg., avg) feature2 eg., std [Wu+, vldb2000]
  28. 28. A: Aggregate Dissimilarity <ul><li> : parameter (~ -5 ~ ‘soft OR’) </li></ul>g1 g2 x + + + + +
  29. 29. <ul><li>converges quickly (~5 iterations) </li></ul><ul><li>good precision/recall </li></ul><ul><li>is fast (can use off-the-shelf ‘spatial/metric access methods’) </li></ul>FALCON
  30. 30. Conclusions for indexing + visualization <ul><li>GEMINI: fast indexing, exploiting off-the-shelf SAMs </li></ul><ul><li>FastMap: automatic feature extraction in O( N ) time </li></ul><ul><li>FALCON: relevance feedback for disjunctive queries </li></ul>
  31. 31. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><li>New tools for Data Mining: Fractals </li></ul><ul><li>Conclusions </li></ul><ul><li>Resourses </li></ul>
  32. 32. Data mining & fractals – Road map <ul><li>Motivation – problems / case study </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul>
  33. 33. Problem #1 - spatial d.m. <ul><li>Galaxies (Sloan Digital Sky Survey w/ B. Nichol) </li></ul><ul><li>- ‘spiral’ and ‘ elliptical ’ galaxies </li></ul><ul><li>(stores & households; healthy & ill subjects) </li></ul><ul><li>- patterns? ( not Gaussian; not uniform) </li></ul><ul><li>attraction/repulsion? </li></ul><ul><li>separability?? </li></ul>
  34. 34. Problem#2: dim. reduction <ul><li>given attributes x 1 , ... x n </li></ul><ul><ul><li>possibly, non-linearly correlated </li></ul></ul><ul><li>drop the useless ones </li></ul><ul><li>(Q: why? </li></ul><ul><li>A: to avoid the ‘dimensionality curse’) </li></ul>engine size mpg
  35. 35. Answer: <ul><li>Fractals / self-similarities / power laws </li></ul>
  36. 36. What is a fractal? <ul><li>= self-similar point set, e.g., Sierpinski triangle: </li></ul>... zero area; infinite length!
  37. 37. Definitions (cont’d) <ul><li>Paradox: Infinite perimeter ; Zero area! </li></ul><ul><li>‘dimensionality’: between 1 and 2 </li></ul><ul><li>actually: Log(3)/Log(2) = 1.58… (long story) </li></ul>
  38. 38. Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul>Eg: #cylinders; miles / gallon 4 2 3 3 2 4 1 5 y x
  39. 39. Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 1 </li></ul>
  40. 40. Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 1 </li></ul><ul><li>Q: fd of a plane? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 2 </li></ul><ul><li>fd== slope of (log(nn) vs log(r) ) </li></ul>
  41. 41. Sierpinsky triangle == ‘correlation integral’ log( r ) log(#pairs within <=r ) 1.58
  42. 42. Observations <ul><li>self-similarity -> </li></ul><ul><li><=> fractals </li></ul><ul><li><=> scale-free </li></ul><ul><li><=> power-laws (y=x^ a , F=C*r^(-2)) </li></ul>log( r ) log(#pairs within <=r ) 1.58
  43. 43. Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul><ul><li>Conclusions </li></ul>
  44. 44. Solution#1: spatial d.m. <ul><li>Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) </li></ul><ul><li>clusters? </li></ul><ul><li>separable? </li></ul><ul><li>attraction/repulsion? </li></ul><ul><li>data ‘scrubbing’ – duplicates? </li></ul>
  45. 45. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
  46. 46. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]
  47. 47. spatial d.m. Heuristic on choosing # of clusters r1 r2 r1 r2
  48. 48. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
  49. 49. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell <ul><li>- 1.8 slope </li></ul><ul><li>- plateau! </li></ul><ul><li>repulsion!! </li></ul>- duplicates
  50. 50. Problem #2: Dim. reduction
  51. 51. Solution: <ul><li>drop the attributes that don’t increase the ‘partial f.d.’ PFD </li></ul><ul><li>dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] </li></ul>
  52. 52. Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1
  53. 53. Problem #2: dim. reduction PFD~1 PFD=1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: ‘max variance’ would fail here
  54. 54. Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: SVD would fail here
  55. 55. Currency dataset HIDE
  56. 56. self-similar? fd=1.98 fd=4.25 currency eigenfaces HIDE
  57. 57. FDR on the ‘currency’ dataset HIDE if unif + indep.
  58. 58. FDR on the ‘currency’ dataset <ul><li>HKD: “useless” </li></ul><ul><li>>1.98 axis are needed </li></ul>HIDE if unif + indep.
  59. 59. Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul><ul><li>Conclusions </li></ul>
  60. 60. App. : traffic <ul><li>disk traces: self-similar (also: web traffic; comm. errors; etc) </li></ul>time #bytes
  61. 61. More apps: Brain scans <ul><li>Oct-trees; brain-scans </li></ul>octree levels Log(#octants) 2.63 = fd
  62. 62. More fractals: <ul><li>stock prices (LYCOS) - random walks: 1.5 </li></ul>1 year 2 years
  63. 63. More fractals: <ul><li>coast-lines: 1.1-1.2 (up to 1.58) </li></ul>
  64. 65. Examples:MG county <ul><li>Montgomery County of MD (road end-points) </li></ul>
  65. 66. Examples:LB county <ul><li>Long Beach county of CA (road end-points) </li></ul>
  66. 67. More power laws: Zipf’s law <ul><li>Bible - rank vs frequency (log-log) </li></ul>log(rank) log(freq) “ a” “ the”
  67. 68. More power laws <ul><li>Freq. distr. of first names; last names (Mandelbrot) </li></ul>
  68. 69. Internet <ul><li>Internet routers: how many neighbors within h hops? </li></ul>U of Alberta
  69. 70. Internet topology <ul><li>Internet routers: how many neighbors within h hops? [SIGCOMM 99] </li></ul>Reachability function: number of neighbors within r hops, vs r (log-log). Mbone routers, 1995 log(hops) log(#pairs) 2.8
  70. 71. More power laws: areas – Korcak’s law Scandinavian lakes ([icde99], w/ Proietti)
  71. 72. More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
  72. 73. Olympic medals: log rank log(# medals)
  73. 74. More power laws <ul><li>Energy of earthquakes (Gutenberg-Richter law) [simscience.org] </li></ul>log(count) magnitude day amplitude
  74. 75. Even more power laws: <ul><li>Income distribution (Pareto’s law); </li></ul><ul><li>sales distributions; </li></ul><ul><li>duration of UNIX jobs </li></ul><ul><li>Distribution of UNIX file sizes </li></ul><ul><li>publication counts (Lotka’s law) </li></ul>
  75. 76. Even more power laws: <ul><li>web hit frequencies ([Huberman]) </li></ul><ul><li>hyper-link distribution [Barabasi], ++ </li></ul>
  76. 77. Overall Conclusions: <ul><li>‘ Find similar/interesting things’ in multimedia databases </li></ul><ul><li>Indexing: feature extraction (‘GEMINI’) </li></ul><ul><ul><li>automatic feature extraction: FastMap </li></ul></ul><ul><ul><li>Relevance feedback: FALCON </li></ul></ul>
  77. 78. Conclusions - cont’d <ul><li>New tools for Data Mining: Fractals/power laws: </li></ul><ul><ul><li>appear everywhere </li></ul></ul><ul><ul><li>lead to skewed distributions (Gaussian, Poisson, uniformity, independence) </li></ul></ul><ul><ul><li>‘ correlation integral’ for separability/cluster detection </li></ul></ul><ul><ul><li>PFD for dimensionality reduction </li></ul></ul>
  78. 79. Conclusions - cont’d <ul><ul><li>can model bursty time sequences (buffering/prefetching) </li></ul></ul><ul><ul><li>selectivity estimation (‘ how many neighbors within x km ?) </li></ul></ul><ul><ul><li>dim. curse diagnosis (it’s the fractal dim. that matters! [ICDE2000]) </li></ul></ul>
  79. 80. Resources: <ul><li>Software and papers: </li></ul><ul><ul><li>http://www.cs.cmu.edu/~christos </li></ul></ul><ul><ul><li>Fractal dimension (FracDim) </li></ul></ul><ul><ul><li>Separability (sigmod 2000) </li></ul></ul><ul><ul><li>Relevance feedback for query by content (FALCON – vldb 2000) </li></ul></ul>

×