Indexing and Data Mining in Multimedia Databases

1,768 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,768
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Indexing and Data Mining in Multimedia Databases

  1. 1. Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos
  2. 2. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><li>New tools for Data Mining: Fractals </li></ul><ul><li>Conclusions </li></ul><ul><li>Resources </li></ul>
  3. 3. Problem <ul><li>Given a large collection of (multimedia) records, find similar/interesting things, ie: </li></ul><ul><li>Allow fast, approximate queries, and </li></ul><ul><li>Find rules/patterns </li></ul>
  4. 4. Sample queries <ul><li>Similarity search </li></ul><ul><ul><li>Find pairs of branches with similar sales patterns </li></ul></ul><ul><ul><li>find medical cases similar to Smith's </li></ul></ul><ul><ul><li>Find pairs of sensor series that move in sync </li></ul></ul><ul><ul><li>Find shapes like a spark-plug </li></ul></ul>
  5. 5. Sample queries –cont’d <ul><li>Rule discovery </li></ul><ul><ul><li>Clusters (of branches; of sensor data; ...) </li></ul></ul><ul><ul><li>Forecasting (total sales for next year?) </li></ul></ul><ul><ul><li>Outliers (eg., unexpected part failures; fraud detection) </li></ul></ul>
  6. 6. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><li>New tools for Data Mining: Fractals </li></ul><ul><li>Conclusions </li></ul><ul><li>related projects @ CMU and resourses </li></ul>
  7. 7. Indexing - Multimedia <ul><li>Problem: </li></ul><ul><li>given a set of (multimedia) objects, </li></ul><ul><li>find the ones similar to a desirable query object </li></ul>
  8. 8. distance function: by expert day $price 1 365 day $price 1 365 day $price 1 365
  9. 9. ‘GEMINI’ - Pictorially day 1 365 day 1 365 S1 Sn F(S1) F(Sn) eg, avg eg,. std
  10. 10. Remaining issues <ul><li>how to extract features automatically? </li></ul><ul><li>how to merge similarity scores from different media </li></ul>
  11. 11. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><ul><li>Visualization: Fastmap </li></ul></ul><ul><ul><li>Relevance feedback: FALCON </li></ul></ul><ul><li>Data Mining / Fractals </li></ul><ul><li>Conclusions </li></ul>
  12. 12. FastMap ?? 0 1 100 100 100 O5 1 0 100 100 100 O4 100 100 0 1 1 O3 100 100 1 0 1 O2 100 100 1 1 0 O1 O5 O4 O3 O2 O1 ~100 ~1
  13. 13. FastMap <ul><li>Multi-dimensional scaling (MDS) can do that, but in O(N**2) time </li></ul><ul><li>We want a linear algorithm: FastMap [SIGMOD95] </li></ul>
  14. 14. Applications: time sequences <ul><li>given n co-evolving time sequences </li></ul><ul><li>visualize them + find rules [ICDE00] </li></ul>time rate HKD JPY DEM
  15. 15. Applications - financial <ul><li>currency exchange rates [ICDE00] </li></ul>USD(t) USD(t-5) FRF GBP JPY HKD
  16. 16. Applications - financial <ul><li>currency exchange rates [ICDE00] </li></ul>USD(t) USD(t-5) USD HKD JPY FRF DEM GBP
  17. 17. Application: VideoTrails <ul><li>[ACM MM97] </li></ul>
  18. 18. VideoTrails - usage <ul><li>scene-cut detection (about 10% errors) </li></ul><ul><li>scene classification (eg., dialogue vs action) </li></ul>
  19. 19. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><ul><li>Visualization: Fastmap </li></ul></ul><ul><ul><li>Relevance feedback: FALCON </li></ul></ul><ul><li>Data Mining / Fractals </li></ul><ul><li>Conclusions </li></ul>
  20. 20. Merging similarity scores <ul><li>eg., video: text, color, motion, audio </li></ul><ul><ul><li>weights change with the query! </li></ul></ul><ul><li>solution 1: user specifies weights </li></ul><ul><li>solution 2: user gives examples  </li></ul><ul><ul><li>and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) </li></ul></ul><ul><ul><li>but: how about disjunctive queries? </li></ul></ul>
  21. 21. ‘FALCON’ Inverted Vs Vs Trader wants only ‘unstable’ stocks
  22. 22. “Single query point” methods Rocchio + + + + + + x
  23. 23. “Single query point” methods Rocchio MindReader + + + + + + MARS The averaging affect in action... x x x + + + + + + + + + + + +
  24. 24. + + + + + Main idea: FALCON Contours feature1 (eg., temperature) feature2 eg., frequency [Wu+, vldb2000]
  25. 25. Conclusions for indexing + visualization <ul><li>GEMINI: fast indexing, exploiting off-the-shelf SAMs </li></ul><ul><li>FastMap: automatic feature extraction in O( N ) time </li></ul><ul><li>FALCON: relevance feedback for disjunctive queries </li></ul>
  26. 26. Outline <ul><li>Goal: ‘Find similar / interesting things’ </li></ul><ul><li>Problem - Applications </li></ul><ul><li>Indexing - similarity search </li></ul><ul><li>New tools for Data Mining: Fractals </li></ul><ul><li>Conclusions </li></ul><ul><li>Resourses </li></ul>
  27. 27. Data mining & fractals – Road map <ul><li>Motivation – problems / case study </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul>
  28. 28. Problem #1 - spatial d.m. <ul><li>Galaxies (Sloan Digital Sky Survey w/ B. Nichol) </li></ul><ul><li>- ‘spiral’ and ‘ elliptical ’ galaxies </li></ul><ul><li>(stores & households ; mpg & MTBF...) </li></ul><ul><li>- patterns? ( not Gaussian; not uniform) </li></ul><ul><li>attraction/repulsion? </li></ul><ul><li>separability?? </li></ul>
  29. 29. Problem#2: dim. reduction <ul><li>given attributes x 1 , ... x n </li></ul><ul><ul><li>possibly, non-linearly correlated </li></ul></ul><ul><li>drop the useless ones </li></ul><ul><li>(Q: why? </li></ul><ul><li>A: to avoid the ‘dimensionality curse’) </li></ul>
  30. 30. Answer: <ul><li>Fractals / self-similarities / power laws </li></ul>
  31. 31. What is a fractal? <ul><li>= self-similar point set, e.g., Sierpinski triangle: </li></ul>... zero area; infinite length!
  32. 32. Definitions (cont’d) <ul><li>Paradox: Infinite perimeter ; Zero area! </li></ul><ul><li>‘dimensionality’: between 1 and 2 </li></ul><ul><li>actually: Log(3)/Log(2) = 1.58… (long story) </li></ul>
  33. 33. Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul>Eg: #cylinders; miles / gallon 4 2 3 3 2 4 1 5 y x
  34. 34. Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 1 </li></ul><ul><li>(‘power law’: y=x^a) </li></ul>
  35. 35. Intrinsic (‘fractal’) dimension <ul><li>Q: fractal dimension of a line? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 1 </li></ul><ul><li>(‘power law’: y=x^a) </li></ul><ul><li>Q: fd of a plane? </li></ul><ul><li>A: nn ( <= r ) ~ r^ 2 </li></ul><ul><li>fd== slope of (log(nn) vs log(r) ) </li></ul>
  36. 36. Sierpinsky triangle == ‘correlation integral’ log( r ) log(#pairs within <=r ) 1.58
  37. 37. Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul><ul><li>Conclusions </li></ul>
  38. 38. Solution#1: spatial d.m. <ul><li>Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) </li></ul><ul><li>clusters? </li></ul><ul><li>separable? </li></ul><ul><li>attraction/repulsion? </li></ul><ul><li>data ‘scrubbing’ – duplicates? </li></ul>
  39. 39. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
  40. 40. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]
  41. 41. spatial d.m. Heuristic on choosing # of clusters r1 r2 r1 r2
  42. 42. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
  43. 43. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell <ul><li>- 1.8 slope </li></ul><ul><li>- plateau! </li></ul><ul><li>repulsion!! </li></ul>- duplicates
  44. 44. Problem #2: Dim. reduction
  45. 45. Solution: <ul><li>drop the attributes that don’t increase the ‘partial f.d.’ PFD </li></ul><ul><li>dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] </li></ul>
  46. 46. Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1
  47. 47. Problem #2: dim. reduction PFD~1 PFD=1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: ‘max variance’ would fail here
  48. 48. Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: SVD would fail here
  49. 49. Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul><ul><ul><li>fractals </li></ul></ul><ul><ul><li>power laws </li></ul></ul><ul><li>Conclusions </li></ul>
  50. 50. disk traffic <ul><li>Not Poisson, not(?) iid - BUT: self-similar </li></ul><ul><li>How to model it? </li></ul>time #bytes
  51. 51. traffic <ul><li>disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) </li></ul>time #bytes 20% 80%
  52. 52. Traffic <ul><li>Many other time-sequences are bursty/clustered: (such as?) </li></ul>
  53. 53. Tape accesses # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...) time Tape#1 Tape# N
  54. 54. Tape accesses # tapes retrieved # qual. records 50-50 = Poisson real time Tape#1 Tape# N
  55. 55. More apps: Brain scans <ul><li>Oct-trees; brain-scans </li></ul>octree levels Log(#octants) 2.63 = fd
  56. 56. GIS points <ul><li>Cross-roads of Montgomery county: </li></ul><ul><li>any rules? </li></ul>
  57. 57. GIS <ul><li>A: self-similarity: </li></ul><ul><li>intrinsic dim. = 1.51 </li></ul><ul><li>avg#neighbors(<= r ) = r^D </li></ul>log( r ) log(#pairs(within <= r)) 1.51
  58. 58. Examples:LB county <ul><li>Long Beach county of CA (road end-points) </li></ul>
  59. 59. More fractals: <ul><li>cardiovascular system: 3 (!) </li></ul><ul><li>stock prices (LYCOS) - random walks: 1.5 </li></ul><ul><li>Coastlines: 1.2-1.58 (?) </li></ul>1 year 2 years
  60. 61. Road map <ul><li>Motivation – problems / case studies </li></ul><ul><li>Definition of fractals and power laws </li></ul><ul><li>Solutions to posed problems </li></ul><ul><li>More examples </li></ul><ul><ul><li>fractals </li></ul></ul><ul><ul><li>power laws </li></ul></ul><ul><li>Conclusions </li></ul>
  61. 62. Fractals <-> Power laws <ul><li>self-similarity -> </li></ul><ul><li><=> fractals </li></ul><ul><li><=> scale-free </li></ul><ul><li><=> power-laws (y=x^ a , F=C*r^(-2)) </li></ul>log( r ) log(#pairs within <=r ) 1.58
  62. 63. Zipf’s law Bible RANK-FREQUENCY plot: (in log-log scales) Zipf’s (first) Law: log(rank) log(freq) “ the” “ and” 
  63. 64. Zipf’s law <ul><li>similarly for first names (slope ~-1) </li></ul><ul><li>last names (~ -0.7) </li></ul><ul><li>etc </li></ul>
  64. 65. More power laws <ul><li>Energy of earthquakes (Gutenberg-Richter law) [simscience.org] </li></ul>log(count) magnitude day amplitude
  65. 66. Clickstream data <url, u-id, ....> Web Site Traffic log(freq) log(count) Zipf
  66. 67. Lotka’s law <ul><li>library science (Lotka’s law of publication count); and citation counts: ( citeseer.nj.nec.com 6/2001) </li></ul>log(#citations) log(count) J. Ullman
  67. 68. Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
  68. 69. More power laws: Korcak Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))
  69. 70. (Korcak’s law: Aegean islands)
  70. 71. Olympic medals: log rank log(# medals) USA China Russia
  71. 72. SALES data – store#96 # units sold count of products
  72. 73. TELCO data # of service units count of customers
  73. 74. More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree) -0.82
  74. 75. Even more power laws: <ul><li>Income distribution (Pareto’s law); </li></ul><ul><li>duration of UNIX jobs [Harchol-Balter] </li></ul><ul><li>Distribution of UNIX file sizes </li></ul><ul><li>Web graph [CLEVER-IBM; Barabasi] </li></ul>
  75. 76. Overall Conclusions: <ul><li>‘ Find similar/interesting things’ in multimedia databases </li></ul><ul><li>Indexing: feature extraction (‘GEMINI’) </li></ul><ul><ul><li>automatic feature extraction: FastMap </li></ul></ul><ul><ul><li>Relevance feedback: FALCON </li></ul></ul>
  76. 77. Conclusions - cont’d <ul><li>New tools for Data Mining: Fractals/power laws: </li></ul><ul><ul><li>appear everywhere </li></ul></ul><ul><ul><li>lead to skewed distributions (Gaussian, Poisson, uniformity, independence) </li></ul></ul><ul><ul><li>‘ correlation integral’ for separability/cluster detection </li></ul></ul><ul><ul><li>PFD for dimensionality reduction </li></ul></ul>
  77. 78. Resources: <ul><li>Software and papers: </li></ul><ul><ul><li>www . cs . cmu .edu/~christos </li></ul></ul><ul><ul><li>Fractal dimension (FracDim) </li></ul></ul><ul><ul><li>Separability (sigmod 2000, kdd2001) </li></ul></ul><ul><ul><li>Relevance feedback for query by content (FALCON – vldb 2000) </li></ul></ul>
  78. 79. Resources <ul><li>Manfred Schroeder “ Chaos, Fractals and Power Laws ” </li></ul>

×