Indexing and Data Mining in Multimedia Databases
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Indexing and Data Mining in Multimedia Databases

on

  • 516 views

 

Statistics

Views

Total Views
516
Views on SlideShare
515
Embed Views
1

Actions

Likes
0
Downloads
10
Comments
0

1 Embed 1

http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Indexing and Data Mining in Multimedia Databases Presentation Transcript

  • 1. Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos
  • 2. Outline
    • Goal: ‘Find similar / interesting things’
    • Problem - Applications
    • Indexing - similarity search
    • New tools for Data Mining: Fractals
    • Conclusions
    • Resources
  • 3. Problem
    • Given a large collection of (multimedia) records, find similar/interesting things, ie:
    • Allow fast, approximate queries, and
    • Find rules/patterns
  • 4. Sample queries
    • Similarity search
      • Find pairs of branches with similar sales patterns
      • find medical cases similar to Smith's
      • Find pairs of sensor series that move in sync
  • 5. Sample queries –cont’d
    • Rule discovery
      • Clusters (of patients; of customers; ...)
      • Forecasting (total sales for next year?)
      • Outliers (eg., fraud detection)
  • 6. Outline
    • Goal: ‘Find similar / interesting things’
    • Problem - Applications
    • Indexing - similarity search
    • New tools for Data Mining: Fractals
    • Conclusions
    • Resourses
  • 7. Indexing - Multimedia
    • Problem:
    • given a set of (multimedia) objects,
    • find the ones similar to a desirable query object (quickly!)
  • 8. distance function: by expert day $price 1 365 day $price 1 365 day $price 1 365
  • 9. ‘GEMINI’ - Pictorially day 1 365 day 1 365 S1 Sn F(S1) F(Sn) eg, avg eg,. std off-the-shelf S.A.Ms (spatial Access Methods)
  • 10. ‘ GEMINI’
    • fast; ‘correct’ (=no false dismissals)
    • used for
      • images (eg., QBIC) (2x, 10x faster)
      • shapes (27x faster)
      • video (eg., InforMedia)
      • time sequences ([Rafiei+Mendelzon], ++)
  • 11. Remaining issues
    • how to extract features automatically?
    • how to merge similarity scores from different media
  • 12. Outline
    • Goal: ‘Find similar / interesting things’
    • Problem - Applications
    • Indexing - similarity search
      • Visualization: Fastmap
      • Relevance feedback: FALCON
    • Data Mining / Fractals
    • Conclusions
  • 13. FastMap ?? 0 1 100 100 100 O5 1 0 100 100 100 O4 100 100 0 1 1 O3 100 100 1 0 1 O2 100 100 1 1 0 O1 O5 O4 O3 O2 O1 ~100 ~1
  • 14. FastMap
    • Multi-dimensional scaling (MDS) can do that, but in O(N**2) time
    • We want a linear algorithm: FastMap [SIGMOD95]
  • 15. Applications: time sequences
    • given n co-evolving time sequences
    • visualize them + find rules [ICDE00]
    time rate HKD JPY DEM
  • 16. Applications - financial
    • currency exchange rates [ICDE00]
    USD(t) USD(t-5) FRF GBP JPY HKD
  • 17. Applications - financial
    • currency exchange rates [ICDE00]
    USD(t) USD(t-5) USD HKD JPY FRF DEM GBP
  • 18. Application: VideoTrails
    • [ACM MM97]
    HIDE
  • 19. VideoTrails - usage
    • scene-cut detection (about 10% errors)
    • scene classification (eg., dialogue vs action)
    HIDE
  • 20. Outline
    • Goal: ‘Find similar / interesting things’
    • Problem - Applications
    • Indexing - similarity search
      • Visualization: Fastmap
      • Relevance feedback: FALCON
    • Data Mining / Fractals
    • Conclusions
  • 21. Merging similarity scores
    • eg., video: text, color, motion, audio
      • weights change with the query!
    • solution 1: user specifies weights
    • solution 2: user gives examples 
      • and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader)
      • but: how about disjunctive queries?
  • 22. DEMO server demo
  • 23. ‘FALCON’ Inverted Vs Vs Trader wants only ‘unstable’ stocks
  • 24. ‘FALCON’ Inverted Vs Vs average: is flat!
  • 25. “Single query point” methods Rocchio + + + + + + x avg std
  • 26. “Single query point” methods Rocchio MindReader + + + + + + MARS The averaging affect in action... x x x + + + + + + + + + + + +
  • 27. + + + + + Main idea: FALCON Contours feature1 (eg., avg) feature2 eg., std [Wu+, vldb2000]
  • 28. A: Aggregate Dissimilarity
    •  : parameter (~ -5 ~ ‘soft OR’)
    g1 g2 x + + + + +
  • 29.
    • converges quickly (~5 iterations)
    • good precision/recall
    • is fast (can use off-the-shelf ‘spatial/metric access methods’)
    FALCON
  • 30. Conclusions for indexing + visualization
    • GEMINI: fast indexing, exploiting off-the-shelf SAMs
    • FastMap: automatic feature extraction in O( N ) time
    • FALCON: relevance feedback for disjunctive queries
  • 31. Outline
    • Goal: ‘Find similar / interesting things’
    • Problem - Applications
    • Indexing - similarity search
    • New tools for Data Mining: Fractals
    • Conclusions
    • Resourses
  • 32. Data mining & fractals – Road map
    • Motivation – problems / case study
    • Definition of fractals and power laws
    • Solutions to posed problems
    • More examples
  • 33. Problem #1 - spatial d.m.
    • Galaxies (Sloan Digital Sky Survey w/ B. Nichol)
    • - ‘spiral’ and ‘ elliptical ’ galaxies
    • (stores & households; healthy & ill subjects)
    • - patterns? ( not Gaussian; not uniform)
    • attraction/repulsion?
    • separability??
  • 34. Problem#2: dim. reduction
    • given attributes x 1 , ... x n
      • possibly, non-linearly correlated
    • drop the useless ones
    • (Q: why?
    • A: to avoid the ‘dimensionality curse’)
    engine size mpg
  • 35. Answer:
    • Fractals / self-similarities / power laws
  • 36. What is a fractal?
    • = self-similar point set, e.g., Sierpinski triangle:
    ... zero area; infinite length!
  • 37. Definitions (cont’d)
    • Paradox: Infinite perimeter ; Zero area!
    • ‘dimensionality’: between 1 and 2
    • actually: Log(3)/Log(2) = 1.58… (long story)
  • 38. Intrinsic (‘fractal’) dimension
    • Q: fractal dimension of a line?
    Eg: #cylinders; miles / gallon 4 2 3 3 2 4 1 5 y x
  • 39. Intrinsic (‘fractal’) dimension
    • Q: fractal dimension of a line?
    • A: nn ( <= r ) ~ r^ 1
  • 40. Intrinsic (‘fractal’) dimension
    • Q: fractal dimension of a line?
    • A: nn ( <= r ) ~ r^ 1
    • Q: fd of a plane?
    • A: nn ( <= r ) ~ r^ 2
    • fd== slope of (log(nn) vs log(r) )
  • 41. Sierpinsky triangle == ‘correlation integral’ log( r ) log(#pairs within <=r ) 1.58
  • 42. Observations
    • self-similarity ->
    • <=> fractals
    • <=> scale-free
    • <=> power-laws (y=x^ a , F=C*r^(-2))
    log( r ) log(#pairs within <=r ) 1.58
  • 43. Road map
    • Motivation – problems / case studies
    • Definition of fractals and power laws
    • Solutions to posed problems
    • More examples
    • Conclusions
  • 44. Solution#1: spatial d.m.
    • Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000])
    • clusters?
    • separable?
    • attraction/repulsion?
    • data ‘scrubbing’ – duplicates?
  • 45. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
  • 46. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]
  • 47. spatial d.m. Heuristic on choosing # of clusters r1 r2 r1 r2
  • 48. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!
  • 49. Solution#1: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell
    • - 1.8 slope
    • - plateau!
    • repulsion!!
    - duplicates
  • 50. Problem #2: Dim. reduction
  • 51. Solution:
    • drop the attributes that don’t increase the ‘partial f.d.’ PFD
    • dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00]
  • 52. Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1
  • 53. Problem #2: dim. reduction PFD~1 PFD=1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: ‘max variance’ would fail here
  • 54. Problem #2: dim. reduction PFD~1 PFD~1 global FD=1 PFD=1 PFD=0 PFD=1 Notice: SVD would fail here
  • 55. Currency dataset HIDE
  • 56. self-similar? fd=1.98 fd=4.25 currency eigenfaces HIDE
  • 57. FDR on the ‘currency’ dataset HIDE if unif + indep.
  • 58. FDR on the ‘currency’ dataset
    • HKD: “useless”
    • >1.98 axis are needed
    HIDE if unif + indep.
  • 59. Road map
    • Motivation – problems / case studies
    • Definition of fractals and power laws
    • Solutions to posed problems
    • More examples
    • Conclusions
  • 60. App. : traffic
    • disk traces: self-similar (also: web traffic; comm. errors; etc)
    time #bytes
  • 61. More apps: Brain scans
    • Oct-trees; brain-scans
    octree levels Log(#octants) 2.63 = fd
  • 62. More fractals:
    • stock prices (LYCOS) - random walks: 1.5
    1 year 2 years
  • 63. More fractals:
    • coast-lines: 1.1-1.2 (up to 1.58)
  • 64.  
  • 65. Examples:MG county
    • Montgomery County of MD (road end-points)
  • 66. Examples:LB county
    • Long Beach county of CA (road end-points)
  • 67. More power laws: Zipf’s law
    • Bible - rank vs frequency (log-log)
    log(rank) log(freq) “ a” “ the”
  • 68. More power laws
    • Freq. distr. of first names; last names (Mandelbrot)
  • 69. Internet
    • Internet routers: how many neighbors within h hops?
    U of Alberta
  • 70. Internet topology
    • Internet routers: how many neighbors within h hops? [SIGCOMM 99]
    Reachability function: number of neighbors within r hops, vs r (log-log). Mbone routers, 1995 log(hops) log(#pairs) 2.8
  • 71. More power laws: areas – Korcak’s law Scandinavian lakes ([icde99], w/ Proietti)
  • 72. More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
  • 73. Olympic medals: log rank log(# medals)
  • 74. More power laws
    • Energy of earthquakes (Gutenberg-Richter law) [simscience.org]
    log(count) magnitude day amplitude
  • 75. Even more power laws:
    • Income distribution (Pareto’s law);
    • sales distributions;
    • duration of UNIX jobs
    • Distribution of UNIX file sizes
    • publication counts (Lotka’s law)
  • 76. Even more power laws:
    • web hit frequencies ([Huberman])
    • hyper-link distribution [Barabasi], ++
  • 77. Overall Conclusions:
    • ‘ Find similar/interesting things’ in multimedia databases
    • Indexing: feature extraction (‘GEMINI’)
      • automatic feature extraction: FastMap
      • Relevance feedback: FALCON
  • 78. Conclusions - cont’d
    • New tools for Data Mining: Fractals/power laws:
      • appear everywhere
      • lead to skewed distributions (Gaussian, Poisson, uniformity, independence)
      • ‘ correlation integral’ for separability/cluster detection
      • PFD for dimensionality reduction
  • 79. Conclusions - cont’d
      • can model bursty time sequences (buffering/prefetching)
      • selectivity estimation (‘ how many neighbors within x km ?)
      • dim. curse diagnosis (it’s the fractal dim. that matters! [ICDE2000])
  • 80. Resources:
    • Software and papers:
      • http://www.cs.cmu.edu/~christos
      • Fractal dimension (FracDim)
      • Separability (sigmod 2000)
      • Relevance feedback for query by content (FALCON – vldb 2000)