• Save
Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification
Upcoming SlideShare
Loading in...5
×
 

Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification

on

  • 794 views

The presentation was presented by Nenad Tomasev at the 7th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM 2011) in New York, NY, USA on August 30th, ...

The presentation was presented by Nenad Tomasev at the 7th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM 2011) in New York, NY, USA on August 30th, 2011.

Publication: http://bit.ly/yQQsOq

Abstract:
High-dimensional data are by their very nature often difficult to handle by conventional machine-learning algorithms, which is usually characterized as an aspect of the curse of dimensionality. However, it was shown that some of the arising high-dimensional phenomena can be exploited to increase algorithm accuracy. One such phenomenon is hubness, which refers to the emergence of hubs in high-dimensional spaces, where hubs are influential points included in many k-neighbor sets of other points in the data. This phenomenon was previously used to devise a crisp weighted voting scheme for the k-nearest neighbor classifier. In this paper we go a step further by embracing the soft approach, and propose several fuzzy measures for k-nearest neighbor classification, all based on hubness, which express fuzziness of elements appearing in k-neighborhoods of other points. Experimental evaluation on real data from the UCI repository and the image domain suggests that the fuzzy approach provides a useful measure of confidence in the predicted labels, resulting in improvement over the crisp weighted method, as well the standard kNN classifier.

Statistics

Views

Total Views
794
Views on SlideShare
753
Embed Views
41

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 41

http://www.planet-data.eu 29
http://planet-data.eu 11
http://planet-data.org 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification Presentation Transcript

  • Hubness-Based FuzzyMeasuresfor High-Dimensionalk-Nearest NeighborClassificationNenad Tomašev, MilošRadovanović, DunjaMladenić, Mirjana Ivanović
  • Presentation outline• The phenomenon of hubness• Why it matters: a motivating example• Types of hubness• Exploiting hubness information in kNN: hubness-based fuzzy measures• Anti-hubs: a problem?• Approximative approaches• Experimental evaluation• Conclusions and future work
  • Hubness One consequence of the well-known dimensionality curse Influential points emerge in nearest- neighbor methods: HUBS These hubs appear in many k-neighbor sets
  • What was that song again? Hubness: first noticed in music collection mining Some songs were being retrieved (as nearest neighbors) much more often than other songs This did not, however, reflect the perceived similarity between the songs
  • Hubness: definitions Hubs: points which appear often as neighbors  Influential points  Rare among the data Anti-hubs: points which nearly never appear as neighbors  Possible outliers  Common among the data k-occurrence: a point occurs in a k-neighbor set Nk(x): the number of k-occurrences of point x
  • k-occurrence distribution(graphs taken from: Radovanović, Nanopulous,Ivanović - Hubs in Space: Popular NearestNeighbors in High-Dimensional Data)
  • k-occurrence distribution(graphs taken from: Radovanović, Nanopulous,Ivanović - Hubs in Space: Popular NearestNeighbors in High-Dimensional Data)
  • What causes hubness At first it was thought that some data distributions or some metrics might be the underlying cause The truth is much simpler than that: hubness is present in almost any inherently high-dimensional data
  • High dimensional data Image data, video, audio, measurement streams, medical records, text, … Modern machine learning challenges are all of an inherently high dimensional nature
  • The curse of dimensionality Everything is sparse  The requirements for proper density estimates rise exponentially with dimensionality  The notions of structure and „shape‟ of clusters are much less meaningful, since there is not enough data to capture these higher-order dependencies Concentration of distances  The relative contrast decreases  Distance expectation increases, but the variance remains constant
  • Related work Work by Miloš Radovanović et al.  The general properties of the hubness phenomenon  Hubness-weighted kNN  Hubness-based outlier detection  Hubs and anti-hubs in SVM  Time series classification  … Work by Krisztian Buza et al.  Instance selection / data reduction based on hubness  Time series classification by using hubness
  • Hubness-weighted kNN: theidea Good Bad Totalhubness hubness hubness GNk(x) BNk(x) Nk(x)
  • Hubness-weighted kNN: theweights A simple, yet effective instance-specific weighting scheme This was the second baseline (the first was kNN) in the experiments
  • How bad can bad hubnessbecome?
  • With devastating results
  • Hubness-based fuzzy measures The idea: explore the structure of bad hubness Introducing the concept of class hubness In other words:  There is nothing inherently good or bad about a k-occurrence, what it does is that, as a random event, it carries some information about the label of the point of interest
  • The fuzzy k-nearest neighborframework Each neighbor distributes its vote across all the categories Also, some distance-based weighting
  • The proposed hubness-basedfuzziness Use class hubness to vote define fuziness, if a data point exhibits enough hubness If not, plan B
  • Anti-hubs: a problem? There exist points which never appear in k- neighbor sets on the training data However, there are even more points which simply appear rarely So, we have to be careful On the other hands, these points will most likely occur rarely on the test data as well
  • The low dimensional case…
  • The high dimensional case
  • Approximative approaches Use point label Use global class-to-class hubness estimate  Captures average class-hubness among data points from the same category Use a local estimate  We tested two different ways of fuzzifying the local estimates
  • Experimental evaluation UCI data  low-to-medium hubness data  many binary classification problems ImageNet data  5 multiclass classification problems  SIFT codebook representation  color histograms
  • The distance weighting An optional part of the algorithm, so we decided to see if it makes a difference It turns out that it does lead to slightly better results Notation:  h-FNN – the non-weighted version  dwh-FNN – the weighted version
  • Comparison between theestimates
  • Neighborhood sizes
  • Results on ImageNet data
  • kNNh-FNN
  • Conclusions Class hubness can be successfully exploited in a fuzzy voting scheme for k- nearest neighbor classification Anti-hubscan be treated as a separate case, in any of the proposed ways, without compromising the accuracy
  • Conclusions Thephenomenon of hubness, even though inherently detrimental, can be turned to our advantage by building hubness-aware classification algorithms There is certainly a lot of space for follow- ups and potential improvement
  • Acknowledgements This work was supported by the bilateral project between Slovenia and Serbia “Correlating Images and Words: Enhancing Image Analysis Through Machine Learning and Semantic Technologies,” the Slovenian Research Agency, the Serbian Ministry of Education and Science through project no. OI174023, “Intelligent techniques and their integration into wide-spectrum decision support,” and the ICT Programme of the EC under PASCAL2 (ICT-NoE-216886) and PlanetData (ICT-NoE-257641)
  • Thank you for your attention Questions?