Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Curse of Dimensionality and Big Data by Stephane Marchand... 1241 views
- Multivariate data analysis by Setia Pramana 2049 views
- Multivariate analysis by SUDARSHAN KUMAR P... 2628 views
- Multivariate Analysis An Overview by guest3311ed 11982 views
- The comparative study of apriori an... by deepti92pawar 37043 views
- Clique by sk_klms 6478 views

No Downloads

Total views

6,592

On SlideShare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

166

Comments

0

Likes

5

No embeds

No notes for slide

- 1. Curse of Dimensionality By, Nikhil Sharma
- 2. What is it? In applied mathematics, curse of dimensionality (a term coined by Richard E. Bellman), also known as the Hughes effect (named after Gordon F. Hughes),refers to the problem caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space.
- 3. Problems• High Dimensional Data is difficult to work with since: Adding more features can increase the noise, and hence the error There usually aren’t enough observations to get good estimates• This causes: Increase in running time Overfitting Number of Samples Required
- 4. An ExampleRepresentation of 10% sample probability space (i) 2-D (ii)3-DThe Number of Points Would Need to Increase Exponentially to Maintain a Given Accuracy. 10n samples would be required for a n-dimension problem.
- 5. Curse of Dimensionality: Complexity Complexity (running time) increases with dimension d A lot of methods have at least O(nd 2 ) complexity, where n is the number of samples So as d becomes large, O(nd2) complexity may be too costly
- 6. Curse of Dimensionality: Overfitting Paradox: If n < d2 we are better off assuming that features are uncorrelated, even if we know this assumption is wrong We are likely to avoid overfitting because we fit a model with less parameters:
- 7. Curse of Dimensionality: Number of Samples Suppose we want to use the nearest neighbor approach with k = 1 (1NN) This feature is not discriminative, i.e. it does not separate the classes well Suppose we start with only one feature We decide to use 2 features. For the 1NN method to work well, need a lot of samples, i.e. samples have to be dense To maintain the same density as in 1D (9 samples per unit length), how many samples do we need?
- 8. Curse of Dimensionality: Number of Samples We need 92 samples to maintain the same density as in 1D
- 9. Curse of Dimensionality: Number of Samples Of course, when we go from 1 feature to 2, no one gives us more samples, we still have 9 This is way too sparse for 1NN to work well
- 10. Curse of Dimensionality: Number of Samples Things go from bad to worse if we decide to use 3 features: If 9 was dense enough in 1D, in 3D we need 93=729 samples!
- 11. Curse of Dimensionality: Number of Samples In general, if n samples is dense enough in 1D Then in d dimensions we need n d samples! And n d grows really really fast as a function of d Common pitfall: If we can’t solve a problem with a few features, adding more features seems like a good idea However the number of samples usually stays the same The method with more features is likely to perform worse instead of expected better
- 12. Curse of Dimensionality: Number of Samples For a fixed number of samples, as we add features, the graph of classification error: Thus for each fixed sample size n, there is the optimal number of features to use
- 13. The Curse of Dimensionality We should try to avoid creating lot of features Often no choice, problem starts with many features Example: Face Detection One sample point is k by m array of pixels Feature extraction is not trivial, usually every pixel is taken as a feature Typical dimension is 20 by 20 = 400 Suppose 10 samples are dense enough for 1 dimension. Need only 10400samples
- 14. The Curse of Dimensionality Face Detection, dimension of one sample point is km The fact that we set up the problem with km dimensions (features) does not mean it is really a km-dimensional problem Most likely we are not setting the problem up with the right features If we used better features, we are likely need much less than km-dimensions Space of all k by m images has km dimensions Space of all k by m faces must be much smaller, since faces form a tiny fraction of all possible images
- 15. Dimensionality Reduction We want to reduce the number of dimensions because: Efficiency. Measurement costs. Storage costs. Computation costs. Classification performance. Ease of interpretation/modeling.
- 16. Principal Components The idea is to project onto the subspace which accounts for most of the variance. This is accomplished by projecting onto the eigenvectors of the covariance matrix associated with the largest eigenvalues. This is generally not the projection best suited for classification. It can be a useful method for providing a first-cut reduction in dimension from a high dimensional space
- 17. Feature Combination as a method to reduce Dimensionality High dimensionality is challenging and redundant It is natural to try to reduce dimensionality Reduce dimensionality by feature combination: combine old features x to create new features y For Example, Ideally, the new vector y should retain from x all information important for classification
- 18. Principle Component Analysis (PCA) Main idea: seek most accurate data representation in a lower dimensional space Example in 2-D ◦ Project data to 1-D subspace (a line) which minimize the projection error Notice that the the good line to use for projection lies in the direction of largest variance
- 19. PCA: Approximation of Elliptical Cloud in 3D
- 20. Thank You! The End.

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment