Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Kaggle Projects Presentation Sawinder Pal Kaur

506

Published on

Kagggle Projects - Digit Recognizer and Titanic Disaster

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
506
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
14
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Sawinder Pal Kaur, PhD Kaggle Projects
• 2. Outline  Problem  Statement  Methods used  Results
• 3. Problem: Digit Recognizer  Identify handwritten single digits 0~9, based on grey scale images. Sample images
• 4. Statement Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel- value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive. pixel0 pixel1 pixel2 ... pixel27 pixel28 pixel29 pixel30 ... pixel55 | | | ... | pixel756 pixel757 pixel758 ... pixel783
• 5. Statement  The training data set, has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.  The test data set, is the same as the training set, except that it does not contain the "label" column.  Goal of the problem is to predict the images in the test data set
• 6. Methods used to solve the problem  Random Forest  Support Vector Machine (SVM)  K-Nearest Neighborhood (KNN)
• 7. Random Forest  Ensemble of decision trees  Each tree is trained on a bootstrapped sample of the original data set  Each time a node is split, only a randomly chosen subset of the dimensions are considered for splitting  Each tree is fully grown and not pruned  When a new input is entered into the system, it is run down all of the trees. The result may either be an average or weighted average of all of the terminal nodes that are reached, or, in the case of categorical variables, a voting majority
• 8. Random Forest
• 9. Support Vector Machine  In a SVM model original objects (training data) are treated as a points in the space (input space)  These are mapped (rearranged) to a new space (feature space) using mathematical functions called kernels  After mapping objects of separate categories are divided by a clear gap as wide as possible
• 10. K Nearest Neighborhood  Basic idea  If it walks like a duck, quacks like a duck than it is probably a duck  There are three key elements :  a set of labeled objects (e.g., a set of stored records)  a distance or similarity metric to compute distance between objects, and  the value of k, the number of nearest neighbors.  To classify an unlabeled object :  the distance of this object to the labeled objects is computed,  its k-nearest neighbors are identified, and  the class labels of these nearest neighbors are then used to determine the class label of the object.
• 11. Results  Random Forests with 500 trees gave 97% accuracy on the test data.  SVM with RBF kernel and C=1, gave 97.71% accuracy on the test data.  KNN with k=10 gave 96% accuracy.
• 12. Titanic: Machine Learning from Disaster
• 13. Problem  The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.  In this project, the analysis of what sorts of people were likely to survive is done. In particular, the tools of machine learning are applied to predict which passengers survived the tragedy.
• 14. Statement  The historical data has been split into two groups, a 'training set' and a 'test set'. For the training set, the outcome whether or not the passenger survived the sinking ( 0 for deceased, 1 for survived ) is provided.  The goal of the problem is to predict the outcome for each passenger in the test set.
• 15. Methods used to solve the problem • Random Forest • Support Vector Machine (SVM)
• 16. Results  Random Forests with 300 trees gave 77.9% accuracy on the test data.  SVM with RBF kernel and C=1, gave 77.7% accuracy on the test data.