Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Lions, zebras and Big Data Anonymiz... by Kai Xin Thia 1162 views
- Forecasting Techniques - Data Scien... by Kai Xin Thia 1455 views
- Strata singapore survey by Cheng Feng 240 views
- A Study on the Relationship between... by Eugene Yan Ziyou 10510 views
- Xavier Conort, DataScience SG Meetu... by Kai Xin Thia 1349 views
- Sharing about my data science journ... by Eugene Yan Ziyou 2606 views

1,247 views

Published on

Full code can be downloaded here: https://github.com/thiakx/RUGS-Meetup

Train / test data from Kaggle: http://www.kaggle.com/c/see-click-predict-fix/data

Interactive map demo: http://www.thiakx.com/misc/playground/scfMap/scfMap.html

License: CC Attribution License

No Downloads

Total views

1,247

On SlideShare

0

From Embeds

0

Number of Embeds

47

Shares

0

Downloads

23

Comments

0

Likes

5

No embeds

No notes for slide

- 1. Musing of a Kaggler By Kai Xin
- 2. I am not a good student. Skipped school, played games all day, almost got kicked out of school.
- 3. I play a different game now. But at the core it is the same: understand the game, devise a strategy, keep playing.
- 4. My Overall Strategy
- 5. Every piece of data is unique but some data is more important than others
- 6. It is not about the tools or the model or the stats. It is about the steps to put everything together.
- 7. The Kaggle Competition
- 8. https://github.com/thiakx/RUGS- Meetup Remember to download data from Kaggle Competition and put it here
- 9. First look at the data 223,129 rows
- 10. First look at the data Plot on map? Not really free text? Some repeats Need to predict these Related to summary / description?
- 11. Graph by Ryn Locar Understand the data via visualization
- 12. Oakland http://www.thiakx.com/misc/playground/scfMap/scfMap. html Oakland Chicargo New Haven Richmond LeafletR Demo Visualize the data - Interactive maps
- 13. Step1: Draw Boundary Polygon Step 2: Create Base (each hex 1km wide) Step 3: Point in Polygon Analysis Step 4: Local Moran’s I
- 14. Obtain Boundary Polygon Lat Long App can be found at: leafletMaps/latlong.html leafletMaps/ regionPoints.csv
- 15. Generating Hex Code can be found at: baseFunctions_map.R
- 16. Point in Polygon Analysis Code can be found at: 1. dataExplore_map.R
- 17. Local Moran’s I Code can be found at: 1. dataExplore_map.R
- 18. LeafletR Code can be found at: 1. dataExplore_map.R Kx’s layered demo map: leafletMaps/scfMap_kxDe moVer
- 19. In Search of the 20% data
- 20. Ignore IgnoreIgnore Model Ignore Model Ignore MAD Training Data
- 21. In Search of the 20% Data Detection of “Anomalies” Can we justify this using statistics?
- 22. ksTest<-ks.test(trainData$num_views[trainData$month==4&trainData$year==2013], trainData$num_views[trainData$month==9&trainData$year==2012]) #d is like the distance of difference, smaller d = the two data sets probably from same distribution d Jan’12 to Oct’12 and Mar’13 training data ignored 2 sample Kolmogorov–Smirnov test
- 23. What happened here? No need to model? Just assume all Chicargo data to be 0? Chicargo data generated by remote_API mostly 0s, no need to model
- 24. Separate Outliers using Median Absolute Deviation (MAD) MAD is robust and can handle skewed data. It helps to identify outliers. We separated data more which are more than 3 Median Absolute Deviation. Code can be found at: baseFunctions_cleanData.R
- 25. Ignore IgnoreIgnore Model Ignore Model Ignore MAD
- 26. Ignore IgnoreIgnore Model Ignore Model Ignore MAD 10% of training data is used for modeling 59% of data are Chicargo data generated by remote_AP I, mostly 0s, no need model, just estimate using median Key Advantage: Rapid prototyping! 4% of data is identified as outliers by MAD KS test: 27% of training data are of different distribution
- 27. When you can focus on a small but representative subset of data, you can run many, many experiments very quickly (I did several hundreds)
- 28. Now we have the raw ingredients prepared, it is time to make the dishes
- 29. Experiment with Different Models ❖ Random Forest ❖ Generalized Boosted Regression Models (GBM) ❖ Support Vector Machines (SVM) ❖ Bootstrap aggregated (bagged) linear models How to use? Ask Google & RTFM
- 30. Or just do download my code
- 31. I don’t spend time on optimizing/tuning model settings (learning rate etc) with cross validation. I find it really boring and really slow
- 32. Obsessing with tuning model variables is like being obsessed with tuning the oven
- 33. Instead, the magic happens when we combine data and when we create new data - aka feature creation
- 34. Creating Simples Features : City trainData$city[trainData$longitude=="-77"]<- "richmond" trainData$city[trainData$longitude=="-72"]<- "new_haven" trainData$city[trainData$longitude=="-87"]<- "chicargo" trainData$city[trainData$longitude=="-122"]<- "oakland" Code can be found at: 1. dataExplore_map.R
- 35. Creating Complex Features: Local Moran’s I Code can be found at: 1. dataExplore_map.R
- 36. Creating Complex Features: Predicted View The task is to predict view, votes, comments but logically, won’t number of votes and comments be correlated with number of views? Code can be found at: baseFunctions_model.R
- 37. Creating Complex Features: Predicted View Storing the predicted value of view as new column and using it as a new feature to predict votes & comments… very risky business but powerful if you know what you are doing
- 38. Creating Complex Features: SplitTag, wordMine
- 39. Creating Complex Features: SplitTag, wordMine Code can be found at: baseFunctions _cleanData.R
- 40. Adjusting Features: Simplify Tags Code can be found at: baseFunctions_cleanData.R
- 41. Adjusting Features: Recode Unknown Tags Code can be found at: baseFunctions_cleanData.R
- 42. Adjusting Features: Combine Low Count Tags Code can be found at: baseFunctions_cleanData.R
- 43. Full List of Features Used Code can be found at: baseFunctions_model.R +Num View as Y variable +Num Comments as Y variable +Num Votes as Y variable Fed into models to predict view, votes, comments respectively
- 44. Only used 1 original feature, I created the other 13 features Code can be found at: baseFunctions_model.R Fed into models to predict view, votes, comments respectively Original Feature (1) Created Feature (13)
- 45. An ensemble of good enough models can be surprisingly strong
- 46. An ensemble of good enough models can be surprisingly strong
- 47. An ensemble of the 4 base model has less error
- 48. Each model is good for different scenario GBM is rock solid, good for all scenarios SVM is counter weight, don’t trust anything it says GLM is amazing for predicting comments, not so much for others RandomForest is moderate, provides a balanced view
- 49. Ensemble (Stacking using regression) testDataAns rfAns gbmAns svmAns glmBagAns 2.3 2 2.5 2.4 1.8 2 1.8 2.2 1.7 1.6 1.3 1.3 1.7 1.2 1.0 1.5 1.4 1.9 1.6 1.2 … … … … … glm(testDataAns~rfAns+gbmAns+svmAns+glmBagAns) We are interested in the coefficient
- 50. Ensemble (Stacking using regression) Sort and column bind the predictions from the 4 models Run regression (logistic or linear) and obtain coefficients Scale ensemble ratio back to 1 (100%)
- 51. Obtaining the ensemble ratio for each model Inside 3. testMod_generateEnsembleRatio folder - getEnsembleRatio.r
- 52. Ensemble is not perfect… ❖ Simple to implement? Kind of. But very tedious to update. Will need to rerun every single model every time you make any changes to the data (as the ensemble ratio may change). ❖ Easy to overfit test data (will require another set of validation data or cross validation). ❖ Very hard to explain to business users what is going on.
- 53. All this should get you to top rank 49/532
- 54. Ignore IgnoreIgnore Model Ignore Model Ignore MAD 10% of training data is used for modeling 4% of data is identified as outliers by MAD KS test: Too different from rest of data 59% of data are Chicargo data generated by remote_AP I, mostly 0s, no need model, just estimate using median Key Advantage: Rapid prototyping!
- 55. Thank you! thiakx@gmail.com Data Science SG Facebook Data Science SG Meetup

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment