Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

593 views

Published on

This talk argues already in 2009 that R is a viable tool for the entire lifecycle of data science / data mining / predictive analytics. A few years later R has indeed become the #1 tool used by data scientists [http://www.rexeranalytics.com/].

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
593
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

  1. 1. Using R for Predictive Analytics Szil´rd Pafka a Predictive Analytics World / DC useR Group October 20, 2009
  2. 2. Introduction and agenda Introduction: • physics, finance, statistical modeling (PhD) • data mining at a credit card processor (Chief Scientist) • co-organizing the Los Angeles area R users group Agenda: • using R for predictive analytics • benefits from R users groups
  3. 3. R and predictive analytics R: • data manipulation • statistical analysis, exploratory data analysis • computations, simulations • modeling • visualization Predictive analytics (data mining for prediction problems): • understanding the domain problem and related questions • data exploration and cleaning • building predictive models (supervised learning) • model diagnostics, selection and evaluation • deployment
  4. 4. R vs alternatives R: • open source • packages • latest cutting-edge methods • command line interaction (vs GUI) • flexible • results easier to reproduce • learning curve • books • R-help • cost, multiple platforms alternatives: • spreadsheets • programming languages • data mining software • similar software environments
  5. 5. Data exploration • data frames • import from various sources (csv files, database connection) • flexible data manipulation functionalities • subscripting • numerous transformations • text manipulation (e.g. regexp) • data aggregation • specialized packages (reshape, plyr etc.) • graphics • sample R code
  6. 6. Visualization • powerful tool for data exploration, diagnostics etc. • R graphics: base, lattice, ggplot2 • visual perception, principles of visual design • ggplot2 • based on the grammar of graphics • nice defaults • elements: mappings, geoms, stats, scales, facets • very expressive (sample code)
  7. 7. Building predictive models • supervised learning: y = f (x1 , x2 , . . . , xp ) • regression (numeric y ), classification (categorical y ) • spam detection example: classification (2 classes, Y/N) • numerous methods: logistic regression, LDA, nearest neighbors, decision trees, bagging, boosting, random forests, neural networks, support vector machines etc. • example: neural networks with nnet (sample code) • weight decay as regularization • helps convergence and against overfitting
  8. 8. Model evaluation 99% accuracy? • package caret: unified interface/wrapper for • training models, tuning (complexity parameters) • diagnosing, selecting, evaluating models • model deployment • Conclusion: R is a viable tool for predictive analytics
  9. 9. R users groups • R users groups around the world, in the US: SF, NY, LA, DC... • Los Angeles: 150+ members, 4 meetings so far • talks, Q&A, discussions, exchange of knowledge • pointing people where to look for (packages, docs etc.) • also more generally for statistical methodology issues • networking opportunities

×