Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009

  • 362 views
Uploaded on

This talk argues already in 2009 that R is a viable tool for the entire lifecycle of data science / data mining / predictive analytics. A few years later R has indeed become the #1 tool used by......

This talk argues already in 2009 that R is a viable tool for the entire lifecycle of data science / data mining / predictive analytics. A few years later R has indeed become the #1 tool used by data scientists [http://www.rexeranalytics.com/].

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
362
On Slideshare
362
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Using R for Predictive Analytics Szil´rd Pafka a Predictive Analytics World / DC useR Group October 20, 2009
  • 2. Introduction and agenda Introduction: • physics, finance, statistical modeling (PhD) • data mining at a credit card processor (Chief Scientist) • co-organizing the Los Angeles area R users group Agenda: • using R for predictive analytics • benefits from R users groups
  • 3. R and predictive analytics R: • data manipulation • statistical analysis, exploratory data analysis • computations, simulations • modeling • visualization Predictive analytics (data mining for prediction problems): • understanding the domain problem and related questions • data exploration and cleaning • building predictive models (supervised learning) • model diagnostics, selection and evaluation • deployment
  • 4. R vs alternatives R: • open source • packages • latest cutting-edge methods • command line interaction (vs GUI) • flexible • results easier to reproduce • learning curve • books • R-help • cost, multiple platforms alternatives: • spreadsheets • programming languages • data mining software • similar software environments
  • 5. Data exploration • data frames • import from various sources (csv files, database connection) • flexible data manipulation functionalities • subscripting • numerous transformations • text manipulation (e.g. regexp) • data aggregation • specialized packages (reshape, plyr etc.) • graphics • sample R code
  • 6. Visualization • powerful tool for data exploration, diagnostics etc. • R graphics: base, lattice, ggplot2 • visual perception, principles of visual design • ggplot2 • based on the grammar of graphics • nice defaults • elements: mappings, geoms, stats, scales, facets • very expressive (sample code)
  • 7. Building predictive models • supervised learning: y = f (x1 , x2 , . . . , xp ) • regression (numeric y ), classification (categorical y ) • spam detection example: classification (2 classes, Y/N) • numerous methods: logistic regression, LDA, nearest neighbors, decision trees, bagging, boosting, random forests, neural networks, support vector machines etc. • example: neural networks with nnet (sample code) • weight decay as regularization • helps convergence and against overfitting
  • 8. Model evaluation 99% accuracy? • package caret: unified interface/wrapper for • training models, tuning (complexity parameters) • diagnosing, selecting, evaluating models • model deployment • Conclusion: R is a viable tool for predictive analytics
  • 9. R users groups • R users groups around the world, in the US: SF, NY, LA, DC... • Los Angeles: 150+ members, 4 meetings so far • talks, Q&A, discussions, exchange of knowledge • pointing people where to look for (packages, docs etc.) • also more generally for statistical methodology issues • networking opportunities