Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data analysis in R for beginners

346 views

Published on

A fast pace beginners course for those new to R and who want a quick broad demonstration of R under the hood. We cover data ingestion, manipulating data frames, data summary and exploration, interactive visualization, creating dashboards, predictive modelling, and big data integrations.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data analysis in R for beginners

  1. 1. Data Analysis in . for Beginners Alton Alexander Data Science Consultant
  2. 2. Why R? • R is open source – like python not like SAS • Out of the box R is single machine, in memory statistical computing engine – Download from https://www.r-project.org/ • Use an IDE – R Studio https://www.rstudio.com/ – Revolution Analytics (MSFT) – Jython (ipython)
  3. 3. R studio Download Overview
  4. 4. Essential Learning Resources A new book for learning R Q: What have you tried and what works?
  5. 5. Topics • Data ingestion • Manipulation • Summary and exploration • Writing Reports • Interactive visualization and dashboarding • Predictive Modeling & Forecasting • Big Data Integrations
  6. 6. Demo Options data R studio
  7. 7. Data ingestion • Load data – Load.csv() – library(RJDBC) – library(RODBC)
  8. 8. Data Structures and Manipulation • Another major reason for using R – Ability to work with data in Data Frames – Like pandas in python and data tables in SAS • Reasons for doing data manipulation (munging) – Feature extraction – ETL – Data cleansing – Pivots, stack/unstack, aggregate, groupby, reshape
  9. 9. Set Theory SQL joins and their results merge, sqldf in R http://www.r-bloggers.com/manipulating- data-frames-using-sqldf-a-brief-overview/
  10. 10. Summary and Exploration • Powerful summary functions for programmatically quantifying datasets • Functions include: – Summary(), hist(), levels(), aggregate()
  11. 11. Interactive Visualization and Dashboarding • Shiny from Rstudio • Like tableau – Local and server options • Much more customizable, more coding, no GUI or click to edit • But you can bring in powerful libraries to build web apps comparatively fast
  12. 12. Predictive Modeling & Forecasting • Examples – Customer segmentation • Unsupervised classification – Marketing mix models • Explain the coefficients – Attribution modeling • Supervised time series of events – Multivariate testing • (AB tests with statistical significance, ANOVA) – Lead scoring • P2B Models, topic of interest, propensity to buy, expected spend
  13. 13. 5 Libraries for Machine Learning Allowing the machine to capture complexity: 1. gbm [Gradient Boosting Machine] 2. randomForest [Random Forest] 3. e1071 [Support Vector Machines] Taking advantage of high-cardinality categorical or text-data: 4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models] 5. tau [Text Analysis Utilities]
  14. 14. Big Data Integration • Single laptop is often sufficient – Millions of rows on a 32GB i7 laptop • Scale using a larger server – Often sufficient but has limitations (100s of GB) • Clustered compute engine – Algorithm considerations to affect performance
  15. 15. RServer • For datasets that don’t fit in memory or for convenience there is a SERVER option – A shared compute engine – Shares resources – Think +100 GB of RAM
  16. 16. Big Data Integration - Frameworks • H2O.ai • SparkR • Revolution Analytics • In DB processing – Applying lead score or segmentation model in real time – Spark, teradata, vertica
  17. 17. Why R? In High Demand Nationally
  18. 18. Get Alton’s FREE Reports! Go to http://frontanalysis.com/bigdatameetup/ Complete the survey including your email I’ll email you the two reports: 1. Anonymized Summary of the Survey 2. LinkedIn Job Suggestions for a Utah Data Scientist

×