Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploratory data analysis in R - Data Science Club

23 views

Published on

How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Exploratory data analysis in R - Data Science Club

  1. 1. MEET OUR TEAM WRITE HERE SOMETHING DATA EXPLORATION METHODS & PRACTISES Martin Bago | Instarea 8.10.2018 2nd Data Science Club, 18/19 Winter
  2. 2. MEET OUR TEAM WRITE HERE SOMETHINGTABLE OF CONTENT INTRO FIRST DEEP INTO DATASET GOING DEEPER CORRELATIONS BONUS D A T A S C I E N C E C L U B
  3. 3. Martin Bago Data Scientist | Instarea Ing. @ Process Automation and Informatization in Industry (2016, MTF STU BA) Bc. @ Applied Informatics (2014, FEI STU BA) 2017- now Data Scientist, Instarea s.r.o., Market Locator 2015-2016 Head of Analyst, News and Media Holding a.s. 2014-2015 SEO Analyst, Centrum Holdings a.s. 2011-2014 Automix.sk, Centrum Holdings a.s. 2010-2013 Editor-in-chief OKO Casopis (FEI STU BA) Passionate driver, beer&coffee&football lover
  4. 4. Something for you Download this presentation + source code here: http://bit.ly/2QybvNV
  5. 5. The Data journey…always the same
  6. 6. Dataset >> install.packages("datasets") #installing datasets package in R >> library(datasets) For studying there is an unique library consisting of many real-life dataset examples (from Monthly Airline Passenger Numbers, thru Weight versus age of chicks on different diets to Monthly Deaths from Lung Diseases in the UK) . For this presentation we will use mtcars dataset. How to find&use
  7. 7. Baby steps head(), tail(), nrow() and ncol() To understand, what are you working with is very important to see dimensions of dataset a number/count of values. >> head(mtcars) >> tail(mtcars) >> head(mtcars, 25) >> nrow(mtcars) >> ncol(mtcars) Input: Output:
  8. 8. Deeper insight str(), summary() To deeper understanding of dataset use detailed views of metrics and dimensions. >> str(mtcars) >> summary(mtcars) Input: Output: Always check data types!!! Source
  9. 9. Unique and missing values unique(), is.na() Is crucial to find, how many values are missing from the dataset. If there is 2/3 missing, you got wrong dataset. >> unique(mtcars$cyl) >> is.na(mtcars) Input: Output: If there is something missing, you can use old&good method to treat that – filling with mean. >> mtcars$smt[is.na(mtcars$smt)] <- mean(mtcars$smt, na.rm = TRUE)
  10. 10. Histograms hist() The best way to learn and understand, is visual >> hist(mtcars$mpg) >> hist(mtcars$hp) Input: Output: Output:
  11. 11. Transforming and recalculating Often you need to calculate your own metrics. In R, it’s really easy. >> mtcars2 <- mtcars >> mtcars2$disp_l <- mtcars$mpg/61.024 >> mtcars2$kml <- 235/mtcars$mpg >> hist(mtcars2$disp_l) Input: Output:
  12. 12. Understand the scope of variablesboxplot() >> boxplot(mtcars) >> boxplot(mtcars2$disp_l, mtcars2$kml) >> boxplot(mtcars2$kml, main = "mtcars dataset", xlab = "Comsumption per 100km", ylab = "Liters") Input: Output: Output:
  13. 13. How to read boxplot? boxplot()
  14. 14. Does it correlate? Library(corplot), cor() >> install.packages("corrplot") >> library(corrplot) >> #cor(x, method = "pearson", use = "complete.obs") >> cor(mtcars) Input: Output: Not very intuitive…
  15. 15. Does it correlate? Library(corplot), cor() >> res <- cor(mtcars) >> round(res, 2) >> corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 25) Input: Output: ! Becareful ! Correlation is not causality
  16. 16. Heatmap via corrplot library >> library(corrplot) >> col<- colorRampPalette(c("blue", "white", "red"))(20) >> heatmap(x = res, col = col, symm = TRUE) Input: Output: Does it correlate?
  17. 17. Or even deeper insight… >>require(graphics) pairs(mtcars2, main = "mtcars2 data", gap = 1/4) coplot(kml ~ disp_l | as.factor(cyl), data = mtcars2, panel = panel.smooth, rows = 1) ## possibly more meaningful, e.g., for summary() or bivariate plots: mtcars2 <- within(mtcars2, { vs <- factor(vs, labels = c("V", "S")) am <- factor(am, labels = c("automatic", "manual")) cyl <- ordered(cyl) gear <- ordered(gear) carb <- ordered(carb) }) summary(mtcars2) Input: Output: Library(corplot), cor()
  18. 18. Or even deeper insight… >> install.packages("PerformanceAnalytics") >> library(PerformanceAnalytics) >> chart.Correlation(mtcars, histogram=TRUE, pch=19) >> mtcars_small <- mtcars[,1:4] >> chart.Correlation(mtcars_small, histogram=TRUE, pch=19) Input: Output: Library Performance Analytics
  19. 19. Bonus - anomaliesDetection AnomalyDetectionTs() As input in considered time-series or vector, at least two periods. Madeby Twitter
  20. 20. What next? To create customizable dashboards try Shiny: Tableau-like Drag and Drop GUI Visualization in R use esquisse:
  21. 21. Something for you Download this presentation + source code here: http://bit.ly/2QybvNV
  22. 22. Stay in touch Instarea s.r.o. 29. Augusta 36/A 811 09 Bratislava www.instarea.com Martin Bago Data Scientist Instarea martin.bago@instarea.com +421 905 255 852 https://www.linkedin.com/in/martinbago/ Thank you!

×