Data Profiling with R

1,393 views

Published on

R is a free, open source, flexible, and powerful tool that isn't scary! Even if you have no background in stats, you can use it for learning more about your datasets. By profiling your data at a start of a project, you can learn more about the data to find problems in it before you embark on a data warehouse project. This will cut development time down and increase user confidence in your data.

Published in: Data & Analytics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,393
On SlideShare
0
From Embeds
0
Number of Embeds
35
Actions
Shares
0
Downloads
52
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data Profiling with R

  1. 1. Want to follow along with this session using R? Download the script and data from the session scheduler. Also download R and RStudio. It’s easy to follow along!
  2. 2. © 2016 RED PILL Analytics Text Here
  3. 3. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Using R for Data Profiling 3 Michelle Kolbe medium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe michelle.kolbe@redpillanalytics.com
  4. 4. © 2016 RED PILL Analytics Do you have a data quality problem?
  5. 5. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics What to Check for? •Accuracy •Consistency •Completeness •Uniqueness •Distribution •Range 5
  6. 6. © 2016 RED PILL Analytics Why Profile Your Data?
  7. 7. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Benefits •Trust in data •Find problems in advance •Shorten development time on projects •Improve understanding of data & business knowledge 7
  8. 8. © 2016 RED PILL Analytics Why R?
  9. 9. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Why R? •Free! •Easy to use •Flexible •Powerful analytics •Great community! 9
  10. 10. © 2016 RED PILL Analytics Getting Started in R
  11. 11. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics What is R? •A programming environment •Fairly simple to use & understand •Allows a user to manipulate & analyze data •Open source •Real power comes from available packages you can install from LARGE community •Easy to learn with programming background •Con: Memory management & speed vs C++ or Python 11
  12. 12. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Tools for R • First download R from r-project.org • Then download R Studio, the best R IDE 12
  13. 13. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics R Basics •Case sensitive •<- assigns to a variable •# begins a comment •??<keyword> will search R documentation for help 13
  14. 14. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Using Packages •First install
 install.packages(“<package name>”) •Once installed, load the package
 library(“<package name>”) •Note that every time you open R you’ll need to load the packages you’ll be using •You’ll see your packages that are installed and loaded in R Studio 14
  15. 15. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Connecting to Data in R •Data should be read into R and stored into an object •Easiest with CSV •Can download datasets from a url or located on a drive
 d <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv") 15
  16. 16. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Connecting to Oracle •RODBC • Load package in R
 library(RODBC) • View available data sources
 odbcDataSources() • Can read tables and send sql queries
 con <- odbcConnect("Oracle Sample", uid="system", pwd="oracle")
 d <- sqlQuery(con, "select sysdate from dual”) 16 ODBC ConnectionName
  17. 17. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Connecting to Oracle • RJDBC • Load Package
 library(RJDBC) • Create connection driver
 jdbcDriver <- JDBC(driverClass=“oracle.jdbc.OracleDriver”, classPath=“lib/ojdbc6.jar”) • Open Connection
 jdbcConnection <- dbConnect(jdbcDriver, “jdbc:oracle:thin@// database.hostname.com:port/service_name_or_sid”, “username”, “password”) • Query
 dbGetQuery(jdbcConnection, “select sysdate from dual”) • Close Connection
 dbDisconnect(jdbcConnection) 17
  18. 18. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics ROracle •Open Source but maintained by Oracle •Faster: 79 times faster than RJDBC and 2.5 times faster than RODBC •Provides scalability and stability 18
  19. 19. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Variables •Can store data in variables using <- or = •Do not need to define variable first •RStudio shows your variables on the right 19
  20. 20. © 2016 RED PILL Analytics Using R Studio
  21. 21. © 2016 RED PILL Analytics Our Data Set to Profile
  22. 22. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics First, Load the Data into R 40
  23. 23. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Summarize the Data •Summary is an R function to show you basic details about each column in your dataset 41
  24. 24. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Summarize the Data 42
  25. 25. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Filter the dataset •Use Function Nesting to get a subset of data in the summary 43
  26. 26. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Bad Data? •If the Mean is 218 for Yards, is it possible to have a max of 5177 or is this bad data? 44
  27. 27. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Group Data by Position •Here we are grouping with the by function and getting the mean of 4 columns 45
  28. 28. © 2016 RED PILL Analytics Visualizing Data
  29. 29. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Grammar of Graphics Package •ggplot2 provides many graphing and charting capabilities with R •Based on Grammar of Graphics by Leland Wilkinson 47
  30. 30. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Bar Chart •Let’s view our distribution by Age. Since this is basically discrete data, we’ll use a Bar Chart. 48
  31. 31. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram •Our data imported into R with Factors for some metrics •Change to Int by converting to a matrix then back to data frame 49
  32. 32. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram 50
  33. 33. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram 51
  34. 34. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram with Some Data Cleanup •Removed low values 52
  35. 35. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distribution •Density charts are thought to be superior to histograms because you do not need to be concerned with bins 53
  36. 36. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distribution with 0 value data back in 54
  37. 37. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Quick Clean Up rm removes a variable or dataset 55
  38. 38. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Group the Chart by a Dimension •We can add a “facet wrap” to group our charts by a dimension 56
  39. 39. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distributions for Categorical Data •Can get a count of how many records exist for each value in a table format 57
  40. 40. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distribution for 2 data points •Can change this to a 2 way cross tab distribution 58
  41. 41. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Boxplot 59
  42. 42. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Scatterplot 60
  43. 43. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics 61
  44. 44. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Scatterplot with Regression 62
  45. 45. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Line Chart 63
  46. 46. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Add a Bar Chart to the Line 64
  47. 47. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Stacked Bars are Rarely Helpful 65
  48. 48. © 2016 RED PILL Analytics What about Text fields?
  49. 49. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Word cloud 67
  50. 50. © 2016 RED PILL Analytics Missing Data
  51. 51. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Null vs NA in R R treats NA like other languages consider NULL 69 NULL NA Definition Null object, a reserved word Logical constant of length 1 containing a missing value indicator Behavior in Vector Not allowed. Won’t save within vector. Exists and represents missing value. Behavior in List 
 (such as Data Frame) Can exist if not assigned but created with it. Exists and represents missing value.
  52. 52. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Nulls on Import Our dataset had nulls in it when we pulled it into R. How were they assigned? 70
  53. 53. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Finding Missing Data 71
  54. 54. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics But look what else we found in Jeff’s records! 72
  55. 55. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Make Missing Data Consistent in R 73
  56. 56. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Check the whole dataset now 74
  57. 57. © 2016 RED PILL Analytics What to do about missing & bad data?
  58. 58. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Handling Bad Data in ETL 76 Reject Clean & Fill In Load As Is
  59. 59. © 2016 RED PILL Analytics Using Data Quality Package
  60. 60. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics DataQualityR Package 78
  61. 61. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Numerical Results 79
  62. 62. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Categorical Results 80
  63. 63. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics In Summary •R gives you a quick and easy way to learn about your data before investing time into ETL •Open source means no investment into tools •R isn’t scary or all statistical and stuff! 81
  64. 64. © 2016 RED PILL Analytics Text Here
  65. 65. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics 83
  66. 66. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Using R for Data Profiling Session #1805 84 Michelle Kolbe medium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe michelle.kolbe@redpillanalytics.com Fill out a session survey in the mobile app!!

×