Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Want to follow along with
this session using R?
Download the script and
data from the session
scheduler. Also download
R a...
© 2016 RED PILL Analytics
Text Here
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Using R for Data Profiling
3
Michell...
© 2016 RED PILL Analytics
Do you have a data quality problem?
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
What to Check for?
•Accuracy
•Consi...
© 2016 RED PILL Analytics
Why Profile Your Data?
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Benefits
•Trust in data
•Find proble...
© 2016 RED PILL Analytics
Why R?
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Why R?
•Free!
•Easy to use
•Flexibl...
© 2016 RED PILL Analytics
Getting Started in R
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
What is R?
•A programming environme...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Tools for R
• First download R from...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
R Basics
•Case sensitive
•<- assign...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Using Packages
•First install

inst...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Connecting to Data in R
•Data shoul...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Connecting to Oracle
•RODBC
• Load ...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Connecting to Oracle
• RJDBC
• Load...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
ROracle
•Open Source but maintained...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Variables
•Can store data in variab...
© 2016 RED PILL Analytics
Using R Studio
© 2016 RED PILL Analytics
Our Data Set to Profile
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
First, Load the Data into R
40
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Summarize the Data
•Summary is an R...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Summarize the Data
42
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Filter the dataset
•Use Function Ne...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Bad Data?
•If the Mean is 218 for Y...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Group Data by Position
•Here we are...
© 2016 RED PILL Analytics
Visualizing Data
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Grammar of Graphics Package
•ggplot...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Bar Chart
•Let’s view our distribut...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Histogram
•Our data imported into R...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Histogram
50
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Histogram
51
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Histogram with Some Data Cleanup
•R...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Distribution
•Density charts are th...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Distribution with 0 value data back...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Quick Clean Up
rm removes a variabl...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Group the Chart by a Dimension
•We ...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Distributions for Categorical Data
...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Distribution for 2 data points
•Can...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Boxplot
59
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Scatterplot
60
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics 61
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Scatterplot with Regression
62
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Line Chart
63
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Add a Bar Chart to the Line
64
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Stacked Bars are Rarely Helpful
65
© 2016 RED PILL Analytics
What about Text fields?
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Word cloud
67
© 2016 RED PILL Analytics
Missing Data
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Null vs NA in R
R treats NA like ot...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Nulls on Import
Our dataset had nul...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Finding Missing Data
71
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
But look what else we found in Jeff...
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Make Missing Data Consistent in R
73
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Check the whole dataset now
74
© 2016 RED PILL Analytics
What to do about missing & bad data?
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Handling Bad Data in ETL
76
Reject
...
© 2016 RED PILL Analytics
Using Data Quality Package
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
DataQualityR Package
78
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Numerical Results
79
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Categorical Results
80
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
In Summary
•R gives you a quick and...
© 2016 RED PILL Analytics
Text Here
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics 83
www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics
Using R for Data Profiling
Session #...
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Data Profiling with R
Upcoming SlideShare
Loading in …5
×

Data Profiling with R

4,975 views

Published on

R is a free, open source, flexible, and powerful tool that isn't scary! Even if you have no background in stats, you can use it for learning more about your datasets. By profiling your data at a start of a project, you can learn more about the data to find problems in it before you embark on a data warehouse project. This will cut development time down and increase user confidence in your data.

Published in: Data & Analytics

Data Profiling with R

  1. 1. Want to follow along with this session using R? Download the script and data from the session scheduler. Also download R and RStudio. It’s easy to follow along!
  2. 2. © 2016 RED PILL Analytics Text Here
  3. 3. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Using R for Data Profiling 3 Michelle Kolbe medium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe michelle.kolbe@redpillanalytics.com
  4. 4. © 2016 RED PILL Analytics Do you have a data quality problem?
  5. 5. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics What to Check for? •Accuracy •Consistency •Completeness •Uniqueness •Distribution •Range 5
  6. 6. © 2016 RED PILL Analytics Why Profile Your Data?
  7. 7. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Benefits •Trust in data •Find problems in advance •Shorten development time on projects •Improve understanding of data & business knowledge 7
  8. 8. © 2016 RED PILL Analytics Why R?
  9. 9. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Why R? •Free! •Easy to use •Flexible •Powerful analytics •Great community! 9
  10. 10. © 2016 RED PILL Analytics Getting Started in R
  11. 11. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics What is R? •A programming environment •Fairly simple to use & understand •Allows a user to manipulate & analyze data •Open source •Real power comes from available packages you can install from LARGE community •Easy to learn with programming background •Con: Memory management & speed vs C++ or Python 11
  12. 12. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Tools for R • First download R from r-project.org • Then download R Studio, the best R IDE 12
  13. 13. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics R Basics •Case sensitive •<- assigns to a variable •# begins a comment •??<keyword> will search R documentation for help 13
  14. 14. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Using Packages •First install
 install.packages(“<package name>”) •Once installed, load the package
 library(“<package name>”) •Note that every time you open R you’ll need to load the packages you’ll be using •You’ll see your packages that are installed and loaded in R Studio 14
  15. 15. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Connecting to Data in R •Data should be read into R and stored into an object •Easiest with CSV •Can download datasets from a url or located on a drive
 d <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv") 15
  16. 16. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Connecting to Oracle •RODBC • Load package in R
 library(RODBC) • View available data sources
 odbcDataSources() • Can read tables and send sql queries
 con <- odbcConnect("Oracle Sample", uid="system", pwd="oracle")
 d <- sqlQuery(con, "select sysdate from dual”) 16 ODBC ConnectionName
  17. 17. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Connecting to Oracle • RJDBC • Load Package
 library(RJDBC) • Create connection driver
 jdbcDriver <- JDBC(driverClass=“oracle.jdbc.OracleDriver”, classPath=“lib/ojdbc6.jar”) • Open Connection
 jdbcConnection <- dbConnect(jdbcDriver, “jdbc:oracle:thin@// database.hostname.com:port/service_name_or_sid”, “username”, “password”) • Query
 dbGetQuery(jdbcConnection, “select sysdate from dual”) • Close Connection
 dbDisconnect(jdbcConnection) 17
  18. 18. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics ROracle •Open Source but maintained by Oracle •Faster: 79 times faster than RJDBC and 2.5 times faster than RODBC •Provides scalability and stability 18
  19. 19. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Variables •Can store data in variables using <- or = •Do not need to define variable first •RStudio shows your variables on the right 19
  20. 20. © 2016 RED PILL Analytics Using R Studio
  21. 21. © 2016 RED PILL Analytics Our Data Set to Profile
  22. 22. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics First, Load the Data into R 40
  23. 23. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Summarize the Data •Summary is an R function to show you basic details about each column in your dataset 41
  24. 24. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Summarize the Data 42
  25. 25. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Filter the dataset •Use Function Nesting to get a subset of data in the summary 43
  26. 26. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Bad Data? •If the Mean is 218 for Yards, is it possible to have a max of 5177 or is this bad data? 44
  27. 27. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Group Data by Position •Here we are grouping with the by function and getting the mean of 4 columns 45
  28. 28. © 2016 RED PILL Analytics Visualizing Data
  29. 29. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Grammar of Graphics Package •ggplot2 provides many graphing and charting capabilities with R •Based on Grammar of Graphics by Leland Wilkinson 47
  30. 30. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Bar Chart •Let’s view our distribution by Age. Since this is basically discrete data, we’ll use a Bar Chart. 48
  31. 31. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram •Our data imported into R with Factors for some metrics •Change to Int by converting to a matrix then back to data frame 49
  32. 32. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram 50
  33. 33. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram 51
  34. 34. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Histogram with Some Data Cleanup •Removed low values 52
  35. 35. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distribution •Density charts are thought to be superior to histograms because you do not need to be concerned with bins 53
  36. 36. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distribution with 0 value data back in 54
  37. 37. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Quick Clean Up rm removes a variable or dataset 55
  38. 38. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Group the Chart by a Dimension •We can add a “facet wrap” to group our charts by a dimension 56
  39. 39. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distributions for Categorical Data •Can get a count of how many records exist for each value in a table format 57
  40. 40. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Distribution for 2 data points •Can change this to a 2 way cross tab distribution 58
  41. 41. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Boxplot 59
  42. 42. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Scatterplot 60
  43. 43. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics 61
  44. 44. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Scatterplot with Regression 62
  45. 45. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Line Chart 63
  46. 46. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Add a Bar Chart to the Line 64
  47. 47. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Stacked Bars are Rarely Helpful 65
  48. 48. © 2016 RED PILL Analytics What about Text fields?
  49. 49. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Word cloud 67
  50. 50. © 2016 RED PILL Analytics Missing Data
  51. 51. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Null vs NA in R R treats NA like other languages consider NULL 69 NULL NA Definition Null object, a reserved word Logical constant of length 1 containing a missing value indicator Behavior in Vector Not allowed. Won’t save within vector. Exists and represents missing value. Behavior in List 
 (such as Data Frame) Can exist if not assigned but created with it. Exists and represents missing value.
  52. 52. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Nulls on Import Our dataset had nulls in it when we pulled it into R. How were they assigned? 70
  53. 53. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Finding Missing Data 71
  54. 54. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics But look what else we found in Jeff’s records! 72
  55. 55. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Make Missing Data Consistent in R 73
  56. 56. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Check the whole dataset now 74
  57. 57. © 2016 RED PILL Analytics What to do about missing & bad data?
  58. 58. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Handling Bad Data in ETL 76 Reject Clean & Fill In Load As Is
  59. 59. © 2016 RED PILL Analytics Using Data Quality Package
  60. 60. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics DataQualityR Package 78
  61. 61. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Numerical Results 79
  62. 62. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Categorical Results 80
  63. 63. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics In Summary •R gives you a quick and easy way to learn about your data before investing time into ETL •Open source means no investment into tools •R isn’t scary or all statistical and stuff! 81
  64. 64. © 2016 RED PILL Analytics Text Here
  65. 65. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics 83
  66. 66. www.RedPillAnalytics.com info@RedPillAnalytics.com @RedPillA © 2016 RED PILL Analytics Using R for Data Profiling Session #1805 84 Michelle Kolbe medium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe michelle.kolbe@redpillanalytics.com Fill out a session survey in the mobile app!!

×