Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Exploration in R.pptx

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Decision Tree.pptx
Decision Tree.pptx
Loading in …3
×

Check these out next

1 of 17 Ad

More Related Content

More from Ramakrishna Reddy Bijjam (16)

Recently uploaded (20)

Advertisement

Data Exploration in R.pptx

  1. 1. Working with directory • Before writing a program in R important to find directory to load all list of file in the system • This can be done by using getwd() without pass any arguments • If you want to change directory then setwd(path). • It help to you to reset the current working directory to another location • List.files() helps to you to give information about your files • Dir() is equalent to list.files()
  2. 2. Data Exploration in R • Data Exploration is a statistical approach or technique for analyzing data sets in order to summarize their important and main characteristics generally by using some visual aids. The EDA approach can be used to gather knowledge about the following aspects of data: • Main characteristics or features of the data. • The variables and their relationships. • Finding out the important variables that can be used in our problem.
  3. 3. • EDA is an iterative approach that includes: • Generating questions about our data • Searching for the answers by using visualization, transformation, and modeling of our data. • Using the lessons that we learn in order to refine our set of questions or to generate a new set of questions. • Exploratory Data Analysis in R • In R Language, we are going to perform EDA under two broad classifications: • Descriptive Statistics, which includes mean, median, mode, inter-quartile range, and so on. • Graphical Methods, which includes histogram, density estimation, box plots, and so on.
  4. 4. • Summary() • It includes functions like min,Max,median,mean… • Str() • Displays the internal structure of dataset • View() • Displays the given dataset in separate spread sheet • Head() • Displays first 6 rows of data • Tail() • Displays last 6 rows of data • Ncol() • It returns the number of columns in the data set
  5. 5. • Nrows() • It returns the number of rows in the data set • Edit() • It is used to dynamic editing or data manipulation of dataset • Fix() • It is used to saves the changes in the dataset itself • Data() • List out the available data sets • Image() • Save.image() writes the external representation of R objects to the specific file
  6. 6. • dim(iris)// Dimentions • names(iris)// The attributes • str(iris) // Structure is revealed • attributes(iris)//The names, class etc • iris[1:5] // the first 5 • Head(iris)//first six . tail(iris)// Last Six entries • idx<-sample(1:nrow(iris),5) 5 random values from the dataset • Iris[1:10,”Sepal.Length”]//10 values • Iris(idx) • Summary(iris) • Quantile(iris$Sepal.Length)//% disrtibution • Quantile(iris$Sepal.Length,c(0.1,0.3,0.65)) • Var(iris$Sepal.Length • Plot(iris)
  7. 7. Commands for Data Exploration 1) Loading Example Data 2) Example 1: Print First Six Rows of Data Frame Using head() Function 3) Example 2: Return Column Names of Data Frame Using names() Function 4) Example 3: Get Number of Rows & Columns of Data Frame Using dim() Function 5) Example 4: Explore Structure of Data Frame Columns Using str() Function 6) Example 5: Calculate Descriptive Statistics Using summary() Function 7) Example 6: Count NA Values by Column Using colSums() & is.na() Functions 8) Example 7: Draw Pairs Plot of Data Frame Columns Using ggpairs() Function of GGally Package 9) Example 8: Draw Boxplots of Multiple Columns Using ggplot2 Package 10) Example 9: Draw facet_wrap Histograms of Multiple Columns Using ggplot2 Package
  8. 8. Loading Example Data • we’ll need to load some example data. In this tutorial, we’ll use the mtcars data set, which contains information about motor trend car road tests. • We can import the mtcars data set to the current R session using the data() function as shown below: • data(mtcars) # Import example data frame
  9. 9. Count NA Values by Column Using colSums() & is.na() Functions • The following R programming syntax demonstrates how to count the number of NA values in each column of a data frame. • To do this, we can apply the colSums and is.na functions: • colSums(is.na(mtcars)) # Count missing values
  10. 10. Draw Pairs Plot of Data Frame Columns Using ggpairs() Function of GGally Package • Until now, we have performed an analytical exploratory data analysis based on numbers and certain RStudio console outputs. • However, when it comes to data exploration, it is also important to have a visual look at your data. • The following R code demonstrates how to create a pairs plot using the . • For this, we need the functions of the ggplot2 and GGally packages. • By installing and loading GGally, the ggplot2 package is also imported. So it’s enough to install and load GGally: • install.packages("GGally") # Install GGally package library("GGally") # Load GGally package • Next, we can apply the ggpairs function of the GGally package to our data frame: • ggpairs(mtcars) # Draw pairs plot
  11. 11. Draw Boxplots of Multiple Columns Using ggplot2 Package • Boxplots are another popular way to visualize the columns of data sets. • To draw such a graph, we first have to manipulate our data using the tidyr package. In order to use the functions of the tidyr package, we first need to install and load tidyr to RStudio: • install.packages("tidyr") # Install & load tidyr library("tidyr") • Next, we can apply the pivot_longer function to reshape some of the columns of our data from wide to long format: • mtcars_long <- pivot_longer(mtcars, # Reshape data frame c("mpg", "disp", "hp", "qsec")) • Finally, we can apply the ggplot and geom_boxplot functions to our data to visualize each of the selected columns in a side-by-side boxplot graphic: • gplot(mtcars_long, # Draw boxplots • aes(x = value, fill = name)) + geom_boxplot()
  12. 12. Draw facet_wrap Histograms of Multiple Columns Using ggplot2 Package • Typically, we would also have a look at our numerical columns in a histogram plot. • In the following R syntax, I’m creating a histogram for each of our columns. Furthermore, I’m using the facet_wrap function to separate each column in its own plotting panel: • ggplot(mtcars_long, # Draw histograms aes(x = value)) + geom_histogram() + facet_wrap(name ~ ., scales = "free")
  13. 13. Importing Data in R Script • Importing Data in R • First, let’s consider a data-set which we can use for the demonstration. For this demonstration, we will use two examples of a single dataset, one in .csv form and another .txt • Reading a Comma-Separated Value(CSV) File • Method 1: Using read.csv() Function Read CSV Files into R • The function has two parameters:
  14. 14. • file.choose(): It opens a menu to choose a csv file from the desktop. • header: It is to indicate whether the first row of the dataset is a variable name or not. Apply T/True if the variable name is present else put F/False. • # import and store the dataset in data1 • data1 <- read.csv(file.choose(), header=T) • • # display the data • data1
  15. 15. • Using read.table() Function • This function specifies how the dataset is separated, in this case we take sep=”, “ as an argument. • Example: • R • # import and store the dataset in data2 • data2 <- read.table(file.choose(), header=T, sep=", ") • • # display data • data2
  16. 16. • Understanding datasets • A dataset is usually a rectangular array of data with rows representing observations and columns representing variables.IT provides an example of a hypothetical patient dataset. • A patient dataset • PatientID AdmDate Age Diabetes Status • 1 10/15/2009 25 type1 poor • 2. 15/12/2007 32 type2 improved

×