Intorduction:
Hi everyone, this session will be dealing with Data Analysis using R Language. Many would
have found difficult to get started with Data Analysis and R as well. I can assure this will be very
helpful for the beginners who really seeks help.
So what is Data Analysis? By definition, it is the process of evaluating data using analytical
and logical reasoning to examine each component of the data provided. This form of analysis is
just one of the many steps that must be completed when conducting a research experiment. Data
from various sources is gathered, reviewed, and then analyzed to form some sort of finding or
conclusion. There are a variety of specific data analysis method, some of which include data
mining, text analytics, business intelligence, and data visualizations. But in a very simple way,
we can say that FINDING PATTERNS OR DATA INSIGHTS which will help to get
concentrate business decisions/exceed customer experience.
And R is a statistical tool used for data analysis and data science as well. R has in-built
functions and provides a wide variety of statistical (linear and nonlinear modelling, classical
statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and
is highly extensible.
Basics:
To become a Data Analyst you should be strong in the following areas:
 Statistics
 Data Mining
 Python/R
 Distributed Computing
Let's start with Statistics, which is further classified into descriptive statistics( measure of central
tendency, measure of dispersion, shape of data ), inferential statistics( infer from the sample data what the
population might think ), explorative statistics( analysing to summarize their main characteristics ).
Next comes the Data Mining, which includes data pre-processing( data cleaning, data transformation
), modelling etc.
When comes to R i have already given a introduction about it, and Python, its again a wonderful
programming language for Data Analysis which has many packages namely pandas, scikit-learn,
matplotlib for visualization. R and Python are the 2 stars preferred by data analysts. Both are having their
own strength and weakness.
For Distributed Computing, i mean HADOOP technology, which is used mainly for storage and
processing time of big data. Since history, data volume and variety is getting increased distributed
computing been the limelight with Hadoop eco-system, which is simply called big data technology.
Nowadays Hadoop has become the synonym for big data.
Steps involved:
The actual session starts here.. Make sure that the environment is ready. I 've explained the
steps to be followed in detail...
Step-1: PROBLEM STATEMENT
You should be very clear about the problem statement given, what you are expected to do.
Ask yourselves, what problem you have, is the data given is sufficient to solve the given problem
statement.
Step-2: DATA PREPROCESSING
This is a very important process that a Data Analyst under goes. Initially you should
collect the required data. First set the working directory where the file is present using setwd().
you can use any of the code to read the file with respect to the file format.
 read.csv()
 read.table()
 read.xlsx()
 for XML do the following
library(XML)
doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
And convert the loaded data to data frame to make the manipulation easy using
data.frame(). Next comes the data cleaning, to handle missing values you can make use of
is.na(), to remove missing values you can use na.omit() or na.exclude().
Next is data transformation, here we have type transformation which can be done by
as.numeric()/as.double()/as.factor() etc. Normalization and Standardization also comes under
data transformation.
Once the preprocessing process is over 60% of work is over.
Step-3: POPULATION AND SAMPLE
Before getting into this, load the necessary packages needed using library("package
name"), eg: library("caret"), library("class"). And dont forget to initialize the seed value, make
use of set.seed(). Coming to the point, it is very important to to split the given dataset to training
and testing data, since training data represents the population which is sample. Testing data
should only be used to test the model, unless you should not touch it. Model is built only using
the training data.
This can be done by many methods here i have used createDataPartition() which is the
function available in caret package.
index <- createDataPartition(y, times = 1, p = 0.5, list = TRUE, ...)
where, y - predictor variable
times - number of partitions
p - percentage of data that will be trained
list - logical - should the results be in a list (TRUE) or a matrix with the number of rows
equal to floor(p * length(y)) and times columns.
training_data <- dataframe[index,]
testing_data <- dataframe[-index,]
Now training and testing data are partitioned and the model is ready to train.
Step-4 : DATA MODELING
To train the model we can use the function train() available in the caret package.
model_trained <- train(y, x, method = "rf", preProcess = NULL, ...)
Here Y is the predictor variable and X(x1 to xn) is the control/independent variables.
There are many other methods like rf(random forest) such as glm(generic linear model).
Refer http://caret.r-forge.r-project.org/bytag.html to know more about the models. Each model
has its own restrictions.
Step-5 : PREDICTION
Once the model is been trained we can predict the model using the function available -
predict().
predicted_model <- predict(model_trained, testing_data)
You can also see whether your model is built and classified perfectly or not. Using
confusionMatrix() we can achieve this.
check <- confusionMatrix(y, predicted_model)
In other words, you can use this confusion matrix to check against the training model to
see how it will work for the training data.
Step-6 : PLOTS
The last step invloves plotting, you can make use of plot() which can be box plot or scatter
plot or histogram or as per the requirement. As per the saing "1 picture speaks more than 1000
words", you can make use of plots to describe your results.
Step-7 : REPORT
Finally for report submission you can use Rmarkdown, where the file should be saved
with the extension .rmd. To use Rmoarkdown check for the packages that are needed to be
installed.
--Thank You--

Analysis using r

  • 1.
    Intorduction: Hi everyone, thissession will be dealing with Data Analysis using R Language. Many would have found difficult to get started with Data Analysis and R as well. I can assure this will be very helpful for the beginners who really seeks help. So what is Data Analysis? By definition, it is the process of evaluating data using analytical and logical reasoning to examine each component of the data provided. This form of analysis is just one of the many steps that must be completed when conducting a research experiment. Data from various sources is gathered, reviewed, and then analyzed to form some sort of finding or conclusion. There are a variety of specific data analysis method, some of which include data mining, text analytics, business intelligence, and data visualizations. But in a very simple way, we can say that FINDING PATTERNS OR DATA INSIGHTS which will help to get concentrate business decisions/exceed customer experience. And R is a statistical tool used for data analysis and data science as well. R has in-built functions and provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. Basics: To become a Data Analyst you should be strong in the following areas:  Statistics  Data Mining  Python/R  Distributed Computing Let's start with Statistics, which is further classified into descriptive statistics( measure of central tendency, measure of dispersion, shape of data ), inferential statistics( infer from the sample data what the population might think ), explorative statistics( analysing to summarize their main characteristics ). Next comes the Data Mining, which includes data pre-processing( data cleaning, data transformation ), modelling etc. When comes to R i have already given a introduction about it, and Python, its again a wonderful programming language for Data Analysis which has many packages namely pandas, scikit-learn, matplotlib for visualization. R and Python are the 2 stars preferred by data analysts. Both are having their own strength and weakness. For Distributed Computing, i mean HADOOP technology, which is used mainly for storage and processing time of big data. Since history, data volume and variety is getting increased distributed computing been the limelight with Hadoop eco-system, which is simply called big data technology. Nowadays Hadoop has become the synonym for big data.
  • 2.
    Steps involved: The actualsession starts here.. Make sure that the environment is ready. I 've explained the steps to be followed in detail... Step-1: PROBLEM STATEMENT You should be very clear about the problem statement given, what you are expected to do. Ask yourselves, what problem you have, is the data given is sufficient to solve the given problem statement. Step-2: DATA PREPROCESSING This is a very important process that a Data Analyst under goes. Initially you should collect the required data. First set the working directory where the file is present using setwd(). you can use any of the code to read the file with respect to the file format.  read.csv()  read.table()  read.xlsx()  for XML do the following library(XML) doc <- xmlTreeParse(fileUrl, useInternal = TRUE) And convert the loaded data to data frame to make the manipulation easy using data.frame(). Next comes the data cleaning, to handle missing values you can make use of is.na(), to remove missing values you can use na.omit() or na.exclude(). Next is data transformation, here we have type transformation which can be done by as.numeric()/as.double()/as.factor() etc. Normalization and Standardization also comes under data transformation. Once the preprocessing process is over 60% of work is over. Step-3: POPULATION AND SAMPLE Before getting into this, load the necessary packages needed using library("package name"), eg: library("caret"), library("class"). And dont forget to initialize the seed value, make use of set.seed(). Coming to the point, it is very important to to split the given dataset to training and testing data, since training data represents the population which is sample. Testing data should only be used to test the model, unless you should not touch it. Model is built only using the training data. This can be done by many methods here i have used createDataPartition() which is the function available in caret package. index <- createDataPartition(y, times = 1, p = 0.5, list = TRUE, ...) where, y - predictor variable times - number of partitions p - percentage of data that will be trained list - logical - should the results be in a list (TRUE) or a matrix with the number of rows
  • 3.
    equal to floor(p* length(y)) and times columns. training_data <- dataframe[index,] testing_data <- dataframe[-index,] Now training and testing data are partitioned and the model is ready to train. Step-4 : DATA MODELING To train the model we can use the function train() available in the caret package. model_trained <- train(y, x, method = "rf", preProcess = NULL, ...) Here Y is the predictor variable and X(x1 to xn) is the control/independent variables. There are many other methods like rf(random forest) such as glm(generic linear model). Refer http://caret.r-forge.r-project.org/bytag.html to know more about the models. Each model has its own restrictions. Step-5 : PREDICTION Once the model is been trained we can predict the model using the function available - predict(). predicted_model <- predict(model_trained, testing_data) You can also see whether your model is built and classified perfectly or not. Using confusionMatrix() we can achieve this. check <- confusionMatrix(y, predicted_model) In other words, you can use this confusion matrix to check against the training model to see how it will work for the training data. Step-6 : PLOTS The last step invloves plotting, you can make use of plot() which can be box plot or scatter plot or histogram or as per the requirement. As per the saing "1 picture speaks more than 1000 words", you can make use of plots to describe your results. Step-7 : REPORT Finally for report submission you can use Rmarkdown, where the file should be saved with the extension .rmd. To use Rmoarkdown check for the packages that are needed to be installed. --Thank You--