This hands-on R course will guide users through a variety of programming functions in the open-source statistical software program, R. Topics covered include indexing, loops, conditional branching, S3 classes, and debugging. Full workshop materials available from http://projects.iq.harvard.edu/rtc/r-prog
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
Learn the basics of data visualization in R. In this module, we explore the Graphics package and learn to build basic plots in R. In addition, learn to add title, axis labels and range. Modify the color, font and font size. Add text annotations and combine multiple plots. Finally, learn how to save the plots in different formats.
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
This hands-on R course will guide users through a variety of programming functions in the open-source statistical software program, R. Topics covered include indexing, loops, conditional branching, S3 classes, and debugging. Full workshop materials available from http://projects.iq.harvard.edu/rtc/r-prog
The goal of this workshop is to introduce fundamental capabilities of R as a tool for performing data analysis. Here, we learn about the most comprehensive statistical analysis language R, to get a basic idea how to analyze real-word data, extract patterns from data and find causality.
The R language is a project designed to create a free, open source language which can be used as a replacement for the S-PLUS language, originally developed as the S language at AT&T Bell Labs, and currently marketed by Insightful Corporation of Seattle, Washington. R is an open source implementation of S, and differs from S-plus largely in its command-line only format.
Topics Covered:
1.Introduction to R
2.Installing R
3.Why Learn R
4.The R Console
5.Basic Arithmetic and Objects
6.Program Example
7.Programming with Big Data in R
8.Big Data Strategies in R
9.Applications of R Programming
10.Companies Using R
11.What R is not so good at
12.Conclusion
Learn the basics of data visualization in R. In this module, we explore the Graphics package and learn to build basic plots in R. In addition, learn to add title, axis labels and range. Modify the color, font and font size. Add text annotations and combine multiple plots. Finally, learn how to save the plots in different formats.
This introduction to the popular ggplot2 R graphics package will show you how to create a wide variety of graphical displays in R. Data sets and additional workshop materials available at http://projects.iq.harvard.edu/rtc/event/r-graphics
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Edureka!
This Edureka Linear Regression tutorial will help you understand all the basics of linear regression machine learning algorithm along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) What is Regression?
3) Types of Regression
4) Linear Regression Examples
5) Linear Regression Use Cases
6) Demo in R: Real Estate Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Exploratory data analysis data visualization:
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Maximize insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Edureka!
( ** Data Science Certification Using R: https://www.edureka.co/data-science ** )
This Edureka tutorial on "Statistics for Data Science" talks about the basic concepts of Statistics, which is primarily an applied branch of mathematics, that attempts to make sense of observations in the real world. Statistics is generally regarded as one of the most crucial aspects of data science.
Introduction to statistics
Basic Terminology
Categories in Statistics
Descriptive Statistics
Reasons for moving to R
Descriptive Statistics in R Studio
Inferential Statistics
Inferential Statistics using R Studio
Check out our Data Science Tutorial blog series: http://bit.ly/data-science-blogs
Check out our complete Youtube playlist here: http://bit.ly/data-science-playlist
3 pillars of big data : structured data, semi structured data and unstructure...PROWEBSCRAPER
There are 3 pillars of Big Data
1.Structured data
2.Unstructured data
3.Semi structured data
Businesses worldwide construct their empire on these three pillars and capitalize on their limitless potential.
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Edureka!
This Edureka Linear Regression tutorial will help you understand all the basics of linear regression machine learning algorithm along with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) What is Regression?
3) Types of Regression
4) Linear Regression Examples
5) Linear Regression Use Cases
6) Demo in R: Real Estate Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Exploratory data analysis data visualization:
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Maximize insight into a data set.
Uncover underlying structure.
Extract important variables.
Detect outliers and anomalies.
Test underlying assumptions.
Develop parsimonious models.
Determine optimal factor settings
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Edureka!
( ** Data Science Certification Using R: https://www.edureka.co/data-science ** )
This Edureka tutorial on "Statistics for Data Science" talks about the basic concepts of Statistics, which is primarily an applied branch of mathematics, that attempts to make sense of observations in the real world. Statistics is generally regarded as one of the most crucial aspects of data science.
Introduction to statistics
Basic Terminology
Categories in Statistics
Descriptive Statistics
Reasons for moving to R
Descriptive Statistics in R Studio
Inferential Statistics
Inferential Statistics using R Studio
Check out our Data Science Tutorial blog series: http://bit.ly/data-science-blogs
Check out our complete Youtube playlist here: http://bit.ly/data-science-playlist
3 pillars of big data : structured data, semi structured data and unstructure...PROWEBSCRAPER
There are 3 pillars of Big Data
1.Structured data
2.Unstructured data
3.Semi structured data
Businesses worldwide construct their empire on these three pillars and capitalize on their limitless potential.
Attached here is a presentation that I made covering some bits and pieces of what I got to discover about Data Science and Machine Learning using R Programming Language.
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
In this session, we will start R right from the beginning, from installing R through to datatransformation and integration, through to visualizing data by using R in PowerBI. Then, we will move towards powerful but simple to use datatypes in R such as data frames. We will also upgrade our data analysis skills by looking at Rdata transformation using a powerful set of tools to make things simple: the tidyverse. Then, we will look at integrating our R work into Power BI, and visualizing our data using beautiful visualizations with R and Power BI. Finally, we will share our work by publishing our Power BI project, with our R code, to the Power BI service. We will also look at refreshing our dataset so that our new dashboard has refreshed data.
This session is aimed at getting beginners up to speed as gently and quickly as possible. Join this session if you are curious about R and want to know more. If you are already a Power BI expert, join this session to open up a whole new world of Power BI to add toyour skill set. If you are new to Power BI, you will still get value from this session since you'll be able to see a Power BI dashboard being built in an end-to-end solution.
The presentation was given by Mr. Bas Kempen, ISRIC, during the GSOC Mapping Global Training hosted by ISRIC - World Soil Information, 6 - 23 June 2017, Wageningen (The Netherlands).
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
Data Science, Statistical Analysis and R... Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.
Week-3 – System RSupplemental material1Recap •.docxhelzerpatrina
Week-3 – System R
Supplemental material
1
Recap
• R - workhorse data structures
• Data frame
• List
• Matrix / Array
• Vector
• System-R – Input and output
• read() function
• read.table and read.csv
• scan() function
• typeof() function
• Setwd() function
• print()
• Factor variables
• Used in category analysis and statistical modelling
• Contains predefined set value called levels
• Descriptive statistics
• ls() – list of named objects
• str() – structure of the data and not the data itself
• summary() – provides a summary of data
• Plot() – Simple plot
2
Descriptive statistics - continued
• Summary of commands with single-value result. These commands will work on variables
containing numeric value.
• max() ---- It shows the maximum value in the vector
• min() ----- It shows the minimum value in the vector
• sum() ----- It shows the sum of all the vector elements.
• mean() ---- It shows the arithmetic mean for the entire vector
• median() – It shows the median value of the vector
• sd() – It shows the standard deviation
• var() – It show the variance
3
Descriptive statistics - single value results -
example
temp is the name of the vector
containing all numeric values
4
• log(dataset) – Shows log value for each
element.
• summary(dataset) –shows the summary
of values
• quantile() - Shows the quantiles by
default—the 0%, 25%, 50%, 75%, and
100% quantiles. It is possible to select
other quantiles also.
Descriptive statistics - multiple value results -
example
5
Descriptive Statistics in R for Data Frames
• Max(frame) – Returns the largest value in the entire data frame.
• Min(frame) – Returns the smallest value in the entire data frame.
• Sum(frame) – Returns the sum of the entire data frame.
• Fivenum(frame) – Returns the Tukey summary values for the entire
data frame.
• Length(frame)- Returns the number of columns in the data frame.
• Summary(frame) – Returns the summary for each column.
6
Descriptive Statistics in R for Data Frames -
Example
7
Descriptive Statistics in R for Data Frames –
RowMeans example
8
Descriptive Statistics in R for Data Frames –
ColMeans example
9
Graphical analysis - simple linear regression model
in R
• Logistic regression is implemented to understand if the dependent
variable is a linear function of the independent variable.
• Logistic regression is used for fitting the regression curve.
• Pre-requisite for implementing linear regression:
• Dependent variable should conform to normal distribution
• Cars dataset that is part of the R-Studio will be used as an example to
explain linear regression model.
10
Creating a simple linear model
• cars is a dataset preloaded into
System-R studio.
• head() function prints the first
few rows of the list/df
• cars dataset contains two major
columns
• X = speed (cars$speed)
• Y = dist (cars$dist)
• data() function is used to list all
the active datasets in the
environment.
• ...
The Tidyverse and the Future of the Monitoring ToolchainJohn Rauser
As delivered at Monitorama PDX 2017.
Thesis: The ideas in the tidyverse, not the tools themselves but the fundamental ideas they are built upon, those ideas are going to have profound impact on everything having to do with data manipulation and visualization or data analysis.
Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.
Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. R for Data Science | Long Nguyen | Sep 20172
Why R?
• R is the most preferred programming tool for statisticians, data scientists, data analysts and data architects
• Easy to develop your own model.
• R is freely available under GNU General Public License
• R has over 10,000 packages (a lot of available algorithms) from multiple repositories.
http://www.burtchworks.com/2017/06/19/2017-sas-r-python-flash-survey-results/
3. R for Data Science | Long Nguyen | Sep 20173
R & Rstudio IDE
Go to:
• https://www.r-project.org/
• https://www.rstudio.com/products/rstudio/download/
4. R for Data Science | Long Nguyen | Sep 20174
Essentials of R Programming
• Basic computations
• Five basic classes of objects
– Character
– Numeric (Real Numbers)
– Integer (Whole Numbers)
– Complex
– Logical (True / False)
• Data types in R
– Vector: a vector contains object of same class
– List: a special type of vector which contain
elements of different data types
– Matrix: A matrix is represented by set of rows
and columns.
– Data frame: Every column of a data frame acts
like a list
2+3
sqrt(121)
myvector<- c("Time", 24, "October", TRUE, 3.33)
my_list <- list(22, "ab", TRUE, 1 + 2i)
my_list[[1]]
my_matrix <- matrix(1:6, nrow=3, ncol=2)
df <- data.frame(name = c("ash","jane","paul","mark"), score =
c(67,56,87,91))
df
name score
1 ash NA
2 jane NA
3 paul 87
4 mark 91
5. R for Data Science | Long Nguyen | Sep 20175
Essentials of R Programming
• Control structures
– If (condition){
Do something
}else{
Do something else
}
• Loop
– For loop
– While loop
• Function
– function.name <- function(arguments) {
computations on the arguments some
other code }
x <- runif(1, 0, 10)
if(x > 3) {
y <- 10
} else {
y <- 0
}
for(i in 1:10) {
print(i)
}
mySquaredFunc<-function(n){
# Compute the square of integer `n`
n*n
}
mySquaredVal(5)
6. R for Data Science | Long Nguyen | Sep 20176
Useful R Packages
• Install packages: install.packages('readr‘, ‘ggplot2’, ‘dplyr’, ‘caret’)
• Load packages: library(package_name)
Importing Data
•readr
•data.table
•Sqldf
Data Manipulation
•dplyr
•tidyr
•lubridate
•stringr
Data Visualization
•ggplot2
•plotly
Modeling
•caret
•lm,
•randomForest, rpart
•gbm, xgb
Reporting
•RMarkdown
•Shiny
7. R for Data Science | Long Nguyen | Sep 20177
Importing Data
• CSV file
mydata <- read.csv("mydata.csv") # read csv file
library(readr)
mydata <- read_csv("mydata.csv") # 10x faster
• Tab-delimited text file
mydata <- read.table("mydata.txt") # read text file
mydata <- read_table("mydata.txt")
• Excel file:
library(XLConnect)
wk <- loadWorkbook("mydata.xls")
df <- readWorksheet(wk, sheet="Sheet1")
• SAS file
library(sas7bdat)
mySASData <- read.sas7bdat("example.sas7bdat")
• Other files:
– Minitab, SPSS(foreign),
– MySQL (RMySQL)
Col1,Col2,Col3
100,a1,b1
200,a2,b2
300,a3,b3
100 a1 b1
200 a2 b2
300 a3 b3
400 a4 b4
8. R for Data Science | Long Nguyen | Sep 20178
Data Manipulation with ‘dplyr’
• Some of the key “verbs”:
– select: return a subset of the columns of
a data frame, using a flexible notation
– filter: extract a subset of rows from a
data frame based on logical conditions
– arrange: reorder rows of a data frame
– rename: rename variables in a data
frame
library(nycflights13)
flights
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
jan1 <- filter(flights, month == 1, day == 1)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
arrange(flights, year, month, day)
arrange(flights, desc(arr_delay))
rename(flights, tail_num = tailnum)
9. R for Data Science | Long Nguyen | Sep 20179
Data Manipulation with ‘dplyr’
• Some of the key “verbs”:
– mutate: add new variables/columns or
transform existing variables
– summarize: generate summary statistics
of different variables in the data frame
– %>%: the “pipe” operator is used to
connect multiple verb actions together
into a pipeline
flights_sml <- select(flights, year:day, ends_with("delay"),
distance, air_time )
mutate(flights_sml, gain = arr_delay - dep_delay,
speed = distance / air_time * 60 )
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delays, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
10. R for Data Science | Long Nguyen | Sep 201710
library(tidyr)
tidy4a <- table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
spread(table2, key = type, value = count)
separate(table3, year, into = c("century", "year"), sep = 2)
separate(table3, rate, into = c("cases", "population"))
unite(table5, "new", century, year, sep = "")
Data Manipulation with ‘tidyr’
• Some of the key “verbs”:
– gather: takes multiple columns, and gathers
them into key-value pairs
– spread: takes two columns (key & value) and
spreads in to multiple columns
– separate: splits a single column into multiple
columns
– unite: combines multiple columns into a
single column
11. R for Data Science | Long Nguyen | Sep 201711
Data Visualization with ‘ggplot2’
• Scatter plot library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() +
geom_smooth(method="lm")
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
12. R for Data Science | Long Nguyen | Sep 201712
Data Visualization with ‘ggplot2’
• Correlogram library(ggplot2)
library(ggcorrplot)
# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)
# Plot
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of mtcars",
ggtheme=theme_bw)
13. R for Data Science | Long Nguyen | Sep 201713
Data Visualization with ‘ggplot2’
• Histogram on Categorical Variables
ggplot(mpg, aes(manufacturer)) +
geom_bar(aes(fill=class), width = 0.5) +
theme(axis.text.x = element_text(angle=65,
vjust=0.6)) +
labs(title="Histogram on Categorical Variable",
subtitle= "Manufacturer across Vehicle
Classes")
14. R for Data Science | Long Nguyen | Sep 201714
Data Visualization with ‘ggplot2’
• Density plot ggplot(mpg, aes(cty)) +
geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title="Density plot", subtitle="City Mileage
Grouped by Number of cylinders",
caption="Source: mpg", x="City Mileage",
fill="# Cylinders")
Other plots:
• Box plot
• Pie chart
• Time-series plot
15. R for Data Science | Long Nguyen | Sep 201715
Interactive Visualization with ‘plotly’
library(plotly)
d <- diamonds[sample(nrow(diamonds), 1000), ]
plot_ly(d, x = ~carat, y = ~price, color = ~carat,
size = ~carat, text = ~paste("Clarity: ", clarity))
Plotly library makes interactive, publication-quality graphs online. It supports line plots, scatter plots, area
charts, bar charts, error bars, box plots, histograms, heat maps, subplots, multiple-axes, and 3D charts.
16. R for Data Science | Long Nguyen | Sep 201716
Data Modeling - Linear Regression
data(mtcars)
mtcars$am = as.factor(mtcars$am)
mtcars$cyl = as.factor(mtcars$cyl)
mtcars$vs = as.factor(mtcars$vs)
mtcars$gear = as.factor(mtcars$gear)
#Dropping dependent variable
mtcars_a = subset(mtcars, select = -c(mpg))
#Identifying numeric variables
numericData <- mtcars_a[sapply(mtcars_a, is.numeric)]
#Calculating Correlation
descrCor <- cor(numericData)
# Checking Variables that are highly correlated
highlyCorrelated = findCorrelation(descrCor, cutoff=0.7)
highlyCorCol = colnames(numericData)[highlyCorrelated]
#Remove highly correlated variables and create a new dataset
dat3 = mtcars[, -which(colnames(mtcars) %in% highlyCorCol)]
#Build Linear Regression Model
fit = lm(mpg ~ ., data=dat3)
#Extracting R-squared value
summary(fit)$r.squared
library(MASS) #Stepwise Selection based on AIC
step <- stepAIC(fit, direction="both")
summary(step)
17. R for Data Science | Long Nguyen | Sep 201717
Data modeling with ‘caret’
• Loan prediction problem
• Data standardization and imputing missing values using kNN
preProcValues <- preProcess(train, method = c("knnImpute","center","scale"))
library('RANN')
train_processed <- predict(preProcValues, train)
• One-hot encoding for categorical variables
dmy <- dummyVars(" ~ .", data = train_processed,fullRank = T)
train_transformed <- data.frame(predict(dmy, newdata = train_processed))
• Prepare training and testing set
index <- createDataPartition(train_transformed$Loan_Status, p=0.75, list=FALSE)
trainSet <- train_transformed[ index,]
testSet <- train_transformed[-index,]
• Feature selection using rfe
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
Loan_Pred_Profile <- rfe(trainSet[,predictors], trainSet[,outcomeName], rfeControl = control)
18. R for Data Science | Long Nguyen | Sep 201718
• Take top 5 variables
predictors<-c("Credit_History", "LoanAmount", "Loan_Amount_Term", "ApplicantIncome", "CoapplicantIncome")
• Train different models
model_gbm<-train(trainSet[,predictors],trainSet[,outcomeName],method='gbm')
model_rf<-train(trainSet[,predictors],trainSet[,outcomeName],method='rf')
model_nnet<-train(trainSet[,predictors],trainSet[,outcomeName],method='nnet')
model_glm<-train(trainSet[,predictors],trainSet[,outcomeName],method='glm')
• Variable important
plot(varImp(object=model_gbm),main="GBM - Variable Importance")
plot(varImp(object=model_rf),main="RF - Variable Importance")
plot(varImp(object=model_nnet),main="NNET - Variable Importance")
plot(varImp(object=model_glm),main="GLM - Variable Importance")
• Prediction
predictions<-predict.train(object=model_gbm,testSet[,predictors],type="raw")
confusionMatrix(predictions,testSet[,outcomeName])
#Confusion Matrix and Statistics
#Prediction 0 1
# 0 25 3
# 1 23 102
#Accuracy : 0.8301
Data modeling with ‘caret’
19. R for Data Science | Long Nguyen | Sep 201719
Reporting
R Markdown files are designed: (i) for communicating to decision
makers, (ii) collaborating with other data scientists, and (iii) an
environment in which to do data science, where you can capture
what you were thinking.
Text formatting
*italic* or _italic_
**bold** __bold__
`code`
superscript^2^ and subscript~2~
Headings
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
Lists
* Bulleted list item 1
* Item 2
* Item 2a
* Item 2b
1. Numbered list item 1
2. Item 2. The numbers are incremented automatically in the output.
Links and images
<http://example.com>[linked phrase](http://example.com)![optional
caption text](path/to/img.png)
20. R for Data Science | Long Nguyen | Sep 201720
Thank you!