SlideShare a Scribd company logo
1 of 18
INTRODUCTION TO R AND RATTLE
1IAUSHIRAZ1/14/2017
What is the R
Statistical Programming Language
used among statisticians and data miners for developing statistical software and data analysis.
Free and Open Source
Written in C, Fortran and R
Statistical features
Linear and nonlinear modeling
Statistical tests
Classification, Clustering
Can manipulate R Objects with C, C++, Java, .NET or Python code.
2IAUSHIRAZ1/14/2017
Source Example
> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)
> y <- x^2 # Square the elements of x
> print(y) # print (vector) y
[1] 1 4 9 16 25 36
> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667
> var(y) # Calculate sample variance
[1] 178.9667
> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"
# store the results as lm_1
> print(lm_1) # Print the model from the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit
# of the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.3333 2.8441 -3.282 0.030453 *
x 7.0000 0.7303 9.585 0.000662 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
> par(mfrow=c(2, 2)) # Request 2x2 plot layout
> plot(lm_1) # Diagnostic plot of regression model
3IAUSHIRAZ1/14/2017
Graphical front-ends
Architect – cross-platform open source IDE based on Eclipse and StatET
DataJoy – Online R Editor focused on beginners to data science and collaboration.
Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab).
Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR).
Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud.
Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining.
R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also
available).
Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE,
and has plans for web based point and click interface.
RGUI – comes with the pre-compiled version of R for Microsoft Windows.
RKWard – extensible GUI and IDE for R.
RStudio – cross-platform open source IDE (which can also be run on a remote Linux server).
4IAUSHIRAZ1/14/2017
What is the Rattle
R Graphical User Interface Package
Offered by Graham Williams in Togaware Pty Ltd.
Free and Open Source
Represents Statistical and Visual Summaries of data
Tabs :
Load Data
Data Exploration
Model
Evaluation
Test
…
5IAUSHIRAZ1/14/2017
Rattle Installation Process
Download and Installing R
https://r-project.org
About 60MB
Download the Rattle Package
About 300MB
Follow Instructions :
 install.packages("rattle", dependencies=c("Depends", "Suggests"))
 Library(rattle)
 Rattle()
6IAUSHIRAZ1/14/2017
Load Data
Dataset Types :
CSV File (CSV, TXT, EXCELL)
ARFF (CSV File which adds type information)
ODBC (MySQL, SqlLITE, SQL Server, …)
 Set Connections in : /etc/odbcinst.ini & /etc/odbc.ini
R Dataset (Existing Datasets in Current Solution)
R Data File
Library (Pre Existing Datasets)
Corpus ( Collection of Documents)
Script (Scripts for Generating Datasets)
1/14/2017 IAUSHIRAZ 7
Load Data
Variable Types :
Input (Most Variables as Input)
 Predict the Target Variables
Target (Influenced by the Input Variables)
 Known as the Output
 Prefix : TARGET_
Risk (Measure of the size of the Targets)
 Prefix : RISK_
Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling)
 Such as : ID, Date
 Prefix : ID_
Ignore (Ignore from Modeling)
 Prefix : IGNORE_
Weight (Weighted by R Formula)
1/14/2017 IAUSHIRAZ 8
Transform
Rescale
Normalize
 Re Center
 Scale [0-1]
 Median/Mad
 Natural Log / Log 10
 Matrix
Order
 Rank
 Interval
 Number of Group
1/14/2017 IAUSHIRAZ 9
Transform
Impute (missing values)
Zero
Mean
Median
Mode
Constant
Recode
Quantiles
K-Means
Equal with
Indicator variable / Join Categories
As Categorical / As Numeric
1/14/2017 IAUSHIRAZ 10
Transform
Cleanup
Delete Ignored
Delete Selected
Delete Missing
Delete Observations with Missing
1/14/2017 IAUSHIRAZ 11
Exploration
Summary
Summary
 Min, Max, Mean, Quartiles Values.
Describe
 Missing, Unique, Sum, Mean, Lowest, Highest Values.
Basics (For Numeric Value)
 Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis)
Kurtosis (For Numeric Value)
 A larger value indicates a sharper peak.
 A lower value indicates a smoother peak.
Skewness (For Numeric Value)
 A positive skew indicates that the tail to the right is longer.
 A negative skew that the tail to the left is longer.
1/14/2017 IAUSHIRAZ 12
Exploration
Summary
Show Missing
 Each row corresponds to a pattern of missing values.
 Perhaps coming to an understanding of why the data is missing.
 Rows and Columns are sorted in ascending order of missing data.
1/14/2017 IAUSHIRAZ 13
Exploration
Distributions (review the distributions of each variable in dataset)
Annotate (include numeric values in plots)
Group by
Numeric Outputs :
 Box Plot
 Histogram
 Cumulative
 Benford
 For any number of continuous variables
 Pairs
Categorical Outputs :
 Bar Plot
 Dot Plot
 Mosaic
 Pairs
1/14/2017 IAUSHIRAZ 14
Exploration
Correlations (Rattle only computes correlations between numeric variables at this time)
Ordered
 Order by strength of correlations
Explore Missing
 Correlation between missing values
Hierarchical
 Pearson
 Kendall
 Spearman
Principal Components
SVD
 For only Numeric Variables
Eigen
1/14/2017 IAUSHIRAZ 15
Model
Tree
Traditional
 Trade off between performance and simplicity of explanation
Conditional
Forest (many decision trees using random subsets of data and variables)
Number of Trees
Number of Variables
Impute (set median numeric value for missing values)
Sample Size (for balancing classes)
Importance (variable importance)
Rules (collection of random forest rules)
ROC (ROC Curve)
Errors
1/14/2017 IAUSHIRAZ 16
Model
SVM
Start with two parallel vector
Linear (linear regression)
For continues values
All
1/14/2017 IAUSHIRAZ 17
Cluster
K-Means
Set First K
EwKm
K-Means with entropy weighting
Hierarchical
Not needed to set first Cluster Number
BiCluster
Suitable subsets of both the variables and the observations
1/14/2017 IAUSHIRAZ 18

More Related Content

What's hot

Data Structure
Data StructureData Structure
Data Structure
sheraz1
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
FELIX75
 

What's hot (20)

Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
Data Structure
Data StructureData Structure
Data Structure
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
R language
R languageR language
R language
 
R Get Started I
R Get Started IR Get Started I
R Get Started I
 
R language
R languageR language
R language
 
R Get Started II
R Get Started IIR Get Started II
R Get Started II
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R language introduction
R language introductionR language introduction
R language introduction
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
A brief introduction to 'R' statistical package
A brief introduction to 'R' statistical packageA brief introduction to 'R' statistical package
A brief introduction to 'R' statistical package
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Motivation and Mechanics behind some aspects of Shapeless
Motivation and Mechanics behind some aspects of ShapelessMotivation and Mechanics behind some aspects of Shapeless
Motivation and Mechanics behind some aspects of Shapeless
 
R training5
R training5R training5
R training5
 
A Presentation About Array Manipulation(Insertion & Deletion in an array)
A Presentation About Array Manipulation(Insertion & Deletion in an array)A Presentation About Array Manipulation(Insertion & Deletion in an array)
A Presentation About Array Manipulation(Insertion & Deletion in an array)
 

Similar to Rattle Graphical Interface for R Language

Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
templedf
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
Kazuki Yoshida
 

Similar to Rattle Graphical Interface for R Language (20)

An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
R basics
R basicsR basics
R basics
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Unit 3
Unit 3Unit 3
Unit 3
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
User biglm
User biglmUser biglm
User biglm
 
Big Data Analytics Part2
Big Data Analytics Part2Big Data Analytics Part2
Big Data Analytics Part2
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
R studio
R studio R studio
R studio
 
20130215 Reading data into R
20130215 Reading data into R20130215 Reading data into R
20130215 Reading data into R
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

Rattle Graphical Interface for R Language

  • 1. INTRODUCTION TO R AND RATTLE 1IAUSHIRAZ1/14/2017
  • 2. What is the R Statistical Programming Language used among statisticians and data miners for developing statistical software and data analysis. Free and Open Source Written in C, Fortran and R Statistical features Linear and nonlinear modeling Statistical tests Classification, Clustering Can manipulate R Objects with C, C++, Java, .NET or Python code. 2IAUSHIRAZ1/14/2017
  • 3. Source Example > x <- c(1,2,3,4,5,6) # Create ordered collection (vector) > y <- x^2 # Square the elements of x > print(y) # print (vector) y [1] 1 4 9 16 25 36 > mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar [1] 15.16667 > var(y) # Calculate sample variance [1] 178.9667 > lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)" # store the results as lm_1 > print(lm_1) # Print the model from the (linear model object) lm_1 Call: lm(formula = y ~ x) Coefficients: (Intercept) x -9.333 7.000 > summary(lm_1) # Compute and print statistics for the fit # of the (linear model object) lm_1 Call: lm(formula = y ~ x) Residuals: 1 2 3 4 5 6 3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.3333 2.8441 -3.282 0.030453 * x 7.0000 0.7303 9.585 0.000662 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.055 on 4 degrees of freedom Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478 F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662 > par(mfrow=c(2, 2)) # Request 2x2 plot layout > plot(lm_1) # Diagnostic plot of regression model 3IAUSHIRAZ1/14/2017
  • 4. Graphical front-ends Architect – cross-platform open source IDE based on Eclipse and StatET DataJoy – Online R Editor focused on beginners to data science and collaboration. Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab). Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR). Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud. Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining. R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also available). Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE, and has plans for web based point and click interface. RGUI – comes with the pre-compiled version of R for Microsoft Windows. RKWard – extensible GUI and IDE for R. RStudio – cross-platform open source IDE (which can also be run on a remote Linux server). 4IAUSHIRAZ1/14/2017
  • 5. What is the Rattle R Graphical User Interface Package Offered by Graham Williams in Togaware Pty Ltd. Free and Open Source Represents Statistical and Visual Summaries of data Tabs : Load Data Data Exploration Model Evaluation Test … 5IAUSHIRAZ1/14/2017
  • 6. Rattle Installation Process Download and Installing R https://r-project.org About 60MB Download the Rattle Package About 300MB Follow Instructions :  install.packages("rattle", dependencies=c("Depends", "Suggests"))  Library(rattle)  Rattle() 6IAUSHIRAZ1/14/2017
  • 7. Load Data Dataset Types : CSV File (CSV, TXT, EXCELL) ARFF (CSV File which adds type information) ODBC (MySQL, SqlLITE, SQL Server, …)  Set Connections in : /etc/odbcinst.ini & /etc/odbc.ini R Dataset (Existing Datasets in Current Solution) R Data File Library (Pre Existing Datasets) Corpus ( Collection of Documents) Script (Scripts for Generating Datasets) 1/14/2017 IAUSHIRAZ 7
  • 8. Load Data Variable Types : Input (Most Variables as Input)  Predict the Target Variables Target (Influenced by the Input Variables)  Known as the Output  Prefix : TARGET_ Risk (Measure of the size of the Targets)  Prefix : RISK_ Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling)  Such as : ID, Date  Prefix : ID_ Ignore (Ignore from Modeling)  Prefix : IGNORE_ Weight (Weighted by R Formula) 1/14/2017 IAUSHIRAZ 8
  • 9. Transform Rescale Normalize  Re Center  Scale [0-1]  Median/Mad  Natural Log / Log 10  Matrix Order  Rank  Interval  Number of Group 1/14/2017 IAUSHIRAZ 9
  • 10. Transform Impute (missing values) Zero Mean Median Mode Constant Recode Quantiles K-Means Equal with Indicator variable / Join Categories As Categorical / As Numeric 1/14/2017 IAUSHIRAZ 10
  • 11. Transform Cleanup Delete Ignored Delete Selected Delete Missing Delete Observations with Missing 1/14/2017 IAUSHIRAZ 11
  • 12. Exploration Summary Summary  Min, Max, Mean, Quartiles Values. Describe  Missing, Unique, Sum, Mean, Lowest, Highest Values. Basics (For Numeric Value)  Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis) Kurtosis (For Numeric Value)  A larger value indicates a sharper peak.  A lower value indicates a smoother peak. Skewness (For Numeric Value)  A positive skew indicates that the tail to the right is longer.  A negative skew that the tail to the left is longer. 1/14/2017 IAUSHIRAZ 12
  • 13. Exploration Summary Show Missing  Each row corresponds to a pattern of missing values.  Perhaps coming to an understanding of why the data is missing.  Rows and Columns are sorted in ascending order of missing data. 1/14/2017 IAUSHIRAZ 13
  • 14. Exploration Distributions (review the distributions of each variable in dataset) Annotate (include numeric values in plots) Group by Numeric Outputs :  Box Plot  Histogram  Cumulative  Benford  For any number of continuous variables  Pairs Categorical Outputs :  Bar Plot  Dot Plot  Mosaic  Pairs 1/14/2017 IAUSHIRAZ 14
  • 15. Exploration Correlations (Rattle only computes correlations between numeric variables at this time) Ordered  Order by strength of correlations Explore Missing  Correlation between missing values Hierarchical  Pearson  Kendall  Spearman Principal Components SVD  For only Numeric Variables Eigen 1/14/2017 IAUSHIRAZ 15
  • 16. Model Tree Traditional  Trade off between performance and simplicity of explanation Conditional Forest (many decision trees using random subsets of data and variables) Number of Trees Number of Variables Impute (set median numeric value for missing values) Sample Size (for balancing classes) Importance (variable importance) Rules (collection of random forest rules) ROC (ROC Curve) Errors 1/14/2017 IAUSHIRAZ 16
  • 17. Model SVM Start with two parallel vector Linear (linear regression) For continues values All 1/14/2017 IAUSHIRAZ 17
  • 18. Cluster K-Means Set First K EwKm K-Means with entropy weighting Hierarchical Not needed to set first Cluster Number BiCluster Suitable subsets of both the variables and the observations 1/14/2017 IAUSHIRAZ 18

Editor's Notes

  1. The intensity of the color is maximal for a perfect correlation, and minimal (white) if there is no correlation. Shades of red are used for negative correlations and blue for positive correlations.