Rattle Graphical Interface for R Language

INTRODUCTION TO R AND RATTLE
1IAUSHIRAZ1/14/2017

What is the R
Statistical Programming Language
used among statisticians and data miners for developing statistical software and data analysis.
Free and Open Source
Written in C, Fortran and R
Statistical features
Linear and nonlinear modeling
Statistical tests
Classification, Clustering
Can manipulate R Objects with C, C++, Java, .NET or Python code.
2IAUSHIRAZ1/14/2017

Source Example
> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)
> y <- x^2 # Square the elements of x
> print(y) # print (vector) y
[1] 1 4 9 16 25 36
> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667
> var(y) # Calculate sample variance
[1] 178.9667
> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"
# store the results as lm_1
> print(lm_1) # Print the model from the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit
# of the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.3333 2.8441 -3.282 0.030453 *
x 7.0000 0.7303 9.585 0.000662 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
> par(mfrow=c(2, 2)) # Request 2x2 plot layout
> plot(lm_1) # Diagnostic plot of regression model
3IAUSHIRAZ1/14/2017

Graphical front-ends
Architect – cross-platform open source IDE based on Eclipse and StatET
DataJoy – Online R Editor focused on beginners to data science and collaboration.
Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab).
Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR).
Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud.
Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining.
R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also
available).
Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE,
and has plans for web based point and click interface.
RGUI – comes with the pre-compiled version of R for Microsoft Windows.
RKWard – extensible GUI and IDE for R.
RStudio – cross-platform open source IDE (which can also be run on a remote Linux server).
4IAUSHIRAZ1/14/2017

What is the Rattle
R Graphical User Interface Package
Offered by Graham Williams in Togaware Pty Ltd.
Free and Open Source
Represents Statistical and Visual Summaries of data
Tabs :
Load Data
Data Exploration
Model
Evaluation
Test
…
5IAUSHIRAZ1/14/2017

Rattle Installation Process
Download and Installing R
https://r-project.org
About 60MB
Download the Rattle Package
About 300MB
Follow Instructions :
 install.packages("rattle", dependencies=c("Depends", "Suggests"))
 Library(rattle)
 Rattle()
6IAUSHIRAZ1/14/2017

Load Data
Dataset Types :
CSV File (CSV, TXT, EXCELL)
ARFF (CSV File which adds type information)
ODBC (MySQL, SqlLITE, SQL Server, …)
 Set Connections in : /etc/odbcinst.ini & /etc/odbc.ini
R Dataset (Existing Datasets in Current Solution)
R Data File
Library (Pre Existing Datasets)
Corpus ( Collection of Documents)
Script (Scripts for Generating Datasets)
1/14/2017 IAUSHIRAZ 7

Load Data
Variable Types :
Input (Most Variables as Input)
 Predict the Target Variables
Target (Influenced by the Input Variables)
 Known as the Output
 Prefix : TARGET_
Risk (Measure of the size of the Targets)
 Prefix : RISK_
Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling)
 Such as : ID, Date
 Prefix : ID_
Ignore (Ignore from Modeling)
 Prefix : IGNORE_
Weight (Weighted by R Formula)

Transform
Rescale
Normalize
 Re Center
 Scale [0-1]
 Median/Mad
 Natural Log / Log 10
 Matrix
Order
 Rank
 Interval
 Number of Group

Transform
Impute (missing values)
Zero
Mean
Median
Mode
Constant
Recode
Quantiles
K-Means
Equal with
Indicator variable / Join Categories
As Categorical / As Numeric

Transform
Cleanup
Delete Ignored
Delete Selected
Delete Missing
Delete Observations with Missing

Exploration
Summary
Summary
 Min, Max, Mean, Quartiles Values.
Describe
 Missing, Unique, Sum, Mean, Lowest, Highest Values.
Basics (For Numeric Value)
 Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis)
Kurtosis (For Numeric Value)
 A larger value indicates a sharper peak.
 A lower value indicates a smoother peak.
Skewness (For Numeric Value)
 A positive skew indicates that the tail to the right is longer.
 A negative skew that the tail to the left is longer.

Exploration
Summary
Show Missing
 Each row corresponds to a pattern of missing values.
 Perhaps coming to an understanding of why the data is missing.
 Rows and Columns are sorted in ascending order of missing data.

Exploration
Distributions (review the distributions of each variable in dataset)
Annotate (include numeric values in plots)
Group by
Numeric Outputs :
 Box Plot
 Histogram
 Cumulative
 Benford
 For any number of continuous variables
 Pairs
Categorical Outputs :
 Bar Plot
 Dot Plot
 Mosaic
 Pairs

Exploration
Correlations (Rattle only computes correlations between numeric variables at this time)
Ordered
 Order by strength of correlations
Explore Missing
 Correlation between missing values
Hierarchical
 Pearson
 Kendall
 Spearman
Principal Components
SVD
 For only Numeric Variables
Eigen

Model
Tree
Traditional
 Trade off between performance and simplicity of explanation
Conditional
Forest (many decision trees using random subsets of data and variables)
Number of Trees
Number of Variables
Impute (set median numeric value for missing values)
Sample Size (for balancing classes)
Importance (variable importance)
Rules (collection of random forest rules)
ROC (ROC Curve)
Errors

Model
SVM
Start with two parallel vector
Linear (linear regression)
For continues values
All

Cluster
K-Means
Set First K
EwKm
K-Means with entropy weighting
Hierarchical
Not needed to set first Cluster Number
BiCluster
Suitable subsets of both the variables and the observations

Rattle Graphical Interface for R Language

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rattle Graphical Interface for R Language

Similar to Rattle Graphical Interface for R Language (20)

Recently uploaded

Recently uploaded (20)

Rattle Graphical Interface for R Language

Editor's Notes