Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
2. What is the R
Statistical Programming Language
used among statisticians and data miners for developing statistical software and data analysis.
Free and Open Source
Written in C, Fortran and R
Statistical features
Linear and nonlinear modeling
Statistical tests
Classification, Clustering
Can manipulate R Objects with C, C++, Java, .NET or Python code.
2IAUSHIRAZ1/14/2017
3. Source Example
> x <- c(1,2,3,4,5,6) # Create ordered collection (vector)
> y <- x^2 # Square the elements of x
> print(y) # print (vector) y
[1] 1 4 9 16 25 36
> mean(y) # Calculate average (arithmetic mean) of (vector) y; result is scalar
[1] 15.16667
> var(y) # Calculate sample variance
[1] 178.9667
> lm_1 <- lm(y ~ x) # Fit a linear regression model "y = f(x)" or "y = B0 + (B1 * x)"
# store the results as lm_1
> print(lm_1) # Print the model from the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-9.333 7.000
> summary(lm_1) # Compute and print statistics for the fit
# of the (linear model object) lm_1
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.3333 2.8441 -3.282 0.030453 *
x 7.0000 0.7303 9.585 0.000662 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 4 degrees of freedom
Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478
F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662
> par(mfrow=c(2, 2)) # Request 2x2 plot layout
> plot(lm_1) # Diagnostic plot of regression model
3IAUSHIRAZ1/14/2017
4. Graphical front-ends
Architect – cross-platform open source IDE based on Eclipse and StatET
DataJoy – Online R Editor focused on beginners to data science and collaboration.
Deducer – GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab).
Java GUI for R – cross-platform stand-alone R terminal and editor based on Java (also known as JGR).
Number Analytics - GUI for R based business analytics (similar to SPSS) working on the cloud.
Rattle GUI – cross-platform GUI based on RGtk2 and specifically designed for data mining.
R Commander – cross-platform menu-driven GUI based on tcltk (several plug-ins to Rcmdr are also
available).
Revolution R Productivity Environment (RPE) – Revolution Analytics-provided Visual Studio-based IDE,
and has plans for web based point and click interface.
RGUI – comes with the pre-compiled version of R for Microsoft Windows.
RKWard – extensible GUI and IDE for R.
RStudio – cross-platform open source IDE (which can also be run on a remote Linux server).
4IAUSHIRAZ1/14/2017
5. What is the Rattle
R Graphical User Interface Package
Offered by Graham Williams in Togaware Pty Ltd.
Free and Open Source
Represents Statistical and Visual Summaries of data
Tabs :
Load Data
Data Exploration
Model
Evaluation
Test
…
5IAUSHIRAZ1/14/2017
6. Rattle Installation Process
Download and Installing R
https://r-project.org
About 60MB
Download the Rattle Package
About 300MB
Follow Instructions :
install.packages("rattle", dependencies=c("Depends", "Suggests"))
Library(rattle)
Rattle()
6IAUSHIRAZ1/14/2017
7. Load Data
Dataset Types :
CSV File (CSV, TXT, EXCELL)
ARFF (CSV File which adds type information)
ODBC (MySQL, SqlLITE, SQL Server, …)
Set Connections in : /etc/odbcinst.ini & /etc/odbc.ini
R Dataset (Existing Datasets in Current Solution)
R Data File
Library (Pre Existing Datasets)
Corpus ( Collection of Documents)
Script (Scripts for Generating Datasets)
1/14/2017 IAUSHIRAZ 7
8. Load Data
Variable Types :
Input (Most Variables as Input)
Predict the Target Variables
Target (Influenced by the Input Variables)
Known as the Output
Prefix : TARGET_
Risk (Measure of the size of the Targets)
Prefix : RISK_
Identifier (any Numeric Variable that has a Unique Value – Not Normally used in modeling)
Such as : ID, Date
Prefix : ID_
Ignore (Ignore from Modeling)
Prefix : IGNORE_
Weight (Weighted by R Formula)
1/14/2017 IAUSHIRAZ 8
12. Exploration
Summary
Summary
Min, Max, Mean, Quartiles Values.
Describe
Missing, Unique, Sum, Mean, Lowest, Highest Values.
Basics (For Numeric Value)
Measures of Numeric Data (Missing, Min, Max, Quartiles, Mean, Sum, Skewness, Kurtosis)
Kurtosis (For Numeric Value)
A larger value indicates a sharper peak.
A lower value indicates a smoother peak.
Skewness (For Numeric Value)
A positive skew indicates that the tail to the right is longer.
A negative skew that the tail to the left is longer.
1/14/2017 IAUSHIRAZ 12
13. Exploration
Summary
Show Missing
Each row corresponds to a pattern of missing values.
Perhaps coming to an understanding of why the data is missing.
Rows and Columns are sorted in ascending order of missing data.
1/14/2017 IAUSHIRAZ 13
14. Exploration
Distributions (review the distributions of each variable in dataset)
Annotate (include numeric values in plots)
Group by
Numeric Outputs :
Box Plot
Histogram
Cumulative
Benford
For any number of continuous variables
Pairs
Categorical Outputs :
Bar Plot
Dot Plot
Mosaic
Pairs
1/14/2017 IAUSHIRAZ 14
15. Exploration
Correlations (Rattle only computes correlations between numeric variables at this time)
Ordered
Order by strength of correlations
Explore Missing
Correlation between missing values
Hierarchical
Pearson
Kendall
Spearman
Principal Components
SVD
For only Numeric Variables
Eigen
1/14/2017 IAUSHIRAZ 15
16. Model
Tree
Traditional
Trade off between performance and simplicity of explanation
Conditional
Forest (many decision trees using random subsets of data and variables)
Number of Trees
Number of Variables
Impute (set median numeric value for missing values)
Sample Size (for balancing classes)
Importance (variable importance)
Rules (collection of random forest rules)
ROC (ROC Curve)
Errors
1/14/2017 IAUSHIRAZ 16
17. Model
SVM
Start with two parallel vector
Linear (linear regression)
For continues values
All
1/14/2017 IAUSHIRAZ 17
18. Cluster
K-Means
Set First K
EwKm
K-Means with entropy weighting
Hierarchical
Not needed to set first Cluster Number
BiCluster
Suitable subsets of both the variables and the observations
1/14/2017 IAUSHIRAZ 18
Editor's Notes
The intensity of the color is maximal for a perfect correlation, and minimal (white) if there is no correlation. Shades of red are used for negative correlations and blue for positive correlations.