The purpose of this tutorial is to show that Scilab can be considered as a powerful data mining tool, able to perform the widest possible range of important data mining tasks.
Modeling an ODE: 3 different approaches - Part 3Scilab
In this tutorial we show how to model a physical system described by ODE using the Modelica extensions of the Xcos environment. The same model has been solved also with Scilab and Xcos in two previous tutorials.
In this Scilab tutorial, we introduce readers to the Control System Toolbox that is available in Scilab/Xcos and known as CACSD. This first tutorial is dedicated to "Linear Time Invariant" (LTI) systems and their representations in Scilab.
In this tutorial the reader can learn about data fitting, interpolation and approximation in Scilab. Interpolation is very important in industrial applications for data visualization and metamodeling.
Modeling an ODE: 3 different approaches - Part 1Scilab
In this tutorial we show how to model a physical system described by ODE using Scilab standard programming language. The same model solution is also described in Xcos and Xcos + Modelica in two other tutorials.
Modeling an ODE: 3 different approaches - Part 3Scilab
In this tutorial we show how to model a physical system described by ODE using the Modelica extensions of the Xcos environment. The same model has been solved also with Scilab and Xcos in two previous tutorials.
In this Scilab tutorial, we introduce readers to the Control System Toolbox that is available in Scilab/Xcos and known as CACSD. This first tutorial is dedicated to "Linear Time Invariant" (LTI) systems and their representations in Scilab.
In this tutorial the reader can learn about data fitting, interpolation and approximation in Scilab. Interpolation is very important in industrial applications for data visualization and metamodeling.
Modeling an ODE: 3 different approaches - Part 1Scilab
In this tutorial we show how to model a physical system described by ODE using Scilab standard programming language. The same model solution is also described in Xcos and Xcos + Modelica in two other tutorials.
An Analysis of maternal mortality ratio across the world.
Actions:
Data Collection from WHO and UNESCO sites
Import Data in SQL Server Management Studio
Using SQL Queries to JOIN data from different tables, using UNPIVOT to view data in a better and comparable form
Summary Statistics
Analysis of Data
Visualization and Spotting Trends Using Tableau
Link: https://public.tableau.com/profile/rutuja.gangane#!/vizhome/ACountry-wiseAnalysisofMaternalMortalityRatio/AnalysisofMMR
TSTD 6251 Fall 2014SPSS Exercise and Assignment 120 PointsI.docxnanamonkton
TSTD 6251 Fall 2014
SPSS Exercise and Assignment 1
20 Points
In this class, we are going to study descriptive summary statistics and learn how to construct box plot. We are still working with univariate variable for this exercise.
Practice Example:
Admission receipts (in million of dollars) for a recent season are given below for the
n =
30 major league baseball teams:
19.4 26.6 22.9 44.5 24.4 19.0 27.5 19.9 22.8 19.0 16.9 15.2 25.7 19.0 15.5 17.1 15.6 10.6 16.2 15.6 15.4 18.2 15.5 14.2 9.5 9.9
10.7 11.9 26.7 17.5
Require:
a. Compute the mean, variance and standard deviation.
b. Find the sample median, first quartile, and third quartile.
c. Construct a boxplot and interpret the distribution of the data.
d. Discuss the distribution of this set of data by examining kurtosis and skewness
statistics, such as if the distribution is skewed to one side of the distribution, and if the
distribution shows a peaked/skinny curve or a spread out/flat curve.
SPSS Procedures for Computing Summary Statistics
:
Enter the 30 data values in the first column of SPSS
Data View
Tab
Variable View
and name this variable
receipts
Adjust
Decimals
to 3 decimal points
Type
Admission Receipts
($ mn)
in the
Label
column for output viewer
Return to
Data View
and click
A
nalyze
on the menu bar
Click the second menu
D
e
scriptive Statistics
Click
F
requencies …
Move
Admission Receipts
to the
Variable(s)
list by clicking the arrow button
Click
S
tatistics …
button at the top of the dialog box
Now, you can select the descriptive statistics according to what the question requires. For this practice question, it requires central tendency, dispersion, percentile and distribution statistics, so we click all the boxes
except for
P
ercentile(s): and Va
l
ues are group midpoints
.
Click
Continue
to return to the
Frequencies
dialog box
Click
OK
to generate descriptive statistic output which is pasted below:
The first table provides summary statistics and the second table lists frequencies, relative frequencies and cumulative frequencies. The statistics required for solving this problem are highlighted in red.
Statistics
Admission Receipts
N
Valid
30
Missing
0
Mean
18.76333
Std. Error of Mean
1.278590
Median
17.30000
Mode
19.000
Std. Deviation
7.003127
Variance
49.043782
Skewness
1.734
Std. Error of Skewness
.427
Kurtosis
5.160
Std. Error of Kurtosis
.833
Range
35.000
Minimum
9.500
Maximum
44.500
Sum
562.900
Percentiles
10
10.61000
20
14.40000
25
15.35000
30
15.50000
40
15.84000
50
17.30000
60
19.00000
70
19.75000
75
22.82500
80
24.10000
90
26.69000
Admission Receipts
Frequency
Percent
Valid Percent
Cumulative Percent
Valid
9.500
1
3.3
3.3
3.3
9.900
1
3.3
3.3
6.7
10.600
1
3.3
3.3
10.0
10.700
1
3.3
3.3
13.3
11.900
1
3.3
3.3
16.7
14.200
1
3.3
3.3
20.0
15.2.
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
Bruno Voisin from the Irish Centre for High End Computing presented this Introduction to Data Analytics Techniques and their Implementation in R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
How to combine interpolation and regression graphs in RDougLoqa
This is a general tutorial that shows you how to take Census data, aggregate columns/rows and use interpolation lines and regression curves in your graphs. You can graph individual rows/columns or aggregate rows/columns. There is an example of graphs created here: https://www.linkedin.com/pulse/comparison-annual-income-going-back-from-2017-doug-loqa-doug-loqa/
Statistics is both the science of uncertainty and the technology.docxrafaelaj1
Statistics is both the science of uncertainty and the technology of extracting information from data.
A statistic is a summary measure of data.
Descriptive statistics are methods that describe and summarize data.
Microsoft Excel supports statistical analysis in two ways:
1. Statistical functions
2. Analysis Toolpak add-in
Statistical Methods for Summarizing Data
A frequency distribution is a table that shows the number of observations in each of several nonoverlapping groups.
Categorical variables naturally define the groups in a frequency distribution.
To construct a frequency distribution, we need only count the number of observations that appear in each category.
This can be done using the Excel COUNTIF function.
Frequency Distributions for Categorical Data
Example 3.16: Constructing a Frequency Distribution for Items in the Purchase Orders Database
List the item names in a column on the spreadsheet.
Use the function =COUNTIF($D$4:$D$97,cell_reference), where cell_reference is the cell containing the item name
Example 3.16: Constructing a Frequency Distribution for Items in the Purchase Orders Database
Construct a column chart to visualize the frequencies.
Relative frequency is the fraction, or proportion, of the total.
If a data set has n observations, the relative frequency of category i is:
We often multiply the relative frequencies by 100 to express them as percentages.
A relative frequency distribution is a tabular summary of the relative frequencies of all categories.
Relative Frequency Distributions
Example 3.17: Constructing a Relative Frequency Distribution for Items in the Purchase Orders Database
First, sum the frequencies to find the total number (note that the sum of the frequencies must be the same as the total number of observations, n).
Then divide the frequency of each category by this value.
For numerical data that consist of a small number of discrete values, we may construct a frequency distribution similar to the way we did for categorical data; that is, we simply use COUNTIF to count the frequencies of each discrete value.
Frequency Distributions for Numerical Data
In the Purchase Orders data, the A/P terms are all whole numbers 15, 25, 30, and 45.
Example 3.18: Frequency and Relative Frequency Distribution for A/P Terms
A graphical depiction of a frequency distribution for numerical data in the form of a column chart is called a histogram.
Frequency distributions and histograms can be created using the Analysis Toolpak in Excel.
Click the Data Analysis tools button in the Analysis group under the Data tab in the Excel menu bar and select Histogram from the list.
Excel Histogram Tool
Specify the Input Range corresponding to the data. If you include the column header, then also check the Labels box so Excel knows that the range contains a label. The Bin Range defines the groups (Excel calls these “bins”) used for the frequency distribution.
Histogra.
Why electric vehicles need model-based design?
Because of the rising complexity in new vehicles, model-based design & systems engineering is needed to cascade the requirements and trace back any modification along the engineering lifecycle. Find out more in this presentation of a customer case about electric motor optimization.
An Analysis of maternal mortality ratio across the world.
Actions:
Data Collection from WHO and UNESCO sites
Import Data in SQL Server Management Studio
Using SQL Queries to JOIN data from different tables, using UNPIVOT to view data in a better and comparable form
Summary Statistics
Analysis of Data
Visualization and Spotting Trends Using Tableau
Link: https://public.tableau.com/profile/rutuja.gangane#!/vizhome/ACountry-wiseAnalysisofMaternalMortalityRatio/AnalysisofMMR
TSTD 6251 Fall 2014SPSS Exercise and Assignment 120 PointsI.docxnanamonkton
TSTD 6251 Fall 2014
SPSS Exercise and Assignment 1
20 Points
In this class, we are going to study descriptive summary statistics and learn how to construct box plot. We are still working with univariate variable for this exercise.
Practice Example:
Admission receipts (in million of dollars) for a recent season are given below for the
n =
30 major league baseball teams:
19.4 26.6 22.9 44.5 24.4 19.0 27.5 19.9 22.8 19.0 16.9 15.2 25.7 19.0 15.5 17.1 15.6 10.6 16.2 15.6 15.4 18.2 15.5 14.2 9.5 9.9
10.7 11.9 26.7 17.5
Require:
a. Compute the mean, variance and standard deviation.
b. Find the sample median, first quartile, and third quartile.
c. Construct a boxplot and interpret the distribution of the data.
d. Discuss the distribution of this set of data by examining kurtosis and skewness
statistics, such as if the distribution is skewed to one side of the distribution, and if the
distribution shows a peaked/skinny curve or a spread out/flat curve.
SPSS Procedures for Computing Summary Statistics
:
Enter the 30 data values in the first column of SPSS
Data View
Tab
Variable View
and name this variable
receipts
Adjust
Decimals
to 3 decimal points
Type
Admission Receipts
($ mn)
in the
Label
column for output viewer
Return to
Data View
and click
A
nalyze
on the menu bar
Click the second menu
D
e
scriptive Statistics
Click
F
requencies …
Move
Admission Receipts
to the
Variable(s)
list by clicking the arrow button
Click
S
tatistics …
button at the top of the dialog box
Now, you can select the descriptive statistics according to what the question requires. For this practice question, it requires central tendency, dispersion, percentile and distribution statistics, so we click all the boxes
except for
P
ercentile(s): and Va
l
ues are group midpoints
.
Click
Continue
to return to the
Frequencies
dialog box
Click
OK
to generate descriptive statistic output which is pasted below:
The first table provides summary statistics and the second table lists frequencies, relative frequencies and cumulative frequencies. The statistics required for solving this problem are highlighted in red.
Statistics
Admission Receipts
N
Valid
30
Missing
0
Mean
18.76333
Std. Error of Mean
1.278590
Median
17.30000
Mode
19.000
Std. Deviation
7.003127
Variance
49.043782
Skewness
1.734
Std. Error of Skewness
.427
Kurtosis
5.160
Std. Error of Kurtosis
.833
Range
35.000
Minimum
9.500
Maximum
44.500
Sum
562.900
Percentiles
10
10.61000
20
14.40000
25
15.35000
30
15.50000
40
15.84000
50
17.30000
60
19.00000
70
19.75000
75
22.82500
80
24.10000
90
26.69000
Admission Receipts
Frequency
Percent
Valid Percent
Cumulative Percent
Valid
9.500
1
3.3
3.3
3.3
9.900
1
3.3
3.3
6.7
10.600
1
3.3
3.3
10.0
10.700
1
3.3
3.3
13.3
11.900
1
3.3
3.3
16.7
14.200
1
3.3
3.3
20.0
15.2.
2013.11.14 Big Data Workshop Bruno Voisin NUI Galway
Bruno Voisin from the Irish Centre for High End Computing presented this Introduction to Data Analytics Techniques and their Implementation in R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013
How to combine interpolation and regression graphs in RDougLoqa
This is a general tutorial that shows you how to take Census data, aggregate columns/rows and use interpolation lines and regression curves in your graphs. You can graph individual rows/columns or aggregate rows/columns. There is an example of graphs created here: https://www.linkedin.com/pulse/comparison-annual-income-going-back-from-2017-doug-loqa-doug-loqa/
Statistics is both the science of uncertainty and the technology.docxrafaelaj1
Statistics is both the science of uncertainty and the technology of extracting information from data.
A statistic is a summary measure of data.
Descriptive statistics are methods that describe and summarize data.
Microsoft Excel supports statistical analysis in two ways:
1. Statistical functions
2. Analysis Toolpak add-in
Statistical Methods for Summarizing Data
A frequency distribution is a table that shows the number of observations in each of several nonoverlapping groups.
Categorical variables naturally define the groups in a frequency distribution.
To construct a frequency distribution, we need only count the number of observations that appear in each category.
This can be done using the Excel COUNTIF function.
Frequency Distributions for Categorical Data
Example 3.16: Constructing a Frequency Distribution for Items in the Purchase Orders Database
List the item names in a column on the spreadsheet.
Use the function =COUNTIF($D$4:$D$97,cell_reference), where cell_reference is the cell containing the item name
Example 3.16: Constructing a Frequency Distribution for Items in the Purchase Orders Database
Construct a column chart to visualize the frequencies.
Relative frequency is the fraction, or proportion, of the total.
If a data set has n observations, the relative frequency of category i is:
We often multiply the relative frequencies by 100 to express them as percentages.
A relative frequency distribution is a tabular summary of the relative frequencies of all categories.
Relative Frequency Distributions
Example 3.17: Constructing a Relative Frequency Distribution for Items in the Purchase Orders Database
First, sum the frequencies to find the total number (note that the sum of the frequencies must be the same as the total number of observations, n).
Then divide the frequency of each category by this value.
For numerical data that consist of a small number of discrete values, we may construct a frequency distribution similar to the way we did for categorical data; that is, we simply use COUNTIF to count the frequencies of each discrete value.
Frequency Distributions for Numerical Data
In the Purchase Orders data, the A/P terms are all whole numbers 15, 25, 30, and 45.
Example 3.18: Frequency and Relative Frequency Distribution for A/P Terms
A graphical depiction of a frequency distribution for numerical data in the form of a column chart is called a histogram.
Frequency distributions and histograms can be created using the Analysis Toolpak in Excel.
Click the Data Analysis tools button in the Analysis group under the Data tab in the Excel menu bar and select Histogram from the list.
Excel Histogram Tool
Specify the Input Range corresponding to the data. If you include the column header, then also check the Labels box so Excel knows that the range contains a label. The Bin Range defines the groups (Excel calls these “bins”) used for the frequency distribution.
Histogra.
Why electric vehicles need model-based design?
Because of the rising complexity in new vehicles, model-based design & systems engineering is needed to cascade the requirements and trace back any modification along the engineering lifecycle. Find out more in this presentation of a customer case about electric motor optimization.
Keynote of the French Space Agency CNES on the Asteroidlander MASCOT boarding the Hayabusa2 mission in collaboration with the Japanese Space Agency JAXA and the German Aerospace Center DLR
Faster Time to Market using Scilab/XCOS/X2C for motor control algorithm devel...Scilab
Rapid Prototyping becomes very popular for faster algorithm development. With a graphical representation of the algorithm and the possibility to simulate complete designs, engineers can help to reduce the time to market. A tight integration with MPLAB-X IDE allows the combination with standard C-coding to easily get mass production code. This solution was used to optimise a sensorless field oriented controlled PMSM motor driven pump efficiency. A model for closed loop simulation was developed using X2C blocks [1][2] for the FOC algorithm based on the existing application note AN1292 [3]. Enhancements to the original version were implemented and verified with simulation. The X2C Communicator was used to generate code of the new algorithm. With the online debugging capabilities and the scope functionality the algorithm was further tuned and optimized to achieve the highest possible efficiency of the pump.
Scilab and Xcos for Very Low Earth Orbits satellites modellingScilab
Very Low Earth Orbits are orbits in altitudes lower than 450 km. The interaction between the atmosphere particles and the surfaces of the spacecraft is responsible for the aerodynamic torques and forces. Simulating several aspects of the performance of a satellite flying in VLEO is very important to make decisions about the design of the spacecraft and the mission.
X2C -a tool for model-based control development and automated code generation...Scilab
Peter Dirnberger, Stefan Fragner
Nowadays, the market demands compact, stable, easy maintain-and customizable embedded systems. To meet these requirements, afast, simple and reliable implementation of control algorithms is crucial. This paper demonstrateshow model-based design with the help of Scilab/Xcosand X2C, developed by LCM,simplifiesand speedsup the development and implementation of controlalgorithms. As an example, acontrol schemefor a bearingless motoris presented.
A Real-Time Interface for Xcos – an illustrative demonstration using a batter...Scilab
As part of an EU-founded research project, the Scilab based development tool LoRra (Low-Cost Rapid Control Prototyping Platform) was created. This allows the realization of the continuously model based and highly automated Rapid Control Prototyping (RCP) design process for embedded software within the Scilab / Xcos environment (cf. Figure 1). Based on the application battery management system (BMS), this paper presents a Real-Time interface for Scilab.
Aircraft Simulation Model and Flight Control Laws Design Using Scilab and XCosScilab
The increasing demand in the aerospace industry for safety and performance has been requiring even more resourceful flight control laws in all market segments, since the airliners until the newest flying cars. The de facto standard for flight control laws design makes extensive use of tools supporting numerical computing and dynamic systems visual modeling, such that Scilab and XCos can nicely suit this kind of development.
Multiobjective optimization and Genetic algorithms in ScilabScilab
In this Scilab tutorial we discuss about the importance of multiobjective optimization and we give an overview of all possible Pareto frontiers. Moreover we show how to use the NSGA-II algorithm available in Scilab.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
1. www.openeering.com
powered by
DATA ANALYSIS AND STATISTICS
The purpose of this tutorial is to show that Scilab can be
considered as a powerful data mining tool, able to perform the
widest possible range of important data mining tasks.
Level
This work is licensed under a Creative Commons Attribution-NonComercial-NoDerivs 3.0 Unported License.
2. A Scilab data mining tutorial www.openeering.com page 2/15
Step 1: The purpose of this tutorial
Nowadays it is really frequent to have availability of large amounts of data
describing aspects of our world or work in a deeply detailed way. It could
be difficult to get useful information without the support of data mining, a
discipline that describes the process of extracting meaningful patterns
from these complex data sets.
The purpose of this tutorial is to show that Scilab can be considered as a
powerful data mining tool, able to perform the widest possible range of
important data mining tasks.
Step 2: Roadmap
In this tutorial, after a description of the database we are going to use and
of the commands used to extract the data, we describe the charts we can
create in Scilab in order to analyze our data. Descriptions Steps
Database description 3
Data extraction 4
Data mining charts 5-13
Conclusions and remarks 14-15
3. A Scilab data mining tutorial www.openeering.com page 3/15
Step 3: Database description
The database presented in this tutorial regards the most relevant
characteristics of the United Nations in terms of population, latitude,
longitude, ages, per capita GDP and so on. It has been created by
extracting data from the website
“http://www.un.org/”
The database collects all the United Nations with any row containing 23
specific elements, listed on the right.
Once the data are well organized in a table, Scilab helps the users in
getting the relations between the data that are not visible at a first glance
due to the quantity of data and/or the high dimensionality of the problem.
1. A unique identification code for the state
2 Name of the state
3. Average latitude of the state
4. Average longitude of the state
5. Total population (in thousands)
6. Number of women (in thousands)
7. Number of men (in thousands)
8. Number of women every 100 men
9. Annual population growth rate
10. Percentage of population under 15 years
11. Percentage of men over 60 years
12. Percentage of women over 60 years
13. Number of men every 100 women of the over 60
population
14. Number of annual maternal deaths every 100.000 live
births
15. Number of annual infants dying before reaching the age of
one year per 1.000 live births
16. Life expectancy at birth for women
17. Life expectancy at birth for men
18. Life expectancy at the age of 60 years for women
19. Life expectancy at the age of 60 years for men
20. Total school life expectancy (in years)
21. School life expectancy for men (in years)
22. School life expectancy for women (in years)
23. Per capita GDP (in US$)
4. A Scilab data mining tutorial www.openeering.com page 4/15
Step 4: How to extract the data
The first step in data mining is to input raw data in an appropriate way. In
Scilab, loading and filtering the data is really easy. During the import
phase, the user can remove rows and columns containing useless data.
Our dataset is stored in a comma-separated values (CSV) file, which
stores tabular data (numbers and text) in plain-text form.
We load the data using the function csvRead, which returns the
corresponding Scilab matrix of strings or doubles. In particular, typing in
the Scilab Console
D = csvRead('data_UN.csv');
we create a matrix of doubles, where data that cannot be read as a double
are substituted with NaN, while typing
S = csvRead('data_UN.csv',',','.', 'string');
we get a matrix in which every data is read as a string.
In the following examples we will always restrict our set of data of type
“double” from the second to the last line of the matrix D:
data = D(2:$,:);
(Loading data as matrix of doubles)
(Loading data as matrix of strings)
5. A Scilab data mining tutorial www.openeering.com page 5/15
Once our dataset has been loaded, it is possible to select on “Variable
Browser” the cell “data” with a double click, which will show the table with
the data stored as shown in the figure on the right.
In the Variable Editor it is possible to select a subset of the data in the
table and to create automatically a chart, choosing among the available
charts listed by an icon.
(Table of data selected from the Variable Browser)
6. A Scilab data mining tutorial www.openeering.com page 6/15
Step 5: The history plot
We can start analyzing the database in Scilab with the help of available
charts beginning with the simplest chart, the history plot. The figure on the
right reports the history plot of the maternal mortality ratio of the 30
poorest states, the red lines point out the incomes gap between the
highest (in Somalia with 1000 deaths every 100000 live births) and lowest
(in North Korea with 81) maternal mortality ratio.
This chart points out the peculiarity of this data in North Korea: in spite of
being the twelfth poorest state of the world, it boasts of a pretty small
maternal mortality ratio.
// Getting rid of the NaN entries
[gdp,k] = thrownan(data(:,23));
[rationonan,kk] = thrownan(data(k,14));
gdpnonan = gdp(kk);
// History plot
[p,i] = gsort(gdpnonan,'g','i');
scf(1); clf(1);
plot([1:30],rationonan(i(1:30)),'bo-')
[m,im] = min(rationonan(i(1:30)));
plot([0,im],[m,m],'r')
[M,iM] = max(rationonan(i(1:30)));
plot([0,iM],[M,M],'r')
set(gca(),"grid",[1 1]*color('gray'));
set(gca(),"data_bounds",[0 30 0 1200]);
title('Maternal mortality ratio in the poor states');
(History plot of the maternal mortality ratio in the poor states)
7. A Scilab data mining tutorial www.openeering.com page 7/15
Similarly, we can even plot a multi-history chart, putting in the same plot
the maternal mortality ratio and the infant mortality rate by adding the line
plot([1:30],data(k(kk(i(1:30))),15),'go-')
as shown in the figure on the top-right, but this chart does not give us a
good comparison between the two lines, because we are plotting in blue
the annual maternal deaths every 100.000 live births, while the green line
represents the number of annual infants dying before reaching the age of
one year per 1.000 live births.
A good solution to compare this two datasets is to normalize them; this
operation allows us to get the unique information we are interested in: the
trends of the two lines, which are very similar.
// Data normalization
dnorm = [rationonan(i(1:30)),data(k(kk(i(1:30))),15)];
for i = 1:size(dnorm,2)
dmin(i) = min(dnorm(:,i));
dmax(i) = max(dnorm(:,i));
dnorm(:,i) = (dnorm(:,i)-dmin(i)) ./ (dmax(i)-
dmin(i));
end
// Plot
scf(2); clf(2);
plot([1:30],dnorm(:,1),'bo-')
plot([1:30],dnorm(:,2),'go-')
set(gca(),"grid",[1 1]*color('gray'));
set(gca(),"data_bounds",[0 30 0 1]);
(History plot of the maternal and infant mortality ratio in the poor
states)
(History plot of the normalized maternal and infant mortality ratio
in the poor states)
8. A Scilab data mining tutorial www.openeering.com page 8/15
Step 6: The pie chart
In the pie chart a value is associated with the area of a slice of pie,
possibly colored, as shown in the figure on the right.
This kind of chart is maybe the most frequent chart in the business world
and the most used by the mass media. Unfortunately with the pie chart it
could be difficult to compare different sections of the pie or to compare
data across different pie charts. These charts can be effective to correlate
the size of a slice with the whole pie, rather than comparing the slices with
each other, e.g., they can be convenient for visualizing percentages.
In the figure on the right we have plotted the size of the population of the
10 biggest states and the population of the rest of the world (41% in light
blue). It is really interesting to see that China (19%) and India (18%)
together cover the 37% of the global population.
We get a pie chart using the Scilab command pie.
// Sorting the population size and plotting the Pie Chart
[p,ind] = gsort(data(:,5));
scf(3); clf(3);
pie([p(1:10);sum(p(11:$))])
legend([S(ind(1:10)+1,2);"Rest of the world"]);
(Pie chart of the size of the population of the states)
9. A Scilab data mining tutorial www.openeering.com page 9/15
Step 7: The bar chart
A bar chart consists of rectangular bars with lengths proportional to the
values that they represent.
In order for the bars to be clearly visible, their number has to be limited.
Each bar is characterized by a label and a length: they can therefore be
used for plotting data with a discrete set of labels, while the data assigned
to the length can be continuous. Bar charts are optimal when we want to
associate nominal values along the X axis to numerical values along the Y
axis.
In the figure on the right, using the command bar, we have plotted the
percentage of the global per capita GDP for the 10 states with highest per
capita GDP, while the red line shows the cumulative sum of these values:
it points out that these 10 states cover the 30% of the global per capita
GDP.
// Bar Chart and cumulative
scf(4); clf(4);
[gdpnonan,i] = thrownan(data(:,23));
[gdp,j] = gsort(gdpnonan);
tot_gdp = sum(gdp);
cum_gdp = cumsum(gdp(1:10));
bar(gdp(1:10)*100/tot_gdp)
plot([1:10],cum_gdp*100/tot_gdp,'ro-')
set(gca(),"grid",[1 1]*color('gray'));
h = get("current_entity");
h.parent.x_ticks.labels = S(i(j(1:10))+1,1);
h.parent.y_ticks.labels = h.parent.y_ticks.labels+'%';
(Bar chart and cumulative of the per capita GDP)
10. A Scilab data mining tutorial www.openeering.com page 10/15
Step 8: Further bar charts
If we want to compare more than one bar chart, the barhomogenize
function allows to homogenize the width and style of all the bars.
In the figures on the right we have plotted for 10 states the percentages of
population composed by under 15 (in yellow), men over 60 (in blue) and
women over 60 (in red).
In the subplot on the left we have set the option 'stacked'.
(Homogenized bar charts)
Step 9: The bubble chart
In a bubble chart a point is substituted by a “bubble”, a circle whose size
(diameter or area) is proportional to a selected parameter. Finally, by
adding a color we get a very powerful and effective way to show four
distinct coordinates on a single plot.
In the figure on the right we have plotted the latitude and the longitude of
the states along the X and the Y axes, while the bubbles are given by the
size of the population of every state. In this way, we get a nice view of the
world. The red bubbles point out the 10 states with the biggest population.
The code is available in the file data_mining.sce.
(Bubble chart of the size of the population of every state, given
the latitude and the longitude)
11. A Scilab data mining tutorial www.openeering.com page 11/15
Step 10: Histograms
When dealing with numerical data corresponding to measurements, a
useful type of information is related to the data distribution. We might be
interested in knowing if the values are nicely concentrated around a
central value; or it may be interesting to know how many cases are falling
in a given interval.
In a histogram, a simple descriptive analysis can be done by partitioning
the interval between two specific values (usually the minimum and
maximum values) into a set of equally-sized segments, also called bins,
and by counting how many values fall in the different segments, as shown
in the figure on the right.
In the figure, using the command histplot, we have plotted the number of
states that have the life expectancy in the given age groups.
// Histograms
LEW = thrownan(data(:,16));
LEM = thrownan(data(:,17));
classes = 45:5:90; lW = length(LEW); lM = length(LEM);
scf(5); clf(5);
subplot(2,1,1)
histplot(classes, LEW, normalization=%f, style=5)
xtitle("Life expectancy at birth: Women");
subplot(2,1,2)
histplot(classes, LEM, normalization=%f, style=2)
xtitle("Life expectancy at birth: Men");
The command hist3d allows to create histograms in three dimensions.
(Histograms of the life expectancy for women and men)
12. A Scilab data mining tutorial www.openeering.com page 12/15
Step 11: The box-and-whisker plot
Also the box-and-whisker plot is useful if we are interested in the data
distribution.
The four horizontal lines of the boxes are the lower quartile Q1, the mean
(in red), the mode (in green) and the upper quartile Q3. In order to
understand the graphic we give the notion of interquartile range, which is a
measure of statistical dispersion, being equal to the difference between
the upper and lower quartiles: IQR = Q3−Q1. The lowest datum of every
whisker under the boxes is placed within 1.5 IQR of the lower quartile and
the highest datum of every whisker on top of the boxes is placed within 1.5
IQR of the upper quartile. Any point not included between the whiskers is
called outlier, because it is an observation that is numerically distant from
the rest of the data. In our chart, outliers are plotted as red points.
In the figure on the right we have plotted the same data used for plotting
the histograms above: the life expectancy of women and men respectively.
The function box_whiskers.sci is provided with the source code.
// Box-and-whisker
scf(6); clf(6);
subplot(1,2,1)
[outUp, outDown]=box_whiskers(LEW)
a=get("current_axes");
a.data_bounds=[0,45;2,90];
xtitle("Life expectancy at birth: Women");
subplot(1,2,2)
[outUp, outDown]=box_whiskers(LEM)
a=get("current_axes");
a.data_bounds=[0,45;2,90];
xtitle('Life expectancy at birth: Men')
(Box-and-whisker plot of the life expectancy for women and
men)
13. A Scilab data mining tutorial www.openeering.com page 13/15
Step 12: The parallel coordinates plot
In the parallel coordinates plot, coordinates are represented by equally-
spaced parallel vertical lines and each data point is assigned to a polyline
(a continuous line composed of a sequence of segments) that intersects
each vertical line at the specific value received by that coordinate.
In the figures on the right we have plotted the data related to the
population (considering its composition and evolution), hence the parallel
coordinates correspond to the elements of our database starting from the
fifth column up to the fifteenth one and each polyline corresponds to a
state.
In the first chart we have pointed out the polyline defined by the United
States, while in the second one we have chosen to point out the United
States, China and Afghanistan.
The function plot_parallel_chart.sci is provided with the source code.
// Parallel coordinates
[i1,j1] = find(S=='US'); // United States
[i2,j2] = find(S=='CN'); // China
[i3,j3] = find(S=='AF'); // Afghanistan
plot_parallel_chart(9,data(:,5:15),i1-1,S(1,5:15))
plot_parallel_chart(10,data(:,5:15),[i1-1,i2-1,i3-
1],S(1,5:15))
(Parallel plot pointing out the United States)
(Parallel plot pointing out three nations)
14. A Scilab data mining tutorial www.openeering.com page 14/15
,
Step 13: The correlation matrix chart and scatterplots
The correlation matrix of n random variables , … , is the matrix
whose , entry is
where is the mean of the variable and is its standard deviation.
corr is +1 in the case of a perfect positive linear relationship (correlation)
between two variables, -1 in the case of a perfect negative linear
relationship (anticorrelation), and some value between -1 and 1 in all other
cases, indicating the degree of linear dependence between the variables.
As it approaches zero the variables are closer to uncorrelated. The closer
the coefficient is to either -1 or 1, the stronger the correlation. The
correlation matrix is symmetric because the correlation between and
is the same as the correlation between and .
In the figure on the top-right we have written in the main diagonal our
elements (the correlation of an element with itself is 1) and the colors map
the degree of correlation (dark blue stands for -1, dark red for +1).
In the figure on the bottom-right we have the scatterplot of the data that
refer to the value of the matrix -0.83 in the white circle, i.e. the number of
annual infants dying before reaching the age of one year per 1000 live
births and the school life expectancy for women (in years). These two
elements are strongly anticorrelated, hence when we have many infant
deaths we also have a brief school life expectancy, which is typical of poor
states. The red line is obtained using the command reglin, which performs
the linear regression between two sets of data.
The files data_mining.sce, linear_correlation.sci and
linear_corr_matrix.sci are provided with the source code.
(Correlation matrix)
(Scatterplot with linear regression)
15. A Scilab data mining tutorial www.openeering.com page 15/15
Step 14: Concluding remarks and References
In this tutorial we have shown that Scilab can be considered as a powerful
data mining tool, well-equipped to perform the widest possible range of
important data mining tasks.
On the right-hand column you may find a list of references for further
studies.
1. Scilab Web Page: Available: www.scilab.org.
2. Openeering: www.openeering.com.
3. United Nations web site: http://www.un.org/
Step 15: Software content
To report a bug or suggest some improvement please contact Openeering
team at the web site www.openeering.com.
Thank you for your attention,
Anna Bassi, Giovanni Borzi
-----------------------------
A SCILAB DATA MINING TUTORIAL
-----------------------------
--------------------
Directory: functions
--------------------
box_whiskers.sci : box and whiskers plot
linear_correlation.sci : computes the linear correlation
linear_corr_matrix.sci : linear correlation matrix plot
plot_parallel_chart.sci : parallel coordinates plot
--------------
Main directory
--------------
data_mining.sce : main scilab program
data_UN.csv : dataset
license.txt : the license file