Intelligent Visualization of Multi-Dimensional Data Sets

By
Hanaa Ismail Elshazly
PhD Student
Faculty of Computers and Information
Cairo University
Intelligent
Visualization of Multi
Dimension Data Sets
Faculty of Computers and Information - Cairo University
Department of Computer Sciences
Supervisors
Prof Aboul Ella Hassanien & Prof. Abeer Mohamed El Korany

Big Image
Multidimensional
data
Reduction
Visualize
Intelligent Visualization of
Multidimensional Data Sets
Dimensions:
A dimension
is a key
descriptor,
an index, by
which you
can access
facts
according to
the value (or
values) you
want
Information visualization is the study of (interactive)
visual representations of abstract data to reinforce
human cognition. The abstract data include both
numerical and non-numerical data, such as text and
geographic information

Contents
Introduction1
2
3 Experimental Results
4 Conclusion
55 Future Work
6
Proposed Framework

Introduction
General
• Massive and complex data are
generated every day in many fields
due to the advance of hardware and
software technology.
• Curse of dimensionality is a major
obstacle in machine learning and data
mining.
• Clinical data referring to patients’
investigations contain irrelevant
attributes that degrade the
classification performance.
• Visualization is important when
analyzing multidimensional datasets,
since it can help humans discover and
understand complex relationships in
data.

Introduction
Data Problems
 Data Quality
 Integrating redundant data
from different sources
 Mining information from
heterogeneous databases
 Difficulty in training set
 Dynamic databases
 Dimensionality

Introduction
Dimensionality reduction
• In machine learning and statistics, dimensionality reduction or dimension
reduction is the process of reducing the number of random variables under
consideration via obtaining a set of principal variables. It can be divided into
feature selection and feature extraction.
• Most popular search methods that are manageable in low space can be totally
unmanageable in high dimension space
• The curse of dimensionality is a major obstacle in machine learning and data
mining
• Reduction of the dimensionality of features space leads to a successful
classification Selecting the optimal feature subset can substantially improve
the classification performance

Filter
Wrapper
Embedded
• Improve the
comprehensibility of the
induced concepts
• Decrease of dataset
complexity
• Improve classification
performance
• Resources saving
• Visualization ability
• Better understanding of
extracted knowledge
• Reducing computation
Requirement
• Reduces the effect of
curse of dimensionality
FS Techniques
Reduced DataMassive Data
Microarray GE
Medical Images
Huge Databases
Finance Data
Sensor Arrays
Web Documents
Introduction
Dimensionality reduction

Introduction
The curse of Dimensionality
Damming
Factor
Computational
Complexity
Limits applicability of ML techniques to real world problems
Slow Learning Process
Difficulty of Inducing
Concepts
Decrease Predictive Performance
Add extra difficulties
in finding potentially
useful knowledge
Difficulty to add visualization ability
Limited human capability
Human inspection and
interpretation of the
data is not feasible
Intractable behavior
of Search Methods
Conventional database
management and data
analysis tools are
insufficient.
Storage requirements

Proposed General Framework
www.themegallery.com
Pre-
processing
phase
Feature
selection
phase
Classification
phase
Rule refine
phase
Visualization
phase

Proposed General Framework
Preprocess
Phase
Discretization
Simplification
Equal
Binning
Feature
Selection
Phase
Reduction
PCA
RS
Classification
Phase
Rough Set
Rules
Generation
Rough Set
Discernibility
Matrix
Rules
Refinement
Phase
Reduction
Entropy
GA
Visualization
Visualization
Nodes
Edges
Charts
Grids

Experimental Data Sets
ClassesInstancesFeaturesSourceData Set
2 classes569 samplesFeatures32UCI (Machine Learning
Repository)
Wisconsin Breast
Cancer–Diagnosis
2 classessamples 198Features32UCI (Machine Learning
Repository)
Wisconsin Breast
Cancer–Prognosis
2 classes267 samples45 FeaturesUCI (Machine Learning
Repository)
SPECTF Heart Dataset
4 classes148 samples18 FeaturesUniversity Medical
Centre, Institute of
Oncology, Ljubljana,
Yugoslavia
Lymphography
2 classes583 samples11 FeaturesUCI (Machine Learning
Repository)
Indian Liver Patient
Dataset
2 classes102 samples12600
Features
UCI (Machine Learning
Repository)
Prostate

Pre-processing
Phase
Aim : Used to reduce the number of
values for a given continuous attribute
by dividing the range of the attribute
into intervals and replacing low level
concepts by higher
level concepts.
Techniques:
• Equal Binning : Transform
numerical variables into
categorical counterparts.
• Simplification : Rescaling
data in the range [1,3].
PREPROCESS
Discretization
Discretized
Data
Simplification
Simplified
Data
Multidimensional
Data
Discretization

Pre-processing Phase
Equal Binning Algorithm
Foreach feature V in data (D)
{ Dividing domain of V into k intervals of equal size.
The width of intervals is:
w = (max(V)-min(V))/k
And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w
}
Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of
the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
How Discretization techniques influence the classification of breast cancer data
Bool.Reas%Binging
%
Entropy
%
9192.977.2Naïve Bayes
95.395.391.4Decision Rules
9494.776.1KNN

Feature Selection Phase
Feature Selection
Phase
Rough Set
PCA
Positive Regions
Extraction
Discernibility
Matrix
Reduced Data
Positive
Regions
Final Reducts
Simplified Data
Aim: Determine a minimal feature subset
that best contribute to accuracy and retain
high efficiency in representing the original
features while negligee the features with little
contribution in prediction process.
PCA (Principal component Analysis)
• A statistical technique useful in data
compression and reduction.
• Rough Sets
• The main goal of the rough set analysis
is induction of (learning) approximations
of concepts.

Principal component Analysis Algorithm

PCA Performance as a transformation method
in ROTATION FOREST for Chronic eye disease diagnosis
• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Ahmad Taher Azar, “Ensemble classifiers for
biomedical data : performance evaluation”, 8th International Conference on Computer Engineering & Systems (ICCES), ISBN:
978-1-4799-0078-7, pp 184-189, 2013.
• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Mohamed Waly, “Chronic Eye Disease diagnosis
using ensemble-based classifier ”, Second International Conference on Engineering and Technology(ICET), German University
– Cairo-Egypt, 2014.
Many transformation methods were applied in the literature such as Principal component
analysis (PCA), nonparametric discriminant analysis (NDA), random projections (RP),
independent component analysis (ICA).
• PCA gave the best results due to the provided diversity.
• PCA preserves the discriminatory features.
• PCA provides the best results compared to those extracted through non-parametric
discriminant analysis (NDA) or random projections.
• PCA was chosen as a transformation method in the following research papers :

Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and
Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi:
10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.Descrinibility
Rough Sets for Reduct Generation
Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU 
M(T), we will mean matrix defined as:
 )]d(u)[d(uDdif)}c(u)c(u:C{c
)]d(u)[d(uDdifλij
jiji
ji
m


nn 
ijm ,Uui  }},...,2,1{,:{)( njijmuf ij
j
iT 
ijm ,ijma  .ijm
),( falsemij  .ijm
),(truetmij  .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
For any

Classification Phase
Classification Phase
Phase
Rule Generation
Classification
with Decision
Rules
Testing
Generated
Rules
Classified
Instances
Tested
Instances
Multidimensional
Data
Final Reducts
Aim : The learning algorithm
called classifier has as goal to return
a set of decision rules with a
procedure that makes possible to
classify objects not found in the
original decision table.
Rough Set Rules Generation
using Discernibility Matrix

Rough Set Rules
Generation Algorithm
Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU 
M(T), we will mean matrix defined as:
 )]d(u)[d(uDdif)}c(u)c(u:C{c
)]d(u)[d(uDdifλij
jiji
ji
m


nn
ijm is the set of all the condition attributes that classify objects ui and uj into
different classes.
,Uui  }},...,2,1{,:{)( njijmuf ij
j
iT 
ijm ,ijma  .ijm
),( falsemij  .ijm
),(truetmij  .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if

Comparison of different
classifiers against different
data Sets
 Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer
classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-
4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
 Hanaa Ismail Elshazly et al., “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data
Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103,
3(4), 31-46, 2013.

Rules Refinement Phase
RULE REFINEMENT
Generated
Reducts
Informative
Reduct
All Rules
Generated
Rules Allocation
Selected Rules
Testing
Criteria
Termination
Classified
Instances
Reducts Evaluation
Multidimensional
Data
GA
Refined Decision Rules
Test
Multidimensional
Data
Reduce rules number to be
easily visualized and
presented to an expert
without decreasing the
accuracy.
Reduct Evaluation using
Entropy
GA using Support and
Confidence as Fitness Function

Reduct Evaluation
Algorithms of Decision tree depend on Information Gain to find the expected amount of
information that would be needed to truly classified.
Calculate entropy of the target : Gain(T) = Entropy (T);
Entropy (T) = where c is the possible values of
the target
Foreach in Reducts
{
Foreach x In R
{
Entropy (T,X) =
}
}
Choose with the largest information gain.
i2
c
1i i plogp

iR
E(c))(
c
 xc
cP
iR
),( XTEntEi 

Genetic Algorithm Using Support and
Confidence as Fitness Function
Body ==> Consequent [ Support , Confidence ]
 Consequent: represents a discovered property for the
examined data.
 Support: represents the percentage of the records
satisfying the body or the consequent.
 Confidence: represents the percentage of the records
satisfying both the body and the consequent to those
satisfying only the body.

Visualization Phase
Expert can manage induced rules
through levels of trusting that
enable fast trust decision.
• Graph Nodes
• Edges
• Charts
• Grids
VISUALIZATION
Measurement Calculation for
Rules Supporting
Refined Rules with
Trusted Levels
Rendering
Rules & Reducts
Refined Decision
Rules

Visualization of Breast Cancer Reducts
Visualization of features of the breast data set ordered by its occurrence over all
extracted reducts.
Experimental Results

Visualization of Breast Cancer Rules
Visualization of global and detailed nodes representing refined classification
rules of the breast data.
86 R 400 R 87000 R

Visualization of Refined Breast Cancer Decision Rules According to
Trusting Levels.

Navigation through Refined Breast Cancer Decision Rules According to
Trusting Levels.

Visualization of Prostate Cancer Reducts
Visualization of all reducts of the Prostate Cancer data set and all features
ordered by its occurrence in all extracted reducts.

Visualization of Prostate Cancer Rules
Navigation through Refined Prostate Cancer Decision Rules According to
Trusting Levels.
26 R 117R 22000 R

Visualization of Refined Prostate Cancer Decision Rules According to Trusting
Levels.

Navigation through Refined Prostate Cancer Decision Rules According to
Trusting Levels.

Performance analysis
0.93
0.96
0.92
0.960.980.97
1
0.62
0.660.64
0.67
0.62
0.72
1
0
0.2
0.4
0.6
0.8
1
1.2
DTKNNNBRFDRROTRSGADTKNNNBRFDRROTRSGA
Breast DiagProstate
Accuracy
 Hanaa Ismail Elshazly et al., ”Weighted Reduct Selection Metaheuristic Based Approach for Rules Reduction and Visualization” ,
International Conference on Computing Communication and Automation (ICCCA2016), IEEE, Buddh Nagar Uttar Pradesh, India,
2016

Conclusions
• We have presented an approach
for knowledge-based classification
and visualization of decision rules
which enhances the classification
process and improves the insight
into rules knowledge.
• Physician can detect a minimum
number of rules with trusted
levels to reach an efficient
diagnosis of diseases.

Future Work
• Promising results of the proposed
approach encourage the possibility of
applying the approach on other multi
dimensional data sets.
• Other visualization dynamic techniques can
be applied to meet the different
requirements of physicians.

Intelligent Visualization of Multi-Dimensional Data Sets

Intelligent Visualization of Multi-Dimensional Data Sets

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Intelligent Visualization of Multi-Dimensional Data Sets

Similar to Intelligent Visualization of Multi-Dimensional Data Sets (20)

More from Aboul Ella Hassanien

More from Aboul Ella Hassanien (20)

Recently uploaded

Recently uploaded (20)

Intelligent Visualization of Multi-Dimensional Data Sets