This document presents an intelligent visualization framework for multi-dimensional data sets. The framework includes pre-processing, feature selection, classification, rule refinement, and visualization phases. In the feature selection phase, principal component analysis and rough sets are used to select important features. Classification is done using rough set rules generation. The rules are then refined using entropy and genetic algorithms. Finally, the refined rules and reducts are visualized using nodes, edges, charts and grids to help experts understand the data. Experimental results on breast cancer and prostate cancer data sets demonstrate the performance of the approach.
Intelligent Visualization of Multi-Dimensional Data Sets
1. By
Hanaa Ismail Elshazly
PhD Student
Faculty of Computers and Information
Cairo University
Intelligent
Visualization of Multi
Dimension Data Sets
Faculty of Computers and Information - Cairo University
Department of Computer Sciences
Supervisors
Prof Aboul Ella Hassanien & Prof. Abeer Mohamed El Korany
2. Big Image
Multidimensional
data
Reduction
Visualize
Intelligent Visualization of
Multidimensional Data Sets
Dimensions:
A dimension
is a key
descriptor,
an index, by
which you
can access
facts
according to
the value (or
values) you
want
Information visualization is the study of (interactive)
visual representations of abstract data to reinforce
human cognition. The abstract data include both
numerical and non-numerical data, such as text and
geographic information
4. Introduction
General
• Massive and complex data are
generated every day in many fields
due to the advance of hardware and
software technology.
• Curse of dimensionality is a major
obstacle in machine learning and data
mining.
• Clinical data referring to patients’
investigations contain irrelevant
attributes that degrade the
classification performance.
• Visualization is important when
analyzing multidimensional datasets,
since it can help humans discover and
understand complex relationships in
data.
5. Introduction
Data Problems
Data Quality
Integrating redundant data
from different sources
Mining information from
heterogeneous databases
Difficulty in training set
Dynamic databases
Dimensionality
6. Introduction
Dimensionality reduction
• In machine learning and statistics, dimensionality reduction or dimension
reduction is the process of reducing the number of random variables under
consideration via obtaining a set of principal variables. It can be divided into
feature selection and feature extraction.
• Most popular search methods that are manageable in low space can be totally
unmanageable in high dimension space
• The curse of dimensionality is a major obstacle in machine learning and data
mining
• Reduction of the dimensionality of features space leads to a successful
classification Selecting the optimal feature subset can substantially improve
the classification performance
7. Filter
Wrapper
Embedded
• Improve the
comprehensibility of the
induced concepts
• Decrease of dataset
complexity
• Improve classification
performance
• Resources saving
• Visualization ability
• Better understanding of
extracted knowledge
• Reducing computation
Requirement
• Reduces the effect of
curse of dimensionality
FS Techniques
Reduced DataMassive Data
Microarray GE
Medical Images
Huge Databases
Finance Data
Sensor Arrays
Web Documents
Introduction
Dimensionality reduction
8. Introduction
The curse of Dimensionality
Damming
Factor
Computational
Complexity
Limits applicability of ML techniques to real world problems
Slow Learning Process
Difficulty of Inducing
Concepts
Decrease Predictive Performance
Add extra difficulties
in finding potentially
useful knowledge
Difficulty to add visualization ability
Limited human capability
Human inspection and
interpretation of the
data is not feasible
Intractable behavior
of Search Methods
Conventional database
management and data
analysis tools are
insufficient.
Storage requirements
11. Experimental Data Sets
ClassesInstancesFeaturesSourceData Set
2 classes569 samplesFeatures32UCI (Machine Learning
Repository)
Wisconsin Breast
Cancer–Diagnosis
2 classessamples 198Features32UCI (Machine Learning
Repository)
Wisconsin Breast
Cancer–Prognosis
2 classes267 samples45 FeaturesUCI (Machine Learning
Repository)
SPECTF Heart Dataset
4 classes148 samples18 FeaturesUniversity Medical
Centre, Institute of
Oncology, Ljubljana,
Yugoslavia
Lymphography
2 classes583 samples11 FeaturesUCI (Machine Learning
Repository)
Indian Liver Patient
Dataset
2 classes102 samples12600
Features
UCI (Machine Learning
Repository)
Prostate
12. Pre-processing
Phase
Aim : Used to reduce the number of
values for a given continuous attribute
by dividing the range of the attribute
into intervals and replacing low level
concepts by higher
level concepts.
Techniques:
• Equal Binning : Transform
numerical variables into
categorical counterparts.
• Simplification : Rescaling
data in the range [1,3].
PREPROCESS
Discretization
Discretized
Data
Simplification
Simplified
Data
Multidimensional
Data
Discretization
13. Pre-processing Phase
Equal Binning Algorithm
Foreach feature V in data (D)
{ Dividing domain of V into k intervals of equal size.
The width of intervals is:
w = (max(V)-min(V))/k
And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w
}
Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of
the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
How Discretization techniques influence the classification of breast cancer data
Bool.Reas%Binging
%
Entropy
%
9192.977.2Naïve Bayes
95.395.391.4Decision Rules
9494.776.1KNN
14. Feature Selection Phase
Feature Selection
Phase
Rough Set
PCA
Positive Regions
Extraction
Discernibility
Matrix
Reduced Data
Positive
Regions
Final Reducts
Simplified Data
Aim: Determine a minimal feature subset
that best contribute to accuracy and retain
high efficiency in representing the original
features while negligee the features with little
contribution in prediction process.
PCA (Principal component Analysis)
• A statistical technique useful in data
compression and reduction.
• Rough Sets
• The main goal of the rough set analysis
is induction of (learning) approximations
of concepts.
16. Feature Selection Phase
PCA Performance as a transformation method
in ROTATION FOREST for Chronic eye disease diagnosis
• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Ahmad Taher Azar, “Ensemble classifiers for
biomedical data : performance evaluation”, 8th International Conference on Computer Engineering & Systems (ICCES), ISBN:
978-1-4799-0078-7, pp 184-189, 2013.
• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Mohamed Waly, “Chronic Eye Disease diagnosis
using ensemble-based classifier ”, Second International Conference on Engineering and Technology(ICET), German University
– Cairo-Egypt, 2014.
Many transformation methods were applied in the literature such as Principal component
analysis (PCA), nonparametric discriminant analysis (NDA), random projections (RP),
independent component analysis (ICA).
• PCA gave the best results due to the provided diversity.
• PCA preserves the discriminatory features.
• PCA provides the best results compared to those extracted through non-parametric
discriminant analysis (NDA) or random projections.
• PCA was chosen as a transformation method in the following research papers :
17. Feature Selection Phase
Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and
Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi:
10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.Descrinibility
Rough Sets for Reduct Generation
Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU
M(T), we will mean matrix defined as:
)]d(u)[d(uDdif)}c(u)c(u:C{c
)]d(u)[d(uDdifλij
jiji
ji
m
nn
ijm ,Uui }},...,2,1{,:{)( njijmuf ij
j
iT
ijm ,ijma .ijm
),( falsemij .ijm
),(truetmij .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
For any
18. Classification Phase
Classification Phase
Phase
Rule Generation
Classification
with Decision
Rules
Testing
Generated
Rules
Classified
Instances
Tested
Instances
Multidimensional
Data
Final Reducts
Aim : The learning algorithm
called classifier has as goal to return
a set of decision rules with a
procedure that makes possible to
classify objects not found in the
original decision table.
Rough Set Rules Generation
using Discernibility Matrix
19. Rough Set Rules
Generation Algorithm
Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU
M(T), we will mean matrix defined as:
)]d(u)[d(uDdif)}c(u)c(u:C{c
)]d(u)[d(uDdifλij
jiji
ji
m
nn
ijm is the set of all the condition attributes that classify objects ui and uj into
different classes.
,Uui }},...,2,1{,:{)( njijmuf ij
j
iT
ijm ,ijma .ijm
),( falsemij .ijm
),(truetmij .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
20. Comparison of different
classifiers against different
data Sets
Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer
classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-
4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
Hanaa Ismail Elshazly et al., “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data
Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103,
3(4), 31-46, 2013.
21. Rules Refinement Phase
RULE REFINEMENT
Generated
Reducts
Informative
Reduct
All Rules
Generated
Rules Allocation
Selected Rules
Testing
Criteria
Termination
Classified
Instances
Reducts Evaluation
Multidimensional
Data
GA
Refined Decision Rules
Test
Multidimensional
Data
Reduce rules number to be
easily visualized and
presented to an expert
without decreasing the
accuracy.
Reduct Evaluation using
Entropy
GA using Support and
Confidence as Fitness Function
22. Reduct Evaluation
Algorithms of Decision tree depend on Information Gain to find the expected amount of
information that would be needed to truly classified.
Calculate entropy of the target : Gain(T) = Entropy (T);
Entropy (T) = where c is the possible values of
the target
Foreach in Reducts
{
Foreach x In R
{
Entropy (T,X) =
}
}
Choose with the largest information gain.
i2
c
1i i plogp
iR
E(c))(
c
xc
cP
iR
),( XTEntEi
23. Genetic Algorithm Using Support and
Confidence as Fitness Function
Body ==> Consequent [ Support , Confidence ]
Consequent: represents a discovered property for the
examined data.
Support: represents the percentage of the records
satisfying the body or the consequent.
Confidence: represents the percentage of the records
satisfying both the body and the consequent to those
satisfying only the body.
24. Visualization Phase
Expert can manage induced rules
through levels of trusting that
enable fast trust decision.
• Graph Nodes
• Edges
• Charts
• Grids
VISUALIZATION
Measurement Calculation for
Rules Supporting
Refined Rules with
Trusted Levels
Rendering
Rules & Reducts
Refined Decision
Rules
25. Visualization of Breast Cancer Reducts
Visualization of features of the breast data set ordered by its occurrence over all
extracted reducts.
Experimental Results
26. Visualization of Breast Cancer Rules
Visualization of global and detailed nodes representing refined classification
rules of the breast data.
86 R 400 R 87000 R
Experimental Results
27. Visualization of Breast Cancer Rules
Visualization of Refined Breast Cancer Decision Rules According to
Trusting Levels.
Experimental Results
28. Visualization of Breast Cancer Rules
Navigation through Refined Breast Cancer Decision Rules According to
Trusting Levels.
Experimental Results
29. Visualization of Prostate Cancer Reducts
Visualization of all reducts of the Prostate Cancer data set and all features
ordered by its occurrence in all extracted reducts.
Experimental Results
30. Visualization of Prostate Cancer Rules
Navigation through Refined Prostate Cancer Decision Rules According to
Trusting Levels.
26 R 117R 22000 R
Experimental Results
31. Visualization of Prostate Cancer Rules
Visualization of Refined Prostate Cancer Decision Rules According to Trusting
Levels.
Experimental Results
32. Visualization of Prostate Cancer Rules
Navigation through Refined Prostate Cancer Decision Rules According to
Trusting Levels.
Experimental Results
34. Conclusions
• We have presented an approach
for knowledge-based classification
and visualization of decision rules
which enhances the classification
process and improves the insight
into rules knowledge.
• Physician can detect a minimum
number of rules with trusted
levels to reach an efficient
diagnosis of diseases.
35. Future Work
• Promising results of the proposed
approach encourage the possibility of
applying the approach on other multi
dimensional data sets.
• Other visualization dynamic techniques can
be applied to meet the different
requirements of physicians.