SlideShare a Scribd company logo
1 of 41
Method for Missing Data
Imputation in Big Data
technologies
Nataliya Shakhovska,
Head of artificial intelligence department
Lviv Polytechnic National University
1
Data Cleaning
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred
Problem formulation
• The nature of gaps
• missing completely at random (MCAR),
• missing at random (MAR)
• and missing not at random (MNAR)
• The limitation of existing methods
• Methods based on the skip generation model
• Methods of imputation (restoration)
6
Existing methods of data imputation
• Mean substitution
• Hot-deck imputation (randomly)
• Cold-deck imputation (constant)
• Model-based
• Regression
• EM
• RF
• KNN
• SOM
• SVM
7
regression
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill
in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as regression, Bayesian formula, decision tree
Existing methods of data imputation
• Mean substitution
• Hot-deck imputation (randomly)
• Cold-deck imputation (constant)
• Model-based
• Regression
• EM
• RF
• KNN
• SOM
• SVM
10
mean imputation has an impact
on attributes variability
regression
Use of noise filtering techniques in
classification
The three noise filters mentioned next, which
are the mostknown, use a voting scheme to
determine what cases have to be removed
from the training set:
• Ensemble Filter (EF)
• Cross-Validated Committees Filter
• Iterative-Partitioning Filter
Ensemble Filter (EF)
Ensemble Filter (EF)
Cross‐Validated Committees Filter (CVCF)
Iterative Partitioning Filter (IPF)
INFFC: An iterative class noise filter based on the fusio
n of classifiers with noise sensitivity control
INFFC: An iterative class noise filter based on the
fusion of classifiers with noise sensitivity control
By the way, what is about nature of Big data?
20
?
The Big Data Model in the Task of Missing Data
Recovery
21
The Big data schema Bd is the finite set of attributes with exact values  
1 2
, , ..., n
А А А , set of attributes
with indefinite, inexact or missing values  
1 2
_ , _ , _ p
A unk A unk A unk and the set of values of
membership functions for inexact attributes  
1 2
, , , m
Unk Unk Unk
 . In addition, the catalog of the
attributes (features) with schema Cg and synonym dictionary with schema Dic should be used:
   
 
1 2 1 2
1 2
, ,..., , _ , _ , _ ,
, , , , , ,
n p
m
Bd A A A unk A unk A unk
Unk Unk U i
A
nk D c Cg

 
(1)
The attributes of the 𝑨_𝒖𝒏𝒌 set are considered indeterminate, and the level of confidence in them
is stored in the attributes of the set Unk .
A binary relation 𝑅𝑒𝑙 is used to show the relationships between the attributes of the 𝑨_𝒖𝒏𝒌 and
Unk sets, the values of which are determined based on the sample view of the source and in the
Cg data directory:
 
arg( )
arg( )
1,
0
,
( ) 1,
,
, 1,
i
j unk j
i
i
j
j i
Unk Dic
rel
oth
R
erwise
el rel Cg i p j m
A






     



 (2)
Examples of consolidated data tuples for
different types of information resources
22
1. Relational database - in this case an extended relational tuple is used rel
t :
1 1
,
{ ,..., } { _ ,..., _ },
rel
rel n m
bd t Unk
t a a a unk a unk
 
 
(4)
where {𝑎1, . . . , 𝑎𝑛 } are the value of exact attributes, {𝑎_𝑢𝑛𝑘1, . . . , 𝑎_𝑢𝑛𝑘𝑚 } are the value of attributes with uncertainty.
2. Data Warehouse combines fact and dimension data. A set of measurement values and fact characteristics is presented as a tuple dw
t :
1 1
1 11 1
1
1
,
{ ,..., } { _ ,..., _ }
{ ,..., } { _ ,..., _ } ...
... { _ ,..., _ } ...
... { _ ,..., _ } ,
dw
dw n m
rf rfi m
k ks
rf rft
bd t Unk
t a a a unk a unk
a a a unk a unk
a unk a unk
a unk a unk
 
  
  
 
 
(5)
where ,
i j
a is the value of exact characteristic j in the dimension i , 1
rf
a is the value of the characteristic j of fact table, ,
_ i j
a unk is the value of
attribute with uncertainty j from dimension i , _ rfj
a unk is value of attribute with uncertainty j from the fact table.
3. Semi-structured text describes the values of the nodes of the semantic networks and the degree of affiliation of these values to the objects
whose names are described in the synonym dictionary text
t :
1
1
,
{ ,..., } { ,..., },
m
text
text n unk unk
bd t Unk
t a a a a
 
 
(6)
Probabilistic Production Dependencies Mining
23
Probabilistic Production Dependency is a production rule in the basic ratio selection that is valid for a
significant number of entities in that selection. The significance threshold should be determined expertly, or
based on calculations of the probability of erroneous selection of this dependence. The main difference
between associative rules and PPD is that PPD will generated from existing FD in dataset.
   
 
: , , ,
,:
I i i j
j
F K a a A D a
a A P k K d D p
  
    
, (8)
where k and d are the tuples of groups of attributes K and D , respectively.
The main indicator of the reliability of such a dependency is the ratio of the number of objects that such a
PPD has to the number of objects in the selection:
 
 
 
k K d D
I
k K
R
P F
R


  

 . (9)
The classification rule is called the probabilistic productive relationship between the subsets of the X and Y
attributes in the consolidated Big data Bd, which occurs in training set bd with trust level s, where:
( ) ( )
X x Y y
   . (10)
Association rules measures
24
Trust Level is the ratio of the number of objects that have such a PPD to the number of objects in the selection:
   
 
 
S T
S
r
Conf S T P S T
r



    . (15)
Support Level is the characteristic of a selection predicate in a ratio that is calculated as the ratio of the number of objects that satisfy P predicate to the total
number of objects in relation to:
 
 
P r
Supp P
r

 . (16)
When calculating the level of support for PPD, the conditional and the resulting dependency predicate are combined by a conjunction:
   
 
S T r
Supp S T Supp S T
r
 
    . (17)
Using this concept, the level of trust can be calculated as:
 
 
 
Supp S T
Conf S T
Supp S

  . (18)
The level of improvement is calculated as the ratio of the levels of trust and support of the PPD:
 
 
 
 
   
Imp
Conf S T Supp S T
S T
Supp T Supp S Supp T
 
  

. (19)
Total mutual information is generally defined as:
2
1 1
log
n m
ij
X Y ij
i j i j
P
I P
p r

 
  , (20)
Algorithm 1: PPD mining
25
Input: Big data dataset bd, Card(Bd)=n
Output: PPDlist
1:PPDlist={}
2:X={}
3:for (i=1;i<n;i++)
4: i
X X A
 
5: Group entities with the same values for the set of attributes X;
6: Search for entities that have the same values for the set of synonyms of
attributes X;
7: Y={}
8: for (j=i+1;j<=n;j++)
9: i
Y Y A
 
10: Calculating Supp and Conf in the source of the entities obtained in steps
5) and 6);
11: Calculating the Imp of the tuple sources obtained in steps 5) and 6);
12: Identifying the entities with the highest levels of confidence and add X→Y
to PPDlist.
Algorithm 2: Recovery algorithm
26
Input: Big data dataset bd with schema Bd and missed data , PPDList
Output: Completeness level of bd
1:Completeness=0
2:If 1 1 1 }
,
( ..
{ .,
) ( ) ( )
n
b
b X d X
d bd
   and 2 1 2 }
, .. ,
{ ( ) (
. )
n
bd X bd X
  and 1 2
1 1
1 2
{ ( ) ( ) }
, . ( ) , (
, )
.. n n
bd X bd X bd X bd X
    and 1 ( )
bd Y  and 2 ( ) }
Y
bd  and
1
( )
X Dic
   and  
1
X Y in PPDlist

3:then
change  to 1( )
bd Y
1
1 1
( ) ( ) / i
i
bd
m
P P
bd
n
 
  
 

Completeness++
4:If 1 1 1 }
, .. ,
{ ( ) (
. )
n
bd X bd X
  and
{m from n values of attributes are  in bd2, n - m values of attributes are , n
m  } and  
1 m
P
n
  and {using defined values
1 2 }
, .. ,
( ) ( )
.
m m
bd X bd X
  } and 1 2 }
, ...
( ) ( 1
, )
bd d Y
b
Y   and {X2 →Y in PPDlist}
5:then
change  to 2 ( )
bd Y
2
2 2
( ) ( ) / i
i
bd
m
P P
bd
n
 
  
 

6: If { mi from n values of attributes are  in i
bd , i
m n
 } and { mj from n values of attributes are  in j
bd , j
m n
 }
and {for exact values 2
( ) ( ) }
m
i
m
bd X bd X

  and {for exact values 𝑏𝑑𝑗 (𝑋𝑚
) ↓= 𝑏𝑑2(𝑋𝑚
) ↓} and  ,
1
j
i i
m
m m
P
n n n
 
  
 
 
and
and ( ) }
{ i
d Y
b  and ( ) }
{ j
d Y
b  and 2 }
{ ( )
bd Y  , and  
2 , j
X X Y in PPDlist

7:then
change  to  
j
bd Y
Completeness++
Sequential dependencies
27
We determine the parent moving operator
, ( ) 1
1
( ) , ( )
( ( ))
X x Dic X
X
c bd X x X Y Dic
Up bd
 
 
  
 , (27)
where 1
x is the value of primary key (child); 2
x is the foreign key (parent), and
the child moving operator
2 2
, ( ) , ( )
_ ( ) ( )
X X
X x Dic Y x Dic
Down c bd bd
 

 
 . (28)
Based on the above described, we may build the operator for recovering the missing data in the sequence of
the events (or entities):
   
   
   
 
 
 
1
1
1
, ,
, ,
, ,
,
_ _
.
. ,
X
X
X
X x Val v Dic
cons
X x Val v Dic
X x Val Null Dic
X
X
X X
bd
bd
Heir c Down c
Bd Unk
Recovery n
b
Bd U
d
k P Dic






 
 
 

 
 
 
 
  
 

 
 
 

 
(29)
Complexity estimation
28
 
stat ma
t O t t
 
 
 
 
log
log
1 log
.
avgD
stat
avgD
avgD
n
minSupport
H
stat
m
m
t O n m O n m
n
O n
minSupport
n
O minSupport
minSupport
 
 
 

 
 
    
 
 
 
 
 
  
 
 
 
 
 
 
 
  
 
 
 
 
1 logavgD m
stat
n
M O
minSupport

 
 
 
  
 
 
 
At this stage, the input relation with the data is passing through each tuple,
so the time of this stage is directly proportional to the relation size n.
If we use the Functional Dependencies as Elemental
Dependencies for PPD Generation Enables Complexity, then:
 
 
2
log
aggr aggr
ma
Z sz
t O
m D A
 

 

 

 
 
ma ma ma
M O Z sz
 
 
 
 
 
 
 
 
1 log
2
2
1 log
2
log
.
log
avgD
avgD
m
aggr aggr
el
m
aggr aggr
n
minSupport
minSupport
t O
Z sz
Z
m D A m D A
n
minSupport
minSupport
O
Z sz
m D A


 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
  

 

 

 
Parallel mode – time complexity estimation
29
 
 
 
 
 
 
 
 
 
1 log
1 log
2
2
1 log
2
2
log
log
log
log
avgD
avgD
avgD
m
k m
aggr aggr
m
aggr aggr
n
minSupport
k minSupport
t O
n
k
k minSupport
Z sz
O
k m D A
n
minSupport k
k minSupport
O
Z sz
k m D



 
 
 
 
 

 
 
 
 
 
 
  
 

 
 
 

 
 
 
 
 
 
  
 

 



   
A
 
 
 
 
 
 
 
In the case of parallel computing on a large number of processors, we can neglect by the second member of the function
tn. For the same reason,𝑙𝑜𝑔2 𝑛 = 𝑂 𝑚𝑖𝑛𝑆𝑢𝑝𝑝𝑜𝑟𝑡 . The asymptotic estimate of the algorithm execution time (on a system
of k processors)  
 
logavgD m
n
t O minSupport


Experimental Results and Discussion
30
To run the experiment, the dataset from the Big Cities Health Inventory Data Platform (“BCHC
Data Platform”) is used. The platform contains over 18,000 data points across more than 50 health,
socio-economic, urban (information from smart sensors) and demographic indicators across 11
categories in the United States. This is an example of semictructured data with hidden dependencies
inside entities and between entities.
The missing data are modeled as random deleting of exact data.
The proposed and existing methods are tested on the same hardware: Intel Core 5 Quad E6600 2.4
GHz, 16 GB RAM, HDD WD 2 TB 7200 RPM. The criteria of PPD creation are Conf(F1)>0.7 and
minSupport=100. RStudio is used for data modelling and analysis.
Comparison with existing approach
31
normalized root-mean-square error The analysis of the percentage of correctly recovered data
Amount of
records
SVM AR EM
PPD (proposed
method)
2 000 31 190 9 8
4 000 49 362 23 19
6 000 95 437 37 31
8 000 144 639 49 42
10 000 210 827 62 51
18 000 286 1019 77 65
Time of analysis (min), depending on the amount of
analyzed data
0 5 10 15
1
5
10
15
20
acceleration of the algorithm
amount
of
servers
m=41
m=20
m=10
m=5
The acceleration graph of the developed
algorithm
Problem formulation
• To find frequent patterns
• To find parameters affected by COVID-19
32
Dataset description
33
• Dataset is collected under supporting of Central European Initiative and
verified by Lviv regional center COVID-19 resistance
• The project Stop COVID-19 has use case, implemented in Ukraine and
Belarus. Partners from Germany shared google form too. Dataset is collected
data over the period from September 01, to October 29.
• The dataset provides data of COVID-19 unconfirmed and confirmed cases
• Age (categorical): 1:<15, 2: 16-22, 3: 23-40, 4: 41-65,5: >66,
• Sex (categorical): male, female,
• Region (string): Lviv (Ukraine), Chernivtsi (Ukraine), Belarus, Germany, Other,
• Do you smoke (Boolean): 2:yes, 0: no,
• Have you had COVID (categorical): 2: yes, 0: no, 1: maybe,
• IgM level (numerical): [0..0.9) (negative), [0.9..1.1) (indefinite), >=1.1 (positive),
• IgG level (numerical): [0..0.9) (negative), [0.9..1.1) (indefinite), >=1.1 (positive),
• Blood group (categorical): 1, 2, 3, 4,
• Do you vaccinated influenza? (categorical): 2:yes, 0:no, 1:maybe,
• Do you vaccinated tuberculosis? (categorical): 2:yes, 0:no, 1:maybe,
• Have you had influenza this year? (categorical): 2:yes, 0:no, 1:maybe,
• Have you had tuberculosis this year? (categorical): 2:yes, 0:no, 1:maybe.
0
25
50
75
100
125
1 2 3 4
Blood group
Number
Preprocessing stage
Clustering
35
1 2 3 4
1
2
3
4
x
Age
Classification
36
The accuracy is equal to 0.5135. But, this model allows
choosing the main features as following “Have you had
influenza this year”, Sex, blood group, region.
0 1 2 Class error
0 153 9 17 0.14525139
1 8 88 12 0.18518518
2 1 3 20 0.16666666
RF - Confusion matrix
Models’ accuracy for whole features
37
Model
Full
dataset
Filtered
by
Ukraine
Filtered by
Belarus
Filtered by
Germany
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Logistic regression 0.553 0.572 0.534 0.544 0.601 0.592 0.610 0.589
Support vector
machine
0.605 0.6327 0.570 0.584 0.621 0.694 0.635 0.637
Naive Bayes 0.670 0.693 0.655 0.655 0.674 0.693 0.672 0.692
XGBoost 0.898 0.932 0.860 0.942 0.941 0.945 0.899 0.957
Random Forest 0.897 0.924 0.859 0.940 0.932 0.944 0.961 0.925
Neural network 0.820 0.849 0.828 0.79 0.830 0.849 0.8204 0.849
Decision tree 0.513 0.542 0.517 0.492 0.553 0.631 0.612 0.642
Models’ accuracy for selected features
Model*
Age, IgG,
Blood_group,
had_influenz, IgM
Age, Sex,
Blood_group,
had_influenz
Logistic regression 0.633 0.671
Support vector
machine 0.671 0.722
Naive Bayes 0.674 0.732
XGBoost 0.935 0.945
RandomForest 0.945 0.934
Neural network 0.832 0.845
Decision tree 0.553 0.631
Hierarchical classifier is built as following
39
1. Using gaps-statistics the appropriative number of clusters is found. This number
is equal to four;
2. k-means divides objects by 4 groups; density of distribution is calculated;
3. XGboost and Random Forest are used for each cluster separately;
4. Hard voting on the obtained results is provided. Based on it, class with the
highest number of votes will be selected. If votes are the same, the result of
classifier with minimal depth value will be selected.
Results
40
Model
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Hierarchical classifier 0.941 0.945 0.961 0.957
The accuracy of hierarchical classifier
Conclusion
• The authors proposed a three-stage approach for missing data imputation, which
includes: (i) designing the Big data model in the task of missing data recovery,
which enables to process the semistructured data; (ii) developing the method of
missing data recovery based on functional dependencies and association rules,
and (iii) estimating the algorithm complexity for missing data recovery.
• The proposed method of missing data recovery creates additional data values
using a based domain and functional dependencies and adds these values in
available training data. The correctness of the imputed values is verified on the
classifier built on the original dataset. The proposed PPD method performs 12%
better than the EM and RF methods for 30% of missing data. For more
percentage of missing data (the range about 40%-50% missing data) the EM
method looks the best, and the PPD has comparable results with SVM method.
Comparison of methods for analyzing and modeling the statistical processes by
qualitative criteria confirmed the developed method has following advantages:
possessing the characteristics of resistance to errors in the data, enabling the
parallel execution in distributed databases, automating and performing the
analysis of different types of data. The acceleration for m=41 attributes is larger
in 12.5 times for 20 servers (processors) comparing with the non-parallel mode.
41

More Related Content

What's hot (20)

08 clustering
08 clustering08 clustering
08 clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Association rules
Association rulesAssociation rules
Association rules
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Statistical ppt
Statistical pptStatistical ppt
Statistical ppt
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
Knn
KnnKnn
Knn
 
PCA
PCAPCA
PCA
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
 
Outliers
OutliersOutliers
Outliers
 
Decision tree
Decision treeDecision tree
Decision tree
 
Binomial Heap
Binomial HeapBinomial Heap
Binomial Heap
 

Similar to Missing Data imputation

Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxSivam Chinna
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theorycsandit
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationMarjan Sterjev
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...IJAEMSJORNAL
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...IJCNCJournal
 
A new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelA new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelSotiris Mitracos
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and EstimationLouis A. Poulin
 
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfDWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfChristinaGayenMondal
 
IR-ranking
IR-rankingIR-ranking
IR-rankingFELIX75
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDoctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDavide Chicco
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 

Similar to Missing Data imputation (20)

Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
Reduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theoryReduct generation for the incremental data using rough set theory
Reduct generation for the incremental data using rough set theory
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...BPSO&1-NN algorithm-based variable selection for power system stability ident...
BPSO&1-NN algorithm-based variable selection for power system stability ident...
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
 
A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
A new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score levelA new graph-based approach for biometric fusion at hybrid rank-score level
A new graph-based approach for biometric fusion at hybrid rank-score level
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and Estimation
 
ML MODULE 2.pdf
ML MODULE 2.pdfML MODULE 2.pdf
ML MODULE 2.pdf
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfDWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
 
IR-ranking
IR-rankingIR-ranking
IR-ranking
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMiDoctoral Thesis Dissertation 2014-03-20 @PoliMi
Doctoral Thesis Dissertation 2014-03-20 @PoliMi
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
AlgorithmAnalysis2.ppt
AlgorithmAnalysis2.pptAlgorithmAnalysis2.ppt
AlgorithmAnalysis2.ppt
 

Recently uploaded

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

Missing Data imputation

  • 1. Method for Missing Data Imputation in Big Data technologies Nataliya Shakhovska, Head of artificial intelligence department Lviv Polytechnic National University 1
  • 2. Data Cleaning • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data
  • 3.
  • 4. Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred
  • 5.
  • 6. Problem formulation • The nature of gaps • missing completely at random (MCAR), • missing at random (MAR) • and missing not at random (MNAR) • The limitation of existing methods • Methods based on the skip generation model • Methods of imputation (restoration) 6
  • 7. Existing methods of data imputation • Mean substitution • Hot-deck imputation (randomly) • Cold-deck imputation (constant) • Model-based • Regression • EM • RF • KNN • SOM • SVM 7
  • 9. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the task is classification—not effective in certain cases) • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples of the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as regression, Bayesian formula, decision tree
  • 10. Existing methods of data imputation • Mean substitution • Hot-deck imputation (randomly) • Cold-deck imputation (constant) • Model-based • Regression • EM • RF • KNN • SOM • SVM 10 mean imputation has an impact on attributes variability
  • 12.
  • 13. Use of noise filtering techniques in classification The three noise filters mentioned next, which are the mostknown, use a voting scheme to determine what cases have to be removed from the training set: • Ensemble Filter (EF) • Cross-Validated Committees Filter • Iterative-Partitioning Filter
  • 18. INFFC: An iterative class noise filter based on the fusio n of classifiers with noise sensitivity control
  • 19. INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control
  • 20. By the way, what is about nature of Big data? 20 ?
  • 21. The Big Data Model in the Task of Missing Data Recovery 21 The Big data schema Bd is the finite set of attributes with exact values   1 2 , , ..., n А А А , set of attributes with indefinite, inexact or missing values   1 2 _ , _ , _ p A unk A unk A unk and the set of values of membership functions for inexact attributes   1 2 , , , m Unk Unk Unk  . In addition, the catalog of the attributes (features) with schema Cg and synonym dictionary with schema Dic should be used:       1 2 1 2 1 2 , ,..., , _ , _ , _ , , , , , , , n p m Bd A A A unk A unk A unk Unk Unk U i A nk D c Cg    (1) The attributes of the 𝑨_𝒖𝒏𝒌 set are considered indeterminate, and the level of confidence in them is stored in the attributes of the set Unk . A binary relation 𝑅𝑒𝑙 is used to show the relationships between the attributes of the 𝑨_𝒖𝒏𝒌 and Unk sets, the values of which are determined based on the sample view of the source and in the Cg data directory:   arg( ) arg( ) 1, 0 , ( ) 1, , , 1, i j unk j i i j j i Unk Dic rel oth R erwise el rel Cg i p j m A                 (2)
  • 22. Examples of consolidated data tuples for different types of information resources 22 1. Relational database - in this case an extended relational tuple is used rel t : 1 1 , { ,..., } { _ ,..., _ }, rel rel n m bd t Unk t a a a unk a unk     (4) where {𝑎1, . . . , 𝑎𝑛 } are the value of exact attributes, {𝑎_𝑢𝑛𝑘1, . . . , 𝑎_𝑢𝑛𝑘𝑚 } are the value of attributes with uncertainty. 2. Data Warehouse combines fact and dimension data. A set of measurement values and fact characteristics is presented as a tuple dw t : 1 1 1 11 1 1 1 , { ,..., } { _ ,..., _ } { ,..., } { _ ,..., _ } ... ... { _ ,..., _ } ... ... { _ ,..., _ } , dw dw n m rf rfi m k ks rf rft bd t Unk t a a a unk a unk a a a unk a unk a unk a unk a unk a unk             (5) where , i j a is the value of exact characteristic j in the dimension i , 1 rf a is the value of the characteristic j of fact table, , _ i j a unk is the value of attribute with uncertainty j from dimension i , _ rfj a unk is value of attribute with uncertainty j from the fact table. 3. Semi-structured text describes the values of the nodes of the semantic networks and the degree of affiliation of these values to the objects whose names are described in the synonym dictionary text t : 1 1 , { ,..., } { ,..., }, m text text n unk unk bd t Unk t a a a a     (6)
  • 23. Probabilistic Production Dependencies Mining 23 Probabilistic Production Dependency is a production rule in the basic ratio selection that is valid for a significant number of entities in that selection. The significance threshold should be determined expertly, or based on calculations of the probability of erroneous selection of this dependence. The main difference between associative rules and PPD is that PPD will generated from existing FD in dataset.       : , , , ,: I i i j j F K a a A D a a A P k K d D p         , (8) where k and d are the tuples of groups of attributes K and D , respectively. The main indicator of the reliability of such a dependency is the ratio of the number of objects that such a PPD has to the number of objects in the selection:       k K d D I k K R P F R        . (9) The classification rule is called the probabilistic productive relationship between the subsets of the X and Y attributes in the consolidated Big data Bd, which occurs in training set bd with trust level s, where: ( ) ( ) X x Y y    . (10)
  • 24. Association rules measures 24 Trust Level is the ratio of the number of objects that have such a PPD to the number of objects in the selection:         S T S r Conf S T P S T r        . (15) Support Level is the characteristic of a selection predicate in a ratio that is calculated as the ratio of the number of objects that satisfy P predicate to the total number of objects in relation to:     P r Supp P r   . (16) When calculating the level of support for PPD, the conditional and the resulting dependency predicate are combined by a conjunction:       S T r Supp S T Supp S T r       . (17) Using this concept, the level of trust can be calculated as:       Supp S T Conf S T Supp S    . (18) The level of improvement is calculated as the ratio of the levels of trust and support of the PPD:             Imp Conf S T Supp S T S T Supp T Supp S Supp T       . (19) Total mutual information is generally defined as: 2 1 1 log n m ij X Y ij i j i j P I P p r      , (20)
  • 25. Algorithm 1: PPD mining 25 Input: Big data dataset bd, Card(Bd)=n Output: PPDlist 1:PPDlist={} 2:X={} 3:for (i=1;i<n;i++) 4: i X X A   5: Group entities with the same values for the set of attributes X; 6: Search for entities that have the same values for the set of synonyms of attributes X; 7: Y={} 8: for (j=i+1;j<=n;j++) 9: i Y Y A   10: Calculating Supp and Conf in the source of the entities obtained in steps 5) and 6); 11: Calculating the Imp of the tuple sources obtained in steps 5) and 6); 12: Identifying the entities with the highest levels of confidence and add X→Y to PPDlist.
  • 26. Algorithm 2: Recovery algorithm 26 Input: Big data dataset bd with schema Bd and missed data , PPDList Output: Completeness level of bd 1:Completeness=0 2:If 1 1 1 } , ( .. { ., ) ( ) ( ) n b b X d X d bd    and 2 1 2 } , .. , { ( ) ( . ) n bd X bd X   and 1 2 1 1 1 2 { ( ) ( ) } , . ( ) , ( , ) .. n n bd X bd X bd X bd X     and 1 ( ) bd Y  and 2 ( ) } Y bd  and 1 ( ) X Dic    and   1 X Y in PPDlist  3:then change  to 1( ) bd Y 1 1 1 ( ) ( ) / i i bd m P P bd n         Completeness++ 4:If 1 1 1 } , .. , { ( ) ( . ) n bd X bd X   and {m from n values of attributes are  in bd2, n - m values of attributes are , n m  } and   1 m P n   and {using defined values 1 2 } , .. , ( ) ( ) . m m bd X bd X   } and 1 2 } , ... ( ) ( 1 , ) bd d Y b Y   and {X2 →Y in PPDlist} 5:then change  to 2 ( ) bd Y 2 2 2 ( ) ( ) / i i bd m P P bd n         6: If { mi from n values of attributes are  in i bd , i m n  } and { mj from n values of attributes are  in j bd , j m n  } and {for exact values 2 ( ) ( ) } m i m bd X bd X    and {for exact values 𝑏𝑑𝑗 (𝑋𝑚 ) ↓= 𝑏𝑑2(𝑋𝑚 ) ↓} and  , 1 j i i m m m P n n n          and and ( ) } { i d Y b  and ( ) } { j d Y b  and 2 } { ( ) bd Y  , and   2 , j X X Y in PPDlist  7:then change  to   j bd Y Completeness++
  • 27. Sequential dependencies 27 We determine the parent moving operator , ( ) 1 1 ( ) , ( ) ( ( )) X x Dic X X c bd X x X Y Dic Up bd         , (27) where 1 x is the value of primary key (child); 2 x is the foreign key (parent), and the child moving operator 2 2 , ( ) , ( ) _ ( ) ( ) X X X x Dic Y x Dic Down c bd bd       . (28) Based on the above described, we may build the operator for recovering the missing data in the sequence of the events (or entities):                   1 1 1 , , , , , , , _ _ . . , X X X X x Val v Dic cons X x Val v Dic X x Val Null Dic X X X X bd bd Heir c Down c Bd Unk Recovery n b Bd U d k P Dic                                     (29)
  • 28. Complexity estimation 28   stat ma t O t t         log log 1 log . avgD stat avgD avgD n minSupport H stat m m t O n m O n m n O n minSupport n O minSupport minSupport                                                       1 logavgD m stat n M O minSupport                 At this stage, the input relation with the data is passing through each tuple, so the time of this stage is directly proportional to the relation size n. If we use the Functional Dependencies as Elemental Dependencies for PPD Generation Enables Complexity, then:     2 log aggr aggr ma Z sz t O m D A              ma ma ma M O Z sz                 1 log 2 2 1 log 2 log . log avgD avgD m aggr aggr el m aggr aggr n minSupport minSupport t O Z sz Z m D A m D A n minSupport minSupport O Z sz m D A                                                         
  • 29. Parallel mode – time complexity estimation 29                   1 log 1 log 2 2 1 log 2 2 log log log log avgD avgD avgD m k m aggr aggr m aggr aggr n minSupport k minSupport t O n k k minSupport Z sz O k m D A n minSupport k k minSupport O Z sz k m D                                                                   A               In the case of parallel computing on a large number of processors, we can neglect by the second member of the function tn. For the same reason,𝑙𝑜𝑔2 𝑛 = 𝑂 𝑚𝑖𝑛𝑆𝑢𝑝𝑝𝑜𝑟𝑡 . The asymptotic estimate of the algorithm execution time (on a system of k processors)     logavgD m n t O minSupport  
  • 30. Experimental Results and Discussion 30 To run the experiment, the dataset from the Big Cities Health Inventory Data Platform (“BCHC Data Platform”) is used. The platform contains over 18,000 data points across more than 50 health, socio-economic, urban (information from smart sensors) and demographic indicators across 11 categories in the United States. This is an example of semictructured data with hidden dependencies inside entities and between entities. The missing data are modeled as random deleting of exact data. The proposed and existing methods are tested on the same hardware: Intel Core 5 Quad E6600 2.4 GHz, 16 GB RAM, HDD WD 2 TB 7200 RPM. The criteria of PPD creation are Conf(F1)>0.7 and minSupport=100. RStudio is used for data modelling and analysis.
  • 31. Comparison with existing approach 31 normalized root-mean-square error The analysis of the percentage of correctly recovered data Amount of records SVM AR EM PPD (proposed method) 2 000 31 190 9 8 4 000 49 362 23 19 6 000 95 437 37 31 8 000 144 639 49 42 10 000 210 827 62 51 18 000 286 1019 77 65 Time of analysis (min), depending on the amount of analyzed data 0 5 10 15 1 5 10 15 20 acceleration of the algorithm amount of servers m=41 m=20 m=10 m=5 The acceleration graph of the developed algorithm
  • 32. Problem formulation • To find frequent patterns • To find parameters affected by COVID-19 32
  • 33. Dataset description 33 • Dataset is collected under supporting of Central European Initiative and verified by Lviv regional center COVID-19 resistance • The project Stop COVID-19 has use case, implemented in Ukraine and Belarus. Partners from Germany shared google form too. Dataset is collected data over the period from September 01, to October 29. • The dataset provides data of COVID-19 unconfirmed and confirmed cases • Age (categorical): 1:<15, 2: 16-22, 3: 23-40, 4: 41-65,5: >66, • Sex (categorical): male, female, • Region (string): Lviv (Ukraine), Chernivtsi (Ukraine), Belarus, Germany, Other, • Do you smoke (Boolean): 2:yes, 0: no, • Have you had COVID (categorical): 2: yes, 0: no, 1: maybe, • IgM level (numerical): [0..0.9) (negative), [0.9..1.1) (indefinite), >=1.1 (positive), • IgG level (numerical): [0..0.9) (negative), [0.9..1.1) (indefinite), >=1.1 (positive), • Blood group (categorical): 1, 2, 3, 4, • Do you vaccinated influenza? (categorical): 2:yes, 0:no, 1:maybe, • Do you vaccinated tuberculosis? (categorical): 2:yes, 0:no, 1:maybe, • Have you had influenza this year? (categorical): 2:yes, 0:no, 1:maybe, • Have you had tuberculosis this year? (categorical): 2:yes, 0:no, 1:maybe.
  • 34. 0 25 50 75 100 125 1 2 3 4 Blood group Number Preprocessing stage
  • 35. Clustering 35 1 2 3 4 1 2 3 4 x Age
  • 36. Classification 36 The accuracy is equal to 0.5135. But, this model allows choosing the main features as following “Have you had influenza this year”, Sex, blood group, region. 0 1 2 Class error 0 153 9 17 0.14525139 1 8 88 12 0.18518518 2 1 3 20 0.16666666 RF - Confusion matrix
  • 37. Models’ accuracy for whole features 37 Model Full dataset Filtered by Ukraine Filtered by Belarus Filtered by Germany Cluster 1 Cluster 2 Cluster 3 Cluster 4 Logistic regression 0.553 0.572 0.534 0.544 0.601 0.592 0.610 0.589 Support vector machine 0.605 0.6327 0.570 0.584 0.621 0.694 0.635 0.637 Naive Bayes 0.670 0.693 0.655 0.655 0.674 0.693 0.672 0.692 XGBoost 0.898 0.932 0.860 0.942 0.941 0.945 0.899 0.957 Random Forest 0.897 0.924 0.859 0.940 0.932 0.944 0.961 0.925 Neural network 0.820 0.849 0.828 0.79 0.830 0.849 0.8204 0.849 Decision tree 0.513 0.542 0.517 0.492 0.553 0.631 0.612 0.642
  • 38. Models’ accuracy for selected features Model* Age, IgG, Blood_group, had_influenz, IgM Age, Sex, Blood_group, had_influenz Logistic regression 0.633 0.671 Support vector machine 0.671 0.722 Naive Bayes 0.674 0.732 XGBoost 0.935 0.945 RandomForest 0.945 0.934 Neural network 0.832 0.845 Decision tree 0.553 0.631
  • 39. Hierarchical classifier is built as following 39 1. Using gaps-statistics the appropriative number of clusters is found. This number is equal to four; 2. k-means divides objects by 4 groups; density of distribution is calculated; 3. XGboost and Random Forest are used for each cluster separately; 4. Hard voting on the obtained results is provided. Based on it, class with the highest number of votes will be selected. If votes are the same, the result of classifier with minimal depth value will be selected.
  • 40. Results 40 Model Cluster 1 Cluster 2 Cluster 3 Cluster 4 Hierarchical classifier 0.941 0.945 0.961 0.957 The accuracy of hierarchical classifier
  • 41. Conclusion • The authors proposed a three-stage approach for missing data imputation, which includes: (i) designing the Big data model in the task of missing data recovery, which enables to process the semistructured data; (ii) developing the method of missing data recovery based on functional dependencies and association rules, and (iii) estimating the algorithm complexity for missing data recovery. • The proposed method of missing data recovery creates additional data values using a based domain and functional dependencies and adds these values in available training data. The correctness of the imputed values is verified on the classifier built on the original dataset. The proposed PPD method performs 12% better than the EM and RF methods for 30% of missing data. For more percentage of missing data (the range about 40%-50% missing data) the EM method looks the best, and the PPD has comparable results with SVM method. Comparison of methods for analyzing and modeling the statistical processes by qualitative criteria confirmed the developed method has following advantages: possessing the characteristics of resistance to errors in the data, enabling the parallel execution in distributed databases, automating and performing the analysis of different types of data. The acceleration for m=41 attributes is larger in 12.5 times for 20 servers (processors) comparing with the non-parallel mode. 41