Biodata analysis

Smart Tech: Biosignals & Medical Electronics
Biodata analysis
Oresti Banos
June 2nd, 2017
o.banoslegran@utwente.nl
@orestibanos
http://orestibanos.com/
19-Apr-18 1

Biodata analysis in the overall “biopicture”
19-Apr-18 2
Lectures 1, 2, 3
Lecture 4 (last week)Lecture 5 (today)
Biosignal acquisition
Biosignal processingBiodata analysis

Learning Objectives
At the end of this course you should be able to:
 Explain the concept of biodata and
enumerate some examples of types
of biodata
 Identify the different stages of the
biodata analysis chain, as well as
the purpose of each step
 Apply some regular biodata
segmentation techniques
 Utilise common biodata feature
extraction and selection techniques
 Employ typical biodata classification
techniques and use metrics to
evaluate their performance
19-Apr-18 3

Outline
 Biodata definition
 Biodata analysis chain
 Segmentation
 Feature extraction
 Feature selection
 Classification
 Performance evaluation
19-Apr-18 4

Outline
 Segmentation
 Classification
19-Apr-18 5

Biodata: definition
 Data: is a collection or set of qualitative or
quantitative variables normally represented
through numbers, characters or symbols
 Biodata (also biomedical data, biological data):
collection of data specifically related to or
describing biological systems or processes
 Different “levels” of data
 Signal
 Features
 Categories
19-Apr-18 6
There are
more specific
definitions!

Biodata: examples
19-Apr-18 7
Genomic data
Kinematics data
Physiological data
Emotional data
Cognitive data
…

Motion data
19-Apr-18 8
 Body motion data:
 Different sensing technologies to track body movements
 Video: Kinect, Vicon
 EMG: MYO
 Inertial: Xsens, Shimmer, Smartphone/watch
 Acceleration:
 tridimensional signal (x, y, z)
 measures typically range from -2g to 2g (g=9.8m/s2) for
daily activities
 sampling frequency around 50Hz
 Multiple applications: Health
Abnormal
behavior
detection
Proactive
Assistance
Labour risk
prevention
Wellness
Sports
Gaming

Self-portrait portrait (Selfie), biodata?
19-Apr-18 9
Two weeks ago
you said…

Self-portrait portrait (Selfie), biodata?
19-Apr-18 10
And the answer
was…

(Medical) selfie data
19-Apr-18 11
Seborrheic
Dermatitis
Allergies,
fatigueConjunctivitis
Moles, skin diseases
Burns
Tongue
infections

(Medical) selfie data
19-Apr-18 12

Outline
 Segmentation
 Classification
19-Apr-18 13

Biodata interpretation: not an easy job…
 Example (EEG signals)
19-Apr-18 14
Just
0.2 seconds
of data!!

Biodata interpretation: not an easy job…
 Medical experts cannot “digest” the enormous
amount of biodata generated by people
 Examples
 Breathing ~100K events/day
 Heart beats ~ 1M samples/day
 Motion ~100M samples/day
 EEG ~100M samples/day
19-Apr-18 15
How to make sense of
these gobs of data?

Biodata analysis chain
 Multistage process combining computational
techniques to automatically extract information
and develop decisions on a given data set
19-Apr-18 16
S = data source (sensor)
si = segment of data
u = raw/unprocessed data
f(si) = feature vector
p = preprocessed data
ci = class/label

Data acquisition and preprocessing
19-Apr-18 17
 Data acquisition refers here to the process of
measurement and digitisation of the biological
phenomenon (Lectures 1, 2, 3)
 Measurement and transduction
 Sampling
 Amplification
 Analog to digital conversion
 Data preprocessing refers here to the preparation
of the biodata for its posterior processing and
analysis (Lectures 4 and 5)
 Removal of artifacts
 Denoising
 Domain transformation
 Downsampling (decimation) / upsampling (interpolation)

Outline
 Segmentation
 Classification
19-Apr-18 18

Segmentation
19-Apr-18 19
 Process to divide the biosignal or data into
smaller time segments
 The segmentation process is frequently called
“windowing” as each segment represents a data
window or frame
 In real-time applications, windows are defined
concurrently with data acquisition and processing,
so data streams can be effectively analysed “on-
the-fly”

Segmentation
19-Apr-18 20
 Sliding window
 Signals are split
into windows of a
fixed size and with
no inter-window
gaps
 An overlap
between adjacent
windows is
sometimes
tolerated
 Most widely used
approach
Window 1 Window 3 Window 5 Window 7
Window 1 Window 2 Window 3 Window 4
Fixed
window size
Window 2 Window 4 Window 6
Non-overlapingOverlaping

Segmentation
19-Apr-18 21
 Event-defined window
 The segment start and
end is defined by a
detected event
 Additional processing is
required to identify the
events of interest
 Example: toe offs and
heel strikes based on
the differentiation of the
acceleration signal
(derivative)
Data windows (normally)
of variable size

Segmentation
19-Apr-18 22
 Class-defined window
 The window start and
end is defined by a
change in the context
or class (also spotting)
 Example: activity
transition detected from
significant variations in
the energy or statistical
properties of the
acceleration signal
(e.g., variance)
Data windows (normally)
of variable size

Outline
 Segmentation
 Classification
19-Apr-18 23

Featuring or characterisation
19-Apr-18 24
 How do you differentiate between these two
persons? What do they have in common?

Featuring or characterisation
19-Apr-18 25
 Sometimes it becomes difficult to tell…

Feature extraction
19-Apr-18 26
 Process of (numerically) characterising or
transforming raw data into more descriptive or
informative data
 Intended to facilitate the subsequent learning and
generalization steps, and in some cases lead to
better human interpretations
Location=prefrontal,
Size=3cm,
Density=60g/cm3, …

Feature extraction
19-Apr-18 27
 Time-domain features: statistical values derived
directly from data window
 Examples:
 Max
 Min
 Mean
 Median
 Variance
 Skewness
 Kurtosis
MATLAB: max, min, mean, median, var, skewness, kurtosis
0 0.5 1 1.5 2 2.5 3 3.5 4
Time (s)
-25
-20
-15
-10
-5
0
5
10
15
Acceleration(m/s2)
X-axis acceleration signal (JUMPING)
0 0.5 1 1.5 2 2.5 3 3.5 4
Time (s)
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
Acceleration(m/s2)
X-axis acceleration signal (WALKING)
0 0.5 1 1.5 2 2.5 3 3.5 4
Time (s)
-3.4
-3.3
-3.2
-3.1
-3
-2.9
-2.8
-2.7
-2.6
Acceleration(m/s2)
X-axis acceleration signal (STANDING)

Feature extraction
19-Apr-18 28
 Frequency-domain features: derived from a
transformed version of the data window in the
frequency domain
 Examples:
 Fundamental frequency
 N-order harmonics
 Mean/Median/Mode frequency
 Spectral power/energy
 Entropy
 Cepstrum coefficients
MATLAB: fft, pwelch, meanfreq, medfreq, rceps
0 5 10 15 20 25
Frequency (Hz)
0
200
400
600
800
1000
1200
1400
FFTMagnitude
X-axis acceleration signal (JUMPING)
0 5 10 15 20 25
Frequency (Hz)
0
20
40
60
80
100
120
140
160
FFTMagnitude
X-axis acceleration signal (STANDING)
0 5 10 15 20 25
Frequency (Hz)
0
20
40
60
80
100
120
140
160
180
200
FFTMagnitude
X-axis acceleration signal (WALKING)

Feature extraction
19-Apr-18 29
 Process of (numerically) characterising or
transforming raw data into more informative data
 The outcome of the feature extraction process is
normally a feature matrix
 Rows represent each data instance, chunk or segment
 Columns refer to the mathematical function (feature)
𝟎. 𝟏𝟖 𝟎. 𝟑𝟓
−𝟎. 𝟐𝟔 𝟎. 𝟏𝟓
−𝟎. 𝟎𝟓
−𝟎. 𝟏𝟗
𝟎. 𝟐𝟏
𝟎. 𝟏𝟖
Feature
matrix
F1: Mean F2: Variance

Feature extraction
19-Apr-18 30
 Feature space:
 Total number of features
extracted from the data
 Normally described as an array
(also known as feature matrix) in
which rows represent each
instance and columns the
feature type
 The dimensions (D) of the
feature space are given by the
number of features (N)
MATLAB: scatter
0 1 2 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance
Class A
Class B
0.18 0.55
0.26 0.15
0.15
2.13
2.86
2.58
0.85
2.62
2.35
2.51
Feature
matrix
Mean Variance
0 0.5 1 1.5 2 2.5 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance
Sitting
Climbing
Feature
space

Outline
 Segmentation
 Classification
19-Apr-18 31

Features of “relevance”
19-Apr-18 32
 Relevant features can be
individually irrelevant
 A helpful feature may be
irrelevant by itself
 E.g.: the characteristic “being a
human” when comparing two
people vs. when comparing
animals
 Two individually irrelevant
features may become relevant
when used in combination
 E.g.: “row” and “column” in a
chess game when comparing the
color of a given position vs. using
either row or column solely

Feature selection
19-Apr-18 33
 Process to select relevant and
informative features
 Different motivations
 General data reduction: limit storage
requirements and increase algorithm
speed
 Feature set reduction: save resources in
the next round of data collection or
during its utilisation
 Performance improvement: gain in
predictive accuracy
 Data understanding: acquire knowledge
about the process that generated the
data or simply visualise the data

Feature selection
19-Apr-18 34
 Visualising the feature space can help determining which
features (or combination thereof) are most discriminative
 Hyperdimensional features spaces (#features > 3) need
to be reduced for a proper visualisation (e.g., PCA, ICA)
MATLAB: scatter3, pca, biplot, ica
-1
-0.5
0
0
0.5
-2
1
-2
1.5
Mean(accelerationZ)
2
-4 -3
Feature space representation
Mean (accelerationY)
2.5
-4
3
Mean (accelerationX)
-6
3.5
-5
-8
-6
-10 -7
Standing
Walking
Jumping
Do not trust statistics alone,
visualise your data!

Feature selection
19-Apr-18 35
 There are several feature ranking and selection methods
MATLAB: rankfeatures
 Filter methods:
 select variables regardless of the
classification model model (analyses
intrinsic properties of data)
 particularly effective in computation time
 robust to overfitting (excessive model
complexity, i.e., too many parameters
relative to the number of samples)
 Ranking feature selection:
 selects a subset of features according to
a statistical separability criteria (e.g., t-
test, ANOVA)
Set of all
features
Selecting
the best
subset
Learning
algorithm
Performance
evaluation
0.4 1 3.3 1
2.3 1 3.2 3
0.4
2.2
2.6
2.2
1
1
1
1
3.1
9.8
9.4
9.7
2
3
1
2
Class “Sitting”
Class “Climbing”
F1 F2 F3 F4
a) F1>F3>F4>F2?
b) F1>F3>F2>F4?
c) F3>F1>F4>F2?
d) F4>F2>F3>F1?
Go to go.voxvote.com
Insert the code PIN: 99020

Feature selection
19-Apr-18 36
 There are several feature ranking and selection methods
MATLAB: sequentialfs
 Wrapper methods:
 allow to detect the possible
interactions between features
 significant computation time for large
sets of features and prone to
overfitting
 Sequential feature selection:
 selects a subset of features from the
feature matrix that predict best the
output classes by iteratively selecting
features until there is no improvement in
prediction
Set of all
features
Selecting the best subset
Learning
algorithm
Performance
evaluation
Generate
a subset
0.4 1 3.3 1
2.3 1 3.2 3
0.4
2.2
2.6
2.2
1
1
1
1
3.1
9.8
9.4
9.7
2
3
1
2
Class “Sitting”
Class “Climbing”
F1 F2 F3 F4
a) F1>F2>F3>F4?
b) F2>F3>F1>F4?
c) F3>F1>F4>F2?
d) F4>F2>F3>F1?
Go to go.voxvote.com
Insert the code PIN: 99020

Outline
 Segmentation
 Classification
19-Apr-18 37

Classification
19-Apr-18 38
 Problem of identifying to
which of a set of
categories or classes a
new observation belongs
 The classification model is
based on a training set of
data containing
observations (or
instances) whose
category membership is
(normally) known
0 1 2 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance Class A
Class B
0.18 0.55
0.26 0.15
0.15
2.13
2.86
2.58
0.85
2.62
2.35
2.51
Feature
matrix
Mean Variance
Classification
boundary
0 0.5 1 1.5 2 2.5 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance
Sitting
Climbing

Classification
19-Apr-18 39
 Types:
 Supervised
 E.g., decision tree
 Unsupervised
 E.g., clustering
 One size does not fit
all: the choice of
classifier is subject to
a trade-off between
complexity and
computational
resources
0 1 2 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance
Class A
Class B
0.18 0.55
0.26 0.15
0.15
2.13
2.86
2.58
0.85
2.62
2.35
2.51
0 1 2 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance
Class A
Class B
0.18 0.55
0.26 0.15
0.15
2.13
2.86
2.58
0.85
2.62
2.35
2.51
𝑨
𝑨
𝑨
𝑩
𝑩
𝑩
?
?
?
?
?
?
Supervised
Unsupervised
0 0.5 1 1.5 2 2.5 3
Mean
0
0.5
1
1.5
2
2.5
3
Variance
Sitting
Climbing

Classification
19-Apr-18 40
 Classification process:
 Training/learning
 Before operation the classification
model has to be trained (created)
 The model parameters are learned
from the training data and as to
minimise the classification error
 Example:
 In a decision tree, nodes
(conditions) and branches (decision
propagation) need to be defined
 The conditions are optimised as to
maximise the distance between
classes
0.18 0.55
0.26 0.15
0.15
2.13
2.86
2.58
0.85
2.62
2.35
2.51
𝑆𝑖𝑡𝑡𝑖𝑛𝑔
𝐶𝑙𝑖𝑚𝑏𝑖𝑛𝑔
+
Mean Variance
Mean < 1.2
Sitting Climbing
Class
MATLAB: fitctree

Classification
19-Apr-18 41
 Classification process:
 Classification/prediction
 Once the model is trained, it can
be used to categorise unseen
new instances into specific
classes
 The outputs of the classification
correspond to the inferred class
or categories
 Example:
 In a decision tree, the conditions
(nodes) are evaluated and the
applicable path followed up to
reach a conclusion (class)
0.52 −0.25
1.38 9.15
2.31
0.19
5.67
0.12
Mean < 1.2
Sitting Climbing
MATLAB: predict

Outline
 Segmentation
 Classification
19-Apr-18 42

Performance evaluation
19-Apr-18 43
 Evaluating the performance of the classifier
(generally, the complete analysis chain) is
crucial to estimate the categorisation
capabilities of the system
 The performance evaluation
is normally conducted
during the design phase
 Classification performance
depends greatly on the
characteristics of the data to
be classified
 There is no single classifier
that works best on all given
problems

19-Apr-18 44
 Performance metrics
 Decision table (confusion matrix)
 Table layout that allows visualization
of the performance of a given
algorithm or classification model
 Each column of the matrix represents
the instances in a predicted class
while each row represents the
instances in an actual class
MATLAB: confusionmat, plotconfusion
Actual
class
Classified
class
Sitting Climbing
Sitting 2 1
Climbing 0 1
Classified class
Actualclass

19-Apr-18 45
 Performance metrics
 Accuracy (acc)
 Proportion of correct classifications
with respect to the total number of
classified instances or observations
MATLAB: classperf
Actual
class
Classified
class
𝒂𝒄𝒄 =
𝟏 + 𝟏 + 𝟎 + 𝟏
𝟒
= 𝟎. 𝟕𝟓  75%
Sitting Climbing
Sitting 2 1
Climbing 0 1
Classified class
Actualclass

19-Apr-18 46
 Experimental data is normally scarce and gives
insight on a sole scenario/situation
 Cross-validation: technique for assessing how
the results of a statistical classifier will generalize
to an independent data set (observations)
MATLAB: crossvalind, cvpartition, crossval
 Leave-one-out cross-validation (LOOCV)
 One observation is left out for validation
and remaining ones are used for training
𝟎. 𝟓𝟐 𝟎. 𝟐𝟓
𝟏. 𝟑𝟖 𝟗. 𝟏𝟓
𝟐. 𝟑𝟏
𝟎. 𝟏𝟗
𝟓. 𝟔𝟕
𝟎. 𝟏𝟐
Round 1
0.52 0.25
1.38 9.15
2.31
0.19
5.67
0.12
Round 2
0.52 0.25
1.38 9.15
2.31
0.19
5.67
0.12
Round 3
0.52 0.25
1.38 9.15
2.31
0.19
5.67
0.12
Round 4
0.52 0.25
1.38 9.15
2.31
0.19
5.67
0.12
Experimental
data
Validation set
Training set
acc1
Validation
accuracy
acc2 acc3 acc4
Final accuracy = average(acc_i) ∀ i

19-Apr-18 47
 Experimental data is normally scarce and gives
insight on a sole scenario/situation
 Cross-validation: technique for assessing how
the results of a statistical classifier will generalize
to an independent data set (observations)
MATLAB: crossvalind, cvpartition, crossval
 K-fold cross-validation (K-fold CV)
 Experimental data set is split into K folds
 K-1 folds are used for training
 The remaining fold is used for testing
 The process is repeated K times for each split
𝟎. 𝟓𝟐 𝟎. 𝟐𝟓
𝟏. 𝟑𝟖 𝟗. 𝟏𝟓
𝟐. 𝟑𝟏
𝟎. 𝟏𝟗
𝟓. 𝟔𝟕
𝟎. 𝟏𝟐
Experimental
data
Validation set
Training set
Round 2
0.52 0.25
1.38 9.15
2.31
0.19
5.67
0.12
Round 1
0.52 0.25
1.38 9.15
2.31
0.19
5.67
0.12
2-fold CV
75% 100%Validation
accuracy
Final accuracy = (75%+100%)/2 = 87.5%

References
 Biomedical Signal Processing and Analysis:
 Principal references:
 Mitchell, T. M., Machine Learning. McGraw-Hill, 1997 (Chapter 3)
 Rangayyan, R. M. Biomedical Signal Analysis: A Case-Study Approach. New
York: IEEE Press, 2002 (Chapters 8-9)
 Preece, Stephen J., Goulermas, John Y., Kenney, Laurence P. J., Howard, Dave,
Meijer, K., Crompton, R., et al. (2009). Activity identification using body-mounted
sensors – a review of classification techniques. Physiological Measurement,
30(4), R1–R33
 Other references:
 Bishop, C. M., Pattern Recognition and Machine Learning. Springer, 2006
 Bulling, A.; Blanke, U.; Schiele, B. A Tutorial on Human Activity Recognition Using
Body-worn Inertial Sensors. ACM Comput. Surv. 2014, 46, 1–33.
19-Apr-18 48

References
 Examples of open-access biodata sets
(http://orestibanos.com/datasets.htm)
19-Apr-18 49

Biodata analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Biodata analysis

Similar to Biodata analysis (20)

More from Oresti Banos

More from Oresti Banos (20)

Recently uploaded

Recently uploaded (20)

Biodata analysis

Editor's Notes