Human Behaviour Understanding using Top-View RGB-D Data

Human Behaviour Understanding
using Top-View RGB-D Data
Daniele Liciotti
d.liciotti@pm.univpm.it
Advisor: Prof. Emanuele Frontoni
Universit`a Politecnica delle Marche
26th March 2018
Daniele Liciotti 26th March 2018 1 / 48

Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works

Table of Contents
Introduction
State of art
Video surveillance

Research problem
In recent years, a lot of researchers have focused the attention on
automatic analysis of human behaviour because of its important
potential applications and its intrinsic scientific challenges.
CV and deep learning techniques are currently the most interesting
solutions to analyse the human behaviour. In particular, if these are used
in combination with RGB-D data that provide high availability, reliability
and affordability.
Several studies adopt the top-view configuration because it eases the task
and makes simple to extract different trajectory features. This setup also
introduces robustness, due to the lack of occlusions among individuals.

Table of Contents
Introduction
State of art
Video surveillance

Human behaviour understanding
The interest in HBU has quickly increased in recent years, motivated by a
societal needs that include security, natural interfaces, gaming, affective
computing, and assisted living.
An initial approach can be the detecting and
tracking of the subjects of interest, which in
this case are the people. This way it is possible to
generate motion descriptors which are used to
identify actions or interactions.
Recognising particular behaviours requires the
definition of a set of templates that represent
different classes of behaviours.

Taxonomy
The works of Moeslund et al. and Borges et al. have been used to create
a taxonomy on HBU. Human activities can be categorized into four main
groups, namely gesture, action, activity, and behaviour.
Gestures are movements of body parts that can be used to control
and to manipulate, or to communicate. These are the atomic
components describing the motion of a person.
Actions can be seen as temporal concatenations of gestures. Actions
represent voluntary body movements of an arbitrary complexity. An
action implies a detailed sequence of elementary movements.
Activities are a set of multiple actions that can be classiﬁed in order
to understand human behaviours.
Behaviours are the responses of subjects to internal, external,
conscious, or unconscious stimuli. A series of activities may be
related to a particular behaviour.

Taxonomy
This table summarises the diﬀerent Degrees of Semantics (DoS)
considered by the taxonomy, along with some examples. Not only time
frame and semantic degree grow at higher levels of this hierarchy, but
also complexity and computational cost lead to heavy and slow
recognition systems, as each level requires most of the previous level
tasks to be done too.
Activity
Action
Gesture
Behaviour
DoS
Time
Frame
DoS Time lapse
Gesture frames, seconds
Action seconds, minutes
Activity minutes, hours
Behaviour hours, days

One of the most commonly used sensor
categories for this type of task are RGB-D
cameras because of their availability,
reliability and affordability. Reliable depth
maps can provide valuable additional
information to significantly improve tracking
and detection results.
Several research papers adopt the top-view
configuration because it eases the task and
makes simple to extract different trajectory
features. This setup introduces robustness,
due to the lack of occlusions among
individuals.
4.43m
3.31m

Datasets
The most relevant available datasets with RGB-D data installed in a
top-view conﬁguration are listed below.
1. TST Fall detection dataset v1
2. TST Intake Monitoring dataset v1
3. UR Fall Detection Dataset
4. Depthvisdoor
5. TVHeads Dataset
6. CBSR
7. TVPR Dataset

Table of Contents
Introduction
State of art
Video surveillance

People detection: algorithms
Different solutions for people detection with RGB-D data from top-view
configuration are used.
Water filling
Multi-level segmentation
Semantic segmentation with DL approaches
U-Net
SegNet
ResNet
FractalNet

Water filling1
This algorithm finds, in a depth image, the local minimum regions
simulating the rain and the flooding of ground. According to an
uniform distribution, it simulates the rain with some raindrops. The
algorithm moves the raindrops towards the local minimum points,
but if a point is wet, it wets the point of the higher level. Then, puddles
are formed because the water flows to some local minimum
regions. It computes the contour lines considering the distance from the
local minimum as a function of the total raindrops.
(a) (b) (c)
Thesis contribution concern in directing drops towards the subjects rather
than distributing them in the whole image. In this way, computational
time has significantly improved.
1[Zhang et al., 2012][Liciotti et al., 2015]

Multi-level segmentation2
This algorithm intends to overcome the limitations of the binary
segmentation method in case of collisions among people. In fact, using
a single-level segmentation, in case of a collision, two people become a
single blob (person), without distinction between head and shoulders of
the person. The head of each person is anyway detected, becoming the
discriminant element.
2[Liciotti et al., 2014a, Liciotti et al., 2017a]

Semantic segmentation with DL approaches
The main ConvNet architectures used for semantic segmentation
problems are:
U-Net [Ronneberger et al., 2015]
SegNet [Badrinarayanan et al., 2015]
ResNet [He et al., 2016]
FractalNet [Larsson et al., 2016]

Semantic segmentation with DL approaches (Video)
https://youtu.be/MWjcW-3A5-I

Semantic segmentation with DL approaches: Metrics
Typically, to measure the segmentation accuracy and performance,
different types of metrics are used.
One of the first metrics is the Jaccard index, also known as IoU,
measures similarity between finite sample sets, and defined as the size of
the intersection divided by the size of the union of the sample sets:
IoU =
truepos
truepos + falsepos + falseneg
(1)
Another metric is the Sørensen–Dice index, also called the overlap index,
is the most used metric in semantic segmentation, and is computed as:
Dice =
2 · truepos
2 · truepos + falsepos + falseneg
(2)
where the positive class is the heads and the negative is all the rest.

Semantic segmentation with DL approaches: U-Net3
A new U-Net architecture is proposed. It is composed of two main parts:
contracting path (left side);
expansive path (right side).
The structure remains the same of original U-Net, but some changes are
made at the end of each layer. In particular a batch normalisation is
added after the ﬁrst ReLU activation function and after each max polling
and upsampling functions.

Semantic segmentation with DL approaches: Results
Table: Jaccard and Dice indices of diﬀerent CNN architectures.
Net Bit
Jaccard Jaccard Dice Dice
Train Validation Train Validation
Fractal [Larsson et al., 2016]
8 0.960464 0.948000 0.979833 0.973306
16 0.961636 0.947762 0.980443 0.973180
8 0.896804 0.869399 0.945595 0.930138
16 0.894410 0.869487 0.944262 0.930188
U-Net2 [Ravishankar et al., 2017]
8 0.923823 0.939086 0.960403 0.968586
16 0.923537 0.938208 0.960249 0.968119
U-Net3
8 0.962520 0.931355 0.980902 0.964458
16 0.961540 0.929924 0.980393 0.963690
8 0.884182 0.823731 0.938531 0.903347
16 0.884162 0.827745 0.938520 0.905756
8 0.932160 0.856337 0.964889 0.922609
16 0.933436 0.848240 0.965572 0.917889

The FractalNet and the ResNet reach high values after a few epochs.
Instead, the U-Net3 increases its value more slowly. The classic U-Net is
always below all other networks.
0 20 40 60 80 100 120 140 160 180 200
Epochs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
JaccardIndex
JU-Net3
16
JU-Net3
8
JFractalNet
16
JFractalNet
8
JResNet
16
JResNet
8
JSegNet
16
JSegNet
8
JU-Net2
16
JU-Net2
8
JU-Net
16
JU-Net
8
Figure: Jaccard indices trends during the ﬁt process.

8 bit 16 bit Label
U-Net2 [Ravishankar et al., 2017]
U-Net3
FractalNet [Larsson et al., 2016]

Table of Contents
Introduction
State of art
Video surveillance

In order to demonstrate the beneﬁts of HBU using
RGB-D data from top-view position, three use
cases have been analysed:
Video surveillance, described through an
application about person re-identiﬁcation;
Intelligent retail environment, where a
novel shopper analytics system is presented;
Activities of daily living, through the case
study of an ad-hoc application for home
environmental monitoring.

Video surveillance: Re-Identification
Re-Identification is the process to determine if different instances or
images of the same person, recorded in different moments, belong to the
same subject.
Re-id represents a valuable task in video surveillance scenarios, where
long-term activities have to be modelled within a large and structured
environment (e.g., airport, metro station).
In this context, a robust modelling of the entire body appearance of the
individual is essential.
4.43m
3.31m
58° H
45° V

Video surveillance: Re-Identiﬁcation3
For Re-id evaluation, data of 100 people are collected, acquired across
intervals of days and in diﬀerent times.
Each person walked with an average gait within the recording area in one
direction, stopping for few seconds just below the camera, then he/she
turned around and repeated the same route in the opposite direction,
always stopping under the camera for a while.
3[Liciotti et al., 2017c]

Seven out of the nine features selected are the
anthropometric features extracted from the
depth image. The remaining two colour-based
features are acquired by the colour image.
TVH = {Hp
h , Hp
o }
TVD = {dp
1 , dp
2 , dp
3 , dp
4 , dp
5 , dp
6 , dp
7 }
TVDH = {dp
1 , dp
2 , dp
3 , dp
4 , dp
5 , dp
6 , dp
7 , Hp
h , Hp
o }
dp
1dp
2
dp
3
dp
4
dp
5
dp
6
dp
7
Hp
o
Hp
h

Video surveillance: Re-Identiﬁcation (Video)
https://youtu.be/YfkERe2UtFU

Cumulative Matching Characteristic
The CMC curve represents the expectation of ﬁnding the correct match
in the top n matches.
Rank
10 20 30 40 50 60 70 80 90 100
RecognitionRate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Depth+Color
Color
Depth

Intelligent retail environment4
In IRE context, I presents a smart and
low-cost embedded sensor network able to
identify customers and to analyse their
behaviour and shelf interactions.
4[Liciotti et al., 2017a, Liciotti et al., 2014a, Liciotti et al., 2014c]

Intelligent retail environment (Video)
https://youtu.be/9gM7MM17lyI

User-shelf interaction recognition with CNNs
Several colour images were collected during the customer’s shopping
activity. In particular, when a part of the body comes into contact with
the shelf.
Positive: the hand has already done the interaction and contains a
product or the customer is putting the product back on the shelf.
Neutral: the hand is approaching to the shelf in order to grab a
product.
Negative: this class contains the accidental interactions of
customers with the shelf.
(a) Positive. (b) Neutral. (c) Negative.
https://youtu.be/jSkwYO2CMVo

For this task, 4 CNNs were selected:
CNN
AlexNet
3
80
80
18
18
4
4
4
4
4
4
CNN 2
CaﬀeNet
input
image

Table: User-shelf interaction results on train and validation sets.
Net
Accuracy Accuracy
Train Validation
CNN 0.716238 0.809045
CNN2 0.846936 0.909548
AlexNet [Krizhevsky et al., 2012] 0.737039 0.809045
CaﬀeNet [Jia et al., 2014] 0.885805 0.919598
Table: User-shelf interaction results on test set.
Net Precision Recall F1-Score
CNN 0.780691 0.622234 0.691118
CNN2 0.873640 0.816821 0.843801
AlexNet [Krizhevsky et al., 2012] 0.771158 0.687371 0.726130
CaﬀeNet [Jia et al., 2014] 0.899060 0.873149 0.885705

The main indicators adopted to evaluate shopper behaviour and
preferences are:
Key Description
Nv # of visitors, that is people crossing camera field of view
Vz Number Vz of visitors in each category
Vs # of visitors people interacting with the shelf, where Vs ⊂ Nv
Ns # of stopped visitors, that is the number of people who stops in front of the selected
category’s shelf (min 5 secs) Vs ⊂ Ns
CR Conversion rate CR = Vs /Ns ∈ [0, 1] is the relationship between the number of stopped
visitors and the number of visitors interacting with selected shelf products
Is # of interactions for each person, with Is = I/Vs , where I is the number of the interactions
¯T Average visit time ¯T = Nv
i=1 ∆ti /Nv , where the visit time ∆ti is the permanence of
each person in the camera view
P # of products touched
Ppos # of positive interactions, shopper touches the product and “buys” it (takes it from the
shelf without returning)
Pneu # of neutral touching, shopper just touches the product without holding it.
Pneg # of negative interactions, shopper touches the product, holds it for a while and returns
it to the shelf.
TI Duration of interactions TI = I
i=1 δti , where δti = ti,end − ti,init is the difference
between final and initial instant of interaction i
¯TI Average interaction time ¯TI = TI /I

0 50 100 150 200 250 300 350
0
50
100
150
200
250
300
1
2 3 4
5 6
7
8
9
10
11
12
13
14 15 16 17
18
19 20 21 22 23
24
25
26
27
28
29 30
31 32
33 34 35 36 37 38 39 40 41 42 43
44
45 46
47
48
49
50 51
52
53 54
55
56 57 58
59 60
61
62
63
64
65
66
67
68 69 70 71
72
(a)
0 50 100 150 200 250 300 350
0
50
100
150
200
250
300
1
2 3 4
5 6
7
8
9
10
11
12
13
14 15 16 17
18
19 20 21 22 23
24
25
26
27
28
29 30
31 32
33 34 35 36 37 38 39 40 41 42 43
44
45 46
47
48
49
50 51
52
53 54
55
56 57 58
59 60
61
62
63
64
65
66
67
68 69 70 71
72
(b)
0 50 100 150 200 250 300 350
0
50
100
150
200
250
300
1
2 3 4
5 6
7
8
9
10
11
12
13
14 15 16 17
18
19 20 21 22 23
24
25
26
27
28
29 30
31 32
33 34 35 36 37 38 39 40 41 42 43
44
45 46
47
48
49
50 51
52
53 54
55
56 57 58
59 60
61
62
63
64
65
66
67
68 69 70 71
72
(c)
Figure: Maps of interactions produced by the software in a test conducted by our research in a real
environment.

Activities of daily living5
ADLs
The ADLs are a series of basic activities performed by individuals on a
daily basis necessary for independent living at home or in the community.
In this work, interest is focused on reliably detecting daily activities that
a person performs in the kitchen. In this context, this work proposes an
automated RGB-D video analysis system that recognises human ADLs
activities, related to classical actions such as making a coﬀee.
The main goal is to classify and predict the probability that a speciﬁc
action happens.
5[Liciotti et al., 2017b]

In this work, HMM method is used in order to facilitate the detection of
anomalous sequences in a classical action sequence such as making a
coﬀee.
Information provided by head and hands detection algorithms is used as
input for a set of HMMs. After training the model, an action sequence
s = {s1, s2, . . . , sn} is considered and its probability λ is calculated for
observation sequence P(s|λ).
Activity1
Activity2
Activity3
Activityn
3D Head
Point
3D Hand
Point
HMM1
HMM2
HMM3
Activity1
Activity2
Activity3
Activityn
...
3D Head
Point
3D Hand
Point
HMM1
HMM2
HMM3
...
Observations (O)
Classiﬁcation
Select model with
Maximum
Likehood

Table: Classiﬁcation Results Cross Validation.
HMM1 HMM2 HMM3
PPV TPR F1 PPV TPR F1 PPV TPR F1
other 0.73 0.57 0.64 0.89 0.70 0.79 0.93 0.76 0.84
coﬀee 0.67 0.80 0.73 0.69 0.83 0.75 0.76 0.87 0.81
kettle 0.60 0.70 0.65 0.47 0.58 0.52 0.58 0.70 0.63
tea/sugar 0.66 0.70 0.68 0.64 0.68 0.66 0.70 0.75 0.72
fridge 0.74 0.61 0.67 0.74 0.65 0.69 0.78 0.71 0.74
avg 0.68 0.68 0.68 0.73 0.71 0.71 0.78 0.77 0.77
other
coffee
kettle
tea/sugar
fridge
Predicted label
other
coffee
kettle
tea/sugar
fridge
Truelabel
0.759 0.073 0.024 0.087 0.058
0.031 0.875 0.041 0.052 0.002
0.014 0.139 0.696 0.143 0.007
0.027 0.047 0.156 0.748 0.021
0.036 0.063 0.019 0.168 0.713 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

Activities of daily living6
In AAL context, we developed automated RGB-D video system that
recognise elderly users activities with a particular focus on the detection
of falls. The study is based on people detection and tracking algorithms,
for mapping the users and for the detection of important activities such
as falls or sit in a chair. The system allows to extract and collect a lot of
statistical data that, properly processed, provide knowledge about the
elderly users.
3,20 m
1,80 m
3,00 m
6[Liciotti et al., 2015, Liciotti et al., 2014b]

Activities of daily living (Video)
https://youtu.be/qQZk1zlssmY

Table of Contents
Introduction
State of art
Video surveillance

Conclusions
The main contributions of this Thesis can be summarized as follows:
design and implementation of two novel algorithms for people
detection from top-view configuration with RGB-D data using image
processing approaches. In particular, a performance improvement of
water filling algorithm is proposed in terms of computational
complexity. Furthermore, a new algorithm, called multi level
segmentation, has been developed. It carries out several
segmentations on different levels of height in order to find all the
heads of people.
development of semantic segmentation CNNs for heads detection, in
particular, U-Net, SegNet, FractalNet, and ResNet are used in this
work. By introducing changes on different layers of these nets, the
performances are significantly improved;

Conclusions
proposal and validation of new descriptors for Re-id task in top-view
conﬁguration. Descriptors are composed of anthropometric and
colour-based features.
design and implementation of several CNNs for user-shelf interaction
recognition. Through a manually annotated dataset made up of
images representing interactions between user and shelf, four
diﬀerent types of CNNs have been trained.
creation of four public available datasets:
TVPR Dataset
TVHeads Dataset
RADiAL Dataset
User-Shelf Interactions Dataset

Conclusions
testing in real environments:

Future works
Future works will include:
the use of other types of RGB-D sensors
the study of more sophisticated features.
the integration of video and audio systems. This way, the
identiﬁcation of abnormal events can be detected using audio
systems.

References I
[Cenci et al., 2016] Cenci, A., Liciotti, D., Ercoli, I., Zingaretti, P., and Carnielli, V. P. (2016).
A cloud-based healthcare infrastructure for medical device integration: the bilirubinometer case
study.
In Mechatronic and Embedded Systems and Applications (MESA), 2016 IEEE/ASME 12th
International Conference, pages 1–6. IEEE.
[Cenci et al., 2015] Cenci, A., Liciotti, D., Frontoni, E., Mancini, A., and Zingaretti, P. (2015).
Non-contact monitoring of preterm infants using rgb-d camera.
In ASME 2015 International Design Engineering Technical Conferences and Computers and
Information in Engineering Conference. American Society of Mechanical Engineers.
[Cenci et al., 2017] Cenci, A., Liciotti, D., Frontoni, E., Zingaretti, P., and Carnielli, V. P. (2017).
Movements analysis of preterm infants by using depth sensor.
In Proceedings of the International Conference on Internet of Things and Machine Learning
(IML 2017). Liverpool, UK.
[Ciabattoni et al., 2017] Ciabattoni, L., Frontoni, E., Liciotti, D., Paolanti, M., and Romeo, L.
(2017).
A sensor fusion approach for measuring emotional customer experience in an intelligent retail
environment.
In 2017 IEEE 7th International Conference on Consumer Electronics - Berlin (ICCE-Berlin)
(ICCE-Berlin 2017), Berlin, Germany.
[Frontoni et al., 2017] Frontoni, E., Liciotti, D., Paolanti, M., Pollini, R., and Zingaretti, P.
(2017).
Design of an interoperable framework with domotic sensors network integration.
In 2017 IEEE 7th International Conference on Consumer Electronics - Berlin (ICCE-Berlin)
(ICCE-Berlin 2017), Berlin, Germany.

References II
[Liciotti et al., 2016] Liciotti, D., Cenci, A., Frontoni, E., Mancini, A., and Zingaretti, P. (2016).
An intelligent rgb-d video system for bus passenger counting.
In IAS-14. Springer.
[Liciotti et al., 2014a] Liciotti, D., Contigiani, M., Frontoni, E., Mancini, A., Zingaretti, P., and
Placidi, V. (2014a).
Shopper analytics: A customer activity recognition system using a distributed rgb-d camera
network.
In Video Analytics for Audience Measurement, pages 146–157. Springer.
[Liciotti et al., 2014b] Liciotti, D., Ferroni, G., Frontoni, E., Squartini, S., Principi, E., Bonﬁgli,
R., Zingaretti, P., and Piazza, F. (2014b).
Advanced integration of multimedia assistive technologies: A prospective outlook.
[Liciotti et al., 2017a] Liciotti, D., Frontoni, E., Mancini, A., and Zingaretti, P. (2017a).
Pervasive System for Consumer Behaviour Analysis in Retail Environments, pages 12–23.
Springer International Publishing.
[Liciotti et al., 2017b] Liciotti, D., Frontoni, E., Zingaretti, P., Bellotto, N., and Duckett, T.
(2017b).
Hmm-based activity recognition with a ceiling rgb-d camera.
In ICPRAM (International Conference on Pattern Recognition Applications and Methods).
[Liciotti et al., 2015] Liciotti, D., Massi, G., Frontoni, E., Mancini, A., and Zingaretti, P. (2015).
Human activity analysis for in-home fall risk assessment.
In Communication Workshop (ICCW), 2015 IEEE International Conference on, pages 284–289.
IEEE.

References III
[Liciotti et al., 2017c] Liciotti, D., Paolanti, M., Frontoni, E., Mancini, A., and Zingaretti, P.
(2017c).
Person Re-identification Dataset with RGB-D Camera in a Top-View Configuration, pages
1–11.
Springer International Publishing.
[Liciotti et al., 2014c] Liciotti, D., Zingaretti, P., and Placidi, V. (2014c).
An automatic analysis of shoppers behaviour using a distributed rgb-d cameras system.
[Paolanti et al., 2017] Paolanti, M., Liciotti, D., Pietrini, R., Frontoni, E., and Mancini, A.
(2017).
Modelling and forecasting customer navigation in intelligent retail environments.
Journal of Intelligent & Robotic Systems.
[Pierdicca et al., 2015] Pierdicca, R., Liciotti, D., Contigiani, M., Frontoni, E., Mancini, A., and
Zingaretti, P. (2015).
Low cost embedded system for increasing retail environment intelligence.
In VAAM 2015 Video Analytics for Audience Measurement & IEEE International Conference on
Multimedia and Expo. IEEE.
[Sturari et al., 2016] Sturari, M., Liciotti, D., Pierdicca, R., Frontoni, E., Mancini, A., Contigiani,
M., and Zingaretti, P. (2016).
Robust and affordable retail customer profiling by vision and radio beacon sensor fusion.
Pattern Recognition Letters.

Human Behaviour Understanding using Top-View RGB-D Data

Recommended

Recommended

More Related Content

Similar to Human Behaviour Understanding using Top-View RGB-D Data

Similar to Human Behaviour Understanding using Top-View RGB-D Data (20)

Recently uploaded

Recently uploaded (20)

Human Behaviour Understanding using Top-View RGB-D Data