Human Behaviour Understanding using Top-View RGB-D Data
1. Human Behaviour Understanding
using Top-View RGB-D Data
Daniele Liciotti
d.liciotti@pm.univpm.it
Advisor: Prof. Emanuele Frontoni
Universit`a Politecnica delle Marche
26th March 2018
Daniele Liciotti 26th March 2018 1 / 48
2. Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works
Daniele Liciotti 26th March 2018 2 / 48
3. Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works
Daniele Liciotti 26th March 2018 3 / 48
4. Research problem
In recent years, a lot of researchers have focused the attention on
automatic analysis of human behaviour because of its important
potential applications and its intrinsic scientific challenges.
CV and deep learning techniques are currently the most interesting
solutions to analyse the human behaviour. In particular, if these are used
in combination with RGB-D data that provide high availability, reliability
and affordability.
Several studies adopt the top-view configuration because it eases the task
and makes simple to extract different trajectory features. This setup also
introduces robustness, due to the lack of occlusions among individuals.
Daniele Liciotti 26th March 2018 4 / 48
5. Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works
Daniele Liciotti 26th March 2018 5 / 48
6. Human behaviour understanding
The interest in HBU has quickly increased in recent years, motivated by a
societal needs that include security, natural interfaces, gaming, affective
computing, and assisted living.
An initial approach can be the detecting and
tracking of the subjects of interest, which in
this case are the people. This way it is possible to
generate motion descriptors which are used to
identify actions or interactions.
Recognising particular behaviours requires the
definition of a set of templates that represent
different classes of behaviours.
Daniele Liciotti 26th March 2018 6 / 48
7. Taxonomy
The works of Moeslund et al. and Borges et al. have been used to create
a taxonomy on HBU. Human activities can be categorized into four main
groups, namely gesture, action, activity, and behaviour.
Gestures are movements of body parts that can be used to control
and to manipulate, or to communicate. These are the atomic
components describing the motion of a person.
Actions can be seen as temporal concatenations of gestures. Actions
represent voluntary body movements of an arbitrary complexity. An
action implies a detailed sequence of elementary movements.
Activities are a set of multiple actions that can be classified in order
to understand human behaviours.
Behaviours are the responses of subjects to internal, external,
conscious, or unconscious stimuli. A series of activities may be
related to a particular behaviour.
Daniele Liciotti 26th March 2018 7 / 48
8. Taxonomy
This table summarises the different Degrees of Semantics (DoS)
considered by the taxonomy, along with some examples. Not only time
frame and semantic degree grow at higher levels of this hierarchy, but
also complexity and computational cost lead to heavy and slow
recognition systems, as each level requires most of the previous level
tasks to be done too.
Activity
Action
Gesture
Behaviour
DoS
Time
Frame
DoS Time lapse
Gesture frames, seconds
Action seconds, minutes
Activity minutes, hours
Behaviour hours, days
Daniele Liciotti 26th March 2018 8 / 48
9. RGB-D data from top-view
One of the most commonly used sensor
categories for this type of task are RGB-D
cameras because of their availability,
reliability and affordability. Reliable depth
maps can provide valuable additional
information to significantly improve tracking
and detection results.
Several research papers adopt the top-view
configuration because it eases the task and
makes simple to extract different trajectory
features. This setup introduces robustness,
due to the lack of occlusions among
individuals.
4.43m
3.31m
Daniele Liciotti 26th March 2018 9 / 48
10. Datasets
The most relevant available datasets with RGB-D data installed in a
top-view configuration are listed below.
1. TST Fall detection dataset v1
2. TST Intake Monitoring dataset v1
3. UR Fall Detection Dataset
4. Depthvisdoor
5. TVHeads Dataset
6. CBSR
7. TVPR Dataset
Daniele Liciotti 26th March 2018 10 / 48
11. Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works
Daniele Liciotti 26th March 2018 11 / 48
12. People detection: algorithms
Different solutions for people detection with RGB-D data from top-view
configuration are used.
Image processing approaches
Water filling
Multi-level segmentation
Semantic segmentation with DL approaches
U-Net
SegNet
ResNet
FractalNet
Daniele Liciotti 26th March 2018 12 / 48
13. Water filling1
This algorithm finds, in a depth image, the local minimum regions
simulating the rain and the flooding of ground. According to an
uniform distribution, it simulates the rain with some raindrops. The
algorithm moves the raindrops towards the local minimum points,
but if a point is wet, it wets the point of the higher level. Then, puddles
are formed because the water flows to some local minimum
regions. It computes the contour lines considering the distance from the
local minimum as a function of the total raindrops.
(a) (b) (c)
Thesis contribution concern in directing drops towards the subjects rather
than distributing them in the whole image. In this way, computational
time has significantly improved.
1[Zhang et al., 2012][Liciotti et al., 2015]
Daniele Liciotti 26th March 2018 13 / 48
14. Multi-level segmentation2
This algorithm intends to overcome the limitations of the binary
segmentation method in case of collisions among people. In fact, using
a single-level segmentation, in case of a collision, two people become a
single blob (person), without distinction between head and shoulders of
the person. The head of each person is anyway detected, becoming the
discriminant element.
2[Liciotti et al., 2014a, Liciotti et al., 2017a]
Daniele Liciotti 26th March 2018 14 / 48
15. Semantic segmentation with DL approaches
The main ConvNet architectures used for semantic segmentation
problems are:
U-Net [Ronneberger et al., 2015]
SegNet [Badrinarayanan et al., 2015]
ResNet [He et al., 2016]
FractalNet [Larsson et al., 2016]
Daniele Liciotti 26th March 2018 15 / 48
16. Semantic segmentation with DL approaches (Video)
https://youtu.be/MWjcW-3A5-I
Daniele Liciotti 26th March 2018 16 / 48
17. Semantic segmentation with DL approaches: Metrics
Typically, to measure the segmentation accuracy and performance,
different types of metrics are used.
One of the first metrics is the Jaccard index, also known as IoU,
measures similarity between finite sample sets, and defined as the size of
the intersection divided by the size of the union of the sample sets:
IoU =
truepos
truepos + falsepos + falseneg
(1)
Another metric is the Sørensen–Dice index, also called the overlap index,
is the most used metric in semantic segmentation, and is computed as:
Dice =
2 · truepos
2 · truepos + falsepos + falseneg
(2)
where the positive class is the heads and the negative is all the rest.
Daniele Liciotti 26th March 2018 17 / 48
18. Semantic segmentation with DL approaches: U-Net3
A new U-Net architecture is proposed. It is composed of two main parts:
contracting path (left side);
expansive path (right side).
The structure remains the same of original U-Net, but some changes are
made at the end of each layer. In particular a batch normalisation is
added after the first ReLU activation function and after each max polling
and upsampling functions.
Daniele Liciotti 26th March 2018 18 / 48
19. Semantic segmentation with DL approaches: Results
Table: Jaccard and Dice indices of different CNN architectures.
Net Bit
Jaccard Jaccard Dice Dice
Train Validation Train Validation
Fractal [Larsson et al., 2016]
8 0.960464 0.948000 0.979833 0.973306
16 0.961636 0.947762 0.980443 0.973180
U-Net [Ronneberger et al., 2015]
8 0.896804 0.869399 0.945595 0.930138
16 0.894410 0.869487 0.944262 0.930188
U-Net2 [Ravishankar et al., 2017]
8 0.923823 0.939086 0.960403 0.968586
16 0.923537 0.938208 0.960249 0.968119
U-Net3
8 0.962520 0.931355 0.980902 0.964458
16 0.961540 0.929924 0.980393 0.963690
SegNet [Badrinarayanan et al., 2015]
8 0.884182 0.823731 0.938531 0.903347
16 0.884162 0.827745 0.938520 0.905756
ResNet [He et al., 2016]
8 0.932160 0.856337 0.964889 0.922609
16 0.933436 0.848240 0.965572 0.917889
Daniele Liciotti 26th March 2018 19 / 48
20. Semantic segmentation with DL approaches: Results
The FractalNet and the ResNet reach high values after a few epochs.
Instead, the U-Net3 increases its value more slowly. The classic U-Net is
always below all other networks.
0 20 40 60 80 100 120 140 160 180 200
Epochs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
JaccardIndex
JU-Net3
16
JU-Net3
8
JFractalNet
16
JFractalNet
8
JResNet
16
JResNet
8
JSegNet
16
JSegNet
8
JU-Net2
16
JU-Net2
8
JU-Net
16
JU-Net
8
Figure: Jaccard indices trends during the fit process.
Daniele Liciotti 26th March 2018 20 / 48
21. Semantic segmentation with DL approaches: Results
8 bit 16 bit Label
U-Net [Ronneberger et al., 2015]
U-Net2 [Ravishankar et al., 2017]
U-Net3
ResNet [He et al., 2016]
SegNet [Badrinarayanan et al., 2015]
FractalNet [Larsson et al., 2016]
Daniele Liciotti 26th March 2018 21 / 48
22. Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works
Daniele Liciotti 26th March 2018 22 / 48
23. RGB-D data for top-view HBU: use cases and results
In order to demonstrate the benefits of HBU using
RGB-D data from top-view position, three use
cases have been analysed:
Video surveillance, described through an
application about person re-identification;
Intelligent retail environment, where a
novel shopper analytics system is presented;
Activities of daily living, through the case
study of an ad-hoc application for home
environmental monitoring.
Daniele Liciotti 26th March 2018 23 / 48
24. Video surveillance: Re-Identification
Re-Identification is the process to determine if different instances or
images of the same person, recorded in different moments, belong to the
same subject.
Re-id represents a valuable task in video surveillance scenarios, where
long-term activities have to be modelled within a large and structured
environment (e.g., airport, metro station).
In this context, a robust modelling of the entire body appearance of the
individual is essential.
4.43m
3.31m
58° H
45° V
Daniele Liciotti 26th March 2018 24 / 48
25. Video surveillance: Re-Identification3
For Re-id evaluation, data of 100 people are collected, acquired across
intervals of days and in different times.
Each person walked with an average gait within the recording area in one
direction, stopping for few seconds just below the camera, then he/she
turned around and repeated the same route in the opposite direction,
always stopping under the camera for a while.
3[Liciotti et al., 2017c]
Daniele Liciotti 26th March 2018 25 / 48
26. Video surveillance: Re-Identification
Seven out of the nine features selected are the
anthropometric features extracted from the
depth image. The remaining two colour-based
features are acquired by the colour image.
TVH = {Hp
h , Hp
o }
TVD = {dp
1 , dp
2 , dp
3 , dp
4 , dp
5 , dp
6 , dp
7 }
TVDH = {dp
1 , dp
2 , dp
3 , dp
4 , dp
5 , dp
6 , dp
7 , Hp
h , Hp
o }
dp
1dp
2
dp
3
dp
4
dp
5
dp
6
dp
7
Hp
o
Hp
h
Daniele Liciotti 26th March 2018 26 / 48
28. Video surveillance: Re-Identification
Cumulative Matching Characteristic
The CMC curve represents the expectation of finding the correct match
in the top n matches.
Rank
10 20 30 40 50 60 70 80 90 100
RecognitionRate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Depth+Color
Color
Depth
Daniele Liciotti 26th March 2018 28 / 48
29. Intelligent retail environment4
In IRE context, I presents a smart and
low-cost embedded sensor network able to
identify customers and to analyse their
behaviour and shelf interactions.
4[Liciotti et al., 2017a, Liciotti et al., 2014a, Liciotti et al., 2014c]
Daniele Liciotti 26th March 2018 29 / 48
31. User-shelf interaction recognition with CNNs
Several colour images were collected during the customer’s shopping
activity. In particular, when a part of the body comes into contact with
the shelf.
Positive: the hand has already done the interaction and contains a
product or the customer is putting the product back on the shelf.
Neutral: the hand is approaching to the shelf in order to grab a
product.
Negative: this class contains the accidental interactions of
customers with the shelf.
(a) Positive. (b) Neutral. (c) Negative.
https://youtu.be/jSkwYO2CMVo
Daniele Liciotti 26th March 2018 31 / 48
32. User-shelf interaction recognition with CNNs
For this task, 4 CNNs were selected:
CNN
AlexNet
3
80
80
18
18
4
4
4
4
4
4
CNN 2
CaffeNet
input
image
Daniele Liciotti 26th March 2018 32 / 48
33. User-shelf interaction recognition with CNNs
Table: User-shelf interaction results on train and validation sets.
Net
Accuracy Accuracy
Train Validation
CNN 0.716238 0.809045
CNN2 0.846936 0.909548
AlexNet [Krizhevsky et al., 2012] 0.737039 0.809045
CaffeNet [Jia et al., 2014] 0.885805 0.919598
Table: User-shelf interaction results on test set.
Net Precision Recall F1-Score
CNN 0.780691 0.622234 0.691118
CNN2 0.873640 0.816821 0.843801
AlexNet [Krizhevsky et al., 2012] 0.771158 0.687371 0.726130
CaffeNet [Jia et al., 2014] 0.899060 0.873149 0.885705
Daniele Liciotti 26th March 2018 33 / 48
34. Intelligent retail environment
The main indicators adopted to evaluate shopper behaviour and
preferences are:
Key Description
Nv # of visitors, that is people crossing camera field of view
Vz Number Vz of visitors in each category
Vs # of visitors people interacting with the shelf, where Vs ⊂ Nv
Ns # of stopped visitors, that is the number of people who stops in front of the selected
category’s shelf (min 5 secs) Vs ⊂ Ns
CR Conversion rate CR = Vs /Ns ∈ [0, 1] is the relationship between the number of stopped
visitors and the number of visitors interacting with selected shelf products
Is # of interactions for each person, with Is = I/Vs , where I is the number of the interactions
¯T Average visit time ¯T = Nv
i=1 ∆ti /Nv , where the visit time ∆ti is the permanence of
each person in the camera view
P # of products touched
Ppos # of positive interactions, shopper touches the product and “buys” it (takes it from the
shelf without returning)
Pneu # of neutral touching, shopper just touches the product without holding it.
Pneg # of negative interactions, shopper touches the product, holds it for a while and returns
it to the shelf.
TI Duration of interactions TI = I
i=1 δti , where δti = ti,end − ti,init is the difference
between final and initial instant of interaction i
¯TI Average interaction time ¯TI = TI /I
Daniele Liciotti 26th March 2018 34 / 48
36. Activities of daily living5
ADLs
The ADLs are a series of basic activities performed by individuals on a
daily basis necessary for independent living at home or in the community.
In this work, interest is focused on reliably detecting daily activities that
a person performs in the kitchen. In this context, this work proposes an
automated RGB-D video analysis system that recognises human ADLs
activities, related to classical actions such as making a coffee.
The main goal is to classify and predict the probability that a specific
action happens.
5[Liciotti et al., 2017b]
Daniele Liciotti 26th March 2018 36 / 48
37. Activities of daily living
In this work, HMM method is used in order to facilitate the detection of
anomalous sequences in a classical action sequence such as making a
coffee.
Information provided by head and hands detection algorithms is used as
input for a set of HMMs. After training the model, an action sequence
s = {s1, s2, . . . , sn} is considered and its probability λ is calculated for
observation sequence P(s|λ).
Activity1
Activity2
Activity3
Activityn
3D Head
Point
3D Hand
Point
HMM1
HMM2
HMM3
Activity1
Activity2
Activity3
Activityn
...
3D Head
Point
3D Hand
Point
HMM1
HMM2
HMM3
...
Observations (O)
Classification
Select model with
Maximum
Likehood
Daniele Liciotti 26th March 2018 37 / 48
39. Activities of daily living6
In AAL context, we developed automated RGB-D video system that
recognise elderly users activities with a particular focus on the detection
of falls. The study is based on people detection and tracking algorithms,
for mapping the users and for the detection of important activities such
as falls or sit in a chair. The system allows to extract and collect a lot of
statistical data that, properly processed, provide knowledge about the
elderly users.
3,20 m
1,80 m
3,00 m
6[Liciotti et al., 2015, Liciotti et al., 2014b]
Daniele Liciotti 26th March 2018 39 / 48
40. Activities of daily living (Video)
https://youtu.be/qQZk1zlssmY
Daniele Liciotti 26th March 2018 40 / 48
41. Table of Contents
Introduction
State of art
Human behaviour understanding (HBU)
RGB-D data from top-view
RGB-D data for top-view HBU: algorithms
Image processing approaches
Semantic segmentation with deep learning approaches
RGB-D data for top-view HBU: use cases and results
Video surveillance
Intelligent retail environment
Activities of daily living
Conclusions and future works
Daniele Liciotti 26th March 2018 41 / 48
42. Conclusions
The main contributions of this Thesis can be summarized as follows:
design and implementation of two novel algorithms for people
detection from top-view configuration with RGB-D data using image
processing approaches. In particular, a performance improvement of
water filling algorithm is proposed in terms of computational
complexity. Furthermore, a new algorithm, called multi level
segmentation, has been developed. It carries out several
segmentations on different levels of height in order to find all the
heads of people.
development of semantic segmentation CNNs for heads detection, in
particular, U-Net, SegNet, FractalNet, and ResNet are used in this
work. By introducing changes on different layers of these nets, the
performances are significantly improved;
Daniele Liciotti 26th March 2018 42 / 48
43. Conclusions
proposal and validation of new descriptors for Re-id task in top-view
configuration. Descriptors are composed of anthropometric and
colour-based features.
design and implementation of several CNNs for user-shelf interaction
recognition. Through a manually annotated dataset made up of
images representing interactions between user and shelf, four
different types of CNNs have been trained.
creation of four public available datasets:
TVPR Dataset
TVHeads Dataset
RADiAL Dataset
User-Shelf Interactions Dataset
Daniele Liciotti 26th March 2018 43 / 48
45. Future works
Future works will include:
the use of other types of RGB-D sensors
the study of more sophisticated features.
the integration of video and audio systems. This way, the
identification of abnormal events can be detected using audio
systems.
Daniele Liciotti 26th March 2018 45 / 48
46. References I
[Cenci et al., 2016] Cenci, A., Liciotti, D., Ercoli, I., Zingaretti, P., and Carnielli, V. P. (2016).
A cloud-based healthcare infrastructure for medical device integration: the bilirubinometer case
study.
In Mechatronic and Embedded Systems and Applications (MESA), 2016 IEEE/ASME 12th
International Conference, pages 1–6. IEEE.
[Cenci et al., 2015] Cenci, A., Liciotti, D., Frontoni, E., Mancini, A., and Zingaretti, P. (2015).
Non-contact monitoring of preterm infants using rgb-d camera.
In ASME 2015 International Design Engineering Technical Conferences and Computers and
Information in Engineering Conference. American Society of Mechanical Engineers.
[Cenci et al., 2017] Cenci, A., Liciotti, D., Frontoni, E., Zingaretti, P., and Carnielli, V. P. (2017).
Movements analysis of preterm infants by using depth sensor.
In Proceedings of the International Conference on Internet of Things and Machine Learning
(IML 2017). Liverpool, UK.
[Ciabattoni et al., 2017] Ciabattoni, L., Frontoni, E., Liciotti, D., Paolanti, M., and Romeo, L.
(2017).
A sensor fusion approach for measuring emotional customer experience in an intelligent retail
environment.
In 2017 IEEE 7th International Conference on Consumer Electronics - Berlin (ICCE-Berlin)
(ICCE-Berlin 2017), Berlin, Germany.
[Frontoni et al., 2017] Frontoni, E., Liciotti, D., Paolanti, M., Pollini, R., and Zingaretti, P.
(2017).
Design of an interoperable framework with domotic sensors network integration.
In 2017 IEEE 7th International Conference on Consumer Electronics - Berlin (ICCE-Berlin)
(ICCE-Berlin 2017), Berlin, Germany.
47. References II
[Liciotti et al., 2016] Liciotti, D., Cenci, A., Frontoni, E., Mancini, A., and Zingaretti, P. (2016).
An intelligent rgb-d video system for bus passenger counting.
In IAS-14. Springer.
[Liciotti et al., 2014a] Liciotti, D., Contigiani, M., Frontoni, E., Mancini, A., Zingaretti, P., and
Placidi, V. (2014a).
Shopper analytics: A customer activity recognition system using a distributed rgb-d camera
network.
In Video Analytics for Audience Measurement, pages 146–157. Springer.
[Liciotti et al., 2014b] Liciotti, D., Ferroni, G., Frontoni, E., Squartini, S., Principi, E., Bonfigli,
R., Zingaretti, P., and Piazza, F. (2014b).
Advanced integration of multimedia assistive technologies: A prospective outlook.
In Mechatronic and Embedded Systems and Applications (MESA), 2014 IEEE/ASME 10th
International Conference, pages 1–6. IEEE.
[Liciotti et al., 2017a] Liciotti, D., Frontoni, E., Mancini, A., and Zingaretti, P. (2017a).
Pervasive System for Consumer Behaviour Analysis in Retail Environments, pages 12–23.
Springer International Publishing.
[Liciotti et al., 2017b] Liciotti, D., Frontoni, E., Zingaretti, P., Bellotto, N., and Duckett, T.
(2017b).
Hmm-based activity recognition with a ceiling rgb-d camera.
In ICPRAM (International Conference on Pattern Recognition Applications and Methods).
[Liciotti et al., 2015] Liciotti, D., Massi, G., Frontoni, E., Mancini, A., and Zingaretti, P. (2015).
Human activity analysis for in-home fall risk assessment.
In Communication Workshop (ICCW), 2015 IEEE International Conference on, pages 284–289.
IEEE.
48. References III
[Liciotti et al., 2017c] Liciotti, D., Paolanti, M., Frontoni, E., Mancini, A., and Zingaretti, P.
(2017c).
Person Re-identification Dataset with RGB-D Camera in a Top-View Configuration, pages
1–11.
Springer International Publishing.
[Liciotti et al., 2014c] Liciotti, D., Zingaretti, P., and Placidi, V. (2014c).
An automatic analysis of shoppers behaviour using a distributed rgb-d cameras system.
In Mechatronic and Embedded Systems and Applications (MESA), 2014 IEEE/ASME 10th
International Conference, pages 1–6. IEEE.
[Paolanti et al., 2017] Paolanti, M., Liciotti, D., Pietrini, R., Frontoni, E., and Mancini, A.
(2017).
Modelling and forecasting customer navigation in intelligent retail environments.
Journal of Intelligent & Robotic Systems.
[Pierdicca et al., 2015] Pierdicca, R., Liciotti, D., Contigiani, M., Frontoni, E., Mancini, A., and
Zingaretti, P. (2015).
Low cost embedded system for increasing retail environment intelligence.
In VAAM 2015 Video Analytics for Audience Measurement & IEEE International Conference on
Multimedia and Expo. IEEE.
[Sturari et al., 2016] Sturari, M., Liciotti, D., Pierdicca, R., Frontoni, E., Mancini, A., Contigiani,
M., and Zingaretti, P. (2016).
Robust and affordable retail customer profiling by vision and radio beacon sensor fusion.
Pattern Recognition Letters.