User generated data: a paradigm shift for research and data products
12 March 2021
Marco Altini, PhD
Twitter: @altini_marco
USER GENERATED DATA: A
PARADIGM SHIFT FOR RESEARCH
AND DATA PRODUCTS
2
Marco Altini
• PhD cum laude in Machine Learning
• MSc cum laude in Computer Science
Engineering
• MSc cum laude in Human Movement
Sciences, High Performance
Coaching
• Founder of HRV4Training (2013)
• Data Science Advisor at Oura
• Guest Lecturer at VU Amsterdam
• 50+ publications at the intersection
between technology, health and
performance
3
IN THIS LECTURE
What’s user generated data?
• Typical study and product
development workflow
• A new paradigm
4
IN THIS LECTURE
What’s user generated data?
• Typical study and product
development workflow
• A new paradigm
Challenges and opportunities
• Research and data products
5
IN THIS LECTURE
What’s user generated data?
• Typical study and product
development workflow
• A new paradigm
Challenges and opportunities
• Research and data products
All examples will be considering health and
sport science applications
10
DATA SCIENCE
As data scientists we can find new
clever ways to create value based
on the data collected:
• Research
• New features
• New products
• New insights
11
DATA SCIENCE
As data scientists we can find new
clever ways to create value based
on the data collected:
• Research
• New features
• New products
• New insights
User generated data opens new
opportunities due to larger sample
size, realistic settings, unforeseen
outcomes
20
WHAT DO THESE
EXAMPLES HAVE IN
COMMON?
• None of these applications were
the original goal of the app or
wearable
21
WHAT DO THESE
EXAMPLES HAVE IN
COMMON?
• None of these applications were
the original goal of the app or
wearable
• User generated data made it
possible
22
WHAT DO THESE
EXAMPLES HAVE IN
COMMON?
• How?
• Contextual data
• Context / confounders /
additional parameters
monitored longitudinally
23
WHAT DO THESE
EXAMPLES HAVE IN
COMMON?
• How?
• Contextual data
• Context / confounders /
additional parameters
monitored longitudinally
• Reference points
• APIs
• Manually reported (e.g.
clinical outcomes)
24
WHAT DO THESE
EXAMPLES HAVE IN
COMMON?
• How?
• Contextual data
• Context / confounders /
additional parameters
monitored longitudinally
• Reference points
• APIs
• Manually reported (e.g.
clinical outcomes)
Let’s take a step back first
27
TYPICAL STUDY
WORKFLOW
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants (small N)
28
TYPICAL STUDY
WORKFLOW
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants (small N)
3. Collect high quality data
29
TYPICAL STUDY
WORKFLOW
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants (small N)
3. Collect high quality data
4. Perform data analysis
30
TYPICAL STUDY
WORKFLOW
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants (small N)
3. Collect high quality data
4. Perform data analysis
5. Use the outcome
1. If academic research: write
a paper
2. If company research:
deploy to consumers
31
EXAMPLES
1. Paper: investigate the effect of
training intensity on heart rate
variability (HRV)
2. Product: estimate VO2max
based on physiological data
collected during workouts
EXAMPLE 1: PAPER ON THE EFFECT
OF TRAINING INTENSITY ON HEART
RATE VARIABILITY
33
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
1. Design the study
1. What dependent variables
to track: HRV
2. What independent
variables to track: training
intensity, age, sex, etc.
34
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
1. Design the study
1. What dependent variables
to track: HRV
2. What independent
variables to track: training
intensity, age, sex, etc.
2. Recruit participants (N = 10 male
students)
35
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
1. Design the study
1. What dependent variables
to track: HRV
2. What independent
variables to track: training
intensity, age, sex, etc.
2. Recruit participants (N = 10 male
students)
3. Collect high quality data
37
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants
3. Collect high quality data
4. Perform data analysis
5. Use the outcome
1. write a paper
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
38
How generalizable is this?
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
39
How generalizable is this?
• What about women?
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
40
How generalizable is this?
• What about women?
• What about different
phases of the menstrual
cycle?
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
41
How generalizable is this?
• What about women?
• What about different
phases of the menstrual
cycle?
• What about people of
different age groups?
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
42
How generalizable is this?
• What about women?
• What about different
phases of the menstrual
cycle?
• What about people of
different age groups?
• What about people with
different health conditions?
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
43
How generalizable is this?
• What about women?
• What about different
phases of the menstrual
cycle?
• What about people of
different age groups?
• What about people with
different health conditions?
• What about different
sports?
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
44
How generalizable is this?
• What about women?
• What about different
phases of the menstrual
cycle?
• What about people of
different age groups?
• What about people with
different health conditions?
• What about different
sports?
Not much
EXAMPLE 1: HEART RATE
VARIAIBLITY IN
RESPONSE TO EXERCISE
INTENSITY
46
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
47
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
1. Design the study
1. What dependent variables
to track: VO2max as
measured by indirect
calorimetry
2. What independent
variables to track:
• Age, sex, weight, height,
heart rate at a specific
intensity, etc.
48
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants
• We get N = 50
49
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants
• We get N = 50
3. Collect high quality data
51
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants (small N)
• We get N = 50
3. Collect high quality data
4. Perform data analysis
• Regression model to
estimate VO2max given
predictors
53
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
1. Design the study
1. What dependent variables
to track
2. What independent
variables to track
2. Recruit participants (small N)
• We get N = 50
3. Collect high quality data
4. Perform data analysis
5. Use the outcome
• Deploy to consumers
55
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
The real world is more complex:
- What about running on trails
where the relationship between
pace and heart rate changes?
- What about other sports, where
speed is less relevant, for
example cycling?
56
EXAMPLE 2: VO2MAX
ESTIMATION USING
WEARABLES
The real world is more complex:
- What about running on trails
where the relationship between
pace and heart rate changes?
- What about other sports, where
speed is less relevant, for
example cycling?
Also not really generalizable
59
TYPICAL LIMITATIONS
• N = 2-10 in many sport science
studies
• Results valid only for the specific
sample analyzed
60
TYPICAL LIMITATIONS
• N = 2-10 in many sport science
studies
• Results valid only for the specific
sample analyzed
• What if we want to extend the
analysis?
• We need to run another
study.. (costs, time, etc.)
61
TYPICAL LIMITATIONS
• N = 2-10 in many sport science
studies
• Results valid only for the specific
sample analyzed
• What if we want to extend the
analysis?
• We need to run another
study.. (costs, time, etc.)
• We collected high quality data,
but was it representative of what
happens in real life?
• Come to the lab, don’t eat
or drink coffee, then “relax”
when I tell you to…
63
OUTSOURCING DATA
COLLECTION
• In the past 10 years our ability to
run studies and monitor
physiology (and other variables)
outside of the lab has changed
dramatically
64
OUTSOURCING DATA
COLLECTION
• In the past 10 years our ability to
run studies and monitor
physiology (and other variables)
outside of the lab has changed
dramatically
• Phones (+ sensors) make data
acquisition possible anywhere
and at a larger scale
65
OUTSOURCING DATA
COLLECTION
• In the past 10 years our ability to
run studies and monitor
physiology (and other variables)
outside of the lab has changed
dramatically
• Phones (+ sensors) make data
acquisition possible anywhere
and at a larger scale
• More realistic settings,
unforeseen outcomes
66
OUTSOURCING DATA
COLLECTION
• In the past 10 years our ability to
run studies and monitor
physiology (and other variables)
outside of the lab has changed
dramatically
• Phones (+ sensors) make data
acquisition possible anywhere
and at a larger scale
• More realistic settings,
unforeseen outcomes
• Data science infrastructure
allows for cost-effective data
aggregation and analysis
68
THREE KEY STEPS
• Validate (or know the limitations
of) the technology to be
deployed
• Garbage in, garbage out
69
THREE KEY STEPS
• Validate (or know the limitations
of) the technology to be
deployed
• Garbage in, garbage out
• Deploy. Confirm lab-based
insights (if possible)
• Data preparation becomes
the most important step
70
THREE KEY STEPS
• Validate (or know the limitations
of) the technology to be
deployed
• Garbage in, garbage out
• Deploy. Confirm lab-based
insights (if possible)
• Data preparation becomes
the most important step
• Discover new relations, build
new products
71
EXAMPLES
1. Paper: investigate the effect of
training intensity on heart rate
variability (HRV)
2. Product: estimate VO2max
based on physiological data
collected during workouts
EXAMPLE 1: PAPER ON THE EFFECT
OF TRAINING INTENSITY ON HEART
RATE VARIABILITY
80
FIND NEW RELATIONS /
EXTEND ANALYSIS
• What else?
• Relationship with different
stressors (alcohol, getting
sick, menstrual cycle, etc.)
81
FIND NEW RELATIONS /
EXTEND ANALYSIS
• What else?
• Relationship with different
stressors (alcohol, getting
sick, menstrual cycle, etc.)
• Relationship with different
outcomes (a new
pandemic?)
82
FIND NEW RELATIONS /
EXTEND ANALYSIS
• What else?
• Relationship with different
stressors (alcohol, getting
sick, menstrual cycle, etc.)
• Relationship with different
outcomes (a new
pandemic?)
84
WHAT IF WE TARGET
CYCLISTS NOW?
• We developed our initial model
thinking like a physiologist
85
WHAT IF WE TARGET
CYCLISTS NOW?
• We developed our initial model
thinking like a physiologist
• We can develop our new model
thinking like a data scientist
86
WHAT IF WE TARGET
CYCLISTS NOW?
• We have deployed our model to
thousands of users. Many are
runners, and are using the
feature
• The user provides as input:
• Anthropometrics
• Workouts from Strava
• The user gets as output the
VO2max estimate
88
CONFIRM LAB BASED
INSIGHTS
Or get clever about it
• Estimated VO2max is correlated
to running performance as
derived from Strava workouts:
89
WHAT IF WE TARGET
CYCLISTS NOW?
• For cyclists, we have:
• Heart rate during exercise
• Power during exercise
90
WHAT IF WE TARGET
CYCLISTS NOW?
• For cyclists, we have:
• Heart rate during exercise
• Power during exercise
However, we do not have reference
VO2max data (from the lab) nor
estimated VO2max data (because
we can only estimate from heart
rate and speed)
91
WHAT IF WE TARGET
CYCLISTS NOW?
• For cyclists, we have:
• Heart rate during exercise
• Power during exercise
However, we do not have reference
VO2max data (from the lab) nor
estimated VO2max data (because
we can only estimate from heart
rate and speed)
The missing link: the triathlete
92
WHAT IF WE TARGET
CYCLISTS NOW?
Keep in the dataset only triathletes,
check again VO2max vs running
performance: still works
97
WHAT IF WE TARGET
CYCLISTS NOW?
Build models, predict VO2max
cycling, then validate (leave one
out cross-validation). R = 0.9
98
WHAT IF WE TARGET
CYCLISTS NOW?
Build models, predict VO2max
cycling, then validate (leave one
out cross-validation). R = 0.9
Deploy!
99
IN THIS LECTURE
What’s user generated data?
• Typical study and product
development workflow
• A new paradigm
Challenges and opportunities
• Research and data products
104
CHALLENGES
• Data preparation:
• Quality control
• Noisy data
• Missing data
• Reference data
• What is available?
• Data engineering (not covered
today)
106
NOISY DATA
• Data collected from wearables
and apps is extremely noisy
• Inaccurate very often
• Typically no signal quality
metric is reported (think
about heart rate)
107
NOISY DATA
• Data collected from wearables
and apps is extremely noisy
• Inaccurate very often
• Typically no signal quality
metric is reported (think
about heart rate)
How do we deal with it?
109
NOISY DATA
Example: training intensity based
on heart rate. To determine a
relative intensity, we need users'
maximal heart rate
110
NOISY DATA
Example: training intensity based
on heart rate. To determine a
relative intensity, we need users'
maximal heart rate
No lab tests. So we need to make
some assumptions:
111
NOISY DATA
Example: training intensity based
on heart rate. To determine a
relative intensity, we need users'
maximal heart rate
No lab tests. So we need to make
some assumptions:
• There will be some hard sessions
during the period we monitor
(hence it needs to be long
enough)
112
NOISY DATA
Here is data from 500 people,
including heart rates above 300
bpm (or below 100 bpm):
113
NOISY DATA
Data for one person
We can use simple statistical
methods to try to approximate this
person’s max heart rate
114
NOISY DATA
Data for one person
We can use simple statistical
methods to try to approximate this
person’s max heart rate
But did they ever go hard?
117
MISSING DATA
Same example as before. What if:
• We don’t have any hard effort
• Workouts are missing
We can sometime ignore or remove
individuals with missing data (we
have a lot of data after all) but this
could introduce a bias (we do not
have the full picture)
118
MISSING DATA
Same example as before. What if:
• We don’t have any hard effort
• Workouts are missing
We can sometime ignore or remove
individuals with missing data (we
have a lot of data after all) but this
could introduce a bias (we do not
have the full picture)
No universal answer, think critically
119
QUALITY CONTROL
• Only a fraction of the collected
data will be usable
• It is key to define methods
to keep track of what data
to trust, automatically, and
to clean the data
120
QUALITY CONTROL
• Only a fraction of the collected
data will be usable
• It is key to define methods
to keep track of what data
to trust, automatically, and
to clean the data
• Trade offs
• It is never enough data
anyways (you can always
do one more stratification)
123
REFERENCE DATA
One of the biggest challenges with
user generated data is lack of
reference data
Users don’t come to the lab for tests
or report outcomes that are key for
model development
What could help you in the future?
• Tags / annotations / APIs
128
REFERENCE DATA
COVID example:
• When was the test done?
• Was it even done?
• Does it even matter? Maybe they
were already infected earlier
with no / mild symptoms
129
REFERENCE DATA
• Not all collected data becomes
valuable research or enables
future data products. Much of it
has to do with reference data:
130
REFERENCE DATA
• Not all collected data becomes
valuable research or enables
future data products. Much of it
has to do with reference data:
• What are the outcomes?
• Can we track them?
131
REFERENCE DATA
• Not all collected data becomes
valuable research or enables
future data products. Much of it
has to do with reference data:
• What are the outcomes?
• Can we track them?
• Are we asking too much to
the user?
• Not a clinical study
• What can we do about it?
• Is it ethical to collect them?
153
ESTIMATING RUNNING
PERFORMANCE
One option could be to get a few
people on a treadmill in the lab,
and have them run a time trial
Or, we could grab workouts from
apps like Strava, analyze training
patterns antecedent to their e.g.
best 10 km performance over a
year or so and build a model
159
USER GENERATED DATA
• Not everything is (or can be) a
data product
• Often data is collected but not
used in any meaningful way, no
value created (either for the
company or the user)
160
USER GENERATED DATA
• Not everything is (or can be) a
data product
• Often data is collected but not
used in any meaningful way, no
value created (either for the
company or the user)
• Reference points are key, you
can have unlimited data and still
have no use for it
161
USER GENERATED DATA
• Not everything is (or can be) a
data product
• Often data is collected but not
used in any meaningful way, no
value created (either for the
company or the user)
• Reference points are key, you
can have unlimited data and still
have no use for it
• More research is being carried
out using consumer products
162
USER GENERATED DATA
• Not everything is (or can be) a
data product
• Often data is collected but not
used in any meaningful way, no
value created (either for the
company or the user)
• Reference points are key, you
can have unlimited data and still
have no use for it
• More research is being carried
out using consumer products
• Think critically about reference
points, data preparation, and
other challenges (estimated vs
measured)
12 March 2021
Marco Altini, PhD
Twitter: @altini_marco
USER GENERATED DATA: A
PARADIGM SHIFT FOR RESEARCH
AND DATA PRODUCTS