Advertisement
Advertisement

More Related Content

Advertisement

User generated data: a paradigm shift for research and data products

  1. 12 March 2021 Marco Altini, PhD Twitter: @altini_marco USER GENERATED DATA: A PARADIGM SHIFT FOR RESEARCH AND DATA PRODUCTS
  2. 2 Marco Altini • PhD cum laude in Machine Learning • MSc cum laude in Computer Science Engineering • MSc cum laude in Human Movement Sciences, High Performance Coaching • Founder of HRV4Training (2013) • Data Science Advisor at Oura • Guest Lecturer at VU Amsterdam • 50+ publications at the intersection between technology, health and performance
  3. 3 IN THIS LECTURE What’s user generated data? • Typical study and product development workflow • A new paradigm
  4. 4 IN THIS LECTURE What’s user generated data? • Typical study and product development workflow • A new paradigm Challenges and opportunities • Research and data products
  5. 5 IN THIS LECTURE What’s user generated data? • Typical study and product development workflow • A new paradigm Challenges and opportunities • Research and data products All examples will be considering health and sport science applications
  6. WHAT’S USER GENERATED DATA?
  7. 7 MANY TYPES OF DATA Content created by users of a product
  8. 8 MANY TYPES OF DATA Content created by users of a product Here we focus on sport and health: • Wearables • Phones
  9. WHY DOES USER GENERATED DATA MATTER?
  10. 10 DATA SCIENCE As data scientists we can find new clever ways to create value based on the data collected: • Research • New features • New products • New insights
  11. 11 DATA SCIENCE As data scientists we can find new clever ways to create value based on the data collected: • Research • New features • New products • New insights User generated data opens new opportunities due to larger sample size, realistic settings, unforeseen outcomes
  12. SOME EXAMPLES
  13. 13 APPS How can we create value for our customers using data? • HRV4Training • Cardiac activity (HR/HRV) • Context
  14. 14 APPS How can we create value for our customers using data? • HRV4Training • Identify / manage stressors
  15. 15 WEARABLES What can we learn? • Bloomlife • Uterine and cardiac activity
  16. 16 WEARABLES • Bloomlife • Can we detect (or predict) labour onset? What can we learn?
  17. 17 WEARABLES It’s not just the hardware anymore • Oura ring • Cardiac activity (HR/HRV) • Temperature • Movement • Sleep stages
  18. 18 WEARABLES It’s not just the hardware anymore • Oura ring • Can we detect (or predict) an infection?
  19. 19 WHAT DO THESE EXAMPLES HAVE IN COMMON?
  20. 20 WHAT DO THESE EXAMPLES HAVE IN COMMON? • None of these applications were the original goal of the app or wearable
  21. 21 WHAT DO THESE EXAMPLES HAVE IN COMMON? • None of these applications were the original goal of the app or wearable • User generated data made it possible
  22. 22 WHAT DO THESE EXAMPLES HAVE IN COMMON? • How? • Contextual data • Context / confounders / additional parameters monitored longitudinally
  23. 23 WHAT DO THESE EXAMPLES HAVE IN COMMON? • How? • Contextual data • Context / confounders / additional parameters monitored longitudinally • Reference points • APIs • Manually reported (e.g. clinical outcomes)
  24. 24 WHAT DO THESE EXAMPLES HAVE IN COMMON? • How? • Contextual data • Context / confounders / additional parameters monitored longitudinally • Reference points • APIs • Manually reported (e.g. clinical outcomes) Let’s take a step back first
  25. TYPICALY STUDY WORKFLOW
  26. 26 TYPICAL STUDY WORKFLOW 1. Design the study 1. What dependent variables to track 2. What independent variables to track
  27. 27 TYPICAL STUDY WORKFLOW 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants (small N)
  28. 28 TYPICAL STUDY WORKFLOW 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants (small N) 3. Collect high quality data
  29. 29 TYPICAL STUDY WORKFLOW 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants (small N) 3. Collect high quality data 4. Perform data analysis
  30. 30 TYPICAL STUDY WORKFLOW 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants (small N) 3. Collect high quality data 4. Perform data analysis 5. Use the outcome 1. If academic research: write a paper 2. If company research: deploy to consumers
  31. 31 EXAMPLES 1. Paper: investigate the effect of training intensity on heart rate variability (HRV) 2. Product: estimate VO2max based on physiological data collected during workouts
  32. EXAMPLE 1: PAPER ON THE EFFECT OF TRAINING INTENSITY ON HEART RATE VARIABILITY
  33. 33 EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY 1. Design the study 1. What dependent variables to track: HRV 2. What independent variables to track: training intensity, age, sex, etc.
  34. 34 EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY 1. Design the study 1. What dependent variables to track: HRV 2. What independent variables to track: training intensity, age, sex, etc. 2. Recruit participants (N = 10 male students)
  35. 35 EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY 1. Design the study 1. What dependent variables to track: HRV 2. What independent variables to track: training intensity, age, sex, etc. 2. Recruit participants (N = 10 male students) 3. Collect high quality data
  36. 36 EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  37. 37 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants 3. Collect high quality data 4. Perform data analysis 5. Use the outcome 1. write a paper EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  38. 38 How generalizable is this? EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  39. 39 How generalizable is this? • What about women? EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  40. 40 How generalizable is this? • What about women? • What about different phases of the menstrual cycle? EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  41. 41 How generalizable is this? • What about women? • What about different phases of the menstrual cycle? • What about people of different age groups? EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  42. 42 How generalizable is this? • What about women? • What about different phases of the menstrual cycle? • What about people of different age groups? • What about people with different health conditions? EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  43. 43 How generalizable is this? • What about women? • What about different phases of the menstrual cycle? • What about people of different age groups? • What about people with different health conditions? • What about different sports? EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  44. 44 How generalizable is this? • What about women? • What about different phases of the menstrual cycle? • What about people of different age groups? • What about people with different health conditions? • What about different sports? Not much EXAMPLE 1: HEART RATE VARIAIBLITY IN RESPONSE TO EXERCISE INTENSITY
  45. EXAMPLE 2: PRODUCT FOR VO2MAX ESTIMATION
  46. 46 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES 1. Design the study 1. What dependent variables to track 2. What independent variables to track
  47. 47 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES 1. Design the study 1. What dependent variables to track: VO2max as measured by indirect calorimetry 2. What independent variables to track: • Age, sex, weight, height, heart rate at a specific intensity, etc.
  48. 48 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants • We get N = 50
  49. 49 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants • We get N = 50 3. Collect high quality data
  50. 50 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES
  51. 51 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants (small N) • We get N = 50 3. Collect high quality data 4. Perform data analysis • Regression model to estimate VO2max given predictors
  52. 52 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES
  53. 53 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES 1. Design the study 1. What dependent variables to track 2. What independent variables to track 2. Recruit participants (small N) • We get N = 50 3. Collect high quality data 4. Perform data analysis 5. Use the outcome • Deploy to consumers
  54. 54 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES
  55. 55 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES The real world is more complex: - What about running on trails where the relationship between pace and heart rate changes? - What about other sports, where speed is less relevant, for example cycling?
  56. 56 EXAMPLE 2: VO2MAX ESTIMATION USING WEARABLES The real world is more complex: - What about running on trails where the relationship between pace and heart rate changes? - What about other sports, where speed is less relevant, for example cycling? Also not really generalizable
  57. TYPICAL LIMITATIONS
  58. 58 TYPICAL LIMITATIONS • N = 2-10 in many sport science studies
  59. 59 TYPICAL LIMITATIONS • N = 2-10 in many sport science studies • Results valid only for the specific sample analyzed
  60. 60 TYPICAL LIMITATIONS • N = 2-10 in many sport science studies • Results valid only for the specific sample analyzed • What if we want to extend the analysis? • We need to run another study.. (costs, time, etc.)
  61. 61 TYPICAL LIMITATIONS • N = 2-10 in many sport science studies • Results valid only for the specific sample analyzed • What if we want to extend the analysis? • We need to run another study.. (costs, time, etc.) • We collected high quality data, but was it representative of what happens in real life? • Come to the lab, don’t eat or drink coffee, then “relax” when I tell you to…
  62. A NEW PARADIGM
  63. 63 OUTSOURCING DATA COLLECTION • In the past 10 years our ability to run studies and monitor physiology (and other variables) outside of the lab has changed dramatically
  64. 64 OUTSOURCING DATA COLLECTION • In the past 10 years our ability to run studies and monitor physiology (and other variables) outside of the lab has changed dramatically • Phones (+ sensors) make data acquisition possible anywhere and at a larger scale
  65. 65 OUTSOURCING DATA COLLECTION • In the past 10 years our ability to run studies and monitor physiology (and other variables) outside of the lab has changed dramatically • Phones (+ sensors) make data acquisition possible anywhere and at a larger scale • More realistic settings, unforeseen outcomes
  66. 66 OUTSOURCING DATA COLLECTION • In the past 10 years our ability to run studies and monitor physiology (and other variables) outside of the lab has changed dramatically • Phones (+ sensors) make data acquisition possible anywhere and at a larger scale • More realistic settings, unforeseen outcomes • Data science infrastructure allows for cost-effective data aggregation and analysis
  67. WHAT DO WE NEED TO GET THIS DONE?
  68. 68 THREE KEY STEPS • Validate (or know the limitations of) the technology to be deployed • Garbage in, garbage out
  69. 69 THREE KEY STEPS • Validate (or know the limitations of) the technology to be deployed • Garbage in, garbage out • Deploy. Confirm lab-based insights (if possible) • Data preparation becomes the most important step
  70. 70 THREE KEY STEPS • Validate (or know the limitations of) the technology to be deployed • Garbage in, garbage out • Deploy. Confirm lab-based insights (if possible) • Data preparation becomes the most important step • Discover new relations, build new products
  71. 71 EXAMPLES 1. Paper: investigate the effect of training intensity on heart rate variability (HRV) 2. Product: estimate VO2max based on physiological data collected during workouts
  72. EXAMPLE 1: PAPER ON THE EFFECT OF TRAINING INTENSITY ON HEART RATE VARIABILITY
  73. 73 VALIDATE THE TECHNOLOGY Or use a validated tool • Equivalency between phone PPG and external ECG:
  74. 74 DEPLOY • Collect data for months in thousands of people. More than 50 000 measurements included in the analysis
  75. 75 CONFIRM LAB BASED INSIGHTS • Reduction in HRV post higher intensity exercise:
  76. 76 FIND NEW RELATIONS / EXTEND ANALYSIS • Same relationship in men and women:
  77. 77 FIND NEW RELATIONS / EXTEND ANALYSIS • Same relationship in different age groups:
  78. 78 FIND NEW RELATIONS / EXTEND ANALYSIS • What else?
  79. 79 FIND NEW RELATIONS / EXTEND ANALYSIS • What else?
  80. 80 FIND NEW RELATIONS / EXTEND ANALYSIS • What else? • Relationship with different stressors (alcohol, getting sick, menstrual cycle, etc.)
  81. 81 FIND NEW RELATIONS / EXTEND ANALYSIS • What else? • Relationship with different stressors (alcohol, getting sick, menstrual cycle, etc.) • Relationship with different outcomes (a new pandemic?)
  82. 82 FIND NEW RELATIONS / EXTEND ANALYSIS • What else? • Relationship with different stressors (alcohol, getting sick, menstrual cycle, etc.) • Relationship with different outcomes (a new pandemic?)
  83. EXAMPLE 2: PRODUCT FOR VO2MAX ESTIMATION
  84. 84 WHAT IF WE TARGET CYCLISTS NOW? • We developed our initial model thinking like a physiologist
  85. 85 WHAT IF WE TARGET CYCLISTS NOW? • We developed our initial model thinking like a physiologist • We can develop our new model thinking like a data scientist
  86. 86 WHAT IF WE TARGET CYCLISTS NOW? • We have deployed our model to thousands of users. Many are runners, and are using the feature • The user provides as input: • Anthropometrics • Workouts from Strava • The user gets as output the VO2max estimate
  87. 87 VALIDATION • Only using running data:
  88. 88 CONFIRM LAB BASED INSIGHTS Or get clever about it • Estimated VO2max is correlated to running performance as derived from Strava workouts:
  89. 89 WHAT IF WE TARGET CYCLISTS NOW? • For cyclists, we have: • Heart rate during exercise • Power during exercise
  90. 90 WHAT IF WE TARGET CYCLISTS NOW? • For cyclists, we have: • Heart rate during exercise • Power during exercise However, we do not have reference VO2max data (from the lab) nor estimated VO2max data (because we can only estimate from heart rate and speed)
  91. 91 WHAT IF WE TARGET CYCLISTS NOW? • For cyclists, we have: • Heart rate during exercise • Power during exercise However, we do not have reference VO2max data (from the lab) nor estimated VO2max data (because we can only estimate from heart rate and speed) The missing link: the triathlete
  92. 92 WHAT IF WE TARGET CYCLISTS NOW? Keep in the dataset only triathletes, check again VO2max vs running performance: still works
  93. 12 March 2021 Marco Altini, PhD Twitter: @altini_marco
  94. 12 March 2021 Marco Altini, PhD Twitter: @altini_marco
  95. 12 March 2021 Marco Altini, PhD Twitter: @altini_marco
  96. 12 March 2021 Marco Altini, PhD Twitter: @altini_marco
  97. 97 WHAT IF WE TARGET CYCLISTS NOW? Build models, predict VO2max cycling, then validate (leave one out cross-validation). R = 0.9
  98. 98 WHAT IF WE TARGET CYCLISTS NOW? Build models, predict VO2max cycling, then validate (leave one out cross-validation). R = 0.9 Deploy!
  99. 99 IN THIS LECTURE What’s user generated data? • Typical study and product development workflow • A new paradigm Challenges and opportunities • Research and data products
  100. USER GENERATED DATA: CHALLENGES
  101. 101 CHALLENGES • Data preparation:
  102. 102 CHALLENGES • Data preparation: • Quality control • Noisy data • Missing data
  103. 103 CHALLENGES • Data preparation: • Quality control • Noisy data • Missing data • Reference data • What is available?
  104. 104 CHALLENGES • Data preparation: • Quality control • Noisy data • Missing data • Reference data • What is available? • Data engineering (not covered today)
  105. QUALITY CONTROL
  106. 106 NOISY DATA • Data collected from wearables and apps is extremely noisy • Inaccurate very often • Typically no signal quality metric is reported (think about heart rate)
  107. 107 NOISY DATA • Data collected from wearables and apps is extremely noisy • Inaccurate very often • Typically no signal quality metric is reported (think about heart rate) How do we deal with it?
  108. 108 NOISY DATA Example: training intensity based on heart rate
  109. 109 NOISY DATA Example: training intensity based on heart rate. To determine a relative intensity, we need users' maximal heart rate
  110. 110 NOISY DATA Example: training intensity based on heart rate. To determine a relative intensity, we need users' maximal heart rate No lab tests. So we need to make some assumptions:
  111. 111 NOISY DATA Example: training intensity based on heart rate. To determine a relative intensity, we need users' maximal heart rate No lab tests. So we need to make some assumptions: • There will be some hard sessions during the period we monitor (hence it needs to be long enough)
  112. 112 NOISY DATA Here is data from 500 people, including heart rates above 300 bpm (or below 100 bpm):
  113. 113 NOISY DATA Data for one person We can use simple statistical methods to try to approximate this person’s max heart rate
  114. 114 NOISY DATA Data for one person We can use simple statistical methods to try to approximate this person’s max heart rate But did they ever go hard?
  115. 115 NOISY DATA Estimated max heart rate:
  116. 116 MISSING DATA Same example as before. What if: • We don’t have any hard effort • Workouts are missing
  117. 117 MISSING DATA Same example as before. What if: • We don’t have any hard effort • Workouts are missing We can sometime ignore or remove individuals with missing data (we have a lot of data after all) but this could introduce a bias (we do not have the full picture)
  118. 118 MISSING DATA Same example as before. What if: • We don’t have any hard effort • Workouts are missing We can sometime ignore or remove individuals with missing data (we have a lot of data after all) but this could introduce a bias (we do not have the full picture) No universal answer, think critically
  119. 119 QUALITY CONTROL • Only a fraction of the collected data will be usable • It is key to define methods to keep track of what data to trust, automatically, and to clean the data
  120. 120 QUALITY CONTROL • Only a fraction of the collected data will be usable • It is key to define methods to keep track of what data to trust, automatically, and to clean the data • Trade offs • It is never enough data anyways (you can always do one more stratification)
  121. REFERENCE DATA
  122. 122 REFERENCE DATA One of the biggest challenges with user generated data is lack of reference data
  123. 123 REFERENCE DATA One of the biggest challenges with user generated data is lack of reference data Users don’t come to the lab for tests or report outcomes that are key for model development What could help you in the future? • Tags / annotations / APIs
  124. 124 REFERENCE DATA COVID example
  125. 125 REFERENCE DATA COVID example: • When was the test done?
  126. 126 REFERENCE DATA COVID example: • When was the test done? • Was it even done?
  127. 127 REFERENCE DATA COVID example: • When was the test done? • Was it even done? • Does it even matter?
  128. 128 REFERENCE DATA COVID example: • When was the test done? • Was it even done? • Does it even matter? Maybe they were already infected earlier with no / mild symptoms
  129. 129 REFERENCE DATA • Not all collected data becomes valuable research or enables future data products. Much of it has to do with reference data:
  130. 130 REFERENCE DATA • Not all collected data becomes valuable research or enables future data products. Much of it has to do with reference data: • What are the outcomes? • Can we track them?
  131. 131 REFERENCE DATA • Not all collected data becomes valuable research or enables future data products. Much of it has to do with reference data: • What are the outcomes? • Can we track them? • Are we asking too much to the user? • Not a clinical study • What can we do about it? • Is it ethical to collect them?
  132. USER GENERATED DATA: OPPORTUNITIES
  133. 133 OPPORTUNITIES • Large scale • Insights that we cannot sometimes even aim at in the lab • New guidelines • New products
  134. 134 WHAT HAPPENED DURING THE PANDEMIC?
  135. 135 WHAT HAPPENED DURING THE PANDEMIC?
  136. 136 WHAT HAPPENED DURING THE PANDEMIC? 5500 people, 3 months of data per person, half a million measurements:
  137. 137 WHAT HAPPENED DURING THE PANDEMIC? Why? Travel, sleep, etc.
  138. 138 WHAT HAPPENED DURING THE PANDEMIC? Why? Travel, sleep, etc.
  139. 139 LIMITATIONS STILL APPLY Who are we talking about? Does this really generalize?
  140. 140 OPPORTUNITIES • Large scale • Insights that we cannot sometimes even aim at in the lab • New guidelines • New products
  141. 141 COVID INFECTION AND HRV, HR
  142. 142 COVID INFECTION AND HRV, HR Can we build a predictive model?
  143. 143 LIMITATIONS STILL APPLY What about the flu?
  144. 144 LIMITATIONS STILL APPLY What about the flu? Can we just distinguish healthy vs an infection or can we distinguish infection type?
  145. 145 LIMITATIONS STILL APPLY It’s easy to get fooled by the data Think critically
  146. 146 OPPORTUNITIES Large scale Insights that we cannot sometimes even aim at in the lab New guidelines New products
  147. 147 WHAT‘S OPTIMAL BLOOD GLUCOSE?
  148. 148 WHAT‘S OPTIMAL BLOOD GLUCOSE?
  149. 149 WHAT‘S OPTIMAL BLOOD GLUCOSE?
  150. 150 WHAT‘S OPTIMAL BLOOD GLUCOSE?
  151. 151 OPPORTUNITIES Large scale Insights that we cannot sometimes even aim at in the lab New guidelines New products
  152. 152 ESTIMATING RUNNING PERFORMANCE One option could be to get a few people on a treadmill in the lab, and have them run a time trial
  153. 153 ESTIMATING RUNNING PERFORMANCE One option could be to get a few people on a treadmill in the lab, and have them run a time trial Or, we could grab workouts from apps like Strava, analyze training patterns antecedent to their e.g. best 10 km performance over a year or so and build a model
  154. 154 ESTIMATING RUNNING PERFORMANCE N = 2100 RMSE = 2 minutes (4%)
  155. 155 ESTIMATING RUNNING PERFORMANCE
  156. 156 ESTIMATING RUNNING PERFORMANCE
  157. THAT’S A WRAP
  158. 158 USER GENERATED DATA • Not everything is (or can be) a data product
  159. 159 USER GENERATED DATA • Not everything is (or can be) a data product • Often data is collected but not used in any meaningful way, no value created (either for the company or the user)
  160. 160 USER GENERATED DATA • Not everything is (or can be) a data product • Often data is collected but not used in any meaningful way, no value created (either for the company or the user) • Reference points are key, you can have unlimited data and still have no use for it
  161. 161 USER GENERATED DATA • Not everything is (or can be) a data product • Often data is collected but not used in any meaningful way, no value created (either for the company or the user) • Reference points are key, you can have unlimited data and still have no use for it • More research is being carried out using consumer products
  162. 162 USER GENERATED DATA • Not everything is (or can be) a data product • Often data is collected but not used in any meaningful way, no value created (either for the company or the user) • Reference points are key, you can have unlimited data and still have no use for it • More research is being carried out using consumer products • Think critically about reference points, data preparation, and other challenges (estimated vs measured)
  163. 12 March 2021 Marco Altini, PhD Twitter: @altini_marco USER GENERATED DATA: A PARADIGM SHIFT FOR RESEARCH AND DATA PRODUCTS
Advertisement