Number of Pages 4 (Double Spaced)Number of sources 8Writi.docx
poster_Baseline_20160518
1. BASELINE
C o n c e p t a n d a p p r o a c h
A d i v e r s i f i e d t e a m
o More than 200 persons came to Baseline meetings and asked to be in the mailing-list to help, more than 100 concretely helped, and about 25 have spent much time on it
o Edouard Debonneuil, Augustin Terlinden and Peter-Mikhaël Richard co-lead the project, managed small groups of skilled participants and share their technical expertise in
actuarial science, health economics and programming, respectively.
o Dahbia Agher, Laurence Samelson, Julien Pasquier, Frédéric Kozlowski, Bruno Scheffer, Camille Pouchol, Joseph Sébastien, Mayumi Iitsuka, were major non-student
contributors. Among their numerous achievements, they collected data about cancer incidence and risk factors for Austria, Denmark, Algeria, Japan and Germany. They
provided the energy to run preliminary analysis on these countries and entered raw data into the SQL database.
o Nine ISFA students, supervised by their professors Frédéric Planchet, Stéphane Loisel, Xavier Milhaud and Alexis Bienvenue, collected risk factors on Englandand Wales
(Rudy Mustafa and Thibaut Bideault), Germany (Do Ngoc Linh), Australia (Tài Anh Truong and Nga Nguyen), Japan (Ngo Hang and Huong Nguyen) and the US (Ha Do and
Terexa Tran). Joseph Lam and Benoit Choffin from ENSAE provided extensive analysis of the French cancer risk factors. The professors Lionel Gabel from Ecole Centrale Paris,
Nicole El Karoui from Paris VI and Caroline Hillairet from ENSAE diffused the project to a large audience of actuaries and data scientists.
o Most known cancer risk factors were identified and confirmed by an oncologist (Dr. Nicolas de Chanaud). About ten persons from public health were instrumental in
selecting key explanatory variables to collect. Claude Touche and Delphine Bertram provided their expertise to validate the results in a real-life setting.
o Sixty data scientists from various fields joined their forces to model the interactions between Y and X during two RAMP (Rapid Analytics and Model Prototyping) sessions on
February 13th 2016 and April 30th 2016 at La Paillasse.
o Aside from organizing the RAMP sessions, Djalel Benbouzid did also technically prepare the BASELINE matrix and a first Y=f(X) function for data scientists to easily start.
Seraya Maouche was a pillar in the decision to migrate the BASELINE database from Excel to SQL. She provided an easy access to EpidemiumDB as well as an environment
to work with it. Last but not least, the Epidemium team provided continuous help and support towards having the right tools and the right persons at the right milestones of
the project. About 100 persons (not mentioned here) came several times to the weekly meeting (Thursday evenings at La Paillasse) and concretely helped.
http://wiki.epidemium.cc/wiki/BASELINE http://www.epidemium.cc/project/23/view
cancerbaseline@googlegroups.com
D a t a s e t a n d t e c h n o l o g y
M a i n R e s u l t s
C o n t a c t u s
EPIDEMIUM Challenge – May 2016 – Paris, France
A large [Y|X] dataset was created! Its age-standardized compact version can be downloaded
at https://github.com/Epidemium/Baseline/tree/master/MATRIX and help research!
A large community was sensitivized on such important public health topics, among students,
researchers and various volunteers. Caveats were sensitivized as well. Mainly: this is epidemiology
on populations, not individuals
A first proof of concept of the populational approach was found about prostate cancer risks
from African origins (similar aspects exist from various European/Asian/other origins) [1]
The first RAMP demonstrated that the collected age-standardize data can be very useful : some
models were able to accurately predict cancer mortality, indicating that the data collected
would capture the essence of populational cancer mortality risks!
A first method was developped to get insightful results and found known risk factors [2]: it is a
sign of good approach (that could be robustified). Another promising methods, e.g. PLS regression.
1.
2.
Collecting data
Building the model
We manually collected, curated and assembled 5 435 936 numbers, one representing for example the lung cancer rate of 60-64 year old males
in Alabama, USA in 2005 or the percentage of smokers in New Zealand in 2008. It is a feedback from humanity from over 4 billion persons.
o It covers 498 areas in the worlds, 34 types of cancer (mortality and incidence rates), all-cause mortality, and 250 potential risk factors
o The potential risk factors are 70 known cancer risk factors specifically looked for as well as any kind of data found – serendipity!
o Those numbers are assembled in a [Y|X] matrix where each row is a geographic location, year range, gender, age range and ethnic origin.
The tools to collectively collect, clean, scale and assemble data have varied along the project:
o we successfully started with Data Science Studio (60 persons collaborated on it) on the Terralab cluster,
o moved to Excel due to a short deadline (RAMP1) – indeed no learning time however not adequate for intense collaboration,
o moved to Mysql+php to interact with other Epidemium projects (via EpidemiumDB lead by Seraya).
o We used GitHub as an open repository, we exchanged via slack, emails and wiki.
2. Building the model The resulting [Y|X] matrix is sparse, such creates challenges to model cancer risks
o A first “RAMP” modelled age-standardized cancer mortality risks
as a function of many risk factors
o A second “RAMP” modelled age-specific digestive cancer mortality
risks as a function of risks of other cancers
o Aggregate GLM: a method was specifically designed to do Y=f(X)
with such sparse matrices:
o 8 actuarial memorandums (ISFA and ENSAE)
are dealing with the Cancer Baseline modelling
N e x t s t e p s
melanoma, lung, etc.
o Cancer risks vary much across the world. If we map life conditions across the globe
as well, perhaps we can guess major ways to prevent cancer?
o Such seems feasible with existing Epidemium’s core dataset and country specific
portals like http://data.gouv.fr. In order to estimate the effects of more than a few
risk factors, maths tell us that 180 countries isn’t enough: we need to collect and
assemble data at a finer level: we need to be many!!
Edouard
Project
leader
Augustin
Co-leader
Peter
Data architect
Dahbia
e-health
Claude
Health pps
Thibaut,
déelegate of
ISFA students
Rudy
England & Wales
Thi Nga
Australia
Dongo
Germany
Hado
USA
…Many more!
40 data scientists competing for the best cancer mortality model
Anyone in the world to add data!: a tool has been started on
http://baseline.epidemium.cc for anyone in the world to add data.
Volunteers are working on it. Then we will welcome you to spread it
massively through social networks!
Secrets against cancer. Volunteers are working on finalizing an age-
specific gigantic [Y|X] matrix and applying the techniques seen so far on it.
Secrets against other main diseases as well! For long and healthy lives
Replacing “Y” by mortality risks from e.g. Alzheimer’s disease and applying
the same techniques should shed light on ways to prevent and mitigate
Alzheimer’s disease. We have started that approach with all-cause mortality
Public health impact? While too early based on current results, we hope
to raise public health awareness and action with respect to particular topics
that will be found with BASELINE. At this stage, two main findings arise:
- Aging & cancer. Age is by far the strongest risk factor for most
cancers. While obvious it suggests that research linking
biological aging and cancer might be key
- Sensitivity of selected populations. We hope to see for
example if males of African origins share some risk factors
Data For Long and Healthy Lives. We hope that the BASELINE approach
will convince more stakeholders to produce useful data for better lives.
[1] Cancers & ethnic origins
Afro-Americans have higher prostate cancer mortality than
Americans from diverse other ethnic origins. Social, biological,
other reasons? … Local to the USA?
We checked the finding with aggregated data from the USA.
We observed where prostate cancer mortality risks are high in
the world: in African regions, Caribbeans, and to moderate
extents in South America (WHO Globocan data)
Such is not true for all countries for those parts of the world
however: we hope to soon do a Y=f(X) analysis that might
suggest risk factors to be particularlycareful about, for those
populations. Of course, such results would be shared
with epidemiologistand public health stakeholders
rather than put online.
[2] Aggregate GLM method
Dealing with sparsity: let the variable U be present across all lines
of the matrix and V across the first half only. How to do Y=f(X)?
A standard GLM approach would calibrate logit Y = a + b U + c V
on the first half of lines, and logit Y = a’ + b’ U on the second half.
The aggregate GLM method supposes that b = b’: logit Y = a+d 1V
+ b U + c V 1V. It can be fitted with a standard GLM regression (of
twice the number of variables -1) . Applied to parts of the RAMP1
matrix,it always found the following risk s significant:
•alcohol (drinking more on average than average population
• long term unemployment
• higher blood pressure. more blood cholesterol
• country where women marry young (interpretedas indirect
effect). being a male (and of some ethnics for some cancers)
Project initiated within the Epidemium
challenge launched by Roche & La Paillasse
Y =f(x1,x2,x3,...)cancer age sun smoking other
20 year old
smoker
40 year old
Non smoker
40 year old
smoker