SlideShare a Scribd company logo
1 of 1
Download to read offline
BASELINE
C o n c e p t a n d a p p r o a c h
A d i v e r s i f i e d t e a m
o More than 200 persons came to Baseline meetings and asked to be in the mailing-list to help, more than 100 concretely helped, and about 25 have spent much time on it
o Edouard Debonneuil, Augustin Terlinden and Peter-Mikhaël Richard co-lead the project, managed small groups of skilled participants and share their technical expertise in
actuarial science, health economics and programming, respectively.
o Dahbia Agher, Laurence Samelson, Julien Pasquier, Frédéric Kozlowski, Bruno Scheffer, Camille Pouchol, Joseph Sébastien, Mayumi Iitsuka, were major non-student
contributors. Among their numerous achievements, they collected data about cancer incidence and risk factors for Austria, Denmark, Algeria, Japan and Germany. They
provided the energy to run preliminary analysis on these countries and entered raw data into the SQL database.
o Nine ISFA students, supervised by their professors Frédéric Planchet, Stéphane Loisel, Xavier Milhaud and Alexis Bienvenue, collected risk factors on Englandand Wales
(Rudy Mustafa and Thibaut Bideault), Germany (Do Ngoc Linh), Australia (Tài Anh Truong and Nga Nguyen), Japan (Ngo Hang and Huong Nguyen) and the US (Ha Do and
Terexa Tran). Joseph Lam and Benoit Choffin from ENSAE provided extensive analysis of the French cancer risk factors. The professors Lionel Gabel from Ecole Centrale Paris,
Nicole El Karoui from Paris VI and Caroline Hillairet from ENSAE diffused the project to a large audience of actuaries and data scientists.
o Most known cancer risk factors were identified and confirmed by an oncologist (Dr. Nicolas de Chanaud). About ten persons from public health were instrumental in
selecting key explanatory variables to collect. Claude Touche and Delphine Bertram provided their expertise to validate the results in a real-life setting.
o Sixty data scientists from various fields joined their forces to model the interactions between Y and X during two RAMP (Rapid Analytics and Model Prototyping) sessions on
February 13th 2016 and April 30th 2016 at La Paillasse.
o Aside from organizing the RAMP sessions, Djalel Benbouzid did also technically prepare the BASELINE matrix and a first Y=f(X) function for data scientists to easily start.
Seraya Maouche was a pillar in the decision to migrate the BASELINE database from Excel to SQL. She provided an easy access to EpidemiumDB as well as an environment
to work with it. Last but not least, the Epidemium team provided continuous help and support towards having the right tools and the right persons at the right milestones of
the project. About 100 persons (not mentioned here) came several times to the weekly meeting (Thursday evenings at La Paillasse) and concretely helped.
http://wiki.epidemium.cc/wiki/BASELINE http://www.epidemium.cc/project/23/view
cancerbaseline@googlegroups.com
D a t a s e t a n d t e c h n o l o g y
M a i n R e s u l t s
C o n t a c t u s
EPIDEMIUM Challenge – May 2016 – Paris, France
A large [Y|X] dataset was created! Its age-standardized compact version can be downloaded
at https://github.com/Epidemium/Baseline/tree/master/MATRIX and help research!
A large community was sensitivized on such important public health topics, among students,
researchers and various volunteers. Caveats were sensitivized as well. Mainly: this is epidemiology
on populations, not individuals
A first proof of concept of the populational approach was found about prostate cancer risks
from African origins (similar aspects exist from various European/Asian/other origins) [1]
The first RAMP demonstrated that the collected age-standardize data can be very useful : some
models were able to accurately predict cancer mortality, indicating that the data collected
would capture the essence of populational cancer mortality risks!
A first method was developped to get insightful results and found known risk factors [2]: it is a
sign of good approach (that could be robustified). Another promising methods, e.g. PLS regression.
1.
2.
Collecting data
Building the model
We manually collected, curated and assembled 5 435 936 numbers, one representing for example the lung cancer rate of 60-64 year old males
in Alabama, USA in 2005 or the percentage of smokers in New Zealand in 2008. It is a feedback from humanity from over 4 billion persons.
o It covers 498 areas in the worlds, 34 types of cancer (mortality and incidence rates), all-cause mortality, and 250 potential risk factors
o The potential risk factors are 70 known cancer risk factors specifically looked for as well as any kind of data found – serendipity!
o Those numbers are assembled in a [Y|X] matrix where each row is a geographic location, year range, gender, age range and ethnic origin.
The tools to collectively collect, clean, scale and assemble data have varied along the project:
o we successfully started with Data Science Studio (60 persons collaborated on it) on the Terralab cluster,
o moved to Excel due to a short deadline (RAMP1) – indeed no learning time however not adequate for intense collaboration,
o moved to Mysql+php to interact with other Epidemium projects (via EpidemiumDB lead by Seraya).
o We used GitHub as an open repository, we exchanged via slack, emails and wiki.
2. Building the model The resulting [Y|X] matrix is sparse, such creates challenges to model cancer risks
o A first “RAMP” modelled age-standardized cancer mortality risks
as a function of many risk factors
o A second “RAMP” modelled age-specific digestive cancer mortality
risks as a function of risks of other cancers
o Aggregate GLM: a method was specifically designed to do Y=f(X)
with such sparse matrices:
o 8 actuarial memorandums (ISFA and ENSAE)
are dealing with the Cancer Baseline modelling
N e x t s t e p s
melanoma, lung, etc.
o Cancer risks vary much across the world. If we map life conditions across the globe
as well, perhaps we can guess major ways to prevent cancer?
o Such seems feasible with existing Epidemium’s core dataset and country specific
portals like http://data.gouv.fr. In order to estimate the effects of more than a few
risk factors, maths tell us that 180 countries isn’t enough: we need to collect and
assemble data at a finer level: we need to be many!!
Edouard
Project
leader
Augustin
Co-leader
Peter
Data architect
Dahbia
e-health
Claude
Health pps
Thibaut,
déelegate of
ISFA students
Rudy
England & Wales
Thi Nga
Australia
Dongo
Germany
Hado
USA
…Many more!
40 data scientists competing for the best cancer mortality model
Anyone in the world to add data!: a tool has been started on
http://baseline.epidemium.cc for anyone in the world to add data.
Volunteers are working on it. Then we will welcome you to spread it
massively through social networks!
Secrets against cancer. Volunteers are working on finalizing an age-
specific gigantic [Y|X] matrix and applying the techniques seen so far on it.
Secrets against other main diseases as well! For long and healthy lives
Replacing “Y” by mortality risks from e.g. Alzheimer’s disease and applying
the same techniques should shed light on ways to prevent and mitigate
Alzheimer’s disease. We have started that approach with all-cause mortality
Public health impact? While too early based on current results, we hope
to raise public health awareness and action with respect to particular topics
that will be found with BASELINE. At this stage, two main findings arise:
- Aging & cancer. Age is by far the strongest risk factor for most
cancers. While obvious it suggests that research linking
biological aging and cancer might be key
- Sensitivity of selected populations. We hope to see for
example if males of African origins share some risk factors
Data For Long and Healthy Lives. We hope that the BASELINE approach
will convince more stakeholders to produce useful data for better lives.
[1] Cancers & ethnic origins
Afro-Americans have higher prostate cancer mortality than
Americans from diverse other ethnic origins. Social, biological,
other reasons? … Local to the USA?
We checked the finding with aggregated data from the USA.
We observed where prostate cancer mortality risks are high in
the world: in African regions, Caribbeans, and to moderate
extents in South America (WHO Globocan data)
Such is not true for all countries for those parts of the world
however: we hope to soon do a Y=f(X) analysis that might
suggest risk factors to be particularlycareful about, for those
populations. Of course, such results would be shared
with epidemiologistand public health stakeholders
rather than put online.
[2] Aggregate GLM method
Dealing with sparsity: let the variable U be present across all lines
of the matrix and V across the first half only. How to do Y=f(X)?
A standard GLM approach would calibrate logit Y = a + b U + c V
on the first half of lines, and logit Y = a’ + b’ U on the second half.
The aggregate GLM method supposes that b = b’: logit Y = a+d 1V
+ b U + c V 1V. It can be fitted with a standard GLM regression (of
twice the number of variables -1) . Applied to parts of the RAMP1
matrix,it always found the following risk s significant:
•alcohol (drinking more on average than average population
• long term unemployment
• higher blood pressure. more blood cholesterol
• country where women marry young (interpretedas indirect
effect). being a male (and of some ethnics for some cancers)
Project initiated within the Epidemium
challenge launched by Roche & La Paillasse
Y =f(x1,x2,x3,...)cancer age sun smoking other
20 year old
smoker
40 year old
Non smoker
40 year old
smoker

More Related Content

Similar to poster_Baseline_20160518

Electronic copy available at httpssrn.comabstract=2129750.docx
Electronic copy available at httpssrn.comabstract=2129750.docxElectronic copy available at httpssrn.comabstract=2129750.docx
Electronic copy available at httpssrn.comabstract=2129750.docxjack60216
 
Big Data and the Future by Sherri Rose
Big Data and the Future by Sherri RoseBig Data and the Future by Sherri Rose
Big Data and the Future by Sherri RoseLewis Lin 🦊
 
Wavelength October 2015 Volume 19 No. 2
Wavelength October 2015 Volume 19 No. 2Wavelength October 2015 Volume 19 No. 2
Wavelength October 2015 Volume 19 No. 2Jerry Duncan
 
Black Hole Essay. The Universe of Black Holes - Free Essay Example PapersOwl...
Black Hole Essay. The Universe of Black Holes - Free Essay Example  PapersOwl...Black Hole Essay. The Universe of Black Holes - Free Essay Example  PapersOwl...
Black Hole Essay. The Universe of Black Holes - Free Essay Example PapersOwl...Shalonda Jefferson
 
Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...
Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...
Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...Biocomplexity Institute of Virginia Tech
 
Age at cancer diagnosis in Malawi
Age at cancer diagnosis in Malawi Age at cancer diagnosis in Malawi
Age at cancer diagnosis in Malawi Humphrey Misiri
 
Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.Jerry Lee
 
Number of Pages 4 (Double Spaced)Number of sources 8Writi.docx
Number of Pages 4 (Double Spaced)Number of sources 8Writi.docxNumber of Pages 4 (Double Spaced)Number of sources 8Writi.docx
Number of Pages 4 (Double Spaced)Number of sources 8Writi.docxcherishwinsland
 

Similar to poster_Baseline_20160518 (20)

Electronic copy available at httpssrn.comabstract=2129750.docx
Electronic copy available at httpssrn.comabstract=2129750.docxElectronic copy available at httpssrn.comabstract=2129750.docx
Electronic copy available at httpssrn.comabstract=2129750.docx
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
sp95-chap3.pdf
sp95-chap3.pdfsp95-chap3.pdf
sp95-chap3.pdf
 
Big Data and the Future by Sherri Rose
Big Data and the Future by Sherri RoseBig Data and the Future by Sherri Rose
Big Data and the Future by Sherri Rose
 
Wavelength October 2015 Volume 19 No. 2
Wavelength October 2015 Volume 19 No. 2Wavelength October 2015 Volume 19 No. 2
Wavelength October 2015 Volume 19 No. 2
 
Reducing Anomia
Reducing AnomiaReducing Anomia
Reducing Anomia
 
Module5_Study_Design.pptx
Module5_Study_Design.pptxModule5_Study_Design.pptx
Module5_Study_Design.pptx
 
Treinamento
TreinamentoTreinamento
Treinamento
 
Black Hole Essay. The Universe of Black Holes - Free Essay Example PapersOwl...
Black Hole Essay. The Universe of Black Holes - Free Essay Example  PapersOwl...Black Hole Essay. The Universe of Black Holes - Free Essay Example  PapersOwl...
Black Hole Essay. The Universe of Black Holes - Free Essay Example PapersOwl...
 
Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...
Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...
Computational Epidemiology tutorial featured at ACM Knowledge Discovery and D...
 
Age at cancer diagnosis in Malawi
Age at cancer diagnosis in Malawi Age at cancer diagnosis in Malawi
Age at cancer diagnosis in Malawi
 
Age at cancer diagnosis in Malawi
Age at cancer diagnosis in MalawiAge at cancer diagnosis in Malawi
Age at cancer diagnosis in Malawi
 
Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.Keynote at NVIDIA GPU Technology Conference in D.C.
Keynote at NVIDIA GPU Technology Conference in D.C.
 
Tirads
TiradsTirads
Tirads
 
Number of Pages 4 (Double Spaced)Number of sources 8Writi.docx
Number of Pages 4 (Double Spaced)Number of sources 8Writi.docxNumber of Pages 4 (Double Spaced)Number of sources 8Writi.docx
Number of Pages 4 (Double Spaced)Number of sources 8Writi.docx
 

poster_Baseline_20160518

  • 1. BASELINE C o n c e p t a n d a p p r o a c h A d i v e r s i f i e d t e a m o More than 200 persons came to Baseline meetings and asked to be in the mailing-list to help, more than 100 concretely helped, and about 25 have spent much time on it o Edouard Debonneuil, Augustin Terlinden and Peter-Mikhaël Richard co-lead the project, managed small groups of skilled participants and share their technical expertise in actuarial science, health economics and programming, respectively. o Dahbia Agher, Laurence Samelson, Julien Pasquier, Frédéric Kozlowski, Bruno Scheffer, Camille Pouchol, Joseph Sébastien, Mayumi Iitsuka, were major non-student contributors. Among their numerous achievements, they collected data about cancer incidence and risk factors for Austria, Denmark, Algeria, Japan and Germany. They provided the energy to run preliminary analysis on these countries and entered raw data into the SQL database. o Nine ISFA students, supervised by their professors Frédéric Planchet, Stéphane Loisel, Xavier Milhaud and Alexis Bienvenue, collected risk factors on Englandand Wales (Rudy Mustafa and Thibaut Bideault), Germany (Do Ngoc Linh), Australia (Tài Anh Truong and Nga Nguyen), Japan (Ngo Hang and Huong Nguyen) and the US (Ha Do and Terexa Tran). Joseph Lam and Benoit Choffin from ENSAE provided extensive analysis of the French cancer risk factors. The professors Lionel Gabel from Ecole Centrale Paris, Nicole El Karoui from Paris VI and Caroline Hillairet from ENSAE diffused the project to a large audience of actuaries and data scientists. o Most known cancer risk factors were identified and confirmed by an oncologist (Dr. Nicolas de Chanaud). About ten persons from public health were instrumental in selecting key explanatory variables to collect. Claude Touche and Delphine Bertram provided their expertise to validate the results in a real-life setting. o Sixty data scientists from various fields joined their forces to model the interactions between Y and X during two RAMP (Rapid Analytics and Model Prototyping) sessions on February 13th 2016 and April 30th 2016 at La Paillasse. o Aside from organizing the RAMP sessions, Djalel Benbouzid did also technically prepare the BASELINE matrix and a first Y=f(X) function for data scientists to easily start. Seraya Maouche was a pillar in the decision to migrate the BASELINE database from Excel to SQL. She provided an easy access to EpidemiumDB as well as an environment to work with it. Last but not least, the Epidemium team provided continuous help and support towards having the right tools and the right persons at the right milestones of the project. About 100 persons (not mentioned here) came several times to the weekly meeting (Thursday evenings at La Paillasse) and concretely helped. http://wiki.epidemium.cc/wiki/BASELINE http://www.epidemium.cc/project/23/view cancerbaseline@googlegroups.com D a t a s e t a n d t e c h n o l o g y M a i n R e s u l t s C o n t a c t u s EPIDEMIUM Challenge – May 2016 – Paris, France A large [Y|X] dataset was created! Its age-standardized compact version can be downloaded at https://github.com/Epidemium/Baseline/tree/master/MATRIX and help research! A large community was sensitivized on such important public health topics, among students, researchers and various volunteers. Caveats were sensitivized as well. Mainly: this is epidemiology on populations, not individuals A first proof of concept of the populational approach was found about prostate cancer risks from African origins (similar aspects exist from various European/Asian/other origins) [1] The first RAMP demonstrated that the collected age-standardize data can be very useful : some models were able to accurately predict cancer mortality, indicating that the data collected would capture the essence of populational cancer mortality risks! A first method was developped to get insightful results and found known risk factors [2]: it is a sign of good approach (that could be robustified). Another promising methods, e.g. PLS regression. 1. 2. Collecting data Building the model We manually collected, curated and assembled 5 435 936 numbers, one representing for example the lung cancer rate of 60-64 year old males in Alabama, USA in 2005 or the percentage of smokers in New Zealand in 2008. It is a feedback from humanity from over 4 billion persons. o It covers 498 areas in the worlds, 34 types of cancer (mortality and incidence rates), all-cause mortality, and 250 potential risk factors o The potential risk factors are 70 known cancer risk factors specifically looked for as well as any kind of data found – serendipity! o Those numbers are assembled in a [Y|X] matrix where each row is a geographic location, year range, gender, age range and ethnic origin. The tools to collectively collect, clean, scale and assemble data have varied along the project: o we successfully started with Data Science Studio (60 persons collaborated on it) on the Terralab cluster, o moved to Excel due to a short deadline (RAMP1) – indeed no learning time however not adequate for intense collaboration, o moved to Mysql+php to interact with other Epidemium projects (via EpidemiumDB lead by Seraya). o We used GitHub as an open repository, we exchanged via slack, emails and wiki. 2. Building the model The resulting [Y|X] matrix is sparse, such creates challenges to model cancer risks o A first “RAMP” modelled age-standardized cancer mortality risks as a function of many risk factors o A second “RAMP” modelled age-specific digestive cancer mortality risks as a function of risks of other cancers o Aggregate GLM: a method was specifically designed to do Y=f(X) with such sparse matrices: o 8 actuarial memorandums (ISFA and ENSAE) are dealing with the Cancer Baseline modelling N e x t s t e p s melanoma, lung, etc. o Cancer risks vary much across the world. If we map life conditions across the globe as well, perhaps we can guess major ways to prevent cancer? o Such seems feasible with existing Epidemium’s core dataset and country specific portals like http://data.gouv.fr. In order to estimate the effects of more than a few risk factors, maths tell us that 180 countries isn’t enough: we need to collect and assemble data at a finer level: we need to be many!! Edouard Project leader Augustin Co-leader Peter Data architect Dahbia e-health Claude Health pps Thibaut, déelegate of ISFA students Rudy England & Wales Thi Nga Australia Dongo Germany Hado USA …Many more! 40 data scientists competing for the best cancer mortality model Anyone in the world to add data!: a tool has been started on http://baseline.epidemium.cc for anyone in the world to add data. Volunteers are working on it. Then we will welcome you to spread it massively through social networks! Secrets against cancer. Volunteers are working on finalizing an age- specific gigantic [Y|X] matrix and applying the techniques seen so far on it. Secrets against other main diseases as well! For long and healthy lives Replacing “Y” by mortality risks from e.g. Alzheimer’s disease and applying the same techniques should shed light on ways to prevent and mitigate Alzheimer’s disease. We have started that approach with all-cause mortality Public health impact? While too early based on current results, we hope to raise public health awareness and action with respect to particular topics that will be found with BASELINE. At this stage, two main findings arise: - Aging & cancer. Age is by far the strongest risk factor for most cancers. While obvious it suggests that research linking biological aging and cancer might be key - Sensitivity of selected populations. We hope to see for example if males of African origins share some risk factors Data For Long and Healthy Lives. We hope that the BASELINE approach will convince more stakeholders to produce useful data for better lives. [1] Cancers & ethnic origins Afro-Americans have higher prostate cancer mortality than Americans from diverse other ethnic origins. Social, biological, other reasons? … Local to the USA? We checked the finding with aggregated data from the USA. We observed where prostate cancer mortality risks are high in the world: in African regions, Caribbeans, and to moderate extents in South America (WHO Globocan data) Such is not true for all countries for those parts of the world however: we hope to soon do a Y=f(X) analysis that might suggest risk factors to be particularlycareful about, for those populations. Of course, such results would be shared with epidemiologistand public health stakeholders rather than put online. [2] Aggregate GLM method Dealing with sparsity: let the variable U be present across all lines of the matrix and V across the first half only. How to do Y=f(X)? A standard GLM approach would calibrate logit Y = a + b U + c V on the first half of lines, and logit Y = a’ + b’ U on the second half. The aggregate GLM method supposes that b = b’: logit Y = a+d 1V + b U + c V 1V. It can be fitted with a standard GLM regression (of twice the number of variables -1) . Applied to parts of the RAMP1 matrix,it always found the following risk s significant: •alcohol (drinking more on average than average population • long term unemployment • higher blood pressure. more blood cholesterol • country where women marry young (interpretedas indirect effect). being a male (and of some ethnics for some cancers) Project initiated within the Epidemium challenge launched by Roche & La Paillasse Y =f(x1,x2,x3,...)cancer age sun smoking other 20 year old smoker 40 year old Non smoker 40 year old smoker