1
Université Paris-Saclay / CNRS
BALÁZS KÉGL
Center for Data Science
Paris-Saclay
RAMP	

DATA CHALLENGES WITH	

MODULARIZATION AND CODE SUBMISSION
• Senior researcher CNRS	

• machine learning (20 years)

interfacing with particle physics (10 years)	

• Head of the Paris-Saclay Center for Data Science
• interfacing with biology, economy, climatology, chemistry, etc. (4 years)
2
WHO AM I?
Balázs Kégl
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Supélec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Supélec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Supélec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
3
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative, building interfaces, matching
people, helping them launching projects
345 affiliated researchers, 50 laboratories
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 4
Data scientist
Data Value Architect
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics

machine learning
information retrieval
signal processing
data visualization
databases
software engineering

clouds/grids
high-performance

computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
THE DATA SCIENCE ECOSYSTEM
https://medium.com/@balazskegl
• Classification problem y = f(x)
5
WHAT DOES MACHINE LEARNING DO
x
f y
‘Stomorhina’
f y
‘Scaeva’
x
6
WHAT DOES MACHINE LEARNING DO
X y
7
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
« CROWDS » - > EXPLOSION OF TOOLS
!  Crowdsourcing
!  is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
!  To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5
8
Most classical data challenges are
HR and publicity events
9
We decided to turn them into a
tool for
1. Collaborative prototyping
2. Teaching aid
3. Data science process management
10
Funded by Université Paris-Saclay and CNRS
11
RAMP.STUDIO	

DATA CHALLENGE WITH CODE SUBMISSION
12
Code submission
1. lets us deliver a working prototype
2. lets the participants collaborate
3. makes the backend challenging to
run (cloud management)
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
RAPID ANALYTICS AND MODEL PROTOTYPING
13
functional data, time series, data augmentation, deep
learning, learning on simulations, nonstandard and
multi-objective losses
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
CURRENT RAMPS
14
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
DATA SCIENCE THEMES
15
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
DATA DOMAINS
16
17
RAMP.STUDIO	

DATA CHALLENGE WITH CODE SUBMISSION
21 challenges
39 events
1205 users
5952 predictive models
18
RAMP.STUDIO	

DATA CHALLENGE WITH CODE SUBMISSION
12 hackathons
6 remote data challenges
11 course data camps
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
CLASSIFYING AND REGRESSING ON
MOLECULAR SPECTRA
19
chemotherapy
drug in
elastic pocket
laser
spectrometer
molecular
spectra
feature
extractor 1
feature
extractor 2
regressor
concentration
classifier
drug type
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 20
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS 	

AHEAD
21
…
300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …
time series feature
extractor
x
(a fixed length feature vector)regressor
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 22
what you achieved with a well tuned deep net
the diversity gap
the human blender gap
competitive phase
collaborative phase
THE POWER OF THE (COLLABORATING) CROWD
23
THE DYNAMICS OF COLLABORATION	

starting kit
the crowd
early influencers
inventors
Center for Data Science
Paris-Saclay
B. Kégl (CNRS)
OPEN PHASE LETS PARTICIPANTS CATCH UP	

THE GOAL OF TEACHING
24
Center for Data Science
Paris-Saclay
B. Kégl (CNRS) 25
COMMUNICATION AND REUSE
26
ONGOING CHALLENGES:	

CLASSIFY POLLENATING INSECTS
4.5K€ for the competitive phase
3K€ for the collaborative phase
50 GPU hours per participant
https://www.ramp.studio/events/pollenating_insects_3_JNI_2017
27
ONGOING CHALLENGES:	

DETECTING MARS CRATERS
28
ONGOING CHALLENGES: FAKE NEWS	

PREDICT THE TRUTHFULNESS OF NEWS
29
ONGOING CHALLENGES: KAGGLE SEGURO	

PREDICT INSURANCE CLAIMS
30
UPCOMING CHALLENGE	

PREDICT AUTISM FROM BRAIN SCANS
31
Setting up the RAMP
is was
long and hard.
32
Separate workflow building
and workflow optimization
33
Before solving the problem,
set it up
(even put it into production)
34
Université Paris-Saclay / CNRS
BALÁZS KÉGL
Center for Data Science
Paris-Saclay
RAMP	

DATA CHALLENGES WITH	

MODULARIZATION AND CODE SUBMISSION
FRUGAL DATA SCIENCE
PROCESS MANAGEMENT
35
data flow
THE DATA FLOW
X
dataconnectors
CALFE CLF
predictive workflow
full automation
production
dashboard
decision support
ypred
36
POC
report
expert labeler
amazon turk
simulator
y
ypred score
type
score
cross-validation scheme
RAMP
full automation
production
dashboard
decision support
data flow
“works on”
dataconnectors
X
data scientists
CALFE CLF
workflow
business unit
domain scientists
data value architects
data engineers
THE IDEAL SEQUENCE
• toolkit: https://github.com/paris-saclay-cds/ramp-workflow 	

• for designing workflows
• set of ready-made metrics, workflows, CV schemes, data readers
• unique command-line test script	

• examples: https://github.com/ramp-kits	

• a zoo of problems, experiments, workflows
• (at least) one initial solution
37
RAMP-WORKFLOW & RAMP-KITS
38
A SINGLE SCRIPT TO DEFINE THE BUNDLE
X ypred score
type
score
cross-validation scheme
dataconnectors
FE CLF
workflow
39
A SINGLE EXECUTABLE TO TEST THE
SUBMISSIONS
• Keep your different
submissions in a simple
file structure	

• Communicate them on git	

• Execute them also from
the notebook
40
You can
1. Participate in upcoming RAMPs
2. Use RAMP in teaching or training
3. Use the toolkit for your own workflows
4. Submit it to us if you want to run a data
challenge
41
toolkit:
github.com/paris-saclay-cds/ramp-workflow
examples:
github.com/ramp-kits
blogs:
medium.com/@balazskegl
slack:
ramp-studio.slack.com
frontend:
www.ramp.studio
mail:
balazs.kegl@gmail.com

RAMP Data Challenge

  • 1.
    1 Université Paris-Saclay /CNRS BALÁZS KÉGL Center for Data Science Paris-Saclay RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION
  • 2.
    • Senior researcherCNRS • machine learning (20 years)
 interfacing with particle physics (10 years) • Head of the Paris-Saclay Center for Data Science • interfacing with biology, economy, climatology, chemistry, etc. (4 years) 2 WHO AM I? Balázs Kégl
  • 3.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA 3 Center for Data Science Paris-Saclay A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects 345 affiliated researchers, 50 laboratories
  • 4.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) 4 Data scientist Data Value Architect Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain THE DATA SCIENCE ECOSYSTEM https://medium.com/@balazskegl
  • 5.
    • Classification problemy = f(x) 5 WHAT DOES MACHINE LEARNING DO x f y ‘Stomorhina’ f y ‘Scaeva’ x
  • 6.
    6 WHAT DOES MACHINELEARNING DO X y
  • 7.
    7 Olga Kokshagina 2015 INTERMEDIARIES:THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS !  Crowdsourcing !  is a model leveraging on novel technologies (web 2.0, mobile apps, social networks) !  To build content and a structured set of information by gathering contributions from large groups of individuals 5
  • 8.
    8 Most classical datachallenges are HR and publicity events
  • 9.
    9 We decided toturn them into a tool for 1. Collaborative prototyping 2. Teaching aid 3. Data science process management
  • 10.
    10 Funded by UniversitéParis-Saclay and CNRS
  • 11.
  • 12.
    12 Code submission 1. letsus deliver a working prototype 2. lets the participants collaborate 3. makes the backend challenging to run (cloud management)
  • 13.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) RAPID ANALYTICS AND MODEL PROTOTYPING 13 functional data, time series, data augmentation, deep learning, learning on simulations, nonstandard and multi-objective losses
  • 14.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) CURRENT RAMPS 14
  • 15.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) DATA SCIENCE THEMES 15
  • 16.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) DATA DOMAINS 16
  • 17.
    17 RAMP.STUDIO DATA CHALLENGE WITHCODE SUBMISSION 21 challenges 39 events 1205 users 5952 predictive models
  • 18.
    18 RAMP.STUDIO DATA CHALLENGE WITHCODE SUBMISSION 12 hackathons 6 remote data challenges 11 course data camps
  • 19.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA 19 chemotherapy drug in elastic pocket laser spectrometer molecular spectra feature extractor 1 feature extractor 2 regressor concentration classifier drug type
  • 20.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) 20
  • 21.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) FORECASTING EL NINO SIX MONTHS AHEAD 21 … 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… … time series feature extractor x (a fixed length feature vector)regressor
  • 22.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) 22 what you achieved with a well tuned deep net the diversity gap the human blender gap competitive phase collaborative phase THE POWER OF THE (COLLABORATING) CROWD
  • 23.
    23 THE DYNAMICS OFCOLLABORATION starting kit the crowd early influencers inventors
  • 24.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) OPEN PHASE LETS PARTICIPANTS CATCH UP THE GOAL OF TEACHING 24
  • 25.
    Center for DataScience Paris-Saclay B. Kégl (CNRS) 25 COMMUNICATION AND REUSE
  • 26.
    26 ONGOING CHALLENGES: CLASSIFY POLLENATINGINSECTS 4.5K€ for the competitive phase 3K€ for the collaborative phase 50 GPU hours per participant https://www.ramp.studio/events/pollenating_insects_3_JNI_2017
  • 27.
  • 28.
    28 ONGOING CHALLENGES: FAKENEWS PREDICT THE TRUTHFULNESS OF NEWS
  • 29.
    29 ONGOING CHALLENGES: KAGGLESEGURO PREDICT INSURANCE CLAIMS
  • 30.
  • 31.
    31 Setting up theRAMP is was long and hard.
  • 32.
    32 Separate workflow building andworkflow optimization
  • 33.
    33 Before solving theproblem, set it up (even put it into production)
  • 34.
    34 Université Paris-Saclay /CNRS BALÁZS KÉGL Center for Data Science Paris-Saclay RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION FRUGAL DATA SCIENCE PROCESS MANAGEMENT
  • 35.
    35 data flow THE DATAFLOW X dataconnectors CALFE CLF predictive workflow full automation production dashboard decision support ypred
  • 36.
    36 POC report expert labeler amazon turk simulator y ypredscore type score cross-validation scheme RAMP full automation production dashboard decision support data flow “works on” dataconnectors X data scientists CALFE CLF workflow business unit domain scientists data value architects data engineers THE IDEAL SEQUENCE
  • 37.
    • toolkit: https://github.com/paris-saclay-cds/ramp-workflow • for designing workflows • set of ready-made metrics, workflows, CV schemes, data readers • unique command-line test script • examples: https://github.com/ramp-kits • a zoo of problems, experiments, workflows • (at least) one initial solution 37 RAMP-WORKFLOW & RAMP-KITS
  • 38.
    38 A SINGLE SCRIPTTO DEFINE THE BUNDLE X ypred score type score cross-validation scheme dataconnectors FE CLF workflow
  • 39.
    39 A SINGLE EXECUTABLETO TEST THE SUBMISSIONS • Keep your different submissions in a simple file structure • Communicate them on git • Execute them also from the notebook
  • 40.
    40 You can 1. Participatein upcoming RAMPs 2. Use RAMP in teaching or training 3. Use the toolkit for your own workflows 4. Submit it to us if you want to run a data challenge
  • 41.