Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay	

Center for Data Science
BALÁZS KÉGL
WHAT IS WRONG W...
2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?
3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data chall...
4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
« CROWDS » - > EXPLOSION OF TOOLS
!  Crowdsourcing
!  is a...
Center for Data Science
Paris-Saclay
CROWDSOURCING ANNOTATION
5
Center for Data Science
Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6
Center for Data Science
Paris-Saclay
CROWDSOURCING MATH
7
Center for Data Science
Paris-Saclay
CROWDSOURCING ANALYTICS
8
Center for Data Science
Paris-Saclay
OPEN SOURCE
9
Center for Data Science
Paris-Saclay
NEW PUBLICATION MODELS
10
Center for Data Science
Paris-Saclay
THE BOOK TO READ
11
Center for Data Science
Paris-Saclay
• Summary of our conclusions after the HiggsML challenge	

• The good, the bad and th...
Center for Data Science
Paris-Saclay13
CIML WORKSHOP TOMORROW
Center for Data Science
Paris-Saclay
• Publicity, awareness	

• both in physics (about the technology) and in ML (about th...
Center for Data Science
Paris-Saclay
• No direct access to code	

• No direct access to data scientists	

• No fundamental...
Center for Data Science
Paris-Saclay
• 18 months to prepare	

• legal issues, access to data	

• problem formulation: inte...
Center for Data Science
Paris-Saclay
• We asked the wrong question, on purpose!	

• because the right questions are comple...
Center for Data Science
Paris-Saclay
• The HiggsML challenge on Kaggle	

• https://www.kaggle.com/c/higgs-boson
18
PUBLICI...
Center for Data Science
Paris-Saclay
PUBLICITY, AWARENESS
19
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR...
Center for Data Science
Paris-Saclay
AWARENESS DYNAMICS	

20
• HEPML workshop @NIPS14	

• JMLR WS proceedings: http://jmlr...
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

21
https://indico.lal.in2p3.fr/event/2692/contribution/1/m...
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

22
• Sophisticated cross validation, CV bagging	

• Sophis...
Center for Data Science
Paris-Saclay
BENCHMARKING
23
CLASSIFICATION FOR DISCOVERY
15
Center for Data Science
Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?
Center for Data Science
Paris-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR D...
Center for Data Science
Paris-Saclay
How to handle systematic (model) uncertainties?
• OK, so let’s design an objective fu...
Center for Data Science
Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didn’t we use it?
Center for Data Science
Paris-Saclay28
How to handle systematic (model) uncertainties?
• The new Approximate Median Signifi...
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

29
• Sophisticated cross validation, CV bagging	

• Sophis...
Center for Data Science
Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score
Center for Data Science
Paris-Saclay
VARIANCE OF THE TEST SCORE
31
Center for Data Science
Paris-Saclay
• Challenges are useful for	

• generating visibility in the data science community a...
33
We decided to design something better
Center for Data Science
Paris-Saclay
• Direct access to code, prototyping	

• Incentivizing diversity	

• Incentivizing co...
Center for Data Science
Paris-Saclay
• Our experience with the HiggsML challenge	

• Need to connect data scientist to dom...
Center for Data Science
Paris-Saclay
UNIVERSITÉ PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiati...
Center for Data Science
Paris-Saclay37
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to define, stru...
38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
e...
Center for Data Science
Paris-Saclay39
https://medium.com/@balazskegl
Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain...
Center for Data Science
Paris-Saclay
• Modularizing the collaboration	

• independent subtasks	

• reduces barriers	

• br...
Center for Data Science
Paris-Saclay42
RAMPS
• Single-day coding sessions
• 20-40 participants	

• preparation is similar ...
43
TRAINING SPRINTS
• Single-day training sessions
• 20-40 participants	

• focusing on a single subject (deep learning, m...
44
ANALYTICS TOOLS TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay45
ANALYTICS TOOL TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
46
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15
The HiggsML challenge
47
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Apr 10
Classifying variable stars
48
Center for Data Science
Paris-Saclay
VARIABLE STARS
49
Learning to discoverB. Kégl / CNRS - Saclay
VARIABLE STARS
50
accuracy improvement: 89% to 96%
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 June 16 and Sept 26
Predicting El Nino
51
52
RAPID ANALYTICS AND MODEL PROTOTYPING
RMSE improvement: 0.9˚C to 0.4˚C
53
2015 October 8
Insect classification
RAPID ANALYTICS AND MODEL PROTOTYPING
54
RAPID ANALYTICS AND MODEL PROTOTYPING
accuracy improvement: 30% to 70%
55
CONCLUSIONS
• Explore the open innovation space
• read Nielsen’s book	

• Drop me a mail (balazs.kegl@gmail.com) if you...
Center for Data Science
Paris-Saclay56
THANK YOU!
Upcoming SlideShare
Loading in …5
×

What is wrong with data challenges

1,488 views

Published on

Open innovation & data science
What we learned from the HiggsML challenge

Published in: Science
  • Be the first to comment

What is wrong with data challenges

  1. 1. Center for Data Science Paris-Saclay1 CNRS & University Paris Saclay Center for Data Science BALÁZS KÉGL WHAT IS WRONG WITH DATA CHALLENGES THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY
  2. 2. 2 Why am I so critical? ! Why do I mitigate our own success with the HiggsML?
  3. 3. 3 Because I believe that there is enormous potential in open innovation/crowdsourcing in science. ! The current data challenge format is a single point in the landscape.
  4. 4. 4 Olga Kokshagina 2015 INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS !  Crowdsourcing !  is a model leveraging on novel technologies (web 2.0, mobile apps, social networks) !  To build content and a structured set of information by gathering contributions from large groups of individuals 5
  5. 5. Center for Data Science Paris-Saclay CROWDSOURCING ANNOTATION 5
  6. 6. Center for Data Science Paris-Saclay CROWDSOURCING COLLECTION AND ANNOTATION 6
  7. 7. Center for Data Science Paris-Saclay CROWDSOURCING MATH 7
  8. 8. Center for Data Science Paris-Saclay CROWDSOURCING ANALYTICS 8
  9. 9. Center for Data Science Paris-Saclay OPEN SOURCE 9
  10. 10. Center for Data Science Paris-Saclay NEW PUBLICATION MODELS 10
  11. 11. Center for Data Science Paris-Saclay THE BOOK TO READ 11
  12. 12. Center for Data Science Paris-Saclay • Summary of our conclusions after the HiggsML challenge • The good, the bad and the ugly • Elaborating on some of the points • Rapid Analytics and Model Prototyping • an experimental format we have been developing 12 OUTLINE
  13. 13. Center for Data Science Paris-Saclay13 CIML WORKSHOP TOMORROW
  14. 14. Center for Data Science Paris-Saclay • Publicity, awareness • both in physics (about the technology) and in ML (about the problem) • Triggering open data • http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 • Learning a lot from Gábor on how to win a challenge • Gábor getting hired by Google Deep Mind • Benchmarking • Tool dissemination (xgboost, keras) 14 THE GOOD
  15. 15. Center for Data Science Paris-Saclay • No direct access to code • No direct access to data scientists • No fundamentally new ideas • No incentive to collaborate 15 THE BAD
  16. 16. Center for Data Science Paris-Saclay • 18 months to prepare • legal issues, access to data • problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource • once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - GaelVaroquaux) 16 THE UGLY
  17. 17. Center for Data Science Paris-Saclay • We asked the wrong question, on purpose! • because the right questions are complex and don’t fit the challenge setup • would have led to way less participation • would have led to bitterness among the participants, bad (?) for marketing 17 THE UGLY
  18. 18. Center for Data Science Paris-Saclay • The HiggsML challenge on Kaggle • https://www.kaggle.com/c/higgs-boson 18 PUBLICITY, AWARENESS
  19. 19. Center for Data Science Paris-Saclay PUBLICITY, AWARENESS 19 B. Kégl / AppStat@LAL Learning to discover CLASSIFICATION FOR DISCOVERY 14
  20. 20. Center for Data Science Paris-Saclay AWARENESS DYNAMICS 20 • HEPML workshop @NIPS14 • JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42 • CERN Open Data • http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 • DataScience@LHC • http://indico.cern.ch/event/395374/ • Flavors of physics challenge • https://www.kaggle.com/c/flavours-of-physics
  21. 21. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 21 https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  22. 22. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 22 • Sophisticated cross validation, CV bagging • Sophisticated calibration and model averaging • The first step: pro participants check if the effort is worthy, risk assessment • variance estimate of the score • Don’t use the public leaderboard score for model selection • None of Gábor’s 200 out-of-the-ordinary ideas worked https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  23. 23. Center for Data Science Paris-Saclay BENCHMARKING 23 CLASSIFICATION FOR DISCOVERY 15
  24. 24. Center for Data Science Paris-Saclay BENCHMARKING 24 But what score did we optimize? ! And why?
  25. 25. Center for Data Science Paris-Saclay count (per year) background signal probability background signal CLASSIFICATION FOR DISCOVERY 25 Goal: optimize the expected discovery significance flux × time selection expected background say, b = 100 events total count, say, 150 events excess is s = 50 events AMS = = 5 sigma ground expectation µb. When optimizing the design of gion G = {x : g(x) = s}, we do not know n and µb. As we estimate the expectation µb by its empirical counter- + b to obtain the approximate median significance ⇣ (s + b) ln ⇣ 1 + s b ⌘ s ⌘ . (14) x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as MS3 ⇥ s 1 + O ✓⇣ s b ⌘3 ◆ , AMS3 = s p b . (15) tically indistinguishable when b s. This approxima- nding on the chosen search region, be a valid surrogate selection thresholdselection threshold
  26. 26. Center for Data Science Paris-Saclay How to handle systematic (model) uncertainties? • OK, so let’s design an objective function that can take background systematics into consideration • Likelihood with unknown background b ⇠ N(µb, b) L(µs, µb) = P(n, b|µs, µb, b) = (µs + µb)n n! e (µs+µb) 1 p 2⇡ b e (b µb)2 /2 b 2 • Profile likelihood ratio (0) = L(0, ˆˆµb) L(ˆµs, ˆµb) • The new Approximate Median Significance (by Glen Cowan) AMS = s 2 ✓ (s + b) ln s + b b0 s b + b0 ◆ + (b b0)2 b 2 where b0 = 1 2 ⇣ b b 2 + p (b b 2)2 + 4(s + b) b 2 ⌘ 1 / 1 26
  27. 27. Center for Data Science Paris-Saclay HOW TO HANDLE SYSTEMATIC UNCERTAINTIES 27 Why didn’t we use it?
  28. 28. Center for Data Science Paris-Saclay28 How to handle systematic (model) uncertainties? • The new Approximate Median Significance AMS = s 2 ✓ (s + b) ln s + b b0 s b + b0 ◆ + (b b0)2 b 2 where b0 = 1 2 ⇣ b b 2 + p (b b 2)2 + 4(s + b) b 2 ⌘ 1 / 1 New AMS ATLAS Old AMS
  29. 29. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 29 • Sophisticated cross validation, CV bagging • Sophisticated calibration and model averaging • The first step: pro participants check if the effort is worthy, risk assessment • variance estimate of the score • Don’t use the public leaderboard score for model selection • None of Gábor’s 200 out-of-the-ordinary ideas worked
  30. 30. Center for Data Science Paris-Saclay THE TWO MOST COMMON DATA CHALLENGE KILLERS 30 Leakage Variance of the test score
  31. 31. Center for Data Science Paris-Saclay VARIANCE OF THE TEST SCORE 31
  32. 32. Center for Data Science Paris-Saclay • Challenges are useful for • generating visibility in the data science community about novel application domains • benchmarking in a fair way state-of-the-art techniques on well-defined problems • finding talented data scientists • Limitations • not necessary adapted to solving complex and open-ended data science problems in realistic environments • no direct access to solutions and data scientist • no incentive to collaboration 32 DATA CHALLENGES
  33. 33. 33 We decided to design something better
  34. 34. Center for Data Science Paris-Saclay • Direct access to code, prototyping • Incentivizing diversity • Incentivizing collaboration • Training • Networking 34 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
  35. 35. Center for Data Science Paris-Saclay • Our experience with the HiggsML challenge • Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science • Collaboration with management scientists specializing in managing innovation • Michel Nielsen’s book: Reinventing Discovery • 5+ iterations so far 35 WHERE DOES IT COME FROM?
  36. 36. Center for Data Science Paris-Saclay UNIVERSITÉ PARIS-SACLAY 36 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  37. 37. Center for Data Science Paris-Saclay37 Center for Data Science Paris-Saclay A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay http://www.datascience-paris-saclay.fr/ Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Supélec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Supélec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Supélec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA
  38. 38. 38 THE DATA SCIENCE LANDSCAPE Domain science energy and physical sciences health and life sciences Earth and environment economy and society brain Data scientist Data trainer Applied scientist Domain scientistSoftware engineer Data engineer Data science statistics
 machine learning information retrieval signal processing data visualization databases Tool building software engineering
 clouds/grids high-performance
 computing optimization
  39. 39. Center for Data Science Paris-Saclay39 https://medium.com/@balazskegl
  40. 40. Center for Data Science Paris-Saclay TOOLS: LANDSCAPE TO ECOSYSTEM 40 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics
 machine learning information retrieval signal processing data visualization databases • interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges • coding sprints • Open Software Initiative • code consolidator and engineering projects software engineering
 clouds/grids high-performance
 computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain • data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform
  41. 41. Center for Data Science Paris-Saclay • Modularizing the collaboration • independent subtasks • reduces barriers • broadens the range of available expertise • Encouraging small contributions • Rich and well-structured information commons • so people can build on earlier work 41 NIELSEN’S CROWDSOURCING PRINCIPLES
  42. 42. Center for Data Science Paris-Saclay42 RAMPS • Single-day coding sessions • 20-40 participants • preparation is similar to challenges • Goals • focusing and motivating top talents • promoting collaboration, speed, and efficiency • solving (prototyping) real problems
  43. 43. 43 TRAINING SPRINTS • Single-day training sessions • 20-40 participants • focusing on a single subject (deep learning, model tuning, functional data, etc.) • preparing RAMPs
  44. 44. 44 ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
  45. 45. Center for Data Science Paris-Saclay45 ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
  46. 46. Center for Data Science Paris-Saclay ANALYTICS TOOLS TO MONITOR PROGRESS 46
  47. 47. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Jan 15 The HiggsML challenge 47
  48. 48. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Apr 10 Classifying variable stars 48
  49. 49. Center for Data Science Paris-Saclay VARIABLE STARS 49
  50. 50. Learning to discoverB. Kégl / CNRS - Saclay VARIABLE STARS 50 accuracy improvement: 89% to 96%
  51. 51. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 June 16 and Sept 26 Predicting El Nino 51
  52. 52. 52 RAPID ANALYTICS AND MODEL PROTOTYPING RMSE improvement: 0.9˚C to 0.4˚C
  53. 53. 53 2015 October 8 Insect classification RAPID ANALYTICS AND MODEL PROTOTYPING
  54. 54. 54 RAPID ANALYTICS AND MODEL PROTOTYPING accuracy improvement: 30% to 70%
  55. 55. 55 CONCLUSIONS • Explore the open innovation space • read Nielsen’s book • Drop me a mail (balazs.kegl@gmail.com) if you are interested in beta-testing the RAMP tool • Come to our CIML WS tomorrow
  56. 56. Center for Data Science Paris-Saclay56 THANK YOU!

×