Data Science: !
A New Frontier for Design
Theory and Methods!
Akin Kazakci!
MINES ParisTech!
akin.kazakci@mines-paristech.fr!
THANKS TO:!
AkınKazakçı!
MINESParisTech!
Question:!
« Design is an essential
driver of innovation and
economic growth. »!
Do you disagree with the following statement? – !
AkınKazakçı!
MINESParisTech!
What is the role of design in comtemporary
challenges society is facing …!
Motto of H2020: Economic growth, job creation, societal well-
being, Europe’s competitiveness…!
– as seen by the decision-makers?!
80 billion a year
AkınKazakçı!
MINESParisTech!
yet…!
AkınKazakçı!
MINESParisTech!
Claim 1!
•  Design research is falling behind in facing
contemporary challenges (enough with the
chairs)!
•  Claim 1a: Too much in-breeding and repetition!
•  Claim 1b: Huge amount of work is based on ideas
from 80s’ !
!
AkınKazakçı!
MINESParisTech!
Data deluge – a tremendous challenge!
Mattmann, C. A.. « A vision for
data science », Nature, 2013.
Rougly 30, 000
modern laptop’s
disk capacity
Roughly 1500 000 000
times more
per year
AkınKazakçı!
MINESParisTech!
Some orders of magnitude…!
Image courtesy of Vladimir Gligorov, CERN
AkınKazakçı!
MINESParisTech!
Even harder for some…!
Image courtesy of Vladimir Gligorov, CERN
AkınKazakçı!
MINESParisTech!
A side remark… on what’s important!
Is this a cat?
What is the role of Higgs
boson in the structure of
the universe?
>
AkınKazakçı!
MINESParisTech!
Huge boost for data science / research!
National Big Data R&D Initiative of
the White House in 2012 !
•  NSF, NIH, and DARPA , The
Research Data Alliance (RDA), !
•  NYU, University of Washington,
Berkeley University (with a five-
year 37.8M$ funding from Moore
and Sloan foundations)!
•  In Europe, University of
Amsterdam, Edinburgh University,
Imperial College (with Zhejiang
University). !
•  In France, Universite ́Paris-Saclay
has created Centre for Data
Science !
Harvard Business Review
Davenport and Patill, 2012
AkınKazakçı!
MINESParisTech!
Data deluge: tremendous opportunity!
« Mastering the creation of value from big data !
… will be a cornerstone in future economic
development and societal well-being: !
Source: EU Comission, Digital Agenda for Europe, Fact Sheet Data cPPP
-  %30 of the global market for European suppliers;!
-  100,000 new jobs in Europe by 2020!
-  %10 lower energy consumption, !
better health-care outcomes and !
more productive industrial machinery »!
AkınKazakçı!
MINESParisTech!
Data-Science: new phenomena or déjà-vu?!
« techniques for
processing large amounts
of information »"
« statistical and
mathematical methods » "
« techniques like
mathematical programming »!
« methodologies like
operations research »!
« no single established name
yet,Let us call it IT »!
« higher-order
thinking through
computer programs »"
12
AkınKazakçı!
MINESParisTech!
The death of OR? – before it delivers its
promise!
13
« OR is dead, even
though it has yet to
be burried. »!
« Little chance of
ressurection; cause!
little understanding
of its demise. »!
AkınKazakçı!
MINESParisTech!
Salvation of OR!
14
prediction paradigm
should be replaced…!
by a paradigm
directed at designing
a desirable future and
inventing ways of
bringing it about.!
(suggest that) OR
replace its
problem-solving
orientation by one
that focuses on
planning and design"
- by design!
AkınKazakçı!
MINESParisTech!
Claim 2!
•  Claim 2: To avoid facing same difficulties as OR, data
science should go beyond the predictive (analytics) paradigm
and embrace a design paradigm!
•  Claim 2a: Data science cannot expect to solve the challenges
imposed by the data solely based on technical
breakthroughs:!
! A renewal of data science methodology is also
needed!
!
•  Hypothesis: More than 50 years of research in design has
allowed design research community to gather invaluable
insights about the nature of creative activities !
•  Corollary (Claim2b): Design theory and methods can
provide, at least to some extent, the much needed
insights.!
AkınKazakçı!
MINESParisTech!
Analysing a data challenge!
AkınKazakçı!
MINESParisTech!
Learning to discover: HiggsML Challenge!
AkınKazakçı!
MINESParisTech!
Winners!
MINES ParisTech
AkınKazakçı!
MINESParisTech!
Record number of participants!
MINES ParisTech
AkınKazakçı!
MINESParisTech!
Great improvements!
MINES ParisTech
AkınKazakçı!
MINESParisTech!
Data Science Challenges: which effectiveness for innovation? !
•  1800+ teams, to develop methods
for detecting Higgs on CERN data!
•  Important improvements (discovery
significance rose from 3.2 to 3.8)!
•  Big buzz, huge visibility!
•  Bringing ML and physics
communities closer!
•  Study of available data!
-  Forums, !
-  Documentation, !
-  Prticipants’ blog entries and !
-  GitHub codes!
! 136 topics, 1400+ posts!
•  Qualitative interpretation
combined with C-K modelling of
participants’ strategies!
AkınKazakçı!
MINESParisTech!
Analysis of design strategies!
MINES ParisTech
Achieve 5σ! Dicovery condition: A discovery is claimed when we find a
‘region’ of the space where there is significant excess of ‘signal’
events. (rejecting background-only hypothesis with a p value
less than 2,9 x 10-7, corresponding to 5 Sigma).
Problem formulation: Traditional classification setting: « the
task of the participants is to train a classifier g based on the
training data D with the goal of maximizing the AMS (7) on a
held-out (test) data set » (HiggsML documentation)
With 2 tweaks:
-  Training set events are « weighted »
-  Maximize « Approximate Median Significance »:
Select a classification
method!
Pre-processing!
Choose hyper-params!
Train!
Optimize
for X!
SVM Decision
Trees
NN…..…..
Performance metrics: During the overall learning process
performance metrics are used to supervise the quality and
convergence of a learned model. A traditional metric is
accuracy:
where
Note that for HiggsML AMS, TP (s) and FP (b) are of particular
importance.
Boosting! Bagging!
others!
Ensemble
Methods
(Extended)
Dominant Design
Traditional workflow = Dominant design
C space K Space
AkınKazakçı!
MINESParisTech!
A deviation from dominant design!
Achieve 5σ!
Select a classification
method!
Pre-processing!
Choose hyper-params!
Train!
Optimize for
accuracy!
SVM Decision
Trees
NN…..…..
Integrate AMS
directly in
training
during
Gradient
Boosting
(John)
Dicovery condition: A discovery is claimed when we …
Problem formulation: Traditional classification setting…
Cross-Validation: Techniques for evaluating how a …
Ensemble Methods
Gradient boosting methods fit a classifier to the 'per data point
loss' and since AMS is not a sum of per data point (event)
losses, it's not obvious how to do use AMS as a loss in gradient
boosting (Andre Holzner)
AMS: 3.3 ! The node split works by looking for the split
that maximises the AMS of one side of the split when
predicting it as pure signal (John)
during
node split
in random
forest
(John)
An alternative may be to « use AUC in gradient boosting till
you get to the max cv result and then tried to move forward
with an AMS loss function from that point »
In principle, the AMS approximate function is derivable
(http://tinyurl.com/ov5pedq) at a node level (s and b being
the totals of other nodes, considered constant, and x, w being
the probability prediction and weight for the node to be split)
and one could rewrite the part of code where the objective
function is evaluated, replacing the sums with a different
calculation » (Giulio Casa)
AkınKazakçı!
MINESParisTech!
Introduction of a new K pocket!
Achieve 5σ!
Select a classification
method!
Pre-processing!
Choose hyper-params!
Train!
Optimize for
accuracy!
SVM Decision
Trees
NN…..…..
Integrate AMS
directly in
training
during
Gradient
Boosting
(John)
Dicovery condition: A discovery is claimed when we …
Problem formulation: Traditional classification setting…
Cross-Validation: Techniques for evaluating how a …
Ensemble Methods
Gradient boosting methods fit a classifier to the 'per data point
loss' and since AMS is not a sum of per data point (event)
losses, it's not obvious how to do use AMS as a loss in gradient
boosting (Andre Holzner)
during
node split
in random
forest
(John)
Weighted
Classification
Cascades
Two participants observe that AMS can be refactorized and its
terms can be rewritten in terms of their convex conjugate form –
which allow to Fenchel-Young inequality from convex
optimization litterature.
Ref: http://arxiv.org/pdf/1409.2655v2.pdf, Mackey & Brian
Optimization of AMS becomes possible by a procedure they
name Weigthed Classification Cascades.(Rank: 451th)
? ? ? ? ?
AkınKazakçı!
MINESParisTech!
Winning strategy…!
Achieve 5σ!
Select a classification
method!
Pre-processing!
Choose hyper-params!
Train!
Optimize for
accuracy!
SVM Decision
Trees
NN…..…..
Integrate AMS
directly in
training
during
Gradient
Boosting
(John)
Dicovery condition: A discovery is
claimed when we …
Problem formulation: Traditional
classification setting…
Cross-Validation: Techniques for
evaluating how a …
Ensemble Methods
during
node split
in random
forest
(John)
Weighted
Classification
Cascades
? ? ? ? ?
Optimization of AMS
Design for statistical
efficiency
The biggest challenge is the unstability
of AMS. Competition results clearly
show that only participants who dealt
effectively with this issue have had
higher ranks.
1st
2nd
3rd
Ensembles + CV
monitoring + cutoff
threshold seem to be a
winning strategy
monitoring
progress with
CV
+
ensembles
+
selecting a cutoff
threshold that
optimise (or stabilise
AMS)
AkınKazakçı!
MINESParisTech!
Fixating others…!
Achieve 5σ!
Select a classification
method!
Pre-processing!
Choose hyper-params!
Train!
Optimize for
accuracy!
SVM Decision
Trees
NN…..…..
Integrate AMS
directly in
training
during
Gradient
Boosting
(John)
Dicovery condition: A discovery is
claimed when we …
Problem formulation: Traditional
classification setting…
Cross-Validation: Techniques for
evaluating how a …
Ensemble Methods
during
node split
in random
forest
(John)
Weighted
Classification
Cascades
? ? ? ? ?
Optimization of AMS
Design for statistical
efficiency
The biggest challenge is the unstability
of AMS. Competition results clearly
show that only participants who dealt
effectively with this issue have had
higher ranks.
1st
2nd
3rd
Ensembles + CV
monitoring + cutoff
threshold seem to be a
winning strategy
monitoring
progress with
CV
+
ensembles
+
selecting a cutoff
threshold that
optimise (or stabilise
AMS)
Public guide to AMS 3.6
« moves » many participants to
the given path
Fixation vs. Creative Authority
(Agogué et al, 2014)
AkınKazakçı!
MINESParisTech!
Analysing is one thing…!
!
What about !
generating alternatives!
using design theory?!
MINES ParisTech
AkınKazakçı!
MINESParisTech!
Generating new design strategies!
Data science as a new frontier for design
A. Kazakci, ICED’15 (submitted)
AkınKazakçı!
MINESParisTech!
•  18 months of problem
formulation (3 physicists, 3
data-scientists)
•  No innovation in DS – only
differences in individual
performances in the adaptation
of a dominant design
•  No innovation in Physics (even
critics – wrong problem?)
•  In their current form and
organisation, data-challenges
are « problem-solving »
approaches
« Extracting value from data » requires
a rigourous design process
You reap what you sow: Data-challenges will not yield innovations unless
problem formulation bears originality and an ingenious organisation of the
exploration.
AkınKazakçı!
MINESParisTech!
DKCP - Machine learning for HEP!
•  A DKCP process has been launched !
•  for exploring innovation opportunities at
the crossroad of HEP and ML!
!
!
AkınKazakçı!
MINESParisTech!
Bootcamps
-  how to ensure a controlled yet !
creative exploration?!
AkınKazakçı!
MINESParisTech!
•  Thank you!
•  Akin Kazakci!
•  akin.kazakci@mines-paristech.fr!
•  Data science as a new frontier for design!
!
!
AkınKazakçı!
MINESParisTech!
DCC’14 !
Machine Learning
and Innovative
Design Workshop!

Data science as a new frontier for design.

  • 1.
    Data Science: ! ANew Frontier for Design Theory and Methods! Akin Kazakci! MINES ParisTech! akin.kazakci@mines-paristech.fr! THANKS TO:!
  • 2.
    AkınKazakçı! MINESParisTech! Question:! « Design is anessential driver of innovation and economic growth. »! Do you disagree with the following statement? – !
  • 3.
    AkınKazakçı! MINESParisTech! What is therole of design in comtemporary challenges society is facing …! Motto of H2020: Economic growth, job creation, societal well- being, Europe’s competitiveness…! – as seen by the decision-makers?! 80 billion a year
  • 4.
  • 5.
    AkınKazakçı! MINESParisTech! Claim 1! •  Designresearch is falling behind in facing contemporary challenges (enough with the chairs)! •  Claim 1a: Too much in-breeding and repetition! •  Claim 1b: Huge amount of work is based on ideas from 80s’ ! !
  • 6.
    AkınKazakçı! MINESParisTech! Data deluge –a tremendous challenge! Mattmann, C. A.. « A vision for data science », Nature, 2013. Rougly 30, 000 modern laptop’s disk capacity Roughly 1500 000 000 times more per year
  • 7.
    AkınKazakçı! MINESParisTech! Some orders ofmagnitude…! Image courtesy of Vladimir Gligorov, CERN
  • 8.
    AkınKazakçı! MINESParisTech! Even harder forsome…! Image courtesy of Vladimir Gligorov, CERN
  • 9.
    AkınKazakçı! MINESParisTech! A side remark…on what’s important! Is this a cat? What is the role of Higgs boson in the structure of the universe? >
  • 10.
    AkınKazakçı! MINESParisTech! Huge boost fordata science / research! National Big Data R&D Initiative of the White House in 2012 ! •  NSF, NIH, and DARPA , The Research Data Alliance (RDA), ! •  NYU, University of Washington, Berkeley University (with a five- year 37.8M$ funding from Moore and Sloan foundations)! •  In Europe, University of Amsterdam, Edinburgh University, Imperial College (with Zhejiang University). ! •  In France, Universite ́Paris-Saclay has created Centre for Data Science ! Harvard Business Review Davenport and Patill, 2012
  • 11.
    AkınKazakçı! MINESParisTech! Data deluge: tremendousopportunity! « Mastering the creation of value from big data ! … will be a cornerstone in future economic development and societal well-being: ! Source: EU Comission, Digital Agenda for Europe, Fact Sheet Data cPPP -  %30 of the global market for European suppliers;! -  100,000 new jobs in Europe by 2020! -  %10 lower energy consumption, ! better health-care outcomes and ! more productive industrial machinery »!
  • 12.
    AkınKazakçı! MINESParisTech! Data-Science: new phenomenaor déjà-vu?! « techniques for processing large amounts of information »" « statistical and mathematical methods » " « techniques like mathematical programming »! « methodologies like operations research »! « no single established name yet,Let us call it IT »! « higher-order thinking through computer programs »" 12
  • 13.
    AkınKazakçı! MINESParisTech! The death ofOR? – before it delivers its promise! 13 « OR is dead, even though it has yet to be burried. »! « Little chance of ressurection; cause! little understanding of its demise. »!
  • 14.
    AkınKazakçı! MINESParisTech! Salvation of OR! 14 predictionparadigm should be replaced…! by a paradigm directed at designing a desirable future and inventing ways of bringing it about.! (suggest that) OR replace its problem-solving orientation by one that focuses on planning and design" - by design!
  • 15.
    AkınKazakçı! MINESParisTech! Claim 2! •  Claim2: To avoid facing same difficulties as OR, data science should go beyond the predictive (analytics) paradigm and embrace a design paradigm! •  Claim 2a: Data science cannot expect to solve the challenges imposed by the data solely based on technical breakthroughs:! ! A renewal of data science methodology is also needed! ! •  Hypothesis: More than 50 years of research in design has allowed design research community to gather invaluable insights about the nature of creative activities ! •  Corollary (Claim2b): Design theory and methods can provide, at least to some extent, the much needed insights.!
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    AkınKazakçı! MINESParisTech! Data Science Challenges:which effectiveness for innovation? ! •  1800+ teams, to develop methods for detecting Higgs on CERN data! •  Important improvements (discovery significance rose from 3.2 to 3.8)! •  Big buzz, huge visibility! •  Bringing ML and physics communities closer! •  Study of available data! -  Forums, ! -  Documentation, ! -  Prticipants’ blog entries and ! -  GitHub codes! ! 136 topics, 1400+ posts! •  Qualitative interpretation combined with C-K modelling of participants’ strategies!
  • 22.
    AkınKazakçı! MINESParisTech! Analysis of designstrategies! MINES ParisTech Achieve 5σ! Dicovery condition: A discovery is claimed when we find a ‘region’ of the space where there is significant excess of ‘signal’ events. (rejecting background-only hypothesis with a p value less than 2,9 x 10-7, corresponding to 5 Sigma). Problem formulation: Traditional classification setting: « the task of the participants is to train a classifier g based on the training data D with the goal of maximizing the AMS (7) on a held-out (test) data set » (HiggsML documentation) With 2 tweaks: -  Training set events are « weighted » -  Maximize « Approximate Median Significance »: Select a classification method! Pre-processing! Choose hyper-params! Train! Optimize for X! SVM Decision Trees NN…..….. Performance metrics: During the overall learning process performance metrics are used to supervise the quality and convergence of a learned model. A traditional metric is accuracy: where Note that for HiggsML AMS, TP (s) and FP (b) are of particular importance. Boosting! Bagging! others! Ensemble Methods (Extended) Dominant Design Traditional workflow = Dominant design C space K Space
  • 23.
    AkınKazakçı! MINESParisTech! A deviation fromdominant design! Achieve 5σ! Select a classification method! Pre-processing! Choose hyper-params! Train! Optimize for accuracy! SVM Decision Trees NN…..….. Integrate AMS directly in training during Gradient Boosting (John) Dicovery condition: A discovery is claimed when we … Problem formulation: Traditional classification setting… Cross-Validation: Techniques for evaluating how a … Ensemble Methods Gradient boosting methods fit a classifier to the 'per data point loss' and since AMS is not a sum of per data point (event) losses, it's not obvious how to do use AMS as a loss in gradient boosting (Andre Holzner) AMS: 3.3 ! The node split works by looking for the split that maximises the AMS of one side of the split when predicting it as pure signal (John) during node split in random forest (John) An alternative may be to « use AUC in gradient boosting till you get to the max cv result and then tried to move forward with an AMS loss function from that point » In principle, the AMS approximate function is derivable (http://tinyurl.com/ov5pedq) at a node level (s and b being the totals of other nodes, considered constant, and x, w being the probability prediction and weight for the node to be split) and one could rewrite the part of code where the objective function is evaluated, replacing the sums with a different calculation » (Giulio Casa)
  • 24.
    AkınKazakçı! MINESParisTech! Introduction of anew K pocket! Achieve 5σ! Select a classification method! Pre-processing! Choose hyper-params! Train! Optimize for accuracy! SVM Decision Trees NN…..….. Integrate AMS directly in training during Gradient Boosting (John) Dicovery condition: A discovery is claimed when we … Problem formulation: Traditional classification setting… Cross-Validation: Techniques for evaluating how a … Ensemble Methods Gradient boosting methods fit a classifier to the 'per data point loss' and since AMS is not a sum of per data point (event) losses, it's not obvious how to do use AMS as a loss in gradient boosting (Andre Holzner) during node split in random forest (John) Weighted Classification Cascades Two participants observe that AMS can be refactorized and its terms can be rewritten in terms of their convex conjugate form – which allow to Fenchel-Young inequality from convex optimization litterature. Ref: http://arxiv.org/pdf/1409.2655v2.pdf, Mackey & Brian Optimization of AMS becomes possible by a procedure they name Weigthed Classification Cascades.(Rank: 451th) ? ? ? ? ?
  • 25.
    AkınKazakçı! MINESParisTech! Winning strategy…! Achieve 5σ! Selecta classification method! Pre-processing! Choose hyper-params! Train! Optimize for accuracy! SVM Decision Trees NN…..….. Integrate AMS directly in training during Gradient Boosting (John) Dicovery condition: A discovery is claimed when we … Problem formulation: Traditional classification setting… Cross-Validation: Techniques for evaluating how a … Ensemble Methods during node split in random forest (John) Weighted Classification Cascades ? ? ? ? ? Optimization of AMS Design for statistical efficiency The biggest challenge is the unstability of AMS. Competition results clearly show that only participants who dealt effectively with this issue have had higher ranks. 1st 2nd 3rd Ensembles + CV monitoring + cutoff threshold seem to be a winning strategy monitoring progress with CV + ensembles + selecting a cutoff threshold that optimise (or stabilise AMS)
  • 26.
    AkınKazakçı! MINESParisTech! Fixating others…! Achieve 5σ! Selecta classification method! Pre-processing! Choose hyper-params! Train! Optimize for accuracy! SVM Decision Trees NN…..….. Integrate AMS directly in training during Gradient Boosting (John) Dicovery condition: A discovery is claimed when we … Problem formulation: Traditional classification setting… Cross-Validation: Techniques for evaluating how a … Ensemble Methods during node split in random forest (John) Weighted Classification Cascades ? ? ? ? ? Optimization of AMS Design for statistical efficiency The biggest challenge is the unstability of AMS. Competition results clearly show that only participants who dealt effectively with this issue have had higher ranks. 1st 2nd 3rd Ensembles + CV monitoring + cutoff threshold seem to be a winning strategy monitoring progress with CV + ensembles + selecting a cutoff threshold that optimise (or stabilise AMS) Public guide to AMS 3.6 « moves » many participants to the given path Fixation vs. Creative Authority (Agogué et al, 2014)
  • 27.
    AkınKazakçı! MINESParisTech! Analysing is onething…! ! What about ! generating alternatives! using design theory?! MINES ParisTech
  • 28.
    AkınKazakçı! MINESParisTech! Generating new designstrategies! Data science as a new frontier for design A. Kazakci, ICED’15 (submitted)
  • 29.
    AkınKazakçı! MINESParisTech! •  18 monthsof problem formulation (3 physicists, 3 data-scientists) •  No innovation in DS – only differences in individual performances in the adaptation of a dominant design •  No innovation in Physics (even critics – wrong problem?) •  In their current form and organisation, data-challenges are « problem-solving » approaches « Extracting value from data » requires a rigourous design process You reap what you sow: Data-challenges will not yield innovations unless problem formulation bears originality and an ingenious organisation of the exploration.
  • 30.
    AkınKazakçı! MINESParisTech! DKCP - Machinelearning for HEP! •  A DKCP process has been launched ! •  for exploring innovation opportunities at the crossroad of HEP and ML! ! !
  • 31.
    AkınKazakçı! MINESParisTech! Bootcamps -  how toensure a controlled yet ! creative exploration?!
  • 32.
    AkınKazakçı! MINESParisTech! •  Thank you! • Akin Kazakci! •  akin.kazakci@mines-paristech.fr! •  Data science as a new frontier for design! ! !
  • 33.