Crowd computing: All your base are belong to us
Upcoming SlideShare
Loading in...5
×
 

Crowd computing: All your base are belong to us

on

  • 832 views

Results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. The central thesis is that a lot of problems can be framed with gaming elements, ...

Results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. The central thesis is that a lot of problems can be framed with gaming elements, lowering the barrier to participation, and increasing engagement. Presented at the Bio-IT Cloud Summit, Data-focused Cloud Applications session, Sept. 12-13, Hotel Kabuki, San Francisco, CA

Statistics

Views

Total Views
832
Views on SlideShare
825
Embed Views
7

Actions

Likes
0
Downloads
5
Comments
0

4 Embeds 7

http://www.linkedin.com 3
https://si0.twimg.com 2
https://twitter.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Crowd computing: All your base are belong to us Crowd computing: All your base are belong to us Presentation Transcript

  • Crowd  Compu*ng:  All  Your  Base   Are  Belong  to  Us   David  C.  Thompson  
  • What  is  about  to  happen  •  Some  background  on:   –  me   –  compe;;on  •  Crowdsourced  science  through  the  ‘ages’  •  The  data  set  •  The  Kaggle  process  •  An  overview  of  the  compe;;on  •  The  models  and  implementa;on  •  What  we  have  learnt  
  • Behold!  Let  the  science  begin  …  
  • hGp://amzn.to/OyQMVf  
  • about.me/dcthompson  My  favourite  papers  from  each  period:  [1]  J.  Chem.  Phys.  122,  124107  (2005)  [2]  J.  Chem.  Phys.  128,  224103  (2008)  [3]  J.  Chem.  Inf.  Model.  49,  1889  (2009)  [4]  J.  Chem.  Inf.  Model.  51,  93  (2011)  
  • A  funny  thing  happened  at  my  1st  external  communica;ons  conference  …   Or  …   A  heart-­‐wrenching  tale  of  man  versus  coffee  machine  …   Or  …  How  an  external  networking  opportunity  brought  some  ‘gamifica;on’  to   research   7  
  • hGp://www.taviscurry.com/  
  • Do  real  science,  at  home.  
  • What  happens  when  you  search  for   ‘blindfolded  archery  
  • I  never  make  predic;ons.  And  I  never   will*    Lots  of  opportunity  to  translate  problems,  from  all   fields,  into  systems  with  gaming  elements   •  Goal  –  What  do  you  hope  to  achieve  by  playing  the   game?   •  Rules  –  The  limita;ons  on  how  you  can  achieve  the   goals   •  Feedback  –  How  close  are  you  to  achieving  your  goal?   •  Voluntary  par*cipa*on  –  Everyone  playing  the  game   accepts  the  goals,  the  rules,  and  the  feedback    *   Paul  Gascoigne   hGp://janemcgonigal.com/  
  • hGp://fold.it/portal/  
  • What  you  should  know  about  this   exercise  •  We  wanted  to  inves;gate  the  u;lity  of  the   process    •  We  wanted  to  move  with  speed  •  We  wanted  to  use  a  data  set  the  scien;fic   community  had  previously  seen  •  We  wanted  to  be  inclusive  –  no  domain   exper*se  needed  
  • Shameless  slide  reuse  …  *    “All  models  are  wrong,  but   some  models  are  useful”  –  G.  E.  P.  Box   “…the  validity  of  any  given  model  is  of  limited   scope,  as  is  the  case  with  any  mental  construct   that  we  have  about  what  our  molecules  are   doing,  whether  we  used  a  sosware  package  or   waved  our  hands  around  in  the  air.”  –  D.  Lowe    Simula;on  and  its  discontents,  Sherry  Turkle,  Cambridge,  MA:  MIT  Press  (2009)  *  D.  C.  Thompson  et  al.  Schrödinger  Regional  User  Mee;ng,  New  York,  NY  2009  
  • The  data  set   •  Version  2  of  the  Hansen  AMES  mutagenicity   data  was  used   •  The  following  protocol  was  observed:     What  happened   Download  smiles   #  of  molecules  (removed)   6512   Conversion  with  Corina   6503  (9)   Remove  non-­‐zero  formal   6419  (84)   charge   Remove  if  more  than  99   6414  (5)   atoms   Remove  if  contain   6252  (162)   undesirable  atoms*  hGp://doc.ml.tu-­‐berlin.de/toxbenchmark/  J.  Chem.  Inf.  Model.  49,  2077  (2009)  *  D,  B,  Al,  P,  Ga,  Si,  Ge,  Sn,  As,  Sb,  Se,  Te,  At,  He,  Ne,  Ar,  Kr,  Xe,  Rn  
  • Descriptor  calcula;on   SD  file,  descriptor  calcula;on  –  6252  x  5030   –  Filter  for  low  variance  (≤  0.01);  removed  2537   –  Remove  for  high  correla;on  (>  0.90);  removed   716   –  Descriptor  normaliza;on  resulted  in  6252  x   1400   1777  .csv  dfile     Descriptor  Engine   #  of   escriptors   1200   MOE  2D   76  (186)   1000   Atom  Pair   696  (1920)   800   MolConn-­‐Z   174  (745)   600   Pipeline  Pilot   5  (130)   Property  Counts   400   Daylight   825  (2048)   fingerprints   200   clogP   0  (1)   0   50   1000   1050   1100   1150   1200   100   150   200   250   300   350   400   450   500   550   600   650   700   750   800   850   900   950  J.  Chem.  Inf.  Model.  49,  2077  (2009)  
  • Tes;ng  Framework   •  Public  Leaderboard:  The   split  of  the  test  set  that   compe;;on  par;cipants   see  real-­‐;me  feedback  on   over  the  course  of  the   compe;;on.   •  Private  Leaderboard:  The   split  of  the  test  set  that  is   used  to  determine  the   compe;;on  winners  and   es;mate  the  generaliza;on   error.  Par;cipants  do  not   see  feedback  on  this  during   the  compe;;on.  “Predic;ve  Modeling  from  a  Kaggler’s  Perspec;ve”  Jeremy  Achin,  Sergey  Yergenson,  Tom  Degodoy  
  • Expecta;ons   “Applicability  Domains  for  Classifica;on  Problems:   Benchmarking  of  Distance  to  Models  for  Ames  Mutagenicity  Set”     •  20  models  generated  with  different  algorithms  and  descriptors   •  Models  have  overall  accuracies  between  0.75  and  0.83  for  the  training  set   and  0.76  and  0.82  for  the  test  set   •  Inter-­‐laboratory  accuracy  for  Ames  test  reported  at  85%     Expecta*on:  Models  should  have  similar  accuracy  to   literature   Goal:  Models  should  be  balanced;  sensi*vity  and   specificity  should  be  high  J.  Chem.  Inf.  Model.  50,  2094  (2010)  
  • hGp://www.kaggle.com/c/bioresponse  
  • Performance  as  a  func;on  of  ;me  796  players   1N log  loss=  − N ∑ y log( y ) + (1 − y ) log(1 − y )   ˆ i i i ˆi703  teams   i =18841  entries  55  forum  topics,  409  posts  
  • Final   Public   Δ  (log   Team  Name   Ranking   Ranking   loss)   1   Winter  is  Coming  &  Sergey   11   0   2   seelary   26   7E-­‐05   3   bluehat   1   0.00051   4   jazz   15   0.0014   5   Wayne  Zhang  &  Gxav  &  woshialex   19   0.00146   6   Indy  Actuaries   38   0.00184   7   bluemaster  &  imran   7   0.00231   8   Efiimov  &  Bers  &  Cragin  &  vsu   4   0.00241   9   y_tag   18   0.0026   10   Killian  O’Connor   44   0.00285   11   PlanetThanet  &  SirGuessalot   40   0.00298   12   AussieTim   48   0.00335   13   Jason  Farmer   31   0.00347   14   GreenPeace   16   0.00356   15   mars   32   0.00388   16   Fuzzify   60   0.00392   17   Emanuele   63   0.00395   18   HappyHour   10   0.00431   19   Bal;c   30   0.00465   20   dejavu   20   0.00482   352   Random  Forest  Benchmark   373   0.04184   Support  Vector  Machine   541   Benchmark   522   0.12147   Op;mized  Constant  Value   647   Benchmark   638   0.31414   650   Uniform  Benchmark   642   0.31959  hGps://github.com/emanuele/kaggle_pbr  hGps://github.com/benhamner/BioResponse  
  • #FTW  Strategies   •  Feature  selec;on   All  three  winning  teams     iden;fied  D27  as   important.     What  is  it?       Organon  toxicophore*   •  RF  +  complementary  approaches   •  Blending  *  J.  Med.  Chem.  49,  312  (2005)  “Predic;ve  Modeling  from  a  Kaggler’s  Perspec;ve”  Jeremy  Achin,  Sergey  Yergenson,  Tom  Degodoy  
  • Private  Set  Performance   TP   FN   Se:  TP/(TP+FN)   Sp:  TN/(FP+TN)   FP   TN   CCR:  (Se  +  Sp)/2   Benchmarks   Winning  Teams   Other   Team  1   873   165   RF   888   150   Team  17   896   142   Team  2   888   150   SVM   822   216   D27   781   257   Team  3   893   145   Team  1   151   687   RF   166   672   Team  17   169   669   Team  2   165   673   SVM   215   673   D27   215   623   Team  3   162   676   Se   Sp   CCR   Se   Sp   CCR   Se   Sp   CCR   RF   0.86   0.80   0.83   Team  1   0.84   0.82   0.83   Team  17   0.86   0.80   0.83  SVM   0.79   0.74   0.77   Team  2   0.86   0.80   0.83   D27   0.75   0.74   0.75   Team  3   0.86   0.80   0.83  
  • Okay,  where’s  this  ‘second’  web   service?   BIpredict     Physicochemical   proper;es  are   updated  as  molecule   is  built     Atomis;c  descriptor   values  are  appended   directly  to  the   molecule   27  *  D.  C.  Thompson  Chemical  Compu;ng  Group,  User  Group  Mee;ng,  Montreal,  2011  
  • So,  what  did  we  learn?   •  Was  this  useful?   –  Yes   •  Par;cipa;on  was  high,  contributors  and   contribu;ons  were  diverse*   •  A  large  number  of  models  were  of  a  high  quality   –  Differences  in  top  models  in  log  loss  metric  are  small   –  Different  sta;s;cal  measures  lead  to  different   rankings   –  RandomForest  benchmark  has  high  correct   classifica;on  rate  (CCR)  *  Sort  of  
  • ‘Machine  learning  that  maGers’   Machine  learning   Domain  exper;se   skill  Kiri  L.  Wagstaff.  Machine  Learning  that  Mabers.  Proceedings  of  the  Twenty-­‐Ninth  Interna8onal  Conference  on  Machine  Learning  (ICML),  June  2012.  Download  PDF  (CL  #12-­‐2026)  
  • Know  your  meme  hGp://roflcon.org/  hGp://katemiltner.com/  
  • Thanks  to:  Lilly  Ackley  Ben  Hamner  Amy  Kunkel  Mehul  Patel  Alex  Renner,  PhD  All  Kaggle  par;cipants  –  esp.  Winter  is  Coming  &  Sergey