• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Competitive data science: A tale of two web services
 

Competitive data science: A tale of two web services

on

  • 1,300 views

Initial results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. Presented at the Fall ACS 2012 #CINF session "When Chemists and Computers ...

Initial results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. Presented at the Fall ACS 2012 #CINF session "When Chemists and Computers Collide: Putting Cheminformatics in the Hands of Medicinal Chemists"

Statistics

Views

Total Views
1,300
Views on SlideShare
1,290
Embed Views
10

Actions

Likes
3
Downloads
13
Comments
0

4 Embeds 10

http://lanyrd.com 7
https://twimg0-a.akamaihd.net 1
http://www.linkedin.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Competitive data science: A tale of two web services Competitive data science: A tale of two web services Presentation Transcript

    • Competitive data science: A tale of two web services David C. Thompson Jörg Bentzien Ingo Mügge Ben Hamner
    • What is about to happen• about.me• The Kaggle process• The data set• How the competition went• The models and implementation• What we have learnt
    • about.me/dcthompsonMy favourite papers from each period:[1] J. Chem. Phys. 122, 124107 (2005)[2] J. Chem. Phys. 128, 224103 (2008)[3] J. Chem. Inf. Model. 49, 1889 (2009)[4] J. Chem. Inf. Model. 51, 93 (2011)
    • What you should know about this exercise• We wanted to investigate the utility of the process• We wanted to move with speed• We wanted to use a data set the scientific community had previously seen• We wanted to be inclusive – no domain expertise needed
    • The data set • Version 2 of the Hansen AMES mutagenicity data was used • The following protocol was observed: What happened # of molecules (removed) Download smiles 6512 Conversion with Corina 6503 (9) Remove non-zero formal 6419 (84) charge Remove if more than 99 6414 (5) atoms Remove if contains 6252 (162) undesirable atoms*http://doc.ml.tu-berlin.de/toxbenchmark/J. Chem. Inf. Model. 49, 2077 (2009)* D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn
    • Descriptor calculation SD file, descriptor calculation – 6252 x 5030 – Filter for low variance (≤ 0.01); removed 2537 – Remove for high correlation (> 0.90); removed 716 – Descriptor normalization resulted in 6252 x 1777 .csv file 1400 Descriptor Engine # of descriptors 1200 MOE 2D 76 (186) 1000 Atom Pair 696 (1920) 800 MolConn-Z 174 (745) 600 Pipeline Pilot 5 (130) Property Counts 400 Daylight 825 (2048) fingerprints 200 clogP 0 (1) 0 50 1000 1050 1100 1150 1200 150 450 750 100 200 250 300 350 400 500 550 600 650 700 800 850 900 950J. Chem. Inf. Model. 49, 2077 (2009)
    • Testing Framework • Public Leaderboard: The split of the test set that competition participants see real-time feedback on over the course of the competition. • Private Leaderboard: The split of the test set that is used to determine the competition winners and estimate the generalization error. Participants do not see feedback on this during the competition.“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy
    • Expectations “Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set” • 20 models generated with different algorithms and descriptors • Models have overall accuracies between 0.75 and 0.83 for the training set and 0.76 and 0.82 for the test set • Inter-laboratory accuracy for Ames test reported at 85% Expectation: Models should have similar accuracy to literature Goal: Models should be balanced; sensitivity and specificity should be highJ. Chem. Inf. Model. 50, 2094 (2010)
    • http://www.kaggle.com/c/bioresponse
    • Performance as a function of time796 players 1 N log loss= N ˆ yi log( yi ) (1 yi ) log(1 ˆ yi )703 teams i 18841 entries55 forum topics, 409 posts
    • Final Public Δ (log Team Name Ranking Ranking loss) 1 Winter is Coming & Sergey 11 0 2 seelary 26 7E-05 3 bluehat 1 0.00051 4 jazz 15 0.0014 5 Wayne Zhang & Gxav & woshialex 19 0.00146 6 Indy Actuaries 38 0.00184 7 bluemaster & imran 7 0.00231 8 Efiimov & Bers & Cragin & vsu 4 0.00241 9 y_tag 18 0.0026 10 Killian O’Connor 44 0.00285 11 PlanetThanet & SirGuessalot 40 0.00298 12 AussieTim 48 0.00335 13 Jason Farmer 31 0.00347 14 GreenPeace 16 0.00356 15 mars 32 0.00388 16 Fuzzify 60 0.00392 17 Emanuele 63 0.00395 18 HappyHour 10 0.00431 19 Baltic 30 0.00465 20 dejavu 20 0.00482 352 Random Forest Benchmark 373 0.04184 Support Vector Machine 541 Benchmark 522 0.12147 Optimized Constant Value 639 Benchmark 647 0.31414 642 Uniform Benchmark 650 0.31959https://github.com/emanuele/kaggle_pbrhttps://github.com/benhamner/BioResponse
    • #FTW Strategies • Feature selection All three winning teams identified D27 as important. What is it? Organon toxicophore* • RF + complementary approaches • Blending* J. Med. Chem. 49, 312 (2005)“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy
    • Private Set Performance TP FN Se: TP/(TP+FN) Sp: TN/(FP+TN) FP TN CCR: (Se + Sp)/2 Benchmarks Winning Teams Other Team 1 873 165 RF 888 150 Team 17 896 142 Team 2 888 150 SVM 822 216 D27 781 257 Team 3 893 145 Team 1 151 687 RF 166 672 Team 17 169 669 Team 2 165 673 SVM 215 673 D27 215 623 Team 3 162 676 Se Sp CCR Se Sp CCR Se Sp CCRRF 0.86 0.80 0.83 Team 1 0.84 0.82 0.83 Team 17 0.86 0.80 0.83SVM 0.79 0.74 0.77 Team 2 0.86 0.80 0.83 D27 0.75 0.74 0.75 Team 3 0.86 0.80 0.83
    • Okay, where’s this ‘second’ web service? BIpredict Physicochemical properties are updated as molecule is built Atomistic descriptor values are appended directly to the molecule 17* D. C. Thompson Chemical Computing Group, User Group Meeting, Montreal, 2011
    • So, what did we learn? • Was this useful? – Yes • Participation was high, contributors and contributions were diverse* • A large number of models were of a high quality – Differences in top models in log loss metric are small – Different statistical measures lead to different rankings – RandomForest benchmark has high correct classification rate (CCR)* Sort of
    • ‘Machine learning that matters’ Machine learning skill Domain expertiseKiri L. Wagstaff. Machine Learning that Matters. Proceedings of the Twenty-Ninth InternationalConference on Machine Learning (ICML), June 2012. Download PDF (CL #12-2026)
    • Thanks to:Lilly AckleyAmy KunkelMehul PatelAlex Renner, PhDAll Kaggle participants – esp. Winter is Coming & Sergey