Open Data talk at the World Bank


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • How does data science differ from econometrics
  • Companies post their problem, their data and a prize, and our 38,000 data scientists compete to product the best solution.
  • Players’ algorithms are back-tested in real time, so we can show how people are performing on a live leaderboard. The live leaderboard accounts for a large part of Kaggle’s success, as people are motivated to outperform each other (which is catalyzes better performance than individuals developing a model in isolation).
  • From many different (maths-related disciplines)
  • Users have the option to tell us their favourite techniques
  • Outbound Within 1 min Within 5 min Overall 62.1 87.9 Peak hour 12.8 57.4 PH next hour 14.8 63.0 Inbound predictions tend to be far more accurate. 34.5 per cent of inbound peak-hour predictions made one hour ahead are correct within one minute. Inbound Within 1 min Within 5 min Overall 69.3 92.1 Peak hour 37.8 77.8 PH next hour 34.5 79.3
  • 6000 molecules (anonymized) 1700 structural descriptors Objective of prediction: Biological Response (mutagenicity) Indicated as 1/0 Many other biological responses can be modeled using the same approach Exceeded expectations within 2 weeks Extensible to other compound properties: mutagenicity, hepatotoxicity, solubility, PK/PD etc.
  • Could predict whether a used car would be a lemon with approximately 47% accuracy.
  • Open Data talk at the World Bank

    1. 1. Making data science a sportAnthony GoldbloomKaggle
    2. 2. Competition MechanicsCompetitions are judged on objective criteria
    3. 3. Kaggle’s Dark Matter Competitionon the White House blog “The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe” “In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”
    4. 4. User base: 60,000 data scientists
    5. 5. Our User Base
    6. 6. Users apply different techniques • neural networks • genetic algorithms • logistic regression • random forest • support vector machine • Monte Carlo methods • decision trees • principal component analysis • ensemble methods • Kalman filter • adaBoost • evolutionary fuzzy modeling • Bayesian networks
    7. 7. EXAMPLE ESSAY QUESTION —We all understand the benefits of laughter. Forexample, someone once said, “Laughter is theshortest distance between two people.”Many other people believe that laughter is animportant part of any relationship. Tell a true story inwhich laughter was one element or part.
    8. 8. “Have you ever experienced a time with your friends or family where you laughed so hard your stomach hurt, and your eyes were filled with tears? Laughing is something every person needs.Automated results by A great laugh can make a persons daythe winning algorithm are and put a smile on their face. If no oneas reliable as manual laughed the world would be a terriblyassessment by teachers. sad place. My friends and I are always laughing, to the point where were rolling on the ground, clutching our stomachs laughing.”
    9. 9. & Obesity & Hypertension & High Cholesterol DiabetesProbability of going to hospital in the next six months
    10. 10. RTA Competition: Travel Time Prediction
    11. 11. Boehringer Ingelheim Competition: Data +1700 fieldsMutates Molecule True Molecule2 False Molecule3 True Molecule4 True Molecule 5 … True 0
    12. 12. Is it a lemon?
    13. 13. What could the world’s bestanalysts find in your data?e-mail a@kaggle.comphone +1 650 283 9781