Drew Conway: A Social Scientist's Perspective on Data Science


Published on

At the NYC Data Science meetup on March 5, 2014, Drew Conway (head of data at Project Florida and co-author of the book Machine Learning for Hackers) spoke about his own research using the tools of data science to tackle problems in political science.

Published in: Technology

Drew Conway: A Social Scientist's Perspective on Data Science

  1. 1. A social scientist‟s perspectives on data science Drew Conway NYC Data Science Meetup March 5, 2013http://www.flickr.com/photos/uiowa/804719510 0/
  2. 2. Hacking Skills Obtain Munge I hold the following truths to be self- evident... 1. Data come from many sources 2. Data come in many form(at)s 10 % 10 % 80 % A .zip file of PDFs ≠ data ‣Data scientist must know where to get data and how to obtain it ‣Work with big text files $ head publicvotes-20101018_votes.dump ‣Work with APIs $ curl http://search.twitter.com/search.json?q=@dr ewconway > drewconway.json Real data are messy ‣Even curated data: duplicates, missing values, date formats ‣Combine data from multiple sources/formats ‣Tools • *NIX tools: sed, awk, grep • Scripting languages: Perl, Python and R $ cat ufo_awesome.tsv | grep probe | wc -l 131
  3. 3. Hacking Skills While 80% of effort is spent here, perhaps most straightforward to teach Heavily tool focused, borrow from CS/EE curriculums ‣Comfort working at the command-line, with text editors ‣A language for every season! Conveying findings in creative and compelling ways
  4. 4. Math & Stats Knowledge If: Better data beats better math Then: What methods should be taught? How do you find structure in new data? ‣Scatter plots ‣Density plots Data exploration that scales ‣Reduce dimensionality ‣PCA, SVD, MDS Methods must match data ‣Text ‣Geospatial ‣Web-scale What is the „best‟ model? ‣Most predictive ‣Most parsimonious Explore Model
  5. 5. } Math & Stats Knowledge Universities good at methods training... ...but what methods fit into Data Science? Things data scientist like... ‣Illustrating the current state of the world ‣Predicting future observations ‣Classifying/ranking observations Things social scientists like... ‣Testable theoretical models ‣Natural experiments ‣Causality 1. When applicable 2. Right tool / right job 3. Open black boxes 4. Learn limitations
  6. 6. Substantive Expertise Data Science, as a discipline, is fundamentally about human behavior Inquire Interpret 10 % 10 % 80 % Focus on questions / not tech ‣What new questions can be asked from web-scale data? ‣Tools are a means to an end Social science has questions ‣Markets ‣Organization How do we know when the results we get make sense, if ever?
  7. 7. http://www.flickr.com/photos/cawley/324240322 4/ Case Study: Methods for Collecting Large- Scale Non-Expert Text Coding
  8. 8. Median Voter Theorem Theorem: In a majority rules system, the preference of the median voter will succeed http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median- voter/ Assumption: The political/ideological preferences of voters can be projected onto a single numeric dimension
  9. 9. Median Voter Theorem http://voteview.com/blog/?p=5 How do we calculate these numbers?
  10. 10. We make it up... http://www.flickr.com/photos/estherlairlandesa/46495660 But, we have to!
  11. 11. http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congre ss.jpg http://www.flickr.com/photos/becca02/672719355 7/ A tale of two disciplines Physics Political Science Build instrument Measure Observe action Infer
  12. 12. One thing we have a lot of: text Politicians ‣Speeches ‣Constituent communication Parties ‣Platform / manifestos ‣Position statements Countries ‣Diplomatic cables ‣Military declarations Expert Coding !
  13. 13. How expert coding (typically) works http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party Expert Code Book 1. Health & Safety: We propose to ban Self Responsibilty on the grounds that it may be dangerous to your health. 2. M.P‟s Expenses: We propose that instead of a second home allowance M.P‟s will have a caravan which will be parked outside the Houses of Parliament. This will make it easier as flipping a caravan is easier than flipping homes 3. Eurofit: The European Constitution which will be sorted out by going for a long Walk. “As everyone knows that walking is good for the constitution”Manifesto Party Year Score Monster Raving Loony 2010 -2 DATA!
  14. 14. What‟s wrong with experts? They‟re slow They‟re biased They‟re expensive They‟re wrong
  15. 15. Can we use non- experts to code political manifestos? How can we measure the quality/validity of non-expert codings? Use Mechanical Turk to code many manifesto fragments.
  16. 16. Experimental approach Expert codings Texts: 18 “big 3” British party manifestos 1987-2010 Experts: 5 advanced poli. sci. graduate students + 2 tenured faculty Coding: deliberately simple schema Baseline data Three experiments No Qualification Low- Threshold High- Threshold Anyone in 4/6 Correct 5/6 Correct MT codings Experimental design Hypothesis: Stronger filter on Turkers leads to better coding Filter: Use MT qualification test as gatekeeper
  17. 17. How do we think about coding a manifesto fragment?
  18. 18. Example text coding HIT from the experiment
  19. 19. How do we implement this (aka, the glue)? Expert codings [{ ‘text_unit_id’: ..., ‘sentence_text’: ..., .... }, ... ] Random sample, as JSON EC2 S3 MT Dynamically generate HITs MT codings Push HITs + retrieve results Statistical analysis of results Scholarship, FTW! https://github.com/drewconway/mturk_coder_qua lity
  20. 20. What‟s good about MT non- experts? They‟re fast They‟re biased? They‟re cheap They‟re wrong? The last crowd-sourced coding job for 600 sentences and got 4,300 sentences coded in about 20 hours (about 3.6 sentences per minute) • We pay about $0.02 / sentence • Typical manifesto (in British set) has 1,000 sentences • Whole manifesto coded for $20 • By comparison, the CMP pays expert coders about €150 per manifesto, call it €.15 or $.20/manifesto - 10x more per sentence
  21. 21. Results Kappa Statistic Experiment Sentences # MT Coders % Agreement k* Std. Error z No Qual. 1,315 89 0.65 0.47 0.13 22.6 Low-Threshold 1,393 56 0.7 0.54 0.12 26.7 High-Threshold 1,250 23 0.62 0.41 0.13 18.3 * A k value between 0.4-0.6 is considered “moderate” agreement Agreement by experiment Experiment Expert Coding MT % Agreement No Qual. Economic 0.77 Social 0.92 Neither 0.22 Low-Threshold Economic 0.87 Social 0.98 Neither 0.2 High-Threshold Economic 0.77 Social 0.91 Neither 0.09 Agreement by expert-coding Results of initial MT experiments
  22. 22. Results Kappa Statistic Experiment Sentences # MT Coders % Agreement k* Std. Error z Econ-only 942 15 0.62 0.23 0.1 4.28 Soc-only 955 32 0.6 0.17 0.09 0.95 * A k value between 0.4-0.6 is considered “moderate” agreement Experiment Expert Coding MT % Agreement Economic 0.92 Economic-only Neither 0.28 Social 0.97 Social-only Neither 0.19 Non-experts have a very hard time with a “null” coding! Separating Social and Economic Sentences
  23. 23. Joint work with... Michael Laver NYU Kenneth Bennoit LSE Slava Mikhaylov UCL Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437 Presentation: http://bit.ly/nonexperts
  24. 24. Project Florida
  25. 25. No Qualification Coder performance stability Low-threshold High-threshold Performance becomes very stable after approximately 20 HITs
  26. 26. Party shifts: economic
  27. 27. Party shifts: social