Your SlideShare is downloading. ×
0
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Drew Conway: A Social Scientist's Perspective on Data Science

3,120

Published on

At the NYC Data Science meetup on March 5, 2014, Drew Conway (head of data at Project Florida and co-author of the book Machine Learning for Hackers) spoke about his own research using the tools of …

At the NYC Data Science meetup on March 5, 2014, Drew Conway (head of data at Project Florida and co-author of the book Machine Learning for Hackers) spoke about his own research using the tools of data science to tackle problems in political science.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,120
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
19
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A social scientist‟s perspectives on data science Drew Conway NYC Data Science Meetup March 5, 2013http://www.flickr.com/photos/uiowa/804719510 0/
  • 2. Hacking Skills Obtain Munge I hold the following truths to be self- evident... 1. Data come from many sources 2. Data come in many form(at)s 10 % 10 % 80 % A .zip file of PDFs ≠ data ‣Data scientist must know where to get data and how to obtain it ‣Work with big text files $ head publicvotes-20101018_votes.dump ‣Work with APIs $ curl http://search.twitter.com/search.json?q=@dr ewconway > drewconway.json Real data are messy ‣Even curated data: duplicates, missing values, date formats ‣Combine data from multiple sources/formats ‣Tools • *NIX tools: sed, awk, grep • Scripting languages: Perl, Python and R $ cat ufo_awesome.tsv | grep probe | wc -l 131
  • 3. Hacking Skills While 80% of effort is spent here, perhaps most straightforward to teach Heavily tool focused, borrow from CS/EE curriculums ‣Comfort working at the command-line, with text editors ‣A language for every season! Conveying findings in creative and compelling ways
  • 4. Math & Stats Knowledge If: Better data beats better math Then: What methods should be taught? How do you find structure in new data? ‣Scatter plots ‣Density plots Data exploration that scales ‣Reduce dimensionality ‣PCA, SVD, MDS Methods must match data ‣Text ‣Geospatial ‣Web-scale What is the „best‟ model? ‣Most predictive ‣Most parsimonious Explore Model
  • 5. } Math & Stats Knowledge Universities good at methods training... ...but what methods fit into Data Science? Things data scientist like... ‣Illustrating the current state of the world ‣Predicting future observations ‣Classifying/ranking observations Things social scientists like... ‣Testable theoretical models ‣Natural experiments ‣Causality 1. When applicable 2. Right tool / right job 3. Open black boxes 4. Learn limitations
  • 6. Substantive Expertise Data Science, as a discipline, is fundamentally about human behavior Inquire Interpret 10 % 10 % 80 % Focus on questions / not tech ‣What new questions can be asked from web-scale data? ‣Tools are a means to an end Social science has questions ‣Markets ‣Organization How do we know when the results we get make sense, if ever?
  • 7. http://www.flickr.com/photos/cawley/324240322 4/ Case Study: Methods for Collecting Large- Scale Non-Expert Text Coding
  • 8. Median Voter Theorem Theorem: In a majority rules system, the preference of the median voter will succeed http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median- voter/ Assumption: The political/ideological preferences of voters can be projected onto a single numeric dimension
  • 9. Median Voter Theorem http://voteview.com/blog/?p=5 How do we calculate these numbers?
  • 10. We make it up... http://www.flickr.com/photos/estherlairlandesa/46495660 But, we have to!
  • 11. http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congre ss.jpg http://www.flickr.com/photos/becca02/672719355 7/ A tale of two disciplines Physics Political Science Build instrument Measure Observe action Infer
  • 12. One thing we have a lot of: text Politicians ‣Speeches ‣Constituent communication Parties ‣Platform / manifestos ‣Position statements Countries ‣Diplomatic cables ‣Military declarations Expert Coding !
  • 13. How expert coding (typically) works http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party Expert Code Book 1. Health & Safety: We propose to ban Self Responsibilty on the grounds that it may be dangerous to your health. 2. M.P‟s Expenses: We propose that instead of a second home allowance M.P‟s will have a caravan which will be parked outside the Houses of Parliament. This will make it easier as flipping a caravan is easier than flipping homes 3. Eurofit: The European Constitution which will be sorted out by going for a long Walk. “As everyone knows that walking is good for the constitution”Manifesto Party Year Score Monster Raving Loony 2010 -2 DATA!
  • 14. What‟s wrong with experts? They‟re slow They‟re biased They‟re expensive They‟re wrong
  • 15. Can we use non- experts to code political manifestos? How can we measure the quality/validity of non-expert codings? Use Mechanical Turk to code many manifesto fragments.
  • 16. Experimental approach Expert codings Texts: 18 “big 3” British party manifestos 1987-2010 Experts: 5 advanced poli. sci. graduate students + 2 tenured faculty Coding: deliberately simple schema Baseline data Three experiments No Qualification Low- Threshold High- Threshold Anyone in 4/6 Correct 5/6 Correct MT codings Experimental design Hypothesis: Stronger filter on Turkers leads to better coding Filter: Use MT qualification test as gatekeeper
  • 17. How do we think about coding a manifesto fragment?
  • 18. Example text coding HIT from the experiment
  • 19. How do we implement this (aka, the glue)? Expert codings [{ ‘text_unit_id’: ..., ‘sentence_text’: ..., .... }, ... ] Random sample, as JSON EC2 S3 MT Dynamically generate HITs MT codings Push HITs + retrieve results Statistical analysis of results Scholarship, FTW! https://github.com/drewconway/mturk_coder_qua lity
  • 20. What‟s good about MT non- experts? They‟re fast They‟re biased? They‟re cheap They‟re wrong? The last crowd-sourced coding job for 600 sentences and got 4,300 sentences coded in about 20 hours (about 3.6 sentences per minute) • We pay about $0.02 / sentence • Typical manifesto (in British set) has 1,000 sentences • Whole manifesto coded for $20 • By comparison, the CMP pays expert coders about €150 per manifesto, call it €.15 or $.20/manifesto - 10x more per sentence
  • 21. Results Kappa Statistic Experiment Sentences # MT Coders % Agreement k* Std. Error z No Qual. 1,315 89 0.65 0.47 0.13 22.6 Low-Threshold 1,393 56 0.7 0.54 0.12 26.7 High-Threshold 1,250 23 0.62 0.41 0.13 18.3 * A k value between 0.4-0.6 is considered “moderate” agreement Agreement by experiment Experiment Expert Coding MT % Agreement No Qual. Economic 0.77 Social 0.92 Neither 0.22 Low-Threshold Economic 0.87 Social 0.98 Neither 0.2 High-Threshold Economic 0.77 Social 0.91 Neither 0.09 Agreement by expert-coding Results of initial MT experiments
  • 22. Results Kappa Statistic Experiment Sentences # MT Coders % Agreement k* Std. Error z Econ-only 942 15 0.62 0.23 0.1 4.28 Soc-only 955 32 0.6 0.17 0.09 0.95 * A k value between 0.4-0.6 is considered “moderate” agreement Experiment Expert Coding MT % Agreement Economic 0.92 Economic-only Neither 0.28 Social 0.97 Social-only Neither 0.19 Non-experts have a very hard time with a “null” coding! Separating Social and Economic Sentences
  • 23. Joint work with... Michael Laver NYU Kenneth Bennoit LSE Slava Mikhaylov UCL Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437 Presentation: http://bit.ly/nonexperts
  • 24. Project Florida
  • 25. No Qualification Coder performance stability Low-threshold High-threshold Performance becomes very stable after approximately 20 HITs
  • 26. Party shifts: economic
  • 27. Party shifts: social

×