SlideShare a Scribd company logo
1 of 28
A social scientist‟s perspectives on
data science
Drew Conway
NYC Data Science
Meetup
March 5, 2013http://www.flickr.com/photos/uiowa/804719510
0/
Hacking
Skills
Obtain Munge
I hold the following truths to be self-
evident...
1. Data come from many sources
2. Data come in many form(at)s
10
% 10
%
80
%
A .zip file of PDFs ≠ data
‣Data scientist must know where to
get data and how to obtain it
‣Work with big text files
$ head publicvotes-20101018_votes.dump
‣Work with APIs
$ curl
http://search.twitter.com/search.json?q=@dr
ewconway > drewconway.json
Real data are messy
‣Even curated data: duplicates,
missing values, date formats
‣Combine data from multiple
sources/formats
‣Tools
• *NIX tools: sed, awk, grep
• Scripting languages: Perl, Python
and R
$ cat ufo_awesome.tsv | grep probe | wc -l
131
Hacking
Skills
While 80% of effort is spent here,
perhaps most straightforward to teach
Heavily tool focused, borrow from CS/EE curriculums
‣Comfort working at the command-line, with text editors
‣A language for every season!
Conveying findings in creative and compelling ways
Math &
Stats
Knowledge
If: Better data beats better math
Then: What methods should be
taught?
How do you find
structure in new data?
‣Scatter plots
‣Density plots
Data exploration that
scales
‣Reduce dimensionality
‣PCA, SVD, MDS
Methods must match
data
‣Text
‣Geospatial
‣Web-scale
What is the „best‟
model?
‣Most predictive
‣Most parsimonious
Explore Model
}
Math &
Stats
Knowledge
Universities good at methods
training...
...but what methods fit into Data
Science?
Things data scientist like...
‣Illustrating the current state of the
world
‣Predicting future observations
‣Classifying/ranking observations
Things social scientists like...
‣Testable theoretical models
‣Natural experiments
‣Causality
1. When applicable
2. Right tool / right job
3. Open black boxes
4. Learn limitations
Substantive
Expertise
Data Science, as a discipline, is
fundamentally about human behavior
Inquire Interpret
10
% 10
%
80
%
Focus on questions / not
tech
‣What new questions can be
asked from web-scale data?
‣Tools are a means to an end
Social science has
questions
‣Markets
‣Organization
How do we know when
the results we get make
sense, if ever?
http://www.flickr.com/photos/cawley/324240322
4/
Case Study: Methods for Collecting Large-
Scale Non-Expert Text Coding
Median Voter
Theorem
Theorem: In a majority rules system, the preference of the median voter will succeed
http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median-
voter/
Assumption: The political/ideological preferences of voters can be projected onto a
single numeric dimension
Median Voter
Theorem
http://voteview.com/blog/?p=5
How do we calculate these numbers?
We make it
up...
http://www.flickr.com/photos/estherlairlandesa/46495660
But, we have
to!
http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congre
ss.jpg
http://www.flickr.com/photos/becca02/672719355
7/
A tale of two
disciplines
Physics Political Science
Build instrument Measure Observe action Infer
One thing we have a lot of:
text
Politicians
‣Speeches
‣Constituent communication
Parties
‣Platform / manifestos
‣Position statements
Countries
‣Diplomatic cables
‣Military declarations
Expert
Coding
!
How expert coding (typically)
works
http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party
Expert Code Book
1. Health & Safety: We propose to ban Self Responsibilty on the grounds that it
may be dangerous to your health.
2. M.P‟s Expenses: We propose that instead of a second home allowance M.P‟s
will have a caravan which will be parked outside the Houses of Parliament. This
will make it easier as flipping a caravan is easier than flipping homes
3. Eurofit: The European Constitution which will be sorted out by going for a long
Walk. “As everyone knows that walking is good for the constitution”Manifesto
Party Year Score
Monster Raving Loony 2010 -2
DATA!
What‟s wrong with
experts?
They‟re
slow
They‟re
biased
They‟re
expensive
They‟re
wrong
Can we use non-
experts to code
political
manifestos?
How can we
measure the
quality/validity of
non-expert
codings?
Use Mechanical
Turk to code
many manifesto
fragments.
Experimental
approach
Expert
codings
Texts: 18 “big 3” British party
manifestos 1987-2010
Experts: 5 advanced poli. sci.
graduate students + 2
tenured faculty
Coding: deliberately simple
schema
Baseline data
Three experiments
No
Qualification
Low-
Threshold
High-
Threshold
Anyone in 4/6 Correct 5/6 Correct
MT
codings
Experimental design
Hypothesis: Stronger filter on
Turkers leads to better coding
Filter: Use MT qualification
test as gatekeeper
How do we think about coding a manifesto
fragment?
Example text coding HIT from the experiment
How do we implement this (aka, the glue)?
Expert
codings
[{ ‘text_unit_id’: ...,
‘sentence_text’: ...,
....
},
...
]
Random sample, as
JSON
EC2
S3
MT
Dynamically generate
HITs
MT
codings
Push HITs + retrieve
results
Statistical
analysis
of results
Scholarship,
FTW!
https://github.com/drewconway/mturk_coder_qua
lity
What‟s good about MT non-
experts?
They‟re
fast
They‟re
biased?
They‟re
cheap
They‟re
wrong?
The last crowd-sourced
coding job for 600
sentences and got
4,300 sentences coded
in about 20 hours
(about 3.6 sentences
per minute)
• We pay about $0.02 /
sentence
• Typical manifesto (in British
set) has 1,000 sentences
• Whole manifesto coded for
$20
• By comparison, the CMP
pays expert coders about
€150 per manifesto, call it
€.15 or $.20/manifesto - 10x
more per sentence
Results Kappa Statistic
Experiment Sentences # MT Coders % Agreement k* Std. Error z
No Qual. 1,315 89 0.65 0.47 0.13 22.6
Low-Threshold 1,393 56 0.7 0.54 0.12 26.7
High-Threshold 1,250 23 0.62 0.41 0.13 18.3
* A k value between 0.4-0.6 is considered “moderate” agreement
Agreement by experiment
Experiment Expert Coding MT % Agreement
No Qual.
Economic 0.77
Social 0.92
Neither 0.22
Low-Threshold
Economic 0.87
Social 0.98
Neither 0.2
High-Threshold
Economic 0.77
Social 0.91
Neither 0.09
Agreement by expert-coding
Results of initial MT experiments
Results Kappa Statistic
Experiment Sentences # MT Coders % Agreement k* Std. Error z
Econ-only 942 15 0.62 0.23 0.1 4.28
Soc-only 955 32 0.6 0.17 0.09 0.95
* A k value between 0.4-0.6 is considered “moderate” agreement
Experiment Expert Coding MT % Agreement
Economic 0.92
Economic-only Neither 0.28
Social 0.97
Social-only Neither 0.19
Non-experts have
a very hard time
with a “null” coding!
Separating Social and Economic Sentences
Joint work
with...
Michael Laver
NYU
Kenneth Bennoit
LSE
Slava Mikhaylov
UCL
Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437
Presentation: http://bit.ly/nonexperts
Project
Florida
No Qualification
Coder performance
stability
Low-threshold
High-threshold
Performance
becomes very stable
after approximately
20 HITs
Party shifts: economic
Party shifts: social

More Related Content

Viewers also liked

Can Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake PorwayCan Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake Porwaymortardata
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 

Viewers also liked (10)

Can Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake PorwayCan Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake Porway
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 

Similar to Drew Conway: A Social Scientist's Perspective on Data Science

Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer universityLászló Kovács
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science DemystifiedEmily Robinson
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big DataDevin Bost
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class QuantUniversity
 
A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data  A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data lokku
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" J T "Tom" Johnson
 
Integration data models, Learning Layers project meeting in Bremen
Integration data models, Learning Layers project meeting in BremenIntegration data models, Learning Layers project meeting in Bremen
Integration data models, Learning Layers project meeting in BremenVladimir Tomberg
 
Visual and interactive storytelling slides cmg 2015-final
Visual and interactive storytelling slides    cmg 2015-finalVisual and interactive storytelling slides    cmg 2015-final
Visual and interactive storytelling slides cmg 2015-finalKatherine-CWACanada
 
m-Assessment_Brum_DaveNDanny
m-Assessment_Brum_DaveNDannym-Assessment_Brum_DaveNDanny
m-Assessment_Brum_DaveNDannyDavid Sugden
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...Daniel Katz
 
BSides London 2018 - Solving Threat Detection
BSides London 2018 - Solving Threat DetectionBSides London 2018 - Solving Threat Detection
BSides London 2018 - Solving Threat DetectionAlex Davies
 
M-Assessment_D-NDave
M-Assessment_D-NDaveM-Assessment_D-NDave
M-Assessment_D-NDaveDavid Sugden
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Microformats 101 Workshop
Microformats 101 WorkshopMicroformats 101 Workshop
Microformats 101 WorkshopKelley Howell
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century Human Capital Media
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
How Many Dimensions of Compatibility?: Discovering What's Right for Your Users
How Many Dimensions of Compatibility?: Discovering What's Right for Your Users How Many Dimensions of Compatibility?: Discovering What's Right for Your Users
How Many Dimensions of Compatibility?: Discovering What's Right for Your Users Marliese Thomas
 
Metadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoerMetadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoerR. John Robertson
 

Similar to Drew Conway: A Social Scientist's Perspective on Data Science (20)

Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Data Science Demystified
Data Science DemystifiedData Science Demystified
Data Science Demystified
 
Real World NLP, ML, and Big Data
Real World NLP, ML, and Big DataReal World NLP, ML, and Big Data
Real World NLP, ML, and Big Data
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class
 
A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data  A living hell - lessons learned in eight years of parsing real estate data
A living hell - lessons learned in eight years of parsing real estate data
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
 
Integration data models, Learning Layers project meeting in Bremen
Integration data models, Learning Layers project meeting in BremenIntegration data models, Learning Layers project meeting in Bremen
Integration data models, Learning Layers project meeting in Bremen
 
Visual and interactive storytelling slides cmg 2015-final
Visual and interactive storytelling slides    cmg 2015-finalVisual and interactive storytelling slides    cmg 2015-final
Visual and interactive storytelling slides cmg 2015-final
 
m-Assessment_Brum_DaveNDanny
m-Assessment_Brum_DaveNDannym-Assessment_Brum_DaveNDanny
m-Assessment_Brum_DaveNDanny
 
Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
 
BSides London 2018 - Solving Threat Detection
BSides London 2018 - Solving Threat DetectionBSides London 2018 - Solving Threat Detection
BSides London 2018 - Solving Threat Detection
 
M-Assessment_D-NDave
M-Assessment_D-NDaveM-Assessment_D-NDave
M-Assessment_D-NDave
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Microformats 101 Workshop
Microformats 101 WorkshopMicroformats 101 Workshop
Microformats 101 Workshop
 
What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century What your employees need to learn to work with data in the 21 st century
What your employees need to learn to work with data in the 21 st century
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
How Many Dimensions of Compatibility?: Discovering What's Right for Your Users
How Many Dimensions of Compatibility?: Discovering What's Right for Your Users How Many Dimensions of Compatibility?: Discovering What's Right for Your Users
How Many Dimensions of Compatibility?: Discovering What's Right for Your Users
 
Metadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoerMetadata and Content Aggregation for ukoer
Metadata and Content Aggregation for ukoer
 

More from mortardata

Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Timesmortardata
 
Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?mortardata
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblrmortardata
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …mortardata
 

More from mortardata (6)

Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Times
 
Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?Jonathan Coveney: Why Pig?
Jonathan Coveney: Why Pig?
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblr
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Drew Conway: A Social Scientist's Perspective on Data Science

  • 1. A social scientist‟s perspectives on data science Drew Conway NYC Data Science Meetup March 5, 2013http://www.flickr.com/photos/uiowa/804719510 0/
  • 2.
  • 3. Hacking Skills Obtain Munge I hold the following truths to be self- evident... 1. Data come from many sources 2. Data come in many form(at)s 10 % 10 % 80 % A .zip file of PDFs ≠ data ‣Data scientist must know where to get data and how to obtain it ‣Work with big text files $ head publicvotes-20101018_votes.dump ‣Work with APIs $ curl http://search.twitter.com/search.json?q=@dr ewconway > drewconway.json Real data are messy ‣Even curated data: duplicates, missing values, date formats ‣Combine data from multiple sources/formats ‣Tools • *NIX tools: sed, awk, grep • Scripting languages: Perl, Python and R $ cat ufo_awesome.tsv | grep probe | wc -l 131
  • 4. Hacking Skills While 80% of effort is spent here, perhaps most straightforward to teach Heavily tool focused, borrow from CS/EE curriculums ‣Comfort working at the command-line, with text editors ‣A language for every season! Conveying findings in creative and compelling ways
  • 5. Math & Stats Knowledge If: Better data beats better math Then: What methods should be taught? How do you find structure in new data? ‣Scatter plots ‣Density plots Data exploration that scales ‣Reduce dimensionality ‣PCA, SVD, MDS Methods must match data ‣Text ‣Geospatial ‣Web-scale What is the „best‟ model? ‣Most predictive ‣Most parsimonious Explore Model
  • 6. } Math & Stats Knowledge Universities good at methods training... ...but what methods fit into Data Science? Things data scientist like... ‣Illustrating the current state of the world ‣Predicting future observations ‣Classifying/ranking observations Things social scientists like... ‣Testable theoretical models ‣Natural experiments ‣Causality 1. When applicable 2. Right tool / right job 3. Open black boxes 4. Learn limitations
  • 7. Substantive Expertise Data Science, as a discipline, is fundamentally about human behavior Inquire Interpret 10 % 10 % 80 % Focus on questions / not tech ‣What new questions can be asked from web-scale data? ‣Tools are a means to an end Social science has questions ‣Markets ‣Organization How do we know when the results we get make sense, if ever?
  • 8. http://www.flickr.com/photos/cawley/324240322 4/ Case Study: Methods for Collecting Large- Scale Non-Expert Text Coding
  • 9. Median Voter Theorem Theorem: In a majority rules system, the preference of the median voter will succeed http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median- voter/ Assumption: The political/ideological preferences of voters can be projected onto a single numeric dimension
  • 13. One thing we have a lot of: text Politicians ‣Speeches ‣Constituent communication Parties ‣Platform / manifestos ‣Position statements Countries ‣Diplomatic cables ‣Military declarations Expert Coding !
  • 14. How expert coding (typically) works http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party Expert Code Book 1. Health & Safety: We propose to ban Self Responsibilty on the grounds that it may be dangerous to your health. 2. M.P‟s Expenses: We propose that instead of a second home allowance M.P‟s will have a caravan which will be parked outside the Houses of Parliament. This will make it easier as flipping a caravan is easier than flipping homes 3. Eurofit: The European Constitution which will be sorted out by going for a long Walk. “As everyone knows that walking is good for the constitution”Manifesto Party Year Score Monster Raving Loony 2010 -2 DATA!
  • 16. Can we use non- experts to code political manifestos? How can we measure the quality/validity of non-expert codings? Use Mechanical Turk to code many manifesto fragments.
  • 17. Experimental approach Expert codings Texts: 18 “big 3” British party manifestos 1987-2010 Experts: 5 advanced poli. sci. graduate students + 2 tenured faculty Coding: deliberately simple schema Baseline data Three experiments No Qualification Low- Threshold High- Threshold Anyone in 4/6 Correct 5/6 Correct MT codings Experimental design Hypothesis: Stronger filter on Turkers leads to better coding Filter: Use MT qualification test as gatekeeper
  • 18. How do we think about coding a manifesto fragment?
  • 19. Example text coding HIT from the experiment
  • 20. How do we implement this (aka, the glue)? Expert codings [{ ‘text_unit_id’: ..., ‘sentence_text’: ..., .... }, ... ] Random sample, as JSON EC2 S3 MT Dynamically generate HITs MT codings Push HITs + retrieve results Statistical analysis of results Scholarship, FTW! https://github.com/drewconway/mturk_coder_qua lity
  • 21. What‟s good about MT non- experts? They‟re fast They‟re biased? They‟re cheap They‟re wrong? The last crowd-sourced coding job for 600 sentences and got 4,300 sentences coded in about 20 hours (about 3.6 sentences per minute) • We pay about $0.02 / sentence • Typical manifesto (in British set) has 1,000 sentences • Whole manifesto coded for $20 • By comparison, the CMP pays expert coders about €150 per manifesto, call it €.15 or $.20/manifesto - 10x more per sentence
  • 22. Results Kappa Statistic Experiment Sentences # MT Coders % Agreement k* Std. Error z No Qual. 1,315 89 0.65 0.47 0.13 22.6 Low-Threshold 1,393 56 0.7 0.54 0.12 26.7 High-Threshold 1,250 23 0.62 0.41 0.13 18.3 * A k value between 0.4-0.6 is considered “moderate” agreement Agreement by experiment Experiment Expert Coding MT % Agreement No Qual. Economic 0.77 Social 0.92 Neither 0.22 Low-Threshold Economic 0.87 Social 0.98 Neither 0.2 High-Threshold Economic 0.77 Social 0.91 Neither 0.09 Agreement by expert-coding Results of initial MT experiments
  • 23. Results Kappa Statistic Experiment Sentences # MT Coders % Agreement k* Std. Error z Econ-only 942 15 0.62 0.23 0.1 4.28 Soc-only 955 32 0.6 0.17 0.09 0.95 * A k value between 0.4-0.6 is considered “moderate” agreement Experiment Expert Coding MT % Agreement Economic 0.92 Economic-only Neither 0.28 Social 0.97 Social-only Neither 0.19 Non-experts have a very hard time with a “null” coding! Separating Social and Economic Sentences
  • 24. Joint work with... Michael Laver NYU Kenneth Bennoit LSE Slava Mikhaylov UCL Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437 Presentation: http://bit.ly/nonexperts