What Can Machine Learning & Crowdsourcing
Do for You?
Exploring New Tools for Scalable Data Processing
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease
“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at 65 universities around the world
www.ischools.org
What’s an Information School?
2
• Machine Learning (AI) lets us automate many
useful tasks, eg. natural language processing (NLP)
• Crowdsourcing enables new levels of efficiency &
scalability in data collection & processing
• Human Computation lets us build next-generation
applications today, with capabilities beyond AI
Roadmap
Motivation: Applications
@mattlease
Automatic/Hybrid Fact Checking
• http://fcweb.pythonanywhere.com
– Nguyen et al., AAAI 2018
5
• http://odyssey.ischool.utexas.edu/mb/
– Ryu et al., HyperText 2012
MemeBrowser
6
• Kumar et al., CIKM 2011
Dating Biographies without Time Mentions
Plato (428-348 B.C.) Lincoln (1809-1865)
7
Transcription & Copy-Editing
• Spontaneous speech is often disfluent, with repetitions,
corrections, and vocalized space-fillers
• Lease, Charniak, and Johnson, 2005
• Zhou, Baskov, and Lease, 2013 (& Zhou’s Thesis)
S1: Uh first um i need to know uh how do you feel about uh about
sending uh an elderly uh family member to a nursing home
S2: Well of course it's you know it's one of the last few things in the
world you'd ever want to do you know unless it's just you know really
you know uh for their uh you know for their own good
Transcription & Copy-Editing
• Spontaneous speech is often disfluent, with repetitions,
corrections, and vocalized space-fillers
• Lease, Charniak, and Johnson, 2005
• Zhou, Baskov, and Lease, 2013 (& Zhou’s Thesis)
S1: Uh first um i need to know uh how do you feel about uh about
sending uh an elderly uh family member to a nursing home
S2: Well of course it's you know it's one of the last few things in the
world you'd ever want to do you know unless it's just you know really
you know uh for their uh you know for their own good
Two Problems
@mattlease
Machine Learning - Supervised
Slide courtesy of Byron Wallace (Northeastern)
11
AI effectiveness is often limited by training data size
Problem: creating labeled data is expensive!
Banko and Brill (2001)
What do we do when state-of-art AI
still isn’t good enough?
Crowdsourcing
@mattlease
Crowdsourcing
• Jeff Howe. Wired, June 2006.
• Take a job traditionally
performed by a known agent
(often an employee)
• Outsource it to an undefined,
generally large group of
people via an open call
15
Volunteer Crowd Success Stories
Zooniverse
17
• Marketplace for paid crowd work (“micro-tasks”)
– Created in 2005 (remains in “beta” today)
• On-demand, scalable, 24/7 global workforce
• API lets human labor be integrated into software
– “You’ve heard of software-as-a-service. Now this is human-as-a-service.”
Amazon Mechanical Turk (MTurk)
Collecting Data from Crowds
2008: MTurk sparks “gold rush” for ML training data
• Information Retrieval: Alonso et al., SIGIR Forum
• Human-Computer Interaction: Kittur et al., CHI
• Computer Vision: Sorokin & Forsythe, CVPR
• NLP: Snow et al, EMNLP
– Annotating human language
– 22,000 labels for only US $26
– Crowd’s consensus labels can
replace traditional expert labels
Human Computation
@mattlease
21
ACM Queue, May 2006
22
“Software developers with innovative ideas for
businesses and technologies are constrained by the
limits of artificial intelligence… If software developers
could programmatically access and incorporate human
intelligence into their applications, a whole new class
of innovative businesses and applications would be
possible. This is the goal of Amazon Mechanical Turk…
people are freer to innovate because they can now
imbue software with real human intelligence.”
PlateMate: Counting Calories
Noronha et al., UIST’10
23
Bederson et al., 2010; Morita & Ishidi, 2009
MonoTrans
Translation by Monolingual Speakers + AI
24
Zensors
Laput et al., CSCW 2015
25
But Who Protects the Moderators?
Dang et al., HCOMP’18 & CI’18 26
What about ethics?
• Silberman, Irani, and Ross (2010)
– “How should we… conceptualize the role of these people
who we ask to power our computing?”
• Irani and Silberman (2013)
– “…by hiding workers behind web forms and APIs…
employers see themselves as builders of innovative
technologies, rather than… unconcerned with working
conditions… redirecting focus to the innovation of human
computation as a field of technological achievement.”
• Fort, Adda, and Cohen (2011)
– “…opportunities for our community to deliberately
value ethics above cost savings.” 27
Summary
• Machine Learning (AI) lets us automate many
useful tasks, eg. natural language processing (NLP)
• Crowdsourcing enables new levels of efficiency &
scalability in data collection & processing
• Human Computation lets us build next-generation
applications today, with capabilities beyond AI
The Future of Crowd Work
Paper @ CSCW 2013 by
Kittur, Nickerson, Bernstein, Gerber,
Shaw, Zimmerman, Lease, and Horton 29
Matt Lease - ml@utexas.edu - @mattlease
Thank You!
Slides: slideshare.net/mattlease
Lab: ir.ischool.utexas.edu

What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for Scalable Data Processing

  • 1.
    What Can MachineLearning & Crowdsourcing Do for You? Exploring New Tools for Scalable Data Processing Matt Lease School of Information @mattlease University of Texas at Austin ml@utexas.edu Slides: slideshare.net/mattlease
  • 2.
    “The place wherepeople & technology meet” ~ Wobbrock et al., 2009 “iSchools” now exist at 65 universities around the world www.ischools.org What’s an Information School? 2
  • 3.
    • Machine Learning(AI) lets us automate many useful tasks, eg. natural language processing (NLP) • Crowdsourcing enables new levels of efficiency & scalability in data collection & processing • Human Computation lets us build next-generation applications today, with capabilities beyond AI Roadmap
  • 4.
  • 5.
    Automatic/Hybrid Fact Checking •http://fcweb.pythonanywhere.com – Nguyen et al., AAAI 2018 5
  • 6.
    • http://odyssey.ischool.utexas.edu/mb/ – Ryuet al., HyperText 2012 MemeBrowser 6
  • 7.
    • Kumar etal., CIKM 2011 Dating Biographies without Time Mentions Plato (428-348 B.C.) Lincoln (1809-1865) 7
  • 8.
    Transcription & Copy-Editing •Spontaneous speech is often disfluent, with repetitions, corrections, and vocalized space-fillers • Lease, Charniak, and Johnson, 2005 • Zhou, Baskov, and Lease, 2013 (& Zhou’s Thesis) S1: Uh first um i need to know uh how do you feel about uh about sending uh an elderly uh family member to a nursing home S2: Well of course it's you know it's one of the last few things in the world you'd ever want to do you know unless it's just you know really you know uh for their uh you know for their own good
  • 9.
    Transcription & Copy-Editing •Spontaneous speech is often disfluent, with repetitions, corrections, and vocalized space-fillers • Lease, Charniak, and Johnson, 2005 • Zhou, Baskov, and Lease, 2013 (& Zhou’s Thesis) S1: Uh first um i need to know uh how do you feel about uh about sending uh an elderly uh family member to a nursing home S2: Well of course it's you know it's one of the last few things in the world you'd ever want to do you know unless it's just you know really you know uh for their uh you know for their own good
  • 10.
  • 11.
    Machine Learning -Supervised Slide courtesy of Byron Wallace (Northeastern) 11
  • 12.
    AI effectiveness isoften limited by training data size Problem: creating labeled data is expensive! Banko and Brill (2001)
  • 13.
    What do wedo when state-of-art AI still isn’t good enough?
  • 14.
  • 15.
    Crowdsourcing • Jeff Howe.Wired, June 2006. • Take a job traditionally performed by a known agent (often an employee) • Outsource it to an undefined, generally large group of people via an open call 15
  • 16.
  • 17.
  • 18.
    • Marketplace forpaid crowd work (“micro-tasks”) – Created in 2005 (remains in “beta” today) • On-demand, scalable, 24/7 global workforce • API lets human labor be integrated into software – “You’ve heard of software-as-a-service. Now this is human-as-a-service.” Amazon Mechanical Turk (MTurk)
  • 19.
    Collecting Data fromCrowds 2008: MTurk sparks “gold rush” for ML training data • Information Retrieval: Alonso et al., SIGIR Forum • Human-Computer Interaction: Kittur et al., CHI • Computer Vision: Sorokin & Forsythe, CVPR • NLP: Snow et al, EMNLP – Annotating human language – 22,000 labels for only US $26 – Crowd’s consensus labels can replace traditional expert labels
  • 20.
  • 21.
  • 22.
    ACM Queue, May2006 22 “Software developers with innovative ideas for businesses and technologies are constrained by the limits of artificial intelligence… If software developers could programmatically access and incorporate human intelligence into their applications, a whole new class of innovative businesses and applications would be possible. This is the goal of Amazon Mechanical Turk… people are freer to innovate because they can now imbue software with real human intelligence.”
  • 23.
  • 24.
    Bederson et al.,2010; Morita & Ishidi, 2009 MonoTrans Translation by Monolingual Speakers + AI 24
  • 25.
  • 26.
    But Who Protectsthe Moderators? Dang et al., HCOMP’18 & CI’18 26
  • 27.
    What about ethics? •Silberman, Irani, and Ross (2010) – “How should we… conceptualize the role of these people who we ask to power our computing?” • Irani and Silberman (2013) – “…by hiding workers behind web forms and APIs… employers see themselves as builders of innovative technologies, rather than… unconcerned with working conditions… redirecting focus to the innovation of human computation as a field of technological achievement.” • Fort, Adda, and Cohen (2011) – “…opportunities for our community to deliberately value ethics above cost savings.” 27
  • 28.
    Summary • Machine Learning(AI) lets us automate many useful tasks, eg. natural language processing (NLP) • Crowdsourcing enables new levels of efficiency & scalability in data collection & processing • Human Computation lets us build next-generation applications today, with capabilities beyond AI
  • 29.
    The Future ofCrowd Work Paper @ CSCW 2013 by Kittur, Nickerson, Bernstein, Gerber, Shaw, Zimmerman, Lease, and Horton 29
  • 30.
    Matt Lease -ml@utexas.edu - @mattlease Thank You! Slides: slideshare.net/mattlease Lab: ir.ischool.utexas.edu