SlideShare a Scribd company logo
1 of 31
Download to read offline
Ushine Plug-In
Using machine learning and natural language processing
to improve the human review process of crisis reports
Topics
● Intro to project
● Project contents
● Data sets
● Evaluation
● Data ethics
● Future work
How to Follow Up...
● GitHub repository (open-source project code + wiki documentation):
http://github.com/dssg/ushine-learning
Collaborators welcome! (Both within and outside of Ushahidi.)
● DSSG team e-mail: dssg-ushahidi@googlegroups.com
● Main Ushahidi contacts: Emmanuel Kala + Heather Leson
● Data Science for Social Good fellowship: http://dssg.io
Thanks!
Thanks to our partners at Ushahidi and the many
individuals and organizations who generously gave us
their advice and feedback...
Alphabetically:
Chris Albon, Rob Baker, George Chamales, Jennifer Chan,
Crisis Mappers, Schuyler Erle, Sara-Jayne Farmer, Rayid
Ghani, Eric Goodwin, Catherine Graham, Neil Horning,
Humanity Road, Anahi Ayala Iacucci, Rob Mitchum,
Emmanuel Kala, David Kobia, Heather Leson, Rob Munro,
Chris Thompson, Syria Tracker, Juan-Pablo Velez.
Project Contents [August 20]
1) Detect language of report text
2) Identify private information in report text
3) Identify locations in report text
4) Identify URLs in report text
5) Suggest categories of report
6) Detect (near-)duplicate reports
Ushahidi Process
DSSG helps here
Report Review w/o Ushine
Report Review with Ushine
(Exact user interface still
under development)
Scope
● Ushine DOES:
○ Improve the human review process of reports
● Ushine DOESN’T:
○ Verify reports
○ “Really” understand the report
○ Achieve 100% accuracy in anything
Useful for:
● In multi-lingual situation, automatically route reports to
speakers of that language
● Flag reports that need / don’t need translations
○ (if deployment specifies certain set of acceptable
languages)
Caveats:
● Not 100% accurate
● Performs less well on “imperfect” writing
○ e.g. SMS-speak, mixed languages
1) Detect report language
1) Detect report language
Technical details:
● Tested 4 plug-in language detectors on 850
reports, for agreement with human language
identification:
2) Identify Private Info
Identify people’s names, organizations’ names, locations, e-mail
addresses, URLs, phone/ID numbers, Twitter usernames
Useful for:
● Flagging private info in report that reviewer might want to remove, to
protect sensitive people/situations
● As an extra check before exporting reports to others.
Technical details:
● Use NLTK’s pre-trained Named Entity Recognizer (NER) to identify people’
s names, organizations’ names, and locations.
● Use regular expressions to identify e-mail addresses, URLs, phone/ID
numbers,and Twitter usernames.
● Better to be overly careful: false negatives are more dangerous than false
positives
2) Identify Private Info
Caveats:
● Not 100% accurate.
○ Use to support, not replace, humans. (Though humans are not 100%
accurate by themselves either!)
○ Always, be aware of responsibility to protect sensitive information.
○ Non-sensitive deployments (non-wars/disasters) may still have
sensitive information.
○ (More on data ethics @ end)
● Definition of “private” can be very subjective and nuanced.
● Does not re-word sentence; only identifies problematic words for editing.
● Currently only useful for English text (though extendable to other
languages given a suitable NER)
3) Identify Locations
Useful for:
● Identifying text within report that may refer to a location
Caveats:
● Imperfect accuracy, especially on imperfect English
● Currently only useful for English text (though extendable to other
languages given a suitable NER)
● Does not geo-locate location for mapping, just makes it easier to figure out
what text to then geo-locate.
Technical details:
● Use NLTK’s pre-trained Named Entity Recognizer (NER)
4) Identify URLs (links)
Useful for:
● Identifying text within report that refers to a URL (photo/video/article/etc.)
Technical details:
● Use regular expressions
A Detour on Data Sets
● So far none of the tasks have required
“training data” on past Ushahidi deployments
○ (NLTK’s named entity recognizer uses its own
training data, not from Ushahidi)
● Next task, category rankings, DOES require
Ushahidi training data
● Data cleanliness: Often lacking
○ We wrote scripts to automate cleaning
○ Useful for other Ushahidi work too!
Data Sets - Examples
Additional unusable
datasets for various
reasons (e.g. overly
formulaic language)
Many additional
CrowdMap datasets
(not used by Ushine
because of time
constraints)
Sensitive data was
removed before
being shared with
us
Afghanistan election
(peaceful)
Kenyan election
(less peaceful)
Data Set Differences
5) Category Suggestions
For each category (e.g. “Bribery” or “Violence”),
give 0-100% rating of how likely the report is to belong
Useful for:
● Increasing speed and accuracy of the category assignment process
Caveats:
● Not 100% accurate
● “Cold start” problem
5) Category Suggestions
● Global classifier:
○ Classifier trained on previous deployments (e.g.
previous Indian and Venezuela election reports) then
used for a new deployment (e.g. new Kenyan
election)
● Local classifier:
○ Train a classifier on-the-fly on reports annotated in a
new deployment. Cold-start problem.
● Adaptive classifier:
○ Retrain global classifier on the current deployment
5) Category Suggestions
● Learning Curve Plot from Mexico election
(Higher F1 score means better performance)
5) Category Suggestions
Technical details:
● Binary classifier for each category.
● Local classifier: Bag-of-words unigram
frequency features (with frequency cut-off = 5)
○ In general, bigrams & TF-IDF normalization did not
help.
● Global classifier for election deployment
○ Trained using 7 election deployments
○ For each category label, cross-deployment validation
was used to select feature sets (unigram, tfidf,
bigram, and C parameter).
5) Category Suggestions
Technical details:
● Adaptive Classifier
○ Interpolates between local classifier f and global
classifier g using
(1-α)*g(x) + α*f(x),
where x is a report.
○ α is tuned on-the-fly to maximize F1 score bas
grid search.
6) Detect (near-) duplicates
Has the report already been submitted, or retweeted?
Useful for:
● Identifying (near-)duplicate reports to prevent
copies and redundant work
Caveats:
● Not 100% accurate
● Not looking at “similar/related content”, but rather (near-)duplicates
Technical details:
● SimHash efficiently hashes each report text to a 64-bit representation.
● (Near-)duplicates have short distances
Evaluation
Currently analyzing the results of an evaluation experiment
that simulates an election crisis.
Assess the impact on users’ speed and accuracy of
● identifying private info, location, URLs
● choosing categories
3 comparison groups:
1) “Regular” process w/o computer suggestions
2) Our computer’s suggestions
3) “Perfect” suggestions
Evaluation
Ushahidi Plugin integration
● Configurable URL for the Ushine web
service
● Extract location names and other entities
from report text. These are displayed as
report metadata
● Detect and display the report language
● Suggest reports that are similar to the
current one
Data Ethics
This isn’t today’s focus, but very important as part of an on-going
Ushahidi discussion:
1) Private information tool especially should be used wisely -- not 100%
accurate and does not replace, but rather supports, thoughtful human decision-
making.
2) To improve category classification, need access to training data.
How to store data? Who has access?
Carelessness about sensitive data
can have real and bad consequences!
Non-sensitive deployments (non-wars/disasters)
may still have sensitive information.
Automated vs. Suggestions
● In theory, everything could be automated
○ Ex: Automatically select top-ranked categories
instead of giving humans the rankings
● Ushahidi reports need high quality data, so
we recommend using our package’s output
as suggestions to guide human decisions
● Especially important for sensitive tasks like
private information detection!
Future Ideas
1. Urgency assessment
2. Filter irrelevant reports (not strictly spam)
3. Automatically propose new [sub-]categories
4. Cluster similar (non-identical) reports
5. Hierarchical topic modelling / visualization
6. …?
How to Follow Up...
● GitHub repository (project code + wiki documentation): http://github.
com/dssg/ushine-learning
Collaborators welcome! (Both within and outside of Ushahidi.)
● DSSG team e-mail: dssg-ushahidi@googlegroups.com
● Main Ushahidi contacts: Emmanuel Kala + Heather Leson
● Data Science for Social Good fellowship: http://dssg.io

More Related Content

Similar to Data Science for Social Good and Ushahidi - Final Presentation

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
Edge AI and Vision Alliance
 
Open data for development
Open data for developmentOpen data for development
Open data for development
mlepage
 

Similar to Data Science for Social Good and Ushahidi - Final Presentation (20)

Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) ProjectHate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
 
Weird News Ranking : IRE project
Weird News Ranking : IRE projectWeird News Ranking : IRE project
Weird News Ranking : IRE project
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Rules for great digital government
Rules for great digital governmentRules for great digital government
Rules for great digital government
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
AI in Data science
AI in Data science AI in Data science
AI in Data science
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
 
An Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass DescriptionAn Audiovisual Metadata Platform to Support Mass Description
An Audiovisual Metadata Platform to Support Mass Description
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 
CAPI _TRIPS_SMS
CAPI _TRIPS_SMSCAPI _TRIPS_SMS
CAPI _TRIPS_SMS
 
Morden EcoSystem.pptx
Morden EcoSystem.pptxMorden EcoSystem.pptx
Morden EcoSystem.pptx
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Open data for development
Open data for developmentOpen data for development
Open data for development
 

More from International Federation of Red Cross and Red Crescent Societies

More from International Federation of Red Cross and Red Crescent Societies (20)

Fragile communities in a data driven world
Fragile communities in a data driven world Fragile communities in a data driven world
Fragile communities in a data driven world
 
Co creating Data Literacy
Co creating Data Literacy Co creating Data Literacy
Co creating Data Literacy
 
When we say open...(updated)
When we say open...(updated)When we say open...(updated)
When we say open...(updated)
 
Introducing the Data Playbook (Beta)
Introducing the Data Playbook (Beta)Introducing the Data Playbook (Beta)
Introducing the Data Playbook (Beta)
 
When we say open...
When we say open...When we say open...
When we say open...
 
Data Literacy at IFRC 2017
Data Literacy at IFRC 2017Data Literacy at IFRC 2017
Data Literacy at IFRC 2017
 
Where Do We Go from Here?
Where Do We Go from Here?Where Do We Go from Here?
Where Do We Go from Here?
 
The Next Million
The Next MillionThe Next Million
The Next Million
 
Crowdsourcing with Data-Driven Innovation
Crowdsourcing with Data-Driven InnovationCrowdsourcing with Data-Driven Innovation
Crowdsourcing with Data-Driven Innovation
 
Building a Citizen Engaged Research Project
Building a Citizen Engaged Research ProjectBuilding a Citizen Engaged Research Project
Building a Citizen Engaged Research Project
 
Our Common Startup
Our Common StartupOur Common Startup
Our Common Startup
 
Reduce Risk with Digital Preparedness
Reduce Risk with Digital Preparedness  Reduce Risk with Digital Preparedness
Reduce Risk with Digital Preparedness
 
Empower Digital Skills for Good
Empower Digital Skills for Good Empower Digital Skills for Good
Empower Digital Skills for Good
 
Primer: Data-Driven Startups
Primer: Data-Driven StartupsPrimer: Data-Driven Startups
Primer: Data-Driven Startups
 
Data-Driven Innovation in Qatar
Data-Driven Innovation in Qatar Data-Driven Innovation in Qatar
Data-Driven Innovation in Qatar
 
Aingel Accelerator
Aingel Accelerator Aingel Accelerator
Aingel Accelerator
 
Using Maps to Connect
Using Maps to ConnectUsing Maps to Connect
Using Maps to Connect
 
Micro Maps
Micro MapsMicro Maps
Micro Maps
 
Getting to know maps for social good
Getting to know maps for social goodGetting to know maps for social good
Getting to know maps for social good
 
Digital Humanitarians in the Sky
Digital Humanitarians in the SkyDigital Humanitarians in the Sky
Digital Humanitarians in the Sky
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Data Science for Social Good and Ushahidi - Final Presentation

  • 1. Ushine Plug-In Using machine learning and natural language processing to improve the human review process of crisis reports
  • 2. Topics ● Intro to project ● Project contents ● Data sets ● Evaluation ● Data ethics ● Future work
  • 3. How to Follow Up... ● GitHub repository (open-source project code + wiki documentation): http://github.com/dssg/ushine-learning Collaborators welcome! (Both within and outside of Ushahidi.) ● DSSG team e-mail: dssg-ushahidi@googlegroups.com ● Main Ushahidi contacts: Emmanuel Kala + Heather Leson ● Data Science for Social Good fellowship: http://dssg.io
  • 4. Thanks! Thanks to our partners at Ushahidi and the many individuals and organizations who generously gave us their advice and feedback... Alphabetically: Chris Albon, Rob Baker, George Chamales, Jennifer Chan, Crisis Mappers, Schuyler Erle, Sara-Jayne Farmer, Rayid Ghani, Eric Goodwin, Catherine Graham, Neil Horning, Humanity Road, Anahi Ayala Iacucci, Rob Mitchum, Emmanuel Kala, David Kobia, Heather Leson, Rob Munro, Chris Thompson, Syria Tracker, Juan-Pablo Velez.
  • 5. Project Contents [August 20] 1) Detect language of report text 2) Identify private information in report text 3) Identify locations in report text 4) Identify URLs in report text 5) Suggest categories of report 6) Detect (near-)duplicate reports
  • 8. Report Review with Ushine (Exact user interface still under development)
  • 9. Scope ● Ushine DOES: ○ Improve the human review process of reports ● Ushine DOESN’T: ○ Verify reports ○ “Really” understand the report ○ Achieve 100% accuracy in anything
  • 10. Useful for: ● In multi-lingual situation, automatically route reports to speakers of that language ● Flag reports that need / don’t need translations ○ (if deployment specifies certain set of acceptable languages) Caveats: ● Not 100% accurate ● Performs less well on “imperfect” writing ○ e.g. SMS-speak, mixed languages 1) Detect report language
  • 11. 1) Detect report language Technical details: ● Tested 4 plug-in language detectors on 850 reports, for agreement with human language identification:
  • 12. 2) Identify Private Info Identify people’s names, organizations’ names, locations, e-mail addresses, URLs, phone/ID numbers, Twitter usernames Useful for: ● Flagging private info in report that reviewer might want to remove, to protect sensitive people/situations ● As an extra check before exporting reports to others. Technical details: ● Use NLTK’s pre-trained Named Entity Recognizer (NER) to identify people’ s names, organizations’ names, and locations. ● Use regular expressions to identify e-mail addresses, URLs, phone/ID numbers,and Twitter usernames. ● Better to be overly careful: false negatives are more dangerous than false positives
  • 13. 2) Identify Private Info Caveats: ● Not 100% accurate. ○ Use to support, not replace, humans. (Though humans are not 100% accurate by themselves either!) ○ Always, be aware of responsibility to protect sensitive information. ○ Non-sensitive deployments (non-wars/disasters) may still have sensitive information. ○ (More on data ethics @ end) ● Definition of “private” can be very subjective and nuanced. ● Does not re-word sentence; only identifies problematic words for editing. ● Currently only useful for English text (though extendable to other languages given a suitable NER)
  • 14. 3) Identify Locations Useful for: ● Identifying text within report that may refer to a location Caveats: ● Imperfect accuracy, especially on imperfect English ● Currently only useful for English text (though extendable to other languages given a suitable NER) ● Does not geo-locate location for mapping, just makes it easier to figure out what text to then geo-locate. Technical details: ● Use NLTK’s pre-trained Named Entity Recognizer (NER)
  • 15. 4) Identify URLs (links) Useful for: ● Identifying text within report that refers to a URL (photo/video/article/etc.) Technical details: ● Use regular expressions
  • 16. A Detour on Data Sets ● So far none of the tasks have required “training data” on past Ushahidi deployments ○ (NLTK’s named entity recognizer uses its own training data, not from Ushahidi) ● Next task, category rankings, DOES require Ushahidi training data ● Data cleanliness: Often lacking ○ We wrote scripts to automate cleaning ○ Useful for other Ushahidi work too!
  • 17. Data Sets - Examples Additional unusable datasets for various reasons (e.g. overly formulaic language) Many additional CrowdMap datasets (not used by Ushine because of time constraints) Sensitive data was removed before being shared with us
  • 19. 5) Category Suggestions For each category (e.g. “Bribery” or “Violence”), give 0-100% rating of how likely the report is to belong Useful for: ● Increasing speed and accuracy of the category assignment process Caveats: ● Not 100% accurate ● “Cold start” problem
  • 20. 5) Category Suggestions ● Global classifier: ○ Classifier trained on previous deployments (e.g. previous Indian and Venezuela election reports) then used for a new deployment (e.g. new Kenyan election) ● Local classifier: ○ Train a classifier on-the-fly on reports annotated in a new deployment. Cold-start problem. ● Adaptive classifier: ○ Retrain global classifier on the current deployment
  • 21. 5) Category Suggestions ● Learning Curve Plot from Mexico election (Higher F1 score means better performance)
  • 22. 5) Category Suggestions Technical details: ● Binary classifier for each category. ● Local classifier: Bag-of-words unigram frequency features (with frequency cut-off = 5) ○ In general, bigrams & TF-IDF normalization did not help. ● Global classifier for election deployment ○ Trained using 7 election deployments ○ For each category label, cross-deployment validation was used to select feature sets (unigram, tfidf, bigram, and C parameter).
  • 23. 5) Category Suggestions Technical details: ● Adaptive Classifier ○ Interpolates between local classifier f and global classifier g using (1-α)*g(x) + α*f(x), where x is a report. ○ α is tuned on-the-fly to maximize F1 score bas grid search.
  • 24. 6) Detect (near-) duplicates Has the report already been submitted, or retweeted? Useful for: ● Identifying (near-)duplicate reports to prevent copies and redundant work Caveats: ● Not 100% accurate ● Not looking at “similar/related content”, but rather (near-)duplicates Technical details: ● SimHash efficiently hashes each report text to a 64-bit representation. ● (Near-)duplicates have short distances
  • 25. Evaluation Currently analyzing the results of an evaluation experiment that simulates an election crisis. Assess the impact on users’ speed and accuracy of ● identifying private info, location, URLs ● choosing categories 3 comparison groups: 1) “Regular” process w/o computer suggestions 2) Our computer’s suggestions 3) “Perfect” suggestions
  • 27. Ushahidi Plugin integration ● Configurable URL for the Ushine web service ● Extract location names and other entities from report text. These are displayed as report metadata ● Detect and display the report language ● Suggest reports that are similar to the current one
  • 28. Data Ethics This isn’t today’s focus, but very important as part of an on-going Ushahidi discussion: 1) Private information tool especially should be used wisely -- not 100% accurate and does not replace, but rather supports, thoughtful human decision- making. 2) To improve category classification, need access to training data. How to store data? Who has access? Carelessness about sensitive data can have real and bad consequences! Non-sensitive deployments (non-wars/disasters) may still have sensitive information.
  • 29. Automated vs. Suggestions ● In theory, everything could be automated ○ Ex: Automatically select top-ranked categories instead of giving humans the rankings ● Ushahidi reports need high quality data, so we recommend using our package’s output as suggestions to guide human decisions ● Especially important for sensitive tasks like private information detection!
  • 30. Future Ideas 1. Urgency assessment 2. Filter irrelevant reports (not strictly spam) 3. Automatically propose new [sub-]categories 4. Cluster similar (non-identical) reports 5. Hierarchical topic modelling / visualization 6. …?
  • 31. How to Follow Up... ● GitHub repository (project code + wiki documentation): http://github. com/dssg/ushine-learning Collaborators welcome! (Both within and outside of Ushahidi.) ● DSSG team e-mail: dssg-ushahidi@googlegroups.com ● Main Ushahidi contacts: Emmanuel Kala + Heather Leson ● Data Science for Social Good fellowship: http://dssg.io