Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science for Social Good and Ushahidi

3,377 views

Published on

The Eric and Wendy Schmidt Data Science for Social Good - Summer Fellowship 2013
Preliminary Update July 2013
About the DSSG Rock stars:
http://dssg.io/
https://twitter.com/datascifellows/

Their project:
http://dssg.io/2013/07/15/ushahidi-machine-learning-for-human-rights.html

More @ ushahidi.com / wiki.ushahidi.com / blog.ushahidi.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Data Science for Social Good and Ushahidi

  1. 1. Project Update - July 11, 2013 The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship 2013 www.dssg.io | dssg-ushahidi@googlegroups.com
  2. 2. Ushahidi Workflow
  3. 3. Ushahidi Workflow + DSSG
  4. 4. Data Sets 23,000 reports from 20 datasets • 22% English • 35% non-English • 43% mixed languages Each report includes text, category, location, sometimes more data
  5. 5. Data Sets Additional unusable datasets for various reasons (e.g. overly formulaic language) What is the quality of the existing "gold standard" annotation? Working on translations of
  6. 6. Afghanistan election (peaceful) Kenyan election (less peaceful) Data Set Differences
  7. 7. Current Task Status [July 11] 1) Suggest categories....................... 2) Extract named entities................... (especially locations) 3) Detect language............................ End of presentation has more extensive technical details
  8. 8. Toy Demo http://ec2-54-218-196-140.us-west-2.compute.amazonaws.com/home Note this is ONLY a basic "toy" user interface to demonstrate the current prototype functionality. Our plan is to deliver an open-source code library, which Ushahidi will incorporate into the existing user interface. If link doesn't work -- just look at the screenshots in the next slides. :)
  9. 9. Demo: Example #1
  10. 10. Demo: Example #2
  11. 11. Secondary Project Ideas 1. Detect private info to strip 2. Urgency assessment 3. Filtering irrelevant reports (not strictly spam) 4. Automatically proposing new [sub-]categories 5. Cluster similar (non-identical) reports 6. Hierarchical topic modelling / visualization
  12. 12. Evaluation Plans • Tap into Ushahidi and crisis mapping communities for feedback • Simulate past event with our system • Success metrics: o Increased annotator speed o Increased annotator categorization accuracy o Decreased annotator frustration/tedium
  13. 13. Feedback welcome! Contact us at dssg- ushahidi@googlegroups.com We would love your input! See next 4 slides for technical details on our 4 tasks... or skip if you're happy to stay unaware... :)
  14. 14. 1) Suggest categories Currently: • Simple bag-of-words unigram features • 1-vs.-all classification (scikit-learn) • Little categories fewer big categories • Performance uninspiring :( Future: Bigrams... word frequency filter...
  15. 15. 2) Extract named entities Currently: • NLTK's Named Entity Recognizer • Eval: pretty good Future: • Train location-recognizer on datasets • Merge types for non-location NEs
  16. 16. 3) Detect Language Currently: • Existing packages (Bing, python, ...) Future: • Evaluate quality • Allow event-specific language bias
  17. 17. 4) Near-Duplicate Detection Currently: • SimHash compares distances of message text hashes efficiently Future: • Evaluate quality more rigorously • Explore other methods

×