Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Understanding Community Needs: Scalable
SMS Processing for UNICEF Nigeria and
Burundi
Jessica Long
Senior software enginee...
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
Acknowledgements
• Robert Munro, CEO of Idibon
• Caroline Barebwoha, U-Report Nigeria project lead
• Aboubacar Kampo, U-Re...
My background
Symbolic Systems BS
Computer Science MS
Health systems
manager in
rural Burundi
Internationalization
enginee...
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
What is Natural Language Processing (NLP)?
Natural language processing is a branch of
artificial intelligence specifically...
Flavors of NLP
• Automatic categorization
• Machine translation
• Named entity recognition
• Sentiment Analysis
• Semantic...
Underlying algorithms
• Semi-supervised machine learning
– Start with labeled training data that’s similar to what you
wan...
Semi-supervised machine learning example
• “Using Wikipedia for Automatic Word Sense Disambiguation,”
by Rada Mihalcea (20...
Tokenization and feature extraction (n-grams)
“tomb”, “of”, “the”, “unknown”, “soldier”,
“beneath”, “arc”, “de”, “triomphe...
Who uses NLP?
Apple’s Siri does
speech recognition on
human voices, as well
as question answering
IBM Watson answers
Jeopa...
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
Language resources for UNICEF Uganda
30+ Languages Spoken in Uganda
Google Translate Supported Languages
Why is NLP difficult for minority
languages?
• Lots of code-switching breaks usual paradigm of language-
specific textual ...
But most of all. . .
• Minority languages lack appropriate training datasets.
– They tend to be primarily spoken, and lack...
“Raw data is an oxymoron.”
- Lisa Gitelman
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
Curation of language data, old & new
Compiled by Webster
Collective wisdom, at scale
Compiled by experts,
Supplemented by ...
Creating new structured data with
crowdsourcing
• “Are two heads better than one? Crowdsourced
translation via a two-step ...
Cell phone access
• Nearly 6 billion people in the world have
access to a cell phone
• In 2013, the UN famously reported t...
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
UNICEF’s U-Report
• Crowd wisdom, in real time, in developing countries
• In 2012, UNICEF Innovation team started building...
UNICEF’s U-Report
• Eventually, UNICEF started receiving urgent, unsolicited
messages
– FLOOD.villages of X, Y sub.county ...
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
Lesson #1: It’s difficult to predict how many
new people will use your product / service
when you start supporting a new l...
Lesson #1: It’s difficult to predict how many
new people will use your product / service
when you start supporting a new l...
Lesson #2: Language mixing in an African
context has different considerations for
classification algorithms vs European
la...
Lesson #3: Geopolitical context affects how
we interpret short messages, and it’s
constantly changing
Lesson #4: Mutually exclusive categories
are elusive. To automatically label
messages is to discover the endless
ambiguity...
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creat...
Conclusions
• Crowdsourcing, machine learning, and the
proliferation of cell phones make amazing new
communication tools a...
Thank you!
Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi
Upcoming SlideShare
Loading in …5
×

Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi

620 views

Published on

Scalable SMS processing for UNICEF Nigeria and Burundi.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi

  1. 1. Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi Jessica Long Senior software engineer at Idibon
  2. 2. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  3. 3. Acknowledgements • Robert Munro, CEO of Idibon • Caroline Barebwoha, U-Report Nigeria project lead • Aboubacar Kampo, U-Report Nigeria project lead • Sarah Atkinson, U-Report Burundi project lead • Kidus Fisaha Asfaw, Global head of U-Report • Evan Wheeler, CTO of UNICEF Innovation / RapidPro • Nicholas Gaylord, data scientist at Idibon
  4. 4. My background Symbolic Systems BS Computer Science MS Health systems manager in rural Burundi Internationalization engineer Second language acquisition research NLP engineer
  5. 5. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  6. 6. What is Natural Language Processing (NLP)? Natural language processing is a branch of artificial intelligence specifically concerned with making automatic judgments about free text
  7. 7. Flavors of NLP • Automatic categorization • Machine translation • Named entity recognition • Sentiment Analysis • Semantic Role Labeling • Opinion Mining • Parsing • Question Answering • Search – 15% of Google’s daily search queries have never been issued before! • Part of Speech Tagging • Textual Entailment • Discourse Analysis • Natural language Generation • Speech Recognition • Word sense disambiguation • Text summarization
  8. 8. Underlying algorithms • Semi-supervised machine learning – Start with labeled training data that’s similar to what you want to generate – Use this to “teach” the computer what features to look for when making a decision about the text Cat Cat Cat ??? Dog Dog Dog Training set Predictio n
  9. 9. Semi-supervised machine learning example • “Using Wikipedia for Automatic Word Sense Disambiguation,” by Rada Mihalcea (2007) Paris, France Paris, Texas Paris, France Paris, France Paris, Texas
  10. 10. Tokenization and feature extraction (n-grams) “tomb”, “of”, “the”, “unknown”, “soldier”, “beneath”, “arc”, “de”, “triomphe” “tomb of”, “of the”, “the unknown”, “unknown soldier”, “beneath the”, “the arc”, “arc de”, “de triomphe” “tomb of the”, “of the unknown”, “the unknown solider”, “unknown soldier beneath”, “beneath the arc”, “the arc de”, “arc de triomphe” Other features - Punctuation - Stemming - Parsing - Capitalization - Dictionary matching - Stopwords - … Paris, France Source text Source label Extracted features
  11. 11. Who uses NLP? Apple’s Siri does speech recognition on human voices, as well as question answering IBM Watson answers Jeopardy questions
  12. 12. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  13. 13. Language resources for UNICEF Uganda 30+ Languages Spoken in Uganda Google Translate Supported Languages
  14. 14. Why is NLP difficult for minority languages? • Lots of code-switching breaks usual paradigm of language- specific textual analysis • Lack of existing digital tools: spell check, autocomplete, access to internet • Minority language speakers lack purchasing power • Tokenization – Consider: • “ntibazoronka.”: “nta” “i” “ba” “zo” “ronka” “.” (Kirundi) • “they will not obtain.”: “they” “will” “not” obtain” “.” (English) • Encoding issues – “I can text you a pile of poo , but I can’t write my name” by Aditya Mukerjee in Model View Culture
  15. 15. But most of all. . . • Minority languages lack appropriate training datasets. – They tend to be primarily spoken, and lack the digital and even written content necessary for statistical machine learning • Google Translate relies on parallel corpora from UN proceedings to help create machine translation products – The UN does not dual broadcast in Wolof. • Textual reviews matched to star ratings on Yelp helps researchers calibrate sentiment analysis – Yelp is literally non-functional in most of Africa.
  16. 16. “Raw data is an oxymoron.” - Lisa Gitelman
  17. 17. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  18. 18. Curation of language data, old & new Compiled by Webster Collective wisdom, at scale Compiled by experts, Supplemented by OED Reading Programme * Shout out! Go see Martin Benjamin’s talk on The Kamusi Project tomorrow at 13:45, for more information on dictionary curation
  19. 19. Creating new structured data with crowdsourcing • “Are two heads better than one? Crowdsourced translation via a two-step collaboration of non- professional editors and translators”, Yan et al – Creating parallel corpuses with crowd workers is much faster and cheaper than using professional translators • Now, more than ever, we have the ability to rapidly create new labeled language data – …as long as we can find proficient writers of minority languages with digital literacy, electricity, and internet access
  20. 20. Cell phone access • Nearly 6 billion people in the world have access to a cell phone • In 2013, the UN famously reported that more people have access to a cell phone than to a toilet
  21. 21. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from Idibon’s automatic labeling in English and minority languages • Conclusions
  22. 22. UNICEF’s U-Report • Crowd wisdom, in real time, in developing countries • In 2012, UNICEF Innovation team started building a real- time SMS polling service for UNICEF Uganda. As of 2015, U- Report operates in over 15 countries • Polls are sent out once a week on topics like: – Has ur community addressed social inclusion issues affecting women, youth, and children? – If you get water from a well, borehole, or community tap, is it working today? – Go to your local health center and tell us: Do they give free HIV / AIDS tests? Report YES or NO and HEALTH CENTER NAME
  23. 23. UNICEF’s U-Report • Eventually, UNICEF started receiving urgent, unsolicited messages – FLOOD.villages of X, Y sub.county suffering. • UNICEF Nigeria alone now receives 10,000+ unsolicited messages per day • UNICEF needs a way to: – Identify topically relevant messages to share with specific partners – Prioritize which messages to respond to first • Idibon labels messages with urgency, category label, and language, in real time
  24. 24. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program • Lessons learned from Idibon’s automatic labeling in English and minority languages • Conclusions
  25. 25. Lesson #1: It’s difficult to predict how many new people will use your product / service when you start supporting a new language Non-English Languages of Nigeria 0 5 10 15 20 25 30
  26. 26. Lesson #1: It’s difficult to predict how many new people will use your product / service when you start supporting a new language # unsolicited Hausa messages per day Hausa polls begin * But we don’t see the same effect for Yoruba
  27. 27. Lesson #2: Language mixing in an African context has different considerations for classification algorithms vs European language code-switching • Downside: complex tokenization • Upside: radically different word structure
  28. 28. Lesson #3: Geopolitical context affects how we interpret short messages, and it’s constantly changing
  29. 29. Lesson #4: Mutually exclusive categories are elusive. To automatically label messages is to discover the endless ambiguity in human discourse. - Is a washed out road more related to infrastructure or personal safety? - Is education scoped to a particular time in life? Does post-graduate education count? What about education outside of a scholastic context? - If a town’s full name is “Mbale Village,” is “Mbale” a valid place name? - How specific do messages need to be to constitute a security threat? Does “these days some of our young people are not safe” count?
  30. 30. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  31. 31. Conclusions • Crowdsourcing, machine learning, and the proliferation of cell phones make amazing new communication tools and digital language data possible • Invest in translators and analysts
  32. 32. Thank you!

×