SlideShare a Scribd company logo
Understanding Community Needs: Scalable
SMS Processing for UNICEF Nigeria and
Burundi
Jessica Long
Senior software engineer at Idibon
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program, and Idibon’s collaboration
• Lessons learned from automatic labeling in English and
minority languages
• Conclusions
Acknowledgements
• Robert Munro, CEO of Idibon
• Caroline Barebwoha, U-Report Nigeria project lead
• Aboubacar Kampo, U-Report Nigeria project lead
• Sarah Atkinson, U-Report Burundi project lead
• Kidus Fisaha Asfaw, Global head of U-Report
• Evan Wheeler, CTO of UNICEF Innovation / RapidPro
• Nicholas Gaylord, data scientist at Idibon
My background
Symbolic Systems BS
Computer Science MS
Health systems
manager in
rural Burundi
Internationalization
engineer
Second
language
acquisition
research
NLP engineer
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program, and Idibon’s collaboration
• Lessons learned from automatic labeling in English and
minority languages
• Conclusions
What is Natural Language Processing (NLP)?
Natural language processing is a branch of
artificial intelligence specifically concerned with
making automatic judgments about free text
Flavors of NLP
• Automatic categorization
• Machine translation
• Named entity recognition
• Sentiment Analysis
• Semantic Role Labeling
• Opinion Mining
• Parsing
• Question Answering
• Search
– 15% of Google’s daily search queries
have never been issued before!
• Part of Speech Tagging
• Textual Entailment
• Discourse Analysis
• Natural language
Generation
• Speech Recognition
• Word sense
disambiguation
• Text summarization
Underlying algorithms
• Semi-supervised machine learning
– Start with labeled training data that’s similar to what you
want to generate
– Use this to “teach” the computer what features to look for
when making a decision about the text
Cat Cat
Cat
???
Dog Dog
Dog
Training set Predictio
n
Semi-supervised machine learning example
• “Using Wikipedia for Automatic Word Sense Disambiguation,”
by Rada Mihalcea (2007)
Paris, France
Paris, Texas
Paris, France Paris, France
Paris, Texas
Tokenization and feature extraction (n-grams)
“tomb”, “of”, “the”, “unknown”, “soldier”,
“beneath”, “arc”, “de”, “triomphe”
“tomb of”, “of the”, “the unknown”,
“unknown soldier”, “beneath the”, “the
arc”, “arc de”, “de triomphe”
“tomb of the”, “of the unknown”, “the
unknown solider”, “unknown soldier
beneath”, “beneath the arc”, “the arc
de”, “arc de triomphe”
Other features
- Punctuation
- Stemming
- Parsing
- Capitalization
- Dictionary matching
- Stopwords
- …
Paris, France
Source text
Source label
Extracted features
Who uses NLP?
Apple’s Siri does
speech recognition on
human voices, as well
as question answering
IBM Watson answers
Jeopardy questions
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program, and Idibon’s collaboration
• Lessons learned from automatic labeling in English and
minority languages
• Conclusions
Language resources for UNICEF Uganda
30+ Languages Spoken in Uganda
Google Translate Supported Languages
Why is NLP difficult for minority
languages?
• Lots of code-switching breaks usual paradigm of language-
specific textual analysis
• Lack of existing digital tools: spell check, autocomplete,
access to internet
• Minority language speakers lack purchasing power
• Tokenization
– Consider:
• “ntibazoronka.”: “nta” “i” “ba” “zo” “ronka” “.” (Kirundi)
• “they will not obtain.”: “they” “will” “not” obtain” “.” (English)
• Encoding issues
– “I can text you a pile of poo , but I can’t write my name” by Aditya
Mukerjee in Model View Culture
But most of all. . .
• Minority languages lack appropriate training datasets.
– They tend to be primarily spoken, and lack the digital and
even written content necessary for statistical machine
learning
• Google Translate relies on parallel corpora from UN
proceedings to help create machine translation products
– The UN does not dual broadcast in Wolof.
• Textual reviews matched to star ratings on Yelp helps
researchers calibrate sentiment analysis
– Yelp is literally non-functional in most of Africa.
“Raw data is an oxymoron.”
- Lisa Gitelman
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program, and Idibon’s collaboration
• Lessons learned from automatic labeling in English and
minority languages
• Conclusions
Curation of language data, old & new
Compiled by Webster
Collective wisdom, at scale
Compiled by experts,
Supplemented by OED Reading Programme
* Shout out! Go see Martin
Benjamin’s talk on The
Kamusi Project tomorrow
at 13:45, for more
information on dictionary
curation
Creating new structured data with
crowdsourcing
• “Are two heads better than one? Crowdsourced
translation via a two-step collaboration of non-
professional editors and translators”, Yan et al
– Creating parallel corpuses with crowd workers is much
faster and cheaper than using professional translators
• Now, more than ever, we have the ability to rapidly
create new labeled language data
– …as long as we can find proficient writers of minority
languages with digital literacy, electricity, and internet
access
Cell phone access
• Nearly 6 billion people in the world have
access to a cell phone
• In 2013, the UN famously reported that more
people have access to a cell phone than to a
toilet
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program, and Idibon’s collaboration
• Lessons learned from Idibon’s automatic labeling in English
and minority languages
• Conclusions
UNICEF’s U-Report
• Crowd wisdom, in real time, in developing countries
• In 2012, UNICEF Innovation team started building a real-
time SMS polling service for UNICEF Uganda. As of 2015, U-
Report operates in over 15 countries
• Polls are sent out once a week on topics like:
– Has ur community addressed social inclusion issues affecting
women, youth, and children?
– If you get water from a well, borehole, or community tap, is it
working today?
– Go to your local health center and tell us: Do they give free HIV
/ AIDS tests? Report YES or NO and HEALTH CENTER NAME
UNICEF’s U-Report
• Eventually, UNICEF started receiving urgent, unsolicited
messages
– FLOOD.villages of X, Y sub.county suffering.
• UNICEF Nigeria alone now receives 10,000+ unsolicited
messages per day
• UNICEF needs a way to:
– Identify topically relevant messages to share with specific
partners
– Prioritize which messages to respond to first
• Idibon labels messages with urgency, category label,
and language, in real time
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program
• Lessons learned from Idibon’s automatic labeling in
English and minority languages
• Conclusions
Lesson #1: It’s difficult to predict how many
new people will use your product / service
when you start supporting a new language
Non-English Languages of Nigeria
0
5
10
15
20
25
30
Lesson #1: It’s difficult to predict how many
new people will use your product / service
when you start supporting a new language
# unsolicited
Hausa messages
per day
Hausa polls begin * But we don’t see the
same effect for Yoruba
Lesson #2: Language mixing in an African
context has different considerations for
classification algorithms vs European
language code-switching
• Downside: complex
tokenization
• Upside: radically different
word structure
Lesson #3: Geopolitical context affects how
we interpret short messages, and it’s
constantly changing
Lesson #4: Mutually exclusive categories
are elusive. To automatically label
messages is to discover the endless
ambiguity in human discourse.
- Is a washed out road more related to infrastructure or personal safety?
- Is education scoped to a particular time in life? Does post-graduate
education count? What about education outside of a scholastic context?
- If a town’s full name is “Mbale Village,” is “Mbale” a valid place name?
- How specific do messages need to be to constitute a security threat? Does
“these days some of our young people are not safe” count?
Overview
• Who’s involved in this project?
• What is Natural Language Processing (NLP)?
• What are the challenges of creating NLP for minority
languages and multilingual societies?
• How has the digital age changed how we curate language
data?
• UNICEF’s U-Report program, and Idibon’s collaboration
• Lessons learned from automatic labeling in English and
minority languages
• Conclusions
Conclusions
• Crowdsourcing, machine learning, and the
proliferation of cell phones make amazing new
communication tools and digital language
data possible
• Invest in translators and analysts
Thank you!

More Related Content

Similar to Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi

Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...
Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...
Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...Christian Gericke
 
Research sketchbook - Interaction Module
Research sketchbook - Interaction ModuleResearch sketchbook - Interaction Module
Research sketchbook - Interaction ModuleDanielM31
 
The english language crystal ball: the past present and future of technology ...
The english language crystal ball: the past present and future of technology ...The english language crystal ball: the past present and future of technology ...
The english language crystal ball: the past present and future of technology ...Paul Woods
 
Telecollaborative Exchange and Intercultural Education
Telecollaborative Exchange and Intercultural EducationTelecollaborative Exchange and Intercultural Education
Telecollaborative Exchange and Intercultural EducationRobert O'Dowd
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIAA REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIAJoe Osborn
 
Applied linguístics 1
Applied linguístics 1Applied linguístics 1
Applied linguístics 1Carlos Mayora
 
Managing Reputation in the Digital Age - Magnus Carter
Managing Reputation in the Digital Age - Magnus CarterManaging Reputation in the Digital Age - Magnus Carter
Managing Reputation in the Digital Age - Magnus CarterMentor Digital
 
The richness of rumours & limitations of facts - stijn aelbers
The richness of rumours & limitations of facts - stijn aelbersThe richness of rumours & limitations of facts - stijn aelbers
The richness of rumours & limitations of facts - stijn aelbersstijn aelbers
 
Technology in Language Learning
Technology in Language LearningTechnology in Language Learning
Technology in Language LearningIwan Syahril
 
John Hajek conference keynote 2014
John Hajek   conference keynote 2014John Hajek   conference keynote 2014
John Hajek conference keynote 2014MLTA of NSW
 
Open Education and Open Development – working together
Open Education and Open Development – working togetherOpen Education and Open Development – working together
Open Education and Open Development – working togetherMarieke Guy
 
Fonelex Keynote
Fonelex KeynoteFonelex Keynote
Fonelex Keynotessorden
 
Developing media relationships: sustainable conservation | Small charities co...
Developing media relationships: sustainable conservation | Small charities co...Developing media relationships: sustainable conservation | Small charities co...
Developing media relationships: sustainable conservation | Small charities co...CharityComms
 
Development Aid Support/Knowledge For Development Without Borders (KFDWB)
Development Aid Support/Knowledge For Development Without Borders (KFDWB)Development Aid Support/Knowledge For Development Without Borders (KFDWB)
Development Aid Support/Knowledge For Development Without Borders (KFDWB)Amouzou Bedi
 
Presentation at Kigali Institute of Education - Setember 2009
Presentation at Kigali Institute of Education - Setember 2009Presentation at Kigali Institute of Education - Setember 2009
Presentation at Kigali Institute of Education - Setember 2009Juliano Bittencourt
 
Weaving Global Partnerships: Telecollaboration in University Education
Weaving Global Partnerships: Telecollaboration in University EducationWeaving Global Partnerships: Telecollaboration in University Education
Weaving Global Partnerships: Telecollaboration in University EducationRobert O'Dowd
 

Similar to Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi (20)

Co-design-ee5-emer
Co-design-ee5-emerCo-design-ee5-emer
Co-design-ee5-emer
 
Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...
Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...
Sprache rettet Leben, Translators without Borders - Content Marketing Tuesday...
 
Research sketchbook - Interaction Module
Research sketchbook - Interaction ModuleResearch sketchbook - Interaction Module
Research sketchbook - Interaction Module
 
The english language crystal ball: the past present and future of technology ...
The english language crystal ball: the past present and future of technology ...The english language crystal ball: the past present and future of technology ...
The english language crystal ball: the past present and future of technology ...
 
Telecollaborative Exchange and Intercultural Education
Telecollaborative Exchange and Intercultural EducationTelecollaborative Exchange and Intercultural Education
Telecollaborative Exchange and Intercultural Education
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIAA REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
A REVIEW ON THE PROGRESS OF NATURAL LANGUAGE PROCESSING IN INDIA
 
Applied linguístics 1
Applied linguístics 1Applied linguístics 1
Applied linguístics 1
 
Managing Reputation in the Digital Age - Magnus Carter
Managing Reputation in the Digital Age - Magnus CarterManaging Reputation in the Digital Age - Magnus Carter
Managing Reputation in the Digital Age - Magnus Carter
 
The richness of rumours & limitations of facts - stijn aelbers
The richness of rumours & limitations of facts - stijn aelbersThe richness of rumours & limitations of facts - stijn aelbers
The richness of rumours & limitations of facts - stijn aelbers
 
Technology in Language Learning
Technology in Language LearningTechnology in Language Learning
Technology in Language Learning
 
John Hajek conference keynote 2014
John Hajek   conference keynote 2014John Hajek   conference keynote 2014
John Hajek conference keynote 2014
 
Open Education and Open Development – working together
Open Education and Open Development – working togetherOpen Education and Open Development – working together
Open Education and Open Development – working together
 
Fonelex Keynote
Fonelex KeynoteFonelex Keynote
Fonelex Keynote
 
Developing media relationships: sustainable conservation | Small charities co...
Developing media relationships: sustainable conservation | Small charities co...Developing media relationships: sustainable conservation | Small charities co...
Developing media relationships: sustainable conservation | Small charities co...
 
Development Aid Support/Knowledge For Development Without Borders (KFDWB)
Development Aid Support/Knowledge For Development Without Borders (KFDWB)Development Aid Support/Knowledge For Development Without Borders (KFDWB)
Development Aid Support/Knowledge For Development Without Borders (KFDWB)
 
ODowd icc_graz_2017
ODowd icc_graz_2017ODowd icc_graz_2017
ODowd icc_graz_2017
 
Denmark ECML 2016
Denmark ECML 2016Denmark ECML 2016
Denmark ECML 2016
 
Presentation at Kigali Institute of Education - Setember 2009
Presentation at Kigali Institute of Education - Setember 2009Presentation at Kigali Institute of Education - Setember 2009
Presentation at Kigali Institute of Education - Setember 2009
 
Weaving Global Partnerships: Telecollaboration in University Education
Weaving Global Partnerships: Telecollaboration in University EducationWeaving Global Partnerships: Telecollaboration in University Education
Weaving Global Partnerships: Telecollaboration in University Education
 

More from Idibon1

Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Idibon1
 
Conspiracy, complaints, and fraud: The language of reasons
Conspiracy, complaints, and fraud: The language of reasonsConspiracy, complaints, and fraud: The language of reasons
Conspiracy, complaints, and fraud: The language of reasonsIdibon1
 
Ciara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learningCiara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learningIdibon1
 
Suzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLPSuzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLPIdibon1
 
Will Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical groundingWill Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical groundingIdibon1
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsIdibon1
 
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Idibon1
 
Pattern recognition and the crowd
Pattern recognition and the crowdPattern recognition and the crowd
Pattern recognition and the crowdIdibon1
 
Dan Jurafsky: The Language of Food
Dan Jurafsky: The Language of FoodDan Jurafsky: The Language of Food
Dan Jurafsky: The Language of FoodIdibon1
 
Chris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in contextChris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in contextIdibon1
 

More from Idibon1 (10)

Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
Computing with Affective Lexicons: Computational Linguistics Tutorial with Da...
 
Conspiracy, complaints, and fraud: The language of reasons
Conspiracy, complaints, and fraud: The language of reasonsConspiracy, complaints, and fraud: The language of reasons
Conspiracy, complaints, and fraud: The language of reasons
 
Ciara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learningCiara Sanker: Personal epistemology and epistemic learning
Ciara Sanker: Personal epistemology and epistemic learning
 
Suzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLPSuzanne Wertheim: Linguistic Anthropology meets NLP
Suzanne Wertheim: Linguistic Anthropology meets NLP
 
Will Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical groundingWill Monroe: Text to 3D scene generation with lexical grounding
Will Monroe: Text to 3D scene generation with lexical grounding
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
 
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...Counts, comparisons, collocations, contestations: Towards a dictionary of the...
Counts, comparisons, collocations, contestations: Towards a dictionary of the...
 
Pattern recognition and the crowd
Pattern recognition and the crowdPattern recognition and the crowd
Pattern recognition and the crowd
 
Dan Jurafsky: The Language of Food
Dan Jurafsky: The Language of FoodDan Jurafsky: The Language of Food
Dan Jurafsky: The Language of Food
 
Chris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in contextChris Potts: Sentiment analysis in context
Chris Potts: Sentiment analysis in context
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka DoktorováCzechDreamin
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupCatarinaPereira64715
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi

  • 1. Understanding Community Needs: Scalable SMS Processing for UNICEF Nigeria and Burundi Jessica Long Senior software engineer at Idibon
  • 2. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  • 3. Acknowledgements • Robert Munro, CEO of Idibon • Caroline Barebwoha, U-Report Nigeria project lead • Aboubacar Kampo, U-Report Nigeria project lead • Sarah Atkinson, U-Report Burundi project lead • Kidus Fisaha Asfaw, Global head of U-Report • Evan Wheeler, CTO of UNICEF Innovation / RapidPro • Nicholas Gaylord, data scientist at Idibon
  • 4. My background Symbolic Systems BS Computer Science MS Health systems manager in rural Burundi Internationalization engineer Second language acquisition research NLP engineer
  • 5. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  • 6. What is Natural Language Processing (NLP)? Natural language processing is a branch of artificial intelligence specifically concerned with making automatic judgments about free text
  • 7. Flavors of NLP • Automatic categorization • Machine translation • Named entity recognition • Sentiment Analysis • Semantic Role Labeling • Opinion Mining • Parsing • Question Answering • Search – 15% of Google’s daily search queries have never been issued before! • Part of Speech Tagging • Textual Entailment • Discourse Analysis • Natural language Generation • Speech Recognition • Word sense disambiguation • Text summarization
  • 8. Underlying algorithms • Semi-supervised machine learning – Start with labeled training data that’s similar to what you want to generate – Use this to “teach” the computer what features to look for when making a decision about the text Cat Cat Cat ??? Dog Dog Dog Training set Predictio n
  • 9. Semi-supervised machine learning example • “Using Wikipedia for Automatic Word Sense Disambiguation,” by Rada Mihalcea (2007) Paris, France Paris, Texas Paris, France Paris, France Paris, Texas
  • 10. Tokenization and feature extraction (n-grams) “tomb”, “of”, “the”, “unknown”, “soldier”, “beneath”, “arc”, “de”, “triomphe” “tomb of”, “of the”, “the unknown”, “unknown soldier”, “beneath the”, “the arc”, “arc de”, “de triomphe” “tomb of the”, “of the unknown”, “the unknown solider”, “unknown soldier beneath”, “beneath the arc”, “the arc de”, “arc de triomphe” Other features - Punctuation - Stemming - Parsing - Capitalization - Dictionary matching - Stopwords - … Paris, France Source text Source label Extracted features
  • 11. Who uses NLP? Apple’s Siri does speech recognition on human voices, as well as question answering IBM Watson answers Jeopardy questions
  • 12. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  • 13. Language resources for UNICEF Uganda 30+ Languages Spoken in Uganda Google Translate Supported Languages
  • 14. Why is NLP difficult for minority languages? • Lots of code-switching breaks usual paradigm of language- specific textual analysis • Lack of existing digital tools: spell check, autocomplete, access to internet • Minority language speakers lack purchasing power • Tokenization – Consider: • “ntibazoronka.”: “nta” “i” “ba” “zo” “ronka” “.” (Kirundi) • “they will not obtain.”: “they” “will” “not” obtain” “.” (English) • Encoding issues – “I can text you a pile of poo , but I can’t write my name” by Aditya Mukerjee in Model View Culture
  • 15. But most of all. . . • Minority languages lack appropriate training datasets. – They tend to be primarily spoken, and lack the digital and even written content necessary for statistical machine learning • Google Translate relies on parallel corpora from UN proceedings to help create machine translation products – The UN does not dual broadcast in Wolof. • Textual reviews matched to star ratings on Yelp helps researchers calibrate sentiment analysis – Yelp is literally non-functional in most of Africa.
  • 16. “Raw data is an oxymoron.” - Lisa Gitelman
  • 17. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  • 18. Curation of language data, old & new Compiled by Webster Collective wisdom, at scale Compiled by experts, Supplemented by OED Reading Programme * Shout out! Go see Martin Benjamin’s talk on The Kamusi Project tomorrow at 13:45, for more information on dictionary curation
  • 19.
  • 20. Creating new structured data with crowdsourcing • “Are two heads better than one? Crowdsourced translation via a two-step collaboration of non- professional editors and translators”, Yan et al – Creating parallel corpuses with crowd workers is much faster and cheaper than using professional translators • Now, more than ever, we have the ability to rapidly create new labeled language data – …as long as we can find proficient writers of minority languages with digital literacy, electricity, and internet access
  • 21. Cell phone access • Nearly 6 billion people in the world have access to a cell phone • In 2013, the UN famously reported that more people have access to a cell phone than to a toilet
  • 22. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from Idibon’s automatic labeling in English and minority languages • Conclusions
  • 23. UNICEF’s U-Report • Crowd wisdom, in real time, in developing countries • In 2012, UNICEF Innovation team started building a real- time SMS polling service for UNICEF Uganda. As of 2015, U- Report operates in over 15 countries • Polls are sent out once a week on topics like: – Has ur community addressed social inclusion issues affecting women, youth, and children? – If you get water from a well, borehole, or community tap, is it working today? – Go to your local health center and tell us: Do they give free HIV / AIDS tests? Report YES or NO and HEALTH CENTER NAME
  • 24. UNICEF’s U-Report • Eventually, UNICEF started receiving urgent, unsolicited messages – FLOOD.villages of X, Y sub.county suffering. • UNICEF Nigeria alone now receives 10,000+ unsolicited messages per day • UNICEF needs a way to: – Identify topically relevant messages to share with specific partners – Prioritize which messages to respond to first • Idibon labels messages with urgency, category label, and language, in real time
  • 25. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program • Lessons learned from Idibon’s automatic labeling in English and minority languages • Conclusions
  • 26. Lesson #1: It’s difficult to predict how many new people will use your product / service when you start supporting a new language Non-English Languages of Nigeria 0 5 10 15 20 25 30
  • 27. Lesson #1: It’s difficult to predict how many new people will use your product / service when you start supporting a new language # unsolicited Hausa messages per day Hausa polls begin * But we don’t see the same effect for Yoruba
  • 28. Lesson #2: Language mixing in an African context has different considerations for classification algorithms vs European language code-switching • Downside: complex tokenization • Upside: radically different word structure
  • 29. Lesson #3: Geopolitical context affects how we interpret short messages, and it’s constantly changing
  • 30. Lesson #4: Mutually exclusive categories are elusive. To automatically label messages is to discover the endless ambiguity in human discourse. - Is a washed out road more related to infrastructure or personal safety? - Is education scoped to a particular time in life? Does post-graduate education count? What about education outside of a scholastic context? - If a town’s full name is “Mbale Village,” is “Mbale” a valid place name? - How specific do messages need to be to constitute a security threat? Does “these days some of our young people are not safe” count?
  • 31. Overview • Who’s involved in this project? • What is Natural Language Processing (NLP)? • What are the challenges of creating NLP for minority languages and multilingual societies? • How has the digital age changed how we curate language data? • UNICEF’s U-Report program, and Idibon’s collaboration • Lessons learned from automatic labeling in English and minority languages • Conclusions
  • 32. Conclusions • Crowdsourcing, machine learning, and the proliferation of cell phones make amazing new communication tools and digital language data possible • Invest in translators and analysts