NLP: Challenges and Opportunities in Underserved Areas

NLP: Challenges and Opportunities in
Underserved Areas
Colleen M. Farrelly, Machine Learning Lead

Natural Language
Processing
• Many applications of text data
• Customer feedback
• Legal documents
• Job search/other search engines
• Image captions
• Product titles
• Need to wrangle text into matrix form in many
applications
• Embeddings
• Parts-of-speech counts
• Sentiment analysis results

Common Tools:
Sentiment Analysis
• Understand positive/negative/neutral tone of text
data
• Expansion to other emotions:
• Anger
• Sadness
• Surprise
• Some uses:
• Identifying customer churn
• Evaluating educational interventions
• Predicting clinical outcomes
• Some packages exist for some languages and
applications.
• Other languages or emotions require custom code
and dictionaries.

Embeddings
Capture relative frequency of word use within a text
and across texts
• Can use down-weighting to ignore common words like “a” or
“the”
• Don’t capture context well in the simple versions
• She bolted the door shut.
• She bolted out the door.
Pretrained encoder/decoder neural networks that
can capture context
• BERT
• GPT-3
Most pretrained models only support a limited
number of languages (though have ways of training
a similar model on a new language corpus)…

Consider the
apps you use
every day…
Now imagine
they didn’t
exist in your
language…

NLP Needs in
Underserved
Areas
• Translation and speech-to-text for
unsupported languages (Hausa,
Lingala, Quechua…)
• Sentiment dictionaries for unsupported
languages/emotional nuances of the
language
• NLP-powered apps (search engines,
matching/recommenders, symptom
checkers, conversational agents…)
• Language preservation of endangered
languages
• 308 highly endangered ones just in
Africa

Market Size for NLP Applications
• Worldwide NLP market projected
to grow from $21B in 2021 to
$127B by 2028.
• South America and Africa are
mostly ignored markets for NLP-
backed technology in healthcare,
travel, retail, education, and other
markets.
• Local companies and universities
are currently trying to meet market
needs.

Caveats…
• Collecting the data
• Existing sources, creating written sources for non-written languages (3074 of 7139 languages that exist)
• Capturing speech tone variety, storing large audio files for non-written languages
• Getting large enough sample sizes from endangered languages (Domari in Northern Africa/Middle East)
• Ownership of data
• Foreign corporations? Governments? Universities? Local speakers?
• Biases and misuses
• Unintentional translation issues from non-native speakers reviewing technology (ex. diseases/symptoms)
• Lack of representation in languages targeted/training in NLP (wealthy world vs. developing world)
• Use of technologies to spread conflict (companies, world powers, neighboring countries… interfering)

Case Studies: Recent
Collaborations
Sub-Saharan Africa

Customized
Dictionaries and
Embeddings
• AfroLeadership
• Crowd-source local language
sentiment dictionaries, writing
samples for embeddings…
• Led by students and researchers
at local Cameroonian universities
• Hausa Hackathon
• Non-profit initiative to build
corpus/dictionaries and build
applications to support the Hausa
language
• Hackathons for Hausa speakers
and NLP professionals interested
in Hausa applications
• Masakhane
• Non-profit collaboration of NLP
researchers in Africa
• Broad set of target languages

Companies Powered by NLP
• Mpuza Inc
• Job matching app connecting companies and job
seekers
• Powered by NLP-based matching engine
• Caveat of needing filters for extremism recruiting:
• Rwanda history and neighboring DRC violence
• Need to identify extremist recruitment job posts
• Name changes of extremist groups
• Concealed recruitment/threats…
• False positives for human rights and security positions

Questions…
• How many familiar with NLP?
• How many lived in another country
as a child?
• How many interested in making
money or making a social good
impact?

We’re positioned accelerate NLP
development for underserved populations.
Starting companies
Volunteering time
Creating NLP hackathons
All from where we are in Miami…

Contact Information
cfarrelly@med.miami.edu

NLP: Challenges and Opportunities in Underserved Areas

More Related Content

Similar to NLP: Challenges and Opportunities in Underserved Areas

More from Colleen Farrelly

Recently uploaded

NLP: Challenges and Opportunities in Underserved Areas