NLP: Challenges and Opportunities in
Underserved Areas
Colleen M. Farrelly, Machine Learning Lead
Natural Language
Processing
• Many applications of text data
• Customer feedback
• Legal documents
• Job search/other search engines
• Image captions
• Product titles
• Need to wrangle text into matrix form in many
applications
• Embeddings
• Parts-of-speech counts
• Sentiment analysis results
Common Tools:
Sentiment Analysis
• Understand positive/negative/neutral tone of text
data
• Expansion to other emotions:
• Anger
• Sadness
• Surprise
• Some uses:
• Identifying customer churn
• Evaluating educational interventions
• Predicting clinical outcomes
• Some packages exist for some languages and
applications.
• Other languages or emotions require custom code
and dictionaries.
Common Tools:
Embeddings
Embeddings
Capture relative frequency of word use within a text
and across texts
• Can use down-weighting to ignore common words like “a” or
“the”
• Don’t capture context well in the simple versions
• She bolted the door shut.
• She bolted out the door.
Pretrained encoder/decoder neural networks that
can capture context
• BERT
• GPT-3
Most pretrained models only support a limited
number of languages (though have ways of training
a similar model on a new language corpus)…
Consider the
apps you use
every day…
Now imagine
they didn’t
exist in your
language…
NLP Needs in
Underserved
Areas
• Translation and speech-to-text for
unsupported languages (Hausa,
Lingala, Quechua…)
• Sentiment dictionaries for unsupported
languages/emotional nuances of the
language
• NLP-powered apps (search engines,
matching/recommenders, symptom
checkers, conversational agents…)
• Language preservation of endangered
languages
• 308 highly endangered ones just in
Africa
Market Size for NLP Applications
• Worldwide NLP market projected
to grow from $21B in 2021 to
$127B by 2028.
• South America and Africa are
mostly ignored markets for NLP-
backed technology in healthcare,
travel, retail, education, and other
markets.
• Local companies and universities
are currently trying to meet market
needs.
Caveats…
• Collecting the data
• Existing sources, creating written sources for non-written languages (3074 of 7139 languages that exist)
• Capturing speech tone variety, storing large audio files for non-written languages
• Getting large enough sample sizes from endangered languages (Domari in Northern Africa/Middle East)
• Ownership of data
• Foreign corporations? Governments? Universities? Local speakers?
• Biases and misuses
• Unintentional translation issues from non-native speakers reviewing technology (ex. diseases/symptoms)
• Lack of representation in languages targeted/training in NLP (wealthy world vs. developing world)
• Use of technologies to spread conflict (companies, world powers, neighboring countries… interfering)
Case Studies: Recent
Collaborations
Sub-Saharan Africa
Customized
Dictionaries and
Embeddings
• AfroLeadership
• Crowd-source local language
sentiment dictionaries, writing
samples for embeddings…
• Led by students and researchers
at local Cameroonian universities
• Hausa Hackathon
• Non-profit initiative to build
corpus/dictionaries and build
applications to support the Hausa
language
• Hackathons for Hausa speakers
and NLP professionals interested
in Hausa applications
• Masakhane
• Non-profit collaboration of NLP
researchers in Africa
• Broad set of target languages
Companies Powered by NLP
• Mpuza Inc
• Job matching app connecting companies and job
seekers
• Powered by NLP-based matching engine
• Caveat of needing filters for extremism recruiting:
• Rwanda history and neighboring DRC violence
• Need to identify extremist recruitment job posts
• Name changes of extremist groups
• Concealed recruitment/threats…
• False positives for human rights and security positions
Miami’s Unique Position
Questions…
• How many familiar with NLP?
• How many lived in another country
as a child?
• How many interested in making
money or making a social good
impact?
We’re positioned accelerate NLP
development for underserved populations.
Starting companies
Volunteering time
Creating NLP hackathons
All from where we are in Miami…
Contact Information
cfarrelly@med.miami.edu

NLP: Challenges and Opportunities in Underserved Areas

  • 1.
    NLP: Challenges andOpportunities in Underserved Areas Colleen M. Farrelly, Machine Learning Lead
  • 2.
    Natural Language Processing • Manyapplications of text data • Customer feedback • Legal documents • Job search/other search engines • Image captions • Product titles • Need to wrangle text into matrix form in many applications • Embeddings • Parts-of-speech counts • Sentiment analysis results
  • 3.
    Common Tools: Sentiment Analysis •Understand positive/negative/neutral tone of text data • Expansion to other emotions: • Anger • Sadness • Surprise • Some uses: • Identifying customer churn • Evaluating educational interventions • Predicting clinical outcomes • Some packages exist for some languages and applications. • Other languages or emotions require custom code and dictionaries.
  • 4.
  • 5.
    Embeddings Capture relative frequencyof word use within a text and across texts • Can use down-weighting to ignore common words like “a” or “the” • Don’t capture context well in the simple versions • She bolted the door shut. • She bolted out the door. Pretrained encoder/decoder neural networks that can capture context • BERT • GPT-3 Most pretrained models only support a limited number of languages (though have ways of training a similar model on a new language corpus)…
  • 6.
    Consider the apps youuse every day… Now imagine they didn’t exist in your language…
  • 7.
    NLP Needs in Underserved Areas •Translation and speech-to-text for unsupported languages (Hausa, Lingala, Quechua…) • Sentiment dictionaries for unsupported languages/emotional nuances of the language • NLP-powered apps (search engines, matching/recommenders, symptom checkers, conversational agents…) • Language preservation of endangered languages • 308 highly endangered ones just in Africa
  • 8.
    Market Size forNLP Applications • Worldwide NLP market projected to grow from $21B in 2021 to $127B by 2028. • South America and Africa are mostly ignored markets for NLP- backed technology in healthcare, travel, retail, education, and other markets. • Local companies and universities are currently trying to meet market needs.
  • 9.
    Caveats… • Collecting thedata • Existing sources, creating written sources for non-written languages (3074 of 7139 languages that exist) • Capturing speech tone variety, storing large audio files for non-written languages • Getting large enough sample sizes from endangered languages (Domari in Northern Africa/Middle East) • Ownership of data • Foreign corporations? Governments? Universities? Local speakers? • Biases and misuses • Unintentional translation issues from non-native speakers reviewing technology (ex. diseases/symptoms) • Lack of representation in languages targeted/training in NLP (wealthy world vs. developing world) • Use of technologies to spread conflict (companies, world powers, neighboring countries… interfering)
  • 10.
  • 11.
    Customized Dictionaries and Embeddings • AfroLeadership •Crowd-source local language sentiment dictionaries, writing samples for embeddings… • Led by students and researchers at local Cameroonian universities • Hausa Hackathon • Non-profit initiative to build corpus/dictionaries and build applications to support the Hausa language • Hackathons for Hausa speakers and NLP professionals interested in Hausa applications • Masakhane • Non-profit collaboration of NLP researchers in Africa • Broad set of target languages
  • 12.
    Companies Powered byNLP • Mpuza Inc • Job matching app connecting companies and job seekers • Powered by NLP-based matching engine • Caveat of needing filters for extremism recruiting: • Rwanda history and neighboring DRC violence • Need to identify extremist recruitment job posts • Name changes of extremist groups • Concealed recruitment/threats… • False positives for human rights and security positions
  • 13.
  • 14.
    Questions… • How manyfamiliar with NLP? • How many lived in another country as a child? • How many interested in making money or making a social good impact?
  • 15.
    We’re positioned accelerateNLP development for underserved populations. Starting companies Volunteering time Creating NLP hackathons All from where we are in Miami…
  • 16.