Mārcis Pinnis discusses language technologies and how they are developed. He explains that language technologies are trained on language data, but this data becomes outdated as language constantly changes due to societal and technological advances. As a result, language technologies can fail or produce errors when processing language that has changed since the data was collected. Both developers and users have roles to play in addressing this - developers should continuously collect new data and use adaptive methods, while users should help share language data to improve technologies over time.
2. A little about me
Developing language
technologies since 2006
Overseeing AI research in
Tilde since 2019
3. What will I talk about?
ꟷ What are language technologies?
ꟷ How are language technologies developed today?
ꟷ Examples of when language technologies fail
ꟷ What can we do about it?
4. Solutions that analyze,
produce, modify or
respond to human
texts and speech.
Spelling and grammar checking Machine translation
Speech processing
Virtual assistants, dialog
systems, etc.
Electronic dictionaries Anonymization
… and many, many more!
Terminology management
What are language technologies?
10. Language is not constant – once you train a
model on some data, it becomes outdated!
Source: https://chat.openai.com
11. Language data is often the main cause why
language technologies generate errors
Typical challenges with language data are:
ꟷ There is never enough data
ꟷ Data is noisy
ꟷ Data is obsolete
ꟷ Data is not in the right domain
13. Language is not constant –
the focus of the society is a changing factor
Source: NewsCrawl corpus (https://data.statmt.org/news-crawl)
Language use “follows”
Y2014Y2015Y2016Y2017Y2018Y2019Y2020Y2021Y2022
Frequency
in
news
(LV)
Ukraina (lv) / Ukraine (en)
Y2014 Y2015 Y2016 Y2017 Y2018 Y2019 Y2020 Y2021 Y2022
Koronavīruss (lv) / Coronavirus (en)
14. Language is not constant –
the society is constantly advancing
Source: https://termini.gov.lv/komisija/lza-tk-23052023-sedes-protokols-nr-51175
The Terminology Commission of the Latvian Academy of Sciences
regularly introduces new terminology in Latvian, e.g.:
English term Translation into Latvian (introduced in May, 2023)
parasailing izpletņbraukšana
backpacker mugursomnieks
Language becomes richer
15. Source: https://termini.gov.lv/komisija/lza-tk-23052023-sedes-protokols-nr-51175
The Terminology Commission of the Latvian Academy of Sciences
sometimes alters existing terminology in Latvian, e.g.:
English term Before May, 2023 Since May, 2023
cooling aukstumapgāde dzesēšana
engineering and communication systems inženierkomunikācijas inženiersistēmas
Language keeps changing
Language is not constant –
the society is constantly advancing
16. Societal efforts may introduce new concepts
or alter existing ones
Source of examples: https://www.auswaertiges-amt.de
In Germany, the “gender star” is being introduced in public sector communication to express
gender-neutrality
Example – gender-neutral language
Referent*in (w/m/d) in der Social-Media-Analyse (w/m/d)
/Consultant (f/m/d) in social media analysis (f/m/d)/
Die Mitarbeiter:innen stehen im Zentrum
/Employees are in the center/
17. Societal efforts may alter existing concepts
Source: https://likumi.lv/ta/id/331352-par-ukrainas-pilsetu-nosaukumu-atveidi-latviesu-valoda
In 2022, the State Language Center of Latvia decided that 31 Ukrainian towns and city names in
Latvian will be translated to follow the original Ukrainian (and not Russian) writing.
18. Even if you can
keep up with the
pace of change,
your language
data will never
be complete
Source: https://twitter.com/krisjaniskarins/status/1705071215520481494
Language is naturally
ambiguous and sparse
19. Language data is often English-centric
More data is available in English and about English-speaking regions.
In other words, data has probably never witnessed some “random person” from a “random place”
somewhere outside the US/UK
If you are that “random person”, AI becomes personal!
20. I am such a “random person”!
Sometimes AI tends to f*%# up my name.
21. Language is changing!
What are our options?
For language technology developers
Collect and don’t stop
collecting data
Source local data
(collect or synthesize)
Plan to deliver
models iteratively
Use adaptive methods
to adjust to a changing
language
22. For language technology users
Pay attention to data
management
processes in your
organization
(Language) data is
gold – do not lose it!
Share your (language) data openly if
you want to benefit better from
“free” AI services.
No one except you have data in your
narrow subject.
Use public infrastructure to do that:
European Language Resource Coordination (ELRC-SHARE)
European Language Grid (ELG)
Language is changing!
What are our options?
23. Takeaways
Language technologies are integral in our day-to-day activities with computers
ꟷ we become more productive
ꟷ we can access more information
ꟷ we can reach wider audiences
Language technologies are not 100% precise
ꟷ Languages are complex and constantly changing
ꟷ There will always be cases where they fail
However, if we develop our systems to expect such changes, we can effectively mitigate errors
(and make our customers happier).