[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx

Initiative for the
development of open
Vuk Batanović, PhD, ETF Belgrade Innovation Center
Tanja Samardžić, PhD, University of Zürich
Slobodan Marković, UNDP Serbia
NLP/NLU resources and tools
for the Serbian language

In recent years, large language models
have proven success in natural language
processing and understanding (NLP/NLU)

ChatGPT and GPT-4 have made considerable strides in natural
language processing and understanding.
The key to the success of these models is not only the vast volume
of text used for self-supervised training, but also the availability of
high-quality datasets for supervised fine-tuning for a wide range of
NLP/NLU tasks and linguistic domains.
The current "AI revolution" would not be possible without multi-year
public and corporate investments in high-quality datasets, which
are currently available predominantly for the English language.

How well is the Serbian language supported?

Getting better, but...
Even when large language models "speak Serbian" (which is not often the case),
they are not substantially fine-tuned for Serbian market, limiting their practical
and business application.
• When working with Cyrillic and Latin, they perform poorer and cost more
• They sometimes mix ekavian and iekavian pronunciation in their responses
or give answers in similar languages (Croatian, Slovenian, and Macedonian)
• They give worse answers in situations that go beyond the scope of everyday
conversational language (specific language domains)
• They give incorrect or undesirable answers in the context of Serbian culture
• They have limited possibilities for expressing Serbian language (for example,
generating speech with varied emotions and support for local dialects)
• Their use is often not viable in business applications that handle large
amounts of text, require rapid response, guarantee data confidentiality, etc.

What could be better?
A greater proportion of Serbian text in training corpora
More high-quality datasets for fine-tuning models for different
language domains and NLP/NLU tasks
More datasets for model evaluation
+
Ideally, as much as possible should be freely available to the public
under a permissive license, and should cover both ekavian and
iekavian pronunciations of Serbian

Small language communities, like ours,
need to invest in language technologies

Estonia
language community: 1.16 million

Israel

Denmark

Iceland

Slovenia

Present situation
Global IT giants
They have little interest in developing support for the Serbian language because we are a small market
and a low priority. When they do offer something, it will be on commercial and restricted terms.
Academic community
The volume and scope of academic research in the field of NLP/NLU for Serbian is insufficient.
Furthermore, these are typically carried out in the domain of basic research rather than being applied
in the industry.
Serbian companies and start-ups
In theory, they are interested in meeting local market demands. However, significant upfront
investments in high-quality datasets and model development are difficult to justify in the context of
the small and low-income Serbian/regional market, resulting in a slow return on investment.
Government
In recent years, AI has been one of the government's top priorities, and significant progress has been
made. Support for the development of language technologies is insufficient.

Impact
There are very few Serbian NLP/NLU software products
The endangered status of the Serbian language in the digital age
Instead of having a reliable and easily accessible foundation,
Serbian IT companies and start-ups waste time and money
integrating disparate solutions and “reinventing the wheel”,
i.e. re-creating basic tools and data sets

Time goes by... Our kids are already conversing in English
with digital assistants, and the gap will only grow wider
For example, in virtual/augmented reality, voice (converted to text)
will be the primary mode of user-computer interaction

Initiative for the development of open NLP resources for Serbian
September 1, 2021
Vuk Batanović, PhD
ETF Belgrade Innovation Center
Tanja Samardžić, PhD
University of Zürich
Slobodan Marković
UNDP Serbia

Initiative goals
1. Create a basic set of NLP/NLU resources for
the Serbian language that are publicly and easily
accessible, under a license that permits them to
be used for any purpose (including commercial)
2. Gather and coordinate the local community
(IT industry, academic community, government)
that will contribute to the project’s implementation
by donating material resources, expertise, and
intellectual property

What do we aim to produce?*
Priority resources and tools for:
1. Improved text search, including named entities
2. Improved text understanding (recognizing
semantic similarity and generating answers to
questions)
+ all the above for ekavian and iekavian
pronunciations of Serbian
3. Creation of educational materials for software
engineers to learn how to implement NLP/NLU
for Serbian
Labeled datasets
Fine-tuned models
* after consultations with more than 40 organizations of the local IT community

What do we get?
Greater flexibility and independence – we may use produced datasets for
training, fine-tuning, and evaluation of both closed (commercially available)
and open-source models
Lower individual investments and higher quality – instead of everyone
starting from scratch, everyone gets a reliable and high-quality foundation
from which to build, while retaining their competitive edge (because the
basic model is insufficient, each solution requires additional adaptation to
the user's needs/data, integration into business processes, continuous
support, and so on)
Faster development of high-quality NLP/NLU solutions with Serbian support
– by internal corporate IT teams, Serbian IT companies, and start-ups (which
are currently virtually non-existent in this field)

The project is being implemented by a consortium of
Initial financial and other assistance agreements were signed with

What are we doing this year, and how far have we come?
1.
Selection of texts to
cover the language
domain
January – March
2.
Automated processing
using existing tools:
tokenization,
lemmatization, word types
April – May
3.
Pronunciation
conversion: ekavian
and iekavian variants
May – June
4.
Manual check and
correction of the
initial automated
processing
May – September
5.
Evaluation of the
existing models
October – November
8.
Results publication
and preparation for a
new project
December – January
6.
Evaluation of models
fine-tuned on the new
dataset
October – November
7.
Transition to business
applications
December – January

This is only the beginning.
We have broken new ground,
but there’s still much work ahead

Therefore…
If you have developed or plan to develop an NLP/NLU solution for the
Serbian language (and its variants spoken in Serbia, Montenegro,
Bosnia and Herzegovina)
If automated text processing can benefit your (e-)business (in terms
of better search, better recommendations, better customer support...)
If you want to position your organization as socially responsible and
willing to help the Serbian IT market and local tech community grow
If you want to contribute to the preservation of the Serbian language
in the digital age

Slobodan Marković
slobodan.markovic@undp.org
+381 63 387 260

[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx

Recommended

Recommended

More Related Content

Similar to [DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx

Similar to [DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx