The document introduces KotlinNLP, an open-source natural language processing library written in Kotlin. It discusses why Kotlin was chosen, provides an overview of the KotlinNLP library and its components, and describes some key features. The library includes modules for tasks like language detection, tokenization, dependency parsing using neural networks and transition-based systems. It aims to make state-of-the-art NLP techniques accessible without requiring deep learning expertise.
UiPath Test Automation using UiPath Test Suite series, part 3
Intro to KotlinNLP
1. Intro to KotlinNLP:
An Open-Source Library for Natural
Language Processing
https://github.com/KotlinNLP
“Reinvent the wheel if it helps you sleep at night.”
- KotlinNLP Authors
2. OUTLINE
▪ Why Kotlin?
▪ What do we mean by NLP?
▪ KotlinNLP Library Overview (with a look at the code on GitHub!)
▪ Repositories / General Architecture
▪ Some more details on SimpleDNN (Machine Learning) and NeuralParser (NLP)
▪ Questions?
3. 1. Over the past 24 months, the adoption and the use of
the Kotlin programming language among developers has
seen tremendous growth.
https://pusher.com/state-of-kotlin
WHY ?
2. We think that there is a space for a library written entirely in Kotlin
dedicated to Machine Learning and Natural Language Processing in
particular. Other experiments are not widely established yet (e.g. Komputation).
3. We like Kotlin, a lot!
4. Language
Identification / Text
Tokenization
Structural Analysis
Morphological Analysis
Part of Speech Tagging
Dependency Parsing
Named Entity
Recognition /
Date/Time-
Recognition
Semantic Parsing
Anaphora Resolution
Semantic Role Labeling
Intent Detection
Entity Linking
(Geoparsing)
Text Categorization
/ Profiling
Topic Detection /
Sentiment Analysis
WHAT DO WE MEAN BY NLP?
5. We believe you don't need to be
a Deep Learning expert to work
with state-of-the-art Natural
Language Processing techniques.
https://github.com/KotlinNLP
MOZILLA PUBLIC LICENSE
VERSION 2.0
6. SimpleDNN
SimpleDNN is a machine learning
lightweight open-source library
designed to support relevant neural
network architectures in natural
language processing tasks.
LinguisticDescription
LinguisticDescription is a Kotlin
library designed to support linguistic
annotations over morphological,
syntactic and semantic levels of
complex languages.
LanguageDetector
LanguageDetector is a very simple to
use text language detector which
uses the Hierarchical Attention
Networks (HAN) from the
SimpleDNN library.
NeuralTokenizer
NeuralTokenizer is a very simple to
use text tokenizer and sentence
splitter which uses neural networks
from the SimpleDNN library.
NeuralParser
NeuralParser is a very simple to use
dependency parser, based on the
SimpleDNN library and the
SyntaxDecoder transition systems
framework.
TokensEncoder
TokensEncoder is a neural processor
that transform a sentence in a dense
encoded representation.
MorphologicalAnalyzer
MorphologicalAnalyzer is a Kotlin
library designed to support
morphological analysis of a text,
including enclitics, multi-words,
numbers and time-expressions.
GeoLocation
GeoLocation is a Kotlin library
designed to support the
identification of geo-locations in a
text.
Others
CONNLIO, DependencyTree, Utils, …
https://github.com/KotlinNLP
MOZILLA PUBLIC LICENSE
VERSION 2.0
7. SimpleDNN
SimpleDNN is a machine learning
lightweight open-source library
designed to support relevant neural
network architectures in natural
language processing tasks.
LinguisticDescription
LinguisticDescription is a Kotlin
library designed to support linguistic
annotations over morphological,
syntactic and semantic levels of
complex languages.
LanguageDetector
LanguageDetector is a very simple to
use text language detector which
uses the Hierarchical Attention
Networks (HAN) from the
SimpleDNN library.
NeuralTokenizer
NeuralTokenizer is a very simple to
use text tokenizer and sentence
splitter which uses neural networks
from the SimpleDNN library.
NeuralParser
NeuralParser is a very simple to use
dependency parser, based on the
SimpleDNN library and the
SyntaxDecoder transition systems
framework.
TokensEncoder
TokensEncoder is a neural processor
that transform a sentence in a dense
encoded representation.
MorphologicalAnalyzer
MorphologicalAnalyzer is a Kotlin
library designed to support
morphological analysis of a text,
including enclitics, multi-words,
numbers and time-expressions.
GeoLocation
GeoLocation is a Kotlin library
designed to support the
identification of geo-locations in a
text.
Others
CONNLIO, DependencyTree, Utils, …
https://github.com/KotlinNLP
MOZILLA PUBLIC LICENSE
VERSION 2.0
Mathematical operations within the SimpleDNN
library are performed by the CPU with jblas. GPU
support still missing, we need your help!
> 1.000 Unit Test with Spek
8. SimpleDNN
SimpleDNN is a machine learning
lightweight open-source library
designed to support relevant neural
network architectures in natural
language processing tasks.
LinguisticDescription
LinguisticDescription is a Kotlin
library designed to support linguistic
annotations over morphological,
syntactic and semantic levels of
complex languages.
LanguageDetector
LanguageDetector is a very simple to
use text language detector which
uses the Hierarchical Attention
Networks (HAN) from the
SimpleDNN library.
NeuralTokenizer
NeuralTokenizer is a very simple to
use text tokenizer and sentence
splitter which uses neural networks
from the SimpleDNN library.
NeuralParser
NeuralParser is a very simple to use
dependency parser, based on the
SimpleDNN library and the
SyntaxDecoder transition systems
framework.
TokensEncoder
TokensEncoder is a neural processor
that transform a sentence in a dense
encoded representation.
MorphologicalAnalyzer
MorphologicalAnalyzer is a Kotlin
library designed to support
morphological analysis of a text,
including enclitics, multi-words,
numbers and time-expressions.
GeoLocation
GeoLocation is a Kotlin library
designed to support the
identification of geo-locations in a
text.
Others
CONNLIO, DependencyTree, Utils, …
https://github.com/KotlinNLP
MOZILLA PUBLIC LICENSE
VERSION 2.0
LinguisticDescription is shared by all the linguistic
processors: it contains all the structures for
describing linguistic phenomena, from tokens to
morpho-syntactic structures.
Compatible with morphological dictionaries for the Italian
language available from the LINDAT repository
https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-2630
20. Transition-based Parsing ☺
• Built-in support for greedy, non-
monotonic and multi-threaded beam
decoding.
• Training can be made using static,
dynamic or non-derministic oracles.
21. LHRParser
Latent Heads Representation (Grella and Cangialosi, 2018)
A state-of-the-art neural dependency
parser that implements a novel
approach based on a bidirectional
recurrent autoencoder to perform
globally optimized non-projective
parsing via semi-supervised learning.
https://arxiv.org/abs/1802.02116