Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation;
tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotation
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation;
tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotation
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution.
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
Tokenization is considered a solved problem when reduced to just word borders identification, punctuation
and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent
NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use
in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related
information as necessary. We also claim that semantic disambiguation performs much better in a bilingual
context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text
provided by high profile on-line machine translation services performs almost to the same level with
human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm
incorporated in TORO can be used as a criterion for on-line machine translation services comparative
quality assessment and we provide a setup for this purpose.
A Word Stemming Algorithm for Hausa Languageiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
Author credits - Maaz Nomani
his paper presents our work on the annotation of intra-chunk dependencies on an English Treebank that was previously annotated with Inter-chunk dependencies. Given a natural language sentence, Intra-chunk dependencies show a relationship between words in a chunk and Inter-chunk dependencies showcase relationship among the chunks. For e.g. “James cut an apple with a knife”; In this sentence, chunking will result in (James) (cut) (an apple) (with a knife). Inter-chunk dependencies show dependency relation among these 4 chunks and Intra-chunk dependency show dependency relation between the words present in each chunk.
This exercise provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task. Further, these fully expanded parallel Hindi and English treebanks were word aligned and an analysis for the task has been given. Issues related to intra-chunk expansion and alignment for the language pair Hindi-English are discussed and guidelines for these tasks have been prepared and released.
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
Tokenization is considered a solved problem when reduced to just word borders identification, punctuation
and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent
NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use
in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related
information as necessary. We also claim that semantic disambiguation performs much better in a bilingual
context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text
provided by high profile on-line machine translation services performs almost to the same level with
human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm
incorporated in TORO can be used as a criterion for on-line machine translation services comparative
quality assessment and we provide a setup for this purpose.
A Word Stemming Algorithm for Hausa Languageiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
Author credits - Maaz Nomani
his paper presents our work on the annotation of intra-chunk dependencies on an English Treebank that was previously annotated with Inter-chunk dependencies. Given a natural language sentence, Intra-chunk dependencies show a relationship between words in a chunk and Inter-chunk dependencies showcase relationship among the chunks. For e.g. “James cut an apple with a knife”; In this sentence, chunking will result in (James) (cut) (an apple) (with a knife). Inter-chunk dependencies show dependency relation among these 4 chunks and Intra-chunk dependency show dependency relation between the words present in each chunk.
This exercise provides fully parsed dependency trees for the English treebank. We also report an analysis of the inter-annotator agreement for this chunk expansion task. Further, these fully expanded parallel Hindi and English treebanks were word aligned and an analysis for the task has been given. Issues related to intra-chunk expansion and alignment for the language pair Hindi-English are discussed and guidelines for these tasks have been prepared and released.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
In this talk I intend to review some basic and high-level concepts like formal languages, grammars and ontologies. Languages to transmit knowledge from a sender to a receiver; grammars to formally specify languages; ontologies as formals specifications of specific knowledge domains. After this introductory revision, enhancing the role of each of those elements in the context of computer-based problem solving (programming), I will talk about a project aimed at automatically infer and generate a Grammar for a Domain Specific Language (DSL) from a given ontology that describes this specific domain. The transformation rules will be presented and the system, Onto2Gra, that fully implements that "Ontological approach for DSL development" will be introduced.
This paper analyses the structure patterns of code-switching quantitatively and qualitatively based on EFL classroom discourse. Through the detailed analysis, the paper finds that there are different structure patterns in which teachers often switch their codes in English classroom. These structure patterns are reflected in different language levels: words and phrases level, clausal and sentence level. The functions of code-switching are determined by those structure patterns that teachers will choose for different purposes in the process of teaching.
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
In the article the results of development of machine translation expert system is presented. The approach of translation correspondences defining is suggested as a background for creation of data base and knowledge base of the system. Methods of transformation rules compiling applied for linguistic knowledge base of the expert system are based on the defining of translation correspondences between Azerbaijani and English languages.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Lectura 3.5 word normalizationintwitter finitestate_transducers
1. Word normalization in Twitter using finite-state transducers
Jordi Porta and Jos´ Luis Sancho
e
Centro de Estudios de la Real Academia Espa˜ola
n
c/ Serrano 197-198, Madrid 28002
{porta,sancho}@rae.es
Resumen:
Palabras clave:
Abstract: This paper presents a linguistic approach based on weighted-finite state
transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers that are applied to out-of-vocabulary tokens.
Transducers implement linguistic models of variation that generate sets of candidates according to a lexicon. A statistical language model is used to obtain the
most probable sequence of words. The article includes a description of the components and an evaluation of the system and some of its parameters.
Keywords: Tweet Messages. Lexical Normalisation. Finite-state Transducers.
Statistical Language Models.
1
Introduction
Text messaging (or texting) exhibits a considerable degree of departure from the writing norm, including spelling. There are many
reasons for this deviation: the informality
of the communication style, the characteristics of the input devices, etc. Although
many people consider that these communication channels are “deteriorating” or even
“destroying” languages, many scholars claim
that even in this kind of channels communication obeys maxims and that spelling is also
principled. Even more, it seems that, in general, the processes underlying variation are
not new to languages. It is under these considerations that the modelling of the spelling
variation, and also its normalisation, can be
addressed. Normalisation of text messaging
is seen as a necessary preprocessing task before applying other natural language processing tools designed for standard language varieties.
Few works dealing with Spanish text messaging can be found in the literature. To
the best of our knowledge, the most relevant and recent published works are Mosquera and Moreda (2012), Pinto et al.
(2012), Gomez Hidalgo, Caurcel D´
ıaz, and
I˜iguez del Rio (2013) and Oliva et al.
n
(2013).
2
Architecture and components of
the system
The system has three main components that
are applied sequentially: An analyser performing tokenisation and lexical analysis on
standard word forms and on other expressions like numbers, dates, etc.; a component generating word candidates for outof-vocabulary (OOV) tokens; a statistical
language model used to obtain the most
likely sequence of words; and finally, a truecaser giving proper capitalisation to common
words assigned to OOV tokens.
Freeling (Atserias et al., 2006) with a special configuration designed for this task is
used to tokenise the message and identify,
among other tokens, standard words forms.
The generation of candidates, i.e., the confusion set of an OOV token, is performed
by components inspired in other modules
used to analyse words found in historical
texts, where other kind of spelling variation
can be found (Porta, Sancho, and G´mez,
o
2013). The approach to historical variation
was based on weighted finite-state transducers over the tropical semiring implementing
linguistically motivated models. Some experiments were conducted in order to assess
the task of assigning to old word forms their
corresponding modern lemmas. For each old
word, lemmas were assigned via the possible
modern forms predicted by the model. Re-
2. sults were comparable to the results obtained
with the Levenshtein distance (Levenshtein,
1966) in terms of recall, but were better in
terms of accuracy, precision and F . As for
old words, the confusion set of a OOV token
is generated by applying the shortest-paths
algorithm to the following expression:
W ◦E◦L
where W is the automata representing the
OOV token, E is an edit transducer generating possible variations on tokens, and L is
the set of target words. The composition of
these three modules is performed using an
on-line implementation of the efficient threeway composition algorithm of Allauzen and
Mohri (2008).
3
Resources employed
In this section, the resources employed by the
components of the system are described: the
edit transducers, the lexical resources and the
language model.
3.1
Edit transducers
We follow the classification of Crystal (2008)
for texting features present also in Twitter
messages. In order to deal with these features
several transducers were developed. Transducers are expressed as regular expressions
and context-dependent rewrite rules of the
δ (Chomsky and Halle,
form α → β / γ
1968) that are compiled into weighted finitestate transducers using the OpenGrm Thrax
tools (Tai, Skut, and Sproat, 2011).
3.1.1 Logograms and Pictograms
Some letters are found used as logograms,
with a phonetic value. They are dealt with
by optional rewrites altering the orthographic
form of tokens:
ReplaceLogograms = (x (→) por ) ◦
(2 (→) dos) ◦ (@ (→) a|o) ◦ . . .
Also laughs, which are very frequent, are
considered logograms, since they represent
sounds associated with actions. The multiple ways they are realised, including plurals,
are easily described with regular expressions.
Pictograms like emoticons entered by
means of ready-to-use icons in input devices
are not treated by our system since they are
not textual representations. However textual
representations of emoticons like :DDD or
xDDDDDD are recognised by regular expressions and mapped to their canonical form by
means of simple transducers.
3.1.2
Initialisms, shortenings, and
letter omissions
The string operations for initialisms (or
acronymisation) and shortenings are difficult
to model without incurring in an overgeneration of candidates. For this reason, only common initialisms, e.g., sq (es que), tk (te quiero) or sa (se ha), and common shortenings,
e.g., exam (examen) or nas (buenas), are listed.
For the omission of letters several transducers are implemented. The simplest and
more conservative one is a transducer introducing just one letter in any position of the
token string. Consonantal writing is a special case of letter omission. This kind of writing relies on the assumption that consonants
carry much more information than vowels do,
which in fact is the norm in same languages
like Semitic languages. Some rewrite rules are
applied to OOV tokens in order to restore vowels:
InsertVowels = invert(RemoveVowels)
RemoveVowels = Vowels (→) ǫ
3.1.3
Standard non-standard
spellings
We consider non-standard spellings standard
when they are widely used. These include
spellings for representing regional or informal
speech, or choices sometimes conditioned by
input devices, as non-accented writing. For
the case of accents and tildes, they are restored using a cascade of optional rewrite rules
like the following:
RestoreAccents = (n|ni|ny|nh (→) n) ◦
˜
(a (→) ´) ◦ (e (→) ´) ◦ . . .
a
e
Also words containing k instead of c or qu,
which appears frequently in protest writings,
are standardised with simple transducers. Some other changes are done to some endings to
recover the standard ending. There are complete paradigms like the following, which relates non-standard to standard endings:
-a
-as
-ao
-aos
-ada
-adas
-ado
-ados
3. We also consider phonetic writing as a
kind of non-standard writing in which a
phonetic form of a word is alphabetically
and syllabically approximated. The transducers used for generating standard words from
their phonetic and graphical variants are:
DephonetiseWriting =
invert(PhonographemicVariation)
PhonographemicVariation =
GraphemeToPhoneme ◦
PhoneConflation ◦
PhonemeToGrapheme ◦
GraphemeVariation
In the previous definitions, the PhoneConflation makes phonemes equivalent, as for
example the IPA phonemes /L/ and /J/. Linguistic phenomena as seseo and ceceo, in
which several phonemes were conflated by
16th century, still remain in spoken variants
and are also reflected in texting. The GraphemeVariation transducer models, among others, the writing of ch as x, which could be
due to the influence of other languages.
3.1.4 Juxtapositions
Spacing in texting is also non-standard. In
the normalisation task, some OOV tokens
are in fact juxtaposed words. The possible decompositions of a word into a sequence of possible words is: shortest-paths(W ◦
SplitConjoinedWords ◦ L( L)+ ), where W is
the word to be analysed, L( L)+ represents
the valid sequences of words and SplitConjoinedWords is a transducer introducing blanks
( ) between letters and undoing optionally
possible fused vowels:
SplitConjoinedWords = invert(JoinWords)
JoinWords =
(a a (→) a<1>) ◦ . . . ◦ (u u (→) u<1>) ◦
( (→) ǫ)
Note that in the previous definition, some rules are weighted with a unit cost <1>. These costs are used by the shortest-paths algorithm as a preference mechanism to select
non-fused over fused sequences when both cases are possible.
3.1.5 Other transducers
Expressive lengthening, which consist in repeating a letter in order to convey emphasis,
are dealt with by means of rules removing a
varying number of consecutive occurrences of
the same letter. An example of a rule dealing
with letter a repetitions is a (→) ǫ / a
.
A transducer is generated for the alphabet.
Because messages are keyboarded, some
errors found in words are due to letter transpositions and confusions between adjacent
letters in the same row of the keyboard. These changes are also implemented with a transducer.
Finally, a Levenshtein transducer with a
maximum distance of one has been also implemented.
3.2
The lexicon
The lexicon for OOV token normalisation
contains mainly Spanish standard words,
proper names and some frequent English
words. These constitute the set of target
words. We used the DRAE (RAE, 2001) as
the source for Spanish standard words in the
lexicon. Besides inflected forms, we have added verbal forms with clitics attached and
derivative forms not found as entries in the
DRAE: -mente adverbs, appreciatives, etc.
The list of proper names was compiled from
many sources and contains first names, surnames, aliases, cities, country names, brands,
organisations, etc. Special attention was payed to hypocorisms, i.e., shorter or diminutive forms of a given name, as well as nicknames or calling names, since communication in
channels as Twitter tends to be interpersonal
(or between members of a group) and affective. A list of common hypocorisms is provided
to the system. For English words, we have selected the 100,000 more frequent words of the
BNC (BNC, 2001).
3.3
Language model
We use a language model to decode the word
graph and thus obtain the most probable
word sequence. The model is estimated from
a corpus of webpages compiled with Wacky
(Baroni et al., 2009). The corpus contains
about 11,200,000 tokens coming from about
21,000 URLs. We used as seeds the types
found in the development set (about 2,500).
Backed-off n-gram models, used as language
models, are implemented with the OpenGrm
NGram toolkit (Roark et al., 2012).
3.4
Truecasing
The restoring of case information in badlycased text has been addressed in (Lita et
al., 2003) and has been included as part of
4. the normalisation task. Part of this process,
for proper names, is performed by the application of the language model to the word
graph. Words at message initial position are
not always uppercased, since doing so yielded contradictory results after some experimentation. A simple heuristic is implemented
to uppercase a normalisation candidate when
the OOV token is also uppercased.
4
Settings and evaluation
In order to generate the confusion sets we
used two edit transducers applied in a cascade. If neither of the two is able to relate a
token with a word, the token is assigned to
itself.
The first transducer generates candidates according to the expansion of abbreviations, the identification of acronyms, pictograms and words which result from the following composition of edit transducers combining some of the features of texting:
RemoveSymbols ◦
LowerCase ◦
Deaccent ◦
RemoveReduplicates ◦
ReplaceLogograms ◦
StandardiseEndings ◦
DephonetiseWriting ◦
Reaccent ◦
MixCase
The second edit transducer analyses tokens that did not receive analyses with the
first editor. This second editor implements
consonantal writing, typing error recovery, an
approximate matching using a Levenshtein
distance of one and the splitting of juxtaposed words. In all cases, case, accents and reduplications are also considered. This second
transducer makes use of an extended lexicon
containing sequences of simple words.
Several experiments were conducted in order to evaluate some parameters of the system. In particular, the effect of the order of
the n-grams in the language model and the
effect of generating confusion sets for OOV
tokens only versus the generation of confusion sets for all tokens. For all the experiments we used the test set provided with the
tokenization delivered by Freeling.
For the first series of experiments, tokens
identified as standard words by Freeling receive the same token as analysis and OOV
tokens are analysed with the system. Recall
on OOV tokens is of 89.40 %. Confusion sets
size follows a power-law distribution with an
average size of 5.48 for OOV tokens that goes
down to 1.38 if we average over the rest of
the tokens. Precision for 2- and 4-gram language models is 78.10 %, but the best result
is obtained with 3-grams, with an precision
of 78.25 %.
There is a number of non-standard
forms that were wrongly recognised as invocabulary words because they clash with
other standard words. In the second series
of experiments a confusion set is generated
for each word in order to correct potentially
wrong assignments. Average size of confusion sets increases to 5.56.1 Precision results
for the 2-gram language model is of 78.25 %
but 3- and 4-gram reach both an precision of
78.55 %.
From a quantitative point of view, it seems
that slighty better results are obtained using
a 3-gram language model and generating confusion sets not only for OOV tokens but for
all the tokens in the message. In a qualitative evaluation of errors several categories show
up. The most populated categories are those
having to do with case restoration and wrong
decoding by the language model. Some errors
are related to particularities of DRAE, from
which the lexicon was derived (dispertar or
malaleche). Non standard morphology is observed in tweets, as in derivatives (tranquileo
or loquendera). Lack of abbreviation expansion is also observed (Hum). Faulty application of segmentation accounts for a few errors
(mencantaba). Finally, some errors are not on
our output but on the reference (Hojo).
5
Conclusions and future work
No attention has been payed to multilingualism since the task explicitly excluded tweets
from bilingual areas of Spain. However, given that not few Spanish speakers (both in
Europe and America) are bilingual or live in
bilingual areas, mechanisms should be provided to deal with other languages than English
to make the system more robust.
We plan to build a corpus of lexically standard tweets via the Twitter streaming API
to determine whether n-grams observed in a
1
We removed from the calculation the token
mes de abril, which receives 308,017 different analyses due to the combination of multiple editions and
segmentations.
5. Twitter-only corpus improve decoding or not
as a side effect of syntax being also non standard.
Qualitative analysis of results showed that
there is room for improvement experimenting
with selective deactivation of items in the lexicon and further development of the segmenting module.
However, initialisms and shortenings are
features of texting difficult to model without
causing overgeneration. Acronyms like FYQ,
which correspond to the school subject of
F´
ısica y Qu´
ımica, are domain specific and difficult to foresee and therefore to have them
listed in the resources.
References
Allauzen, Cyril and Mehryar Mohri. 2008. 3way composition of weighted finite-state
transducers. In Proc. of the 13th Int.
Conf. on Implementation and Application
of Automata (CIAA–2008), pages 262–
273, San Francisco, California, USA.
Atserias, Jordi, Bernardino Casas, Elisabet Comelles, Meritxell Gonz´lez, Llu´
a
ıs
Padr´, and Muntsa Padr´. 2006. Freeo
o
Ling 1.3: Syntactic and semantic services
in an open-source NLP library. In Proc. of
the 5th Int. Conf. on Language Resources
and Evaluation (LREC-2006), pages 48–
55, Genoa, Italy, May.
Baroni, Marco, Silvia Bernardini, Adriano
Ferraresi, and Eros Zanchetta. 2009. The
wacky wide web: A collection of very large
linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209–226.
BNC. 2001. The British National Corpus, version 2 (BNC World). Distributed by Oxford University Computing Services on behalf of the BNC Consortium.
http://www.natcorp.ox.ac.uk.
Chomsky, Noam and Morris Halle. 1968.
The sound pattern of English. Harper &
Row, New York.
Crystal, David. 2008. Txtng: The Gr8 Db8.
Oxford University Press.
Gomez Hidalgo, Jos´ Mar´ Andr´s Alfonso
e
ıa,
e
Caurcel D´ and Yovan I˜iguez del Rio.
ıaz,
n
2013. Un m´todo de an´lisis de lenguaje
e
a
tipo SMS para el castellano. Linguam´tia
ca, 5(1):31—39, July.
Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710.
Lita, Lucian Vlad, Abe Ittycheriah, Salim
Roukos, and Nanda Kambhatla. 2003.
tRuEcasIng. In Proc. of the 41st Annual
Meeting on ACL - Volume 1, ACL ’03, pages 152–159, Stroudsburg, PA, USA.
Mosquera, Alejandro and Paloma Moreda.
2012. TENOR: A lexical normalisation
tool for Spanish Web 2.0 texts. In Petr
Sojka, Aleˇ Hor´k, Ivan Kopeˇek, and Kas
a
c
rel Pala, editors, Text, Speech and Dialogue, volume 7499 of LNCS. Springer, pages 535–542.
Oliva, J., J. I. Serrano, M. D. Del Castillo,
´
and A. Iglesias. 2013. A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19:121–141, 1.
Pinto, David, Darnes Vilari˜o Ayala, Yurin
diana Alem´n, Helena G´mez, Nahun Loa
o
ya, and H´ctor Jim´nez-Salazar. 2012.
e
e
The Soundex phonetic algorithm revisited
for SMS text representation. In Petr Sojka, Aleˇ Hor´k, Ivan Kopeˇek, and Karel
s
a
c
Pala, editors, Text, Speech and Dialogue,
volume 7499 of LNCS, pages 47–55. Springer.
Porta, Jordi, Jos´-Luis Sancho, and Javier
e
G´mez. 2013. Edit transducers for speo
lling variation in Old Spanish. In Proc. of
the workshop on computational historical
linguistics at NODALIDA 2013. NEALT
Proc. Series 18, pages 70–79, Oslo, Norway, May 22-24.
RAE. 2001. Diccionario de la lengua espa˜ola. Espasa, Madrid, 22th edition.
n
Roark, Brian, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, and
Terry Tai. 2012. The OpenGrm opensource finite-state grammar software libraries. In Proc. of the ACL 2012 System
Demonstrations, pages 61–66, Jeju Island,
Korea, July.
Tai, Terry, Wojciech Skut, and Richard
Sproat. 2011. Thrax: An Open Source
Grammar Compiler Built on OpenFst. In
ASRU 2011, Waikoloa Resort, Hawaii, December.