A focused crawler for romanian words discovery

Authors
University
Politehnica
of Bucharest
A Focused Crawler for
Romanian Words Discovery
Ionuț-Gabriel Radu
Traian Rebedea traian.rebedea@cs.pub.ro

Overview
• Introduction
• Objective
• RWScraper
• Related Work
• RWScraper: Implementation
• Results
• Conclusions
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2

Introduction
• All natural languages are subject to change
over time
• As the Web becomes more prevalent, it also
constitutes a major source for identifying
language evolution
• Due to large amounts of Romanian web
content, the rate of change has increased
significantly
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3

Objective
• To provide a mechanism to identify new
words (e.g. neologisms) that entered the
Romanian language
• Develop a specialized (focused) web crawler
for analyzing Romanian web pages and
identifying new words

Focused Web Crawling
• Crawling the web with a specific purpose:
– “Focus” the spiders to specific content (e.g.
people search, scientific publications, products,
etc.)
– Ignore other web pages
and domains

Solution: RWScraper
• RWScraper (Romanian Word Scraper) - is able
to solve the following problems:
– Identify Romanian texts;
– Distinguish between proper names and common
nouns;
– Create a database with new words along with
context information and metadata. In order to
identify new
– Discover the most frequent spelling errors in
Romanian online texts.

RWScraper – Text Processing
• Each word discovered in a Romanian text is looked in
the database provided by www.dexonline.ro, which
contains definitions from several Romanian
dictionaries (DEX, DOOM, etc.)
• Text Processing Pipeline
– Text Normalization
– Language Validation
– Sentence Segmentation
– Sentence-Level Language
Identification
– Word Tokenization

Related Work:
Neologisms Identification
• A study for Japanese:
– Scanning existing Japanese corpora for possible ”new” words,
typically by processing the texts through segmentation software
and dealing with the ”out-of-lexicon” problem
– Simulating the Japanese morphological processes to create new
possible words and then test for the presence of them in large
corpora
• Identification of lexical discriminants (e.g. termed, called,
known as) and punctuation discriminants (e.g. single and
double quotes) for introducing new words
– This method is able to identify a significantly smaller number of
potential new words due to the limited number of lexical
discriminant patterns.
• Using data about the frequency of words usage over time

Related Work:
Language Identification
• Common Words Methods
– Store and use a list with the most frequent words for each language
• Unique Letter Combinations
– Database with the most frequent sequences of letters in a language,
not necessarily valid words
– The main disadvantage: the poor performance on short texts
– The main advantage: it does not require word tokenization
• Language Identification Using N-Grams
– Every language has several specific frequently used character n-grams
– For a particular language L, the n-gram ordered dictionary is called n-
gram language profile
– For a new text, we compute the distance to all computed language
profiles
• Markov Models for Language Identification
– The word can be represented as a Markov chain where letters are
states
– Compute a Markov model for each language

RWScraper: Implementation
• RWScraper is a focused crawler for Romanian
web pages
• Developed using Scrapy: open-source scraping
framework in Python
• It uses three main concepts:
– Spiders: responsible for defining rules to restrict the
crawled content to our area of interest
– Items: data we want to scrape from the web pages
– Pipelines: text processing tasks that act on the
crawled web resources

RWScraper Language Validation
• Divide the texts into two categories:
– Diacritics free texts - DIAFREE
– Genuine Romanian texts – GEN
• 6.40% of the characters in the Romanian texts part of
the ro_eu_parliament corpus are diacritics
• One of the problems with this approach is that 4.14%
of texts contained ș, â, and î. Unfortunately, there are
also other languages that possess these diacritics
• Romanian is the only language that uses ț and ă
• Our assumption: if a text has over 600 characters and
has no ț/ă are found
– Then it is DIAFREE
– Otherwise is GEN

RWScraper Language Validation
• Build language profiles, consisting of:
– Character bigrams and trigrams frequency
– Common words frequency
– Diacritics frequency
– Rare characters frequency
– Double consonant frequency
– Single quotes frequency

Results: Language Validation
• 105 texts are divided into: 20 Romanian with diacritics (RO1 -
RO20), 20 Romanian without diacritics (RO21- RO40), 20
Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish
texts, 3 Catalan texts, and 2 Aromanian
• The size of the texts varied from 9KB to 2:5MB, the average
size being 253:4KB
• Average scores for the discriminator function
– Lower score means higher probability for the text to be written in
Romanian
– Used to set the discriminant score to 0.77 to separate between
Romanian and non-Romanian texts

Results
• Processed 264,328 online documents
– Only 12,555 documents contained new words
• From this set of texts, we extracted 698,341
– Only 47,363 phrases contained new words
• Discovered 53,724 new words
– 21,343 are proper names
• The remaining tokens are common words and they are
divided into the following main categories:
– Misspelled words (approximately 35%)
– Technical words (approximately 15%)
– Argotic words (approximately 10%)
– Clitics, regionalisms, archaisms, alternative forms for
existing words account for the rest (cca. 40%)

Results
• Most frequent new words

Conclusions
• RWScraper is a simple new Romanian words discovery
system
• The project has also managed to create a large
database of Romanian words extracted from the
WWW
– Statistics about common proper names, frequent spelling
mistakes and newly-invented words
• There are several elements that could be further
improved
– The accuracy of the NLP components used by the system
– A more pertinent analysis of the words identified by the
system

Thank you!
Questions?
Discussion
This work has been funded by the
Sectorial Operational Programme
Human Resources Development
2007-2013 of the Romanian Ministry
of European Funds through the
Financial Agreement
POSDRU/159/1.5/S/132397

A focused crawler for romanian words discovery

More Related Content

What's hot

Viewers also liked

Similar to A focused crawler for romanian words discovery

More from Traian Rebedea

Recently uploaded

A focused crawler for romanian words discovery