Authors
University
Politehnica
of Bucharest
A Focused Crawler for
Romanian Words Discovery
Ionuț-Gabriel Radu
Traian Rebedea traian.rebedea@cs.pub.ro
Overview
• Introduction
• Objective
• RWScraper
• Related Work
• RWScraper: Implementation
• Results
• Conclusions
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
Introduction
• All natural languages are subject to change
over time
• As the Web becomes more prevalent, it also
constitutes a major source for identifying
language evolution
• Due to large amounts of Romanian web
content, the rate of change has increased
significantly
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
Objective
• To provide a mechanism to identify new
words (e.g. neologisms) that entered the
Romanian language
• Develop a specialized (focused) web crawler
for analyzing Romanian web pages and
identifying new words
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
Focused Web Crawling
• Crawling the web with a specific purpose:
– “Focus” the spiders to specific content (e.g.
people search, scientific publications, products,
etc.)
– Ignore other web pages
and domains
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
Solution: RWScraper
• RWScraper (Romanian Word Scraper) - is able
to solve the following problems:
– Identify Romanian texts;
– Distinguish between proper names and common
nouns;
– Create a database with new words along with
context information and metadata. In order to
identify new
– Discover the most frequent spelling errors in
Romanian online texts.
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
RWScraper – Text Processing
• Each word discovered in a Romanian text is looked in
the database provided by www.dexonline.ro, which
contains definitions from several Romanian
dictionaries (DEX, DOOM, etc.)
• Text Processing Pipeline
– Text Normalization
– Language Validation
– Sentence Segmentation
– Sentence-Level Language
Identification
– Word Tokenization
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
Related Work:
Neologisms Identification
• A study for Japanese:
– Scanning existing Japanese corpora for possible ”new” words,
typically by processing the texts through segmentation software
and dealing with the ”out-of-lexicon” problem
– Simulating the Japanese morphological processes to create new
possible words and then test for the presence of them in large
corpora
• Identification of lexical discriminants (e.g. termed, called,
known as) and punctuation discriminants (e.g. single and
double quotes) for introducing new words
– This method is able to identify a significantly smaller number of
potential new words due to the limited number of lexical
discriminant patterns.
• Using data about the frequency of words usage over time
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
Related Work:
Language Identification
• Common Words Methods
– Store and use a list with the most frequent words for each language
• Unique Letter Combinations
– Database with the most frequent sequences of letters in a language,
not necessarily valid words
– The main disadvantage: the poor performance on short texts
– The main advantage: it does not require word tokenization
• Language Identification Using N-Grams
– Every language has several specific frequently used character n-grams
– For a particular language L, the n-gram ordered dictionary is called n-
gram language profile
– For a new text, we compute the distance to all computed language
profiles
• Markov Models for Language Identification
– The word can be represented as a Markov chain where letters are
states
– Compute a Markov model for each language
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
RWScraper: Implementation
• RWScraper is a focused crawler for Romanian
web pages
• Developed using Scrapy: open-source scraping
framework in Python
• It uses three main concepts:
– Spiders: responsible for defining rules to restrict the
crawled content to our area of interest
– Items: data we want to scrape from the web pages
– Pipelines: text processing tasks that act on the
crawled web resources
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
RWScraper Language Validation
• Divide the texts into two categories:
– Diacritics free texts - DIAFREE
– Genuine Romanian texts – GEN
• 6.40% of the characters in the Romanian texts part of
the ro_eu_parliament corpus are diacritics
• One of the problems with this approach is that 4.14%
of texts contained ș, â, and î. Unfortunately, there are
also other languages that possess these diacritics
• Romanian is the only language that uses ț and ă
• Our assumption: if a text has over 600 characters and
has no ț/ă are found
– Then it is DIAFREE
– Otherwise is GEN
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
RWScraper Language Validation
• Build language profiles, consisting of:
– Character bigrams and trigrams frequency
– Common words frequency
– Diacritics frequency
– Rare characters frequency
– Double consonant frequency
– Single quotes frequency
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
Results: Language Validation
• 105 texts are divided into: 20 Romanian with diacritics (RO1 -
RO20), 20 Romanian without diacritics (RO21- RO40), 20
Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish
texts, 3 Catalan texts, and 2 Aromanian
• The size of the texts varied from 9KB to 2:5MB, the average
size being 253:4KB
• Average scores for the discriminator function
– Lower score means higher probability for the text to be written in
Romanian
– Used to set the discriminant score to 0.77 to separate between
Romanian and non-Romanian texts
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
Results
• Processed 264,328 online documents
– Only 12,555 documents contained new words
• From this set of texts, we extracted 698,341
– Only 47,363 phrases contained new words
• Discovered 53,724 new words
– 21,343 are proper names
• The remaining tokens are common words and they are
divided into the following main categories:
– Misspelled words (approximately 35%)
– Technical words (approximately 15%)
– Argotic words (approximately 10%)
– Clitics, regionalisms, archaisms, alternative forms for
existing words account for the rest (cca. 40%)
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
Results
• Most frequent new words
19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
Conclusions
• RWScraper is a simple new Romanian words discovery
system
• The project has also managed to create a large
database of Romanian words extracted from the
WWW
– Statistics about common proper names, frequent spelling
mistakes and newly-invented words
• There are several elements that could be further
improved
– The accuracy of the NLP components used by the system
– A more pertinent analysis of the words identified by the
system
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
Thank you!
Questions?
Discussion
19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 17
This work has been funded by the
Sectorial Operational Programme
Human Resources Development
2007-2013 of the Romanian Ministry
of European Funds through the
Financial Agreement
POSDRU/159/1.5/S/132397

A focused crawler for romanian words discovery

  • 1.
    Authors University Politehnica of Bucharest A FocusedCrawler for Romanian Words Discovery Ionuț-Gabriel Radu Traian Rebedea traian.rebedea@cs.pub.ro
  • 2.
    Overview • Introduction • Objective •RWScraper • Related Work • RWScraper: Implementation • Results • Conclusions 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
  • 3.
    Introduction • All naturallanguages are subject to change over time • As the Web becomes more prevalent, it also constitutes a major source for identifying language evolution • Due to large amounts of Romanian web content, the rate of change has increased significantly 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 3
  • 4.
    Objective • To providea mechanism to identify new words (e.g. neologisms) that entered the Romanian language • Develop a specialized (focused) web crawler for analyzing Romanian web pages and identifying new words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
  • 5.
    Focused Web Crawling •Crawling the web with a specific purpose: – “Focus” the spiders to specific content (e.g. people search, scientific publications, products, etc.) – Ignore other web pages and domains 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
  • 6.
    Solution: RWScraper • RWScraper(Romanian Word Scraper) - is able to solve the following problems: – Identify Romanian texts; – Distinguish between proper names and common nouns; – Create a database with new words along with context information and metadata. In order to identify new – Discover the most frequent spelling errors in Romanian online texts. 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
  • 7.
    RWScraper – TextProcessing • Each word discovered in a Romanian text is looked in the database provided by www.dexonline.ro, which contains definitions from several Romanian dictionaries (DEX, DOOM, etc.) • Text Processing Pipeline – Text Normalization – Language Validation – Sentence Segmentation – Sentence-Level Language Identification – Word Tokenization 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
  • 8.
    Related Work: Neologisms Identification •A study for Japanese: – Scanning existing Japanese corpora for possible ”new” words, typically by processing the texts through segmentation software and dealing with the ”out-of-lexicon” problem – Simulating the Japanese morphological processes to create new possible words and then test for the presence of them in large corpora • Identification of lexical discriminants (e.g. termed, called, known as) and punctuation discriminants (e.g. single and double quotes) for introducing new words – This method is able to identify a significantly smaller number of potential new words due to the limited number of lexical discriminant patterns. • Using data about the frequency of words usage over time 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
  • 9.
    Related Work: Language Identification •Common Words Methods – Store and use a list with the most frequent words for each language • Unique Letter Combinations – Database with the most frequent sequences of letters in a language, not necessarily valid words – The main disadvantage: the poor performance on short texts – The main advantage: it does not require word tokenization • Language Identification Using N-Grams – Every language has several specific frequently used character n-grams – For a particular language L, the n-gram ordered dictionary is called n- gram language profile – For a new text, we compute the distance to all computed language profiles • Markov Models for Language Identification – The word can be represented as a Markov chain where letters are states – Compute a Markov model for each language 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
  • 10.
    RWScraper: Implementation • RWScraperis a focused crawler for Romanian web pages • Developed using Scrapy: open-source scraping framework in Python • It uses three main concepts: – Spiders: responsible for defining rules to restrict the crawled content to our area of interest – Items: data we want to scrape from the web pages – Pipelines: text processing tasks that act on the crawled web resources 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
  • 11.
    RWScraper Language Validation •Divide the texts into two categories: – Diacritics free texts - DIAFREE – Genuine Romanian texts – GEN • 6.40% of the characters in the Romanian texts part of the ro_eu_parliament corpus are diacritics • One of the problems with this approach is that 4.14% of texts contained ș, â, and î. Unfortunately, there are also other languages that possess these diacritics • Romanian is the only language that uses ț and ă • Our assumption: if a text has over 600 characters and has no ț/ă are found – Then it is DIAFREE – Otherwise is GEN 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
  • 12.
    RWScraper Language Validation •Build language profiles, consisting of: – Character bigrams and trigrams frequency – Common words frequency – Diacritics frequency – Rare characters frequency – Double consonant frequency – Single quotes frequency 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
  • 13.
    Results: Language Validation •105 texts are divided into: 20 Romanian with diacritics (RO1 - RO20), 20 Romanian without diacritics (RO21- RO40), 20 Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish texts, 3 Catalan texts, and 2 Aromanian • The size of the texts varied from 9KB to 2:5MB, the average size being 253:4KB • Average scores for the discriminator function – Lower score means higher probability for the text to be written in Romanian – Used to set the discriminant score to 0.77 to separate between Romanian and non-Romanian texts 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
  • 14.
    Results • Processed 264,328online documents – Only 12,555 documents contained new words • From this set of texts, we extracted 698,341 – Only 47,363 phrases contained new words • Discovered 53,724 new words – 21,343 are proper names • The remaining tokens are common words and they are divided into the following main categories: – Misspelled words (approximately 35%) – Technical words (approximately 15%) – Argotic words (approximately 10%) – Clitics, regionalisms, archaisms, alternative forms for existing words account for the rest (cca. 40%) 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
  • 15.
    Results • Most frequentnew words 19.08.15 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
  • 16.
    Conclusions • RWScraper isa simple new Romanian words discovery system • The project has also managed to create a large database of Romanian words extracted from the WWW – Statistics about common proper names, frequent spelling mistakes and newly-invented words • There are several elements that could be further improved – The accuracy of the NLP components used by the system – A more pertinent analysis of the words identified by the system 19.08.15 RoEduNet Conference 2014 – Chi inău, R. Moldovaș 16
  • 17.
    Thank you! Questions? Discussion 19.08.15 RoEduNetConference 2014 – Chi inău, R. Moldovaș 17 This work has been funded by the Sectorial Operational Programme Human Resources Development 2007-2013 of the Romanian Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132397