SlideShare a Scribd company logo
Adrian Iftene1
, Diana Trandabăţ1,2
{adiftene, dtrandabat}@info.uaic.ro
1
Faculty of Computer Science
1
“Al. I. Cuza” University of Iasi
2
Romanian Academy, Iasi Branch
2 July, KEP T 2009, Cluj Napoca
 Motivation
 The system
 Steps performed
 Results
 Conclusions
 Ro-Wikipedia was used in CLEF 2007
◦ 1.43 Gb
◦ 121.832 files
Iftene, Trandabăţ, KEPT 2009
Iftene, Trandabăţ, KEPT 2009
Step 1 - Initial text is split into sentences and then sentences
are further split into words
Step 2 - For every word without diacritics, we search in
DBPF the corresponding possible value
◦ If the current word doesn’t contain “a, i, s, t” letters then we search in
DBFP or in Ro-Wikipedia the word
◦ If the current word contains one or more from “a, i, s, t” letters then we
search in DBFP or in Ro-Wikipedia using a pattern, obtained from
initial word, where all possible diacritics (a, i, s, t) are replaced with
the corresponding values (”a” is replaced by (ă|â|a), ”i” is replaced by
(î|i), ”s” is replaced by (ş|s), ”t” is replaced by (t|ţ))
◦ For example for word = “fata” the pattern = “f(ă|â|a)(t|ţ)(ă|â|a)”
Iftene, Trandabăţ, KEPT 2009
Step 3 - We build a query in order to search web
pages that contain similar sentences (At this
step we receive sentences that contain words
with multiples forms in DBFP)
Iftene, Trandabăţ, KEPT 2009
Step 4 - We extract from web the first 10 relevant
pages returned by Google
Step 5- From downloaded sites we select only pages
with texts and ignore files with images, fonts, and
with configuration settings. In the selection process
we identify the ”correct” files with diacritics and
concatenate them in one file
Iftene, Trandabăţ, KEPT 2009
Step 6 - Using the file built at Step 5 we will show
how we will identify the most appropiate form for
words with multiple forms. We build the same kind of
patterns as at Step 2 b) ii. and identify, for every
word, the possible forms and its relative positions in
the concatenated file
Iftene, Trandabăţ, KEPT 2009
 If the sentence S has as components the words w1,
w2, ..., wn
 We note with fi the current form for word wi and with
pi1, pi2, ..., piti the positions from each associated layer
 With these notations a full path from first layer
(corresponding to the first word of the sentence) to
the last layer (corresponding to the last word of the
sentence) can be noticed with
FP = (p1i1, p2i2, …, pnin)
Iftene, Trandabăţ, KEPT 2009
 From now our goal is to find a full path between
current layers with a minimal length
 For that we build
Iftene, Trandabăţ, KEPT 2009
 An example is presented below for the sentence: ”Scoala
incepe sambata” with two possible solutions:
 Şcoala începe sâmbătă. (School starts this Saturday).
 Şcoala începe sâmbăta. ((Usually) the school starts
Saturday).
Iftene, Trandabăţ, KEPT 2009
 Step 7 - Context improvement:
◦ The backward rule
◦ The forward rule
◦ The maximization rule
Iftene, Trandabăţ, KEPT 2009
 In order to evaluate the systems performances, we
used a large file containing the Calimera Guidelines
(14.148 sentences).
Iftene, Trandabăţ, KEPT 2009
 The paper presents a method to restore
diacritics using web found contexts
 The system accuracy is similar to the
accuracy of existing systems, but the main
advantage comes from fact that it uses
resource and tools available for free.
 Also, we tested our algorithm on other
languages like French and German and the
results are very promising
Iftene, Trandabăţ, KEPT 2009

More Related Content

What's hot

Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
Leiden University
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
Vsevolod Dyomkin
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
Machine Learning Prague
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
Hady Elsahar
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
Lidia Pivovarova
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
Kobkrit Viriyayudhakorn
 
L1
L1L1
L3 v2
L3 v2L3 v2
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
Ekaterina Chernyak
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
Abdullah Khan Zehady
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
rudolf eremyan
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
Ilia Karpov
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Marina Santini
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
Leiden University
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
Lidia Pivovarova
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
Felipe Moraes
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
Marina Santini
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Marcin Junczys-Dowmunt
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
Lukáš Svoboda
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg
 

What's hot (20)

Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
L1
L1L1
L1
 
L3 v2
L3 v2L3 v2
L3 v2
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 

Viewers also liked

Phonetic 2
Phonetic 2Phonetic 2
Phonetic 2
Fatmawati Khodijah
 
Ipa internacional phonetic_alphabet
Ipa internacional phonetic_alphabetIpa internacional phonetic_alphabet
Ipa internacional phonetic_alphabet
Luciana Viter
 
Schwa and the short i
Schwa and the short iSchwa and the short i
Schwa and the short i
David Nicholson
 
Ipa pronunciation session[1]
Ipa pronunciation session[1]Ipa pronunciation session[1]
Ipa pronunciation session[1]
Laydy
 
2 phonetics slides final
2 phonetics slides final2 phonetics slides final
2 phonetics slides final
Jasmine Wong
 
Ocean powerpoint presentation
Ocean powerpoint presentationOcean powerpoint presentation
Ocean powerpoint presentation
asniffen
 
Phonetics & phonology (The way Vowels and Consonant of English are articulated)
Phonetics & phonology (The way Vowels and Consonant of English are articulated)Phonetics & phonology (The way Vowels and Consonant of English are articulated)
Phonetics & phonology (The way Vowels and Consonant of English are articulated)
AishaKoukab
 
phonetics and phonology
phonetics and phonologyphonetics and phonology
phonetics and phonology
Wu Heping
 
Phonetics powerpoint
Phonetics powerpointPhonetics powerpoint
Phonetics powerpoint
Melvin Cabacaba
 

Viewers also liked (9)

Phonetic 2
Phonetic 2Phonetic 2
Phonetic 2
 
Ipa internacional phonetic_alphabet
Ipa internacional phonetic_alphabetIpa internacional phonetic_alphabet
Ipa internacional phonetic_alphabet
 
Schwa and the short i
Schwa and the short iSchwa and the short i
Schwa and the short i
 
Ipa pronunciation session[1]
Ipa pronunciation session[1]Ipa pronunciation session[1]
Ipa pronunciation session[1]
 
2 phonetics slides final
2 phonetics slides final2 phonetics slides final
2 phonetics slides final
 
Ocean powerpoint presentation
Ocean powerpoint presentationOcean powerpoint presentation
Ocean powerpoint presentation
 
Phonetics & phonology (The way Vowels and Consonant of English are articulated)
Phonetics & phonology (The way Vowels and Consonant of English are articulated)Phonetics & phonology (The way Vowels and Consonant of English are articulated)
Phonetics & phonology (The way Vowels and Consonant of English are articulated)
 
phonetics and phonology
phonetics and phonologyphonetics and phonology
phonetics and phonology
 
Phonetics powerpoint
Phonetics powerpointPhonetics powerpoint
Phonetics powerpoint
 

More from Faculty of Computer Science

Using Artificial Intelligence in Software Engineering
Using Artificial Intelligence in Software EngineeringUsing Artificial Intelligence in Software Engineering
Using Artificial Intelligence in Software Engineering
Faculty of Computer Science
 
Eye and Voice Control for an Augmented Reality Cooking Experience
Eye and Voice Control for an Augmented Reality Cooking ExperienceEye and Voice Control for an Augmented Reality Cooking Experience
Eye and Voice Control for an Augmented Reality Cooking Experience
Faculty of Computer Science
 
Learn Chemistry with Augmented Reality
Learn Chemistry with Augmented RealityLearn Chemistry with Augmented Reality
Learn Chemistry with Augmented Reality
Faculty of Computer Science
 
Exploiting Social Networks. Technological Trends
Exploiting Social Networks. Technological TrendsExploiting Social Networks. Technological Trends
Exploiting Social Networks. Technological Trends
Faculty of Computer Science
 
Augmented Reality in Education
Augmented Reality in EducationAugmented Reality in Education
Augmented Reality in Education
Faculty of Computer Science
 
Diversification in an Image Retrieval System
Diversification in an Image Retrieval SystemDiversification in an Image Retrieval System
Diversification in an Image Retrieval System
Faculty of Computer Science
 
Using opinion mining techniques for early crisis detection
Using opinion mining techniques for early crisis detectionUsing opinion mining techniques for early crisis detection
Using opinion mining techniques for early crisis detection
Faculty of Computer Science
 
Augmented reality
Augmented realityAugmented reality
Augmented reality
Faculty of Computer Science
 
I See You, You Can't See Me: On People's Perception About Surveillance In Po...
I See You, You Can't See Me: On People's Perception About Surveillance In Po...I See You, You Can't See Me: On People's Perception About Surveillance In Po...
I See You, You Can't See Me: On People's Perception About Surveillance In Po...
Faculty of Computer Science
 
Named Entity Recognition for Romanian
Named Entity Recognition for RomanianNamed Entity Recognition for Romanian
Named Entity Recognition for Romanian
Faculty of Computer Science
 
Question Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishQuestion Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and English
Faculty of Computer Science
 
Identify Experts from a Domain of Interest
Identify Experts from a Domain of Interest Identify Experts from a Domain of Interest
Identify Experts from a Domain of Interest
Faculty of Computer Science
 
Question Answering on Romanian, English and French Languages
Question Answering on Romanian, English and French LanguagesQuestion Answering on Romanian, English and French Languages
Question Answering on Romanian, English and French Languages
Faculty of Computer Science
 
UAIC Participation at RTE4
UAIC Participation at RTE4UAIC Participation at RTE4
UAIC Participation at RTE4
Faculty of Computer Science
 
Hypothesis Transformation and Semantic Variability Rules Used in RTE
Hypothesis Transformation and Semantic Variability Rules Used in RTEHypothesis Transformation and Semantic Variability Rules Used in RTE
Hypothesis Transformation and Semantic Variability Rules Used in RTE
Faculty of Computer Science
 
Improving a Question Answering System for Romanian Using Textual Entailment
Improving a Question Answering System for Romanian Using Textual EntailmentImproving a Question Answering System for Romanian Using Textual Entailment
Improving a Question Answering System for Romanian Using Textual Entailment
Faculty of Computer Science
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
Faculty of Computer Science
 
Graph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer NetworksGraph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer Networks
Faculty of Computer Science
 
Formalizing Peer-to-Peer Systems based on Content Addressable Network
Formalizing Peer-to-Peer Systems based on Content Addressable NetworkFormalizing Peer-to-Peer Systems based on Content Addressable Network
Formalizing Peer-to-Peer Systems based on Content Addressable Network
Faculty of Computer Science
 

More from Faculty of Computer Science (19)

Using Artificial Intelligence in Software Engineering
Using Artificial Intelligence in Software EngineeringUsing Artificial Intelligence in Software Engineering
Using Artificial Intelligence in Software Engineering
 
Eye and Voice Control for an Augmented Reality Cooking Experience
Eye and Voice Control for an Augmented Reality Cooking ExperienceEye and Voice Control for an Augmented Reality Cooking Experience
Eye and Voice Control for an Augmented Reality Cooking Experience
 
Learn Chemistry with Augmented Reality
Learn Chemistry with Augmented RealityLearn Chemistry with Augmented Reality
Learn Chemistry with Augmented Reality
 
Exploiting Social Networks. Technological Trends
Exploiting Social Networks. Technological TrendsExploiting Social Networks. Technological Trends
Exploiting Social Networks. Technological Trends
 
Augmented Reality in Education
Augmented Reality in EducationAugmented Reality in Education
Augmented Reality in Education
 
Diversification in an Image Retrieval System
Diversification in an Image Retrieval SystemDiversification in an Image Retrieval System
Diversification in an Image Retrieval System
 
Using opinion mining techniques for early crisis detection
Using opinion mining techniques for early crisis detectionUsing opinion mining techniques for early crisis detection
Using opinion mining techniques for early crisis detection
 
Augmented reality
Augmented realityAugmented reality
Augmented reality
 
I See You, You Can't See Me: On People's Perception About Surveillance In Po...
I See You, You Can't See Me: On People's Perception About Surveillance In Po...I See You, You Can't See Me: On People's Perception About Surveillance In Po...
I See You, You Can't See Me: On People's Perception About Surveillance In Po...
 
Named Entity Recognition for Romanian
Named Entity Recognition for RomanianNamed Entity Recognition for Romanian
Named Entity Recognition for Romanian
 
Question Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishQuestion Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and English
 
Identify Experts from a Domain of Interest
Identify Experts from a Domain of Interest Identify Experts from a Domain of Interest
Identify Experts from a Domain of Interest
 
Question Answering on Romanian, English and French Languages
Question Answering on Romanian, English and French LanguagesQuestion Answering on Romanian, English and French Languages
Question Answering on Romanian, English and French Languages
 
UAIC Participation at RTE4
UAIC Participation at RTE4UAIC Participation at RTE4
UAIC Participation at RTE4
 
Hypothesis Transformation and Semantic Variability Rules Used in RTE
Hypothesis Transformation and Semantic Variability Rules Used in RTEHypothesis Transformation and Semantic Variability Rules Used in RTE
Hypothesis Transformation and Semantic Variability Rules Used in RTE
 
Improving a Question Answering System for Romanian Using Textual Entailment
Improving a Question Answering System for Romanian Using Textual EntailmentImproving a Question Answering System for Romanian Using Textual Entailment
Improving a Question Answering System for Romanian Using Textual Entailment
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
 
Graph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer NetworksGraph Coloring using Peer-to-Peer Networks
Graph Coloring using Peer-to-Peer Networks
 
Formalizing Peer-to-Peer Systems based on Content Addressable Network
Formalizing Peer-to-Peer Systems based on Content Addressable NetworkFormalizing Peer-to-Peer Systems based on Content Addressable Network
Formalizing Peer-to-Peer Systems based on Content Addressable Network
 

Recently uploaded

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 

Recovering Diacritics using Wikipedia and Google

  • 1. Adrian Iftene1 , Diana Trandabăţ1,2 {adiftene, dtrandabat}@info.uaic.ro 1 Faculty of Computer Science 1 “Al. I. Cuza” University of Iasi 2 Romanian Academy, Iasi Branch 2 July, KEP T 2009, Cluj Napoca
  • 2.  Motivation  The system  Steps performed  Results  Conclusions
  • 3.  Ro-Wikipedia was used in CLEF 2007 ◦ 1.43 Gb ◦ 121.832 files Iftene, Trandabăţ, KEPT 2009
  • 5. Step 1 - Initial text is split into sentences and then sentences are further split into words Step 2 - For every word without diacritics, we search in DBPF the corresponding possible value ◦ If the current word doesn’t contain “a, i, s, t” letters then we search in DBFP or in Ro-Wikipedia the word ◦ If the current word contains one or more from “a, i, s, t” letters then we search in DBFP or in Ro-Wikipedia using a pattern, obtained from initial word, where all possible diacritics (a, i, s, t) are replaced with the corresponding values (”a” is replaced by (ă|â|a), ”i” is replaced by (î|i), ”s” is replaced by (ş|s), ”t” is replaced by (t|ţ)) ◦ For example for word = “fata” the pattern = “f(ă|â|a)(t|ţ)(ă|â|a)” Iftene, Trandabăţ, KEPT 2009
  • 6. Step 3 - We build a query in order to search web pages that contain similar sentences (At this step we receive sentences that contain words with multiples forms in DBFP) Iftene, Trandabăţ, KEPT 2009
  • 7. Step 4 - We extract from web the first 10 relevant pages returned by Google Step 5- From downloaded sites we select only pages with texts and ignore files with images, fonts, and with configuration settings. In the selection process we identify the ”correct” files with diacritics and concatenate them in one file Iftene, Trandabăţ, KEPT 2009
  • 8. Step 6 - Using the file built at Step 5 we will show how we will identify the most appropiate form for words with multiple forms. We build the same kind of patterns as at Step 2 b) ii. and identify, for every word, the possible forms and its relative positions in the concatenated file Iftene, Trandabăţ, KEPT 2009
  • 9.  If the sentence S has as components the words w1, w2, ..., wn  We note with fi the current form for word wi and with pi1, pi2, ..., piti the positions from each associated layer  With these notations a full path from first layer (corresponding to the first word of the sentence) to the last layer (corresponding to the last word of the sentence) can be noticed with FP = (p1i1, p2i2, …, pnin) Iftene, Trandabăţ, KEPT 2009
  • 10.  From now our goal is to find a full path between current layers with a minimal length  For that we build Iftene, Trandabăţ, KEPT 2009
  • 11.  An example is presented below for the sentence: ”Scoala incepe sambata” with two possible solutions:  Şcoala începe sâmbătă. (School starts this Saturday).  Şcoala începe sâmbăta. ((Usually) the school starts Saturday). Iftene, Trandabăţ, KEPT 2009
  • 12.  Step 7 - Context improvement: ◦ The backward rule ◦ The forward rule ◦ The maximization rule Iftene, Trandabăţ, KEPT 2009
  • 13.  In order to evaluate the systems performances, we used a large file containing the Calimera Guidelines (14.148 sentences). Iftene, Trandabăţ, KEPT 2009
  • 14.  The paper presents a method to restore diacritics using web found contexts  The system accuracy is similar to the accuracy of existing systems, but the main advantage comes from fact that it uses resource and tools available for free.  Also, we tested our algorithm on other languages like French and German and the results are very promising Iftene, Trandabăţ, KEPT 2009

Editor's Notes

  1. For every word from the initial sentence we build layers with its position, in the following manner: at every moment, each form found in DBPF is placed on a different layer. On every layer we place the position of the corresponding forms.
  2. For the initial sentence we consider an ordered set of layers associated to every word of it. A path between two layers will be an ordered set of positions from every layer between considered layers. One full path from first layer (corresponding to the first word of the sentence) to the last layer (corresponding to the last word of the sentence) will have consecutive positions from every layer.
  3. The backward rule searches in previous solved sentences in order to see what forms were already used for words with multiple forms. The forward rule puts this sentence in a waiting process until next sentences will be solved. After that we will use the identified forms in unclear situations. Another rule can be the maximization rule. This rule can be used in cases in which we have a high level of confidence in identifying the correct form for some words, and we de cide to use the same form of these words in other sentences from a specified ”neighborhood”.