SlideShare a Scribd company logo
Indexing the Albanian Language
by Andri Xhitoni
may 2015
Indexing the Albanian Language
The problem: Have search functionality
work on a website that's in Albanian.
Indexing the Albanian Language
The problem: Have search functionality
work on a website that's in Albanian.
Indexing the Albanian Language
The problem: Have search functionality
work on a website that's in Albanian.
Intricacies of search
Intricacies of search
Many think of search as
a straight-forward process
Intricacies of search
Many think of search as
a straight-forward process
“in go search terms, out come results”
it’s not that simple...
“in go search terms, out come results”
it’s not that simple...
it’s not that simple...
Words take on many forms.
it’s not that simple...
Words take on many forms.
Words may have different meanings
based on context
it’s not that simple...
Words take on many forms.
Words may have different meanings
based on context
Some words have no real semantic value
and must be ignored (stop words)
How do the big guys do it?
How do the big guys do it?
No searching through raw content
How do the big guys do it?
No searching through raw content
Search through optimized versions
of the raw content (indexing)
Basic indexing process
Alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'
Basic indexing process
Normalize the characters (transliteration)
and remove punctuation
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'
Basic indexing process
Remove stop words
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought alice `without pictures or conversation?'
Basic indexing process
Transform each remaining word to its "basic version"
(stemming)
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
Basic indexing process
Store the indexed content alongside the original
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
Performing the search
Performing the search
the book Alice’s sister was reading
Performing the search
the book alice’s sister was reading
Perform the same indexing on the search terms
Performing the search
Search for the indexed search terms
in the indexed content
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
the book alice’s sister was reading
Performing the search
Rank results according to number of occurrences,
closeness of terms, position in the indexed text
alice was beginning to get very tired of sitting by her sister on
the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
think alice `without pictures or conversation?'
the book alice’s sister was reading
2 21 1
Add the Albanian language
on top of the problem
Add the Albanian language
on top of the problem
No known "stop words" list
Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Add the Albanian language
on top of the problem
No known "stop words" list
Non-trivial stemming process
High irregularity in word formation
Vast number of forms for each single word
Just a taste of the complexity
Nouns 6 cases
x 2 numbers (singular, plural)
x 2 definitenes (definite, indefinite)
~24 word forms
Verbs 3 unique word-forming modes (of 6)
x 4 unique word-forming tenses (of 8)
x 2 voices (active, passive)
x 6 conjugative forms
~70 word forms
Looking for solutions
Looking for solutions
Ideally:
Looking for solutions
Ideally:
A list of stop words
Looking for solutions
Ideally:
A list of stop words
A (huge) list of all possible word forms
for all words in Albanian,
linked to their stem form.
Looking for solutions
Sources:
Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Looking for solutions
Sources:
The Dictionary
highly comprehensive
only base word forms
The Internet
not too comprehensive
many word forms
potential errors
Hybrid source
a probability-based model
picking (hopefully) the best
from both sources
Data mining: Stop words
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Data mining: Stop words
Get as many texts in Albanian as possible
(the more diverse, the better)
Transliterate the texts
Keep a running count of the occurrence for each word
Sort the list by occurrence count (highest first).
Stop words will float to the top.
Manually white-list obvious false positives
Data mining: Stemming
Data mining: Stemming
Invert each word from the collected list
Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Data mining: Stemming
Invert each word from the collected list
Sort the list alphabetically
(effectively sorting by suffixes)
Find highest occurring suffixes of 2, 3 and 4 letters
Manually look for false positives
and put them in a white list
The (basic) indexing algorithm
The (basic) indexing algorithm
Transliterate the input text
The (basic) indexing algorithm
Transliterate the input text
Find and remove all stop words
The (basic) indexing algorithm
Transliterate the input text
Find and remove all stop words
Go through each word and remove
the found suffixes (largest to smallest)
The (basic) indexing algorithm
https://github.com/andrixh/index-albanian
Transliterate the input text
Find and remove all stop words
Go through each word and remove
the found suffixes (largest to smallest)
Indexing the Albanian Language
by Andri Xhitoni
Thank you!
https://github.com/andrixh/index-albanian

More Related Content

Similar to Andri Xhitoni - Indexing Albanian Language

Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Dawn Anderson MSc DigM
 
Fpt Academic Writing Grammars
Fpt Academic Writing GrammarsFpt Academic Writing Grammars
Fpt Academic Writing GrammarsHung Nguyen
 
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization Methods
Plotly
 
копия How to teach vocabulary
копия How  to teach vocabularyкопия How  to teach vocabulary
копия How to teach vocabulary
Iryna Grusha
 
Luis canas
Luis canasLuis canas
Luis canas
ronaldo1416
 
present perfecto
present perfectopresent perfecto
present perfecto
Ericssón Muñoz Muñoz
 
Definite and Indefinite Articles
Definite and Indefinite ArticlesDefinite and Indefinite Articles
Definite and Indefinite Articles
Shelli Seehusen
 
Subject verb agreement exercise answers
Subject verb agreement exercise answersSubject verb agreement exercise answers
Subject verb agreement exercise answers
Patrick John Ibanez
 
RelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptxRelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptx
English Online Inc.
 
E10 Feb17 2010
E10 Feb17 2010E10 Feb17 2010
E10 Feb17 2010mlsteacher
 
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdfactivetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
MuhammadSajed1
 
Summary quote paraphrase workshop
Summary quote paraphrase workshopSummary quote paraphrase workshop
Summary quote paraphrase workshop
kb615
 
Active to passive voice basic rules
Active to passive voice  basic rulesActive to passive voice  basic rules
Active to passive voice basic rules
Tika Subedi
 
Ova Gleidy
Ova GleidyOva Gleidy
Ova Gleidy
guesta59e08d
 
Adso update
Adso updateAdso update
Adso update
Adlov Alexandr
 
Active passive
Active passiveActive passive
Active passive
Rabia Khan
 
Course intro
Course introCourse intro
Course intro
Nik Farhan
 

Similar to Andri Xhitoni - Indexing Albanian Language (20)

Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
Natural Language Processing and Search Intent Understanding C3 Conductor 2019...
 
Fpt Academic Writing Grammars
Fpt Academic Writing GrammarsFpt Academic Writing Grammars
Fpt Academic Writing Grammars
 
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization MethodsPLOTCON NYC: Text is data! Analysis and Visualization Methods
PLOTCON NYC: Text is data! Analysis and Visualization Methods
 
Chapter10
Chapter10Chapter10
Chapter10
 
копия How to teach vocabulary
копия How  to teach vocabularyкопия How  to teach vocabulary
копия How to teach vocabulary
 
Luis canas
Luis canasLuis canas
Luis canas
 
present perfecto
present perfectopresent perfecto
present perfecto
 
Definite and Indefinite Articles
Definite and Indefinite ArticlesDefinite and Indefinite Articles
Definite and Indefinite Articles
 
Subject verb agreement exercise answers
Subject verb agreement exercise answersSubject verb agreement exercise answers
Subject verb agreement exercise answers
 
RelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptxRelativeClausesEnglishOnline(for CLB 5+).pptx
RelativeClausesEnglishOnline(for CLB 5+).pptx
 
6th lecture
6th lecture6th lecture
6th lecture
 
Grammar
GrammarGrammar
Grammar
 
E10 Feb17 2010
E10 Feb17 2010E10 Feb17 2010
E10 Feb17 2010
 
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdfactivetopassivevoicebasicrules-140927001142-phpapp02.pdf
activetopassivevoicebasicrules-140927001142-phpapp02.pdf
 
Summary quote paraphrase workshop
Summary quote paraphrase workshopSummary quote paraphrase workshop
Summary quote paraphrase workshop
 
Active to passive voice basic rules
Active to passive voice  basic rulesActive to passive voice  basic rules
Active to passive voice basic rules
 
Ova Gleidy
Ova GleidyOva Gleidy
Ova Gleidy
 
Adso update
Adso updateAdso update
Adso update
 
Active passive
Active passiveActive passive
Active passive
 
Course intro
Course introCourse intro
Course intro
 

More from Open Labs Albania

Clair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the cloudsClair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the clouds
Open Labs Albania
 
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Open Labs Albania
 
Georges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governanceGeorges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governance
Open Labs Albania
 
Chris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond SoftwareChris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond Software
Open Labs Albania
 
Bruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on qualityBruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on quality
Open Labs Albania
 
Alex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platformAlex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platform
Open Labs Albania
 
Kiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledgeKiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledge
Open Labs Albania
 
Gjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open TechnologyGjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open Technology
Open Labs Albania
 
Giannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora communityGiannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora community
Open Labs Albania
 
Enkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source securityEnkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source security
Open Labs Albania
 
Chris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of openChris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of open
Open Labs Albania
 
Bruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open sourceBruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open source
Open Labs Albania
 
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Open Labs Albania
 
Bledar Gjocaj - Java open source
Bledar Gjocaj - Java open sourceBledar Gjocaj - Java open source
Bledar Gjocaj - Java open source
Open Labs Albania
 
Besfort Guri - OS Geo Live
Besfort Guri - OS Geo LiveBesfort Guri - OS Geo Live
Besfort Guri - OS Geo Live
Open Labs Albania
 
Besfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for GisBesfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for Gis
Open Labs Albania
 
Alex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_dbAlex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_db
Open Labs Albania
 
Inva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open AtriumInva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open Atrium
Open Labs Albania
 
Greta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy AlbaniaGreta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy Albania
Open Labs Albania
 
Altin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy KosovoAltin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy Kosovo
Open Labs Albania
 

More from Open Labs Albania (20)

Clair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the cloudsClair Tolan - Passwords for the clouds
Clair Tolan - Passwords for the clouds
 
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
Ismet Azizi - Shquarsia: Si mund të siguroheni që artikulli juaj nuk do të fs...
 
Georges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governanceGeorges Labreche - Open Data Kosovo - Open data for good governance
Georges Labreche - Open Data Kosovo - Open data for good governance
 
Chris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond SoftwareChris Ward - Taking Open Source beyond Software
Chris Ward - Taking Open Source beyond Software
 
Bruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on qualityBruno Skvorc - Open sourcing content - peer review's effect on quality
Bruno Skvorc - Open sourcing content - peer review's effect on quality
 
Alex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platformAlex Corbi - Building 100 percent os open data platform
Alex Corbi - Building 100 percent os open data platform
 
Kiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledgeKiril Simeonovski - The value of open knowledge
Kiril Simeonovski - The value of open knowledge
 
Gjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open TechnologyGjergj Sheldija - Healthcare and Open Technology
Gjergj Sheldija - Healthcare and Open Technology
 
Giannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora communityGiannis Konstantinidis - The fedora community
Giannis Konstantinidis - The fedora community
 
Enkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source securityEnkeleda Ibrahimi - Open source security
Enkeleda Ibrahimi - Open source security
 
Chris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of openChris Heilmann - The new challenge of open
Chris Heilmann - The new challenge of open
 
Bruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open sourceBruno Skvorc - The many ways to contribute to open source
Bruno Skvorc - The many ways to contribute to open source
 
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
Blerta Thaçi & zana Idrizi - Empowering women in the community of coding
 
Bledar Gjocaj - Java open source
Bledar Gjocaj - Java open sourceBledar Gjocaj - Java open source
Bledar Gjocaj - Java open source
 
Besfort Guri - OS Geo Live
Besfort Guri - OS Geo LiveBesfort Guri - OS Geo Live
Besfort Guri - OS Geo Live
 
Besfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for GisBesfort Guri - Floss Tools for Gis
Besfort Guri - Floss Tools for Gis
 
Alex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_dbAlex Corbi - Visualizing open data with carto_db
Alex Corbi - Visualizing open data with carto_db
 
Inva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open AtriumInva Veliu & Florian Tani - Open Atrium
Inva Veliu & Florian Tani - Open Atrium
 
Greta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy AlbaniaGreta Doçi - WikiAcademy Albania
Greta Doçi - WikiAcademy Albania
 
Altin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy KosovoAltin Ukshini - WikiAcademy Kosovo
Altin Ukshini - WikiAcademy Kosovo
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

Andri Xhitoni - Indexing Albanian Language

  • 1. Indexing the Albanian Language by Andri Xhitoni may 2015
  • 2. Indexing the Albanian Language The problem: Have search functionality work on a website that's in Albanian.
  • 3. Indexing the Albanian Language The problem: Have search functionality work on a website that's in Albanian.
  • 4. Indexing the Albanian Language The problem: Have search functionality work on a website that's in Albanian.
  • 6. Intricacies of search Many think of search as a straight-forward process
  • 7. Intricacies of search Many think of search as a straight-forward process “in go search terms, out come results”
  • 8. it’s not that simple... “in go search terms, out come results”
  • 9. it’s not that simple...
  • 10. it’s not that simple... Words take on many forms.
  • 11. it’s not that simple... Words take on many forms. Words may have different meanings based on context
  • 12. it’s not that simple... Words take on many forms. Words may have different meanings based on context Some words have no real semantic value and must be ignored (stop words)
  • 13. How do the big guys do it?
  • 14. How do the big guys do it? No searching through raw content
  • 15. How do the big guys do it? No searching through raw content Search through optimized versions of the raw content (indexing)
  • 16. Basic indexing process Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
  • 17. Basic indexing process Normalize the characters (transliteration) and remove punctuation alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'
  • 18. Basic indexing process Remove stop words alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought alice `without pictures or conversation?'
  • 19. Basic indexing process Transform each remaining word to its "basic version" (stemming) alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
  • 20. Basic indexing process Store the indexed content alongside the original alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?'
  • 22. Performing the search the book Alice’s sister was reading
  • 23. Performing the search the book alice’s sister was reading Perform the same indexing on the search terms
  • 24. Performing the search Search for the indexed search terms in the indexed content alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?' the book alice’s sister was reading
  • 25. Performing the search Rank results according to number of occurrences, closeness of terms, position in the indexed text alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' think alice `without pictures or conversation?' the book alice’s sister was reading 2 21 1
  • 26. Add the Albanian language on top of the problem
  • 27. Add the Albanian language on top of the problem No known "stop words" list
  • 28. Add the Albanian language on top of the problem No known "stop words" list Non-trivial stemming process
  • 29. Add the Albanian language on top of the problem No known "stop words" list Non-trivial stemming process High irregularity in word formation
  • 30. Add the Albanian language on top of the problem No known "stop words" list Non-trivial stemming process High irregularity in word formation Vast number of forms for each single word
  • 31. Just a taste of the complexity Nouns 6 cases x 2 numbers (singular, plural) x 2 definitenes (definite, indefinite) ~24 word forms Verbs 3 unique word-forming modes (of 6) x 4 unique word-forming tenses (of 8) x 2 voices (active, passive) x 6 conjugative forms ~70 word forms
  • 34. Looking for solutions Ideally: A list of stop words
  • 35. Looking for solutions Ideally: A list of stop words A (huge) list of all possible word forms for all words in Albanian, linked to their stem form.
  • 37. Looking for solutions Sources: The Dictionary highly comprehensive only base word forms
  • 38. Looking for solutions Sources: The Dictionary highly comprehensive only base word forms The Internet not too comprehensive many word forms potential errors
  • 39. Looking for solutions Sources: The Dictionary highly comprehensive only base word forms The Internet not too comprehensive many word forms potential errors Hybrid source a probability-based model picking (hopefully) the best from both sources
  • 41. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better)
  • 42. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts
  • 43. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word
  • 44. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word Sort the list by occurrence count (highest first).
  • 45. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word Sort the list by occurrence count (highest first). Stop words will float to the top.
  • 46. Data mining: Stop words Get as many texts in Albanian as possible (the more diverse, the better) Transliterate the texts Keep a running count of the occurrence for each word Sort the list by occurrence count (highest first). Stop words will float to the top. Manually white-list obvious false positives
  • 48. Data mining: Stemming Invert each word from the collected list
  • 49. Data mining: Stemming Invert each word from the collected list Sort the list alphabetically (effectively sorting by suffixes)
  • 50. Data mining: Stemming Invert each word from the collected list Sort the list alphabetically (effectively sorting by suffixes) Find highest occurring suffixes of 2, 3 and 4 letters
  • 51. Data mining: Stemming Invert each word from the collected list Sort the list alphabetically (effectively sorting by suffixes) Find highest occurring suffixes of 2, 3 and 4 letters Manually look for false positives and put them in a white list
  • 52. The (basic) indexing algorithm
  • 53. The (basic) indexing algorithm Transliterate the input text
  • 54. The (basic) indexing algorithm Transliterate the input text Find and remove all stop words
  • 55. The (basic) indexing algorithm Transliterate the input text Find and remove all stop words Go through each word and remove the found suffixes (largest to smallest)
  • 56. The (basic) indexing algorithm https://github.com/andrixh/index-albanian Transliterate the input text Find and remove all stop words Go through each word and remove the found suffixes (largest to smallest)
  • 57. Indexing the Albanian Language by Andri Xhitoni Thank you! https://github.com/andrixh/index-albanian