Arabic Content with Apache Solr 
Ramzi Alqrainy
Ramzi Alqrainy 
• MSc. In computer science, University of 
Jordan, Amman - Jordan 
• Senior Enterprise Search / Data Engineer @ 
OpenSooq.com 
• Technical Reviewer for “Scaling Apache Solr” 
and “Apache Solr Search Patterns” (Books) 
• Co-founder of Solr.ar group 
• Built 8 search engines for different models in 
the last 2 years 
• Active blogger and Presenter about 
Information Retrieval
Agenda 
• Why is Arabic Language Important ? 
• Arabic Language is Complex 
• How we use Apache Solr @ OpenSooq ? 
• Localization Concept with SolrCloud 
• Ranking and Relevancy 
• Apache Solr Implementations @ OpenSooq
Why is Arabic Language Important ?
Why is Arabic Language Important ? 
Sample Arabic document without dots
Why is Arabic Language Important ? 
Sample Arabic document with dots
Why is Arabic Language Important ? 
• The Arabic Language is ranked as the fourth 
top language on the web 
• The number of Arab Internet users grew from 
65 million in 2011 to 135 million in 2013
Arabic Language is Complex 
• Arabic Orthography and Print 
§ Arabic 
has 
a 
right-­‐to-­‐le0 
connected 
script 
that 
uses 
28 
basic 
le7ers, 
which 
change 
shape 
depending 
on 
their 
posi:ons 
in 
words. 
• Arabic Diacritics 
§ Diacri:cs 
help 
disambiguate 
the 
meaning 
of 
words. 
§ For 
example, 
the 
two 
words 
عَلَم (Alam 
-­‐ 
meaning 
“flag”) 
and 
عِلم (Eilm 
-­‐ 
meaning 
“knowledge”) 
share 
the 
same 
le7ers 
علم 
(Elm) 
but 
differ 
in 
diacri:cs.
Arabic Language is Complex 
• Arabic Morphology 
§ Arabic 
words 
are 
divided 
into 
three 
main 
types: 
nouns, 
verbs, 
and 
par:cles. 
§ Arabic 
nouns, 
which 
include 
adjec:ves 
and 
adverbs, 
and 
verbs 
are 
derived 
from 
a 
closed 
set 
of 
around 
10,000 
roots
Arabic Language is Complex 
• Arabic Dialects 
§ There 
are 
6 
dominant 
with 
many 
more 
varia:ons 
of 
them 
and 
dozens 
more 
less 
spoken 
dialects. 
§ EG. 
The 
concept 
corresponding 
to 
“I 
want” 
is 
expressed 
as 
عاوز 
(Eawz) 
in 
Egyp:an, 
أبغى 
(Abgy) 
in 
Gulf, 
أبي 
(Aby) 
in 
Iraqi, 
and 
بدي 
(bdy) 
in 
Levan:ne. 
• Arabizi (Transliteration) 
§ Arabic 
is 
some:mes 
wri7en 
using 
La:n 
characters 
in 
transliterated 
form. 
§ Arabizi 
uses 
numerals 
to 
represent 
Arabic 
le7ers. 
§ EG. 
"2" 
and 
“3” 
represent 
the 
le7ers 
أ 
(that 
sounds 
like 
“a” 
as 
in 
apple) 
and 
ع 
(E) 
(that 
is 
a 
gu7ural 
“aa”) 
respec:vely.
How we use Apache Solr @ OpenSooq ? 
• A leading classifieds ads website in the Middle East and North Africa. 
• Right now : Average > 7K Concurrent Users. 
• Activity-Per-Second : 240 APS. 
• Adding/Edi:ng/Dele:ng 
Post 
• Adding 
Comments 
• Sending 
Message 
to 
Buyer/Seller, 
etc. 
• More than 40k hits on Apache Solr Per Minute.
How we use Apache Solr @ OpenSooq ? 
• Arabic Search Engine
Arabic Normalization 
• There are common spelling mistakes that are widely accepted. 
For 
example, 
the 
verb ادرس 
(Adrs) 
in 
impera:ve 
mood 
(meaning 
“study” 
– 
in 
a 
command 
form) 
would 
turn 
to 
. 
أدرس 
• Arabic content would be normalized according to the following steps: 
§ Remove 
punctua:on 
§ Remove 
diacri:cs 
(primarily 
weak 
vowels). 
§ Remove 
non 
le7ers 
§ Replace 
ا 
, 
إ 
, 
and 
أ 
with 
ا 
from 
first 
le7er 
in 
each 
word 
(A 
-­‐ 
alef) 
§ Replace 
final 
ى 
with 
ي 
(Ya) 
§ Replace 
final 
ة 
with 
ه 
(Ha)
Arabic Light Stemmer 
• A light stemmer is not dictionary driven. 
• This algorithm follows a rule-based prefix-removal mechanism.
Arabic Light Stemmer 
• The light stemmer, light10, outperformed the other approaches. It is becoming 
widely used in Arabic information retrieval.
Arabic Light Stemmer 
• Sometimes a stemmer might not do what you want out of the box. 
• Protects words from being modified by stemmers. 
Stop words and Synonyms 
• Removing stop words is important to ensure high performance and improve recall 
h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt 
• Matching strings of tokens and replacing them with other strings of tokens will 
improve precision and recall .
Apache Solr Schema.xml 
• A text field that is appropriate for Arabic
Localization Concept with SolrCloud
Ranking and Relevancy: Boost documents by age 
• Just do a descending sort by age = done? 
• Boost more recent documents and penalize older documents just for being old 
• Recency Boosting 
Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05) 
^5
Tune Solr Recip Function
Solr Implementations @ OpenSooq ? 
§ Anti Spam 
§ Checking Relevancy 
§ Tags Generations 
§ Recommendation System
Thank You 
@RamziAlqrainy 
https://github.com/Ramzi-Alqrainy 
http://solr-enterprise-search-server.blogspot.com/

Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq

  • 2.
    Arabic Content withApache Solr Ramzi Alqrainy
  • 3.
    Ramzi Alqrainy •MSc. In computer science, University of Jordan, Amman - Jordan • Senior Enterprise Search / Data Engineer @ OpenSooq.com • Technical Reviewer for “Scaling Apache Solr” and “Apache Solr Search Patterns” (Books) • Co-founder of Solr.ar group • Built 8 search engines for different models in the last 2 years • Active blogger and Presenter about Information Retrieval
  • 4.
    Agenda • Whyis Arabic Language Important ? • Arabic Language is Complex • How we use Apache Solr @ OpenSooq ? • Localization Concept with SolrCloud • Ranking and Relevancy • Apache Solr Implementations @ OpenSooq
  • 5.
    Why is ArabicLanguage Important ?
  • 6.
    Why is ArabicLanguage Important ? Sample Arabic document without dots
  • 7.
    Why is ArabicLanguage Important ? Sample Arabic document with dots
  • 8.
    Why is ArabicLanguage Important ? • The Arabic Language is ranked as the fourth top language on the web • The number of Arab Internet users grew from 65 million in 2011 to 135 million in 2013
  • 9.
    Arabic Language isComplex • Arabic Orthography and Print § Arabic has a right-­‐to-­‐le0 connected script that uses 28 basic le7ers, which change shape depending on their posi:ons in words. • Arabic Diacritics § Diacri:cs help disambiguate the meaning of words. § For example, the two words عَلَم (Alam -­‐ meaning “flag”) and عِلم (Eilm -­‐ meaning “knowledge”) share the same le7ers علم (Elm) but differ in diacri:cs.
  • 10.
    Arabic Language isComplex • Arabic Morphology § Arabic words are divided into three main types: nouns, verbs, and par:cles. § Arabic nouns, which include adjec:ves and adverbs, and verbs are derived from a closed set of around 10,000 roots
  • 11.
    Arabic Language isComplex • Arabic Dialects § There are 6 dominant with many more varia:ons of them and dozens more less spoken dialects. § EG. The concept corresponding to “I want” is expressed as عاوز (Eawz) in Egyp:an, أبغى (Abgy) in Gulf, أبي (Aby) in Iraqi, and بدي (bdy) in Levan:ne. • Arabizi (Transliteration) § Arabic is some:mes wri7en using La:n characters in transliterated form. § Arabizi uses numerals to represent Arabic le7ers. § EG. "2" and “3” represent the le7ers أ (that sounds like “a” as in apple) and ع (E) (that is a gu7ural “aa”) respec:vely.
  • 12.
    How we useApache Solr @ OpenSooq ? • A leading classifieds ads website in the Middle East and North Africa. • Right now : Average > 7K Concurrent Users. • Activity-Per-Second : 240 APS. • Adding/Edi:ng/Dele:ng Post • Adding Comments • Sending Message to Buyer/Seller, etc. • More than 40k hits on Apache Solr Per Minute.
  • 13.
    How we useApache Solr @ OpenSooq ? • Arabic Search Engine
  • 14.
    Arabic Normalization •There are common spelling mistakes that are widely accepted. For example, the verb ادرس (Adrs) in impera:ve mood (meaning “study” – in a command form) would turn to . أدرس • Arabic content would be normalized according to the following steps: § Remove punctua:on § Remove diacri:cs (primarily weak vowels). § Remove non le7ers § Replace ا , إ , and أ with ا from first le7er in each word (A -­‐ alef) § Replace final ى with ي (Ya) § Replace final ة with ه (Ha)
  • 15.
    Arabic Light Stemmer • A light stemmer is not dictionary driven. • This algorithm follows a rule-based prefix-removal mechanism.
  • 16.
    Arabic Light Stemmer • The light stemmer, light10, outperformed the other approaches. It is becoming widely used in Arabic information retrieval.
  • 17.
    Arabic Light Stemmer • Sometimes a stemmer might not do what you want out of the box. • Protects words from being modified by stemmers. Stop words and Synonyms • Removing stop words is important to ensure high performance and improve recall h7ps://github.com/Ramzi-­‐Alqrainy/Arabic-­‐IR/blob/master/stopwords-­‐ar.txt • Matching strings of tokens and replacing them with other strings of tokens will improve precision and recall .
  • 18.
    Apache Solr Schema.xml • A text field that is appropriate for Arabic
  • 19.
  • 20.
    Ranking and Relevancy:Boost documents by age • Just do a descending sort by age = done? • Boost more recent documents and penalize older documents just for being old • Recency Boosting Bf=recip(ms(sub(NOW,post_inserted_date)),3.16e-­‐11,0.08,0.05) ^5
  • 21.
  • 22.
    Solr Implementations @OpenSooq ? § Anti Spam § Checking Relevancy § Tags Generations § Recommendation System
  • 23.
    Thank You @RamziAlqrainy https://github.com/Ramzi-Alqrainy http://solr-enterprise-search-server.blogspot.com/