TOOLS FOR ARABIC PEOPLE NAMES PROCESSING AND RETRIEVAL   A STATISTICAL APPROACH  By Ali Salhi Adnan Yahya October 30, 2011 اللغة العربية بين الأتمتة والفلسفة في جامعة بيرزيت دراسات منطقية وفلسفية وحاسوبية في اللغة العربية
OUTLINE Motivation and Background. What we are trying to build? Names Tools Resources and Construction. Names Tables Filtration. Names Methods and Tools. Conclusions.
MOTIVATION AND BACKGROUND One of the many problems in Arabic content processing is related to  people names . People names are used in user profiles, registrations, articles, forms, .... with many problems such  as different spelling and translation forms . Other problems:  Name expansion and correction  in search queries. Names are frequent in searches/retrieval. Part of “named entities” (Locations, ...etc).
WHAT WE ARE TRYING TO BUILD? Different  names tables  with frequency, gender and translation attributes: Infrastucture. Arabic people names processing  tools  such as:  Names Gender Detector. Names Translation tool. Names Correction tool. Names Auto Suggestion tool. Names Extraction tool .
NAMES TOOLS RESOURCES AND CONSTRUCTION For names processing tools we  employ  statistical/Corpus  based approach. Built tables for  names use, gender and translation . Data obtained from  two  sources with different formats (with  Privacy Precautions ): Palestinian General High School Certificate Exam (Tawjihi) student lists for the years 2005 and 2007--2010. Birzeit University students and employees records (2003 --2010).
NAMES TOOLS RESOURCES AND CONSTRUCTION: DIFFERENT FORMATS OF SOURCE DATA Palestinian Tawjihi list is obtained from the Palestinian ministry of education as  “xls”  (Microsoft excel) format with student names (first, second, third, last/family), city, school and score attributes. Birzeit list is obtained from Birzeit University with the following features:  The list contains a bag of student name tuples . Each tuple may be a  first name ,  father name ,  grandfather name  or a  family name . Each tuple (repetition allowed) holds a translation to English as well as the “ gender ” (Male, Female or Family).
NAMES TABLES AND FILTRATION The following tables were obtained: Male Names Table. Female Names Table. Family Names Table. Names Translation Table. General Names Table.
NAMES TABLES AND FILTRATION MALE NAMES TABLE: This is a table that holds male names only. Built by  filtering  all male names from  Birzeit list  by selecting all names with gender equal to  male  then adding males from  Tawjihi list . Since it lacks gender, Tawjihi list first we did the following: Parsed  2 nd  and 3 rd  fields in student names. ( Those are considered male by default as father/grandfather names ) Thus obtained male names were used to  filter  male names from 1 st  names. ( 1 st  name can be male or a female ) that appear in 2 nd  and 3 rd  fields repeated in the 1 st  name field. Reminder: (first Names/(2 nd Names +3 rd  Names)
MALE NAMES TABLE (CONT …) Problem:   Names such as  نور ,  ضياء ,  جهاد ,  may be female or male (multi- classification).  To try to give a fair judgment about such names we assumed the following: Any name is considered to be have “male” classification if it appears in 2 nd  or 3 rd  name in Tawjihi list ( regardless the appearance in 1 st  field ) in that list. If the name appeared in 1 st  field (a first name) only then it’s considered to be a female name. Multi classification names:  Any 2 nd  or 3 rd  field names that are considered to be female in Birzeit list [and found as 1 st  name in Tawijhi list] are considered multi classification name. Number of distinct male names processed is  3570.
MALE NAMES TABLE (CONT …) Male names are processed to have a table of unique names with frequencies. Top 20 Male Names  Item Name Frequency Item  Name Frequency 1 محمد 41280 11 مصطفى 5031 2 محمود 15662 12 موسى 4649 3 أحمد 11752 13 خالد 4199 4 ابرهيم 9287 14 سليمان 4042 5 حسن 8359 15 سعيد 3897 6 علي 8008 16 عبد الله 3893 7 يوسف 7965 17 جمال 3442 8 احمد 7714 18 اسماعيل 3438 9 خليل 5483 19 صالح 3431 10 حسين 5341 20 عمر 3093
FEMALE NAMES TABLE. How the table was build ? Selecting female names  from Birzeit list with gender “female”. Adding  all female names found in Tawjihi lists. Adding  all multi classification names with female gender in Birzeit list and in 1 st  field in Tawjihi list. Number of distinct female names processed is  2633.
FEMALE NAMES TABLE (CONT …) Female names are processed to only save unique names with a frequency counter. Top 20 Female Names Item Name Frequency Item  Name Frequency 1 ايمان 2177 11 هبة 1178 2 دعاء 2034 12 نداء 1065 3 الاء 1998 13 سماح 1037 4 ولاء 1673 14 روان 1030 5 حنين 1663 15 هديل 1015 6 اسماء 1506 16 مريم 946 7 اسراء 1297 17 حنان 943 8 فداء 1268 18 فاطمة 912 9 ياسمين 1218 19 صابرين 875 10 عبير 1190 20 اماني 871
FAMILY NAMES TABLE How the table was build? Merge all names   with  gender equal to  family  in Birzeit list, with: All 4 th  (family) field names in  T awjihi  lists, then Subtract  male/female names (from respective tables). The total number of distinct family names is  11209.
FAMILY NAMES TABLE (CONT …) Family names are processed to only save unique names with a frequency counter. Top 20 Family Names Item Name Frequency Item  Name Frequency 1 تكروري 952 11 مصري 268 2 حلواني 940 12 جرار 208 3 النجار 450 13 حروب 208 4 عاصي 438 14 الشاعر 203 5 دراغمه 356 15 ربايعة 198 6 بشارات 335 16 رجوب 181 7 جرادات 319 17 سويطي 177 8 دويكات 318 18 صلاحات 175 9 المصري 308 19 شويكي 170 10 ابو الرب 280 20 صوافطه 162
ENGLISH TRANSLATION TABLE This table holds  Arabic names  and different  translated forms  and a frequency counter for each translated form of a given name Example : The name  محمد   and its top 20 different English translated forms  Item Name Freq Item  Name Freq 1 Mohammad 5513 11 Mohmad 8 2 Muhammad 783 12 Moh'd 8 3 Mohammed 181 13 Mohamd 5 4 Mohamad 168 14 Mohmmed 5 5 Mohummad 157 15 Mouhamad 4 6 Mohamed 44 16 Mouhammad 4 7 Mohmmad 20 17 Mhamad 4 8 Mohammd 12 18 Mhammed 3 9 Muhamad 11 19 Mhmmad 3 10 Muhammed 11 20 Mhmmed 3
GENERAL NAMES TABLE This is a large table holding  all  names appearing in Birzeit and Tawjihi lists and is a merge of the male, female and family tables. For each name  one gender  is assigned as well as a  frequency of appearance . If a name has more than one classification/gender then the frequencies of occurrences in all classifications are  summed  and assigned to the classification with the highest frequency.
GENERAL NAMES TABLE (CONT …) Example : The Name  نور  can be used as a male name and a female name ,  نور  as a male name holds a frequency of  32   and  نور as a female name has a frequency  of  847 . Both frequencies are summed ( 32 + 847 = 879 ) and the gender female is given to the name (is this always fair?). Multi classification flag is added where needed to give multi classification indication to the name.
NAMES METHODS AND TOOLS  Names Correction. Name Gender Detector (NGD). Names Translation. Names Auto Suggestion. Names extractor.
ERROR CORRECTION IN NAMES Two types of errors are common : Names with multiple forms errors. Compound names errors .
ERROR CORRECTION (CONT…) NAMES WITH DIFFERENT FORMS ERRORS Fixing common  misspelling\errors:  is looking for the  best  form to represent a name. For us, the best form is that with the  highest frequency (is this fair? Democracy!). Example : The name is  احمد  and it has three different forms  ( أحمد ,  احمد ,  إحمد   ) , however  أحمد  is the one with the  highest  frequency.  أ حمد  is considered to be the correct format. The frequency of occurrence  of  احمد  and  إحمد  are summed and added to the frequency of  أحمد   .
ERROR CORRECTION(CONT…) NAMES WITH DIFFERENT FORMS ERRORS Levenshtein Distance (LD) : a measure of how far names (or forms) are from each other (# of Edits). We use LD to select “the correct form”  from a group of possible names. The group is a list of names with n-letter difference (built using LD and our general table). Based on common errors studies, we use the common errors letters ( أ , ا , إ , آ , و , ي،ى , ه , ة ) to find which name must be selected (or given preference).
ERROR CORRECTION: SIMPLE EXAMPLE  Assuming a group of ( ديمه ,  ديما ,  ريمه ,  ديمة ,  سيمه ) and the group key (incorrect entry)  ديمه   . سيمه  and  ريمه   can be dropped (No common error letters  ( أ , ا , إ , آ , و , ي،ى , ه , ة ) . We end up with  )  ديمه  ,  ديما  ,  ديمة ) then  ديما   is selected for having the highest frequency and the frequency of the other two forms is summed and added to  ديما  frequency.
ERROR CORRECTION: COMPLEX EXAMPLE Assume a name like  اسامه  which  differs by 2 from the name  أسامة  and won’t be in the same group. To fix that, we rejoined groups that have common elements with potential common errors. اسامه   has a group of four shapes after joining two smaller groups ( اسامه ,  اسامة ,   أسامه ) and ( أسامة ,  أسامه ).  أسامة  has the largest frequency of appearance  so the final result will be the sum of all frequencies.
ERROR CORRECTION: COMPLEX EXAMPLE Consider:  عدير      غدير  ,  عبير   two possible corrections Choice depends on  3 components: Enhanced Names Table (our dictionary ) . Levenshtein distance. Ranking system: Given error sources we used a ranking system based on a combination of 4 elements: Frequency of Appearance (in our lists).  Shape Similarity (of letters). Location Measurement (keyboard). Soundex Function (sound similarity).
NAMES CORRECTION TOOL  (CONT…) FORMS RANKING General Ranking Equation : Rank (word) = A*Frequency + B*ShapeSimilarity + C*LetterLocation + D*Soudex. A, B, C, D are percentages with summation of 100% . Consider A = 0.5 and B = 0.20 and C= 0.25 and D = 0.05  Frequency,  ShapeSimilarity,  LetterLocation , Soudex are parameters of the forms of a name, with obvious interpretation . The chosen values for A, B, C and D are not necessarily the best. They are based on experimentation and thus need more testing to decide the best range (or values).
NAMES CORRECTION TOOL  (CONT…) Some test samples: # Input Output(s)  # Input Output(s) 1 ديم ريم ,  ديما , كيم ,  نديم 5 اية راية ,  آية 2 شوشن سوسن ,  شوكت ,  سوزان ,  روان 6 نوزالدين نور الدين 3 خاقلين تالين ,  جاكلين ,  مارلين ,  كاثلين ,  مادلين 7 رمري رمزي ,  رازي 4 اقراجيم إبراهيم 8 غبير عبير ,  غدير
NAMES CORRECTION TOOL  (CONT…)   TEST RESULTS General test ( each test consist of 100 misspelled name ) : # Test Type Pass Percentage 1 Speed Writing (test1) Speed Writing (test2) Speed Writing (test3) 87% 84% 85% 2 Auto generated errors  One Error Two Errors  Three Errors  91% 79% 70%
NAMES METHODS AND TOOLS  NAME GENDER DETECTOR (NGD) NGD : A tool that detects the classification of an input name into : Male, Female or Family. How it works ? The NGD tool receives the name, issues a query to check the existence of the name in  the enhanced names table .  If  found, NGD  returns the gender and its percentage of the whole names lists. If not, it returns a null statement with no results found, and the tool pushes the input string to the correction tool to check whether the “not found” result happened due to spelling/common error . Can work in reverse: given the gender, limit the correction/suggestion to names in that gender.
NAMES METHODS AND TOOLS  NAMES TRANSLATION TOOL Names  translation tool  finds the correct (or widely accepted) English translation of a given name. Many Arabic names have different equivalent English forms as seen in the following table:  # Arabic Name English Translation Freq # Arabic Name English Translation Freq 1 سمير Samir 299 4 أحمد Ahmad 1875 Sameer 85 Ahmed 48 2 نورا Noura 19 Ahamad 6 Nora 7 5 مؤيد Mo'ayad 10 Nura 5 Mu'ayad 9 Noora 3 Moayad 5 3 رياض Riyad 148 Mu'ayyad 5 Riad 24 Mo'ayyad 3 Reyad 8 Muayad 3
NAMES TRANSLATION TOOL (CONT…) The translation tool searches in the English Translation table and builds a table that holds all possible translations sorted in a descending order  of frequency of use.  Usually we output the top 3 translations, to give the user a choice if needed, with the default being the most frequent form.
AUTO SUGGESTION TOOL   A general autosuggestion for names which can be used in applications where name entry is needed. It suggests names  while typing (completion? Not quite). The challenge is to guess intended names even when users start incorrectly. For example, a user wants to enter  أحمد   but starts with  ا   not   أ   and thus will never end up reaching  أحمد  by  completion .  Solution : A modification on user input is needed and the tool will automatically take the possibility of changing the first letter (  ا   to  أ   or  آ   or  إ ) and then wait for the next letter. The same is said in case of middle letter.  مؤيد   for example the user might enter the name as  مويد   and the tool will take the possibility of  ؤ   while typing.
NAMES EXTRACTION TOOL Names extraction is a method to isolate people names (Full and Single)  from an Arabic text. Since names may be misspelled, the reference table in use is the general table (has all forms of all the names). How it works ? The extraction function first parses text comparing words with the general names table entries. If the table has the word then the function directly parses and checks three words ahead (word+1, word+2, word+3) to detect full  names and single names.  The series of words is compared with predefined names types (<male[0],male[1],…male[i] || family>, <female, male[1],…male[i] || family> )
NAMES EXTRACTION TOOL  (CONT…) Examples :  رامي محمد حمدان  matches <male, male, family>  هند محمد حمدان  رامي  doesn’t match anything    need splitting.  هند محمد حمدان   matches <female, male, family> هند محمد حمدان  رامي   matches <male> , <female, male, family> هند محمد حمدان  رامي  |   matches : single name , full name.
NAMES EXTRACTION TOOL (CONT…) N ot every string matching a name is considered a name. For example the name  جميل   might be an adjective not a person name. To consider a word to be a single name (when it can double as a name) the following rules are applied: Either appears more than N times in the text (currently N = 3). Appears in a full name in the text, for example  جميلة سمير النتشة   then  جميلة  and its other form ( جميله  )  and  سمير   will be detected. Appears in “anded” series:  علي و ذكي و سامي و انس ذهبوا إلى الجامعة . Appears after a defining term (such as  السيد ,  الدكتور ,  الآنسة  ...  etc).
POSSIBLE USES OF DEVELOPED TOOLS Our tools should be useful in form filling and data entry, as well for batch processing of existing name lists (say, correction, translation). They can be incorporated into search tools/engines to make sure that misspelled occurrences of a name and multilingual forms are accounted for.  Reporting on individuals by detecting name occurrences in documents.  The statistical basis can be overridden by expert knowledge in the field of correct spelling.
CONCLUSIONS We presented some useful tools that can help processing people names in digital documents and web content.  Our work aimed to design and deploy query/forms pre-processing name tools able to efficiently process and identify Arabic people names in queries and documents. Employed a statistical/Corpus-based approach, and constructed databases that contain names from different resources. Some regional accent results from the source data: may be rectified. Promising testing, though more is needed.
Thank you

Tools For Arabic People Names Processing And Retrieval - Ali Salhi & Adnan Yahya

  • 1.
    TOOLS FOR ARABICPEOPLE NAMES PROCESSING AND RETRIEVAL A STATISTICAL APPROACH By Ali Salhi Adnan Yahya October 30, 2011 اللغة العربية بين الأتمتة والفلسفة في جامعة بيرزيت دراسات منطقية وفلسفية وحاسوبية في اللغة العربية
  • 2.
    OUTLINE Motivation andBackground. What we are trying to build? Names Tools Resources and Construction. Names Tables Filtration. Names Methods and Tools. Conclusions.
  • 3.
    MOTIVATION AND BACKGROUNDOne of the many problems in Arabic content processing is related to people names . People names are used in user profiles, registrations, articles, forms, .... with many problems such as different spelling and translation forms . Other problems: Name expansion and correction in search queries. Names are frequent in searches/retrieval. Part of “named entities” (Locations, ...etc).
  • 4.
    WHAT WE ARETRYING TO BUILD? Different names tables with frequency, gender and translation attributes: Infrastucture. Arabic people names processing tools such as: Names Gender Detector. Names Translation tool. Names Correction tool. Names Auto Suggestion tool. Names Extraction tool .
  • 5.
    NAMES TOOLS RESOURCESAND CONSTRUCTION For names processing tools we employ statistical/Corpus based approach. Built tables for names use, gender and translation . Data obtained from two sources with different formats (with Privacy Precautions ): Palestinian General High School Certificate Exam (Tawjihi) student lists for the years 2005 and 2007--2010. Birzeit University students and employees records (2003 --2010).
  • 6.
    NAMES TOOLS RESOURCESAND CONSTRUCTION: DIFFERENT FORMATS OF SOURCE DATA Palestinian Tawjihi list is obtained from the Palestinian ministry of education as “xls” (Microsoft excel) format with student names (first, second, third, last/family), city, school and score attributes. Birzeit list is obtained from Birzeit University with the following features: The list contains a bag of student name tuples . Each tuple may be a first name , father name , grandfather name or a family name . Each tuple (repetition allowed) holds a translation to English as well as the “ gender ” (Male, Female or Family).
  • 7.
    NAMES TABLES ANDFILTRATION The following tables were obtained: Male Names Table. Female Names Table. Family Names Table. Names Translation Table. General Names Table.
  • 8.
    NAMES TABLES ANDFILTRATION MALE NAMES TABLE: This is a table that holds male names only. Built by filtering all male names from Birzeit list by selecting all names with gender equal to male then adding males from Tawjihi list . Since it lacks gender, Tawjihi list first we did the following: Parsed 2 nd and 3 rd fields in student names. ( Those are considered male by default as father/grandfather names ) Thus obtained male names were used to filter male names from 1 st names. ( 1 st name can be male or a female ) that appear in 2 nd and 3 rd fields repeated in the 1 st name field. Reminder: (first Names/(2 nd Names +3 rd Names)
  • 9.
    MALE NAMES TABLE(CONT …) Problem: Names such as نور , ضياء , جهاد , may be female or male (multi- classification). To try to give a fair judgment about such names we assumed the following: Any name is considered to be have “male” classification if it appears in 2 nd or 3 rd name in Tawjihi list ( regardless the appearance in 1 st field ) in that list. If the name appeared in 1 st field (a first name) only then it’s considered to be a female name. Multi classification names: Any 2 nd or 3 rd field names that are considered to be female in Birzeit list [and found as 1 st name in Tawijhi list] are considered multi classification name. Number of distinct male names processed is 3570.
  • 10.
    MALE NAMES TABLE(CONT …) Male names are processed to have a table of unique names with frequencies. Top 20 Male Names Item Name Frequency Item Name Frequency 1 محمد 41280 11 مصطفى 5031 2 محمود 15662 12 موسى 4649 3 أحمد 11752 13 خالد 4199 4 ابرهيم 9287 14 سليمان 4042 5 حسن 8359 15 سعيد 3897 6 علي 8008 16 عبد الله 3893 7 يوسف 7965 17 جمال 3442 8 احمد 7714 18 اسماعيل 3438 9 خليل 5483 19 صالح 3431 10 حسين 5341 20 عمر 3093
  • 11.
    FEMALE NAMES TABLE.How the table was build ? Selecting female names from Birzeit list with gender “female”. Adding all female names found in Tawjihi lists. Adding all multi classification names with female gender in Birzeit list and in 1 st field in Tawjihi list. Number of distinct female names processed is 2633.
  • 12.
    FEMALE NAMES TABLE(CONT …) Female names are processed to only save unique names with a frequency counter. Top 20 Female Names Item Name Frequency Item Name Frequency 1 ايمان 2177 11 هبة 1178 2 دعاء 2034 12 نداء 1065 3 الاء 1998 13 سماح 1037 4 ولاء 1673 14 روان 1030 5 حنين 1663 15 هديل 1015 6 اسماء 1506 16 مريم 946 7 اسراء 1297 17 حنان 943 8 فداء 1268 18 فاطمة 912 9 ياسمين 1218 19 صابرين 875 10 عبير 1190 20 اماني 871
  • 13.
    FAMILY NAMES TABLEHow the table was build? Merge all names with gender equal to family in Birzeit list, with: All 4 th (family) field names in T awjihi lists, then Subtract male/female names (from respective tables). The total number of distinct family names is 11209.
  • 14.
    FAMILY NAMES TABLE(CONT …) Family names are processed to only save unique names with a frequency counter. Top 20 Family Names Item Name Frequency Item Name Frequency 1 تكروري 952 11 مصري 268 2 حلواني 940 12 جرار 208 3 النجار 450 13 حروب 208 4 عاصي 438 14 الشاعر 203 5 دراغمه 356 15 ربايعة 198 6 بشارات 335 16 رجوب 181 7 جرادات 319 17 سويطي 177 8 دويكات 318 18 صلاحات 175 9 المصري 308 19 شويكي 170 10 ابو الرب 280 20 صوافطه 162
  • 15.
    ENGLISH TRANSLATION TABLEThis table holds Arabic names and different translated forms and a frequency counter for each translated form of a given name Example : The name محمد and its top 20 different English translated forms Item Name Freq Item Name Freq 1 Mohammad 5513 11 Mohmad 8 2 Muhammad 783 12 Moh'd 8 3 Mohammed 181 13 Mohamd 5 4 Mohamad 168 14 Mohmmed 5 5 Mohummad 157 15 Mouhamad 4 6 Mohamed 44 16 Mouhammad 4 7 Mohmmad 20 17 Mhamad 4 8 Mohammd 12 18 Mhammed 3 9 Muhamad 11 19 Mhmmad 3 10 Muhammed 11 20 Mhmmed 3
  • 16.
    GENERAL NAMES TABLEThis is a large table holding all names appearing in Birzeit and Tawjihi lists and is a merge of the male, female and family tables. For each name one gender is assigned as well as a frequency of appearance . If a name has more than one classification/gender then the frequencies of occurrences in all classifications are summed and assigned to the classification with the highest frequency.
  • 17.
    GENERAL NAMES TABLE(CONT …) Example : The Name نور can be used as a male name and a female name , نور as a male name holds a frequency of 32 and نور as a female name has a frequency of 847 . Both frequencies are summed ( 32 + 847 = 879 ) and the gender female is given to the name (is this always fair?). Multi classification flag is added where needed to give multi classification indication to the name.
  • 18.
    NAMES METHODS ANDTOOLS Names Correction. Name Gender Detector (NGD). Names Translation. Names Auto Suggestion. Names extractor.
  • 19.
    ERROR CORRECTION INNAMES Two types of errors are common : Names with multiple forms errors. Compound names errors .
  • 20.
    ERROR CORRECTION (CONT…)NAMES WITH DIFFERENT FORMS ERRORS Fixing common misspelling\errors: is looking for the best form to represent a name. For us, the best form is that with the highest frequency (is this fair? Democracy!). Example : The name is احمد and it has three different forms ( أحمد , احمد , إحمد ) , however أحمد is the one with the highest frequency. أ حمد is considered to be the correct format. The frequency of occurrence of احمد and إحمد are summed and added to the frequency of أحمد .
  • 21.
    ERROR CORRECTION(CONT…) NAMESWITH DIFFERENT FORMS ERRORS Levenshtein Distance (LD) : a measure of how far names (or forms) are from each other (# of Edits). We use LD to select “the correct form” from a group of possible names. The group is a list of names with n-letter difference (built using LD and our general table). Based on common errors studies, we use the common errors letters ( أ , ا , إ , آ , و , ي،ى , ه , ة ) to find which name must be selected (or given preference).
  • 22.
    ERROR CORRECTION: SIMPLEEXAMPLE Assuming a group of ( ديمه , ديما , ريمه , ديمة , سيمه ) and the group key (incorrect entry) ديمه . سيمه and ريمه can be dropped (No common error letters ( أ , ا , إ , آ , و , ي،ى , ه , ة ) . We end up with ) ديمه , ديما , ديمة ) then ديما is selected for having the highest frequency and the frequency of the other two forms is summed and added to ديما frequency.
  • 23.
    ERROR CORRECTION: COMPLEXEXAMPLE Assume a name like اسامه which differs by 2 from the name أسامة and won’t be in the same group. To fix that, we rejoined groups that have common elements with potential common errors. اسامه has a group of four shapes after joining two smaller groups ( اسامه , اسامة , أسامه ) and ( أسامة , أسامه ). أسامة has the largest frequency of appearance so the final result will be the sum of all frequencies.
  • 24.
    ERROR CORRECTION: COMPLEXEXAMPLE Consider: عدير  غدير , عبير two possible corrections Choice depends on 3 components: Enhanced Names Table (our dictionary ) . Levenshtein distance. Ranking system: Given error sources we used a ranking system based on a combination of 4 elements: Frequency of Appearance (in our lists). Shape Similarity (of letters). Location Measurement (keyboard). Soundex Function (sound similarity).
  • 25.
    NAMES CORRECTION TOOL (CONT…) FORMS RANKING General Ranking Equation : Rank (word) = A*Frequency + B*ShapeSimilarity + C*LetterLocation + D*Soudex. A, B, C, D are percentages with summation of 100% . Consider A = 0.5 and B = 0.20 and C= 0.25 and D = 0.05 Frequency, ShapeSimilarity, LetterLocation , Soudex are parameters of the forms of a name, with obvious interpretation . The chosen values for A, B, C and D are not necessarily the best. They are based on experimentation and thus need more testing to decide the best range (or values).
  • 26.
    NAMES CORRECTION TOOL (CONT…) Some test samples: # Input Output(s) # Input Output(s) 1 ديم ريم , ديما , كيم , نديم 5 اية راية , آية 2 شوشن سوسن , شوكت , سوزان , روان 6 نوزالدين نور الدين 3 خاقلين تالين , جاكلين , مارلين , كاثلين , مادلين 7 رمري رمزي , رازي 4 اقراجيم إبراهيم 8 غبير عبير , غدير
  • 27.
    NAMES CORRECTION TOOL (CONT…) TEST RESULTS General test ( each test consist of 100 misspelled name ) : # Test Type Pass Percentage 1 Speed Writing (test1) Speed Writing (test2) Speed Writing (test3) 87% 84% 85% 2 Auto generated errors One Error Two Errors Three Errors 91% 79% 70%
  • 28.
    NAMES METHODS ANDTOOLS NAME GENDER DETECTOR (NGD) NGD : A tool that detects the classification of an input name into : Male, Female or Family. How it works ? The NGD tool receives the name, issues a query to check the existence of the name in the enhanced names table . If found, NGD returns the gender and its percentage of the whole names lists. If not, it returns a null statement with no results found, and the tool pushes the input string to the correction tool to check whether the “not found” result happened due to spelling/common error . Can work in reverse: given the gender, limit the correction/suggestion to names in that gender.
  • 29.
    NAMES METHODS ANDTOOLS NAMES TRANSLATION TOOL Names translation tool finds the correct (or widely accepted) English translation of a given name. Many Arabic names have different equivalent English forms as seen in the following table: # Arabic Name English Translation Freq # Arabic Name English Translation Freq 1 سمير Samir 299 4 أحمد Ahmad 1875 Sameer 85 Ahmed 48 2 نورا Noura 19 Ahamad 6 Nora 7 5 مؤيد Mo'ayad 10 Nura 5 Mu'ayad 9 Noora 3 Moayad 5 3 رياض Riyad 148 Mu'ayyad 5 Riad 24 Mo'ayyad 3 Reyad 8 Muayad 3
  • 30.
    NAMES TRANSLATION TOOL(CONT…) The translation tool searches in the English Translation table and builds a table that holds all possible translations sorted in a descending order of frequency of use. Usually we output the top 3 translations, to give the user a choice if needed, with the default being the most frequent form.
  • 31.
    AUTO SUGGESTION TOOL A general autosuggestion for names which can be used in applications where name entry is needed. It suggests names while typing (completion? Not quite). The challenge is to guess intended names even when users start incorrectly. For example, a user wants to enter أحمد but starts with ا not أ and thus will never end up reaching أحمد by completion .  Solution : A modification on user input is needed and the tool will automatically take the possibility of changing the first letter ( ا to أ or آ or إ ) and then wait for the next letter. The same is said in case of middle letter. مؤيد for example the user might enter the name as مويد and the tool will take the possibility of ؤ while typing.
  • 32.
    NAMES EXTRACTION TOOLNames extraction is a method to isolate people names (Full and Single) from an Arabic text. Since names may be misspelled, the reference table in use is the general table (has all forms of all the names). How it works ? The extraction function first parses text comparing words with the general names table entries. If the table has the word then the function directly parses and checks three words ahead (word+1, word+2, word+3) to detect full names and single names. The series of words is compared with predefined names types (<male[0],male[1],…male[i] || family>, <female, male[1],…male[i] || family> )
  • 33.
    NAMES EXTRACTION TOOL (CONT…) Examples : رامي محمد حمدان matches <male, male, family> هند محمد حمدان رامي doesn’t match anything  need splitting. هند محمد حمدان matches <female, male, family> هند محمد حمدان رامي matches <male> , <female, male, family> هند محمد حمدان رامي | matches : single name , full name.
  • 34.
    NAMES EXTRACTION TOOL(CONT…) N ot every string matching a name is considered a name. For example the name جميل might be an adjective not a person name. To consider a word to be a single name (when it can double as a name) the following rules are applied: Either appears more than N times in the text (currently N = 3). Appears in a full name in the text, for example جميلة سمير النتشة then جميلة and its other form ( جميله ) and سمير will be detected. Appears in “anded” series: علي و ذكي و سامي و انس ذهبوا إلى الجامعة . Appears after a defining term (such as السيد , الدكتور , الآنسة ... etc).
  • 35.
    POSSIBLE USES OFDEVELOPED TOOLS Our tools should be useful in form filling and data entry, as well for batch processing of existing name lists (say, correction, translation). They can be incorporated into search tools/engines to make sure that misspelled occurrences of a name and multilingual forms are accounted for. Reporting on individuals by detecting name occurrences in documents. The statistical basis can be overridden by expert knowledge in the field of correct spelling.
  • 36.
    CONCLUSIONS We presentedsome useful tools that can help processing people names in digital documents and web content. Our work aimed to design and deploy query/forms pre-processing name tools able to efficiently process and identify Arabic people names in queries and documents. Employed a statistical/Corpus-based approach, and constructed databases that contain names from different resources. Some regional accent results from the source data: may be rectified. Promising testing, though more is needed.
  • 37.