Arabic key phrase extraction
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Arabic key phrase extraction



The task of extracting keyphrases from free text documents is becoming increasingly important as the uses for such technology expands, ...

The task of extracting keyphrases from free text documents is becoming increasingly important as the uses for such technology expands,
as the amount of electronic textual content grows fast,
Keyphrases play an important role in digital libraries, web contents, and content management systems, especially in cataloging and
information retrieval purposes.

We propose an approach for basic text minning tool for Arabic KeyPhrase Extraction. The approach is relying on AI techniques represented in applying heuristic knowledge [linguistic rules] combined with statistical machine learning.





Total Views
Views on SlideShare
Embed Views



2 Embeds 3 2 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • 350M speaking arabic120K wiki arabic page2000% growth in Arabic Tweets from 2010 to 2011Arabic on the web has increased 2500% since 200066M users uses arabiclang in the Internet---------------------------------------------------------Arabic internet users grow by 2500% and reached 65M usersEgypy , KSA and UAE will spend about 2.1 billion $ in Electronic Retail sector
  • Generating metadata that gives a high-level description of a document's contents. This provides tools for text-mining related tasks such as document and Web page retrieval purposes. Summarizing documents for prospective readers. Keyphrases can represent a highly condensed summary of the document in question (Avanzo & Magnini, 2005).Highlighting important topics within the body of the text, to facilitate speed reading (skimming), which allows deciding whether it is relevant or not.Measuring the similarity between documents, making it possible to cluster and categorize documents (Karanikolas & Skourlas, 2006).Searching: more precise upon using them as the basis for search indexes or as a way of browsing a collection of documents.
  • Just points
  • Speak about correction and why we do it !!Removing Diactries and why we do it ?!SegmentionSpeak about Segmenter Module from Stanford Segmenting Sentences and it’s importance in Features Calculation (NPL)Segmentation into Words
  • Remove all unused special characters Remove non-arabic characters Replace QM and exclamation with arabic oneLeave only significant special charcter، ; :
  • Almost all clitics are separated off as separate words. This includes clitic pronouns, prepositions, and conjunctions. However, the clitic determiner (definite article) "Al" (ال) is not separated off. Inflectional and derivational morphology is not separated off.[GALE ROSETTA: These separated off clitics are not overtly marked as proclitics/enclitics, although we do have a facility to strip off the '+' and '#' characters that the IBM segmenter uses to mark enclitics and proclitics, respectively. See the example below using the option -escaper]Parentheses are rendered -LRB- and -RRB-Quotes are rendered as (ASCII) straight single and double quotes (' and "), not as curly quotes or LaTeX-style quotes (unlike the Penn English Treebank).Dashes are represented with the ASCII hyphen character (U+002D).Non-break space is not used.
  • (not words) mean the pos tagger run on the whole sentence to detect the right pos for every word
  • Another example Word: التوقعات المرئيةStem مرء:Lemma : مرئي
  • Mention about another test set that include different domain like (Sport, psychology, science and religious )Don’t forget to mention about the documents have different authors
  • Define both precession and recall to the audience
  • Mention that sakhr is not eligible to compare due to limitation of categorizing keyphrase into sectionMention that kp-miner is only available website to compare withWe used same test set with kp-miner to make fair comparison
  • Don’t forget human judgment
  • Don’t forget human judgment
  • Speak in more detailsLinguistic feature include adding special characters like semicolon and double colon to detect importance of text Linguistic include to check if the candidate is a sub of another candidateStatistical feature include some defining of each writer style where he mention the topic and when he tell details (تفصيل واجمال)

Arabic key phrase extraction Presentation Transcript

  • 1. 350M 120K 66M2000% 2500%
  • 2. Precision 0.25 (for 15 keyphrase) 0.171 (for 20 keyphrase) Recall 0.443 (for 15 keyphrase) 0.447 (for 20 keyphrase)
  • 3. Precision 0.25 0.214 (for 15 keyphrase) (for 15 keyphrase) 0.171 0.178 (for 20 keyphrase) (for 20 keyphrase) Recall 0.443 0.399 (for 15 keyphrase) (for 15 keyphrase) 0.447 0.414 (for 20 keyphrase) (for 20 keyphrase)