Your SlideShare is downloading. ×
Arabic key phrase extraction
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Arabic key phrase extraction


Published on

The task of extracting keyphrases from free text documents is becoming increasingly important as the uses for such technology expands, …

The task of extracting keyphrases from free text documents is becoming increasingly important as the uses for such technology expands,
as the amount of electronic textual content grows fast,
Keyphrases play an important role in digital libraries, web contents, and content management systems, especially in cataloging and
information retrieval purposes.

We propose an approach for basic text minning tool for Arabic KeyPhrase Extraction. The approach is relying on AI techniques represented in applying heuristic knowledge [linguistic rules] combined with statistical machine learning.



Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • 350M speaking arabic120K wiki arabic page2000% growth in Arabic Tweets from 2010 to 2011Arabic on the web has increased 2500% since 200066M users uses arabiclang in the Internet---------------------------------------------------------Arabic internet users grow by 2500% and reached 65M usersEgypy , KSA and UAE will spend about 2.1 billion $ in Electronic Retail sector
  • Generating metadata that gives a high-level description of a document's contents. This provides tools for text-mining related tasks such as document and Web page retrieval purposes. Summarizing documents for prospective readers. Keyphrases can represent a highly condensed summary of the document in question (Avanzo & Magnini, 2005).Highlighting important topics within the body of the text, to facilitate speed reading (skimming), which allows deciding whether it is relevant or not.Measuring the similarity between documents, making it possible to cluster and categorize documents (Karanikolas & Skourlas, 2006).Searching: more precise upon using them as the basis for search indexes or as a way of browsing a collection of documents.
  • Just points
  • Speak about correction and why we do it !!Removing Diactries and why we do it ?!SegmentionSpeak about Segmenter Module from Stanford Segmenting Sentences and it’s importance in Features Calculation (NPL)Segmentation into Words
  • Remove all unused special characters Remove non-arabic characters Replace QM and exclamation with arabic oneLeave only significant special charcter، ; :
  • Almost all clitics are separated off as separate words. This includes clitic pronouns, prepositions, and conjunctions. However, the clitic determiner (definite article) "Al" (ال) is not separated off. Inflectional and derivational morphology is not separated off.[GALE ROSETTA: These separated off clitics are not overtly marked as proclitics/enclitics, although we do have a facility to strip off the '+' and '#' characters that the IBM segmenter uses to mark enclitics and proclitics, respectively. See the example below using the option -escaper]Parentheses are rendered -LRB- and -RRB-Quotes are rendered as (ASCII) straight single and double quotes (' and "), not as curly quotes or LaTeX-style quotes (unlike the Penn English Treebank).Dashes are represented with the ASCII hyphen character (U+002D).Non-break space is not used.
  • (not words) mean the pos tagger run on the whole sentence to detect the right pos for every word
  • Another example Word: التوقعات المرئيةStem مرء:Lemma : مرئي
  • Mention about another test set that include different domain like (Sport, psychology, science and religious )Don’t forget to mention about the documents have different authors
  • Define both precession and recall to the audience
  • Mention that sakhr is not eligible to compare due to limitation of categorizing keyphrase into sectionMention that kp-miner is only available website to compare withWe used same test set with kp-miner to make fair comparison
  • Don’t forget human judgment
  • Don’t forget human judgment
  • Speak in more detailsLinguistic feature include adding special characters like semicolon and double colon to detect importance of text Linguistic include to check if the candidate is a sub of another candidateStatistical feature include some defining of each writer style where he mention the topic and when he tell details (تفصيل واجمال)
  • Transcript

    • 1. 350M 120K 66M2000% 2500%
    • 2. Precision 0.25 (for 15 keyphrase) 0.171 (for 20 keyphrase) Recall 0.443 (for 15 keyphrase) 0.447 (for 20 keyphrase)
    • 3. Precision 0.25 0.214 (for 15 keyphrase) (for 15 keyphrase) 0.171 0.178 (for 20 keyphrase) (for 20 keyphrase) Recall 0.443 0.399 (for 15 keyphrase) (for 15 keyphrase) 0.447 0.414 (for 20 keyphrase) (for 20 keyphrase)