Arabic tokenization and stemming

•Download as PPTX, PDF•

3 likes•3,327 views

Arabic_NLP_ImamU2013

Technology Business

Imam Mohammad Ibn Saud
Islamic University
College of Computing and
Information Science
Computer sciences Department
Prepared by:
Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N
Arabic Tokenization and
Stemming
Supervised by:
Dr. Amal Al-Saif.

Outline
 Introduction
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.

Introduction
 Arabic language.
 Tokenization.
 Stemming.

Arabic Language Characteristics
• Writing the letter in ambiguous case cause orthography problems.
• Encliticization of a word ending with “ ” or “ ” :
• Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”.
word Encliticization of word
“their Friday”
“collect them”
“Your level”

My Approach
 Sample of Arabic tokenized text:
 The Bigrams equation that used is:
P(wi | sj) is probability of ith word given jth segmentation.
P(sj | si-1)is probability of jth segmentation given previous segmentation.

Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Arabic Characteristics.
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.

Results
The result of My Approach algorithm:
• They used Bigrams on 45 files with size of 29092 tokens.
• The final accuracy was 98.83%.
Recall Accuracy Precision F-measure
Result without statistical
support
0.9877462 0.9802977 0.8617793 0.920473

Outline
 Introduction
 Arabic Characteristics.
 Tokenization:
• Methodology.
• Result.
 Stemming:
• Arabic Characteristics.
• Methodology.
• Results.
 Conclusion.

Methodology
 Root-based.
 Light Stemmer.
 N-Gram.
 Hybrid Method.

Root-based
 Example of root-based stemmer

Light Stemmer
 Removed morphemes by Light stemmers

Light Stemmer
 Classification of Light8 stemmer

N-gram
 Statistical stemmer based on calculating a measure of
similarity between a pair of words.
 N-gram techniques:
• Digram.
• Trigram.

N-gram
N-gram techniques:
• ( )
• Digram (N=2)
“
• Trigram (N=3)

N-gram
 The string similarity measures calculated using Dice’s
Coefficient:
S = 2Cwq /(Aw + Bq)
Example :
“
would be:
(2 * 4/(10 +5) = 0.533).

Hybrid Method
 Incorporates three different techniques for Arabic Stemming.
 The Hybrid algorithm starts with constructing the root file
containing more than 9,000 valid Arabic roots.

Results
 Hybrid algorithm was found to supersede the other
stemming ones.
 The obtained results illustrate that using the hybrid stemmer
enhances the performance of some Arabic process.
 In Arabic Text Categorization: the averages accuracies are:
74.41% for khoja, 59.71% for light stemming, 48.17% for
n-grams, and 82.33% for Hybrid stemmer.

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

GenCyber Cyber Security Day Presentation

Michael W. Hawkins

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

In an era where artificial intelligence (AI) stands at the forefront of business innovation, Information Architecture (IA) is at the core of functionality. See “There’s No AI Without IA” – (from 2016 but even more relevant today) Understanding and leveraging how Information Architecture (IA) supports AI synergies between knowledge engineering and prompt engineering is critical for senior leaders looking to successfully deploy AI for internal and externally facing knowledge processes. This webinar be a high-level overview of the methodologies that can elevate AI-driven knowledge processes supporting both employees and customers. Core Insights Include: Strategic Knowledge Engineering: Delve into how structuring AI's knowledge base is required to prevent hallucinations, enable contextual retrieval of accurate information. This will include discussion of gold standard libraries of use cases support testing various LLMs and structures and configurations of knowledge base. Precision in Prompt Engineering: Learn the art of crafting prompts that direct AI to deliver targeted, relevant responses, thereby optimizing customer experiences and business outcomes. Unified Approach for Enhanced AI Performance: Explore the intersection of knowledge and prompt engineering to develop AI systems that are not only more responsive but also aligned with overarching business strategies. Guiding Principles for Implementation: Equip yourself with best practices, ethical guidelines, and strategic considerations for embedding these technologies into your business ecosystem effectively. This webinar is designed to empower business and technology leaders with the knowledge to harness the full potential of AI, ensuring their organizations not only keep pace with digital transformation but lead the charge. Join us to map a roadmap to fully leverage Information Architecture (IA) and AI chart a course towards a future where AI is a key pillar of strategic innovation and business success.

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Earley Information Science

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

A Call to Action for Generative AI in 2024

Results

Choosing the right accounts payable services provider is a strategic decision that can significantly impact your business's financial performance and operational efficiency. By considering factors such as expertise, range of services, technology infrastructure, scalability, cost, and reputation, businesses can make informed decisions and select a provider that aligns with their unique needs and objectives. Partnering with the right provider can streamline accounts payable processes, drive cost savings, and position your business for long-term success. https://katprotech.com/accounts-payable-and-purchase-order-automation/

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Katpro Technologies

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Civil Lines Women Seeking Men

Delhi Call girls

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Delhi Call girls

Explore 'The Codex of Business: Writing Software for Real-World Solutions,' a compelling SlideShare presentation that delves into digital transformation in healthcare. Discover through a detailed case study how Agile methodologies empower healthcare providers to develop, iterate, and refine digital solutions that address real-world challenges. Learn how strategic planning, user feedback, and continuous improvement drive success in deploying technologies that enhance patient care and operational efficiency. Ideal for healthcare professionals, IT specialists, and digital transformation advocates seeking actionable insights and practical examples of technology making a real difference.

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Malak Abu Hammad

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

CNv6 Instructor Chapter 6 Quality of Service

giselly40

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

Microsoft's Threat Matrix for Kubernetes helps organizations understand the attack surface a Kubernetes deployment introduces to their environments. This ensures that adequate detections and mitigations are in place. By covering over 40 different attacker techniques, defenders can learn about Kubernetes-specific mitigations and controls to deploy to their environments. In this session, we will explore the MS-TA9013 Host Path Mount technique, which is commonly used by attackers to perform privilege escalation in a Kubernetes cluster. Attendees will learn how attackers and defenders can: * Escape the container's host volume mount to gain persistence on an underlying node * Move laterally from the underlying node into the customer's cloud environment * Analyze Kubernetes audit logs to detect pods deployed with a hostPath mount * Deploy an admission controller that prevents new pods from using a hostPath mount

Breaking the Kubernetes Kill Chain: Host Path Mount

Puma Security, LLC

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

GenCyber Cyber Security Day Presentation

Driving Behavioral Change for Information Management through Data-Driven Gree...

What Are The Drone Anti-jamming Systems Technology?

2024: Domino Containers - The Next Step. News from the Domino Container commu...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

IAC 2024 - IA Fast Track to Search Focused AI Solutions

A Call to Action for Generative AI in 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

How to Troubleshoot Apps for the Modern Connected Worker

08448380779 Call Girls In Civil Lines Women Seeking Men

Advantages of Hiring UIUX Design Service Providers for Your Business

[2024]Digital Global Overview Report 2024 Meltwater.pdf

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Boost PC performance: How more available memory can improve productivity

CNv6 Instructor Chapter 6 Quality of Service

Automating Google Workspace (GWS) & more with Apps Script

Handwritten Text Recognition for manuscripts and early printed texts

Breaking the Kubernetes Kill Chain: Host Path Mount

Arabic tokenization and stemming

1. Imam Mohammad Ibn Saud Islamic University College of Computing and Information Science Computer sciences Department Prepared by: Al-Moammar.A., Al-Abdullah.H., and Al-Ajlan.N Arabic Tokenization and Stemming Supervised by: Dr. Amal Al-Saif.

2. Arabic Tokenization and Stemming

3. Outline  Introduction  Tokenization: • Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

4. Introduction  Arabic language.  Tokenization.  Stemming.

5. Outline  Introduction  Tokenization: • Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

6. Arabic Language Characteristics • Writing the letter in ambiguous case cause orthography problems. • Encliticization of a word ending with “ ” or “ ” : • Ambiguity results from decliticization of “ ” “l” “ ” “A” and “ ” “Al” “the”. word Encliticization of word “their Friday” “collect them” “Your level”

7. Outline  Introduction  Tokenization: • Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

8. My Approach  Sample of Arabic tokenized text:  The Bigrams equation that used is: P(wi | sj) is probability of ith word given jth segmentation. P(sj | si-1)is probability of jth segmentation given previous segmentation.

9. Outline  Introduction  Arabic Characteristics.  Tokenization: • Arabic Characteristics. • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

10. Results The result of My Approach algorithm: • They used Bigrams on 45 files with size of 29092 tokens. • The final accuracy was 98.83%. Recall Accuracy Precision F-measure Result without statistical support 0.9877462 0.9802977 0.8617793 0.920473

11. Outline  Introduction  Arabic Characteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

12. Arabic Language Characteristics

13. Outline  Introduction  Arabic Characteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

14. Methodology  Root-based.  Light Stemmer.  N-Gram.  Hybrid Method.

15. Root-based  Example of root-based stemmer

16. Light Stemmer  Removed morphemes by Light stemmers

17. Light Stemmer  Classification of Light8 stemmer

18. N-gram  Statistical stemmer based on calculating a measure of similarity between a pair of words.  N-gram techniques: • Digram. • Trigram.

19. N-gram N-gram techniques: • ( ) • Digram (N=2) “ • Trigram (N=3)

20. N-gram  The string similarity measures calculated using Dice’s Coefficient: S = 2Cwq /(Aw + Bq) Example : “ would be: (2 * 4/(10 +5) = 0.533).

21. Outline  Introduction  Arabic Characteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

22. Hybrid Method  Incorporates three different techniques for Arabic Stemming.  The Hybrid algorithm starts with constructing the root file containing more than 9,000 valid Arabic roots.

23. Results

24. Results  Hybrid algorithm was found to supersede the other stemming ones.  The obtained results illustrate that using the hybrid stemmer enhances the performance of some Arabic process.  In Arabic Text Categorization: the averages accuracies are: 74.41% for khoja, 59.71% for light stemming, 48.17% for n-grams, and 82.33% for Hybrid stemmer.

25. Outline  Introduction  Arabic Characteristics.  Tokenization: • Methodology. • Result.  Stemming: • Arabic Characteristics. • Methodology. • Results.  Conclusion.

26. Conclusion

27. Thanks

Arabic tokenization and stemming

Recommended

Recommended

More Related Content

More from Arabic_NLP_ImamU2013

More from Arabic_NLP_ImamU2013 (12)

Recently uploaded

Recently uploaded (20)

Arabic tokenization and stemming