Building corpus from www for arabic
Upcoming SlideShare
Loading in...5

Like this? Share it with your network

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Building corpus fromwww for ArabicArabic NLP group at Imam University 2013Al-Fridi.A , Bhattab.R , Al-Rakaf.N
  • 2. Outline• Introduction• Data collection• Data processing• Architecture• Problems• Tools Methodology• Conclusion
  • 3. Introduction• Building a corpus requires major time and effort.• Texts may not be easily available for building acorpus.• Web data that a new strand of research developed• The web is immense, free and available.• The Web as a source of language data, because thatits so big source rather than other sources.• The idea of building corpora starting at 1897 byGerman linguist Kading.
  • 4. Data collection• There is many ways to collecting the data from thewebsites.• used a locally developed spider program to get thedata from each site.• used the Arabic Optical Character Recognition (OCR)program Automatic Reader.
  • 5. Data processingThe processing of the data to obtain the corpusconsisted of the following steps:• Language classification.• Linguistic filtering.• Processing.• Corpus indexing.
  • 6. Architecture
  • 7. Problems• Textual layout.• Spelling mistakes.• Duplicates.
  • 8. Tools Methodology
  • 9. Crawler System
  • 10. Cosmas Query
  • 11. Boot CaT• This is the first propose a full procedure for theautomated extraction of specialized corpora andtechnical terms by web-mining.• Let’s us try to build corpus
  • 12. Sketch EngineIntroduction• The Sketch Engine is a corpus processing systemdeveloped in 2002.• The basic elements of the Sketch Engine areconcordances, word sketches, grammaticalrelations, and a distributional thesaurus.• The Sketch Engine service makes a number oflarge web corpora available for onlineanalysis which can be done by usinga web-based corpus query.
  • 13. Sketch EngineImplementation and Design• The Sketch Engine has a different query system.• A Word Sketch includes: subject, object,prepositional object, and modifier.
  • 14. Conclusion• Building corpus from www for Arabic.• Ways to collecting data from web.• Problem we faced and the tools thatsupport us to build the corpus.
  • 15. AcknowledgmentsThis work has been supervised byDr.Amal Al-Saif,we Thank her forhelping and supporting us.