Building corpus fromwww for ArabicArabic NLP group at Imam University 2013Al-Fridi.A , Bhattab.R , Al-Rakaf.N
Outline• Introduction• Data collection• Data processing• Architecture• Problems• Tools Methodology• Conclusion
Introduction• Building a corpus requires major time and effort.• Texts may not be easily available for building acorpus.• Web data that a new strand of research developed• The web is immense, free and available.• The Web as a source of language data, because thatits so big source rather than other sources.• The idea of building corpora starting at 1897 byGerman linguist Kading.
Data collection• There is many ways to collecting the data from thewebsites.• used a locally developed spider program to get thedata from each site.• used the Arabic Optical Character Recognition (OCR)program Automatic Reader.
Data processingThe processing of the data to obtain the corpusconsisted of the following steps:• Language classification.• Linguistic filtering.• Processing.• Corpus indexing.
Boot CaT• This is the ﬁrst propose a full procedure for theautomated extraction of specialized corpora andtechnical terms by web-mining.• Let’s us try to build corpus
Sketch EngineIntroduction• The Sketch Engine is a corpus processing systemdeveloped in 2002.• The basic elements of the Sketch Engine areconcordances, word sketches, grammaticalrelations, and a distributional thesaurus.• The Sketch Engine service makes a number oflarge web corpora available for onlineanalysis which can be done by usinga web-based corpus query.
Sketch EngineImplementation and Design• The Sketch Engine has a different query system.• A Word Sketch includes: subject, object,prepositional object, and modifier.
Conclusion• Building corpus from www for Arabic.• Ways to collecting data from web.• Problem we faced and the tools thatsupport us to build the corpus.
AcknowledgmentsThis work has been supervised byDr.Amal Al-Saif,we Thank her forhelping and supporting us.