Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sinmin Literature Review Presentation

674 views

Published on

This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.

Published in: Engineering
  • Be the first to comment

Sinmin Literature Review Presentation

  1. 1. SINMIN CORPUS FOR SINHALA LANGUAGE Literature Review Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. De Silva
  2. 2. Sinmin is a Corpus for Sinhala language which is ➢Continuously updating ➢Dynamic (Scalable) ➢Covers wide range of language (Structured and unstructured)
  3. 3. OUTLINE ● Literature Review ● Introduction to corpus linguistics and What is a Corpus ● Usages of a corpus ● Existing Corpus Implementations ● Identifying Sinhala Sources and Crawling ● Data Storage and Information Retrieval from Corpus ● Information Visualization ● Extracting Linguistic Feature ● Current Progress
  4. 4. INTRODUCTION TO CORPUS LINGUISTICS AND WHAT IS A CORPUS Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201
  5. 5. WHAT IS A CORPUS?? “A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010) Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.
  6. 6. ● There are mainly 8 kinds of corpora. ● They are generalized corpuses, specialized corpuses, learner corpuses, pedagogic corpuses, historical corpuses, parallel corpuses, comparable corpuses, and monitor corpuses. ● The broadest type of corpus is the genarilezed corpes.
  7. 7. “Sinmin” will be a generalized corpus. cover all types of Sinhala Language.
  8. 8. USAGES OF A CORPUS ● Implementing translators, spell checkers and grammar checkers. ● Identifying lexical and grammatical features of a language. ● Identifying varieties of language of context of usage and time. ● Retrieving statistical details of a language. ● Providing backend support for tools like OCR, POS Tagger, etc.
  9. 9. EXISTING CORPUS IMPLEMENTATIONS
  10. 10. ● There is a implemented corpus for Sinhala language which is known as UCSC Text Corpus of Contemporary Sinhala. ● It consists of about 10 million words, but it covers very little amount of language and it is not updating. CORPUS FOR SINHALA LANGUAGE?
  11. 11. COMPOSITION OF THE CORPUS ● Language comprising the corpus cannot be random but chosen according to specific characteristics. ● It must use authentic texts. The language it contains is not made up for the sole purpose of creating the corpus
  12. 12. EXAMPLE - COMPOSITION OF COCA ● The COCA contains more than 385 million words from 1990–2008 (20 million words each year). ● Texts are evenly divided between 5 genres, spoken (20%), fiction (20%), popular magazines (20%), newspapers (20%) and academic journals (20%).
  13. 13. COMPOSITION OF UCSC TEXT CORPUS OF CONTEMPORARY SINHALA
  14. 14. DATA STORAGE AND INFORMATION RETRIEVAL FROM CORPUS Existing corpora uses two main technologies for data storage ● Relational Databases ● Indexed file Systems
  15. 15. INDEXED FILE SYSTEMS AS STORAGE ● BNC uses this mechanism. ● data is stored as XML like files which follows a scheme known as the Corpus Data Interchange Format. ● This supports to store a great deal of detail about the structure of each text, such as its division into sections or chapters, paragraphs, verse lines, etc.
  16. 16. RELATIONAL DATABASE AS STORAGE ● COCA, Corpus del Español use relational databases.
  17. 17. DATA MODEL IN COCA
  18. 18. CORPUS DEL ESPAÑOL USES SEPARATE TABLES FOR BIGRAMS AND TRIGRAMS
  19. 19. RELATIONAL DB VS INDEXED FILE SYSTEMS ● Indexed file systems use extensive use of indexes ● Relational Database models are relatively fast. ● In Indexed file systems, difficult to add additional layers of annotation.
  20. 20. No study has been done on how NoSQL performs in implementing Corpora.
  21. 21. INFORMATION VISUALIZATION Most of the popular corpora like BNC, COCA, Corpus Del Espanol, Google books corpus use similar kind of Web Interface.
  22. 22. USER INTERFACE OF COCA
  23. 23. GOOGLE BOOKS NGRAM VIEWER UI
  24. 24. EXTRACTING LINGUISTIC FEATURES ● A main usage of a language corpus is extracting linguistic features of a language. ● Linguistic features for many languages has been identified using Corpora. ● Example - A corpus-based linguistics analysis on written corpus: colligation of “TO” and “FOR.”
  25. 25. CURRENT PROGRESS
  26. 26. IDENTIFIED SINHALA RESOURCES ● Online Newspapers ● News Websites ● School Textbooks ● Sinhala Wikipedia ● Online Mahawansaya ● Subtitles ● Sinhala Fiction ● Sinhala Blogs ● Sinhala Magazines ● Gazette
  27. 27. DIVIDED INTO 5 MAIN GENRES News Academic Creative Writing Spoken Gazette News Paper Text books Fiction Subtitle Gazette News Items Religious Blogs Wikipedia Magazine mahawansa
  28. 28. Implemented Crawlers for different sources, adhering to same format. https://github.com/madurangasiriwardena/corpus.sinhala.crawler
  29. 29. FINISHED CRAWLERS
  30. 30. CRAWLED DATA SAVED TO XML FILES WITH FOLLOWING META DATA ● Post Name ● Author ● Link ● Published Date
  31. 31. CRAWLER CONTROLLER Crawler controller monitors and handles the status of the web crawlers. Crawler controller address - http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb
  32. 32. We tested performance of several database systems to determine what should we use to store data.
  33. 33. WE CONSIDERED FOLLOWING DATA STORAGE SYSTEMS
  34. 34. We considered performance for inserting data and for retrieving 12 different information needs. Data set and source code - https://github.com/madurangasiriwardena/performance-test
  35. 35. DATA INSERTION TIME COMPARISON
  36. 36. INFORMATION RETRIEVAL PERFORMANCE COMPARISON - PART 1
  37. 37. INFORMATION RETRIEVAL PERFORMANCE COMPARISON - PART 2
  38. 38. Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly. So we chose it for implementing the corpus.
  39. 39. USER INTERFACE DESIGN AND IMPLEMENTATION ● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin. ● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort. http://sinhala-corpus.projects.uom.lk/sinmin-web/
  40. 40. CORPUS API DESIGN AND IMPLEMENTATION • REST API to expose Corpus services • Much complex and customizable data retrieval and filtering • Interface for third party applications to consume
  41. 41. PUBLICATIONS ● Comparison between performance of various database systems for implementing a language corpus - 11th Beyond Databases, Architectures and Structures conference (Pending) ● Implementing a Corpus for Sinhala Language - Symposium on Language Technology for South Asia (Pending)
  42. 42. REMAINING WORK FOR THE NEXT PHASE • Finish writing crawlers • Feed data to Cassendra database • Connecting front end with API calls
  43. 43. Questions?
  44. 44. Thank you!

×