SlideShare a Scribd company logo
Sinmin - Corpus for Sinhala
Language
1
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. de Silva
Outline
● Introduction
● Crawler Implementation and Design
● Data Cleaning and Tokenizing Mechanisms
● Selecting Data Storage Mechanism
● Data Storage Model of SinMin
● User Interface Design and Implementation
● API Design and Implementation
● Unit Testing
● Performance Testing of the API
● Implemented Sample Usages
2
What is a Corpus??
“A corpus is a principled collection of
authentic texts stored electronically that
can be used to discover information about
language that may not have been noticed
through intuition alone.” - Bennet (2010)
3
Usages of a Corpus
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
4
Sinmin is a Corpus for Sinhala
language which is
➢ Continuously updating
➢ Dynamic (Scalable)
➢ Covers wide range of language (Structured and unstructured)
5
Architecture of Sinmin
6
Identified Sinhala Resources
7
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
Identified Sinhala Resources
8
Crawler Implementation and
Design
9
Crawlers are responsible of finding
web pages that contain sinhala
content, fetching, parsing and storing
them in a manageable format.
10
Crawler Architecture
11
Sample Xml File With One Article Stored In It
12
Crawler Controller
Crawler controller monitors and handles
the status of the web crawlers.
13
14
Data Cleaning and Tokenizing
Mechanisms used
15
Identified Issues
● Erroneous characters of the texts
● Short forms
● Consecutive Sinhala vowel sign problem
fixing
16
Erroneous Characters Of The
Texts
● Invalid Unicode characters
Eg: Characters in a private user area, Replacement
character
● Symbols
Eg: “,”, “.”, “{“, “(“, “?”
17
Erroneous Characters Of The
Texts
● Unwanted non-Sinhala characters
Eg: ‘u+200C’, Á, À, ®, ¡, ª, º
● Non-symbolic characters which were terminating
words
18
Short Forms
● Short forms consists of full stops.
● But those full stop marks aren’t separating
sentences nor words.
E.g.: පෙ. ව. (pm), රු. (Rupees)
19
Identified Common Short
Forms
"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."
"පෙ.", "ව.", "ෙ.", "රු."
"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."
20
Consecutive Sinhala Vowel Sign
Problem
21
Consecutive Sinhala Vowel Sign
Problem
● Solution: Mapping them into one format
● Convention: Only one vowel sign to a Sinhala
letter
22
Consecutive Sinhala Vowel Sign
Problem
23
Selecting Data Storage
Mechanism for Sinmin
24
The performance of data insertion and
retrieval mainly depend on the Data
Storage Mechanism used for the
Corpus.
25
We tested performance of several
database systems to determine what
should we use to store data.
26
We Considered Following Data
Storage Systems
27
We considered performance for
inserting data and for retrieving 12
different information needs.
Data set and source code
https://github.com/madurangasiriwardena/performance-test
28
Data Insertion Time Comparison
29
Information Retrieval Performance
Comparison - Part 1
30
Information Retrieval Performance
Comparison - Part 2
31
Cassandra performed better than others
in most of the scenarios, and its
insertion time increased linearly.
So we chose it for implementing
corpus.
32
Data Storage Model of Sinmin
33
● We Used Cassandra as the Main Storage System of
Sinmin
● Apache Cassandra version 2.1.2 used.
● cqlsh version 5.0.1 used
34
Cassandra
● Most queries of API are retrieved from Cassandra
Database.
● Cassandra Database consist of more than 50 Column
Families where each of them provides a specific
information need
35
Cassandra
● Oracle used as a backup storage server.
36
Oracle
Oracle
Schema
37
Wildcard Search Feature
Wildcard search feature enables users to run wild-
card queries on the corpus
Eg: පෙ?
ෙහ*
38
Wildcard Search Feature
● Implemented using Apache Solr
● More than 1.2 million distinct words
● Supports at most 10 asterisks and atmost 10
question marks
39
Sinhala Vowel Sign Problem At Wildcard
Search
In Sinhala Unicode, Sinhala vowel signs are separate
Unicode characters
40
Sinhala Vowel Sign Problem At Wildcard
Search
Solution: Represent Sinhala letter and vowel sign as one
entity
41
User Interface Design and
Implementation
42
● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
43
Sinmin user interface allows to,
● Find the probability of an n-gram
● Find the most probable word comes after an n-gram
● Compare the usage of n-grams
● Find statistics of words, bigrams and trigrams
● Wildcard search
● Find latest articles for an n-gram
44
45
46
API Design and Implementation
47
REST API
● REST API to expose Corpus services
● Much complex and customizable data retrieval and
filtering
● Interface for third party applications to consume
48
REST API
● Depends on backend databases (Cassandra,
Oracle, Solr)
● Cassandra acts as main storage system
● Oracle is used as a backup database
● Solr is used for wildcard search functions
49
Architecture
50
API Functions
● wordFrequency
● bigramFrequency
● trigramFrequency
● frequentWords
● frequentBigrams
● frequentTrigrams
● latestArticlesForWord
● latestArticlesForBigram
● latestArticlesForTrigram
51
● frequentWordsAroundWord
● frequentWordsInPosition
● frequentWordsInPositionReverse
● frequentWordsAfterWordTimeRange
● frequentWordsAfterBigramTimeRange
● wordCount
● bigramCount
● trigramCount
Performance Testing of the API
52
Throughput Under Different Load
Conditions
53
Time Taken To Process Requests Under
Different Load Conditions
54
Full Stop Predictor For OCR
● One challenge in OCR development is identifying
fullstops.
● This tool is a consumer application of Sinmin that
predicts the full stop marks of Sinhala texts.
55
Publications
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Presented)
● Comparison between performance of various
database systems for implementing a language
corpus – 11th International Beyond Databases,
Architectures and Structures conference (Accepted)
56
Future Works
● Annotate Words with POS Taggers and lemmas.
● Implement tools and applications that make use of
the corpus
57
Q & A
58
Thank You!
59

More Related Content

What's hot

RDF Seminar Presentation
RDF Seminar PresentationRDF Seminar Presentation
RDF Seminar Presentation
Muntazir Mehdi
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminologyTenforce
 
NoSQL
NoSQLNoSQL
NoSQL
Radu Potop
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
Jerven Bolleman
 

What's hot (6)

Performance neo4j-versus (2)
Performance neo4j-versus (2)Performance neo4j-versus (2)
Performance neo4j-versus (2)
 
RDF Seminar Presentation
RDF Seminar PresentationRDF Seminar Presentation
RDF Seminar Presentation
 
NoSQL
NoSQLNoSQL
NoSQL
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminology
 
NoSQL
NoSQLNoSQL
NoSQL
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 

Viewers also liked

SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - Thesis
Chamila Wijayarathna
 
G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11
Mahesh Kodituwakku
 
G.C.E. O/L ICT Lessons Database sinhala
 G.C.E. O/L ICT Lessons Database sinhala G.C.E. O/L ICT Lessons Database sinhala
G.C.E. O/L ICT Lessons Database sinhala
Mahesh Kodituwakku
 
දත්ත සහ තොරතුරු
දත්ත සහ තොරතුරුදත්ත සහ තොරතුරු
දත්ත සහ තොරතුරු
Tennyson
 
Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)
Mahesh Kodituwakku
 
Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Muhammad Fahad Saleh
 
Animal and Plant Cells
Animal and Plant CellsAnimal and Plant Cells
Animal and Plant CellsECSD
 
Power point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cellPower point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cell
jhoysantos12
 
Cell : Structure and Function Part 01
Cell : Structure and Function Part 01Cell : Structure and Function Part 01
Cell : Structure and Function Part 01
Abdul-arzeez Penai
 
Cells Powerpoint Presentation
Cells Powerpoint PresentationCells Powerpoint Presentation
Cells Powerpoint Presentationcprizel
 

Viewers also liked (11)

SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - Thesis
 
G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11
 
G.C.E. O/L ICT Lessons Database sinhala
 G.C.E. O/L ICT Lessons Database sinhala G.C.E. O/L ICT Lessons Database sinhala
G.C.E. O/L ICT Lessons Database sinhala
 
දත්ත සහ තොරතුරු
දත්ත සහ තොරතුරුදත්ත සහ තොරතුරු
දත්ත සහ තොරතුරු
 
Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)
 
Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02
 
Animal and Plant Cells
Animal and Plant CellsAnimal and Plant Cells
Animal and Plant Cells
 
Power point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cellPower point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cell
 
Cell : Structure and Function Part 01
Cell : Structure and Function Part 01Cell : Structure and Function Part 01
Cell : Structure and Function Part 01
 
Cell Structure And Function
Cell Structure And FunctionCell Structure And Function
Cell Structure And Function
 
Cells Powerpoint Presentation
Cells Powerpoint PresentationCells Powerpoint Presentation
Cells Powerpoint Presentation
 

Similar to Sinmin final presentation

Online handwritten script recognition
Online handwritten script recognitionOnline handwritten script recognition
Online handwritten script recognitionDhiraj Singh
 
Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012
Piergiorgio Lucidi
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
GlobalLogic Ukraine
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...
AshutoshSharma874829
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
fschupp
 
Mca sem1syll
Mca sem1syllMca sem1syll
Mca sem1syll
praveen_5552392
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
Manuel Correa
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
fschupp
 
Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...
Concordia University
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
Rebaz Najeeb
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
Shifa Khan
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
Sam Bowne
 
CISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development SecurityCISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development Security
Sam Bowne
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
Alkacon Software GmbH & Co. KG
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
Sam Bowne
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
Mubashar Iqbal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
Sujit Pal
 

Similar to Sinmin final presentation (20)

Online handwritten script recognition
Online handwritten script recognitionOnline handwritten script recognition
Online handwritten script recognition
 
Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
Mca sem1syll
Mca sem1syllMca sem1syll
Mca sem1syll
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
CISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development SecurityCISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development Security
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 

More from Chamila Wijayarathna

Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Chamila Wijayarathna
 
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Chamila Wijayarathna
 
GS0C - "How to Start" Guide
GS0C - "How to Start" GuideGS0C - "How to Start" Guide
GS0C - "How to Start" Guide
Chamila Wijayarathna
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate category
Chamila Wijayarathna
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
Chamila Wijayarathna
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge Report
Chamila Wijayarathna
 
Programs With Common Sense
Programs With Common SensePrograms With Common Sense
Programs With Common Sense
Chamila Wijayarathna
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paper
Chamila Wijayarathna
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My LifeChamila Wijayarathna
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Chamila Wijayarathna
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...Chamila Wijayarathna
 

More from Chamila Wijayarathna (17)

Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
 
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
 
GS0C - "How to Start" Guide
GS0C - "How to Start" GuideGS0C - "How to Start" Guide
GS0C - "How to Start" Guide
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate category
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge Report
 
Programs With Common Sense
Programs With Common SensePrograms With Common Sense
Programs With Common Sense
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paper
 
IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My Life
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
 
Ieee xtreme 5.0 results
Ieee xtreme 5.0 resultsIeee xtreme 5.0 results
Ieee xtreme 5.0 results
 
Memory technologies
Memory technologiesMemory technologies
Memory technologies
 
History of Computer
History of ComputerHistory of Computer
History of Computer
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
 
Path Following Robot
Path Following RobotPath Following Robot
Path Following Robot
 
Path following robot
Path following robotPath following robot
Path following robot
 

Recently uploaded

J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 

Recently uploaded (20)

J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 

Sinmin final presentation

  • 1. Sinmin - Corpus for Sinhala Language 1 Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. de Silva
  • 2. Outline ● Introduction ● Crawler Implementation and Design ● Data Cleaning and Tokenizing Mechanisms ● Selecting Data Storage Mechanism ● Data Storage Model of SinMin ● User Interface Design and Implementation ● API Design and Implementation ● Unit Testing ● Performance Testing of the API ● Implemented Sample Usages 2
  • 3. What is a Corpus?? “A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010) 3
  • 4. Usages of a Corpus ● Implementing translators, spell checkers and grammar checkers. ● Identifying lexical and grammatical features of a language. ● Identifying varieties of language of context of usage and time. ● Retrieving statistical details of a language. ● Providing backend support for tools like OCR, POS Tagger, etc. 4
  • 5. Sinmin is a Corpus for Sinhala language which is ➢ Continuously updating ➢ Dynamic (Scalable) ➢ Covers wide range of language (Structured and unstructured) 5
  • 7. Identified Sinhala Resources 7 News Academic Creative Writing Spoken Gazette News Paper Text books Fiction Subtitle Gazette News Items Religious Blogs Wikipedia Magazine mahawansa
  • 10. Crawlers are responsible of finding web pages that contain sinhala content, fetching, parsing and storing them in a manageable format. 10
  • 12. Sample Xml File With One Article Stored In It 12
  • 13. Crawler Controller Crawler controller monitors and handles the status of the web crawlers. 13
  • 14. 14
  • 15. Data Cleaning and Tokenizing Mechanisms used 15
  • 16. Identified Issues ● Erroneous characters of the texts ● Short forms ● Consecutive Sinhala vowel sign problem fixing 16
  • 17. Erroneous Characters Of The Texts ● Invalid Unicode characters Eg: Characters in a private user area, Replacement character ● Symbols Eg: “,”, “.”, “{“, “(“, “?” 17
  • 18. Erroneous Characters Of The Texts ● Unwanted non-Sinhala characters Eg: ‘u+200C’, Á, À, ®, ¡, ª, º ● Non-symbolic characters which were terminating words 18
  • 19. Short Forms ● Short forms consists of full stops. ● But those full stop marks aren’t separating sentences nor words. E.g.: පෙ. ව. (pm), රු. (Rupees) 19
  • 20. Identified Common Short Forms "ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්." "පෙ.", "ව.", "ෙ.", "රු." "0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9." 20
  • 21. Consecutive Sinhala Vowel Sign Problem 21
  • 22. Consecutive Sinhala Vowel Sign Problem ● Solution: Mapping them into one format ● Convention: Only one vowel sign to a Sinhala letter 22
  • 23. Consecutive Sinhala Vowel Sign Problem 23
  • 25. The performance of data insertion and retrieval mainly depend on the Data Storage Mechanism used for the Corpus. 25
  • 26. We tested performance of several database systems to determine what should we use to store data. 26
  • 27. We Considered Following Data Storage Systems 27
  • 28. We considered performance for inserting data and for retrieving 12 different information needs. Data set and source code https://github.com/madurangasiriwardena/performance-test 28
  • 29. Data Insertion Time Comparison 29
  • 32. Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly. So we chose it for implementing corpus. 32
  • 33. Data Storage Model of Sinmin 33
  • 34. ● We Used Cassandra as the Main Storage System of Sinmin ● Apache Cassandra version 2.1.2 used. ● cqlsh version 5.0.1 used 34 Cassandra
  • 35. ● Most queries of API are retrieved from Cassandra Database. ● Cassandra Database consist of more than 50 Column Families where each of them provides a specific information need 35 Cassandra
  • 36. ● Oracle used as a backup storage server. 36 Oracle
  • 38. Wildcard Search Feature Wildcard search feature enables users to run wild- card queries on the corpus Eg: පෙ? ෙහ* 38
  • 39. Wildcard Search Feature ● Implemented using Apache Solr ● More than 1.2 million distinct words ● Supports at most 10 asterisks and atmost 10 question marks 39
  • 40. Sinhala Vowel Sign Problem At Wildcard Search In Sinhala Unicode, Sinhala vowel signs are separate Unicode characters 40
  • 41. Sinhala Vowel Sign Problem At Wildcard Search Solution: Represent Sinhala letter and vowel sign as one entity 41
  • 42. User Interface Design and Implementation 42
  • 43. ● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin. ● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort. 43
  • 44. Sinmin user interface allows to, ● Find the probability of an n-gram ● Find the most probable word comes after an n-gram ● Compare the usage of n-grams ● Find statistics of words, bigrams and trigrams ● Wildcard search ● Find latest articles for an n-gram 44
  • 45. 45
  • 46. 46
  • 47. API Design and Implementation 47
  • 48. REST API ● REST API to expose Corpus services ● Much complex and customizable data retrieval and filtering ● Interface for third party applications to consume 48
  • 49. REST API ● Depends on backend databases (Cassandra, Oracle, Solr) ● Cassandra acts as main storage system ● Oracle is used as a backup database ● Solr is used for wildcard search functions 49
  • 51. API Functions ● wordFrequency ● bigramFrequency ● trigramFrequency ● frequentWords ● frequentBigrams ● frequentTrigrams ● latestArticlesForWord ● latestArticlesForBigram ● latestArticlesForTrigram 51 ● frequentWordsAroundWord ● frequentWordsInPosition ● frequentWordsInPositionReverse ● frequentWordsAfterWordTimeRange ● frequentWordsAfterBigramTimeRange ● wordCount ● bigramCount ● trigramCount
  • 53. Throughput Under Different Load Conditions 53
  • 54. Time Taken To Process Requests Under Different Load Conditions 54
  • 55. Full Stop Predictor For OCR ● One challenge in OCR development is identifying fullstops. ● This tool is a consumer application of Sinmin that predicts the full stop marks of Sinhala texts. 55
  • 56. Publications ● Implementing a Corpus for Sinhala Language - Symposium on Language Technology for South Asia (Presented) ● Comparison between performance of various database systems for implementing a language corpus – 11th International Beyond Databases, Architectures and Structures conference (Accepted) 56
  • 57. Future Works ● Annotate Words with POS Taggers and lemmas. ● Implement tools and applications that make use of the corpus 57