SlideShare a Scribd company logo
1 of 59
Download to read offline
Sinmin - Corpus for Sinhala
Language
1
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. de Silva
Outline
● Introduction
● Crawler Implementation and Design
● Data Cleaning and Tokenizing Mechanisms
● Selecting Data Storage Mechanism
● Data Storage Model of SinMin
● User Interface Design and Implementation
● API Design and Implementation
● Unit Testing
● Performance Testing of the API
● Implemented Sample Usages
2
What is a Corpus??
“A corpus is a principled collection of
authentic texts stored electronically that
can be used to discover information about
language that may not have been noticed
through intuition alone.” - Bennet (2010)
3
Usages of a Corpus
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
4
Sinmin is a Corpus for Sinhala
language which is
➢ Continuously updating
➢ Dynamic (Scalable)
➢ Covers wide range of language (Structured and unstructured)
5
Architecture of Sinmin
6
Identified Sinhala Resources
7
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
Identified Sinhala Resources
8
Crawler Implementation and
Design
9
Crawlers are responsible of finding
web pages that contain sinhala
content, fetching, parsing and storing
them in a manageable format.
10
Crawler Architecture
11
Sample Xml File With One Article Stored In It
12
Crawler Controller
Crawler controller monitors and handles
the status of the web crawlers.
13
14
Data Cleaning and Tokenizing
Mechanisms used
15
Identified Issues
● Erroneous characters of the texts
● Short forms
● Consecutive Sinhala vowel sign problem
fixing
16
Erroneous Characters Of The
Texts
● Invalid Unicode characters
Eg: Characters in a private user area, Replacement
character
● Symbols
Eg: “,”, “.”, “{“, “(“, “?”
17
Erroneous Characters Of The
Texts
● Unwanted non-Sinhala characters
Eg: ‘u+200C’, Á, À, ®, ¡, ª, º
● Non-symbolic characters which were terminating
words
18
Short Forms
● Short forms consists of full stops.
● But those full stop marks aren’t separating
sentences nor words.
E.g.: පෙ. ව. (pm), රු. (Rupees)
19
Identified Common Short
Forms
"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."
"පෙ.", "ව.", "ෙ.", "රු."
"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."
20
Consecutive Sinhala Vowel Sign
Problem
21
Consecutive Sinhala Vowel Sign
Problem
● Solution: Mapping them into one format
● Convention: Only one vowel sign to a Sinhala
letter
22
Consecutive Sinhala Vowel Sign
Problem
23
Selecting Data Storage
Mechanism for Sinmin
24
The performance of data insertion and
retrieval mainly depend on the Data
Storage Mechanism used for the
Corpus.
25
We tested performance of several
database systems to determine what
should we use to store data.
26
We Considered Following Data
Storage Systems
27
We considered performance for
inserting data and for retrieving 12
different information needs.
Data set and source code
https://github.com/madurangasiriwardena/performance-test
28
Data Insertion Time Comparison
29
Information Retrieval Performance
Comparison - Part 1
30
Information Retrieval Performance
Comparison - Part 2
31
Cassandra performed better than others
in most of the scenarios, and its
insertion time increased linearly.
So we chose it for implementing
corpus.
32
Data Storage Model of Sinmin
33
● We Used Cassandra as the Main Storage System of
Sinmin
● Apache Cassandra version 2.1.2 used.
● cqlsh version 5.0.1 used
34
Cassandra
● Most queries of API are retrieved from Cassandra
Database.
● Cassandra Database consist of more than 50 Column
Families where each of them provides a specific
information need
35
Cassandra
● Oracle used as a backup storage server.
36
Oracle
Oracle
Schema
37
Wildcard Search Feature
Wildcard search feature enables users to run wild-
card queries on the corpus
Eg: පෙ?
ෙහ*
38
Wildcard Search Feature
● Implemented using Apache Solr
● More than 1.2 million distinct words
● Supports at most 10 asterisks and atmost 10
question marks
39
Sinhala Vowel Sign Problem At Wildcard
Search
In Sinhala Unicode, Sinhala vowel signs are separate
Unicode characters
40
Sinhala Vowel Sign Problem At Wildcard
Search
Solution: Represent Sinhala letter and vowel sign as one
entity
41
User Interface Design and
Implementation
42
● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
43
Sinmin user interface allows to,
● Find the probability of an n-gram
● Find the most probable word comes after an n-gram
● Compare the usage of n-grams
● Find statistics of words, bigrams and trigrams
● Wildcard search
● Find latest articles for an n-gram
44
45
46
API Design and Implementation
47
REST API
● REST API to expose Corpus services
● Much complex and customizable data retrieval and
filtering
● Interface for third party applications to consume
48
REST API
● Depends on backend databases (Cassandra,
Oracle, Solr)
● Cassandra acts as main storage system
● Oracle is used as a backup database
● Solr is used for wildcard search functions
49
Architecture
50
API Functions
● wordFrequency
● bigramFrequency
● trigramFrequency
● frequentWords
● frequentBigrams
● frequentTrigrams
● latestArticlesForWord
● latestArticlesForBigram
● latestArticlesForTrigram
51
● frequentWordsAroundWord
● frequentWordsInPosition
● frequentWordsInPositionReverse
● frequentWordsAfterWordTimeRange
● frequentWordsAfterBigramTimeRange
● wordCount
● bigramCount
● trigramCount
Performance Testing of the API
52
Throughput Under Different Load
Conditions
53
Time Taken To Process Requests Under
Different Load Conditions
54
Full Stop Predictor For OCR
● One challenge in OCR development is identifying
fullstops.
● This tool is a consumer application of Sinmin that
predicts the full stop marks of Sinhala texts.
55
Publications
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Presented)
● Comparison between performance of various
database systems for implementing a language
corpus – 11th International Beyond Databases,
Architectures and Structures conference (Accepted)
56
Future Works
● Annotate Words with POS Taggers and lemmas.
● Implement tools and applications that make use of
the corpus
57
Q & A
58
Thank You!
59

More Related Content

What's hot

RDF Seminar Presentation
RDF Seminar PresentationRDF Seminar Presentation
RDF Seminar PresentationMuntazir Mehdi
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminologyTenforce
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLJerven Bolleman
 

What's hot (6)

Performance neo4j-versus (2)
Performance neo4j-versus (2)Performance neo4j-versus (2)
Performance neo4j-versus (2)
 
RDF Seminar Presentation
RDF Seminar PresentationRDF Seminar Presentation
RDF Seminar Presentation
 
NoSQL
NoSQLNoSQL
NoSQL
 
Services semantic technology_terminology
Services semantic technology_terminologyServices semantic technology_terminology
Services semantic technology_terminology
 
NoSQL
NoSQLNoSQL
NoSQL
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 

Viewers also liked

SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisChamila Wijayarathna
 
G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11Mahesh Kodituwakku
 
G.C.E. O/L ICT Lessons Database sinhala
 G.C.E. O/L ICT Lessons Database sinhala G.C.E. O/L ICT Lessons Database sinhala
G.C.E. O/L ICT Lessons Database sinhalaMahesh Kodituwakku
 
දත්ත සහ තොරතුරු
දත්ත සහ තොරතුරුදත්ත සහ තොරතුරු
දත්ත සහ තොරතුරුTennyson
 
Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)Mahesh Kodituwakku
 
Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Muhammad Fahad Saleh
 
Animal and Plant Cells
Animal and Plant CellsAnimal and Plant Cells
Animal and Plant CellsECSD
 
Power point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cellPower point presentation of animal cell and plant cell
Power point presentation of animal cell and plant celljhoysantos12
 
Cell : Structure and Function Part 01
Cell : Structure and Function Part 01Cell : Structure and Function Part 01
Cell : Structure and Function Part 01Abdul-arzeez Penai
 
Cells Powerpoint Presentation
Cells Powerpoint PresentationCells Powerpoint Presentation
Cells Powerpoint Presentationcprizel
 

Viewers also liked (11)

SinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - ThesisSinMin - Sinhala Corpus Project - Thesis
SinMin - Sinhala Corpus Project - Thesis
 
G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11G.C.E O/L ICT Short Notes Grade-11
G.C.E O/L ICT Short Notes Grade-11
 
G.C.E. O/L ICT Lessons Database sinhala
 G.C.E. O/L ICT Lessons Database sinhala G.C.E. O/L ICT Lessons Database sinhala
G.C.E. O/L ICT Lessons Database sinhala
 
දත්ත සහ තොරතුරු
දත්ත සහ තොරතුරුදත්ත සහ තොරතුරු
දත්ත සහ තොරතුරු
 
Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)Grade 10 ICT Short Notes in Sinhala(2015)
Grade 10 ICT Short Notes in Sinhala(2015)
 
Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02Cellsppt presentation-100813001954-phpapp02
Cellsppt presentation-100813001954-phpapp02
 
Animal and Plant Cells
Animal and Plant CellsAnimal and Plant Cells
Animal and Plant Cells
 
Power point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cellPower point presentation of animal cell and plant cell
Power point presentation of animal cell and plant cell
 
Cell : Structure and Function Part 01
Cell : Structure and Function Part 01Cell : Structure and Function Part 01
Cell : Structure and Function Part 01
 
Cell Structure And Function
Cell Structure And FunctionCell Structure And Function
Cell Structure And Function
 
Cells Powerpoint Presentation
Cells Powerpoint PresentationCells Powerpoint Presentation
Cells Powerpoint Presentation
 

Similar to Sinmin final presentation

Online handwritten script recognition
Online handwritten script recognitionOnline handwritten script recognition
Online handwritten script recognitionDhiraj Singh
 
Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012Piergiorgio Lucidi
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine
 
computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...AshutoshSharma874829
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Enginefschupp
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009fschupp
 
Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...Concordia University
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...ScyllaDB
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processorsRebaz Najeeb
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development SecuritySam Bowne
 
CISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development SecurityCISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development SecuritySam Bowne
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development SecuritySam Bowne
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabaseMubashar Iqbal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudSujit Pal
 

Similar to Sinmin final presentation (20)

Online handwritten script recognition
Online handwritten script recognitionOnline handwritten script recognition
Online handwritten script recognition
 
Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012Apache ManifoldCF @ Linux Day 2012
Apache ManifoldCF @ Linux Day 2012
 
Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "Azure Cosmos DB: Features, Practical Use and Optimization "
Azure Cosmos DB: Features, Practical Use and Optimization "
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
BlackRay - The open Source Data Engine
BlackRay - The open Source Data EngineBlackRay - The open Source Data Engine
BlackRay - The open Source Data Engine
 
Mca sem1syll
Mca sem1syllMca sem1syll
Mca sem1syll
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...Improving the Performance of Database-Centric Applications Through Program An...
Improving the Performance of Database-Centric Applications Through Program An...
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Lecture 1 introduction to language processors
Lecture 1  introduction to language processorsLecture 1  introduction to language processors
Lecture 1 introduction to language processors
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
CISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development SecurityCISSP Prep: Ch 9. Software Development Security
CISSP Prep: Ch 9. Software Development Security
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
PostgreSQL - Object Relational Database
PostgreSQL - Object Relational DatabasePostgreSQL - Object Relational Database
PostgreSQL - Object Relational Database
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 

More from Chamila Wijayarathna

Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Chamila Wijayarathna
 
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Chamila Wijayarathna
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate categoryChamila Wijayarathna
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportChamila Wijayarathna
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paperChamila Wijayarathna
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My LifeChamila Wijayarathna
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperChamila Wijayarathna
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...Chamila Wijayarathna
 

More from Chamila Wijayarathna (17)

Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...
 
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...
 
GS0C - "How to Start" Guide
GS0C - "How to Start" GuideGS0C - "How to Start" Guide
GS0C - "How to Start" Guide
 
Xbotix 2014 Rules undergraduate category
Xbotix 2014 Rules   undergraduate categoryXbotix 2014 Rules   undergraduate category
Xbotix 2014 Rules undergraduate category
 
Kaggle KDD Cup Report
Kaggle KDD Cup ReportKaggle KDD Cup Report
Kaggle KDD Cup Report
 
Higgs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge ReportHiggs Boson Machine Learning Challenge Report
Higgs Boson Machine Learning Challenge Report
 
Programs With Common Sense
Programs With Common SensePrograms With Common Sense
Programs With Common Sense
 
Knock detecting door lock research paper
Knock detecting door lock research paperKnock detecting door lock research paper
Knock detecting door lock research paper
 
IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012IEEE Xtreme Final results 2012
IEEE Xtreme Final results 2012
 
Helen Keller, The Story of My Life
Helen Keller, The Story of My LifeHelen Keller, The Story of My Life
Helen Keller, The Story of My Life
 
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research PaperShirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
Shirsha Yaathra - Head Movement controlled Wheelchair - Research Paper
 
Ieee xtreme 5.0 results
Ieee xtreme 5.0 resultsIeee xtreme 5.0 results
Ieee xtreme 5.0 results
 
Memory technologies
Memory technologiesMemory technologies
Memory technologies
 
History of Computer
History of ComputerHistory of Computer
History of Computer
 
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...Products, Process  Development Firms in Sri Lanka and their focus on Sustaina...
Products, Process Development Firms in Sri Lanka and their focus on Sustaina...
 
Path Following Robot
Path Following RobotPath Following Robot
Path Following Robot
 
Path following robot
Path following robotPath following robot
Path following robot
 

Recently uploaded

Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfShreyas Pandit
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...
Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...
Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...Amil baba
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxLina Kadam
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studydhruvamdhruvil123
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProRay Yuan Liu
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...shreenathji26
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Livre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdf
Livre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdfLivre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdf
Livre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdfsaad175691
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 

Recently uploaded (20)

Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdf
 
Versatile Engineering Construction Firms
Versatile Engineering Construction FirmsVersatile Engineering Construction Firms
Versatile Engineering Construction Firms
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...
Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...
Uk-NO1 kala jadu karne wale ka contact number kala jadu karne wale baba kala ...
 
AntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptxAntColonyOptimizationManetNetworkAODV.pptx
AntColonyOptimizationManetNetworkAODV.pptx
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain study
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision Pro
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
Introduction to Artificial Intelligence: Intelligent Agents, State Space Sear...
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Livre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdf
Livre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdfLivre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdf
Livre Implementing_Six_Sigma_and_Lean_A_prac([Ron_Basu]_).pdf
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 

Sinmin final presentation

  • 1. Sinmin - Corpus for Sinhala Language 1 Upeksha W. D. Wijayarathna D. G. C. D. Siriwardena M. P. Lasandun K. H. L. Supervisors : Dr. Chinthana Wimalasuriya Prof. Gihan Dias Mr. N. H. N. D. de Silva
  • 2. Outline ● Introduction ● Crawler Implementation and Design ● Data Cleaning and Tokenizing Mechanisms ● Selecting Data Storage Mechanism ● Data Storage Model of SinMin ● User Interface Design and Implementation ● API Design and Implementation ● Unit Testing ● Performance Testing of the API ● Implemented Sample Usages 2
  • 3. What is a Corpus?? “A corpus is a principled collection of authentic texts stored electronically that can be used to discover information about language that may not have been noticed through intuition alone.” - Bennet (2010) 3
  • 4. Usages of a Corpus ● Implementing translators, spell checkers and grammar checkers. ● Identifying lexical and grammatical features of a language. ● Identifying varieties of language of context of usage and time. ● Retrieving statistical details of a language. ● Providing backend support for tools like OCR, POS Tagger, etc. 4
  • 5. Sinmin is a Corpus for Sinhala language which is ➢ Continuously updating ➢ Dynamic (Scalable) ➢ Covers wide range of language (Structured and unstructured) 5
  • 7. Identified Sinhala Resources 7 News Academic Creative Writing Spoken Gazette News Paper Text books Fiction Subtitle Gazette News Items Religious Blogs Wikipedia Magazine mahawansa
  • 10. Crawlers are responsible of finding web pages that contain sinhala content, fetching, parsing and storing them in a manageable format. 10
  • 12. Sample Xml File With One Article Stored In It 12
  • 13. Crawler Controller Crawler controller monitors and handles the status of the web crawlers. 13
  • 14. 14
  • 15. Data Cleaning and Tokenizing Mechanisms used 15
  • 16. Identified Issues ● Erroneous characters of the texts ● Short forms ● Consecutive Sinhala vowel sign problem fixing 16
  • 17. Erroneous Characters Of The Texts ● Invalid Unicode characters Eg: Characters in a private user area, Replacement character ● Symbols Eg: “,”, “.”, “{“, “(“, “?” 17
  • 18. Erroneous Characters Of The Texts ● Unwanted non-Sinhala characters Eg: ‘u+200C’, Á, À, ®, ¡, ª, º ● Non-symbolic characters which were terminating words 18
  • 19. Short Forms ● Short forms consists of full stops. ● But those full stop marks aren’t separating sentences nor words. E.g.: පෙ. ව. (pm), රු. (Rupees) 19
  • 20. Identified Common Short Forms "ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්." "පෙ.", "ව.", "ෙ.", "රු." "0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9." 20
  • 21. Consecutive Sinhala Vowel Sign Problem 21
  • 22. Consecutive Sinhala Vowel Sign Problem ● Solution: Mapping them into one format ● Convention: Only one vowel sign to a Sinhala letter 22
  • 23. Consecutive Sinhala Vowel Sign Problem 23
  • 25. The performance of data insertion and retrieval mainly depend on the Data Storage Mechanism used for the Corpus. 25
  • 26. We tested performance of several database systems to determine what should we use to store data. 26
  • 27. We Considered Following Data Storage Systems 27
  • 28. We considered performance for inserting data and for retrieving 12 different information needs. Data set and source code https://github.com/madurangasiriwardena/performance-test 28
  • 29. Data Insertion Time Comparison 29
  • 32. Cassandra performed better than others in most of the scenarios, and its insertion time increased linearly. So we chose it for implementing corpus. 32
  • 33. Data Storage Model of Sinmin 33
  • 34. ● We Used Cassandra as the Main Storage System of Sinmin ● Apache Cassandra version 2.1.2 used. ● cqlsh version 5.0.1 used 34 Cassandra
  • 35. ● Most queries of API are retrieved from Cassandra Database. ● Cassandra Database consist of more than 50 Column Families where each of them provides a specific information need 35 Cassandra
  • 36. ● Oracle used as a backup storage server. 36 Oracle
  • 38. Wildcard Search Feature Wildcard search feature enables users to run wild- card queries on the corpus Eg: පෙ? ෙහ* 38
  • 39. Wildcard Search Feature ● Implemented using Apache Solr ● More than 1.2 million distinct words ● Supports at most 10 asterisks and atmost 10 question marks 39
  • 40. Sinhala Vowel Sign Problem At Wildcard Search In Sinhala Unicode, Sinhala vowel signs are separate Unicode characters 40
  • 41. Sinhala Vowel Sign Problem At Wildcard Search Solution: Represent Sinhala letter and vowel sign as one entity 41
  • 42. User Interface Design and Implementation 42
  • 43. ● Web interface of Sinmin has been designed for users who would prefer a visualised and summarized view of statistical data of Sinmin. ● Visual design of the interface has been made in a way that any user without prior experience of the interface is able to fulfill his information requirements with little effort. 43
  • 44. Sinmin user interface allows to, ● Find the probability of an n-gram ● Find the most probable word comes after an n-gram ● Compare the usage of n-grams ● Find statistics of words, bigrams and trigrams ● Wildcard search ● Find latest articles for an n-gram 44
  • 45. 45
  • 46. 46
  • 47. API Design and Implementation 47
  • 48. REST API ● REST API to expose Corpus services ● Much complex and customizable data retrieval and filtering ● Interface for third party applications to consume 48
  • 49. REST API ● Depends on backend databases (Cassandra, Oracle, Solr) ● Cassandra acts as main storage system ● Oracle is used as a backup database ● Solr is used for wildcard search functions 49
  • 51. API Functions ● wordFrequency ● bigramFrequency ● trigramFrequency ● frequentWords ● frequentBigrams ● frequentTrigrams ● latestArticlesForWord ● latestArticlesForBigram ● latestArticlesForTrigram 51 ● frequentWordsAroundWord ● frequentWordsInPosition ● frequentWordsInPositionReverse ● frequentWordsAfterWordTimeRange ● frequentWordsAfterBigramTimeRange ● wordCount ● bigramCount ● trigramCount
  • 53. Throughput Under Different Load Conditions 53
  • 54. Time Taken To Process Requests Under Different Load Conditions 54
  • 55. Full Stop Predictor For OCR ● One challenge in OCR development is identifying fullstops. ● This tool is a consumer application of Sinmin that predicts the full stop marks of Sinhala texts. 55
  • 56. Publications ● Implementing a Corpus for Sinhala Language - Symposium on Language Technology for South Asia (Presented) ● Comparison between performance of various database systems for implementing a language corpus – 11th International Beyond Databases, Architectures and Structures conference (Accepted) 56
  • 57. Future Works ● Annotate Words with POS Taggers and lemmas. ● Implement tools and applications that make use of the corpus 57