Sinmin is a corpus for the Sinhala language that is continuously updating and scalable. It covers a wide range of structured and unstructured Sinhala language data from sources like news, academic writings, fiction, and more. The architecture includes crawlers to fetch web pages, data cleaning mechanisms to handle issues like erroneous characters and short forms, and Cassandra as the main storage system. The API and user interface allow users to perform queries on the corpus to obtain word and ngram frequencies, latest articles, and more. Performance testing was conducted on the storage systems and API.
This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.
Not sure what RDF is and confused about or how it relates to Linked Data and the jargon surrounding it? This describes of what RDF as well as what you need to know to understand how it applies to library work.
Dynamic website that changes daily automatically. A dynamic website can contain client-side scripting or server-side scripting to generate the changing content, or a combination of both scripting types. These sites also include HTML programming for the basic structure.
This project is about building a corpus for Sinhala language. This is the presentation about its literature review. This includes previous literature we referred in this project.
Not sure what RDF is and confused about or how it relates to Linked Data and the jargon surrounding it? This describes of what RDF as well as what you need to know to understand how it applies to library work.
Dynamic website that changes daily automatically. A dynamic website can contain client-side scripting or server-side scripting to generate the changing content, or a combination of both scripting types. These sites also include HTML programming for the basic structure.
Seminar presentation for which the entire work was conducted at Technical University Kaiserslautern. The seminar work involved understanding the Semantic Web technology along with RDF and querying mechanism. It also involved looking at technologies that are used for data storage, data management and data querying.
Semantic Variation Graphs the case for RDF & SPARQLJerven Bolleman
Presentation given to the GA4GH dataworking group. It starts with an introduction to what RDF is followed by how one can model genomic variation graphs in RDF. Then we show how one can use SPARQL to query this data.
Today, the corpus based approach can be identified as the state of the art methodology in
language learning studying for both prominent and less known languages in the world. The
corpus based approach mines new knowledge on a language by answering two main
questions:
What particular patterns are associated with lexical or grammatical features of the
language?
How do these patterns differ within varieties and registers?
A language corpus can be identified as a collection of authentic texts that are stored
electronically. It contains different language patterns in different genres, time periods and
social variants. Most of the major languages in the world have their own corpora. But corpora
which have been implemented for Sinhala language have so many limitations.
SinMin is a corpus for Sinhala language which is
Continuously updating
Dynamic (Scalable)
Covers wide range of language (Structured and unstructured)
Providing a better interface for users to interact with the corpus
This report contains the comprehensive literature review done and the research, and design
and implementation details of the SinMin corpus. The implementation details are organized
according to the various components of the platform. Testing, and future works have been
discussed towards the end of this report.
An overview about Apache ManifoldCF with an introduction to repositories and search servers. Includes an overview about the latest improvements and new features.
Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine
This presentation is dedicated to Azure Cosmos DB, it's history, characteristics, tasks and solutions. The presentation deals with performance optimization, practical experience of usage and an overview of the news about Cosmos DB from Microsoft Build 2017 conference (https://build.microsoft.com).
This presentation by Andriy Gorda (Engineering Manager & Lead Software Engineer, Consultant, GlobalLogic Kharkiv) was delivered at GlobalLogic Kharkiv MS TechTalk on June 13, 2017.
Seminar presentation for which the entire work was conducted at Technical University Kaiserslautern. The seminar work involved understanding the Semantic Web technology along with RDF and querying mechanism. It also involved looking at technologies that are used for data storage, data management and data querying.
Semantic Variation Graphs the case for RDF & SPARQLJerven Bolleman
Presentation given to the GA4GH dataworking group. It starts with an introduction to what RDF is followed by how one can model genomic variation graphs in RDF. Then we show how one can use SPARQL to query this data.
Today, the corpus based approach can be identified as the state of the art methodology in
language learning studying for both prominent and less known languages in the world. The
corpus based approach mines new knowledge on a language by answering two main
questions:
What particular patterns are associated with lexical or grammatical features of the
language?
How do these patterns differ within varieties and registers?
A language corpus can be identified as a collection of authentic texts that are stored
electronically. It contains different language patterns in different genres, time periods and
social variants. Most of the major languages in the world have their own corpora. But corpora
which have been implemented for Sinhala language have so many limitations.
SinMin is a corpus for Sinhala language which is
Continuously updating
Dynamic (Scalable)
Covers wide range of language (Structured and unstructured)
Providing a better interface for users to interact with the corpus
This report contains the comprehensive literature review done and the research, and design
and implementation details of the SinMin corpus. The implementation details are organized
according to the various components of the platform. Testing, and future works have been
discussed towards the end of this report.
An overview about Apache ManifoldCF with an introduction to repositories and search servers. Includes an overview about the latest improvements and new features.
Azure Cosmos DB: Features, Practical Use and Optimization "GlobalLogic Ukraine
This presentation is dedicated to Azure Cosmos DB, it's history, characteristics, tasks and solutions. The presentation deals with performance optimization, practical experience of usage and an overview of the news about Cosmos DB from Microsoft Build 2017 conference (https://build.microsoft.com).
This presentation by Andriy Gorda (Engineering Manager & Lead Software Engineer, Consultant, GlobalLogic Kharkiv) was delivered at GlobalLogic Kharkiv MS TechTalk on June 13, 2017.
This presentation was first held at the OpenSQL Camp 2009, part of the FrOSCon conference in St. Augustin, Germany. It gives a nice overview over the project, technology and how it will progress. Find more information at http://www.blackray.org
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...ScyllaDB
Numberly operates business-critical data pipelines and applications where failure and latency means "lost money" in the best-case scenario. Most of those data pipelines and applications are deployed on Kubernetes and rely on Kafka and ScyllaDB, where Kafka acts as the message bus and ScyllaDB as the source of some data enrichment. The availability and latency of both systems are thus very important because they mix and match data in the early stage of their pipelines to be consumed by their platforms.
Most of their applications are developed using Python. But they always felt that they could benefit from a lower-level programming language to squeeze the performance of their hardware even further for some of the most demanding applications. So, when an important part of their data pipeline was to be adjusted to reflect some important changes in their platforms, they thought it was a great opportunity to rewrite it in Rust!
Moving to Rust was hard, not only because of the language itself, but because being at a lower level allowed them to see, test, and demonstrate things that they could not pinpoint or explain that well using Python. They spent a lot of time analyzing the latency impacts of code patterns and client driver settings and ended up contributing to Apache Avro as they went down the rabbit hole.
This session will share their experience transitioning from Python to Rust while meeting the expectations of a business-critical application mixing data from Confluent Kafka and ScyllaDB. There will be code snippets, graphs, numbers, tears, pull requests, grins, latency results, smiles, rants of frustration, and a lot of fun!
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
SOLR has been integrated with OpenCms 9.5 tighter than ever before. With 9.5, all content items in the OpenCms repository can be indexed by SOLR, in all available languages. This deep integration allows to use SOLR not only for basic full text searches, but also as an API extension to create advanced queries for all kinds of contents.
In this workshop, Sören shows how to use SOLR for advanced content retrieval in OpenCms. He combines attributes, properties and XML field values in a query that generates an editable list of elements with a content collector. He also explains how to use advanced features such as individual content field mappings to make your custom content types easily findable.
Accelerating NLP with Dask and Saturn CloudSujit Pal
Slides for talk delivered at NY NLP Meetup. Abstract -- Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines. This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. The pipeline was built and executed on Saturn Cloud, a platform that makes it easy to launch and manage Dask clusters. The talk will present an introduction to Dask and explain how users can easily accelerate Python and NLP code across clusters of machines.
Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncyca...Chamila Wijayarathna
Slides I used to present our paper "Why Johnny Can't Store Passwords Securely? " at Evaluation and Assessment in Software Engineering (EASE) 2018 Conference. The full paper can be accessed at https://arxiv.org/ftp/arxiv/papers/1805/1805.09487.pdf
Using Cognitive Dimensions Questionnaire to Evaluate the Usability of Securit...Chamila Wijayarathna
This was presented by me at the 28th annual gathering of Psychology of Programmers Interest Group (PPIG).
Usability issues that exist in security APIs cause programmers to embed those security APIs incorrectly to the applications they develop. This results in introduction of security vulnerabilities to those applications. One of the main reasons for security APIs to be not usable is currently there is no proper method by which the usability issues of security APIs can be identified. We conducted a study to assess the effectiveness of the cognitive dimensions questionnaire based usability evaluation methodology in evaluating the usability of security APIs. We used a cognitive dimensions based generic questionnaire to collect feedback from programmers who participated in the study. Results revealed interesting facts about the prevailing usability issues in four commonly used security APIs and the capability of the methodology to identify those issues.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Sinmin final presentation
1. Sinmin - Corpus for Sinhala
Language
1
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. de Silva
2. Outline
● Introduction
● Crawler Implementation and Design
● Data Cleaning and Tokenizing Mechanisms
● Selecting Data Storage Mechanism
● Data Storage Model of SinMin
● User Interface Design and Implementation
● API Design and Implementation
● Unit Testing
● Performance Testing of the API
● Implemented Sample Usages
2
3. What is a Corpus??
“A corpus is a principled collection of
authentic texts stored electronically that
can be used to discover information about
language that may not have been noticed
through intuition alone.” - Bennet (2010)
3
4. Usages of a Corpus
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
4
5. Sinmin is a Corpus for Sinhala
language which is
➢ Continuously updating
➢ Dynamic (Scalable)
➢ Covers wide range of language (Structured and unstructured)
5
16. Identified Issues
● Erroneous characters of the texts
● Short forms
● Consecutive Sinhala vowel sign problem
fixing
16
17. Erroneous Characters Of The
Texts
● Invalid Unicode characters
Eg: Characters in a private user area, Replacement
character
● Symbols
Eg: “,”, “.”, “{“, “(“, “?”
17
18. Erroneous Characters Of The
Texts
● Unwanted non-Sinhala characters
Eg: ‘u+200C’, Á, À, ®, ¡, ª, º
● Non-symbolic characters which were terminating
words
18
19. Short Forms
● Short forms consists of full stops.
● But those full stop marks aren’t separating
sentences nor words.
E.g.: පෙ. ව. (pm), රු. (Rupees)
19
20. Identified Common Short
Forms
"ඒ.", "බී.", "සී.", "ඩී.", "ඊ.", "එෆ්."
"පෙ.", "ව.", "ෙ.", "රු."
"0.", "1.", "2.", "3.", "4.", "5.", "6.", "7.", "8.", "9."
20
28. We considered performance for
inserting data and for retrieving 12
different information needs.
Data set and source code
https://github.com/madurangasiriwardena/performance-test
28
32. Cassandra performed better than others
in most of the scenarios, and its
insertion time increased linearly.
So we chose it for implementing
corpus.
32
34. ● We Used Cassandra as the Main Storage System of
Sinmin
● Apache Cassandra version 2.1.2 used.
● cqlsh version 5.0.1 used
34
Cassandra
35. ● Most queries of API are retrieved from Cassandra
Database.
● Cassandra Database consist of more than 50 Column
Families where each of them provides a specific
information need
35
Cassandra
39. Wildcard Search Feature
● Implemented using Apache Solr
● More than 1.2 million distinct words
● Supports at most 10 asterisks and atmost 10
question marks
39
40. Sinhala Vowel Sign Problem At Wildcard
Search
In Sinhala Unicode, Sinhala vowel signs are separate
Unicode characters
40
41. Sinhala Vowel Sign Problem At Wildcard
Search
Solution: Represent Sinhala letter and vowel sign as one
entity
41
43. ● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
43
44. Sinmin user interface allows to,
● Find the probability of an n-gram
● Find the most probable word comes after an n-gram
● Compare the usage of n-grams
● Find statistics of words, bigrams and trigrams
● Wildcard search
● Find latest articles for an n-gram
44
48. REST API
● REST API to expose Corpus services
● Much complex and customizable data retrieval and
filtering
● Interface for third party applications to consume
48
49. REST API
● Depends on backend databases (Cassandra,
Oracle, Solr)
● Cassandra acts as main storage system
● Oracle is used as a backup database
● Solr is used for wildcard search functions
49
54. Time Taken To Process Requests Under
Different Load Conditions
54
55. Full Stop Predictor For OCR
● One challenge in OCR development is identifying
fullstops.
● This tool is a consumer application of Sinmin that
predicts the full stop marks of Sinhala texts.
55
56. Publications
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Presented)
● Comparison between performance of various
database systems for implementing a language
corpus – 11th International Beyond Databases,
Architectures and Structures conference (Accepted)
56
57. Future Works
● Annotate Words with POS Taggers and lemmas.
● Implement tools and applications that make use of
the corpus
57