This document describes a student project to implement database systems on the SpiNNaker neuromorphic hardware architecture. The student developed a key-value store and relational database to run on SpiNNaker, evaluating its performance and limitations for general purpose computing. The conclusions from this work provide feedback that could help improve SpiNNaker's design for applications beyond neural network simulations. Challenges included dealing with unreliable communication, out-of-order execution, and API bugs in the neuromorphic system. Evaluation benchmarks analyzed reliability, throughput, and memory usage under the database workloads.
This document summarizes a project that implements function call parallelism within the LLVM compiler framework. The project analyzes serial programs at compile time and automatically adds parallelism by running certain function calls in separate threads while speculatively continuing the main thread. This speculation is made safe using software transactional memory to roll back threads if memory conflicts occur between threads. The implementation finds suitable functions and call sites, parallelizes the calls using pthreads and STM, and includes a merging procedure to enforce correct commit ordering. Evaluation shows the implementation provides performance gains of up to 3.5x on some benchmarks.
The document is Nathaniel Knapp's master's thesis titled "Parasite: Local Scalability Profiling for Parallelization" submitted to Technische Universität München. The thesis presents Parasite, a tool that measures the parallelism of function call sites in programs parallelized using Pthreads. Parasite calculates the parallelism ratio, which is an upper bound on potential speedup and useful for evaluating scalability. The thesis demonstrates Parasite on sorting algorithms, molecular dynamics simulations, and other programs to analyze parallelism and identify factors limiting scalability.
This document describes a thesis that proposes a multicore architecture allowing fault tolerant cores to distribute critical tasks to less reliable cores. It uses a fingerprinting system where each core monitors others by calculating fingerprints and comparing them in a centralized hardware comparator. The fingerprinting unit represents 15% of core resources while the comparator adds 6% cost. An FPGA prototype was developed to fingerprint parallel thread executions. A virtual debugging platform was also created using processor models and multicore simulation.
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...Nitesh Pandit
iPDC is a free Phasor Data Concentrator based on IEEEC37.118 synchrophasor standard. It also has Database Server for iPDC and PMU Simulator modules.
The objective of iPDC project is to create a IEEE C37.118 Synchrophasor standardized Phasor Data Concentrator and PMU Simulator, on which research students and others can develop and test their algorithms and applications. The purpose of iPDC released as a Free Software to its availability for users without any restriction regarding its usage and modification. And to get the contribution from users and developers all around the world.
iPDC do the time alignment and combining of the received data into frames as per IEEEC37.118 and can send to other iPDCs, and applications. iPDC can also archive received data in the MySQL database on local/remote machine. PMU Simulator is also IEEEC37.118 std compliant. Software is built to be working on Linux platform.
The use of synchrophasors for monitoring and improving the stability of power transmission networks is gaining in significance all over the world. The aim is to monitor the system state, to intensify awareness for system stability and to make optimal use of existing lines. This way, system stability can be improved overall and even the transmission performance can be increased. The data from so many PMU’s and PDC’s needs to be collected and directed to proper channels for its efficient use. Thus we need to develop an efficient, flexible and hybrid data concentrator that can serve this purpose. Besides accepting the data from PMU’s, PDC should be able to accept the data also from other PDC. We have designed such a PDC (iPDC) that accepts data from PMU & PDC that are IEEEC37.118 standard compliant.
WAMS architecture with iPDC and PMU at different levels. This architecture enables iPDC to receive data either from a PMU or other iPDC. Both PMU and iPDC from whom the data is being received should be IEEE C37.118 synchrophasor standard compliant. It is hybrid architecture.
iPDC Design
The client server architecture is common in networks when two peers are communicating with each other. Of the two peers (PMU and iPDC) that are communicating with each other in WAMS one acts as a client and the other as a server. Since PMU saves the requests coming
from iPDC by sending data or configuration frames it acts as a server. It listens for command frames from iPDC. PMU-iPDC communication can be either over TCP or UDP communication protocols. On receiving command frames, PMU replies to the iPDC with data or configuration frames according to the type of request.
iPDC functionality is bifurcated as server and client. iPDC as a Client - When iPDC receives data or configuration frames its acts as a client. When acting as a client, it creates a new thread for each PMU or a PDC from whom it is going to receive data/configuration frames. This thread would establish connection between the two communication entities. It handles both TCP and UDP connections. The first frame that the server (PMU/PDC) would receive is the command for sending the configuration frame. When the server replies with the configuration frame, iPDC (client) would generate another request to start sending the data frames. On receiving
such a command frame, the server starts sending the data frames. If there is some change in the status bits of data frame which the client (iPDC) notices, it would take an action. For example if it notices a bit 10 has been set, it would internally send a command to server to send the latest configuration frame.
iPDC as a Server- When iPDC receives command frames from another PDC it would acts as a server. There would be two reserved ports one for UDP and other for TCP on which the PDC would receive command frame requests. Thus PDC now plays the role of PMU waiting
for command frames.
This document is a master's thesis submitted by Sascha Nawrot to Berlin University of Applied Sciences in partial fulfillment of the requirements for a Master of Science degree in Applied Computer Science. The thesis introduces novel, lightweight open source annotation tools for whole slide images that enable deep learning experts and pathology experts to cooperate in creating training samples by annotating regions of interest in whole slide images, regardless of platform or format, in a fast and easy manner. The tools consist of a conversion service to convert whole slide images to an open format, an annotation service for annotating regions of interest, and a tessellation service to extract the annotated regions from the images.
This doctoral thesis by Juan Luis Jerez focuses on developing more efficient computational methods and custom hardware architectures for real-time optimal decision making and control applications. The thesis proposes techniques to exploit synergies between digital hardware, numerical algorithms, and algorithm design. These include custom storage schemes, parallel optimization approaches, tailored linear algebra methods for fixed-point arithmetic, and finite-precision analysis of first-order optimization methods. The techniques are demonstrated on examples such as a hardware-in-the-loop setup for model predictive control of a large airliner.
This document describes a student project to implement database systems on the SpiNNaker neuromorphic hardware architecture. The student developed a key-value store and relational database to run on SpiNNaker, evaluating its performance and limitations for general purpose computing. The conclusions from this work provide feedback that could help improve SpiNNaker's design for applications beyond neural network simulations. Challenges included dealing with unreliable communication, out-of-order execution, and API bugs in the neuromorphic system. Evaluation benchmarks analyzed reliability, throughput, and memory usage under the database workloads.
This document summarizes a project that implements function call parallelism within the LLVM compiler framework. The project analyzes serial programs at compile time and automatically adds parallelism by running certain function calls in separate threads while speculatively continuing the main thread. This speculation is made safe using software transactional memory to roll back threads if memory conflicts occur between threads. The implementation finds suitable functions and call sites, parallelizes the calls using pthreads and STM, and includes a merging procedure to enforce correct commit ordering. Evaluation shows the implementation provides performance gains of up to 3.5x on some benchmarks.
The document is Nathaniel Knapp's master's thesis titled "Parasite: Local Scalability Profiling for Parallelization" submitted to Technische Universität München. The thesis presents Parasite, a tool that measures the parallelism of function call sites in programs parallelized using Pthreads. Parasite calculates the parallelism ratio, which is an upper bound on potential speedup and useful for evaluating scalability. The thesis demonstrates Parasite on sorting algorithms, molecular dynamics simulations, and other programs to analyze parallelism and identify factors limiting scalability.
This document describes a thesis that proposes a multicore architecture allowing fault tolerant cores to distribute critical tasks to less reliable cores. It uses a fingerprinting system where each core monitors others by calculating fingerprints and comparing them in a centralized hardware comparator. The fingerprinting unit represents 15% of core resources while the comparator adds 6% cost. An FPGA prototype was developed to fingerprint parallel thread executions. A virtual debugging platform was also created using processor models and multicore simulation.
iPDC-v1.3.0 - A Complete Technical Report including iPDC, PMU Simulator, and ...Nitesh Pandit
iPDC is a free Phasor Data Concentrator based on IEEEC37.118 synchrophasor standard. It also has Database Server for iPDC and PMU Simulator modules.
The objective of iPDC project is to create a IEEE C37.118 Synchrophasor standardized Phasor Data Concentrator and PMU Simulator, on which research students and others can develop and test their algorithms and applications. The purpose of iPDC released as a Free Software to its availability for users without any restriction regarding its usage and modification. And to get the contribution from users and developers all around the world.
iPDC do the time alignment and combining of the received data into frames as per IEEEC37.118 and can send to other iPDCs, and applications. iPDC can also archive received data in the MySQL database on local/remote machine. PMU Simulator is also IEEEC37.118 std compliant. Software is built to be working on Linux platform.
The use of synchrophasors for monitoring and improving the stability of power transmission networks is gaining in significance all over the world. The aim is to monitor the system state, to intensify awareness for system stability and to make optimal use of existing lines. This way, system stability can be improved overall and even the transmission performance can be increased. The data from so many PMU’s and PDC’s needs to be collected and directed to proper channels for its efficient use. Thus we need to develop an efficient, flexible and hybrid data concentrator that can serve this purpose. Besides accepting the data from PMU’s, PDC should be able to accept the data also from other PDC. We have designed such a PDC (iPDC) that accepts data from PMU & PDC that are IEEEC37.118 standard compliant.
WAMS architecture with iPDC and PMU at different levels. This architecture enables iPDC to receive data either from a PMU or other iPDC. Both PMU and iPDC from whom the data is being received should be IEEE C37.118 synchrophasor standard compliant. It is hybrid architecture.
iPDC Design
The client server architecture is common in networks when two peers are communicating with each other. Of the two peers (PMU and iPDC) that are communicating with each other in WAMS one acts as a client and the other as a server. Since PMU saves the requests coming
from iPDC by sending data or configuration frames it acts as a server. It listens for command frames from iPDC. PMU-iPDC communication can be either over TCP or UDP communication protocols. On receiving command frames, PMU replies to the iPDC with data or configuration frames according to the type of request.
iPDC functionality is bifurcated as server and client. iPDC as a Client - When iPDC receives data or configuration frames its acts as a client. When acting as a client, it creates a new thread for each PMU or a PDC from whom it is going to receive data/configuration frames. This thread would establish connection between the two communication entities. It handles both TCP and UDP connections. The first frame that the server (PMU/PDC) would receive is the command for sending the configuration frame. When the server replies with the configuration frame, iPDC (client) would generate another request to start sending the data frames. On receiving
such a command frame, the server starts sending the data frames. If there is some change in the status bits of data frame which the client (iPDC) notices, it would take an action. For example if it notices a bit 10 has been set, it would internally send a command to server to send the latest configuration frame.
iPDC as a Server- When iPDC receives command frames from another PDC it would acts as a server. There would be two reserved ports one for UDP and other for TCP on which the PDC would receive command frame requests. Thus PDC now plays the role of PMU waiting
for command frames.
This document is a master's thesis submitted by Sascha Nawrot to Berlin University of Applied Sciences in partial fulfillment of the requirements for a Master of Science degree in Applied Computer Science. The thesis introduces novel, lightweight open source annotation tools for whole slide images that enable deep learning experts and pathology experts to cooperate in creating training samples by annotating regions of interest in whole slide images, regardless of platform or format, in a fast and easy manner. The tools consist of a conversion service to convert whole slide images to an open format, an annotation service for annotating regions of interest, and a tessellation service to extract the annotated regions from the images.
This doctoral thesis by Juan Luis Jerez focuses on developing more efficient computational methods and custom hardware architectures for real-time optimal decision making and control applications. The thesis proposes techniques to exploit synergies between digital hardware, numerical algorithms, and algorithm design. These include custom storage schemes, parallel optimization approaches, tailored linear algebra methods for fixed-point arithmetic, and finite-precision analysis of first-order optimization methods. The techniques are demonstrated on examples such as a hardware-in-the-loop setup for model predictive control of a large airliner.
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Trevor Parsons
Enterprise applications are becoming increasingly complex. In recent times they have moved away from monolithic architectures to more distributed systems made up of a collection of heterogonous servers. Such servers generally host numerous soft- ware components that interact to service client requests. Component based enterprise frameworks (e.g. JEE or CCM) have been extensively adopted for building such ap- plications. Enterprise technologies provide a range of reusable services that can assist developers building these systems. Consequently developers no longer need to spend time developing the underlying infrastructure of such applications, and can instead concentrate their efforts on functional requirements.
Poor performance design choices, however, are common in enterprise applications and have been well documented in the form of software antipatterns. Design mistakes generally result from the fact that these multi-tier, distributed systems are extremely complex and often developers do not have a complete understanding of the entire ap- plication. As a result developers can be oblivious to the performance implications of their design decisions. Current performance testing tools fail to address this lack of system understanding. Most merely profile the running system and present large vol- umes of data to the tool user. Consequently developers can find it extremely difficult to identify design issues in their applications. Fixing serious design level performance problems late in development is expensive and can not be achieved through ”code op- timizations”. In fact, often performance requirements can only be met by modifying the design of the application which can lead to major project delays and increased costs.
This thesis presents an approach for the automatic detection of performance design and deployment antipatterns in enterprise applications built using component based frameworks. Our main aim is to take the onus away from developers having to sift through large volumes of data, in search of performance bottlenecks in their applica- tions. Instead we automate this process. Our approach works by automatically recon- structing the run-time design of the system using advanced monitoring and analysis techniques. Well known (predefined) performance design and deployment antipat- terns that exist in the reconstructed design are automatically detected. Results of ap- plying our technique to two enterprise applications are presented.
The main contributions of this thesis are (a) an approach for automatic detection of performance design and deployment antipatterns in component based enterprise frameworks, (b) a non-intrusive, portable, end-to-end run-time path tracing approach for JEE and (c) the advanced analysis of run-time paths using frequent sequence mining to automatically identify interesting communication patterns between com- ponents.
This document is a thesis submitted by Shruti Ranjan Satapathy for the degree of B.Tech - M.Tech at the Indian Institute of Technology Kanpur in June 2013. It examines word sense disambiguation through both supervised and knowledge-based approaches. The supervised approach uses support vector machines and syntactic, syntacto-semantic and semantic features for all-words sense disambiguation. The knowledge-based approaches construct graphs based on WordNet and use PageRank to score word senses, showing that the approach using subgraph projections from WordNet outperforms the pairwise similarity-based approach. The thesis highlights issues with sense granularity, lack of sense-annotated training data and knowledge acquisition bottlenecks that still challenge word sense
This document is the thesis submitted by Bryan Omar Collazo Santiago to the Department of Electrical Engineering and Computer Science at MIT in partial fulfillment of the requirements for a Master of Engineering degree. The thesis presents MLBlocks, a machine learning system that allows data scientists to easily explore different modeling techniques. MLBlocks supports discriminative modeling, generative modeling, and using synthetic features to boost performance. It has a simple interface and is highly parameterizable and extensible. The thesis describes the architecture and implementation of MLBlocks and provides two examples of using it on real-world problems - predicting student dropout in MOOCs and predicting vehicle destinations from trajectory data.
The use of synchrophasors for monitoring and improving the stability of power transmission networks is gaining in significance all over the world. The aim is to monitor the system state, to intensify awareness for system stability and to make optimal use of existing lines. This way, system stability can be improved overall and even the transmission performance can be increased. The data from so many PMU’s and PDC’s needs to be collected and directed to proper channels for its efficient use. Thus we need to develop an efficient, flexible and hybrid data concentrator that can serve this purpose. Besides accepting the data from PMU’s, PDC should be able to accept the data also from other PDC. We have designed such a PDC (iPDC) that accepts data from PMU & PDC that are IEEEC37.118 standard compliant.
WAMS architecture with iPDC and PMU at different levels. This architecture enables iPDC to receive data either from a PMU or other iPDC. Both PMU and iPDC from whom the data is being received should be IEEE C37.118 synchrophasor standard compliant. It is hybrid architecture.
iPDC Design
The client server architecture is common in networks when two peers are communicating with each other. Of the two peers (PMU and iPDC) that are communicating with each other in WAMS one acts as a client and the other as a server. Since PMU saves the requests coming
from iPDC by sending data or configuration frames it acts as a server. It listens for command frames from iPDC. PMU-iPDC communication can be either over TCP or UDP communication protocols. On receiving command frames, PMU replies to the iPDC with data or configuration frames according to the type of request.
iPDC functionality is bifurcated as server and client. iPDC as a Client - When iPDC receives data or configuration frames its acts as a client. When acting as a client, it creates a new thread for each PMU or a PDC from whom it is going to receive data/configuration frames. This thread would establish connection between the two communication entities. It handles both TCP and UDP connections. The first frame that the server (PMU/PDC) would receive is the command for sending the configuration frame. When the server replies with the configuration frame, iPDC (client) would generate another request to start sending the data frames. On receiving
such a command frame, the server starts sending the data frames. If there is some change in the status bits of data frame which the client (iPDC) notices, it would take an action. For example if it notices a bit 10 has been set, it would internally send a command to server to send the latest configuration frame.
iPDC as a Server- When iPDC receives command frames from another PDC it would acts as a server. There would be two reserved ports one for UDP and other for TCP on which the PDC would receive command frame requests. Thus PDC now plays the role of PMU waiting
for command frames.
This document describes a senior project submitted by Wongsarun Chatamornwong and Ronnakrit Kunaviriyasiri to Mahidol University International College in partial fulfillment of a Bachelor of Science degree in Computer Science. The project, called Meka Code, aims to develop an online integrated development environment (IDE) that allows instructors and students to have a shared coding environment and tools. Key features of Meka Code include providing Linux containers to users, a graphical user interface within containers, and functionality for instructors to create courses and assign work and for students to enroll in courses and submit assignments.
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
This document is the dissertation of Gianmarco De Francisci Morales submitted for the PhD program in Computer Science and Engineering at IMT Institute for Advanced Studies in Lucca, Italy. The dissertation addresses challenges in managing and analyzing large datasets, or "big data", and presents algorithms for tasks like document filtering, graph computation and real-time news recommendation. It was approved by the program coordinator and supervisor, and reviewed by two external reviewers. The dissertation contains six chapters, including introductions to big data and related work, and presents three contributed algorithms for document filtering, graph computation and news recommendation that scale to large datasets through parallel and distributed techniques.
This document describes a project to design and implement an OFDM-based wireless transmitter compliant with the IEEE 802.11g standard on an FPGA. The transmitter was modeled using Simulink and the model was tested through cosimulation and using EDA tools. Testing showed the design met timing requirements and error measurements were satisfactory, demonstrating a successful OFDM transmitter design using a model-based approach.
Modelling Time in Computation (Dynamic Systems)M Reza Rahmati
This document presents a dissertation on functional reactive programming (FRP) for real-time reactive systems. The dissertation introduces RT-FRP, a language for programming real-time reactive systems that can guarantee resource bounds while allowing a restricted form of recursion. It presents two variants of RT-FRP called H-FRP and E-FRP for modeling hybrid and event-driven systems respectively. A compiler is presented for compiling E-FRP to an imperative language for implementation. The dissertation contributes new programming languages and techniques for programming reactive and embedded systems with guarantees on resource usage.
The document provides an introduction to the technical basics of migrating data to SAP ERP systems. It defines common terminology used in data migration projects and describes the typical process steps from a technical perspective, including exporting data from the legacy system, converting the data, and importing it into SAP ERP. It also provides an overview of common technical procedures for data migration, such as batch input, the extended computer-aided test tool (eCATT), and the legacy system migration workbench.
This document presents a thesis that assesses the application of machine learning techniques to password list generation. The goal is to create human-like password dictionaries using character-based Recurrent Neural Networks (RNNs) and to classify passwords as human- or machine-generated using machine learning algorithms. The thesis shows that an attacker can facilitate machine learning to generate tailored password lists for specific victims by training a model on password creation schemes of other people in combination with user data of the victim. The results indicate that machine learning can accurately classify human passwords and successfully apply RNNs to recognize, learn, and recreate human password generation schemes to generate general and individual human password lists. While these approaches could be abused as an attack tool, they can also be
This document provides an extensive literature review and overview of automatic text summarization from multiple documents. It discusses definitions of text summarization and the summarization process, which includes steps like domain definition, subject analysis, data analysis, feature generation, information aggregation, summary representation, generation and evaluation. It also describes using n-gram graphs as a text representation for summarization and the operators and algorithms involved. Finally it discusses evaluation of summarization systems and using background knowledge to improve the summarization process.
Dissertation_of_Pieter_van_Zyl_2_March_2010Pieter Van Zyl
This document provides a summary of a dissertation submitted in partial fulfillment of the requirements for a Magister Scientia (Computer Science) degree from the University of Pretoria. The dissertation investigates the performance of selected object persistence stores using the OO7 benchmark. It compares the open-source ORM tool Hibernate, the open-source object database db4o, and the proprietary object database Versant. The study found that with optimization techniques, Hibernate performed comparably to the object databases. Versant was the fastest of the three systems tested. The dissertation provides background on persistence technologies, describes implementations of the OO7 benchmark in Java for each system, analyzes performance results, and offers recommendations to improve performance.
This document is a feasibility study report submitted by Benjamin Kremer for the MSc Computer Science degree at University College London. The report examines the feasibility of constructing a system to verify and quantify collaborative work using blockchain architecture. The project aimed to address the problem of student disengagement by developing an API and mobile application to interact with a blockchain that records collaborative task and team data. While the project did not fully establish a way to verify and quantify collaboration, it demonstrated the concept is feasible with more time and blockchain expertise. The report describes the background, requirements, design, implementation, and testing of the prototype system developed as a proof of concept.
This thesis proposes and evaluates a compressive sensing (CS)-based indoor positioning and tracking system using received signal strength (RSS) from wireless local area network access points. The system is designed and implemented on mobile devices with limited resources.
In the offline phase, RSS fingerprints are collected and clustered using affinity propagation. In the online phase, coarse localization is done by matching RSS measurements to precomputed clusters, and fine localization refines the position using CS recovery on the sparse location signal.
An indoor tracking system is also presented, which integrates the CS-based positioning with a Kalman filter for sequential location estimates. Experimental results on two testbeds show the system achieves better accuracy than other fingerprinting methods, suitable for implementation
This document is the thesis submitted by Jiří Danihelka to the Faculty of Electrical Engineering at Czech Technical University in Prague for the degree of Doctor of Philosophy. The thesis focuses on distributed mobile graphics, including rendering of facial models, collaborative distributed computer graphics, and generating virtual cities on mobile devices. It presents research conducted from 2010 to 2015 and supported by several grants and organizations. The thesis is divided into four parts covering introduction, rendering of facial models, collaborative graphics, and generating virtual cities on mobile devices.
This masters report describes the COAcHMAN project which aims to simplify user interactions with smart homes through context awareness. The report conducts background research on context awareness and home automation technologies. As a result, a software solution called COAcHMAN is proposed which enables homes to react based on the user's context rather than requiring direct user interaction. COAcHMAN integrates with the openHAB home automation platform and uses online user profiles to provide familiar interfaces for users. The implementation of COAcHMAN is described along with further development areas like authentication and using internal sensor data.
This master's thesis documents the Linux kernel version 2.6. It begins with an introduction to operating systems concepts and an overview of Linux kernel subsystems. The core chapters analyze important kernel mechanisms such as synchronization, scheduling, memory management and device drivers. Code examples are provided to illustrate kernel programming concepts. The thesis concludes with the documentation of a sample loadable kernel module.
This document is a doctoral thesis that examines bringing more intelligence to the web and beyond through semantic web technologies. It discusses the motivation for more intelligent web applications, provides an overview of semantic web technologies and languages. It then presents the H-DOSE semantic platform and its logical architecture for semantic resource retrieval. Several case studies that implemented the H-DOSE platform are also described. The thesis concludes with discussions on related works and potential future directions.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
The document presents a master's thesis that proposes and develops Scalable Advanced Massive Online Analysis (SAMOA), a distributed streaming machine learning framework. SAMOA aims to address the big data challenges of volume, velocity, and variety by providing flexible APIs for developing machine learning algorithms, and integrating with Storm, a stream processing engine, to inherit its scalability. The thesis describes SAMOA's modular components, its integration with Storm, and evaluates a distributed online classification algorithm implemented on SAMOA and Storm to demonstrate its features.
A Mobile and Web application for time measurement intended to get an accurate picture of the productive time in a production environment in order to reveal the root causes behind ineffective/idle time and to eliminate non-added activities/tasks .
Technical Key-words : Ionic 2, Angular 2, PouchDB, CouchDB ,
DB Replication Protocol, Django, Python NvD3 charts .
The document is a thesis submitted by Maliththa S. S. Bulathwela for the degree of Master of Science in Computational Statistics and Machine Learning at University College London. The thesis explores building a self-adaptive topic engine to extract insights from customer feedback data. Initial work uses supervised support vector machines for topic classification and adapts trust modeling techniques to enhance the reliability of crowd-sourced labeled data. Latent Dirichlet allocation is then used to detect emerging topics from unlabeled data. The results were promising, suggesting further work could build self-adapting topic engines using techniques from the thesis.
Automatic Detection of Performance Design and Deployment Antipatterns in Comp...Trevor Parsons
Enterprise applications are becoming increasingly complex. In recent times they have moved away from monolithic architectures to more distributed systems made up of a collection of heterogonous servers. Such servers generally host numerous soft- ware components that interact to service client requests. Component based enterprise frameworks (e.g. JEE or CCM) have been extensively adopted for building such ap- plications. Enterprise technologies provide a range of reusable services that can assist developers building these systems. Consequently developers no longer need to spend time developing the underlying infrastructure of such applications, and can instead concentrate their efforts on functional requirements.
Poor performance design choices, however, are common in enterprise applications and have been well documented in the form of software antipatterns. Design mistakes generally result from the fact that these multi-tier, distributed systems are extremely complex and often developers do not have a complete understanding of the entire ap- plication. As a result developers can be oblivious to the performance implications of their design decisions. Current performance testing tools fail to address this lack of system understanding. Most merely profile the running system and present large vol- umes of data to the tool user. Consequently developers can find it extremely difficult to identify design issues in their applications. Fixing serious design level performance problems late in development is expensive and can not be achieved through ”code op- timizations”. In fact, often performance requirements can only be met by modifying the design of the application which can lead to major project delays and increased costs.
This thesis presents an approach for the automatic detection of performance design and deployment antipatterns in enterprise applications built using component based frameworks. Our main aim is to take the onus away from developers having to sift through large volumes of data, in search of performance bottlenecks in their applica- tions. Instead we automate this process. Our approach works by automatically recon- structing the run-time design of the system using advanced monitoring and analysis techniques. Well known (predefined) performance design and deployment antipat- terns that exist in the reconstructed design are automatically detected. Results of ap- plying our technique to two enterprise applications are presented.
The main contributions of this thesis are (a) an approach for automatic detection of performance design and deployment antipatterns in component based enterprise frameworks, (b) a non-intrusive, portable, end-to-end run-time path tracing approach for JEE and (c) the advanced analysis of run-time paths using frequent sequence mining to automatically identify interesting communication patterns between com- ponents.
This document is a thesis submitted by Shruti Ranjan Satapathy for the degree of B.Tech - M.Tech at the Indian Institute of Technology Kanpur in June 2013. It examines word sense disambiguation through both supervised and knowledge-based approaches. The supervised approach uses support vector machines and syntactic, syntacto-semantic and semantic features for all-words sense disambiguation. The knowledge-based approaches construct graphs based on WordNet and use PageRank to score word senses, showing that the approach using subgraph projections from WordNet outperforms the pairwise similarity-based approach. The thesis highlights issues with sense granularity, lack of sense-annotated training data and knowledge acquisition bottlenecks that still challenge word sense
This document is the thesis submitted by Bryan Omar Collazo Santiago to the Department of Electrical Engineering and Computer Science at MIT in partial fulfillment of the requirements for a Master of Engineering degree. The thesis presents MLBlocks, a machine learning system that allows data scientists to easily explore different modeling techniques. MLBlocks supports discriminative modeling, generative modeling, and using synthetic features to boost performance. It has a simple interface and is highly parameterizable and extensible. The thesis describes the architecture and implementation of MLBlocks and provides two examples of using it on real-world problems - predicting student dropout in MOOCs and predicting vehicle destinations from trajectory data.
The use of synchrophasors for monitoring and improving the stability of power transmission networks is gaining in significance all over the world. The aim is to monitor the system state, to intensify awareness for system stability and to make optimal use of existing lines. This way, system stability can be improved overall and even the transmission performance can be increased. The data from so many PMU’s and PDC’s needs to be collected and directed to proper channels for its efficient use. Thus we need to develop an efficient, flexible and hybrid data concentrator that can serve this purpose. Besides accepting the data from PMU’s, PDC should be able to accept the data also from other PDC. We have designed such a PDC (iPDC) that accepts data from PMU & PDC that are IEEEC37.118 standard compliant.
WAMS architecture with iPDC and PMU at different levels. This architecture enables iPDC to receive data either from a PMU or other iPDC. Both PMU and iPDC from whom the data is being received should be IEEE C37.118 synchrophasor standard compliant. It is hybrid architecture.
iPDC Design
The client server architecture is common in networks when two peers are communicating with each other. Of the two peers (PMU and iPDC) that are communicating with each other in WAMS one acts as a client and the other as a server. Since PMU saves the requests coming
from iPDC by sending data or configuration frames it acts as a server. It listens for command frames from iPDC. PMU-iPDC communication can be either over TCP or UDP communication protocols. On receiving command frames, PMU replies to the iPDC with data or configuration frames according to the type of request.
iPDC functionality is bifurcated as server and client. iPDC as a Client - When iPDC receives data or configuration frames its acts as a client. When acting as a client, it creates a new thread for each PMU or a PDC from whom it is going to receive data/configuration frames. This thread would establish connection between the two communication entities. It handles both TCP and UDP connections. The first frame that the server (PMU/PDC) would receive is the command for sending the configuration frame. When the server replies with the configuration frame, iPDC (client) would generate another request to start sending the data frames. On receiving
such a command frame, the server starts sending the data frames. If there is some change in the status bits of data frame which the client (iPDC) notices, it would take an action. For example if it notices a bit 10 has been set, it would internally send a command to server to send the latest configuration frame.
iPDC as a Server- When iPDC receives command frames from another PDC it would acts as a server. There would be two reserved ports one for UDP and other for TCP on which the PDC would receive command frame requests. Thus PDC now plays the role of PMU waiting
for command frames.
This document describes a senior project submitted by Wongsarun Chatamornwong and Ronnakrit Kunaviriyasiri to Mahidol University International College in partial fulfillment of a Bachelor of Science degree in Computer Science. The project, called Meka Code, aims to develop an online integrated development environment (IDE) that allows instructors and students to have a shared coding environment and tools. Key features of Meka Code include providing Linux containers to users, a graphical user interface within containers, and functionality for instructors to create courses and assign work and for students to enroll in courses and submit assignments.
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
This document is the dissertation of Gianmarco De Francisci Morales submitted for the PhD program in Computer Science and Engineering at IMT Institute for Advanced Studies in Lucca, Italy. The dissertation addresses challenges in managing and analyzing large datasets, or "big data", and presents algorithms for tasks like document filtering, graph computation and real-time news recommendation. It was approved by the program coordinator and supervisor, and reviewed by two external reviewers. The dissertation contains six chapters, including introductions to big data and related work, and presents three contributed algorithms for document filtering, graph computation and news recommendation that scale to large datasets through parallel and distributed techniques.
This document describes a project to design and implement an OFDM-based wireless transmitter compliant with the IEEE 802.11g standard on an FPGA. The transmitter was modeled using Simulink and the model was tested through cosimulation and using EDA tools. Testing showed the design met timing requirements and error measurements were satisfactory, demonstrating a successful OFDM transmitter design using a model-based approach.
Modelling Time in Computation (Dynamic Systems)M Reza Rahmati
This document presents a dissertation on functional reactive programming (FRP) for real-time reactive systems. The dissertation introduces RT-FRP, a language for programming real-time reactive systems that can guarantee resource bounds while allowing a restricted form of recursion. It presents two variants of RT-FRP called H-FRP and E-FRP for modeling hybrid and event-driven systems respectively. A compiler is presented for compiling E-FRP to an imperative language for implementation. The dissertation contributes new programming languages and techniques for programming reactive and embedded systems with guarantees on resource usage.
The document provides an introduction to the technical basics of migrating data to SAP ERP systems. It defines common terminology used in data migration projects and describes the typical process steps from a technical perspective, including exporting data from the legacy system, converting the data, and importing it into SAP ERP. It also provides an overview of common technical procedures for data migration, such as batch input, the extended computer-aided test tool (eCATT), and the legacy system migration workbench.
This document presents a thesis that assesses the application of machine learning techniques to password list generation. The goal is to create human-like password dictionaries using character-based Recurrent Neural Networks (RNNs) and to classify passwords as human- or machine-generated using machine learning algorithms. The thesis shows that an attacker can facilitate machine learning to generate tailored password lists for specific victims by training a model on password creation schemes of other people in combination with user data of the victim. The results indicate that machine learning can accurately classify human passwords and successfully apply RNNs to recognize, learn, and recreate human password generation schemes to generate general and individual human password lists. While these approaches could be abused as an attack tool, they can also be
This document provides an extensive literature review and overview of automatic text summarization from multiple documents. It discusses definitions of text summarization and the summarization process, which includes steps like domain definition, subject analysis, data analysis, feature generation, information aggregation, summary representation, generation and evaluation. It also describes using n-gram graphs as a text representation for summarization and the operators and algorithms involved. Finally it discusses evaluation of summarization systems and using background knowledge to improve the summarization process.
Dissertation_of_Pieter_van_Zyl_2_March_2010Pieter Van Zyl
This document provides a summary of a dissertation submitted in partial fulfillment of the requirements for a Magister Scientia (Computer Science) degree from the University of Pretoria. The dissertation investigates the performance of selected object persistence stores using the OO7 benchmark. It compares the open-source ORM tool Hibernate, the open-source object database db4o, and the proprietary object database Versant. The study found that with optimization techniques, Hibernate performed comparably to the object databases. Versant was the fastest of the three systems tested. The dissertation provides background on persistence technologies, describes implementations of the OO7 benchmark in Java for each system, analyzes performance results, and offers recommendations to improve performance.
This document is a feasibility study report submitted by Benjamin Kremer for the MSc Computer Science degree at University College London. The report examines the feasibility of constructing a system to verify and quantify collaborative work using blockchain architecture. The project aimed to address the problem of student disengagement by developing an API and mobile application to interact with a blockchain that records collaborative task and team data. While the project did not fully establish a way to verify and quantify collaboration, it demonstrated the concept is feasible with more time and blockchain expertise. The report describes the background, requirements, design, implementation, and testing of the prototype system developed as a proof of concept.
This thesis proposes and evaluates a compressive sensing (CS)-based indoor positioning and tracking system using received signal strength (RSS) from wireless local area network access points. The system is designed and implemented on mobile devices with limited resources.
In the offline phase, RSS fingerprints are collected and clustered using affinity propagation. In the online phase, coarse localization is done by matching RSS measurements to precomputed clusters, and fine localization refines the position using CS recovery on the sparse location signal.
An indoor tracking system is also presented, which integrates the CS-based positioning with a Kalman filter for sequential location estimates. Experimental results on two testbeds show the system achieves better accuracy than other fingerprinting methods, suitable for implementation
This document is the thesis submitted by Jiří Danihelka to the Faculty of Electrical Engineering at Czech Technical University in Prague for the degree of Doctor of Philosophy. The thesis focuses on distributed mobile graphics, including rendering of facial models, collaborative distributed computer graphics, and generating virtual cities on mobile devices. It presents research conducted from 2010 to 2015 and supported by several grants and organizations. The thesis is divided into four parts covering introduction, rendering of facial models, collaborative graphics, and generating virtual cities on mobile devices.
This masters report describes the COAcHMAN project which aims to simplify user interactions with smart homes through context awareness. The report conducts background research on context awareness and home automation technologies. As a result, a software solution called COAcHMAN is proposed which enables homes to react based on the user's context rather than requiring direct user interaction. COAcHMAN integrates with the openHAB home automation platform and uses online user profiles to provide familiar interfaces for users. The implementation of COAcHMAN is described along with further development areas like authentication and using internal sensor data.
This master's thesis documents the Linux kernel version 2.6. It begins with an introduction to operating systems concepts and an overview of Linux kernel subsystems. The core chapters analyze important kernel mechanisms such as synchronization, scheduling, memory management and device drivers. Code examples are provided to illustrate kernel programming concepts. The thesis concludes with the documentation of a sample loadable kernel module.
This document is a doctoral thesis that examines bringing more intelligence to the web and beyond through semantic web technologies. It discusses the motivation for more intelligent web applications, provides an overview of semantic web technologies and languages. It then presents the H-DOSE semantic platform and its logical architecture for semantic resource retrieval. Several case studies that implemented the H-DOSE platform are also described. The thesis concludes with discussions on related works and potential future directions.
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
The document presents a master's thesis that proposes and develops Scalable Advanced Massive Online Analysis (SAMOA), a distributed streaming machine learning framework. SAMOA aims to address the big data challenges of volume, velocity, and variety by providing flexible APIs for developing machine learning algorithms, and integrating with Storm, a stream processing engine, to inherit its scalability. The thesis describes SAMOA's modular components, its integration with Storm, and evaluates a distributed online classification algorithm implemented on SAMOA and Storm to demonstrate its features.
A Mobile and Web application for time measurement intended to get an accurate picture of the productive time in a production environment in order to reveal the root causes behind ineffective/idle time and to eliminate non-added activities/tasks .
Technical Key-words : Ionic 2, Angular 2, PouchDB, CouchDB ,
DB Replication Protocol, Django, Python NvD3 charts .
The document is a thesis submitted by Maliththa S. S. Bulathwela for the degree of Master of Science in Computational Statistics and Machine Learning at University College London. The thesis explores building a self-adaptive topic engine to extract insights from customer feedback data. Initial work uses supervised support vector machines for topic classification and adapts trust modeling techniques to enhance the reliability of crowd-sourced labeled data. Latent Dirichlet allocation is then used to detect emerging topics from unlabeled data. The results were promising, suggesting further work could build self-adapting topic engines using techniques from the thesis.
eclipse is an open source programming tool.
s an open-source software system
whose aim is to serve as a platform for integrating various Logic Programming extensions
Report on e-Notice App (An Android Application)Priyanka Kapoor
The document is a report submitted for a degree at DigiMantra Labs, Ludhiana from January 5, 2014 to May 30, 2014. It describes the development of an e-Notice Application for Android phones. The app allows users to access online notices on their phone and acts as an online notice board where people can communicate and post notices with text, images or videos. It aims to digitize the traditional notice board and allow staff/students to read and respond to notices from anywhere. The app also serves as a mailing list to notify all employees of new notices without needing to maintain a separate mailing list.
This document is a dissertation submitted by Spiros N. Agathos to the University of Ioannina in partial fulfillment of the requirements for a Doctor of Philosophy degree. The dissertation describes work on efficient OpenMP runtime support for general-purpose and embedded multi-core platforms. It includes contributions related to OpenMP tasking, transforming nested workshares into tasks, and runtime support for multicore embedded accelerators.
This document is a dissertation submitted by Spiros N. Agathos to the University of Ioannina in partial fulfillment of the requirements for a Doctor of Philosophy degree. The dissertation describes work on efficient OpenMP runtime support for general-purpose and embedded multi-core platforms. It presents contributions in the areas of OpenMP tasking, transforming nested workshares into tasks, runtime support for multicore embedded accelerators, OpenMP 4 support for multiple devices, and a compiler-assisted runtime. Evaluation results demonstrating performance improvements are also discussed.
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Nóra Szepes
This document describes the design and implementation of a new educational support system portal and thin client. It discusses the specification phase where user requirements were gathered. The Mithril JavaScript framework was chosen for implementing the student client module. The design follows a Model-View-Controller pattern. Testing was done using Cucumber, Zombie and Istanbul to validate the design and implementation.
This document is the Software Guide for version 3.20 of the ORFEO Toolbox (OTB). OTB is a set of algorithms encapsulated in a software library developed by CNES to efficiently exploit results from methodological remote sensing research and development studies. It is implemented in C++ and based on the Insight Toolkit (ITK). The guide provides an introduction to OTB, instructions for downloading and installing it, and overviews of the system organization and essential concepts like the data processing pipeline and spatial objects.
The document introduces revision control systems (RCS) as essential tools for software development that allow developers to save different versions of source code over time. Key benefits of RCS mentioned include the ability to revert code, safeguard against loss through backups, track changes made, support concurrent editing, save notable versions as snapshots, and create isolated experimental branches. The document advocates for using Subversion as an RCS and provides examples of how RCS can also be useful for storing documents beyond just code.
This document is a thesis submitted by David Liebman to the State University of New York at New Paltz for the degree of Master of Science in Computer Science. The goal of the thesis is to create a chatbot using natural language processing and deep learning models. The thesis provides background on recurrent neural networks, transformers, and pre-trained language models like GPT-2. It then describes the experimental design and setup for installing chatbot models on devices like the Raspberry Pi. Several chatbot experiments are conducted using GRU, transformer, and GPT-2 models with discussion of the results.
This document summarizes a dissertation titled "Augmented Reality for Space Applications". The dissertation proposes introducing in-field-of-view head mounted display systems in spacesuits to give astronauts the ability to access digital information and operate robots during extravehicular activities. The proposed system would be capable of feeding task-specific information on request and recognizing objects in the real world to overlay augmented reality information for error checking and status purposes. This would increase situational awareness and task accuracy while reducing human error risk. The dissertation focuses on preliminary design and testing of an experimental head mounted display and its integration and testing in a spacesuit analogue.
This document is an industrial training report submitted by Deshapriya A.G.S. for their internship at Mobitel (Pvt) Ltd from January 4th to March 25th 2016. Mobitel is the largest telecommunications company in Sri Lanka that specializes in mobile services. The report describes Mobitel's background, services, organizational structure, technical details of projects worked on during the internship, software development processes, and a conclusion on the experience and knowledge gained.
This document is a textbook titled "Programming Fundamentals - A Modular Structured Approach using C++" by Kenneth Leroy Busbee. It covers topics related to programming fundamentals such as data types, operators, functions, input/output, and more using C++ as the programming language. The textbook is divided into chapters that each cover a programming concept and include examples and exercises. It is intended to teach structured programming techniques using a modular approach in C++.
The document discusses support architecture for high-level synthesis of algorithms that use pointers. It first introduces high-level synthesis and its typical steps of compilation, allocation, scheduling, binding and generation. It then presents a case study on OpenCV image processing algorithms that heavily use pointers. The proposed architecture aims to address the memory model problem for such algorithms. It consists of different memory structures like RAM, ring buffer and virtual buffer to support locality and efficient handling of pointers. Exception handling and integration methods are also discussed to map the algorithm to the architecture within the high-level synthesis flow.
This document is a minor project report submitted by Shahrukh Mohd Ayyaz Khan to the Department of Computer Engineering at SSBT's College of Engineering and Technology in partial fulfillment of the requirements for a Bachelor of Engineering degree. The report details the development of a Local Area Network Manager application. It includes sections on system analysis, requirements specification, system design, implementation, testing, results and analysis, and conclusions. Diagrams and screenshots are provided to illustrate various aspects of the system architecture, design, and functionality.
This thesis examines machine learning approaches using Hadoop in the cloud. It implements a distributed machine learning infrastructure in the cloud without dependence on distributed file systems or shared memory. This infrastructure learns and configures a distributed network of learners. The results are then filtered, fused and visualized. The thesis also develops a machine learning infrastructure using Python and compares the two approaches. It uses real-world immigration and GDP datasets from a government database to test the frameworks. The cloud-based approach is able to scale to petabytes of data with minimal configuration.
This document is a thesis that examines automated detection of short-lived websites. It presents the design and evaluation of discovery, identification, and classification engines to analyze websites and determine if they are short-lived or replicated across multiple domains. The tools crawl websites to gather content and metadata, calculate similarity metrics, and visualize relationships. Evaluation of the tools found they could successfully identify similar websites and classify pages as likely, unlikely, or partially replicated. The thesis also discusses non-functional requirements like architecture, anonymization techniques, and improving performance. Overall, the document outlines an approach for automatically detecting short-lived or replicated pharmaceutical websites.
Similar to Machine_translation_for_low_resource_Indian_Languages_thesis_report (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
1. Machine Translation for low resource
Indian Languages
Trushita Prashant Redij
Supervisor: Prof. Amir Esmaeily
Dublin Business School
This dissertation is submitted in partial fulfillment for degree of
Master of Science in Data Analytics.
May 2020
3. Declaration
I hereby certify that the embodied thesis report submitted for examination for the award of
Master of Science in Data Analytics, is solely my own work and contains references and
acknowledgements for research done by the researchers and technical scholars.
The thesis comply by the regulations for postgraduate study by research of Dublin
Business School and has not been submitted in whole or in part for another award in any
other university.
The thesis work conforms by the ethics, principles and guidelines of applied research
stated by Dublin Business School.
Trushita Prashant Redij
May 2020
4.
5. Acknowledgements
Motivation, guidance and determination has played a vital role in completion of this project
report on Machine Translation for Low Resource Indian Languages.
Foremost, I am grateful to God almighty for giving me strength and optimism to complete
this project at this difficult times.
I would like to express special gratitude and thanks to my project guide Prof. Amir
Sajad Esmaily for his expertise, feedback and guidance.
Lastly, my thanks and appreciation goes to my family and friends who encouraged,
supported and helped me with best of their abilities.
6.
7. Abstract
Natural Language Processing predominantly comprises of various advent techniques and
methods which assist the computers to process natural languages. NLP based applications
like summarization, recommender systems, classification, machine translation systems, etc
have reflected the significant role of Artificial Intelligence in modern times. A tremendous
amount of data is available on the internet which majorly represents the English Language
thereby challenging the machine translation for other low resource languages.
Indian Languages are consecrated, concise, and syntactical rich and provide tremendous
scope to experiment using various methods of Machine Translation. The majority of the
work done on Indian languages has implemented rule-based and language-specific models
thereby assuring space for new experiments and development.
In this work, we present approaches to build an automatic translator system for the
Marathi language. We have proposed to build a statistically based Machine translation
model using the Moses toolkit and Deep Neural Network-based model using OpenNMT.
The training data used for this project comprises a parallel corpora of Bible in Marathi and
English.
Also, the research progressively depicts the evolution of the Machine translation system
and its various application. It highlights the process of data preprocessing, implementation,
testing, and evaluation.
Furthermore, to evaluate the performance of our models we used the BLEU metric
wherein we could analyze the performance of the two models. The performance of the Deep
Neural Network model was more accurate than the Statistical Machine Translation Model.
The thesis helped us conclude that Neural Network has emerged as a strong competitor
challenging the dominance of the primitive and popular SMT based approaches.
17. Chapter 1
Introduction
One of the most preponderant and challenging task for the computer since its evolution and
development is the automatic translation of texts for different languages. Human languages
are diverse and define distinct syntax and semantics thereby imposing challenges for Artificial
Intelligence to automate translation. Machine translation is a process of automatically
converting text from one language to another by using a software program [1].
Traditionally, machine translation was based on the rule-based system which was used
for interpretation by storing and manipulating knowledge and information [2].
In 1990s rule-based system were replaced by statistical methods were in bilingual or
parallel text corpora are used to derive parameters for the model [3].
Going further, deep neural network models dawned on the new era of automatic translation
called neural machine translation.
Fig. 1.1 Machine Translation for Languages
Machine translations comprises of input data made of a sequence of symbols of source
language which is parsed through a computer program to derive output sequence of the target
language.
The fundamental drawbacks of classical machine translation are the framing of rules and
exceptions, sequential nature, learning long-range dependencies in the network.
18. 2 Introduction
However, for this research, we have implemented statistical machine translation using
the Moses tool and neural machine translation using the transformer model for low resource
Indian Language called Marathi.
1.1 What is Natural language processing ?
Natural language processing is a subfield of artificial intelligence that vitally focuses on
interactions between the human language and computer language. It’s a field which sits at
the intersection of computer science, artificial intelligence, and linguistics [4]. Languages are
spoken or written by humans to communicate like English, Hindi, Marathi, French, Japanese,
Chinese, etc are examples of natural language.
Predominantly, language is based on two fundamental aspects called symbols and rules,
symbols represent information that needs to be conveyed and rules define the manipulation
of symbols.
Fig. 1.2 Natural Language Processing
The primary aim of language processing is to interpret the language by understanding
its semantics and syntax thereby implementing it to develop applications like chatbots,
summarizer, auto tag, named entity recognition, sentiment analysis, online shopping, smart
devices like Cortana, Siri, etc. There are various methods to translate the sentences from one
language to another.
19. 1.1 What is Natural language processing ? 3
However, human languages are complex and are based on unique syntax and seman-
tics thereby imposing a challenge in the field of Artificial Intelligence to process natural
languages.
Natural Language Understanding
• Natural Language Understanding task is to understand, interpret and reason a natu-
ral language which is at the input side. It deals with Machine reading comprehension
which is applied in automated reasoning, categorizing text, machine translation, ques-
tion answering, activating voice and content analysis [5].
Natural Language Generation
• It’s a process that deals with transformation of structured data into natural language.
It is used for automatic content generation for example chatbot, content for mobile or
web application. For Natural Language Generation the system makes decisions for
putting a concept into words thus the ideas the system wants to portray are known
precisely.
Formal Language
Formal language is made of symbols, alphabets and strings or word.
• Symbol is character or an abstract entity which has no meaning of itself. For e.g letters,
digits and special characters.
• Alphabet is finite set of symbols and is denoted using sigma. For e.g B = (0,1) B is an
alphabet of two symbols 0 and 1.
• String is finite sequence of symbols from an alphabet. For e.g 0110 and 111 are strings
from alphabet B above.
• Language is a finite set of strings from an alphabet.
20. 4 Introduction
Linguistics and Language processing
Linguistics is a science of language processing which comprises of sounds, word formation,
sentence structure, meaning and understanding.
Fig. 1.3 Natural Language Processing Levels
Image Source: NLPhackers.io
There are five levels for processing natural language.
1. Morphological and Lexical analysis: Morphology depicts the identification, analysis,
and description of the structure of words into morphemes. Morphemes are the smallest
meaningful unit in the grammar of a language that has semantic meaning. For e.g
the word ’unbreakable’ has 3 morphemes, ’un’, ’break’, ’able’ [6]. There are various
types of Morphemes such as free, bound, inflectional, derivational, root, and null
morpheme. Syntax of the language comprises the set of rules that define the structure
of the language. It’s represented using a parse tree or by a list.
21. 1.2 What is Machine Translation? 5
2. Lexical Analysis divides the text into paragraphs, sentences and words taking into
consideration the morphological and syntactical structure of the language.
3. Syntactical Analysis This step analyzes the words and transforms them to find their
relation with each other. It converts a flat input sentence into a hierarchical structure
that corresponds to the units of meaning in the sentence. It comprises of two main
components called grammar and parser. Grammar declares the syntactical represen-
tation and legal structure of the language. Parser compares the grammar against the
input sentences to produce a parsed structure called Parse Tree.
4. Semantic Analysis This step determines the absolute meaning from a context and
determines the possible meaning for a sentence in context. The structures derived from
syntactic analysis are assigned meaning and mapped to the objects in task domain. For
e.g the sentence ’colourless red ideas’ will be rejected as colorless red does have any
meaning [6].
5. Discourse Processing The meaning of an individual sentence may depend on the
previous sentence or the sentence preceding it. For e.g the word ’it’ in the sentence "
you wanted it " depends on the prior discourse content [6].
6. Pragmatic Analysis This step deals with knowledge that is beyond the context of the
word. Pragmatics analysis derives the various aspects of language that require real
world knowledge by focusing on actual meaning of the sentences. For e.g "Please,
place my order?" should be interpreted as a request [6].
1.2 What is Machine Translation?
Machine translation, normally known as MT, can be characterized as "interpretation from
one natural language to another dialect utilizing modernized frameworks and with or without
human help [7].
1.2.1 Application of Machine Translation
• MT is inconceivably quick.
• It can convert into numerous dialects without a moment’s delay which definitely
decreases the measure of labor required.
22. 6 Introduction
• Actualizing MT into your localization procedure can do the hard work for interpreters
and save their time, permitting them to concentrate on the more multifaceted parts of
translation.
• MT innovation is growing quickly and is continually progressing towards creating
more excellent interpretations and diminishing the requirement for post-altering.
1.2.2 Machine Translation System Architectures
In the etymological design there are three essential methodologies being utilized for creating
MT frameworks that contrast in their unpredictability and advancement. These approaches
are represented in the diagram below:
Fig. 1.4 The Vauquois Triangle
Image Source: researchgate
In direct translation, interpretation is immediate from the source content to the objective
content. The vocabularies of source language are examined varying for the goals of source
language ambiguities, for the right identification of target language articulations as well with
respect to the determination of word order.
In the transfer approach, translation is finished through three phases: the principal stage
comprises in changing over source text into an intermediate representation, for the most part
parse trees; the subsequent stage convert these representations into proportionate ones in the
objective language; and the third phase generates the target text.
The interlingua approach is the most appropriate methodology for multilingual frame-
works. It has two phases: analysis and generation. In the analysis stage, a sentence in the
source language is broke down and its semantic context is separated and represented in the
interlingua form.
23. 1.3 Motivation 7
An interlingua is a completely new dialect that is free of any source or target language
and is intended to be utilized as a delegate internal portrayal of the source content. The
investigation stage is trailed by the target sentence generation [7].
1.3 Motivation
“The world is one big data problem.”
- Andrew McAfee
As we take stock on technical advances over the most recent years, there is one factor common
among them all, information or data. The exponential development of information accessible
to comprehend, help individuals and associations are guiding us to a time that endeavors to
supplant human insight-based decisions with information are driven and measurably upheld
choices.
Natural Language Processing, an ordinarily utilized strategy while endeavoring to com-
prehend and pick up bits of knowledge from information have to a great extent stayed
sidelines for a while. Presently, with the significant advent in technical capacities and the
gigantic measure of information accessible, this technology has all the earmarks of being
promising.
The primary challenges in NLP are Machine Translation. A legitimate solution for the
issue suggests that machines should be capable enough to interpret the patterns of language
and differentiate the structure of the language. The advent of various statistical and deep
neural-based approaches has contributed to addressing the syntactic and semantics issues
in language translation. Also, this architecture predominantly focuses on high resource
languages.
There are a couple of dialects, for example, English, Spanish, Chinese, French, German,
and so forth which get a great deal of consideration from individuals who look into NLP.
Because of these numerous assets like POS taggers, Treebanks, Senti-WordNets, and so
on are accessible in those dialects. The NLP techniques created for these dialects can’t be
utilized for low resourced dialects as they are made to fit too unequivocally, on an enormous
dataset with different highlights. Utilizing these on little datasets would prompt exceptionally
lackluster performance.
Consequently, there is a gigantic need to work on low resource languages. Research into
language-autonomous NLP strategies that are fitting in low-resource settings is frantically
required as such methods can be applied to some low-resource dialects without a moment’s
delay.
24. 8 Introduction
Marathi is one such low resourced language and my native language. Other than this,
there are numerous issues in natural language which we come across while translating the
languages using different available approaches. The above reasons assisted us to set an
objective to introduce and implement methods that are suitable for low resource languages
and can be stretched out to any language.
1.4 Key Contributions
The thesis contributes towards progression in the undertaking of Machine Translation in
Indian dialects. The examination, principally, centers around the Marathi Language.
As expressed already, research in these languages is constrained because of the inaccessi-
bility of annotated resources.
To investigate the parsing, semantic, and syntactic angles while interpreting it from
Marathi to English we proposed two methodologies.
• Statistical Machine Translation using Moses toolkit.
• Neural Machine Translation using OpenNMT.
The major contribution of the thesis is a cross-lingual phrase based translation learning and
transformer model using attention mechanism..
1.5 Thesis Overview
• Chapter 1 contains an introduction and the motivation for the thesis. It briefly de-
scribes the evolution of natural language processing, levels of natural language pro-
cessing Machine translation, architectures, and a computational point of view. This
chapter also highlights the key contributions of the thesis.
• Chapter 2 reveals insight into the earlier research in the field of machine translation.
It briefly explains the state of art models like Rule-Based System, Example-Based
Translation, Statistical Machine Translation, and the recent Deep Neural Network
Based Translation.
• Chapter 3 deals with the stepwise description of the processes and methods used in
the machine translation of Marathi language to English. This chapter describes the
detailed process of using Moses’ tool for statistical machine translation and OpenNMT
for deep neural machine translation.
25. 1.5 Thesis Overview 9
• Chapter 4 highlights the objectives of this research work.
• Chapter 5 showcases our work in developing the Statistical Model for Machine
Translation. We work with parallel corpora from 2 languages: Marathi Bible and
English Bible. We successfully build a statistical model using the Moses toolkit.
• Chapter 6 portrays our work in building the Deep Neural Network Model for Machine
Translation. We were successful to build a Deep Neural Model using the OpenNMT
toolkit.
• Chapter 7 describes the evaluation metric called BLEU. It highlights the performance
results and portrays the BLEU scores for the models built.
• Chapter 8 concludes the thesis and addresses the future scope of research on machine
translation for low resource languages.
26.
27. Chapter 2
Background and Related Work
Machine translation has evolved over the years and has occupied significant importance in
the field of artificial intelligence. It describes a range of computer-based activities which
involve translation system [8].
The earliest use of machine translation dates back to the period after the second world
war wherein the early computer was used for encoding the secret messages. In the 1980s
there was a drastic change and evolution of the field wherein it paved a new dimension for
the application of machine translation in artificial intelligence [9].
This chapter briefs about the sixty years of history, research, and development in ma-
chine translation. It also highlights the obstacles and drawbacks of implementing different
approaches for machine translation.
Fig. 2.1 History of Machine Translation, Image Source: medium.com
28. 12 Background and Related Work
2.1 Rule Based Machine Translation
The early 70s embarks the start of Machine Translation, wherein the translation was made
based on a set of predefined rules.
It comprises of two important aspects:
• Bilingual dictionary for each language pair.
• Set of linguistic rules.
The translation quality can be improved by adding user-defined rules and dictionaries into
the translation process by overriding the default settings. The text is parsed by the software
and a transitional representation is created which generates text in the target language. Rule-
based procedure is proposed to disentangle the complex sentences dependent on connectives
like relative pronouns, organizing and subjecting combination [10].
It’s based on a large set of lexicons, predefined rules, syntactic and semantic information
of both the source and target language [11].
RBMT system is efficient and reliable to generate the translation but is dependent on a
huge set of rules which consume a lot of time. Also, redefining and updating the system
knowledge is a bit tedious task.
Although, RBMT are productive enough for a company to gain quality translation it takes
huge initial investment to maintain the quality and increase it incrementally.
2.2 Direct Machine Translation
This is a simplest approach to machine translation wherein the words in the source are
replaced by corresponding words in the target language. The translation in this approach is
bilingual and unidirectional with no intermediary representation [12]. It follows bottom up
approach wherein the transfer is made at word level.
Fig. 2.2 Direct Machine Translation
Image Source: medium.com
29. 2.3 Transfer Based Machine Translation 13
It is specific for a language pair considering the words as the translation unit. It relies
less on the syntactic or semantic analysis wherein grammatical adjustments are made to do
word by word translation.
Also, this approach is an easy and feasible way to translate any language pair perhaps the
results obtained are poor as it does not consider the grammar or analyzes the meaning of the
sentence parsed for translation due to linguistic and computational naivety [12].
2.3 Transfer Based Machine Translation
In this approach an intermediate representation is created after the text is parsed from the
source sentence. It comprises of three steps:
• Analysis
• Transfer
• Generation
The first step analyzes the input text and converts into abstract form, the second step converts
the abstract text into intermediate representation oriented to the target language and finally,
the third step generates the target text using the morphological analyzer.
The intermediate representation is specific to the source and target language respectively.
The results obtained with this approach were fairly satisfactorily with 90 percent accuracy
region [12]. Although, this approach was based on simplified grammar rules these rules
needed to be applied at every step for analysis of source language, transfer of source to target,
and generation of the target language.
This resulted in verbatim translation and exhausted linguists which in turn increased the
work, making it complicated to reuse the modules and maintain the simplicity of the modules
[12].
2.4 Interlingual Based Machine Translation
This approach is also based on intermediate representation with the source language being
translated into inter-lingual language wherein the representation is language independent.
Finally, the target language is generated from the interlingual representation. This approach
is very advantageous for generating multiple target languages from a source. KANT is only
an operational commercial inter-lingual machine translation system which is designed to
30. 14 Background and Related Work
translate technical English into other languages. This approach is beneficial for multilingual
translation systems.
However, it is a very complex task to create a universal interlingua which extracts the
original meaning of the source language and retains it in the generated target [12].
Dave et al, [13] in their research work contemplate the language dissimilarity among
English and Hindi and its suggestion to machine interpretation between these dialects utilizing
the Universal Networking Language (UNL). The portrayal works at the degree of single
sentences and characterizes a semantic net-like structure in which nodes are word ideas and
curves are semantic relations between these ideas.
2.5 Example Based Machine Translation
Example based Machine Translation was primarily developed to overcome the drawbacks
of Rule-Based Machine Translation when translating between languages having different
structures for e.g English and Japanese [14]. This approach retrieves similar examples in the
form of pairs of source phrases, sentences, texts or translation from a database of example to
translate into a new input [15].
Bilingual corpus with parallel text constitutes as the main knowledge of the Example-
Based Machine Translation system. The system input comprises a set of sentences from the
source language and corresponding mapping of translations for each sentence in the target
language. These examples are the base to translate similar sentences from source language to
target language.
There are four steps in Example Based Machine Translation:
• Example acquisition
• Example base and management
• Example application
• Synthesis
The translation of Example based machine translation is predominantly based on the analogy
wherein example translations are used to train the models by encoding principle of analogical
translation [14].
Example-Based Machine Translation is beneficial for machine translation as it does not
require manually derived rules. Perhaps it requires pre-trained translation models to analyze
the sentences. Also, it requires high computational efficiency for large databases.
31. 2.6 Statistical Based Machine Translation 15
2.6 Statistical Based Machine Translation
The IBM research center was introduced to a machine translation system in the early 1970s
which knew very little about rules and linguistics. This system dealt with the analysis of two
languages and tried to recognize the patterns [16].
Fig. 2.3 Statistical Machine Translation
Statistical models that are derived by analyzing bilingual text corpora form the basis of
Statistical Machine Translation. Bayes Theorem was the base for building Statistical model
wherein the system takes the view of the most probable sentence which matches the source
sentence to be translated [16].
The advantage of Statistical Machine Translation is that its the most accurate method that
was introduced and overcame the drawbacks of the traditional rule-based system. There is
no need for predefined rules thus supervision by the linguists is not needed thereby saving
efforts and time.
32. 16 Background and Related Work
2.6.1 Word Based Statistical Machine Translation
The initial models were based on words as an atomic unit that may be translated, dropped,
and reordered. The preliminary step of machine translation is aligning the words in sentence
pairs. This approach uses both translation model and language model thereby ensuring good
output [17].
The first word-based models split-ted the sentence into words and translated one word
into multiple count stats. The model memorizes the usual place the word takes at the output
sentence and shuffles it for more natural sound.
Although word-based systems embarked on a new revolution in the field of machine
translation it couldn’t deal with exceptions like gender and homonym. This approach became
redundant and was further replaced by a phrase-based system.
2.6.2 Syntax Based Statistical Machine Translation
Syntax analysis deals with the subject, predicate and other parts of the sentence to build a
tree. Unlike, phrase-based machine translation which translates single words or strings of
words this approach translates the syntactic units.
For e.g, the Data-Oriented Processing based machine translation and synchronous context-
free grammars include Syntax Based Statistical Machine Translation [18].
This approach has demonstrated improved translated results perhaps its speed is consid-
ered slow as compared to other approaches.
2.6.3 Phrase Based Statistical Machine Translation
This approach is built on the principles of word-based translation which comprises statistics,
reordering, and lexical hacks. It split the text into atomic phrases.
The advantages of phrase-based models are that none com-positioned phrases can be
handled using many to many translations. We can use local content in translation. It can be
used for a larger data set that needs to be translated. It is a standard model used by google
translate.
Phrase-based models were based on N-grams which are nothing but a continuous sequence
of words. As a result, the machine was able to process these sequences of words thereby
improving accuracy.
33. 2.6 Statistical Based Machine Translation 17
Fig. 2.4 Phrase Based Statistical Machine Translation
Image Source: wordpress
34. 18 Background and Related Work
This approach provided options to choose the bilingual texts for learning. The word-based
translation ignored or excluded the free translation thereby making it critical to exactly match
the sources.
However, phrase-based translation overcame this by learning from literary or free transla-
tion. Phrase-based translation gained considerable importance starting 2006 to 2016. It was
used in the working of various online translators like Google Translate, Bing and Yandex
[19].
2.7 Deep Neural Machine Translation
This approach has pioneered a new era for machine translation wherein it uses a large neural
network to predict a likelihood of a sequence of words thereby creating a single integrated
sentence model [20]. The early 1990s embarks on the appearance of speech recognition
applications based on deep learning.
In 2014, first scientific paper based on neural networks in machine translation was
published which was later followed by developments in the following years which included
an application to image capturing, subword-NMT, Zero-Shot NMT, Zero-Resource NMT,
Fully Character-NMT, Large vocabulary NMT, Multi-source NMT, character-dec NMT, etc
[21].
Fig. 2.5 Neural Machine Translation
Image Source: altoross.com
35. 2.7 Deep Neural Machine Translation 19
The fundamental benefit of this approach is it trains a single system directly on the
source and target text thereby no more pipeline of the specialized system is required to be
used in statistical machine learning. Also, neural machine translation systems are called an
end-to-end system as they are based on only one model for the translation.
The learning occurs in two phases.
• The first phase in Deep Neural Network consists of applying a nonlinear transformation
of the input and create a statistical model as output.
• The second phase improves the model with a mathematical method termed as deriva-
tive.
The above two steps are repeated several times until the desired accuracy is obtained.
The repetition of this two-phase is termed as an iteration. Various architectures such as deep
neural networks, recurrent neural networks, deep belief networks have played a significant
role in the fields such as computer vision, audio recognition, social network filtering, speech
recognition, machine translation, drug design and bioinformatics where outstanding results
were obtained [21].
The main objective of a neural network is that they receive a set of inputs, complex
calculations are performed on them, and an output is generated which eventually addresses
real-world problems like classification, supervised learning and reinforcement learning.
The gradient descent method is used to optimize the network and minimize the loss
function. The most important step in deep learning models is training the data set. Also,
Backpropagation is the main algorithm used to train the models.
In Deep Neural Network architecture compositional models are generated wherein the
object is expressed as the layered composition of primitives. The extra layers facilitate the
composition of features that belong to lower layers thereby modeling complex data with
fewer units. Also, the deep architectures are based on many variants of a few basic approach
which are successful in specific domains. Deep Neural Networks are predominantly fed
forward networks wherein the data flow from the input layer to the output layer without
looping back. On the other hand for Recurrent Neural Networks, the data flows in multiple
directions which are applicable in language modeling. They have considerably enhanced the
state-of-the-art Neural Machine Translation as they are capable to model complex functions
and capture complex linguistic structures.
However, Neural Machine Translation systems with deep architecture suffer from severe
gradient diffusion in their encoder or decoder due to the non-linear recurrent activations
thereby making it difficult to optimize [21]. To address it, the solution is to use an attention
36. 20 Background and Related Work
mechanism wherein the model learns to place attention on the input sequence while each
word of the output sequence is decoded.
The recurrent neural network encoder-decoder architecture with attention has played a
significant role to address problems for machine translation. Also, it is used by the Google
Neural Machine Translation system or GNMT for Google translate service.
However, despite being efficient the neural machine translation systems have few draw-
backs when scaled with large vocabularies and consume a lot of time for training the models.
Neural Machine Translation systems are proven to be computationally expensive for
training and translation. Also, most systems have difficulty with exceptions and rare words.
These issues have hindered the deployment and use of this approach to retrieve accurate
results.
Going further, Google’s Neural Machine Translation system has attempted to address
many of these issues. The models are based on deep Long short-term memory (LSTM)
network, with 8 encoder and 8 decoder layers using attention and residual connections. This
approach has helped in improving parallelism thereby decreasing training time. The attention
mechanism of Google NMT connects the bottom layer of the decoder to the top layer of the
encoder.
Finally, to increase the translation speed, they use low-precision arithmetic for compu-
tations. They deal with rare words, by dividing words into a limited set of common units
called word piece for both input and output thereby providing a good balance between the
flexibility of "character" delimited models and the efficiency of "word"-delimited models.
Also, they have a beam search technique which includes a length-normalization procedure
and uses a coverage penalty, which generates an output sentence which most likely covers all
the words in the source sentence [22].
37. 2.7 Deep Neural Machine Translation 21
2.7.1 Feed Forward Network
The feed-forward neural network is a primitive type of artificial neural network which is
based on a simple design. The feed-forward neural network has an input layer, hidden layers,
and an output layer. Information always travels in a uniform direction from input to output
layer without forming a loop or cycle. [23]. Supervised learning is used to feed the input
examples to the network and transformed it into a labeled output.
In a feed-forward network, training is done on labeled images until the errors are reduced
while categorizing. Going further, the network uses these trained models to categorize data it
has never seen.
This trained feed-forward network can be exposed to any random collection of pho-
tographs, it will classify all images separately considering each image it is exposed to as an
individual input without perceiving the past input.
2.7.2 Recurrent Neural Network
Recurrent networks, on the other hand, take as their input not just from the current input
example they see, but also what they have perceived previously in time. Recurrent Neural
Network is a multi-layered neural network wherein the information is stored in context nodes
thereby allowing it to learn sequences of input data and generate output sequence. In simple
words, connections between nodes are based on loops [24].
For example, consider an input sentence "where is the . ... .... ... .? wherein we predict the
next word.
The RNN neurons will get a sign that points to the beginning of the sentence. The system
gets "Where" as info and produces a vector of the number. This vector is taken care of back
to the neuron to give a memory to the system. This stage causes the system to recollect the
word "Where" and it is in the first position. The system will comparatively continue to the
following words. It takes "is" and "the" and the condition of the neurons is refreshed after
getting each word. The neural system will give a likelihood to every English word that can be
utilized to finish the sentence. An all-around prepared recurrent neural network most likely
allots a high likelihood to "bistro," "drink," "burger," and so on.
Normal employments of Recurrent Neural Networks:
• Help securities brokers to produce systematic reports.
• Recognize variations from the norm in the agreement of fiscal summary.
• Recognize false Visa exchange.
38. 22 Background and Related Work
• Give a subtitle to pictures.
• Power chatbots.
The standard employments of RNN happen when the professionals are working with
time-series information or sequences (e.g. sound chronicles or content).
2.7.3 Convolutional Neural Network
Convolutional Neural Network is a multi-layered neural system with a novel engineering
intended to extricate progressively complex highlights of the information at each layer
to decide the yield. This approach is generally utilized when there is an unstructured
informational index (e.g., pictures) and the specialists need to separate data from it.
For example, if the assignment is to foresee a picture inscription. The network gets a picture
of supposing a cat, this picture, in scientific term, is an assortment of the pixel. By and large,
one layer for the grey-scale picture and three layers for a shading picture.
During the element learning (i.e.shrouded layers), the system will distinguish novel
highlights, for example, the tail of the cat, the ear, and so forth.
At the point when the system completely figured out how to perceive an image, it can
give a likelihood to each picture it knows. The mark with the most elevated likelihood will
turn into the expectation of the system.
39. 2.7 Deep Neural Machine Translation 23
2.7.4 Transformer Model
RNN based models are difficult to parallelize and can experience issues learning long-extend
conditions inside the info and yield arrangements
The Transformer models all of these conditions utilizing attention mechanisms.
Fig. 2.6 Recurrent Neural Network
Image Source: sdl.com
Rather than utilizing one range of attention, the Transformer utilizes various "heads".
Moreover, the Transformer utilizes layer normalization and residual connection which make
advancement simpler. Attention can’t use the input position. To settle this, the Transformer
utilizes explicit position encodings which are added to the input embeddings. [25].
The attention mechanism in the Transformer is deciphered as a method of figuring the
pertinence of a lot of values(information) based on certain keys and inquiries. Essentially,
the attention mechanism is utilized as a path for the model to concentrate on important data
dependent on what it is as of now handling.
Generally, the attention weights were the significance of the encoder hidden state (values)
in preparing the decoder state and were determined depending on the encoder hidden states
(keys) and the decoder shrouded state (query).
As should be obvious, a single attention head has an exceptionally straightforward
structure: it applies a one of a kind direct change to its input queries, keys, and values,
registers the attention score between each query and key, at that point utilizes it to weight the
qualities and summarize them. The Multi-Head Attention square just applies various squares
in equal, connects their yields, at that point applies one single linear transformation [26].
40. 24 Background and Related Work
Scaled Dot Product Attention
Concerning the attention mechanism, the transformer utilizes a specific type of attention
called the "Scaled Dot-Product Attention" which is figured by the accompanying equation:
The essential attention system is a dot product between the query and the key. The size of
the speck item will, in general, develop with the dimensionality of the query and key vectors,
however, so the Transformer rescales the dot product to keep it from detonating into gigantic
qualities [26].
2.7.5 Transformer Architecture
The Transformer despite everything utilizes the fundamental encoder-decoder structure of
customary neural machine translation frameworks. The left-hand side is the encoder and
the right-hand side is the decoder. The underlying contributions to the encoder are the
embeddings of the input, and the underlying contributions to the decoder are the embeddings
of the yields up to that point [26].
Encoder
The encoder is made out of a pile of N = 6 indistinguishable layers. Each layer has two sub-
layers. The first is a multi-head self-attention component, and the second is a straightforward,
position astute completely associated feed-forward system.
We utilize a lingering association around each of the two sub-layers, trailed by layer
normalization. That is, the yield of each sub-layer is LayerNorm(x + Sublayer(x)), where
Sublayer(x) is an actualized function [26].
To encourage these residual connections, all sub-layers in the model along with embed-
ding layers, produce yields of dimension dmodel = 512 [26].
42. 26 Background and Related Work
Decoder
The decoder is additionally made out of a stack of N = 6 indistinguishable layers. Notwith-
standing the two sub-layers in each encoder layer, the decoder embeds a third sub-layer,
which performs multi-head attention over the yield of the encoder stack. Like the encoder, we
utilize residual connections around every one of the sub-layers, trailed by layer normalization.
We likewise adjust the self-attention sub-layer in the decoder stack to keep positions
from taking care of resulting positions. This concealing joined with the truth that the output
embedding is counterbalanced by one position, guarantees that the forecasts for the position
i can rely just upon the known outputs at positions less than i [26].
Positional Encodings
Since our model contains no repeat and no convolution, all together for the model to utilize
the request of the grouping, we should infuse some data about the relative position of the
tokens in the sequence.
To this end, we include "positional encodings" to the input embeddings at the bottoms
of the encoder and decoder stacks. The positional encodings have a similar measurement
dmodel as the embeddings, with the goal that the two can be added. There are numerous
decisions of positional encodings, learned, and fixed.
In this work, we use sine and cosine elements of various frequencies:
where pos is the position and i is the measurement. That is, each element of the positional
encoding compares to a sinusoid.
We picked this capacity since we speculated it would permit the model to handily figure
out how to go to by relative positions, since for any fixed counterbalance k, P Epos+k can be
spoken to as a linear function of P Epos [26].
43. 2.7 Deep Neural Machine Translation 27
2.7.6 Open NMT
OpenNMT is a nonexclusive profound learning system, for the most part, had some exper-
tise in sequence-to-sequence models covering an assortment of assignments, for example,
machine translation, image to text, summarization, and speech recognition. The structure
has additionally been stretched out for other non-grouping sequence-to-sequence tasks like
language modeling and sequence tagging.
The toolkit organizes proficiency, seclusion, and extensibility with the objective of sup-
porting neural machine translation investigation into model designs, highlight portrayals,
and source modalities while keeping up serious execution and reasonable training require-
ments. The toolbox comprises of modeling and interpretation support, just as point by point
academic documentation about the hidden strategies [27].
OpenNMT was structured to achieve following three goals:
• Prioritize first training and test productivity.
• Keep up model measured quality and coherence.
• Support research extensibility.
Application of Open NMT
• Summarization
The models are prepared precisely like NMT models. In any case, the nature of the
preparation information is unique: source corpus are full length report or articles, and
target are summaries.
• Image to text
Im2Text, created by Yuntian Deng from the Harvard NLP group, is actualizing a
nonexclusive picture to-content application on OpenNMT libraries for visual markup
decompilation. The fundamental alteration to the vanilla OpenNMT is an encoder
presenting CNN layers in mix with RNN.
• Speech recognition
While OpenNMT isn’t fundamentally targetting speech recognition applications, its
capacity to help input vectors and pyramidal RNN makes conceivable start to finish
probes speech to text applications as portrayed for example in Listen, Attend and Spell.
44. 28 Background and Related Work
• Sequence tagging
A sequence tagger is accessible in OpenNMT. It has the equivalent encoder engineering
as a sequence-to-sequence model however needn’t bother with a decoder since each
information is matched up with a yield. A sequence tagger simply needs an encoder
and generation layer. Sequence tagging can be utilized for any comment undertakings,
for example, speech tagging.
– To prepare a sequence tagger we need to preprocess the equal information with
source and target sequence having a similar length (you can utilize the - checkp-
length alternative).
– Train the model with - model-type seqtagger.
– Utilize the model with tag.lua
• Language modelling
A language model is fundamentally the same as an sequence tagger. The primary
contrast is that the yield "tag" for every token is the accompanying word in source
sentence.
– Preprocess the information with data-type monotext.
– Train the model with model-type lm.
– Utilize the model with lm.lua.
45. Chapter 3
Methodology
3.1 Building SMT Model using Moses
3.1.1 Moses - An open source SMT toolkit
In the year 2005, Moses toolkit was developed by the Edinburgh MT group to train statistical
models of text translation from a source language to a target language. Going further, this
tool decodes the source language text thereby producing automatic text in the target language
[28].
Parallel corpora containing source and target language text is required to train the model.
Also, it uses concurrences of words and segments to infer translation correspondences
between the two languages of interest.
Moses is described as an open-source toolkit for statistical machine translation whose
novel contributions are to support linguistically motivating factors, integration confusion
network decoding and providing efficient data formats for translation models which allows
the processing of large data with limited hardware.
Also, the toolkit includes a wide variety of tools for training, tuning and applying the
system to many translation tasks and finally evaluating the resulting translations using BLEU
score [28].
The Training Pipeline
It comprises of a collection of tools which take the raw data as input and generate a machine
translation model. There are various stages involved which are implemented as a pipeline
and are controlled by the Moses experiment management system.
46. 30 Methodology
Also, Moses is compatible with the use of different types of external tools in the training
pipeline. The initial step involves preparing data by cleaning it by using heuristics to remove
misaligned and long sentence pairs.
Going further, GIZA++ is used to word-align parallel sentences which are used to
extract phrase-based translation or hierarchical rules. Moses uses external tools to develop a
language model that is built using the monolingual data in the target language and is used by
the decoder to ensure accurate output. The penultimate step is tuning wherein the statistical
models are weighted against each other to generate the best translation [28].
Fig. 3.1 Statistical Machine Translation using Moses
47. 3.1 Building SMT Model using Moses 31
Decoder
The decoder is an application based on C++ wherein a trained machine translation model
and a source sentence is given as input thereby translating the source sentence into the
target language. Also, the decoder finds the highest scoring sentence in the target language
which corresponds to a given source sentence. the decoder can also reveal the ranked list of
translated candidates and provide information about its decision.
The decoder is written in a modular fashion and allows the user to vary the decoding
process in various ways, such as:
• Input: This is generally a plain sentence or it can be annotated with xml-like elements,
a structure like a lattice or confusion network.
• Translation model: This is based on phrase-phrase rules, or hierarchical rules and can
undergo binarised compilation for swift loading. Additional features which ensures
the reliability by indicating the source of the phrase pairs can also be added.
• Decoding algorithm: Moses implements several different strategies for decoding,
such as stack-based, cube-pruning, chart parsing etc to ease the search.
• Language model: Language model toolkits like SRILM, KenLM, IRSTLM, RandLM
are supported by moses.
48. 32 Methodology
3.2 Build Neural Machine Translation Model using Open-
NMT
3.2.1 Transformer Architecture
The Transformer has a heap of 6 Encoder and 6 Decoder, dissimilar to Seq2Seq; the Encoder
contains two sub-layers: multi-head self-attention layer and a completely associated feed-
forward system.
The Decoder contains three sub-layers, a multi-head self-attention layer, an extra layer
that performs multi-head self-attention over encoder yields, and a completely associated
feed-forward network.
3.2.2 Encoder and Decoder Input
All input and output tokens to Encoder/Decoder are changed over to vectors utilizing learned
embeddings. These input embeddings are then passed to Positional Encoding.
Positional Encoding
The Transformer’s architecture doesn’t contain any repeat or convolution and henceforth has
no thought of word request. All the words of the input are taken care of by the system with
no exceptional request or position as they all stream at the same time through the Encoder
and decoder stack. To comprehend the significance of a sentence, it is basic to comprehend
the position and the request for words.
Positional encoding is added to the model to infuses the data about the absolute position-
ing of the words in the sentence. Also, it has a similar measurement as input embedding with
the goal that the two can be added.
3.2.3 Self Attention
A self-attention layer associates all positions with a constant number of successively executed
tasks and henceforth are quicker than repetitive layers.
An Attention function in a Transformer is depicted as mapping a query and a set of
key and value pair to the output. Query, key, and value are vectors. Attention weights are
determined to utilize Scaled Dot-Product Attention for each word in the sentence. The last
score is the weighted entirety of the values.
49. 3.2 Build Neural Machine Translation Model using OpenNMT 33
Fig. 3.2 Neural Machine Translation Process
1. Dot Product
Take the dot product of the query and key for each word in the sentence. Dot Product
decides the amount to concentrate on different words in the input sentence.
2. Scale
Scale the Dot-Product by partitioning by the square base of the component of the key
vector. Dimension is 64; subsequently we partition the Dot-Product by 8.
3. Apply softmax
Softmax standardizes the scaled worth. Subsequent to applying Softmax, all the
qualities are certain and mean 1.
4. Calculate the weighted sum of the values
Dot-Product is applied between the normalized score and the value vector and then the
sum is calculated. The above steps are repeated for all words in the sentence.
50. 34 Methodology
3.2.4 Multi Head Attention
Rather than utilizing a single attention function where the attention can be overwhelmed by
the actual word itself, transformers utilize numerous attention heads. Every attention head
has a linear transformation applied to a similar input representation.
The Transformer utilizes eight diverse attention heads, which are processed in parallel.
With eight distinctive attention heads, we have eight unique arrangements of the query, key,
and value, and furthermore eight arrangements of Encoder and Decoder every one of these
sets is initialized randomly.
3.2.5 Masked Multi Head Attention
The Decoder has veiled multi-head attention where it covers or hinders the decoder inputs
from the future steps. During training, the multi-head attention of the Decoder conceals the
future decoder inputs.
For the machine translation task to decipher a sentence, "I appreciate nature" from English
to Hindi utilizing the Transformer, the Decoder will consider all the sources of input words
"I, appreciate, nature" to anticipate the primary word.
Residual connections
These are "skip connections" that permit angles to move through the network without going
through the non-linear activation function. Residual connection assists with abstaining from
disappearing or detonating gradient issues.
For residual connections to work, the yield of each sub-layer in the model ought to be the
equivalent. All sub-layers in the Transformer, produce a yield of measurement 512.
Layer Normalization
It normalizes the inputs over every feature and is autonomous of different models. Layer
normalization lessens the preparation time in feed-forward neural systems. In Layer normal-
ization, we process mean and difference from the entirety of the added inputs to the neurons
in a layer on a single training case.
51. Chapter 4
Objectives and Requirements
4.1 Goals
Following are the primary goals of this venture:
• To get a comprehension of Statistical Machine Translation and Deep Neural Networks,
and how they are utilized to complete interpretation between languages.
• Build a machine translation framework by creating and preparing a profound learning
Statistical Machine Translation model and Transformer model. Also, perceive how it
performs on a machine with standard handling power.
• Play around with the hyper-parameters, preparing information size, number of prepar-
ing steps, and think about the exactness of the outcomes for various combinations.
4.2 Software Setup
The undertaking depended intensely on the equipment and programming support given by
the PC. The PC utilized for running the undertaking had the accompanying details:
• Running Ubuntu 16.04 with the most recent drivers and bundles to give development
environment support to the execution.
• NVIDIA GTX 970 GPU and 6GB RAM to give equipment support to the implementa-
tion.
• Access to the web to download freely accessible open-source Marathi Bible files for
training and cross-validation process.
• Python version 3.6.
52. 36 Objectives and Requirements
4.3 Dataset
The dataset for testing and training for the Marathi language was procured from:
http://opus.nlpl.eu/bible-uedin-v1.php
The dataset is a paralllel corpora of Marathi to English language. It has 60876 sentence
pairs and 2.70M words.
53. Chapter 5
Building Statistical Machine Translation
model
5.1 Introduction
Moses is one of the most utilized Statistical Machine Translation frameworks. It is a finished
framework with a worked in decoder that can be utilized with a few alignment algorithms.
Moses is the SMT framework that we have used to prepare Marathi to English translation
model. In the following segment, we depict the steps followed to create, train, and test the
model utilizing Moses.
5.2 Baseline System
After effectively introducing Moses and another required programming (Giza++, Boost, and
so on). We utilized it to prepare Marathi to English translation model by utilizing the Marathi
Bible as Marathi corpus and the King James Version(KJV) Bible as English corpus.
Following is a passage of the initial three refrains of the New Testament in both the
Marathi and the English version.
54. 38 Building Statistical Machine Translation model
As should be obvious from the above passage of an equal corpus, we need to have two
records two proportionate texts in two dialects: the target language and the source language.
The content in those two documents needs to compare line by line. Line 100 in the target
language document ought to be the interpretation of line 100 in the record containing the
source language.
For this project, as we attempted to translate a Marathi to English translation framework,
we began with two distinct documents, one containing the Marathi Bible and the other one
containing the King James Version English Bible. When utilizing Moses, the initial phase in
preparing a translation model called "Corpus Preparation".
5.3 Corpus Preparation
Corpus Preparation comprises of three stages: tokenization, truecasing, and cleaning. During
tokenization, spaces are included between all words and accentuation to make sure that
various types of a similar word are considered one.
In the subsequent stage, Moses utilizes a truecasing content, otherwise called the truecaser,
to compute the frequencies proportions of how frequently a specific word is lower-cased
contrasted with when it is capitalized.
This is significant as, without this progression, it would be practically unimaginable for
the translation framework to figure if the words toward the start of a sentence are promoted
on the grounds that they are typically promoted (legitimate names) or in the event that they
are promoted in light of the fact that they are the toward the start of particular sentences.
The last yet significant step is the cleaning step. In this progression, a sentence pair is
expelled from the training data if one of its sentences has a character count more prominent
than a set sum, or if the proportion of the character count of its sentences is not relative to the
determined or set proportion for the training data.
The constraining character count is set by the structure of the dialects that are being
managed with or the quality/size of the parallel corpora being utilized.
55. 5.4 Language Model Training 39
5.4 Language Model Training
Command to Build Language Model
In this progression, we utilized Moses worked in the KenLM 3-gram model tool to build a
target language model dependent on the corpus. For this case, as we have Marathi to English
translation, English is the target language, along these lines we utilized the Marathi bible
corpus document created by the truecaser.
Now, there is no need of utilizing the output created after the cleaning phase of the Corpus
Preparation process as a language model just relies upon the structure of the target language
being used, English for this situation, and not to its proportionate interpretation to the source
language, Marathi.
Subsequently, there is no compelling reason to mull over the impacts of the sentence
character count and the restricting ratio used to channelize data in the cleaning phase of the
Corpus Preparation process. In the wake of building the English language model, we utilized
the Moses binarizing script to transform the record containing the English language model
into a twofold form that heaps quicker.
At this progression, we can utilize it to get the likelihood that any input sentence is a
piece of the English as per the language model that we developed using data exclusively
from the Marathi Bible.
5.5 Training the Translation System
Command to train the SMT model:
Since we have prepared our target language model, the time has come to begin the
preparation of the translation framework. For this progression, we utilized Moses’s default
word-arrangement instrument called Giza++. Subsequent to running the orders for this
56. 40 Building Statistical Machine Translation model
progression, Moses produced a Moses.ini setup document that can be utilized to make an
interpretation of any Marathi sentence to English.
There are two principle issues that must be taken a gander at. The first one is that
translation sets aside a long effort; to fix this we have to binarise the expression and reorder
the tables. The subsequent one is that weights in our model arrangement record are not
balanced, i.e.: they are reliant to the Bible information we utilized in preparing the model.
In the following subsection, we tune the model to make it progressively adjusted and less
reliant on the data used to prepare it.
5.6 Tuning
Command to Tune the SMT model:
5.7 Binarising Phrase and Reordering Tables
Command to Binarise and reorder tables:
When the tuning procedure is finished, it is encouraged to binarise the expression and
reorder the tables in your translation model by utilizing the Moses’ tools.
5.8 Testing
Since we have finished the essential strides of building, preparing, and tuning an interpretation
model utilizing Moses, we can utilize it to do some basic translations. To do this, we simply
run the terminal command, and that way you can decipher a document containing sentences
in Marathi to English. The sentences in the information record must be in a similar format as
that of training and tuning stages. Following is the Marathi input file that we used to test our
translation model followed by the generated English file.
57. 5.9 Results and Analysis 41
5.9 Results and Analysis
The previous sections depict the results predicted by the translation model given an input file
containing Marathi Corpus and the output file is generated containing data in English. We
trained the model on large parallel corpora of Marathi and English Bible which had 60776
sentence pairs and 2.7M words. The translation obtained is noticeable accurate and can be
termed as successful.
The BLEU score obtained for the SMT model is 27.17 which is moderate and satisfactory.
BP ratio hyp-len ref-len BLEU
0.728 0.759 44678 58852 27.17
Table 5.1 BLEU Score for SMT Model
58.
59. Chapter 6
Building Deep Neural Machine
Translation Model
6.1 Introduction
OpenNMT is a finished library for preparing and conveying neural machine translation
models. The framework is a replacement to seq2seq-attn created at Harvard and has been
revised for simplicity of productivity, lucidness, and generalizability. It incorporates vanilla
NMT models alongside support for attention, gating, stacking, input feeding, regularization,
beam search.
The fundamental framework is actualized in the Lua/Torch scientific structure and can be
effectively be expanded utilizing Torch’s inner standard neural network segments.
6.2 Setup of Required Modules
The main bundle required for preparing your custom translation framework is basically
pyTorch, in which the Open-NMT models have been actualized [29].
The priliminary step, is to clone the OpenNMT-py repsitory :
60. 44 Building Deep Neural Machine Translation Model
6.3 Corpus Preparation
The dataset includes an equal corpus of source and target language records containing one
sentence for every line with the end goal that every token is isolated by a space. We used
equal corpora of Marathi and English sentences put away in isolated records.
6.4 Pre-Processing Text Data
To pre process the training data, validation data and extract features to generate vocabulary
files we used the following command: The data comprises of parallel source and target data
which contain one sentence per line wherein the tokens are separated by a space.
Following files are generated after running the preprocessing :
61. 6.5 Training the Translator model : 45
6.5 Training the Translator model :
The command for training is really easy to use. It takes as input, a data file and a save file.
This will run the default model, which comprises of a 2-layer LSTM with 500 shrouded units
on both the encoder/decoder.
62. 46 Building Deep Neural Machine Translation Model
6.6 Translate
The following command is executed to play out a surmising step on unseen content in the
Source language (Marathi) and produce comparing interpretations which are predicted.
A translated output is generated and the predictions are stored into pred.txt file.
6.7 Testing
Since we have completed the basic steps of pre-processing, training, and translating an
NMT based model using the OpenNMT toolkit, we can use it to do some fundamental
interpretations. To do this, we just run the terminal order and that way you can unravel an
archive containing sentences in Marathi to English. The sentences in the input file must be in
a comparable arrangement as that of preparing and tuning stages.
Following is the Marathi input document that we used to test our interpretation model
followed by the produced English translation.
63. 6.8 Results and Analysis 47
6.8 Results and Analysis
The above section portrays the translation results in the generated output file and it is capable
of translating the sentence pairs for the corresponding input sentence in Marathi Corpora.
The BLEU score obtained for NMT is 43.74 which helps us conclude that the NMT model
has performed better than the SMT model.
BP ratio hyp-len ref-len BLEU
0.953 0.954 44481 46631 43.74
Table 6.1 BLEU Score for NMT Model
64.
65. Chapter 7
Evaluation and Analysis
7.1 Evaluation
Human assessments of machine translation are broad however costly. Also, they take a
long time to complete and include human work that can not be reused. [30] Papineni et
al proposed a technique for programmed machine interpretation assessment that is fast,
economical, and language-autonomous, that connects profoundly with human assessment,
and that has minimal minor expense per run. We present this strategy as a mechanized
understudy to talented human appointed authorities which substitutes for them when there is
a requirement for speedy or regular assessments.
7.1.1 Bilingual Evaluation Understudy Score
The Bilingual Evaluation Understudy Score, or BLEU Score, alludes to an assessment
metric to assess Machine Translation Systems by contrasting a created sentence with a
reference sentence. An ideal match in this correlation brings about a BLEU score of 1.0 ,
while a total confound brings about a BLEU score of 0.0. The BLEU score is an all-around
adjusted measurement for assessing interpretation models as it is autonomous of language,
easy to decipher, and has a high relationship with the manual assessment.
The BLEU score is created after including n-grams in the candidate translation coor-
dinating with the n-grams in the reference content. Word order isn’t considered in this
comparison.
66. 50 Evaluation and Analysis
7.2 Analysis of SMT and NMT models
To think about the two Machine Translation(MT) models examined in this paper: SMT
(Statistical Machine Translation) and NMT (Neural Machine Translation), it is imperative
to see how the two models are executed, what sort of crude information they require, and
what sort of results to expect when utilizing these two MT models. In expansion to that, it is
critical to contemplate the measure of exertion it would take to improve or scale every one of
the two models.
7.2.1 Using Data and Implementing Model
The principle distinction among SMT and NMT is the kind of information utilized in their
executions. The Moses SMT model that we executed use equal corpora (translated sentences
pairs) from the two dialects as essential input information. On the other hand, the NMT
model that we actualized utilizing OpenNMT can be prepared legitimately on Marathi and
English content without any pipeline of specialized frameworks utilized in SMT.
7.2.2 Efficiency
SMT is information-driven, requiring just a corpus of models with both source and target
language content. In contrary, neural machine interpretation frameworks are supposed to be
an end to end systems as just one model is required for the interpretation.
7.2.3 Accuracy
The results obtained after implementing the NMT model depicted higher accuracy than the
SMT model. Thus, given a set of large parallel corpus data, the NMT transformer model
produces more reliable output. The BLEU score for the NMT model was 43.74 and for SMT
model was 27.17 justifies the above statement.
67. Chapter 8
Conclusion and Future Work
Our investigation uncovers that an out-of-the-box NMT framework, prepared on a parallel
corpus of Marathi to English text, accomplishes a lot higher interpretation quality than a
custom-fitted SMT framework. These outcomes are really astonishing given that Marathi
presents huge numbers of the known difficulties that NMT right now battles with (information
shortage, long sentences, and rich morphology).
In future trials, we would like to explore strategies for fitting NMT to a specific domain
and language pair. A potential road of research to investigate is the consideration of linguistic
features in NMT.
Finally, it will be significant in the future to include human assessment for our exam-
inations to guarantee that the MT frameworks intended for open organization use will be
streamlined to improve the undertaking of a human interpreter, and won’t just be tuned to
programmed measurements.
68.
69. References
[1] Wikipedia contributors. Machine translation — Wikipedia, The Free Encyclopedia.
[Online; accessed 4-May-2020 ]. 2020. URL: https://en.wikipedia.org/w/index.php?
title=Machine_translation&oldid=953518509.
[2] Wikipedia contributors. Rule-based system — Wikipedia, The Free Encyclopedia.
[Online; accessed 4-May-2020 ]. 2020. URL: https://en.wikipedia.org/w/index.php?
title=Rule-based_system&oldid=948096750.
[3] Wikipedia contributors. Statistical machine translation — Wikipedia, The Free Ency-
clopedia. [Online; accessed 4-May-2020 ]. 2020. URL: https://en.wikipedia.org/w/
index.php?title=Statistical_machine_translation&oldid=950991925.
[4] Wikipedia contributors. Natural language processing — Wikipedia, The Free Encyclo-
pedia. [Online; accessed 4-May-2020]. 2020. URL: https://en.wikipedia.org/w/index.
php?title=Natural_language_processing&oldid=954334473.
[5] Wikipedia contributors. Natural-language understanding — Wikipedia, The Free
Encyclopedia. [Online; accessed 7-May-2020]. 2020. URL: https://en.wikipedia.org/
w/index.php?title=Natural-language_understanding&oldid=954266182.
[6] Elizabeth D Liddy. “Natural language processing”. In: (2001).
[7] Mohamed Amine Chéragui. “Theoretical overview of machine translation”. In: Pro-
ceedings ICWIT (2012), p. 160.
[8] John hutchins. “Machine translation: A concise history”. In: Computer aided transla-
tion: Theory and practice 13.29-70 (2007), p. 11.
[9] Jonathan Slocum. “A survey of machine translation: its history, current status, and
future prospects”. In: Computational linguistics 11.1 (1985), pp. 1–17.
[10] C Poornima et al. “Rule based sentence simplification for english to tamil machine
translation system”. In: International Journal of Computer Applications 25.8 (2011),
pp. 38–42.
[11] W3Techs. Usage Statistics of Content Languages for Websites. Last accessed 16
September 2017. 2017. URL: https://www.freecodecamp.org/news/a-history-of-
machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5/.
[12] MD Okpor. “Machine translation approaches: issues and challenges”. In: International
Journal of Computer Science Issues (IJCSI) 11.5 (2014), p. 159.
[13] Shachi Dave, Jignashu Parikh, and Pushpak Bhattacharyya. “Interlingua-based English–
Hindi machine translation and language divergence”. In: Machine Translation 16.4
(2001), pp. 251–304.
70. 54 References
[14] John Hutchins. “Towards a definition of example-based machine translation”. In:
Machine Translation Summit X, Second Workshop on Example-Based Machine Trans-
lation. 2005, pp. 63–70.
[15] Eiichiro Sumita and Hitoshi Iida. “Experiments and prospects of example-based
machine translation”. In: Proceedings of the 29th annual meeting on Association for
Computational Linguistics. Association for Computational Linguistics. 1991, pp. 185–
192.
[16] Adam Lopez. “Statistical machine translation”. In: ACM Computing Surveys (CSUR)
40.3 (2008), pp. 1–49.
[17] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
[18] Eugene Charniak, Kevin Knight, and Kenji Yamada. “Syntax-based language models
for statistical machine translation”. In: Proceedings of MT Summit IX. Citeseer. 2003,
pp. 40–46.
[19] Philipp Koehn, Franz Josef Och, and Daniel Marcu. “Statistical phrase-based transla-
tion”. In: Proceedings of the 2003 Conference of the North American Chapter of the
Association for Computational Linguistics on Human Language Technology-Volume 1.
Association for Computational Linguistics. 2003, pp. 48–54.
[20] John Kelleher. “Fundamentals of machine learning for neural machine translation”. In:
(2016).
[21] Fahimeh Ghasemi et al. “Deep neural network in QSAR studies using deep belief
network”. In: Applied Soft Computing 62 (2018), pp. 251–258.
[22] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation. 2016. arXiv: 1609.08144 [cs.CL].
[23] Terrence L Fine. Feedforward neural network methodology. Springer Science &
Business Media, 2006.
[24] Larry R Medsker and LC Jain. “Recurrent neural networks”. In: Design and Applica-
tions 5 (2001).
[25] Martin Popel and Ondˇrej Bojar. “Training tips for the transformer model”. In: The
Prague Bulletin of Mathematical Linguistics 110.1 (2018), pp. 43–70.
[26] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information
processing systems. 2017, pp. 5998–6008.
[27] Guillaume Klein et al. “Opennmt: Open-source toolkit for neural machine translation”.
In: arXiv preprint arXiv:1701.02810 (2017).
[28] Philipp Koehn et al. “Moses: Open source toolkit for statistical machine translation”.
In: Proceedings of the 45th annual meeting of the association for computational
linguistics companion volume proceedings of the demo and poster sessions. 2007,
pp. 177–180.
[29] Guillaume Klein et al. “OpenNMT: Open-Source Toolkit for Neural Machine Trans-
lation”. In: Proc. ACL. 2017. DOI: 10.18653/v1/P17-4012. URL: https://doi.org/10.
18653/v1/P17-4012.
[30] Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine transla-
tion”. In: Proceedings of the 40th annual meeting on association for computational
linguistics. Association for Computational Linguistics. 2002, pp. 311–318.