The document proposes a real-time architecture using Apache Storm and Apache Kafka to apply natural language processing (NLP) tasks to streams of text data. It allows developers to inject NLP modules from different programming languages in a distributed, scalable, and low-latency manner. An experiment was conducted using OpenNLP, Fasttext and SpaCy modules on Bahasa Malaysia and English text, and Apache Storm achieved the lowest latency compared to other frameworks.
Improvement from proof of concept into the production environment cater for...Conference Papers
This document discusses improvements made to the Trust Engine component of an authentication platform to improve performance and scalability. The Proof of Concept system was found to not meet scalability requirements due to the database architecture requiring multiple connections to retrieve and update user data. The improvements included consolidating configuration data, combining user tables, updating the process to perform analysis in memory without database connections, limiting stored login records, and changing to a JSON data format. Performance testing showed the new system completed processes on average 99% faster.
This document describes a web-based application that converts heterogeneous power quality data from measurement equipment into the standard PQDIF (Power Quality Data Interchange Format) format. The application was developed using Java technologies and has a three-tier architecture. It allows users to easily upload data files and metadata via a graphical interface. The application then generates the corresponding PQDIF file, which can be used by various power quality analysis software programs. A case study demonstrates the conversion of data from a MEG40 power acquisition device into PQDIF format using the application. The application provides a flexible way to centrally manage large volumes of power quality data from different sources in a standardized format.
Combining efficiency, fidelity, and flexibility in resource information servicesPvrtechnologies Nellore
The document discusses a resource information service that aims to provide high efficiency and fidelity without restricting flexibility. It presents a system that offers scalable key-based lookup functions using distributed hash tables for resource discovery. Previous systems either achieved high fidelity at low efficiency or vice versa. The proposed service claims to outperform other services by dramatically reducing overhead while significantly enhancing efficiency and fidelity based on extensive simulation and experimental results.
Software testing automation a comparative study on productivity rate of ope...Conference Papers
This document compares the productivity of two open source automated software testing tools, Robot Framework 3.0 and Katalon Studio 7.0, for testing smart manufacturing applications. Ten subject matter experts tested their productivity using each tool across various stages of the software development lifecycle. Katalon Studio 7.0 was found to be significantly more productive than Robot Framework 3.0 based on statistical analysis of the time taken using each tool. The study provides guidance for selecting automated testing tools to improve productivity for software test engineers working in smart manufacturing.
Improved learning through remote desktop mirroring controlConference Papers
The document describes a Wireless Stream Management System (WSMS) that allows a moderator (teacher) to remotely manage and control wireless screen mirroring from student devices to support collaborative learning. Key features of WSMS include allowing the teacher to select any student's laptop screen to project, enabling the teacher to remotely control the student's laptop, and distributing presentation content as images to student devices. The system architecture uses various components like a Wireless Screen Sender, Receiver, Administrator and Controller. Performance tests showed the system using under 2 Mbps of bandwidth and latency under 173ms with no major CPU utilization issues.
A New Framework for Information System Development on Instant Messaging for L...TELKOMNIKA JOURNAL
The increasingly inexpensive Internet has spurred the growth of online information system
services in various companies. Almost all services are available in forms on web or mobile applications.
For small companies, this particular system is more difficult to implement as it requires a substantial cost
allocated for hosting, domain and server devices. The solution is to develop a framework for building
information system services through Instant Messaging (IM) such as Telegram, Line or XMPP / Jabber
using the Design Science Research Methodology. This proposed framework has the ability to transform
the existing information system services into chat services with RBAC role, session, validation and natural
interaction using Indonesian-language conversations. The framework that consists of Initiate layers,
business process and communication, memory group and OLTP DBMS will produce low-cost solution for
the development of integrated information systems service.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document discusses the implementation challenges of autonomous things and proposes a high-level architecture for a cloud robotics infrastructure to address these challenges. It explores existing platforms for autonomous things and identifies three main areas of complexity: development, execution, and operation. A proposed architecture is presented using the TOGAF framework, with core services for integrated development/testing/simulation and operation/monitoring/maintenance, and application services and technologies to realize these, including cloud, edge and robotics computing with virtualization and ROS. The architecture aims to ease autonomous things implementation through a super-converged system.
Improvement from proof of concept into the production environment cater for...Conference Papers
This document discusses improvements made to the Trust Engine component of an authentication platform to improve performance and scalability. The Proof of Concept system was found to not meet scalability requirements due to the database architecture requiring multiple connections to retrieve and update user data. The improvements included consolidating configuration data, combining user tables, updating the process to perform analysis in memory without database connections, limiting stored login records, and changing to a JSON data format. Performance testing showed the new system completed processes on average 99% faster.
This document describes a web-based application that converts heterogeneous power quality data from measurement equipment into the standard PQDIF (Power Quality Data Interchange Format) format. The application was developed using Java technologies and has a three-tier architecture. It allows users to easily upload data files and metadata via a graphical interface. The application then generates the corresponding PQDIF file, which can be used by various power quality analysis software programs. A case study demonstrates the conversion of data from a MEG40 power acquisition device into PQDIF format using the application. The application provides a flexible way to centrally manage large volumes of power quality data from different sources in a standardized format.
Combining efficiency, fidelity, and flexibility in resource information servicesPvrtechnologies Nellore
The document discusses a resource information service that aims to provide high efficiency and fidelity without restricting flexibility. It presents a system that offers scalable key-based lookup functions using distributed hash tables for resource discovery. Previous systems either achieved high fidelity at low efficiency or vice versa. The proposed service claims to outperform other services by dramatically reducing overhead while significantly enhancing efficiency and fidelity based on extensive simulation and experimental results.
Software testing automation a comparative study on productivity rate of ope...Conference Papers
This document compares the productivity of two open source automated software testing tools, Robot Framework 3.0 and Katalon Studio 7.0, for testing smart manufacturing applications. Ten subject matter experts tested their productivity using each tool across various stages of the software development lifecycle. Katalon Studio 7.0 was found to be significantly more productive than Robot Framework 3.0 based on statistical analysis of the time taken using each tool. The study provides guidance for selecting automated testing tools to improve productivity for software test engineers working in smart manufacturing.
Improved learning through remote desktop mirroring controlConference Papers
The document describes a Wireless Stream Management System (WSMS) that allows a moderator (teacher) to remotely manage and control wireless screen mirroring from student devices to support collaborative learning. Key features of WSMS include allowing the teacher to select any student's laptop screen to project, enabling the teacher to remotely control the student's laptop, and distributing presentation content as images to student devices. The system architecture uses various components like a Wireless Screen Sender, Receiver, Administrator and Controller. Performance tests showed the system using under 2 Mbps of bandwidth and latency under 173ms with no major CPU utilization issues.
A New Framework for Information System Development on Instant Messaging for L...TELKOMNIKA JOURNAL
The increasingly inexpensive Internet has spurred the growth of online information system
services in various companies. Almost all services are available in forms on web or mobile applications.
For small companies, this particular system is more difficult to implement as it requires a substantial cost
allocated for hosting, domain and server devices. The solution is to develop a framework for building
information system services through Instant Messaging (IM) such as Telegram, Line or XMPP / Jabber
using the Design Science Research Methodology. This proposed framework has the ability to transform
the existing information system services into chat services with RBAC role, session, validation and natural
interaction using Indonesian-language conversations. The framework that consists of Initiate layers,
business process and communication, memory group and OLTP DBMS will produce low-cost solution for
the development of integrated information systems service.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
This document discusses the implementation challenges of autonomous things and proposes a high-level architecture for a cloud robotics infrastructure to address these challenges. It explores existing platforms for autonomous things and identifies three main areas of complexity: development, execution, and operation. A proposed architecture is presented using the TOGAF framework, with core services for integrated development/testing/simulation and operation/monitoring/maintenance, and application services and technologies to realize these, including cloud, edge and robotics computing with virtualization and ROS. The architecture aims to ease autonomous things implementation through a super-converged system.
IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...ijp2p
Currently the Peer-to-Peer computing paradigm rises as an economic solution for the large scale
computation problems. However due to the dynamic nature of peers it is very difficult to use this type of
systems for the computations of real time applications. Strict deadline of scientific and real time
applications require predictable performance in such applications. We propose an algorithm to identify the
group of reliable peers, from the available peers on the Internet, for the processing of real time
application’s tasks. The algorithm is based on joint evaluation of peer properties like peer availability,
credibility, computation time and the turnaround time of the peer with respect to the task distributor peer.
Here we also define a method to calculate turnaround time (distance) on task distributor peers at
application level.
Abstract In early days information contain in increasingly corporate area, now IT organization help to right module to store, manage ,retrieve and transfer information in the more reliable and powerful manner. As part of an Information Lifecycle Management (ILM) best-practices strategy, organizations require solutions for migrating data between in heterogeneous environments and system storage. In early days information contain in increasingly corporate area, today IT organization help to right module to store, manage ,retrieve and transfer information in the more reliable and powerful manner. This paper helps to planned to design powerful modules that high-performances data migration of storage area with less time complexity. This project contain unique information of data migration in dynamic IT nature and business advantage that design to provide new tool used for data migration. Keywords— Heterogeneous Environment, data migration, data mapping
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A unified dashboard for collaborative robot management systemConference Papers
This document proposes a unified dashboard for managing collaborative robot (COBOT) systems across multiple factories. The dashboard would provide centralized monitoring and control of COBOT assets and production data. It incorporates interactive 3D visualization of COBOT movement for troubleshooting. The dashboard has role-based access, with views tailored for super administrators, administrators and regular users. It utilizes a hierarchical interface and "batch actions" to efficiently manage large numbers of COBOTs.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Hashtag Recommendation System in a P2P Social Networking Applicationcsandit
In this paper focus is on developing a hashtag recommendation system for an online social
network application with a Peer-to-Peer infrastructure motivated by BestPeer++ architecture and
BATON overlay structure. A user may invoke a recommendation procedure while writing the
content. After being invoked, the recommendation procedure returns a list of candidate hashtags, and the user may select one hashtag from the list and embed it into the content. The proposed approach uses Latent Dirichlet Allocation (LDA) topic model to derive the latent or
hidden topics of different content. LDA topic model is a well-developed data mining algorithm
and generally effective in analysing text documents with different lengths. The topic model is
used to identify the candidate hashtags that are associated with the texts in the published content
through their association with the derived hidden topics.
The experiments for evaluating the recommendation approach were fed with the tweets
published in Twitter. Hit-rate of recommendation is considered as an evaluation metricfor our
experiments. Hit-rate is the percentage of the selected or relevant hashtags contained in
candidate hashtags. Our experiment results show that the hit-rate above 50% is observed when
we use a method of recommendation approach independently. Also, for the case that both
similar user and user preferences are considered at the same time, the hit-rate improved to 87%
and 92% for top-5 and top-10 candidate recommendations respectively.
IRJET- Plug-In based System for Data VisualizationIRJET Journal
This document describes a plug-in based system for data visualization. The system allows users to upload different file types like Excel, HTML, CSV and visualize the data through interactive visualizations. The system uses a plug-in architecture that allows new plug-ins to be added to support additional file formats. Each plug-in implements a reader interface to extract data from its file type and output it as JSON. The system then hosts the JSON and provides various visualization patterns for users to analyze and report on the data. The plug-in based design makes the system flexible and adaptable to future changes and additions of new plug-in types.
Finite State Machine Based Evaluation Model For Web Service Reliability Analysisdannyijwest
Today’s world economy demands that both market access and customer service be available anytime and
anywhere. The Web is the only way to supply global economic needs and, due to expand the development of
comprehensive web service, it does so relatively inexpensively. The ability of web service is to provide a
relatively inexpensive way to deploy customer services. As days goes on the business logic of a system
emerges out at a great extent where it has to react to several different competitors under different
situations. Through means of a business logic system we can able to achieve faster communication of
information, rampant change and increasing business complexity
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
The scale of data streaming in social networks, such as Twitter, is increasing exponentially. Twitter is one of the most important and suitable big data sources for machine learning research in terms of analysis, prediction, extract knowledge, and opinions. People use Twitter platform daily to express their opinion which is a fundamental fact that influence their behaviors. In recent years, the flow of Iraqi dialect has been increased, especially on the Twitter platform. Sentiment analysis for different dialects and opinion mining has become a hot topic in data science researches. In this paper, we will attempt to develop a real-time analytic model for sentiment analysis and opinion mining to Iraqi tweets using spark streaming, also create a dataset for researcher in this field. The Twitter handle Bassam AlRawi is the case study here. The new method is more suitable in the current day machine learning applications and fast online prediction.
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
The document proposes a new framework for efficient semantic search in large datasets. It aims to improve understanding of short texts by enriching them with concepts and related terms from a probabilistic knowledge base. A deep learning model using stacked autoencoders is designed to learn features from the enriched short texts and encode them into binary codes, allowing similarity searches. Experiments show the new approach captures semantics better than existing methods and enables applications like short text retrieval and classification.
The document discusses several MCA projects including:
1. A medical information integration model using cloud computing to provide data storage and analysis for medical departments and workers.
2. A customer relationship and warehouse management system to optimize revenue, customer satisfaction and understanding.
3. An issue tracking, managing, monitoring and reporting system for IT departments to access shared data and history.
A Web Extraction Using Soft Algorithm for Trinity Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Fakebuster fake news detection system using logistic regression technique i...Conference Papers
The document describes a fake news detection system called "FAKEBUSTER" that was developed using logistic regression in machine learning. It analyzed past research that found logistic regression achieved 79-89% accuracy in detecting fake news. The system was trained on a dataset of news articles labeled as real or fake. It uses TF-IDF to convert text to numerical features for the logistic regression model. The model was integrated into a web application called "FAKEBUSTER" that allows users to input a news article or URL to check if it is real or fake. Evaluation found the stance detection approach improved the model's accuracy for fake news classification.
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET Journal
The document proposes a system to provide in-database analytic functionalities to MySQL by implementing machine learning algorithms like linear regression within the MySQL database server. This would eliminate the need to migrate data to external analytic tools for processing, reducing time and network load. Specifically, it aims to develop user-defined functions in MySQL using the linear regression algorithm to predict numeric values. This in-database processing approach could improve performance for large-scale analytics compared to conventional methods that require data movement.
A prototype framework_for_high_performance_push_noDavidNereekshan
This document describes a prototype framework for sending high volumes of push notifications. It discusses the architectural design of the framework, which includes 6 main modules: a REST interface, service layer, database, message producers, message consumers, and queue manager. The document then outlines 13 performance test scenarios run on the framework and discusses the results and conclusions.
Latent semantic analysis and cosine similarity for hadith search engineTELKOMNIKA JOURNAL
Search engine technology was used to find information as needed easily, quickly and efficiently, including in searching the information about the hadith which was a second guideline of life for muslim besides the Holy Qur'an. This study was aim to build a specialized search engine to find information about a complete and eleven hadith in Indonesian language. In this research, search engines worked by using latent semantic analysis (LSA) and cosine similarity based on the keywords entered. The LSA and cosine similarity methods were used in forming structured representations of text data as well as calculating the similarity of the keyword text entered with hadith text data, so the hadith information was issued in accordance with what was searched. Based on the results of the test conducted 50 times, it indicated that the LSA and cosine similarity had a success rate in finding high hadith information with an average recall value was 87.83%, although from all information obtained level of precision hadith was found semantically not many, it was indicated by the average precision value was 36.25%.
Development of Effective Audit Service to Maintain Integrity of Migrated Data...IRJET Journal
This document proposes an audit service to verify the integrity of data migrated to the cloud. It discusses existing proof of retrievability and provable data possession schemes that allow third-party auditing of cloud data without downloading. The document then presents a new interactive proof system-based audit scheme using bilinear pairing cryptography. The scheme uses key generation, tag generation, and an interactive proof protocol between the cloud service provider and third-party auditor. The protocol commitments, challenges, and verifies responses to ensure data integrity while preserving privacy and achieving high performance for cloud auditing.
This document presents an algorithm for converting text to graphs using natural language processing techniques. It discusses two applications: 1) an automatic text summarizer that takes newspaper articles as input and generates summaries based on word frequencies, and 2) a text to graph converter that takes stock articles as input, extracts terms related to points, percentages and time, and maps these tokens to a graph. The algorithm uses Python libraries like NLTK, regular expressions and Matplotlib to perform tasks like text segmentation, pattern matching and graph plotting.
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...kevig
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming
the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The
main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural language input which is done using regular expressions, artificial intelligence and database concepts. Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency
of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
In recent times, research activities in the areas of Opinion and Sentiment analysis in natural language texts and other media are gaining ground under the umbrella of subjectivity analysis. The reason may be the huge amount of available text data in the Social Web in the forms of news, reviews, blogs, chats and even twitter. Though Sentiment analysis from natural lan-guage text is a multifaceted and multidisciplinary problem, in general, the term “sentiment” is used in reference to the automatic analysis of evaluative text.
IDENTIFICATION OF EFFICIENT PEERS IN P2P COMPUTING SYSTEM FOR REAL TIME APPLI...ijp2p
Currently the Peer-to-Peer computing paradigm rises as an economic solution for the large scale
computation problems. However due to the dynamic nature of peers it is very difficult to use this type of
systems for the computations of real time applications. Strict deadline of scientific and real time
applications require predictable performance in such applications. We propose an algorithm to identify the
group of reliable peers, from the available peers on the Internet, for the processing of real time
application’s tasks. The algorithm is based on joint evaluation of peer properties like peer availability,
credibility, computation time and the turnaround time of the peer with respect to the task distributor peer.
Here we also define a method to calculate turnaround time (distance) on task distributor peers at
application level.
Abstract In early days information contain in increasingly corporate area, now IT organization help to right module to store, manage ,retrieve and transfer information in the more reliable and powerful manner. As part of an Information Lifecycle Management (ILM) best-practices strategy, organizations require solutions for migrating data between in heterogeneous environments and system storage. In early days information contain in increasingly corporate area, today IT organization help to right module to store, manage ,retrieve and transfer information in the more reliable and powerful manner. This paper helps to planned to design powerful modules that high-performances data migration of storage area with less time complexity. This project contain unique information of data migration in dynamic IT nature and business advantage that design to provide new tool used for data migration. Keywords— Heterogeneous Environment, data migration, data mapping
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A unified dashboard for collaborative robot management systemConference Papers
This document proposes a unified dashboard for managing collaborative robot (COBOT) systems across multiple factories. The dashboard would provide centralized monitoring and control of COBOT assets and production data. It incorporates interactive 3D visualization of COBOT movement for troubleshooting. The dashboard has role-based access, with views tailored for super administrators, administrators and regular users. It utilizes a hierarchical interface and "batch actions" to efficiently manage large numbers of COBOTs.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Hashtag Recommendation System in a P2P Social Networking Applicationcsandit
In this paper focus is on developing a hashtag recommendation system for an online social
network application with a Peer-to-Peer infrastructure motivated by BestPeer++ architecture and
BATON overlay structure. A user may invoke a recommendation procedure while writing the
content. After being invoked, the recommendation procedure returns a list of candidate hashtags, and the user may select one hashtag from the list and embed it into the content. The proposed approach uses Latent Dirichlet Allocation (LDA) topic model to derive the latent or
hidden topics of different content. LDA topic model is a well-developed data mining algorithm
and generally effective in analysing text documents with different lengths. The topic model is
used to identify the candidate hashtags that are associated with the texts in the published content
through their association with the derived hidden topics.
The experiments for evaluating the recommendation approach were fed with the tweets
published in Twitter. Hit-rate of recommendation is considered as an evaluation metricfor our
experiments. Hit-rate is the percentage of the selected or relevant hashtags contained in
candidate hashtags. Our experiment results show that the hit-rate above 50% is observed when
we use a method of recommendation approach independently. Also, for the case that both
similar user and user preferences are considered at the same time, the hit-rate improved to 87%
and 92% for top-5 and top-10 candidate recommendations respectively.
IRJET- Plug-In based System for Data VisualizationIRJET Journal
This document describes a plug-in based system for data visualization. The system allows users to upload different file types like Excel, HTML, CSV and visualize the data through interactive visualizations. The system uses a plug-in architecture that allows new plug-ins to be added to support additional file formats. Each plug-in implements a reader interface to extract data from its file type and output it as JSON. The system then hosts the JSON and provides various visualization patterns for users to analyze and report on the data. The plug-in based design makes the system flexible and adaptable to future changes and additions of new plug-in types.
Finite State Machine Based Evaluation Model For Web Service Reliability Analysisdannyijwest
Today’s world economy demands that both market access and customer service be available anytime and
anywhere. The Web is the only way to supply global economic needs and, due to expand the development of
comprehensive web service, it does so relatively inexpensively. The ability of web service is to provide a
relatively inexpensive way to deploy customer services. As days goes on the business logic of a system
emerges out at a great extent where it has to react to several different competitors under different
situations. Through means of a business logic system we can able to achieve faster communication of
information, rampant change and increasing business complexity
Multi Similarity Measure based Result Merging Strategies in Meta Search EngineIDES Editor
In Meta Search Engine result merging is the key
component. Meta Search Engines provide a uniform query
interface for Internet users to search for information.
Depending on users’ needs, they select relevant sources and
map user queries into the target search engines, subsequently
merging the results. The effectiveness of a Meta Search
Engine is closely related to the result merging algorithm it
employs. In this paper, we have proposed a Meta Search
Engine, which has two distinct steps (1) searching through
surface and deep search engine, and (2) Ranking the results
through the designed ranking algorithm. Initially, the query
given by the user is inputted to the deep and surface search
engine. The proposed method used two distinct algorithms
for ranking the search results, concept similarity based
method and cosine similarity based method. Once the results
from various search engines are ranked, the proposed Meta
Search Engine merges them into a single ranked list. Finally,
the experimentation will be done to prove the efficiency of
the proposed visible and invisible web-based Meta Search
Engine in merging the relevant pages. TSAP is used as the
evaluation criteria and the algorithms are evaluated based on
these criteria.
A real-time big data sentiment analysis for iraqi tweets using spark streamingjournalBEEI
The scale of data streaming in social networks, such as Twitter, is increasing exponentially. Twitter is one of the most important and suitable big data sources for machine learning research in terms of analysis, prediction, extract knowledge, and opinions. People use Twitter platform daily to express their opinion which is a fundamental fact that influence their behaviors. In recent years, the flow of Iraqi dialect has been increased, especially on the Twitter platform. Sentiment analysis for different dialects and opinion mining has become a hot topic in data science researches. In this paper, we will attempt to develop a real-time analytic model for sentiment analysis and opinion mining to Iraqi tweets using spark streaming, also create a dataset for researcher in this field. The Twitter handle Bassam AlRawi is the case study here. The new method is more suitable in the current day machine learning applications and fast online prediction.
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
The document proposes a new framework for efficient semantic search in large datasets. It aims to improve understanding of short texts by enriching them with concepts and related terms from a probabilistic knowledge base. A deep learning model using stacked autoencoders is designed to learn features from the enriched short texts and encode them into binary codes, allowing similarity searches. Experiments show the new approach captures semantics better than existing methods and enables applications like short text retrieval and classification.
The document discusses several MCA projects including:
1. A medical information integration model using cloud computing to provide data storage and analysis for medical departments and workers.
2. A customer relationship and warehouse management system to optimize revenue, customer satisfaction and understanding.
3. An issue tracking, managing, monitoring and reporting system for IT departments to access shared data and history.
A Web Extraction Using Soft Algorithm for Trinity Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Fakebuster fake news detection system using logistic regression technique i...Conference Papers
The document describes a fake news detection system called "FAKEBUSTER" that was developed using logistic regression in machine learning. It analyzed past research that found logistic regression achieved 79-89% accuracy in detecting fake news. The system was trained on a dataset of news articles labeled as real or fake. It uses TF-IDF to convert text to numerical features for the logistic regression model. The model was integrated into a web application called "FAKEBUSTER" that allows users to input a news article or URL to check if it is real or fake. Evaluation found the stance detection approach improved the model's accuracy for fake news classification.
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET Journal
The document proposes a system to provide in-database analytic functionalities to MySQL by implementing machine learning algorithms like linear regression within the MySQL database server. This would eliminate the need to migrate data to external analytic tools for processing, reducing time and network load. Specifically, it aims to develop user-defined functions in MySQL using the linear regression algorithm to predict numeric values. This in-database processing approach could improve performance for large-scale analytics compared to conventional methods that require data movement.
A prototype framework_for_high_performance_push_noDavidNereekshan
This document describes a prototype framework for sending high volumes of push notifications. It discusses the architectural design of the framework, which includes 6 main modules: a REST interface, service layer, database, message producers, message consumers, and queue manager. The document then outlines 13 performance test scenarios run on the framework and discusses the results and conclusions.
Latent semantic analysis and cosine similarity for hadith search engineTELKOMNIKA JOURNAL
Search engine technology was used to find information as needed easily, quickly and efficiently, including in searching the information about the hadith which was a second guideline of life for muslim besides the Holy Qur'an. This study was aim to build a specialized search engine to find information about a complete and eleven hadith in Indonesian language. In this research, search engines worked by using latent semantic analysis (LSA) and cosine similarity based on the keywords entered. The LSA and cosine similarity methods were used in forming structured representations of text data as well as calculating the similarity of the keyword text entered with hadith text data, so the hadith information was issued in accordance with what was searched. Based on the results of the test conducted 50 times, it indicated that the LSA and cosine similarity had a success rate in finding high hadith information with an average recall value was 87.83%, although from all information obtained level of precision hadith was found semantically not many, it was indicated by the average precision value was 36.25%.
Development of Effective Audit Service to Maintain Integrity of Migrated Data...IRJET Journal
This document proposes an audit service to verify the integrity of data migrated to the cloud. It discusses existing proof of retrievability and provable data possession schemes that allow third-party auditing of cloud data without downloading. The document then presents a new interactive proof system-based audit scheme using bilinear pairing cryptography. The scheme uses key generation, tag generation, and an interactive proof protocol between the cloud service provider and third-party auditor. The protocol commitments, challenges, and verifies responses to ensure data integrity while preserving privacy and achieving high performance for cloud auditing.
This document presents an algorithm for converting text to graphs using natural language processing techniques. It discusses two applications: 1) an automatic text summarizer that takes newspaper articles as input and generates summaries based on word frequencies, and 2) a text to graph converter that takes stock articles as input, extracts terms related to points, percentages and time, and maps these tokens to a graph. The algorithm uses Python libraries like NLTK, regular expressions and Matplotlib to perform tasks like text segmentation, pattern matching and graph plotting.
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...kevig
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming
the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The
main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural language input which is done using regular expressions, artificial intelligence and database concepts. Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency
of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
Text can be analysed by splitting the text and extracting the keywords .These may be represented as summaries, tabular representation, graphical forms, and images. In order to provide a solution to large amount of information present in textual format led to a research of extracting the text and transforming the unstructured form to a structured format. The paper presents the importance of Natural Language Processing (NLP) and its two interesting applications in Python Language: 1. Automatic text summarization [Domain: Newspaper Articles] 2. Text to Graph Conversion [Domain: Stock news]. The main challenge in NLP is natural language understanding i.e. deriving meaning from human or natural
language input which is done using regular expressions, artificial intelligence and database concepts.Automatic Summarization tool converts the newspaper articles into summary on the basis of frequency of words in the text. Text to Graph Converter takes in the input as stock article, tokenize them on various index (points and percent) and time and then tokens are mapped to graph. This paper proposes a business solution for users for effective time management.
In recent times, research activities in the areas of Opinion and Sentiment analysis in natural language texts and other media are gaining ground under the umbrella of subjectivity analysis. The reason may be the huge amount of available text data in the Social Web in the forms of news, reviews, blogs, chats and even twitter. Though Sentiment analysis from natural lan-guage text is a multifaceted and multidisciplinary problem, in general, the term “sentiment” is used in reference to the automatic analysis of evaluative text.
A powerful comparison of deep learning frameworks for Arabic sentiment analysis IJECEIAES
Deep learning (DL) is a machine learning (ML) subdomain that involves algorithms taken from the brain function named artificial neural networks (ANNs). Recently, DL approaches have gained major accomplishments across various Arabic natural language processing (ANLP) tasks, especially in the domain of Arabic sentiment analysis (ASA). For working on Arabic SA, researchers can use various DL libraries in their projects, but without justifying their choice or they choose a group of libraries relying on their particular programming language familiarity. We are basing in this work on Java and Python programming languages because they have a large set of deep learning libraries that are very useful in the ASA domain. This paper focuses on a comparative analysis of different valuable Python and Java libraries to conclude the most relevant and robust DL libraries for ASA. Throw this comparative analysis, and we find that: TensorFlow, Theano, and Keras Python frameworks are very popular and very used in this research domain.
The document discusses Python and the Natural Language Toolkit (NLTK). It explains that Python was chosen as the implementation language for NLTK due to its shallow learning curve, transparent syntax and semantics, and good string handling functionality. NLTK provides basic classes for natural language processing tasks, standard interfaces and implementations for tasks like tokenization and tagging, and extensive documentation. NLTK is organized into packages that encapsulate data structures and algorithms for specific NLP tasks.
The document presents a software training on Python3. It covers the objectives of understanding Python as a scripting language and how to design programs. It then discusses various Python libraries and tools - NumPy for numeric computing, Pandas for data analysis, Matplotlib for visualization, Jupyter notebooks, Anaconda for package/environment management, and MySQL for databases. The training aims to help participants learn how to use these technologies for data science and development.
Python is a widely-used, high-level programming language known for its simplicity, readability, and extensive library support. It is favored by developers for its ease of use and ability to handle diverse tasks, making it suitable for various applications ranging from web development to data analysis and artificial intelligence.
A convolutional neural network model with five convolutional layers has been created using the TensorFlow platform to perform handwritten text recognition. The model was trained on samples from the large IAM database of handwritten text images and tested on a user-defined dataset. The highest recognition accuracy was achieved using this approach.
IRJET- Hosting NLP based Chatbot on AWS Cloud using DockerIRJET Journal
This document discusses hosting an NLP-based chatbot on AWS using Docker. It describes developing a chatbot that answers user questions by searching text data indexed in Elasticsearch. The chatbot is containerized using Docker and deployed on AWS Elastic Container Service (ECS) to improve availability and performance. Key components include natural language processing, Elasticsearch for searching, an API for querying data, and an Angular UI. Docker Compose is used to launch multiple containers for the Elasticsearch, API, UI and other services.
- The document profiles Alberto Paro and his experience including a Master's Degree in Computer Science Engineering from Politecnico di Milano, experience as a Big Data Practise Leader at NTTDATA Italia, authoring 4 books on ElasticSearch, and expertise in technologies like Apache Spark, Playframework, Apache Kafka, and MongoDB. He is also an evangelist for the Scala and Scala.JS languages.
The document then provides an overview of data streaming architectures, popular message brokers like Apache Kafka, RabbitMQ, and Apache Pulsar, streaming frameworks including Apache Spark, Apache Flink, and Apache NiFi, and streaming libraries such as Reactive Streams.
Apache frameworks provide solutions for processing big and fast data. Traditional APIs use a request/response model with pull-based interactions, while modern data streaming uses a publish/subscribe model. Key concepts for big data architectures include batch processing frameworks like Hadoop, stream processing tools like Storm, and hybrid options like Spark and Flink. Popular data ingestion tools include Kafka for messaging, Flume for log data, and Sqoop for structured data. The best solution depends on requirements like latency, data volume, and workload type.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
CSP as a Domain-Specific Language Embedded in Python and JythonM H
The document describes a new Python library called python-csp that implements synchronous message-passing concurrency based on Hoare's Communicating Sequential Processes (CSP). python-csp allows programmers to compose processes and guards using infix operators similar to the original CSP syntax, making it idiomatic Python. It has implementations that reify CSP processes as Python threads, operating system processes, or Java threads when used with Jython. The library aims to provide a higher-level abstraction for concurrent programming in Python that hides the underlying implementation and encourages program correctness through the semantics of CSP.
Deep learning has exceeded massive powers of human mind and most popularity for using scientific computing, and its algorithmic procedures to purposeful industries that solve complete difficulties.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Session 2.1 ontological representation of the telecom domain for advanced a...semanticsconference
This document discusses the creation of an ontology for the telecom domain to support advanced AI applications. It describes extracting concepts, relations, and synonyms from various data sources through both manual and automated methods. Machine learning techniques like word embeddings are used to retrieve synonym suggestions. The ontology is stored in a semantic graph and can be queried through a natural language interface to power applications such as a semantic search and chatbot integration. The ontology provides a centralized knowledge base that strengthens independence and allows reuse of data across different AI systems.
This document provides an introduction and overview of the Stat project, which aims to create an open source machine learning framework in Java for text analysis. The Stat framework is designed to be simple, extensible, and performant. It aims to simplify common text analysis tasks for researchers and engineers by providing reusable tools and wrappers for existing NLP and machine learning packages. The document outlines the goals, scope, stakeholders and provides an initial requirements analysis for the Stat framework.
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, V.P. of Apache Beam; Founder/CEO at Operiant
The document describes an AI-driven Occupational Skills Generator (AIOSG) that aims to automate the process of creating occupational skills reference documents. The AIOSG utilizes an intelligent web crawler, natural language processing, neural networks, and a blockchain to gather data on occupational skills from various sources, analyze the data, and generate standardized skills reference documents. It is meant to make the document creation process more efficient, data-driven, and able to incorporate rapidly changing skills demands compared to the traditional manual process. The system architecture and key components of data collection, analysis, skills ontology construction, and reference document generation are outlined.
Advanced resource allocation and service level monitoring for container orche...Conference Papers
This document proposes an architecture for advanced resource allocation and service level monitoring for container orchestration platforms. It begins with background on containerization and different container orchestration platforms like Docker Swarm, Kubernetes, and Mesos. It then discusses the need for resource-aware container placement and SLA-based monitoring to minimize container migration and ensure performance. The proposed architecture consists of different components like a request manager, information collector, policy manager, and resource manager to enable advanced scheduling and monitoring of containers on Kubernetes. The proposed solution aims to analyze future resource utilization to improve placement decisions and reduce issues after deployment.
Adaptive authentication to determine login attempt penalty from multiple inpu...Conference Papers
This document proposes an adaptive authentication solution that determines login penalties based on multiple input sources. It describes adding an IP address checker module to the existing Trust Engine component of the Mi-UAP authentication platform. The IP address checker would identify the source type of a user's IP address and apply the appropriate penalty, such as requiring additional authentication methods or blocking the user, depending on factors like whether the IP is on a blacklist database. The document outlines the process flow and provides examples of how penalties would be applied based on the identified source type.
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Conference Papers
- The document analyzes the absorption spectrum of dentine sialophosphoprotein (DSPP) in gingival crevicular fluid (GCF) samples from orthodontic patients to develop a model for detecting orthodontic-induced inflammatory root resorption (OIIRR).
- GCF samples were collected from orthodontic patients at different treatment periods (3, 6, 12 months) and from non-orthodontic patients. Absorption spectroscopy found DSPP absorbance spectra increased with longer treatment duration, indicating more DSPP released due to more OIIRR.
- A qualitative model using SIMCA analysis accurately classified GCF samples into orthodontic and non-orthodont
A deployment scenario a taxonomy mapping and keyword searching for the appl...Conference Papers
This document discusses developing a taxonomy to map relationships between applications, virtual machines, hosts, and clients when performing upgrades and patches. It proposes creating a taxonomy based on analyzing errors that occur during application execution to understand dependencies. The methodology involves backing up configurations, testing connectivity between virtual networks and clusters before and after upgrades, and analyzing issues that arise. The goal is to establish structures for troubleshooting by classifying relationships between applications, libraries, operating systems, and browsers involved. This may improve determining the root cause of errors during upgrades involving virtualization.
Automated snomed ct mapping of clinical discharge summary data for cardiology...Conference Papers
The document discusses an approach to automatically map clinical terms in clinical discharge summary data from Malaysian hospitals to SNOMED CT terminology in order to improve the accuracy of queries for cardiology-related cases. Natural language processing techniques are used to preprocess the free-text discharge notes by removing formatting tags and identifying clinical terms, which are then mapped to SNOMED CT concepts using techniques like synonym matching, subsumption relationships, and identifying and excluding negative statements. The goal is to enrich the query results by standardizing the clinical terms to SNOMED CT and taking relationships like synonyms, subsumption, and negation into account to provide more accurate analytic results for monitoring and planning related to heart disease in Malaysia.
Automated login method selection in a multi modal authentication - login meth...Conference Papers
The document proposes an intelligent model to automatically select the login authentication method in a multi-modal authentication system based on user behavior profiling. It analyzes user behavior data from login sessions to minimize real-time processing and prevent untrusted attempts, while facilitating a frictionless user experience. The system determines the user, retrieves their behavioral historical data, matches the user profile based on data retrieval, and selects the authentication method based on evaluating the user profile and environmental parameters. It then updates the user profile with new successful login session data for future evaluations.
Atomization of reduced graphene oxide ultra thin film for transparent electro...Conference Papers
This document summarizes research on using an atomization process to deposit reduced graphene oxide (rGO) thin films for use as transparent conductive electrodes. Key points:
- Graphene oxide was spray coated onto silicon wafers and glass slides using an ultrasonic atomizer. Thermal reduction processes were then used to make the films electrically conductive while maintaining optical transparency.
- Thinner films with 1-2 spray coats had higher transparency (>90%) but higher resistivity, while thicker 3-4 coat films had lower transparency (77.1%) but lower resistivity (5.3 kΩ/sq).
- Rapid thermal processing was more effective than plasma processing at reducing resistivity. Sheet resistance decreased
An enhanced wireless presentation system for large scale content distribution Conference Papers
An enhanced wireless presentation system (eWPS) was developed to distribute presentation content to larger audiences over WiFi networks. The eWPS uses multiple access points connected via a high-speed Ethernet switch to provide WiFi coverage to audiences. It captures screenshots of presentations and stores them on an external web server for access by audience devices through a web browser. Testing showed the eWPS could serve over 125 audience devices with an average delay of 1.74ms per page load. System resources on the web server remained mostly idle, indicating it could potentially serve a much larger audience size.
An analysis of a large scale wireless image distribution system deploymentConference Papers
This document describes two setups of a wireless image distribution system:
1. A setup using commercial network equipment like access points and an access controller, which supported over 125 connected devices and provided sufficient bandwidth for the system load in an auditorium with 159 seats.
2. A setup using a wireless mesh network of three NerveNet nodes, which provided a quick and easy setup without wired connections but needs further performance improvements. Results from tests of both setups were analyzed to evaluate the network technologies for smart community applications.
Validation of early testing method for e government projects by requirement ...Conference Papers
The document describes a validation study of an Early Requirement Testing Method (ERTM) for e-government projects. Test engineers used the ERTM, which involves reviewing requirements documents and providing feedback, on six e-government projects. The number of defects found before and after applying the ERTM and providing interventions was compared using a statistical test. The results showed that overall, there was a statistically significant reduction in the number of defects found after applying the ERTM, suggesting it is useful for improving requirements documentation. However, one project saw an increase in defects due to additional requirements added later in the project.
The design and implementation of trade finance application based on hyperledg...Conference Papers
This document describes the design and implementation of a trade finance application built on the Hyperledger Fabric permissioned blockchain platform. It discusses the architecture of blockchain-based applications in general and this trade finance application specifically. Key aspects covered include identifying different types of software connectors (linkage, arbitrator, event, adaptor) that are important building blocks in the architecture. The trade finance application uses connectors like the blockchain facade connector and block/transaction event connector to interface between layers and handle asynchronous event propagation. Overall the document aims to provide insights into architectural considerations and best practices for developing blockchain-based applications.
Unified theory of acceptance and use of technology of e government services i...Conference Papers
This document describes a study that developed and validated a survey instrument to understand technology acceptance of an e-Government system called MYGOVSVC among Malaysian government employees. A literature review was conducted on previous studies applying the Unified Theory of Acceptance and Use of Technology (UTAUT) model to e-Government systems. A 21-item survey was developed containing questions on performance expectancy, effort expectancy, hedonic motivation, and facilitating conditions. The survey was translated to Malay and validated with stakeholders. It was administered to 419 government employees and results found the survey to be reliable in measuring acceptance of the MYGOVSVC system. The validated survey can be used to help improve e-Government services for Malaysian citizens.
Towards predictive maintenance for marine sector in malaysiaConference Papers
This research uses machine learning on sensor data from ships to predict failures of components and their remaining useful life. Interviews with marine experts identified significant maintenance items to prioritize for ship supply chains. The results were analyzed to provide recommendations to a government company on implementing predictive analytics and supply chain strategies for ship maintenance in Malaysia.
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...Conference Papers
This document describes the development of a lead ion-selective electrode (Pb2+-ISE) sensor based on a poly-tetrahydrofurfuryl acrylate (pTHFA) membrane without plasticizers using photo-polymerization. The sensor demonstrated a linear range of 0.1-10-5 M, Nernstian slope of 26.5-29.8 mV/decade, limit of detection of 3.24-3.98 x 10-6 M, and good selectivity against interfering ions. Sensor characterization showed comparable results to measurements using atomic absorption spectroscopy on artificial and real samples. Optimization of the lipophilic salt potassium tetrakis(4-chlorophenyl)borate and lead ionophore
This document summarizes security definitions for searchable symmetric encryption (SSE) schemes. It reviews the indistinguishability and semantic security game definitions, noting that attacks have succeeded against published schemes. It then proposes a new security game definition against distribution-based query recovery attacks, to better capture practical adversary capabilities. The goal is to define security in a way that implies the current indistinguishability and semantic security definitions.
Study on performance of capacitor less ldo with different types of resistorConference Papers
The document summarizes a study on the performance of a capacitor-less low dropout (LDO) voltage regulator using different types of resistors. A 1.8V LDO voltage regulator was designed and simulated using five different resistor types in Cadence. The performance metrics compared included output voltage accuracy, phase margin, unity gain bandwidth, and power supply rejection ratio. Simulation results showed differences in LDO performance depending on the resistor type. The LDO with hpoly resistor had the best stability performance, while the LDO with pdiffb resistor produced the highest power supply rejection ratio. In conclusion, the type of resistor used can significantly impact key performance characteristics of a capacitor-less LDO regulator.
Stil test pattern generation enhancement in mixed signal designConference Papers
This document describes a process for generating STIL test patterns from mixed signal design simulations in order to test digital blocks on an SoC. It involves simulating the mixed signal design, sampling the waveforms to generate test vectors, and converting those vectors into an ATPG-compliant STIL format using an automation program. This was implemented successfully at MIMOS Berhad, generating STIL test patterns that passed 100% of stuck-at tests.
The document discusses the implementation of an on-premise AI platform at MIMOS Berhad, a Malaysian research institute. The platform makes use of existing on-premise services such as a private cloud, distributed storage, and authentication platform. It provides an AI training facility using containers on VMs, with distributed training and GPU/CPU support. A version management system stores AI models and applications in Docker images. Deployment is supported on the private cloud and edge devices using containers. The goal is to enable internal development and hosting of AI projects in a secure, customizable manner.
Review of big data analytics (bda) architecture trends and analysis Conference Papers
This document reviews big data analytics (BDA) architecture trends and analysis. It discusses the evolution of data analytics from ancient times to modern technologies like Hadoop and Spark. It describes key features of BDA like flexibility, scalability, and fault tolerance. Common BDA architectures like lambda and kappa architectures are summarized. The lambda architecture uses batch, speed, and serving layers to handle both real-time and batch processing. The kappa architecture simplifies this by removing the batch layer and handling all processing through streaming. Overall, the document provides a high-level overview of BDA architectures and technologies.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Real time text stream processing - a dynamic and distributed nlp pipeline
1. Real-time Text Stream Processing: A Dynamic and Distributed
NLP Pipeline
Mohammad Arshi Saloot
MIMOS Berhad
Kuala Lumpur, Malaysia
+60187007981
arshi.saloot@yahoo.com
Duc Nghia Pham
MIMOS Berhad
Kuala Lumpur, Malaysia
address
+60389955000
nghia.pham@mimos.my
ABSTRACT
In recent years, the need for flexible and instant Natural Language
Processing (NLP) pipelines becomes more crucial. The existence
of real-time data sources, such as Twitter, necessitates using real-
time text analysis platforms. In addition, due to the existence of a
wide range of NLP toolkits and libraries in a variety of
programming languages, a streaming platform is required to
combine and integrate different modules of various NLP toolkits.
This study proposes a real-time architecture that uses Apache
Storm and Apache Kafka to apply different NLP tasks on streams
of textual data. The architecture allows developers to inject NLP
modules to it via different programming languages. To compare
the performance of the architecture, a series of experiments are
conducted to handle OpenNLP, Fasttext, and SpaCy modules for
Bahasa Malaysia and English languages. The result shows that
Apache Storm achieved the lowest latency, compared with
Trident and baseline experiments.
CCS Concepts
Computer systems organization ~ Real-time systems ~ Real-
time system architecture
Keywords
Real-time, Natural Language Processing, Streaming, Pipeline,
Kafka, Storm
1. INTRODUCTION
NLP toolkits often offer the following NLP components:
tokenization, part-of-speech (PoS) tagging, chunking, named
entity recognition (NER), and sentiment analysis. Currently, there
is a wide range of NLP tools and libraries in different
programming programs, and there is an ongoing competition
between them in terms of accuracy and performance. For example,
in 2017, an experiment is performed to compare in which we
applied four state-of-the-art NLP libraries to publicly available
software artifacts, namely Google’s SyntaxNet, Stanford
CoreNLP suite, NLTK Python library, and spaCy [1]. They
proved that NLTK achieved the highest accuracy for tokenization
among the other toolkits while its accuracy of PoS tagger is the
lowest results [1]. Therefore, because of this diversity of software
artifacts in NLP field, linking and merging NLP modules with
different techniques into an NLP pipeline is an important
phenomenon for NLP engineers and researchers [2].
A sheer volume of textual data is generated daily in many
domains, such as medicine, sports, legal, education, etc. For
example, a law institute generates with a large amount of research
notes, legal transaction documents, emails, reference books, etc.
Thus, NLP becomes an essential factor to get the best results out
of the descriptive or predictive analysis. As a result, AI and NLP
are a vital tool for legal practice to contribute to the growth of
technologies that assist lawyers or “think like a lawyer”.
Therefore, traditional data processing techniques are substituted
with the big data analytics approaches to solve real-life problems
[3].
Big-data (i.e. batch processing), focuses on batch, a
posteriori processing. Recently, the explosion of sensors and
applications needing immediate actions, interest are shifted
towards Fast-data (i.e. stream processing), focusing on real-time
processing. Batch big-data processing techniques encounters
many challenges when it comes to analyzing real-time stream of
data. Data streaming is useful for the types of data sources that
send data in small sizes (often in kilobytes) in a continuous flow
as the data is generated. This may include a wide variety of data
sources such as telemetry, log files, e-commerce transactions,
social network data, or geospatial services. Thus, many real-time
generated data sources, such as Tweets, need real-time data
analysis pipeline. The output of real-time pipelines should be
generated with low-latency and any incoming data must be
processed within seconds or milliseconds [4]. Therefore, research
efforts should be addressed towards developing scalable
frameworks and algorithms that will accommodate data stream
computing mode, effective resource allocation strategy and
parallelization issues to cope with the ever-growing size and
complexity of data [4]. The objectives of this work are examining
different frameworks in order to propose a platform to achieve
following aims:
• To encapsulate NLP modules: adding and removing
multilingual NLP modules to the pipeline without disturbing the
architecture of the system.
• To compute distributedly: scalable by adding or
removing parallel processes as well as worker nodes.
• To process real-time streams: real-time processing and
analysis of incoming streams of textual data with different lengths
and frequencies.
• To have a configurable topology: manipulate the
processing topology (i.e. workflow or network of implemented
NLP modules) at runtime without interrupting the running
services.
2. • To allow user interaction with the system: Provide
RESTful APIs to end users.
Section 2 reflects on the state-of-the-art NLP systems and libraries
as well as distributed stream processing platforms. Section 3
describes the experiment of this work. Finally, Section 4
summarizes the paper and suggests future research directions.
2. LITERATURE REVIEW
2.1 NLP Toolkits
In 2014, a short survey of NLP toolkits [5] recognized NLTK [6]
as the most well-known and comprehensive NLP toolkit. NLTK is
written in Python and provides essential NLP modules, which
includes tokenization, splitting, statistical analysis of corpora,
classification and clustering. Although NLTK does not provide
any neural network tool, it can be fused with Gensim [7] to
provide word embedding.
Apache OpenNLP [8] offers pre-trained models for the most
common NLP modules, such as tokenization, sentence
segmentation, PoS tagging, NER, chunking, parsing, and co-
reference resolution, for a variety of languages. OpenNLP is an
Apache licensed cross-platform Java library, which uses machine
learning methods.
Tokenization, sentence splitting, PoS tagging, NER, parsing,
sentiment analysis and temporal expression tagging as well as
word embedding are arranged in Stanford NLP Toolkit [9], which
is written in the Java programming language. OpenNLP and
Stanford NLP use Maximum Entropy models for their PoS tagger.
Stanford NLP uses different approaches for each task. For
instance, Conditional Random Fields (CRF) are used for NER
library.
Fasttext [10], [11] and spaCy [12] are new generations of
NLP libraries that emphasis on neural networks. SpaCy is a one of
the advanced multi-language NLP library, which is written in
Python and Cython [12]. SpaCy developers put efforts on the
speed of their library to provide a suitable NLP solution for
industry and commercial applications. There are some comparison
studies to compare different NLP libraries using different domains
and datasets [13]. For example, a study in 2017 [1], found that
spaCy achieved the most promising combined accuracy of PoS
tagging and tokenizer, compared to Google’s SyntaxNet, Stanford
NLP, NLTK Python libraries, while testing with Stack Overflow
data. In addition, spaCy supports neural network modeling while
OpenNLP lacks the advance of deep learning.
In 2016, Facebook Research offered Fasttext as an open-
source NLP library. Similar to spaCy, efficiency is vital in
Fasttext. Although Fasttext is written in C++, there are available
wrappers for it in other languages such as Python and Java.
Fasttext provides pre-trained word embedding models for 294
languages. The Fasttext in considered as a generic tool in the NLP
field because it does not provide any specific NLP module, such
as such as NER and sentiment analysis. Instead, it provides a text
classification library to be used as the engine for many NLP tasks.
Finally, UIMA is the most reliable and well-known
framework to combine different tasks from different library into a
single text annotation pipeline [14]. Although UIMA itself does
not provide any NLP module, it offers flexible pipelines that can
be configured by writing an XML description or using a GUI tool.
Instead of using a specific annotation format, the annotation
formats in UIMA are interoperable using XML Metadata
Interchange (XMI), which is an interchange standard. To pass
right types of input to the next component, UIMA validates the
output formats of components based on predefined Types.
Therefore, there are many frameworks is developed based on
UIMA such as Text Imager [15].
2.2 Real-time Stream Processing Frameworks
This study compares the main features of six most popular
stream processing frameworks, namely Spark Streams [16], Flink
[17], Akka Streams [18], Kafka Streams [19], Samza [20], and
Apache Storm [21]. Spark Streams is a library in Spark
framework to process continuously flowing streams data, which is
powered by Spark RDD. Flink provides stream processing for
large-volume data, and it also lets you handle batch analytics, with
one technology. Akka Streams are an implementation of the
Reactive Streams specification, build on top of Akka Actors to do
asynchronous, non-blocking stream processing. Apache Samza is
another open-source near-realtime, asynchronous computational
framework for stream processing developed in Scala and Java.
Apache Storm accepts tons of data coming in extremely fast,
possibly from various sources, analyze it, and publish real-time
updates to some other places, without storing any actual data.
Apache offers two different streaming frameworks with similar
names: 1) Kafka Streams is a library to write complex logic for
your stream processing jobs, 2) Apache Kafka (known as Kafka
Topics) is a distributed streaming platform [22]. Kafka Streams
API is used to develop stream applications, which may consume
from Kafka Topics and produces back into Kafka Topics. Results
from any of these tools are usually written back to new Kafka
topics for downstream consumption, as shown in Figure 1.
Figure 1. Kafka Topics
2.2.1 Kafka Topics
Apache Kafka is a distributed logging, which stores
messages sequentially. In Kafka terminology, consumers
consume/read data from some topics, and producers produce/write
data into topics. As shown in Figure 2, a Kafka cluster typically
consists of multiple brokers to maintain the load balance. The
Kafka broker election can be done by ZooKeeper [23]. Kafka
Kafka
Topics
Flink
Spark
Streams
Akka Streams
Samza
Appache
Storm
Kafka
Streams
3. provides high scalability and resiliency, so it is an excellent
integration tool between data producers and consumers. As
depicted in Figure 3, peer-to-peer spaghetti integration quickly
becomes unmanageable as the number of services grows.
Therefore, Kafka Topics provide a single backbone which is used
by all services. Although Kafka is not basically a queue, it can be
utilized as FIFO queue. Producers always write to the end of the
log, consumers can read in the log offset that they want to read
from the beginning or ending of the queue [22].
Figure 2. Kafka Architecture
Figure 3. Peer-to-peer Architecture
2.2.2 Distributed Processing Comparison
There are three categories of the reliability of message delivery:
• At-most-once delivery: for each input message, that
message is delivered zero or one time; in other words, a message
may be lost.
• At-least-once delivery: for each input message,
potentially multiple attempts are made at delivering it; in other
words, a message may be duplicated but not lost.
• Exactly-once delivery: for each input message, one
delivery is made to the recipient; in other words, the message can
neither be lost nor duplicated.
Kafka Topics, Kafka Streams, Spark Streams, Apache Storm, and
Flink support exactly-once and at-least-once delivery semantics.
However, Akka Streams and Samza are unable to guarantee
exactly-once delivery, as shown in Table 1.
Spark Streams and Flink are similar and often compared with
each other because they run as distributed services to run any
submitted jobs. They provide similar, very rich analytics, based on
Apache Beam. Apache Beam [24] is an advanced unified
programming model that implements batch and streaming data
processing jobs that run on any beam runners.
Spark Streams and Flink can manage all the issues of
scheduling processes, etc. After submitting jobs to run, they
handle scalability, failover, load balancing, etc. Another
advantage of Spark Streams and Flink is that they possess a big
community and ongoing updates and improvements because they
widely accepted by big companies at scale. A drawbacks of Spark
Streams and Flink is their restricted programming model. Jobs
should be written using their APIs that conform to their
programming model. Furthermore, integration with other services
usually requires that you run the engines separately from the
microservices and exchange data through Kafka topics or other
means. This adds some latency and more running applications at
the system level. In addition, the overhead of these systems makes
them less ideal for smaller data streams. In Spark Streams, data is
captured in fixed time intervals, then processed as a “mini batch.”
The drawback is longer latencies are required (100 milliseconds
or longer for the intervals).
Akka Streams, Kafka Streams, Samza, and Storm are similar
because they run as libraries that can be embedded in
microservices, providing greater flexibility in how to integrate
analytics with other processes. Akka Streams are very flexible in
terms of deployment and configuration options, compared to
Spark and Flink. In Akk Streams, there are many flexibility and
interoperation capabilities. When Akka Streams use Kafka Topics
to exchange data, consumer lag should be watched carefully (i.e.,
queue depth), which is a source of latency. Figure 4 shows the
spectrum of microservices. Microservices are not always record
oriented. It is a spectrum because we might take some events and
also route them through a data pipeline. Compared to Kafka
Streams, Akka Streams are more generic microservice oriented
and less data-analytics oriented. Although both Akka and Kafka
Streams can cover most of the spectrum, Akka emerged in the
world of building Reactive microservices, and Kafka Streams are
effectively a dataflow API.
Broker 1
Producers Consumers
Producers
Producers
Consumers
Consumers
Broker 2
Broker 3
Multiple Kafka Broker
ZooKeeper
Services Service 1
Services
Services
Service 2
Service 3
Producers Consumers
4. Figure 4. Microservices spectrum
In Kafka Streams, there should be always a persistent buffer
between Stream applications as shown in Figure 5. Another
disadvantage of Kafka Streams is that all the nodes (processors) in
one topology should be written in one programming language.
Figure 5. Kafka Streams
As displayed in Table 1, Samza currently only provides at-least-
once delivery guaranty. Exactly-once is going to be added to its
future releases. On top of that, Samza Currently only supports
JAVA and Scala for its high and low level APIs. All in all,
Apache Storm is one of the best distributed stream processing
because: 1) it is not restricted to any specific type of programming
model or data structure, 2) it supports at-least-once and exactly-
once message delivery semantics, 3) it provides real-time message
processing with a low latency, 4) it can easily be integrated with
other external platforms such as Apache Kafka, 5) it supports
many programming languages including Java, Scala, Ruby,
Python, JavaScript and Perl.
Table 1. Platform Comparison
Spark Streams Flink Akka Streams Kafka
Streams
Samza Apache Storm
Programming
model
Apache Beam Apache Beam High & Low
Level APIs
High & Low
Level APIs
High &
Low Level
APIs +
Apache
Beam
Model Free
Semantical
Guarantees
-at-least-once
-exactly-once
-at-least-once
-exactly-once
-at-most-once
-at-least-once
-at-least-once
-at-most-
once
-exactly-once
-at-least-
once
-at-least -once
-exactly-once
Latency High Medium Low Low Low Low
Real-time
processing
Mini batches Real-time Real-time Real-time Real-time Real-time
Integration
with other
services
Integration
require extra
engines
Integration
require extra
engines
Integratable Best
Integration
platform
Integratable Integratable
Supported
Programming
Language
-JVM
-Python
-R
-SparkSQL
-JVM -JVM -JVM
-Python
-KSQL
-JVM
-SamzaSQL
-JVM
-Ruby
-Python
-JavaScript
-Perl
2.2.3 Apache Storm
An arrangements of Spouts and Bolts are called topology. A
Spout is a source of data in a topology to fetch data from an
external source and emit them into the Bolts. A Bolt performs the
actual data processing (Veen et al., 2015). At the core of Apache
Storm is a Thrift definition (Agarwal et al., 2007) for defining and
submitting topologies As shown in Figure 4, since Thrift can be
utilized in any language, topologies can be defined and submitted
from any language. Apache storm is designed to be usable with
any programming language. Spouts and Bolts can be defined in
any language. Non-JVM Spouts and Bolts communicate to
Apache Storm over stdin/stdout.
Records
Events
REST
DATA
API Gateway
ORDERS ACCOUNTS
INVENTORY
Model Training
Storage Model Serving
Other Logics
Text
Processi
ng0
Text
Proc
Kafka
Topic
1
Kafka
Topic
3
Text
Proc
Text
Processi
ng0
Languag
e
Detector
Text
Proc
Kafka
Streams
Application
1
Text
Proc
Kafka
Streams
Application
n
Data flow
Kafka
Streams
Application
→
Kafka
Topic
0
5. Figure 6. Apache Storm
To parallelize the processing, all spout or bolt will be executed in
many Tasks across a Storm cluster. Executors are the processing
threads in a Storm worker node to run one or more Tasks of the
same Spout/Bolt. Figure 6 displays a topology into a topology
with two working machines. It contains four threads (Executors),
where each thread consists of two Tasks.
The number of Executors is always less than or equal to the
number of Tasks. The number of executors can be changed
without downtime while the number of Tasks is fixed. When the
number of Tasks is more than the number of Executors, the Tasks
inside an Executer are serial. For example, only four Tasks can be
active concurrently in Figure 7.
Stream grouping decides how a stream should be divided
among the bolt's tasks. Apache Storm supports eight types of
stream grouping that four of them are more important and
practical. Shuffle grouping is the most popular grouping. To
distribute data in a uniform and arbitrary way across the bolts,
shuffle grouping should be used. Field grouping controls how
each message is sent to bolts based on the content of each. All
grouping is a special grouping that send a message to all the bolts
that often is used to send signals to bolts. Global grouping
approach is used only to combine results from previous bolts in
the topology in a single bolt. The global grouping sends all the
messages to a single Task with lowest ID.
Divide-and-conquer is one of the most important phenomenon in
big-data. All batched big-data processing platforms, such as
Hadoop and Spark, use map-reduce technique which is based on
divide-and-conquer logic [26]. Trident is an extension to Apache
Storm, which provides divide-and-conquer logic for the real-time
stream processing applications [27]. Using Trident, a message can
be divided into many pieces and, and distributed between many
Storm Tasks, and merged back into one message. Therefore,
Trident offers joins, aggregations, and grouping Bolts.
Figure 7. Storm Topology
3. PROPOSED ARCHITECTURE
The importance of Kafka Topics in real-time platforms is
explained in the Section 2. As Kafka Topics provide persistent
data storage with the lowest latency, it is an essential part of our
architecture. The default behavior of Kafka consumer is to send
an acknowledgement message to the Kafka Brokers after
receiving successfully a message. However, sending an
acknowledgement can be delayed until any specific time. Figure 8
displays three different high level designs for a real-time platform.
Figure 8-a refers to the most common way of using Kafka Topics
while dealing with stream processing that is used in this work as a
baseline experiment to be compared with the proposed
architecture. Figure 8-a shows that the output of each process is
stored in Kafka Topics. It guarantees that the output of each
processor will not be lost in case of processor failure. Figure 8-b
shows that only the output of last processor will be stored in a
Topic. As the first processor sends the acknowledgement to the
input Topic, if other processors are down the message will be lost.
Figure 8-c is the best option for our platform because in case of a
failure in processors, no message will be lost. Although only the
output of last processor will be stored in a Topic, the
acknowledgement will be sent by the last processor instead of the
first one.
Thrift file Thrift Compiler
JAVA Nimbus
Python Nimbus
Ruby Nimbus
…
Supervisor (worker)
Executor Task
Task
Ta
Task
Supervisor (worker)
Executor
Executor
Executor Task Task
Task
Task
Task
Task
6. Figure 8. High Level Design
One of the main challenges of NLP tasks is to handle different
languages. In the proposed architecture, a separate stream of data
will be created of each language. Thus, the dataflow can be leaded
to language-specific bolts. Figure 9 that a how a language
identifier Bolt can divide a stream of Tweets into two different
data streams. Most NLP pipelines are required to handle few
languages. For example, to analyze Tweets from Malaysia,
English, Bahasa Malaysia, Chinese, and Tamil languages need to
be supported in NLP pipeline. Since Apache Storm allows to have
different data streams inside a topology, each language is
considered as one stream. Figure 10 displays the proposed
architecture. Input data can come from any sources, including
real-time streams and RDBMS. Kafka Producer connect to the
source of input via Kafka connector. Kafka Producer fetches data
from the source and push it a Kafka Topic. Then a Spout reads
data from the Kafka Topic and sends to the first Bolt (i.e. set of
Kafka Tasks). Each Bolt sends data to the next Bolt based on the
assigned streams. The last Bolt writes data to a Kafka Topic.
Finally, a Kafka Consumer reads the results from the Kafka Topic
and sends it to real-time application or write the result to a more
persistence database.
Figure 11 displays an embodiment of the proposed
architecture to process English and Bahasa Malaysia texts.
OpenNLP is used in language and sentence detection Bolts. Since
Bahasa Malaysia uses the English writing system, it is assumed
that OpenNLP sentence detector and Fasttext tokenizer can handle
both languages. There are two different Bolts for PoS taggers:
English and Bahasa Malaysia PoS taggers. Language detector
Bolt acts as a stream splitter to create different data streams based
on the detected languages. Finally, Kafka writer Bolt converts the
formats of the results into key and value pairs and writes them
into a Kafka topic. In Figure 11, the Bolts are implemented using
different programming languages. SpaCy is written in Python
language and other Bolts are in the Java language.
Figure 9. Stream Branching
Text Processorx Text Processory Text Processorz
Kafka Topic
(input1′)
Kafka Topic
(input1′′) Kafka Topic
(output1)
a) (baseline experiment)
b)
c) (proposed experiment)
Kafka Topic
(input1
)
Text Processorx
Text Processory Text Processorz
Kafka Topic
(input1
)
Text Processorx
Text Processory
Text Processorz
ack
Kafka Topic
(output1
)
Kafka Topic
(output1
)
Kafka Topic
(input1
)
Language
Identifier
Tweet ID,
English
Twitter
English
Sentence
Detector
Foreign
Sentence
Detector
Language
Identifier
Tweet ID,
Foreign
Twitter
English
Sentence
Detector
Foreign
Sentence
Detector
English POS
Tagger
Foreign POS
Tagger
Foreign POS
Tagger
English POS
Tagger
Output
Generator
Output
Generator
Sending to
Kafka
Sending to
Kafka
7. Figure 10. Proposed Architecture
Figure 11. Sample Implementation
4. EXPERIMENT RESULTS
100,000 messages are pushed to Apache Kafka to be processed by
the proposed architecture as well as baseline experiment. All
messages have at least 450 characters length with minimum two
sentences. To compare the performance of Storm with other
platforms, three different experiments are conducted:
Baseline (Kafka): the baseline refers to running NLP
tasks as separate Java applications. As shown in figure
8-a, each Text Processor (i.e. Java application) is
responsible for reading from a Kafka Topic, performing
an NLP task, and write to a particular Kafka Topic.
Apache Storm: A series of Bolts are dispositioned inside
a topology to perform different NLP tasks. There is a
Spout that reads from Kafka Topic, and send to the
sentence detection Bolt as shown in Figure 10 and
Figure 11. As shown in Figure 8-c, an
acknowledgement message will be sent to Apache
Kafka after a message processed by all the Bolts.
Storm Trident: Messages will be divided into multiple
messages based on the detected sentences inside a
message. After processing sentences by all NLP Bolts,
sentences will be merged back together based on the
message ID,
Regarding the hardware resources, two sets of experiments are
conducted:
• Standalone: the experiments are conducted using a
single Virtual Machine (VM). The VM encompasses 8GB RAM
and 4 Intel Core 2299 MHz.
• Cluster: the experiments are conducted using a cluster
of three VMs with the same specification and configuration. Each
VM encompasses 8GB RAM and 4 Intel Core 2299 MHz.
Figure 12 displays the final results, where Apache Storm achieved
the best result: processing 100,000 messages in 8 minutes in
Cluster mode. Although Trident outperforms the baseline (i.e.
Kafka) experiment, it cannot reach the performance of Apache
Kafka
Topic
(start)
Kafka
Topic
Kafka
Producer
Storm
Spout
Input
Output
Kafka
Connect
Kafka
Connect
Storm
Bolt
Kafka
Consumer
Storm
Bolt
Storm
Bolt
Storm
Bolt
Storm
Bolt
Storm
Bolt
Data flow
Kafka Topic
(start)
Kafka Topic
(end)
Kafka
Producer
Data flow
Storm
Spout
Storm
Spout
Tokenizer
Kafka
Writer
Sentence
Detector
Sentence
Detector
Tokenizer
Tokenizer
Language
Detector
Language
Detector
PoS (English)
OpenNLP FastText
OpenNLP
PoS (English)
PoS (English)
PoS (Malay)
PoS (Malay)
PoS (Malay)
PoS (Malay)
PoS (English)
Kafka
Writer
SpaCy
8. Storm because there is some overhead to break a message into
separate messages and merge them back together. Figure 13
shows the trend of processing for the Standalone experiment and
Figure 14 displays the trends in the Cluster mode. The fluctuation
of lines in Figure 13 proves that Trident and Storm cannot reach
their maximum efficiency because of the lack of resources, where
each hike will be followed by a dramatic drop. This problem is
resolved in the Cluster mode, where there was an optimal point
that Storm was able to process about 22,000 messages in a
minute.
Figure 12. Evaluation Result
Figure 13. Apache Kafka vs Apache Storm vs Trident (Standalone)
9. Figure 14. Apache Kafka vs Apache Storm vs Trident (Cluster)
5. CONCLUSION
The existence of many NLP modules from different sources, as
well as the increasing trend of real-time data necessitate NLP
pipelines that are flexible and rapid. There are several real-time
generated data sources that require data analysis pipeline. Another
challenge of NLP tasks is to handle multiple languages in real-
time processing. Therefore, in recent years, there was a shift of
focus from batch data processing towards stream processing.
Data streaming is one of the useful methods in sending small
size of data in a continuous flow. Although Apache Kafka is one
of the important platforms in real-time processing, it does not
provide distributed computation similar to other Stream
processing platforms such as Akka, Flink, and Apache Storm.
Apache Kafka is a vital part of stream processing because it
provides a rapid approach to store, read and write data from
persistence data sources (i.e. Kafka Topics). Between Spark
Streams, Flink, Akka Streams, Kafka Streams, Samza, and
Apache Storm, Apache Storm is selected for this study because it
is not restrained by any programming model or data structure or
programming language. Moreover, Apache Storm supports at-
least-once and exactly-once message delivery semantics and also
it is easily integrable with other data sources especially Apache
Kafka. This study examines the latency of Apache Storm while
handling NLP tasks.
A distributed architecture is proposed to handle OpenNLP,
Fasttext, and SpaCy modules for Bahasa Malaysia and English
languages. The architecture is implemented using a mixture of
Java and Python programming languages. The input and output of
the architecture are connected to Kafka Topics. A total of 100,000
messages which had at least 450 characters length with minimum
two sentences used to test our proposed architecture. The result
shows that, Apache Storm outperforms Trident and the baseline
experiment by processing 100,000 messages in 8 minutes in the
cluster mode.
ACKNOWLEDGMENTS
This research was done under Artificial Intelligence Lab, MIMOS
BERHAD.
REFERENCES
[1] F. N. A. Al Omran and C. Treude, “Choosing an NLP
Library for Analyzing Software Documentation: A
Systematic Literature Review and a Series of Experiments,”
in 2017 IEEE/ACM 14th International Conference on Mining
Software Repositories (MSR), 2017, pp. 187–197.
[2] R. de Castilho and I. Gurevych, “A broad-coverage
collection of portable NLP components for building
shareable analysis pipelines,” in Proceedings of the
Workshop on Open Infrastructures and Analysis Frameworks
for {HLT}, 2014, pp. 1–11, doi: 10.3115/v1/W14-5201.
[3] Z. Xiang, Z. Schwartz, J. H. Gerdes, and M. Uysal, “What
can big data and text analytics tell us about hotel guest
experience and satisfaction?,” Int. J. Hosp. Manag., vol. 44,
pp. 120–130, 2015, doi:
https://doi.org/10.1016/j.ijhm.2014.10.013.
[4] T. Kolajo, O. Daramola, and A. Adebiyi, “Big data stream
analysis: a systematic literature review,” J. Big Data, vol. 6,
no. 1, p. 47, 2019, doi: 10.1186/s40537-019-0210-7.
[5] L. B. Krithika and K. V. Akondi, “Survey on Various
Natural Language Processing Toolkits,” 2014.
[6] E. Loper and S. Bird, “NLTK: The Natural Language
Toolkit,” in Proceedings of the ACL-02 Workshop on
Effective Tools and Methodologies for Teaching Natural
Language Processing and Computational Linguistics -
Volume 1, 2002, pp. 63–70, doi: 10.3115/1118108.1118117.
[7] R. Rehurek and P. Sojka, “Software Framework for Topic
Modelling with Large Corpora,” in Proceedings of the LREC
10. 2010 Workshop on New Challenges for NLP Frameworks,
2010, pp. 45–50.
[8] Apache Software Foundation, “openNLP Natural Language
Processing Library.” 2014.
[9] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
Bethard, and D. McClosky, “The Stanford CoreNLP Natural
Language Processing Toolkit,” in Association for
Computational Linguistics (ACL) System Demonstrations,
2014, pp. 55–60.
[10] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov,
“Enriching Word Vectors with Subword Information,” arXiv
Prepr. arXiv1607.04606, 2016.
[11] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of
Tricks for Efficient Text Classification,” arXiv Prepr.
arXiv1607.01759, 2016.
[12] M. Honnibal and I. Montani, “spaCy 2: Natural language
understanding with Bloom embeddings, convolutional neural
networks and incremental parsing,” 2017.
[13] J. D. Choi, J. Tetreault, and A. Stent, “It Depends:
Dependency Parser Comparison Using A Web-based
Evaluation Tool,” in Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), 2015, pp.
387–396, doi: 10.3115/v1/P15-1038.
[14] G. Wilcock, “Text Annotation with OpenNLP and UIMA,”
in Proceedings of 17th Nordic Conference on Computational
Linguistics, NODALIDA, 2009, pp. 7–8.
[15] W. Hemati, T. Uslu, and A. Mehler, “Text Imager: a
Distributed UIMA-based System for NLP,” in Proceedings
of {COLING} 2016, the 26th International Conference on
Computational Linguistics: System Demonstrations, 2016,
pp. 59–63.
[16] M. Zaharia et al., “Apache Spark: A Unified Engine for Big
Data Processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65,
Oct. 2016, doi: 10.1145/2934664.
[17] E. Friedman and K. Tzoumas, Introduction to Apache Flink:
Stream Processing for Real Time and Beyond, 1st ed.
O’Reilly Media, Inc., 2016.
[18] A. L. Davis, Reactive Streams in Java: Concurrency with
RxJava, Reactor, and Akka Streams, 1st ed. USA: Apress,
2018.
[19] S. Ehrenstein, “Scalability Benchmarking of Kafka Streams
Applications,” Institut für Informatik, 2020.
[20] S. A. Noghabi et al., “Samza: Stateful Scalable Stream
Processing at LinkedIn,” Proc. VLDB Endow., vol. 10, no.
12, pp. 1634–1645, Aug. 2017, doi:
10.14778/3137765.3137770.
[21] J. S. van der Veen, B. van der Waaij, E. Lazovik, W.
Wijbrandi, and R. J. Meijer, “Dynamically Scaling Apache
Storm for the Analysis of Streaming Data,” in Proceedings of
the 2015 IEEE First International Conference on Big Data
Computing Service and Applications, 2015, pp. 154–161,
doi: 10.1109/BigDataService.2015.56.
[22] N. Garg, Apache Kafka. Packt Publishing, 2013.
[23] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed,
“ZooKeeper: Wait-Free Coordination for Internet-Scale
Systems,” in Proceedings of the 2010 USENIX Conference
on USENIX Annual Technical Conference, 2010, p. 11.
[24] H. Karau, “Unifying the open big data world: The
possibilities∗ of apache BEAM,” in 2017 IEEE International
Conference on Big Data (Big Data), 2017, p. 3981, doi:
10.1109/BigData.2017.8258410.
[25] A. Agarwal, M. Slee, and M. Kwiatkowski, “Thrift: Scalable
Cross-Language Services Implementation,” 2007.
[26] A. B. Patel, M. Birla, and U. Nair, “Addressing big data
problem using Hadoop and Map Reduce,” in 2012 Nirma
University International Conference on Engineering
(NUiCONE), 2012, pp. 1–5, doi:
10.1109/NUICONE.2012.6493198.
[27] A. Jain, Mastering Apache Storm: Real-Time Big Data
Streaming Using Kafka, Hbase and Redis. Packt Publishing,
2017.