This document describes a deep learning model called the Predict Text Classification Network (PTCN) that was developed to predict the outcome (pass/fail) of congressional roll call votes based solely on the text of legislation. The PTCN uses a hybrid convolutional and long short-term memory neural network architecture to analyze legislative texts and predict whether a vote will pass or fail. The model was tested on legislative texts from 2000-2019 and achieved an average prediction accuracy of 67.32% using 10-fold cross-validation, suggesting it can recognize patterns in language that correlate with congressional voting behaviors.
New prediction method for data spreading in social networks based on machine ...TELKOMNIKA JOURNAL
Information diffusion prediction is the study of the path of dissemination of news, information, or topics in a structured data such as a graph. Research in this area is focused on two goals, tracing the information diffusion path and finding the members that determine future the next path. The major problem of traditional approaches in this area is the use of simple probabilistic methods rather than intelligent methods. Recent years have seen growing interest in the use of machine learning algorithms in this field. Recently, deep learning, which is a branch of machine learning, has been increasingly used in the field of information diffusion prediction. This paper presents a machine learning method based on the graph neural network algorithm, which involves the selection of inactive vertices for activation based on the neighboring vertices that are active in a given scientific topic. Basically, in this method, information diffusion paths are predicted through the activation of inactive vertices byactive vertices. The method is tested on three scientific bibliography datasets: The Digital Bibliography and Library Project (DBLP), Pubmed, and Cora. The method attempts to answer the question that who will be the publisher of thenext article in a specific field of science. The comparison of the proposed method with other methods shows 10% and 5% improved precision in DBL Pand Pubmed datasets, respectively.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERSijait
Peer-to-peer systems have recently a remarkable success in the social, academic, and commercial communities. A fundamental problem in Peer-to-Peer systems is how to efficiently locate appropriate peers to answer a specific query (Query Routing Problem). A lot of approaches have been carried out to enhance search result quality as well as to reduce network overhead. Recently, researches focus on methods based on query-oriented routing indices. These methods utilize the historical information of past queries and query hits to build a local knowledge base per peer, which represents the user's interests or profile. When a peer forwards a given query, it evaluates the query against its local knowledge base in order to select a set of relevant peers to whom the query will be routed. Usually, an insufficient number of relevant peers is selected from the current peer's local knowledge base thus a broadcast search is investigated which badly affects the approach efficiency. To tackle this problem, we introduce a novel method that clusters peers having similar interests. It exploits not only the current peer's knowledge base but also that of the others in
the cluster to extract relevant peers. We implemented the proposed approach, and tested (i) its retrieval effectiveness in terms of recall and precision, (ii) its search cost in terms of messages traffic and visited peers number. Experimental results show that our approach improves the recall and precision metrics while reducing dramatically messages traffic.
Scale-Free Networks to Search in Unstructured Peer-To-Peer NetworksIOSR Journals
This document discusses using scale-free networks to improve search efficiency in unstructured peer-to-peer networks. It proposes the EQUATOR architecture, which creates an overlay network topology based on the scale-free Barabasi-Albert model. Simulation results show that EQUATOR achieves good lookup performance comparable to the ideal Barabasi-Albert network, with low message overhead even under node churn. The scale-free topology allows random walks to efficiently locate resources by directing searches to high-degree "hub" nodes with greater knowledge of the network.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
The document discusses designing learning activities focused on computational thinking for children based on public transportation environments. It presents preliminary results from a survey of parents that found most explained step-by-step how to reach destinations to their children and many children understood differences between transportation methods. Examples are given of potential activities using the Paris metro system that could develop computational thinking skills like decomposition and pattern matching. The discussion concludes more research is needed but public transportation could enable developing these skills through daily experiences.
Delay Tolerant Networking routing as a Game Theory problem – An OverviewCSCJournals
This document discusses modeling delay tolerant networking (DTN) routing as a game theory problem. It provides background on DTN, which aims to enable communication in disrupted networks, and discusses how routing in DTN can be viewed as a strategic interaction between nodes. The document then gives an overview of game theory and different game forms. It proposes analyzing DTN routing as a game with Nash equilibrium, where nodes make forwarding decisions rationally based on beliefs about other nodes' actions.
New prediction method for data spreading in social networks based on machine ...TELKOMNIKA JOURNAL
Information diffusion prediction is the study of the path of dissemination of news, information, or topics in a structured data such as a graph. Research in this area is focused on two goals, tracing the information diffusion path and finding the members that determine future the next path. The major problem of traditional approaches in this area is the use of simple probabilistic methods rather than intelligent methods. Recent years have seen growing interest in the use of machine learning algorithms in this field. Recently, deep learning, which is a branch of machine learning, has been increasingly used in the field of information diffusion prediction. This paper presents a machine learning method based on the graph neural network algorithm, which involves the selection of inactive vertices for activation based on the neighboring vertices that are active in a given scientific topic. Basically, in this method, information diffusion paths are predicted through the activation of inactive vertices byactive vertices. The method is tested on three scientific bibliography datasets: The Digital Bibliography and Library Project (DBLP), Pubmed, and Cora. The method attempts to answer the question that who will be the publisher of thenext article in a specific field of science. The comparison of the proposed method with other methods shows 10% and 5% improved precision in DBL Pand Pubmed datasets, respectively.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
A QUERY LEARNING ROUTING APPROACH BASED ON SEMANTIC CLUSTERSijait
Peer-to-peer systems have recently a remarkable success in the social, academic, and commercial communities. A fundamental problem in Peer-to-Peer systems is how to efficiently locate appropriate peers to answer a specific query (Query Routing Problem). A lot of approaches have been carried out to enhance search result quality as well as to reduce network overhead. Recently, researches focus on methods based on query-oriented routing indices. These methods utilize the historical information of past queries and query hits to build a local knowledge base per peer, which represents the user's interests or profile. When a peer forwards a given query, it evaluates the query against its local knowledge base in order to select a set of relevant peers to whom the query will be routed. Usually, an insufficient number of relevant peers is selected from the current peer's local knowledge base thus a broadcast search is investigated which badly affects the approach efficiency. To tackle this problem, we introduce a novel method that clusters peers having similar interests. It exploits not only the current peer's knowledge base but also that of the others in
the cluster to extract relevant peers. We implemented the proposed approach, and tested (i) its retrieval effectiveness in terms of recall and precision, (ii) its search cost in terms of messages traffic and visited peers number. Experimental results show that our approach improves the recall and precision metrics while reducing dramatically messages traffic.
Scale-Free Networks to Search in Unstructured Peer-To-Peer NetworksIOSR Journals
This document discusses using scale-free networks to improve search efficiency in unstructured peer-to-peer networks. It proposes the EQUATOR architecture, which creates an overlay network topology based on the scale-free Barabasi-Albert model. Simulation results show that EQUATOR achieves good lookup performance comparable to the ideal Barabasi-Albert network, with low message overhead even under node churn. The scale-free topology allows random walks to efficiently locate resources by directing searches to high-degree "hub" nodes with greater knowledge of the network.
Great model a model for the automatic generation of semantic relations betwee...ijcsity
The
large
a
v
ailable
am
ou
n
t
of
non
-
structured
texts
that
b
e
-
long
to
differe
n
t
domains
su
c
h
as
healthcare
(e.g.
medical
records),
justice
(e.g.
l
a
ws,
declarations),
insurance
(e.g.
declarations),
etc. increases
the
effort
required
for
the
analysis
of
information
in
a
decision making
pro
-
cess.
Differe
n
t
pr
o
jects
and t
o
ols
h
av
e
pro
p
osed
strategies
to
reduce
this
complexi
t
y
b
y
classifying,
summarizing
or
annotating
the
texts.
P
artic
-
ularl
y
,
text
summary
strategies
h
av
e
pr
ov
en
to
b
e
v
ery
useful
to
pr
o
vide
a
compact
view
of
an
original
text.
H
ow
e
v
er,
the
a
v
ailable
strategies
to
generate
these
summaries
do
not
fit
v
ery
w
ell
within
the
domains
that
require
ta
k
e
i
n
to
consideration
the
tem
p
oral
dimension
of
the
text
(e.g.
a
rece
n
t
piece
of
text
in
a
medical
record
is
more
im
p
orta
n
t
than
a
pre
-
vious
one)
and
the
profile
of
the
p
erson
who
requires
the
summary
(e.g
the
medical
s
p
ecialization).
T
o
co
p
e with
these
limitations
this
pa
p
er
prese
n
ts
”GRe
A
T”
a
m
o
del
for
automatic
summary
generation
that
re
-
lies
on
natural
language
pr
o
cessing
and
text
mining
te
c
hniques
to
extract
the
most
rele
v
a
n
t
information
from
narrati
v
e
texts
and
disc
o
v
er
new
in
-
formation
from
the
detection
of
related
information. GRe
A
T
M
o
del
w
as impleme
n
ted
on
sof
tw
are
to
b
e
v
alidated
in
a
health
institution
where
it
has
sh
o
wn
to
b
e
v
ery
useful
to displ
a
y
a
preview
of
the
information
a
b
ou
t
medical
health
records
and
disc
o
v
er
new
facts
and
h
y
p
otheses
within
the
information.
Se
v
eral
tests
w
ere
executed
su
c
h
as
F
unctional
-
i
t
y
,
Usabili
t
y
and
P
erformance
regarding
to
the
impleme
n
ted
sof
t
w
are.
In
addition,
precision
and
recall
measures
w
ere
applied
on
the
results
ob
-
tained
through
the
impleme
n
ted
t
o
ol,
as
w
ell
as
on
the
loss
of
information
obtained
b
y
pr
o
viding
a
text
more
shorter than
the
original
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
The document discusses designing learning activities focused on computational thinking for children based on public transportation environments. It presents preliminary results from a survey of parents that found most explained step-by-step how to reach destinations to their children and many children understood differences between transportation methods. Examples are given of potential activities using the Paris metro system that could develop computational thinking skills like decomposition and pattern matching. The discussion concludes more research is needed but public transportation could enable developing these skills through daily experiences.
Delay Tolerant Networking routing as a Game Theory problem – An OverviewCSCJournals
This document discusses modeling delay tolerant networking (DTN) routing as a game theory problem. It provides background on DTN, which aims to enable communication in disrupted networks, and discusses how routing in DTN can be viewed as a strategic interaction between nodes. The document then gives an overview of game theory and different game forms. It proposes analyzing DTN routing as a game with Nash equilibrium, where nodes make forwarding decisions rationally based on beliefs about other nodes' actions.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document compares the k-means data mining and outlier detection approaches for network-based intrusion detection. It analyzes four datasets capturing network traffic using both approaches. The k-means approach clusters traffic into normal and abnormal flows, while outlier detection calculates an outlier score for each flow. The document finds that k-means was more accurate and precise, with a better classification rate than outlier detection. It requires less computer resources than outlier detection. This comparison of the approaches can help network administrators choose the best intrusion detection method.
Data-to-text technologies present an enormous and exciting opportunity to help
audiences understand some of the insights present in today’s vasts and growing amounts of electronic
data. In this article we analyze the potential value and benefits of these solutions as well as their risks
and limitations for a wider penetration. These technologies already bring substantial advantages of
cost, time, accuracy and clarity versus other traditional approaches or format. On the other hand,
there are still important limitations that restrict the broad applicability of these solutions, most
importantly in the limited quality of their output. However we find that the current state of
development is sufficient for the application of these solution across many domains and use cases and
recommend businesses of all sectors to consider how to deploy them to enhance the value they are
currently getting from their data. As the availability of data keeps growing exponentially and natural
language generation technology keeps improving, we expect data-to-text solutions to take a much
more bigger role in the production of automated content across many different domains.
The document introduces the visual mapping sentence (MS) methodology for multifaceted research design and analysis. The MS is a visual representation that classifies research variables into facets. It guides hypothesis generation and systematic data collection/analysis. The MS links components of a research domain to suggest relationships between variables and facets. It maps all relevant variables and provides information about excluded variables. Two examples of MS are included in figures to illustrate their use. The MS methodology is presented as a valuable tool that can help address limitations of other research methods by emphasizing conceptual definition and structure of content areas.
Six Degrees of Separation to Improve Routing in Opportunistic Networksijujournal
This document discusses using small-world network concepts for routing in opportunistic networks. It analyzes three real-world datasets representing contact graphs and finds they exhibit small-world properties with high clustering and short path lengths. The document proposes a simple routing algorithm that applies these findings and concludes it outperforms other algorithms in simulations by taking temporal contact factors into account.
Predicting Forced Population Displacement Using News ArticlesJaresJournal
The world has witnessed mass forced population displacement across the globe. Population displacement has various indications, with different social and policy consequences. Mitigation of the humanitarian crisis requires tracking and predicting the population movements to
allocate the necessary resources and inform the policymakers. The set of events that triggers population movements can be traced in the news articles. In this paper, we propose the Population
Displacement-Signal Extraction Framework (PD-SEF) to explore a large news corpus and extract
the signals of forced population displacement. PD-SEF measures and evaluates violence signals,
which is a critical factor of forced displacement from it. Following signal extraction, we propose a
displacement prediction model based on extracted violence scores. Experimental results indicate
the effectiveness of our framework in extracting high quality violence scores and building accurate
prediction models.
The document describes an interactive topic modeling framework that allows non-expert users to provide feedback to iteratively refine topic models. It presents a mechanism for encoding user feedback as correlations between words to guide topic modeling. More efficient inference algorithms are developed for tree-based topic models to support interactivity. The framework is evaluated with both simulated and real users, and is shown to help users navigate datasets and understand political discourse.
This document is an electronic assignment coversheet for a student submitting an assignment in the unit ICT349 Forensic Data Analysis. It includes the student's details, assignment information including the topic "The consequences of Internet mobility protocols on network forensics", and a declaration acknowledging the university's academic integrity policy. An appendix provides additional context on the unit and assignment requirements.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
An efficient-classification-model-for-unstructured-text-documentSaleihGero
The document presents a classification model for unstructured text documents that aims to support both generality and efficiency. The model follows the logical sequence of text classification steps and proposes a combination of techniques for each step. Specifically, it uses multinomial naive bayes classification with term frequency-inverse document frequency (TF-IDF) representation. The model is tested on the 20-Newsgroups dataset and results show improved performance over precision, recall, and f-score compared to other models.
New Heuristic Model for Optimal CRC Polynomial IJECEIAES
Cyclic Redundancy Codes (CRCs) are important for maintaining integrity in data transmissions. CRC performance is mainly affected by the polynomial chosen. Recent increases in data throughput require a foray into determining optimal polynomials through software or hardware implementations. Most CRC implementations in use, offer less than optimal performance or are inferior to their newer published counterparts. Classical approaches to determining optimal polynomials involve brute force based searching a population set of all possible polynomials in that set. This paper evaluates performance of CRC-polynomials generated with Genetic Algorithms. It then compares the resultant polynomials, both with and without encryption headers against a benchmark polynomial.
This document discusses predicting new friendships in social networks using temporal information. It describes research on predicting new links in social networks over time using supervised learning models trained on temporal features from past network interactions. The researchers used anonymized Facebook data over 28 months to train decision tree and neural network classifiers to predict new relationships, finding models using temporal information performed better than those without it.
Prediction of Answer Keywords using Char-RNNIJECEIAES
Generating sequences of characters using a Recurrent Neural Network (RNN) is a tried and tested method for creating unique and context aware words, and is fundamental in Natural Language Processing tasks. These type of Neural Networks can also be used a question-answering system. The main drawback of most of these systems is that they work from a factoid database of information, and when queried about new and current information, the responses are usually bleak. In this paper, the author proposes a novel approach to finding answer keywords from a given body of news text or headline, based on the query that was asked, where the query would be of the nature of current affairs or recent news, with the use of Gated Recurrent Unit (GRU) variant of RNNs. Thus, this ensures that the answers provided are relevant to the content of query that was put forth.
P2P DOMAIN CLASSIFICATION USING DECISION TREE ijp2p
The increasing interest in Peer-to-Peer systems (such as Gnutella) has inspired many research activities
in this area. Although many demonstrations have been performed that show that the performance of a
Peer-to-Peer system is highly dependent on the underlying network characteristics, much of the
evaluation of Peer-to-Peer proposals has used simplified models that fail to include a detailed model of
the underlying network. This can be largely attributed to the complexity in experimenting with a scalable
Peer-to-Peer system simulator built on top of a scalable network simulator. A major problem of
unstructured P2P systems is their heavy network traffic. In Peer-to-Peer context, a challenging problem
is how to find the appropriate peer to deal with a given query without overly consuming bandwidth?
Different methods proposed routing strategies of queries taking into account the P2P network at hand.
This paper considers an unstructured P2P system based on an organization of peers around Super-Peers
that are connected to Super-Super-Peer according to their semantic domains; in addition to integrating
Decision Trees in P2P architectures to produce Query-Suitable Super-Peers, representing a community
of peers where one among them is able to answer the given query. By analyzing the queries log file, a
predictive model that avoids flooding queries in the P2P network is constructed after predicting the
appropriate Super-Peer, and hence the peer to answer the query. A challenging problem in a schemabased Peer-to-Peer (P2P) system is how to locate peers that are relevant to a given query. In this paper,
architecture, based on (Super-)Peers is proposed, focusing on query routing. The approach to be
implemented, groups together (Super-)Peers that have similar interests for an efficient query routing
method. In such groups, called Super-Super-Peers (SSP), Super-Peers submit queries that are often
processed by members of this group. A SSP is a specific Super-Peer which contains knowledge about: 1.
its Super-Peers and 2. The other SSP. Knowledge is extracted by using data mining techniques (e.g.
Decision Tree algorithms) starting from queries of peers that transit on the network. The advantage of
this distributed knowledge is that, it avoids making semantic mapping between heterogeneous data
sources owned by (Super-)Peers, each time the system decides to route query to other (Super-) Peers.
The set of SSP improves the robustness in queries routing mechanism, and the scalability in P2P
Network. Compared with a baseline approach,the proposal architecture shows the effect of the data
mining with better performance in respect to response time and precision.
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
An optimal unsupervised text data segmentation 3prj_publication
This document summarizes a research paper that proposes an unsupervised text data segmentation model using genetic algorithms (OUTDSM) to improve clustering accuracy. The OUTDSM uses an encoding strategy, fitness function, and genetic operators to evolve optimal text clusters. Experimental results presented in the research paper demonstrate that OUTDSM can arrive at global optima and prevent stagnation at local optima due to its biologically diverse population. Key areas of related work discussed include text mining techniques, clustering methods, and prior uses of optimization algorithms like genetic algorithms for text mining tasks.
Resource management and search is very important yet challenging in large-scale distributed systems like
P2Pnetworks. Most existing P2P systems rely on indexing to efficiently route queries over the network.
However, searches based on such indices face two key issues. First, majority of existing search schemes
often rely on simply keyword based indices that can only support exact string based matches without taking
into account the meaning of words. Second it is difficult, if not impossible, to devise query based indexing
schemes that can represent all possible concept combinations without resulting in exponential index sizes.
To address these problems, we present BSI, a novel P2P indexing and query routing strategy to support
semantic based content searches. The BSI indexing structure captures the semantic content of documents
using a reference ontology. Our indexing scheme can efficiently handle multi-concept queries by
maintaining summary level information for each individual concept and concept combinations using a
novel space-efficient Two-level Semantic Bloom Filter(TSBF) data structure. By using TSBFs to represent
a large document and query base, BSI significantly reduces the communication cost and storage cost of
indices. Furthermore, We devise a low-overhead mechanism to allow peers to dynamically estimate the
relevance strength of a peer for multi-concept queries with high accuracy solely based on TSBFs. We also
propose a routing index compression mechanism to observe peers’ dynamic storage limitations with
minimal loss of information by exploiting a reference ontology structure. Based on the proposed index
structure, we design a novel query routing algorithm that exploits semantic based information to route
queries to semantically relevant peers. Performance evaluation demonstrates that our proposed approach
can improve the search recall of unstructured P2P systems up to 383.71% while keeping the
communication cost at a low level compared to state-of-art search mechanism OSQR [7].
This document summarizes research posters being presented at a computer science and electrical engineering department research review. It describes 8 posters presented by BS, MS, and PhD students. The posters cover topics such as identifying political affiliations in blogs, statistically weighted visualization hierarchies, voter verifiable optical-scan voting, predictive caching in mobile networks, generating statistical volume models, predicting appropriate semantic web terms, approximating online social network community structure, and utilizing semantic policies for managing BGP route dissemination.
ARABIC TEXT MINING AND ROUGH SET THEORY FOR DECISION SUPPORT SYSTEM.Hassanein Alwan
The present work based on an implement classification system for Arab Complaint. The
basis of this system is the construction of Decision Support Systems, Text Mining, and Rough Set, a
the specialized semantic approach of correlation which is referred to as the model of ASN for creating
semantic structure between all document’s words. A particular semantic weight represents true
significance of this term in the text after each feature is probable to be utilized the semantic selection
of a suggested feature based on 2 threshold values, the first being the maximum weight in a document
and the other one representing the characteristic having the maximum semantic scale with the first one
of the thresholds. The task of classification in the stage of testing depends on the dependency degree
concept in the raw group theory than before to treat all the characteristics of every one of the classes
that results from the stage of training as particular condition rules and therefore, considered in the
lower rounding group. The outputs of this classification system are sufficient, performance is quite
good persuasive, and assessment of this the system through measurement precision.
Activity Context Modeling in Context-AwareEditor IJCATR
The explosion of mobile devices has fuelled the advancement of pervasive computing to provide personal assistance in this
information-driven world. Pervasive computing takes advantage of context-aware computing to track, use and adapt to contextual
information. The context that has attracted the attention of many researchers is the activity context. There are six major techniques that
are used to model activity context. These techniques are key-value, logic-based, ontology-based, object-oriented, mark-up schemes and
graphical. This paper analyses these techniques in detail by describing how each technique is implemented while reviewing their pros
and cons. The paper ends with a hybrid modeling method that fits heterogeneous environment while considering the entire of modeling
through data acquisition and utilization stages. The modeling stages of activity context are data sensation, data abstraction and
reasoning and planning. The work revealed that mark-up schemes and object-oriented are best applicable at the data sensation stage.
Key-value and object-oriented techniques fairly support data abstraction stage whereas the logic-based and ontology-based techniques
are the ideal techniques for reasoning and planning stage. In a distributed system, mark-up schemes are very useful in data
communication over a network and graphical technique should be used when saving context data into database.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document compares the k-means data mining and outlier detection approaches for network-based intrusion detection. It analyzes four datasets capturing network traffic using both approaches. The k-means approach clusters traffic into normal and abnormal flows, while outlier detection calculates an outlier score for each flow. The document finds that k-means was more accurate and precise, with a better classification rate than outlier detection. It requires less computer resources than outlier detection. This comparison of the approaches can help network administrators choose the best intrusion detection method.
Data-to-text technologies present an enormous and exciting opportunity to help
audiences understand some of the insights present in today’s vasts and growing amounts of electronic
data. In this article we analyze the potential value and benefits of these solutions as well as their risks
and limitations for a wider penetration. These technologies already bring substantial advantages of
cost, time, accuracy and clarity versus other traditional approaches or format. On the other hand,
there are still important limitations that restrict the broad applicability of these solutions, most
importantly in the limited quality of their output. However we find that the current state of
development is sufficient for the application of these solution across many domains and use cases and
recommend businesses of all sectors to consider how to deploy them to enhance the value they are
currently getting from their data. As the availability of data keeps growing exponentially and natural
language generation technology keeps improving, we expect data-to-text solutions to take a much
more bigger role in the production of automated content across many different domains.
The document introduces the visual mapping sentence (MS) methodology for multifaceted research design and analysis. The MS is a visual representation that classifies research variables into facets. It guides hypothesis generation and systematic data collection/analysis. The MS links components of a research domain to suggest relationships between variables and facets. It maps all relevant variables and provides information about excluded variables. Two examples of MS are included in figures to illustrate their use. The MS methodology is presented as a valuable tool that can help address limitations of other research methods by emphasizing conceptual definition and structure of content areas.
Six Degrees of Separation to Improve Routing in Opportunistic Networksijujournal
This document discusses using small-world network concepts for routing in opportunistic networks. It analyzes three real-world datasets representing contact graphs and finds they exhibit small-world properties with high clustering and short path lengths. The document proposes a simple routing algorithm that applies these findings and concludes it outperforms other algorithms in simulations by taking temporal contact factors into account.
Predicting Forced Population Displacement Using News ArticlesJaresJournal
The world has witnessed mass forced population displacement across the globe. Population displacement has various indications, with different social and policy consequences. Mitigation of the humanitarian crisis requires tracking and predicting the population movements to
allocate the necessary resources and inform the policymakers. The set of events that triggers population movements can be traced in the news articles. In this paper, we propose the Population
Displacement-Signal Extraction Framework (PD-SEF) to explore a large news corpus and extract
the signals of forced population displacement. PD-SEF measures and evaluates violence signals,
which is a critical factor of forced displacement from it. Following signal extraction, we propose a
displacement prediction model based on extracted violence scores. Experimental results indicate
the effectiveness of our framework in extracting high quality violence scores and building accurate
prediction models.
The document describes an interactive topic modeling framework that allows non-expert users to provide feedback to iteratively refine topic models. It presents a mechanism for encoding user feedback as correlations between words to guide topic modeling. More efficient inference algorithms are developed for tree-based topic models to support interactivity. The framework is evaluated with both simulated and real users, and is shown to help users navigate datasets and understand political discourse.
This document is an electronic assignment coversheet for a student submitting an assignment in the unit ICT349 Forensic Data Analysis. It includes the student's details, assignment information including the topic "The consequences of Internet mobility protocols on network forensics", and a declaration acknowledging the university's academic integrity policy. An appendix provides additional context on the unit and assignment requirements.
The document discusses mining frequent items and item sets from data streams using fuzzy approaches. It describes objectives of mining frequent items from datasets in real-time using fuzzy sets and slices. This involves fetching relevant records, analyzing the data, searching for liked items using fuzzy slices, identifying frequently viewed item lists, making recommendations, and evaluating the results. Algorithms used for mining frequent items from data streams in a single or multiple pass are also reviewed.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
An efficient-classification-model-for-unstructured-text-documentSaleihGero
The document presents a classification model for unstructured text documents that aims to support both generality and efficiency. The model follows the logical sequence of text classification steps and proposes a combination of techniques for each step. Specifically, it uses multinomial naive bayes classification with term frequency-inverse document frequency (TF-IDF) representation. The model is tested on the 20-Newsgroups dataset and results show improved performance over precision, recall, and f-score compared to other models.
New Heuristic Model for Optimal CRC Polynomial IJECEIAES
Cyclic Redundancy Codes (CRCs) are important for maintaining integrity in data transmissions. CRC performance is mainly affected by the polynomial chosen. Recent increases in data throughput require a foray into determining optimal polynomials through software or hardware implementations. Most CRC implementations in use, offer less than optimal performance or are inferior to their newer published counterparts. Classical approaches to determining optimal polynomials involve brute force based searching a population set of all possible polynomials in that set. This paper evaluates performance of CRC-polynomials generated with Genetic Algorithms. It then compares the resultant polynomials, both with and without encryption headers against a benchmark polynomial.
This document discusses predicting new friendships in social networks using temporal information. It describes research on predicting new links in social networks over time using supervised learning models trained on temporal features from past network interactions. The researchers used anonymized Facebook data over 28 months to train decision tree and neural network classifiers to predict new relationships, finding models using temporal information performed better than those without it.
Prediction of Answer Keywords using Char-RNNIJECEIAES
Generating sequences of characters using a Recurrent Neural Network (RNN) is a tried and tested method for creating unique and context aware words, and is fundamental in Natural Language Processing tasks. These type of Neural Networks can also be used a question-answering system. The main drawback of most of these systems is that they work from a factoid database of information, and when queried about new and current information, the responses are usually bleak. In this paper, the author proposes a novel approach to finding answer keywords from a given body of news text or headline, based on the query that was asked, where the query would be of the nature of current affairs or recent news, with the use of Gated Recurrent Unit (GRU) variant of RNNs. Thus, this ensures that the answers provided are relevant to the content of query that was put forth.
P2P DOMAIN CLASSIFICATION USING DECISION TREE ijp2p
The increasing interest in Peer-to-Peer systems (such as Gnutella) has inspired many research activities
in this area. Although many demonstrations have been performed that show that the performance of a
Peer-to-Peer system is highly dependent on the underlying network characteristics, much of the
evaluation of Peer-to-Peer proposals has used simplified models that fail to include a detailed model of
the underlying network. This can be largely attributed to the complexity in experimenting with a scalable
Peer-to-Peer system simulator built on top of a scalable network simulator. A major problem of
unstructured P2P systems is their heavy network traffic. In Peer-to-Peer context, a challenging problem
is how to find the appropriate peer to deal with a given query without overly consuming bandwidth?
Different methods proposed routing strategies of queries taking into account the P2P network at hand.
This paper considers an unstructured P2P system based on an organization of peers around Super-Peers
that are connected to Super-Super-Peer according to their semantic domains; in addition to integrating
Decision Trees in P2P architectures to produce Query-Suitable Super-Peers, representing a community
of peers where one among them is able to answer the given query. By analyzing the queries log file, a
predictive model that avoids flooding queries in the P2P network is constructed after predicting the
appropriate Super-Peer, and hence the peer to answer the query. A challenging problem in a schemabased Peer-to-Peer (P2P) system is how to locate peers that are relevant to a given query. In this paper,
architecture, based on (Super-)Peers is proposed, focusing on query routing. The approach to be
implemented, groups together (Super-)Peers that have similar interests for an efficient query routing
method. In such groups, called Super-Super-Peers (SSP), Super-Peers submit queries that are often
processed by members of this group. A SSP is a specific Super-Peer which contains knowledge about: 1.
its Super-Peers and 2. The other SSP. Knowledge is extracted by using data mining techniques (e.g.
Decision Tree algorithms) starting from queries of peers that transit on the network. The advantage of
this distributed knowledge is that, it avoids making semantic mapping between heterogeneous data
sources owned by (Super-)Peers, each time the system decides to route query to other (Super-) Peers.
The set of SSP improves the robustness in queries routing mechanism, and the scalability in P2P
Network. Compared with a baseline approach,the proposal architecture shows the effect of the data
mining with better performance in respect to response time and precision.
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
An optimal unsupervised text data segmentation 3prj_publication
This document summarizes a research paper that proposes an unsupervised text data segmentation model using genetic algorithms (OUTDSM) to improve clustering accuracy. The OUTDSM uses an encoding strategy, fitness function, and genetic operators to evolve optimal text clusters. Experimental results presented in the research paper demonstrate that OUTDSM can arrive at global optima and prevent stagnation at local optima due to its biologically diverse population. Key areas of related work discussed include text mining techniques, clustering methods, and prior uses of optimization algorithms like genetic algorithms for text mining tasks.
Resource management and search is very important yet challenging in large-scale distributed systems like
P2Pnetworks. Most existing P2P systems rely on indexing to efficiently route queries over the network.
However, searches based on such indices face two key issues. First, majority of existing search schemes
often rely on simply keyword based indices that can only support exact string based matches without taking
into account the meaning of words. Second it is difficult, if not impossible, to devise query based indexing
schemes that can represent all possible concept combinations without resulting in exponential index sizes.
To address these problems, we present BSI, a novel P2P indexing and query routing strategy to support
semantic based content searches. The BSI indexing structure captures the semantic content of documents
using a reference ontology. Our indexing scheme can efficiently handle multi-concept queries by
maintaining summary level information for each individual concept and concept combinations using a
novel space-efficient Two-level Semantic Bloom Filter(TSBF) data structure. By using TSBFs to represent
a large document and query base, BSI significantly reduces the communication cost and storage cost of
indices. Furthermore, We devise a low-overhead mechanism to allow peers to dynamically estimate the
relevance strength of a peer for multi-concept queries with high accuracy solely based on TSBFs. We also
propose a routing index compression mechanism to observe peers’ dynamic storage limitations with
minimal loss of information by exploiting a reference ontology structure. Based on the proposed index
structure, we design a novel query routing algorithm that exploits semantic based information to route
queries to semantically relevant peers. Performance evaluation demonstrates that our proposed approach
can improve the search recall of unstructured P2P systems up to 383.71% while keeping the
communication cost at a low level compared to state-of-art search mechanism OSQR [7].
This document summarizes research posters being presented at a computer science and electrical engineering department research review. It describes 8 posters presented by BS, MS, and PhD students. The posters cover topics such as identifying political affiliations in blogs, statistically weighted visualization hierarchies, voter verifiable optical-scan voting, predictive caching in mobile networks, generating statistical volume models, predicting appropriate semantic web terms, approximating online social network community structure, and utilizing semantic policies for managing BGP route dissemination.
ARABIC TEXT MINING AND ROUGH SET THEORY FOR DECISION SUPPORT SYSTEM.Hassanein Alwan
The present work based on an implement classification system for Arab Complaint. The
basis of this system is the construction of Decision Support Systems, Text Mining, and Rough Set, a
the specialized semantic approach of correlation which is referred to as the model of ASN for creating
semantic structure between all document’s words. A particular semantic weight represents true
significance of this term in the text after each feature is probable to be utilized the semantic selection
of a suggested feature based on 2 threshold values, the first being the maximum weight in a document
and the other one representing the characteristic having the maximum semantic scale with the first one
of the thresholds. The task of classification in the stage of testing depends on the dependency degree
concept in the raw group theory than before to treat all the characteristics of every one of the classes
that results from the stage of training as particular condition rules and therefore, considered in the
lower rounding group. The outputs of this classification system are sufficient, performance is quite
good persuasive, and assessment of this the system through measurement precision.
Activity Context Modeling in Context-AwareEditor IJCATR
The explosion of mobile devices has fuelled the advancement of pervasive computing to provide personal assistance in this
information-driven world. Pervasive computing takes advantage of context-aware computing to track, use and adapt to contextual
information. The context that has attracted the attention of many researchers is the activity context. There are six major techniques that
are used to model activity context. These techniques are key-value, logic-based, ontology-based, object-oriented, mark-up schemes and
graphical. This paper analyses these techniques in detail by describing how each technique is implemented while reviewing their pros
and cons. The paper ends with a hybrid modeling method that fits heterogeneous environment while considering the entire of modeling
through data acquisition and utilization stages. The modeling stages of activity context are data sensation, data abstraction and
reasoning and planning. The work revealed that mark-up schemes and object-oriented are best applicable at the data sensation stage.
Key-value and object-oriented techniques fairly support data abstraction stage whereas the logic-based and ontology-based techniques
are the ideal techniques for reasoning and planning stage. In a distributed system, mark-up schemes are very useful in data
communication over a network and graphical technique should be used when saving context data into database.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
This document discusses using natural language processing techniques like n-grams, deep learning models, and named entity recognition to analyze scientific publications and identify references to datasets. It evaluates classifiers like recurrent neural networks and convolutional neural networks to perform sequence labeling and extract dataset citations. The goal is to help government agencies and researchers quickly find datasets, measures, and experts by automating the analysis of research articles.
The document proposes using conditional random fields (CRFs) to improve legal document summarization. CRFs are applied to segment legal documents into seven labeled rhetorical components. Feature sets are used to improve CRF performance. A term distribution model and structured domain knowledge are then used to extract key sentences for each rhetorical category. The resulting structured summary is found to be 80% accurate compared to ideal summaries generated by experts.
The document proposes using conditional random fields (CRFs) to improve legal document summarization. CRFs are applied to segment legal documents into seven labeled rhetorical components. Feature sets are used to improve CRF performance. A term distribution model and structured domain knowledge are then used to extract key sentences for each rhetorical category. The resulting structured summary is found to be 80% accurate compared to ideal summaries generated by experts.
The document proposes applying conditional random fields (CRFs) for legal document summarization. CRFs are used to segment legal documents into seven labeled rhetorical roles. Feature sets are employed to improve CRF performance. A term distribution model and structured domain knowledge are then used to extract key sentences related to the rhetorical categories. The final summaries generated are around 80% accurate compared to expert-generated summaries.
The document proposes using conditional random fields (CRFs) to improve legal document summarization. CRFs are applied to segment legal documents into seven labeled rhetorical components. Feature sets are used to improve CRF performance. A term distribution model and structured domain knowledge are then used to extract key sentences for each rhetorical category. The resulting structured summary is found to be 80% accurate compared to ideal summaries generated by experts.
The document summarizes text mining techniques in data mining. It discusses common text mining tasks like text categorization, clustering, and entity extraction. It also reviews several text mining algorithms and techniques, including information extraction, clustering, classification, and information visualization. Several literature papers applying these techniques to domains like movie reviews, research proposals, and e-commerce are also summarized. The document concludes that text mining can extract useful patterns from unstructured text through techniques like clustering, classification, and information extraction.
A large-scale sentiment analysis using political tweetsIJECEIAES
Twitter has become a key element of political discourse in candidates’ campaigns. The political polarization on Twitter is vital to politicians as it is a popular public medium to analyze and predict public opinion concerning political events. The analysis of the sentiment of political tweet contents mainly depends on the quality of sentiment lexicons. Therefore, it is crucial to create sentiment lexicons of the highest quality. In the proposed system, the domain-specific of the political lexicon is constructed by using the supervised approach to extract extreme political opinions words, and features in tweets. Political multi-class sentiment analysis (PMSA) system on the big data platform is developed to predict the inclination of tweets to infer the results of the elections by conducting the analysis on different political datasets: including the Trump election dataset and the BBC News politics. The comparative analysis is the experimental results which are better political text classification by using the three different models (multinomial naïve Bayes (MNB), decision tree (DT), linear support vector classification (SVC)). In the comparison of three different models, linear SVC has the better performance than the other two techniques. The analytical evaluation results show that the proposed system can be performed with 98% accuracy in linear SVC.
A simplified classification computational model of opinion mining using deep ...IJECEIAES
Opinion and attempts to develop an automated system to determine people's viewpoints towards various units such as events, topics, products, services, organizations, individuals, and issues. Opinion analysis from the natural text can be regarded as a text and sequence classification problem which poses high feature space due to the involvement of dynamic information that needs to be addressed precisely. This paper introduces effective modelling of human opinion analysis from social media data subjected to complex and dynamic content. Firstly, a customized preprocessing operation based on natural language processing mechanisms as an effective data treatment process towards building quality-aware input data. On the other hand, a suitable deep learning technique, bidirectional long short term-memory (Bi-LSTM), is implemented for the opinion classification, followed by a data modelling process where truncating and padding is performed manually to achieve better data generalization in the training phase. The design and development of the model are carried on the MATLAB tool. The performance analysis has shown that the proposed system offers a significant advantage in terms of classification accuracy and less training time due to a reduction in the feature space by the data treatment operation.
Supreme court dialogue classification using machine learning models IJECEIAES
This study aimed to classify sentences from supreme court dialogues as being said by a justice or non-justice using machine learning models. Two models were tested - naïve Bayes and logistic regression. The models were tested on datasets from individual court cases and a combined dataset. The naïve Bayes model performed better than logistic regression on individual cases, achieving AUC scores of 88.54% and 83.74%. However, on the combined dataset both models performed equally poorly with an AUC score of 67.72%. The study showed that models trained on individual cases yielded better performance than ones trained on multiple cases, demonstrating the importance of case specificity for legal classification tasks.
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemIRJET Journal
This document proposes a knowledge graph and question answering system to extract and analyze information from large volumes of unstructured data like annual reports. It discusses using natural language processing techniques like named entity recognition with spaCy and dependency parsing to extract entity-relation pairs from text and construct a knowledge graph. For question answering, it analyzes user queries with similar NLP approaches and then matches query triplets to the knowledge graph to retrieve answers, combining information retrieval and trained classifiers. The proposed system aims to provide faster understanding and analysis of complex, unstructured data for professionals.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Analyzing The Organization Of Collaborative Math Problem-Solving In Online Ch...Michele Thomas
The document describes how a statistical analysis of online math problem-solving chat logs led to an unexpected result. Conversation analysis was then used to better understand this result. Specifically:
- A statistical test on coding of chat logs showed no effect from whether the math problem was announced in advance or at the start of the session.
- Correlation analysis revealed some chat sessions were more similar to those from the other group than their own, requiring further investigation.
- Conversation analysis identified the phenomenon behind the unexpected statistical result and helped revise the coding scheme and quantitative approach.
- The combined use of quantitative and qualitative methods provided a more complete understanding of collaboration in the online math problem-solving sessions.
Here are the key points about using content-based filtering techniques:
- Content-based filtering relies on analyzing the content or description of items to recommend items similar to what the user has liked in the past. It looks for patterns and regularities in item attributes/descriptions to distinguish highly rated items.
- The item content/descriptions are analyzed automatically by extracting information from sources like web pages, or entered manually from product databases.
- It focuses on objective attributes about items that can be extracted algorithmically, like text analysis of documents.
- However, personal preferences and what makes an item appealing are often subjective qualities not easily extracted algorithmically, like writing style or taste.
- So while content-based filtering can
This document discusses techniques for analyzing unstructured text data from computer data inspection. It discusses using clustering algorithms like K-means and hierarchical clustering to automatically group related documents without supervision. The goal is to help computer examiners analyze large amounts of text data more efficiently. Prior work on clustering ensembles, evolving gene expression clusters, self-organizing maps, and thematically clustering search results is reviewed as relevant to this problem. The problem is how to identify and cluster documents stored across multiple remote locations during computer inspections when existing algorithms make this difficult.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
Similar to A Deep Learning Model to Predict Congressional Roll Call Votes from Legislative Texts (20)
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
A Deep Learning Model to Predict Congressional Roll Call Votes from Legislative Texts
1. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
DOI:10.5121/mlaij.2020.7402 15
A DEEP LEARNING MODEL TO PREDICT
CONGRESSIONAL ROLL CALL VOTES FROM
LEGISLATIVE TEXTS
Jonathan Wayne Korn and Mark A. Newman
Department of Data Science, Harrisburg University, Harrisburg, Pennsylvania
ABSTRACT
Developments in natural language processing (NLP) techniques, convolutional neural networks (CNNs),
and long-short- term memory networks (LSTMs) allow for a state-of-the-art automated system capable of
predicting the status (pass/fail) of congressional roll call votes. The paper introduces a custom hybrid
model labeled "Predict Text Classification Network" (PTCN), which inputs legislation and outputs a
prediction of the document's classification (pass/fail). The convolutional layers and the LSTM layers
automatically recognize features from the input data's latent space. The PTCN's custom architecture
provides elements enabling adaptation to the input's variance from adjustment to the kernel weights over
time. On the document level, the model reported an average evaluation of 67.32% using 10-fold cross-
validation. The results suggest that the model can recognize congressional voting behaviors from the
associated legislation's language. Overall, the PTCN provides a solution with competitive performance to
related systems targeting congressional roll call votes.
KEYWORDS
Deep Learning (DL), Convolutional Neural Networks (CNNs), Long-Short-Term Memory Networks
(LSTMs), Natural Language Processing (NLP), Congressional Roll Call Votes
1. INTRODUCTION
Predicting the status (pass/fail) of congressional roll call votes has been political scientists' goal
for decades. There are patterns of congressional voting behavior captured in the legislative text,
which has shown significance when predicting congressional votes' status. Understanding the
future status of legislation provides vital insights into government and industry matters.
Analyzing roll-call data allows insight into information detailing the legislation's vote status and
can predict future votes [1].
Other approaches using quantitative roll call data and legislative text have shown success in the
past, however only under certain conditions. Their success expresses limitations due to the
dimensionality of the data and unforeseen conditions in Congress's complex social environment.
For instance, using two separate datasets diminishes a model's flexibility to predict and adapt to
the event space's changing conditions.
Political environments are complex social networks that often create noisy data. The scale of
topics that the government considers in legislation is a diverse subject matter, and the language is
sparse. Most past approaches rely on an attempt to use extra dimensionality from both text and
quantitative data. However, using extra dimensionality produces limitations, or predictive
bottleneck, in approaches using text and quantitative congressional data. Many situations occur
when congressional roll call data is not available or represents uncontrolled conditions creating
2. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
16
issues predicting the event. For instance, a pure expression of mixed but non-partisan support in
the voting data could limit existing models' overall accuracy.
A more robust model is required to address the current limitations expressed in prior models
addressing the problem discussed above. Statistical modeling seeks to learn the joint probability
function from words
contained in the texts [2]. However, it is difficult to obtain this goal because of the "... curse of
dimensionality" [2]. However, the rise of deep learning makes it possible for computer systems to
recognize patterns in complex text representations. Advancements in natural language processing
techniques allow text to convert into tokenized word vector spaces. The conversion provides the
proper dimensions to embed the texts into different deep neural networks. A custom hybrid
architecture provides the ability to input texts and provide accurate outputs from recognized
patterns. Convolutional neural networks (CNNs) and long- short term memory neural networks
(LSTMs) can recognize these intricate patterns within the data's dimensions. Each layer in the
network provides benefits in recognizing temporal and spatial features from abstract data
representations. Mainly, CNNs reflect successful results in identifying abstract data patterns
mainly because of development in max-pooling layers.
However, due to long lag periods from the data's complexity, the CNN architecture cannot alone
capture the text's patterns. A primary reason for implementing LSTM layers in the model's
architecture is to overcome the long lag time problem that occurs when processing high
dimensional data. Overcoming long lag periods allows for the neural network's depth to continue,
enabling minute features to be recognized.
Combining CNN and LSTM algorithms provide an architecture capable of recognizing features
in sparse word vector spaces over long periods, known as a Convolutional Long-Short Term
Memory Neural Network (C-LSTM). Adaptable layers during kernel initialization help filter out
the non-significant features from the inputs dimensional space. The adaptability in each layer of
the network provides stability in the predictions and robustness against the dynamic social
environment creating the data. The custom architecture ensures the network depth is suitable for
accurate prediction from highly sparse text samples. The paper presents a custom deep learning
solution to accomplish the binary classification of legislative texts.
1.1. Organizational Structure
The remainder of the paper is organized into the following sections, including Section 2.
Literature Review, Section 3. Methods and Materials, Section 4. Results, Section 5. Conclusion,
and Section 6. Future Work. Section 3. Methods and Materials includes two sections including
3.1. The Data, and 3.2. The Deep Learning Approach. Section 3.1. includes Section 3.1.1. Pre-
processing text and 3.1.2. Equal Distribution of Samples. Section 3.2. includes 3.2.1. The Custom
PTCN Architecture and 3.2.2. The PTCN Modelling Process.
2. LITERATURE REVIEW
A focus for quantitative political scientists is using roll-call data to understand legislative voting
behaviors better. Few models have attempted to use supplementary data such as the text of
legislation to understand congressional voting trends better.
In 1991, research expressed that spatial positions captured in roll call data is stable and contains
reliable features to recognize voting patterns [3]. The party discipline is present in the spatial
3. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
17
dimensions of roll-call votes of the legislator's ideal points captured in the data [4]. Spatial
dimensions in the context of the domain refer to the probabilistic word occurrence contained
across the text samples. Researchers utilized a method using Bayesian simulation models to
capture the ideal points of the legislators. The approach allows researchers to ensure the beliefs
incorporated into the inputs' dimensional space for roll call analysis [5]. The method enables
researchers to handle the increasing complexity in higher-dimensional contexts [5]. Policy ideas
are standard features in legislation that provide insight into legislators' behavior similar to
quantitative roll call data [6].
In 2011, Gerrish and Blei introduced a probabilistic model capable of inferring a person's
political position for specific topics [7]. The model focuses on capturing the individual
representative's view on specific political interests from the text alone. The authors have used 12
years of congressional legislative data in their experiment to capture significant patterns. The
patterns express the lawmakers' vote behavior and on which type of document [7]. Gerrish and
Blei integrated the analysis of text into quantitative models such as the ideal
point model with success inferring "... ideal points and bills' locations..." from roll call data
leading to the prediction of legislative vote status (pass/fail) [8]. The integration of the bill texts
into the ideal point model helped mitigate a limitation of only predicting on vote data alone,
which may, at times, be inconsistent in its availability. The authors developed a supervised ideal
point topic model capable of predicting pending bills using votes. It also is a method of exploring
the connection between language and political support [8]. Other models developed by Gerrish
and Blei to predict the "...inferred ideal points using different forms of regression on phrase
counts." [8]. However, lead existing models cannot predict when mixed but non-partisan support
is present in the data. Many existing models cannot expand beyond one-dimensional limitations,
such as the ideal point model [8]. The authors also based their model's performance on the
baseline that 85% of all votes are 'yea', limiting the model's performance results. The authors
reported their ideal topic model predicted 89% of the votes with 64 topics, and their L2 model
predicted 90% [8]. In sequential predictions, both their models predicted 87% and 88.1%
accurate at predicting future votes, respectively [8]. Their study only attempts to understand a
few topics with a one-dimensional political space, which creates a predictive bottleneck [8].
Interestingly, the ideal point model reflects representatives' preferences, constituency preferences,
or any other feature indicating a preference for particular legislation [1]. In 2013, Spatio-temporal
modeling expressed success in using text to predict congressional roll call votes with relative
success to the ideal points model [9]. In 2015, ideal points were redefined as two
characterizations, including word and vote choice [10]. These characterize the ideal points and
the dimensions of the policy. The researchers utilize Sparse Factor Analysis to combine both
votes and textual data to estimate the ideal points [10].
In 2016, Yang and others introduced hierarchical attention networks for document classification
[11], the results outperformed previous models by a large margin, which can be indicated in the
author's results using the Yelp 2013, Yelp 2014, Yelp 2015, IMBD review, yahoo Answer, and
Amazon review datasets to test their algorithm and compare it to other methods [11]. On average,
the HN-ATT performed with about a 70% accuracy rate across the tasks. They were generalizing
the network to address multiple types of tasks that limit their ability to identify the text's critical
features, such as the political ideology in the congressional legislative texts across spatial and
temporal dimensions. A similar approach that uses multi-dimensional bill texts to predict roll
calls' status is Kraft, Jain, and Rush's approach established in 2016. Using an embedding model
with prepared bill texts, they competed with Gerrish and Blei's approach. Mainly the approach
utilized ideal vectors rather than ideal points [12]. The approach relies on quantitative data that
leads to the same predictive bottlenecks as prior models due to the complex event space.
4. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
18
In 2016, the Gov2Vec method expressed success in capturing opinions from legal documents.
The text's transformation allows words to be embedded in a model to learn the representations of
individuals' opinions [13]. In 2017 it became possible to use quantitative data and text to predict
if a bill will become law. The researcher states that they always perform better than using text or
content individually [14]. Interestingly, the author conducted three experiments, including
"...text-only, text, and context, or context-only..." to test the predictive power of each type of
model [14]. Nay's approach uses a language model that provides a prediction on the text's
sentence level, providing a probability of the sentence contributing to the bill text's (pass/fail)
status [14]. The data included a few performance measures and two data conditions spanning 14
years.[14]. Nay concluded that a text alone approach is fundamental for better results. For newer
data, the use of bill text outperforms bill context-only. Context-only only outperforms bill text
alone for older data [14]. The most important finding is that text adds predictive power to the
model. However, the approach relies on two data sources, which creates a predictive bottleneck.
As the most successful approach to date using text alone, the model can predict a 65% accuracy
[14].
A hybrid C-LSTM Neural Networks architecture can help mitigate the predictive bottleneck in
existing models by capturing more features from the bill texts alone. In statistical language
modeling, the primary goal is learning. The main objective to learn is the "... joint probability
function ..." of each sequence of words contained in the language [2]. However, there is difficulty
in this type of task because of the curse of dimensionality. By learning distributed representations
of language, the curse mitigates. The main issue is addressing the variance from the training data
to the testing data. Past research discovered "... the similarities
between words to obtain generalization from training sequences to new sequences..." [15] [16]
[17] [18]. Much of this type of work is thanks to contributions by Schutze, 1993, where vector-
space representations for words can be learned based on the probability of the word co-occurring
in documents [19]. Most of the experimental workings can be summed up from learning
distributed feature vectors to represent their similarities between words, which is discussed in
[15] [17] [20]. Past research has shown that the hybridization of CNNs and LSTMs tasked to
solve text classification problems is successful [21]. The combination of the two different types
of neural networks builds a C-LSTM [21]. In 2015, Zhou et al. introduced a successful C- LSTM
in sentence and document modeling [21]. The CNN extracts a sequence of high-level phrase
representations, which are in-return fed through LSTM layers set to obtain the sentence-
representations [21]. Overall, the hybrid model performed better than existing systems in
classification tasks. The results indicated that the local features of phrases and the sentence's
global/temporal semantics are recognizable by their model [21]. In 2019, a team of researchers
took advantage of convolutional recurrent neural networks to tackle text classification tasks [22].
Their experiments showed that the method of using a C-LSTM achieves better success than other
networks.
3. METHODS AND MATERIALS
The following sections discuss key components to developing the PTCN and the associated
results referred to in Section IV. The methodology is summarized below in Figure 1.
5. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
19
Fig. 1: Visualization of the methodology.
3.1. The Data
The original data is extracted and organized from (https://www.Govtrack.us) consisting of
legislative texts and associated quantitative roll call data. The original data contained samples
from the year 2000 to 2019, including 3668 samples from the house and senate. Refer to Figure 2
for an example of original legislative texts. Not all the samples from the original population will
be included in the experiment due to limiting factors as discussed below.
Fig. 2: Legislative text samples.
6. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
20
3.1.1. Pre-Processing the Text
A series of NLP techniques are required to supervise the custom C-LSTM training to classify the
legislative text based on voting status. The text samples contain noise, which is in the form of
stop words, uppercasing, special characters, NA values, punctuation, numbers, and whitespace.
The removal of noise from the text samples helps mitigate the C-LSTM from recognizing
patterns within the texts that create bias predictions.
The C-LSTM is a deep learning algorithm that automatically extracts features from an input
vector. Each text undergoes augmentation using the following conditions, including converting
data types from character to a string, lower case conversion, stop-words removal using the
“SMART” function, punctuation removal, number removal, white-space removal, and document
stemming. The above augmentation resulted in Figure 3.
Fig. 3: Sample of pre-processed Legislative texts.
After pre-processing, the text goes through a tokenization and vectorization step resulting in each
word, symbol, or any other character represented as a unique number. For instance, the word
‘bill’ is represented by the value of 3109 across all the documents. Note the text is limited to
10000 max features during the tokenization process. Refer to Figure 4 for an example of the
vectorized vocabulary from the legislative documents.
Fig. 4: Sample of tokenized and vectorized Vocabulary.
The pre-processing of the text resulted in vectorized legislative documents, as seen in Figure 5. A
method of padding is implemented to transform all the texts to the same length. The dimensions
of the text are converted into a tokenized, vectorized, and padded format with a max length of
10,000.
7. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
21
Fig. 5: Sample of tokenized, vectorized, and padded text.
3.1.2. Equal Distribution of Samples
The labeled legislative text samples are balanced into an equal distribution of each voting status
to reduce bias. It is essential to mitigate bias in the data. Deep neural networks transform inputs
over extensive credit assignment pathways (CAPs). The more complicated the dimensions of an
input, then the more complex the CAPs.
Each vote’s status was captured by determining the number of “yea” or “Aye” responses reached
for each legislation. Each translates into a value of 1, representing a vote to pass the document.
All other responses are considered a vote against the document, labeled as a value of 0.
Interestingly, the number of documents labeled one is greater than 0 labeled documents. Note that
some votes require a special condition to pass, which is more than 2/3 of the votes. The study
ignores the special voting condition because it only focuses on an equal distribution of (pass/fail)
text representations.
An equal distribution of the voting statuses in the sample is necessary to mitigate bias in the
network for either classification. The text samples are randomly selected from each class to
represent an equal distribution, which reduces the original number of samples. Each class of vote
is represented equally by 98 randomly selected legislative texts. 98 is the maximum number of 0
labeled samples available in the data due to the behavior of congress. Each text is a max length of
10,000 features. It should be noted that most of the legislative texts are close or higher to the
maximum number of features. Custom features in the PTCN’s architecture are implemented to
deal with the small number of samples during training as discussed below.
3.2. The Deep Learning Approach
The following section discusses the custom deep learning architecture and modelling process
implemented in the study.
3.2.1. The Custom PTCN Architecture
The sequential deep learning model is a stack of different layers set with several parameters,
including dropout rate, hidden convolutional nodes, LSTM hidden nodes, L1 regularization rate,
L2 regularization rate, batch size, input max length, max features, embedding dimensions, leaky
Relu rate, kernel size, epochs, max-pooling size, learning rate, and validation split. In Figure 6,
an example of the PTCN models architecture:
8. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
22
Fig. 6: PTCN architecture
The first layer in the model is the embedding layer, which embeds the tokenized words. At the
next layer, the inputs pipe into a 1-dimensional convolutional layer with only a 4th of the set
convolutional hidden nodes. The inputs are representations of lengthy and highly sparse word
vector spaces, making the model weight's critically important. The model requires a method to
adjust to the high variance between samples due to the low number of samples. The control of the
kernels will help the model steadily learn significant features contained in the text.
Throughout the convolutions, the model can explore deeper to recognize more significant
features. A variance scaling kernel initializer provides an “initializer capable of adapting its scale
to the shape of weights. ” [23]. Variance scaling is a kernel initialization strategy that encodes an
object without knowing the shape of a variable [23]. The Convolutional layers’ initializer makes
the deep network adapt to the input weights. It is important to regularize the kernels when
utilizing an adaptive initializer. Setting a duel regularization technique that deploys L1 and L2
regularization helps mitigate high fluctuations while
batching samples through the layers. L1 is a Lasso Regression (LR). L2 is Ridge Regression. The
main difference between the two methods is the penalty term. The layer includes strides set at 1L
through the convolutions are an “...an integer or list of a single integer, specifying the stride
length of the convolution” [24]. The convolutional layer is activated using a Leaky Relu function.
The leaky Relu function allows for a “... small gradient when the unit is not active ...”, providing
“... slightly higher flexibility to the model than traditional Relu. ” [25].
9. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
23
The first convolutional layer extracts lower level features from the inputs due to the decrease in
the hidden nodes. The reduction of the number of transformations provides control to the
adaptable features initializing the weights. The second layer pipes the inputs through another 1-
dimensional convolutional layer with the same parameters set, except the number of hidden
nodes set to 32. Increasing the number of hidden nodes provides more transformations extracting
higher-level features from the inputs. The second convolutional layer is activated using the leaky
Relu function. The next layer in the stack is another convolutional layer set at half the set number
of hidden nodes. Reducing the number of nodes and following with a max-pooling layer helps
mitigate overfitting during training. In the study, the max-pooling layer is set to 4. The following
are two more layers of 1-dimensional convolutional layers set to half the set hidden nodes and a
4th of the set hidden nodes.
All parameters are set the same as the prior convolutional layers. A second max-pooling layer,
batch normalization, and dropout layer help mitigate overfitting further. The next layer is an
LSTM layer set at 32 hidden nodes. Variance Scaling kernel initializers and L1 and L2 kernel
regularizers control the model's exploration of the feature space. The LSTM layer is activated
using a Leaky Relu function. A dropout layer of 0.5 is in the stack before the output layer. The
output layer is activated using a sigmoid function. The model compiles using a loss function of
binary cross-entropy. The PTCN uses a stochastic gradient descent (SGD) optimizer. The hyper-
tuning sessions determine the learning rate.
3.2.2. The PTCN Modelling Process
To ensure the model is producing the best results, the parameters of the PTCN are hyper-tuned.
The data is split 80/20% for training and validation of the model during the hyper-tuning
sessions. The hyper-tuning model’s performance evaluates a random selection of 100 samples.
The best parameters are captured and set for the final model training. Once the final parameters
are identified, the PTCN is trained for a final training session. The final training session of the
PTCN uses the best parameters identified during the hyper-tuning sessions, and 10-fold cross-
validation is implemented to evaluate the model’s performance.
4. RESULTS
After implementing the model using 10-fold cross-validation, the PTCN Model averaged 67.32%
evaluation accuracy with a standard deviation of 9.11. The best model performed at 76.32%
evaluation on fold 6, as depicted below in Figure 7.
Fig. 7: Sample of PTCN training and validation performance including early stopping.
10. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
24
The minimum evaluation accuracy reports at 44.74% on the fold 2. However, the learning curve
in Figure 7 suggests that the parameters are well trained, mainly the lack of samples diminishes
the overall performance of the PTCN. The word vector spaces are reliable distributions of
language representations for determining roll call vote behavior in congress [26]. Note, the
evaluations 95% confidence interval is 59.76% to 88.56% on fold 6. The model’s generalization
is measured by further understanding its performance through the resulting confusion matrix.
Here, the model expresses an accuracy with mitigated bias. On fold 6, Table I shows that the
model did not significantly express bias towards either class when tested across a random
selection of samples. Even with the small sample size for training the model is able to effectively
distinguish between the classes as indicated in Table 1. The adaptability of the model’s network
to adjust to the inputs shape appears to extract significant features from even a small sample as
indicated by folds 3 through 10 in Table 2. Even with a small sample set, the texts were 1000s of
words in length. The sheer length of the documents may have been an ample space to extract
patterns from the texts due to the available variance. It is likely that the performance of the model
on fold 1 and 2 is an indication that the samples lacked ample feature space limiting the PTCN
capabilities to adapt to the input’s dimensions and recognize significant patterns.
Table 1. Sample of confusion matrix of PTCN evaluation dataset.
Class 0 Class 1
Class 0 15 5
Class 1 4 14
On fold 6, the best measures of accuracy, precision, recall, and F1 score report, 76.32, 75, 78.95,
76.69 respectively. Each fold’s performance measures are shown below in Table II showing that
the PTCN across all the folds performed relatively well compared to existing approaches.
Table 2. PTCN evaluation metrics across each fold.
Accuracy Precision Recall F1
Fold1 65.00 80.00 40.00 53.33
Fold2 44.74 45.45 52.63 48.78
Fold3 73.68 76.47 68.42 72.22
Fold4 72.50 71.43 75.00 73.17
Fold5 62.50 61.90 65.00 63.41
Fold6 76.32 75.00 78.95 76.69
Fold7 65.00 65.00 65.00 65.00
Fold8 68.42 65.22 78.95 71.43
Fold9 72.50 73.68 70.00 71.29
Fold10 72.50 71.43 75.00 73.17
The model's training/validation/evaluation indicates that each follows a similar pattern, which
suggests that overfitting could easily get handled. Even though the evaluation accuracy shows
significant variance across the folds, it provides evidence of the algorithm's robustness.
Increasing the number of samples should account for the limitation expressed in fold 1 and 2
from Table 1. However, the overall training, validation, and evaluation of the algorithm indicates
the PTCN competes with past models. In turn, the results express that high accuracies predicting
congressional roll call votes are possible; however, the accuracy may be bounded lower than
deterministic accuracy [27].
11. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
25
Fig. 8: Sample of the importance of features for each class.
Understanding the features that contribute or contradict each class help determine possible forms
of noise in the inputs. A sample of the network's embedded features is depicted in Figure 8. The
top-number of features in each class, as depicted in Figure 8, may help explore for noise. Any
noise identified should be removed later for further experimentation to help fine the model to the
problem.
5. CONCLUSION
The PTCN near performs current state-of-the-art systems in predicting congressional roll call
votes. Event circumstances with no available quantitative roll call data or topic complexity do not
affect PTCN's predictions.
The PTCN architecture is adaptable to language's sparse nature, allowing it to adjust to complex
language representations. The individual legislative texts' overall length helped provide the model
with enough information to recognize significant features still. The different types of legislative
documents structure and content should be further experimented with to gauge if the model's
accuracy scales. Including a larger sample of texts reflecting yea's and nay's may also improve
the PTCN's overall capabilities. Another method to improve the model's accuracy is testing a
larger range of regularization values on the kernel regularizers. Even though the inputs in these
tasks are sparse, and the language may be ambiguous, the PTCN design expresses strong
capabilities in recognizing sophisticated features from the data representations transformed into
word vector spaces. Mainly, the PTCN recognizes word probabilities of occurrence, similar to
other approaches discussed; however, it may recognize features related to the texts' structure or
other significant features.
The custom design of the network filters through lower and higher-level features from layer to
layer. The innovative stack expresses the ability to adapt to inputs and recognize common and
high-level features from complex input dimensions. Further research should shed light on the
stability of the model's accuracy and overall reliability.
6. FUTURE WORK
In the future, the PTCN should be tested with various types of problems, texts, and input shapes.
Specifically, the PTCN should be tested on other types of binary text classification problems.
Testing other types of texts and inputs is critical to understand the generalization of the model.
12. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
26
For the particular task discussed in this paper an example of future works should include shorter
length texts, or even samples of greater length, or even samples of larger distributions or lower
distributions. Testing the adaptability of the network to various input dimensions is critical to
further understand the capability of the PTCN. Also, as more legislative texts become available
the current state of the PTCN will be updated by training, validating, and evaluating the model
with the inclusion of new samples. The project may be transcribed from R to Python
programming language to take advantage of the faster runtimes during training of the PTCN. The
research is included publicly in Mind Mimic Labs GitHub repository [28].
REFERENCES
[1] J. D. Clinton, “Using roll call estimates to test models of politics,” Annual Review of Political
Science, vol. 15, pp. 79–99, 2012, doi: 10.1146/annurev-polisci-043010-095836.
[2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal
of machine learning research, vol. 3, no. Feb, pp. 1137–1155, 2003.
[3] K. T. Poole and H. Rosenthal, “Patterns of congressional voting,” American journal of political
science, pp. 228–278, 1991, doi: 10.2307/2111445.
[4] N. McCarty, K. T. Poole, and H. Rosenthal, “The hunt for party discipline in congress,” American
Political Science Review, vol. 95, no. 3, pp. 673–687, 2001, doi: 10.2139/ssrn.1154137.
[5] S. Jackman, “Multidimensional analysis of roll call data via bayesian simulation: Identification,
estimation, inference, and model checking,” Political Analysis, vol. 9, no. 3, pp. 227–241, 2001, doi:
10.1093/polana/9.3.227.
[6] J. Wilkerson, D. Smith, and N. Stramp, “Tracing the flow of policy ideas in legislatures: A text reuse
approach,” American Journal of Political Science, vol. 59, no. 4, pp. 943–956, 2015, doi:
10.1111/ajps.12175.
[7] S. Gerrish and D. M. Blei, “How they vote: Issue-adjusted models of legislative behavior,” in
Advances in neural information processing systems, 2012, pp. 2753–2761, Accessed: Jul. 11, 2019.
[Online]. Available: https://semanticscholar.org/paper/
0c5b7dde400a8786f32d58ad704f8e1ffd135031.
[8] S. Gerrish and D. M. Blei, “Predicting legislative roll calls from text,” in Proceedings of the 28th
international conference on machine learning (icml-11), 2011, pp. 489–496, Accessed: Jul. 11, 2019.
[Online]. Available: https://semanticscholar.org/paper/
62e14dec73970514a5e3f81b059d63b34e9ad37c.
[9] E. Wang, E. Salazar, D. Dunson, L. Carin, and others, “Spatio-temporal modeling of legislation and
votes,” Bayesian Analysis, vol. 8, no. 1, pp. 233–268, 2013, doi: 10.1214/13-ba810.
[10] I. S. Kim, J. Londregan, and M. Ratkovic, “Voting, speechmaking, and the dimensions of conflict in
the us senate,” Unpublished paper, 2015.
[11] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for
document classification,” Proceedings of NAACL-HLT 2016, vol. Proceedings of the 2016
Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, pp. 1480–1489, 2016, doi: 10.18653/v1/N16-1174.
[12] P. Kraft, H. Jain, and A. M. Rush, “An embedding model for predicting roll-call votes,” in
Proceedings of the 2016 conference on empirical methods in natural language processing, 2016, pp.
2066–2070, doi: 10.18653/v1/d16-1221.
[13] J. J. Nay, “Gov2Vec: Learning distributed representations of institutions and their legal text,” in
Proceedings of the first workshop on nlp and computational social science, 2016, pp. 49–54, doi:
10.18653/v1/w16-5607.
[14] J. J. Nay, “Predicting and understanding law-making with word vectors and an ensemble model,”
PloS one, vol. 12, no. 5, p. e0176999, 2017, doi: 10.1371/journal.pone.0176999.
[15] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai, “Class-based n-gram models
of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–479, 1992.
[16] F. Pereira, N. Tishby, and L. Lee, “Distributional clustering of english words,” in Proceedings of the
31st annual meeting on association for computational linguistics, 1993, pp. 183–190, doi:
10.3115/981574.981598.
13. Machine Learning and Applications: An International Journal (MLAIJ) Vol.7, No.3/4, December 2020
27
[17] T. R. Niesler, E. W. Whittaker, and P. C. Woodland, “Comparison of part-of-speech and
automatically derived category-based language models for speech recognition,” in Proceedings of the
1998 ieee international conference on acoustics, speech and signal processing, icassp’98 (cat. No.
98CH36181), 1998, vol. 1, pp. 177–180, doi: 10.1109/icassp.1998.674396.
[18] L. D. Baker and A. K. McCallum, “Distributional clustering of words for text classification,” in
Proceedings of the 21st annual international acm sigir conference on research and development in
information retrieval, 1998, pp. 96–103, doi: 10.1145/290941.290970.
[19] H. Schütze, “Word space,” in Advances in neural information processing systems, 1993, pp. 895–
902.
[20] R. Kneser and H. Ney, “Improved clustering techniques for class-based statistical language
modelling,” 1993.
[21] C. Zhou, C. Sun, Z. Liu, and F. Lau, “A c-lstm neural network for text classification,” arXiv preprint
arXiv:1511.08630, 2015, [Online]. Available: https://arxiv.org/ abs/1511.08630.
[22] R. Wang, Z. Li, J. Cao, T. Chen, and L. Wang, “Convolutional recurrent neural networks for text
classification,” 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–6, 2019.
[23] “VarianceScaling.” Accessed: Jul. 23, 2020. [Online]. Available:
https://keras.rstudio.com/reference/initializer variance scaling.html.
[24] “Strides.” Accessed: Jul. 23, 2020. [Online]. Available: https://keras.io/api/layers/convolution
layers/convolution2d/.
[25] “LeakyRelu.” Accessed: Jul. 23, 2020. [Online]. Available: https://keras.io/api/layers/activation
layers/leakyrelu/.
[26] M. Taddy, “Document classification by inversion of distributed language representations,” in
Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th
international joint conference on natural language processing (volume 2: Short papers), 2015, pp. 45–
49, doi: 10.3115/v1/p15-2008.
[27] T. Martin, J. M. Hofman, A. Sharma, A. Anderson, and D. J. Watts, “Exploring limits to prediction in
complex social systems,” in Proceedings of the 25th international conference on world wide web,
2016, pp. 683–694, doi: 10.1145/2872427.2883001.
[28] “GitHub mind mimic labs.” Accessed: Jul. 27, 2020. [Online]. Available:
https://github.com/MindMimicLabs/classification-congressional-votes.
[29] “GitHub.” Accessed: Jun. 04, 2018. [Online]. Available: https://github.com.
[30] R Core Team, “R: A language and environment for statistical computing.” R Foundation for
Statistical Computing, Vienna, Austria, 2019, Accessed: Dec. 15, 2019. [Online]. Available:
https://www.R- project.org.
[31] RStudio Team, “RStudio: Integrated development environment for r.” RStudio, Inc., Boston, MA,
2015, Accessed: Jun. 30, 2018. [Online]. Available: https://www.rstudio.com.