1. M.Phil Computer Science Data Mining Projects
Web : www.kasanpro.com Email : sales@kasanpro.com
List Link : http://kasanpro.com/projects-list/m-phil-computer-science-data-mining-projects
Title :Bridging Socially Enhanced Virtual Communities
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/bridging-socially-enhanced-virtual-communities
Abstract : Interactions spanning multiple organizations have become an important aspect in todays collaboration
landscape. Organizations create alliances to fulfill strategic objectives. The dynamic nature of collaborations
increasingly demands for automated techniques and algorithms to support the creation of such alliances. Our
approach bases on the recommendation of potential alliances by discovery of currently relevant competence sources
and the support of semi automatic formation. The environment is service-oriented comprising humans and software
services with distinct capabilities. To mediate between previously separated groups and organizations, we introduce
the broker concept that bridges disconnected networks. We present a dynamic broker discovery approach based on
interaction mining techniques and trust metrics. We evaluate our approach by using simulations in real Web services
testbeds.
Title :Mood Recognition During Online Self Assessment Test
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/mood-recognition-during-online-self-assessment-test
Abstract : Individual emotions play a crucial role during any learning interaction. Identifying a student's emotional
state and providing personalized feedback, based on integrated pedagogical models, has been considered to be one
of the main limits of traditional tools of e-learning. This paper presents an empirical study that llustrates how learner
mood may be predicted during online self-assessmenttests. Here, a previous method of determining student mood
has been refined based on the assumption that the influence on learner mood of questions already answered declines
in relation to their distance from the current question. Moreover, this paper sets out to indicate that "exponential logic"
may help produce more efficient models if integrated adequately with affective modeling. The results show that these
assumptions may prove useful to future research.
Title :On The Path To A World Wide Web Census: A Large Scale Survey
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/world-wide-web-census-large-scale-survey
Abstract : How large is the World Wide Web? We present the results of the largest Web survey performed to
date.We use an inter-disciplinary approach which uses methods from ecology. In addition to Web server counts, we
also present other information collected, such as Web server market share, operating system type used by Web
servers and Web server distribution. The software system used to collect data is a prototype of a system that we
believe can be used for a complete Web census.
Title :Knowledge Sharing In Virtual Organizations: Barriers and Enablers
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/knowledge-sharing-in-virtual-organizations-barriers-enablers
Abstract : Modern organizations have to deal with many drastic external and internal constraints due notably to the
globalization of the economy, the fast technological changes, and the shifts in customers demand. Moreover,
organizations functionally divided and hierarchical internal structures are too rigid and make difficult their adjustment
to the changing constraints resulting from the pressure of their external environment. Consequently, to survive and
maintain their competitive advantage in the market, modern organizations must alter their internal structure to become
2. organic and flexible systems able to adapt and progress in a high velocity environment. Virtual organizations are
among the most popular solutions which provide organizations with more agility and improve their efficiency and
effectiveness. Despite many success stories materialized by economic and non-economic benefits, many virtual
organizations have failed to reach their goals due to the problems they have encountered while trying to manage
knowledge. In this work, we analyze the barriers and enablers of knowledge management in virtual organizations.
Title :Adaptive Provisioning of Human Expertise in Service-oriented Systems
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/adaptive-provisioning-human-expertise-service-oriented-systems
Abstract : Web-based collaborations have become essential in today's business environments. Due to the availability
of various SOA frameworks, Web services emerged as the de facto technology to realize flexible compositions of
services. While most existing work focuses on the discovery and composition of software based services, we highlight
concepts for a people-centric Web. Knowledge-intensive environments clearly demand for provisioning of human
expertise along with sharing of computing resources or business data through software-based services. To address
these challenges, we introduce an adaptive approach allowing humans to provide their expertise through services
using SOA standards, such as WSDL and SOAP. The seamless integration of humans in the SOA loop triggers
numerous social implications, such as evolving expertise and drifting interests of human service providers. Here we
propose a framework that is based on interaction monitoring techniques enabling adaptations in SOA-based
socio-technical systems.
M.Phil Computer Science Data Mining Projects
Title :Cost-aware rank join with random and sorted access
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/cost-aware-rank-join-random-sorted-access
Abstract : In this project, we address the problem of joining ranked results produced by two or more services on the
Web. We consider services endowed with two kinds of access that are often available: i) sorted access, which returns
tuples sorted by score; ii) random access, which returns tuples matching a given join attribute value. Rank join
operators combine objects of two or more relations and output the k combinations with the highest aggregate score.
While the past literature has studied suitable bounding schemes for this setting, in this paper we focus on the
definition of a pulling strategy, which determines the order of invocation of the joined services. We propose the CARS
(Cost-Aware with Random and Sorted access) pulling strategy, which is derived at compile-time and is oblivious of
the query-dependent score distributions. We cast CARS as the solution of an optimization problem based on a small
set of parameters characterizing the joined services. We validate the proposed strategy with experiments on both real
and synthetic data sets. We show that CARS outperforms prior proposals and that its overall access cost is always
within a very short margin from that of an oracle-based optimal strategy. In addition, CARS is shown to be robust. The
uncertainty that may characterize the estimated parameters.
Title :USHER Improving Data Quality with Dynamic Forms
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/usher-improving-data-quality-dynamic-forms
Abstract : ta quality is a critical problem in modern databases. Data entry forms present the first and arguably best
opportunity for detecting and mitigating errors, but there has been little research into automatic methods for improving
data quality at entry time. In this paper, we propose USHER, an end-to-end system for form design, entry, and data
quality assurance. Using previous form submissions, USHER learns a probabilistic model over the questions of the
form. USHER then applies this model at every step of the data entry process to improve data quality. Before entry, it
induces a form layout that captures the most important data values of a form instance as quickly as possible. During
entry, it dynamically adapts the form to the values being entered, and enables real-time feedback to guide the data
enterer toward their intended values. After entry, it re-asks questions that it deems likely to have been entered
incorrectly. We evaluate all three components of USHER using two real-world data sets. Our results demonstrate that
each component has the potential to improve data quality considerably, at a reduced cost when compared to current
practice.
Title :A Dual Framework and Algorithms for Targeted Data Delivery
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/algorithms-targeted-data-delivery
3. Abstract : In this project, we develop a framework for comparing pull based solutions and present dual optimization
approaches. The first approach maximizes user utility while satisfying constraints on the usage of system resources.
The second approach satisfies the utility of user profiles while minimizing the usage of system resources. We present
an adaptive algorithm and show how it can incorporate feedback to improve user utility with only a moderate increase
in resource utilization.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :Selecting Attributes for Sentiment Classification Using Feature Relation Networks
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/sentiment-classification-using-feature-relation-networks
Abstract : A major concern when incorporating large sets of diverse n-gram features for sentiment classification is
the presence of noisy, irrelevant, and redundant attributes. These concerns can often make it difficult to harness the
augmented discriminatory potential of extended feature sets. We propose a rule-based multivariate text feature
selection method called Feature Relation Network (FRN) that considers semantic information and also leverages the
syntactic relationships between n-gram features. FRN is intended to efficiently enable the inclusion of extended sets
of heterogeneous n-gram features for enhanced sentiment classification. Experiments were conducted on three online
review test beds in comparison with methods used in prior sentiment classification research. FRN outperformed the
comparison univariate, multivariate, and hybrid feature selection methods; it was able to select attributes resulting in
significantly better classification accuracy irrespective of the feature subset sizes. Furthermore, by incorporating
syntactic information about n-gram relations, FRN is able to select features in a more computationally efficient manner
than many multivariate and hybrid techniques.
Title :Improving Aggregate Recommendation Diversity Using Ranking-Based Techniques
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/aggregate-recommendation-diversity-using-ranking-based
Abstract : Recommender systems are becoming increasingly important to individual users and businesses for
providing personalized recommendations. However, while the majority of algorithms proposed in recommender
systems literature have focused on improving recommendation accuracy, other important aspects of recommendation
quality, such as the diversity of recommendations, have often been overlooked. In this paper, we introduce and
explore a number of item ranking techniques that can generate recommendations that have substantially higher
aggregate diversity across all users while maintaining comparable levels of recommendation accuracy.
Comprehensive empirical evaluation consistently shows the diversity gains of the proposed techniques using several
real-world rating datasets and different rating prediction algorithms.
M.Phil Computer Science Data Mining Projects
Title :Integration of Sound Signature in Graphical Password Authentication System
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/sound-signature-graphical-password-authentication-system
Abstract : In this project, a graphical password system with a supportive sound signature to increase the
remembrance of the password is discussed. In proposed work a click-based graphical password scheme called Cued
Click Points (CCP) is presented. In this system a password consists of sequence of some images in which user can
select one click-point per image. In addition user is asked to select a sound signature corresponding to click point this
sound signature will be used to help the user to login. System showed very good Performance in terms of speed,
accuracy, and ease of use. Users preferred CCP to Pass Points, saying that selecting and remembering only one
point per image was easier and sound signature helps considerably in recalling the click points.
Title :Monitoring Service Systems from a Language-Action Perspective
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/monitoring-service-systems-language-action
4. Abstract : The Exponential growth in the global economy is being supported by service systems, realized by
recasting mission-critical application services accessed across organizational boundaries. Language-Action
Perspective (LAP) is based upon the notion as proposed that "expert behavior requires an exquisite sensitivity to
context and that such sensitivity is more in the realm of the human than in that of the artificial.
Business processes are increasingly distributed and open, making them prone to failure. Monitoring is, therefore, an
important concern not only for the processes themselves but also for the services that comprise these processes. We
present a framework for multilevel monitoring of these service systems. It formalizes interaction protocols, policies,
and commitments that account for standard and extended effects following the language-action perspective, and
allows specification of goals and monitors at varied abstraction levels. We demonstrate how the framework can be
implemented and evaluate it with multiple scenarios like between merchant and customer transaction that include
specifying and monitoring open-service policy commitments.
Title :A Personalized Ontology Model for Web Information Gathering
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/ontology-model-web-information-gathering
Abstract : As a model for knowledge description and formalization, ontologies are widely used to represent user
profiles in personalized web information gathering. However, when representing user profiles, many models have
utilized only knowledge from either a global knowledge base or user local information. In this paper, a personalized
ontology model is proposed for knowledge representation and reasoning over user profiles. This model learns
ontological user profiles from both a world knowledge base and user local instance repositories. The ontology model
is evaluated by comparing it against benchmark models in web information gathering. The results show that this
ontology model is successful.
Title :Publishing Search Logs-A Comparative Study of Privacy Guarantees
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/publishing-search-logs-privacy-guarantees
Abstract : Search engine companies collect the "database of intentions", the histories of their users' search queries.
These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing
search logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequent
keywords, queries and clicks of a search log. We first show how methods that achieve variants of k-anonymity are
vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by differential privacy
unfortunately does not provide any utility for this problem. Our paper concludes with a large experimental study using
real applications where we compare ZEALOUS and previous work that achieves k-anonymity in search log publishing.
Our results show that ZEALOUS yields comparable utility to k?anonymity while at the same time achieving much
stronger privacy guarantees.
Title :Scalable Scheduling of Updates in Streaming Data Warehouse
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/scheduling-updates-streaming-data-warehouse
Abstract : This study of collective behavior is to understand how individuals behave in a social networking
environment. Oceans of data generated by social media like Face book, Twitter, Flicker, and YouTube present
opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predict
collective behavior in social media. In particular, given information about some individuals, how can we infer the
behavior of unobserved individuals in the same network? A social-dimension-based approach has been shown
effective in addressing the heterogeneity of connections presented in social media. However, the networks in social
media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails
scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an
edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposed
approach can efficiently handle networks of millions of actors while demonstrating a comparable prediction
performance to other non-scalable methods.
M.Phil Computer Science Data Mining Projects
Title :The Awareness Network, To Whom Should I Display My Actions? And, Whose Actions Should I Monitor?
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/accessing-monitoring-inawareness-network
5. Abstract : The concept of awareness plays a pivotal role in research in Computer Supported Cooperative Work.
Recently, Software Engineering researchers interested in the collaborative nature of software development have
explored the implications of this concept in the design of software development tools. A critical aspect of awareness is
the associated coordinative work practices of displaying and monitoring actions. This aspect concerns how colleagues
monitor one another's actions to understand how these actions impact their own work and how they display their
actions in such a way that others can easily monitor them while doing their own work. In this paper, we focus on an
additional aspect of awareness: the identification of the social actors who should be monitored and the actors to
whom their actions should be displayed. We address this aspect by presenting software developers' work practices
based on ethnographic data from three different software development teams. In addition, we illustrate how these
work practices are influenced by different factors, including the organizational setting, the age of the project, and the
software architecture. We discuss how our results are relevant for both CSCW and Software Engineering
researchers.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :The World in a Nutshell Concise Range Queries
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/world-nutshell-concise-range-queries
Abstract : With the advance of wireless communication technology, it is quite common for people to view maps or get
related services from the handheld devices, such as mobile phones and PDAs. Range queries, as one of the most
commonly used tools, are often posed by the users to retrieve needful information from a spatial database. However,
due to the limits of communication bandwidth and hardware power of handheld devices, displaying all the results of a
range query on a handheld device is neither communicationefficient nor informative to the users. This is simply
because that there are often too many results returned from a range query.
In view of this problem, we present a novel idea that a concise representation of a specified size for the range query
results, while incurring minimal information loss, shall be computed and returned to the user. Such a concise range
query not only reduces communication costs, but also offers better usability to the users, providing an opportunity for
interactive exploration.
The usefulness of the concise range queries is confirmed by comparing it with other possible alternatives, such as
sampling and clustering. Unfortunately, we prove that finding the optimal representation with minimum information
loss is an NP-hard problem. Therefore, we propose several effective and nontrivial algorithms to find a good
approximate result. Extensive experiments on real-world data have demonstrated the effectiveness and efficiency of
the proposed techniques.
Title :A Query Formulation Language for the Data Web
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/query-formulation-language-data-web
Abstract : We present a query formulation language called MashQL in order to easily query and fuse structured data
on the web. The main novelty of MashQL is that it allows people with limited IT-skills to explore and query one or
multiple data sources without prior knowledge about the schema, structure, vocabulary, or any technical details of
these sources. More importantly, to be robust and cover most cases in practice, we do not assume that a data source
should have -an offline or inline- schema. This poses several language-design and performance complexities that we
fundamentally tackle. To illustrate the query formulation power of MashQL, and without loss of generality, we chose
the Data Web scenario. We also chose querying RDF, as it is the most primitive data model; hence, MashQL can be
similarly used for querying relational databases and XML. We present two implementations of MashQL, an online
mashup editor, and a Firefox add-on. The former illustrates how MashQL can be used to query and mash up the Data
Web as simple as filtering and piping web feeds; and the Firefox addon illustrates using the browser as a web
composer rather than only a navigator. To end, we evaluate MashQL on querying two datasets, DBLP and DBPedia,
and show that our indexing techniques allow instant user-interaction.
Title :Exploring Application-Level Semantics for Data Compression
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/exploring-application-level-semantics-data-compression
Abstract : Natural phenomena show that many creatures form large social groups and move in regular patterns.
6. However, previous works focus on finding the movement patterns of each single object or all objects. In this paper, we
first propose an efficient distributed mining algorithm to jointly identify a group of moving objects and discover their
movement patterns in wireless sensor networks. Afterward, we propose a compression algorithm, called 2P2D, which
exploits the obtained group movement patterns to reduce the amount of delivered data.
The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence merge
phase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In the
entropy reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that
obtains the optimal solution. Moreover, we devise three replacement rules and derive the maximum compression
ratio. The experimental results show that the proposed compression algorithm leverages the group movement
patterns to reduce the amount of delivered data effectively and efficiently.
Title :Data Leakage Detection
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/data-leakage-detection
Abstract : A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of
the data is leaked and found in an unauthorized place (e.g., on the web or somebody's laptop). The distributor must
assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently
gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of
identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases
we can also inject "realistic but fake" data records to further improve our chances of detecting leakage and identifying
the guilty party.
M.Phil Computer Science Data Mining Projects
Title :Knowledge Based Interactive Postmining of Association Rules Using Ontologies
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/knowledge-based-interactive-postmining-association-rules-using-ontologies
Abstract : In Data Mining, the usefulness of association rules is strongly limited by the huge amount of delivered
rules. To overcome this drawback, several methods were proposed in the literature such as item set concise
representations, redundancy reduction, and post processing. However, being generally based on statistical
information, most of these methods do not guarantee that the extracted rules are interesting for the user. Thus, it is
crucial to help the decision-maker with an efficient post processing step in order to reduce the number of rules. This
paper proposes a new interactive approach to prune and filter discovered rules. First, we propose to use ontologies in
order to improve the integration of user knowledge in the post processing task. Second, we propose the Rule Schema
formalism extending the specification language proposed by Liu et al. for user expectations. Furthermore, an
interactive framework is designed to assist the user throughout the analyzing task. Applying our new approach over
voluminous sets of rules, we were able, by integrating domain expert knowledge in the post processing step, to
reduce the number of rules to several dozens or less. Moreover, the quality of the filtered rules was validated by the
domain expert at various points in the interactive process.
Title :A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/link-analysis-mining-relational-databases
Abstract : This work introduces a link-analysis procedure for discovering relationships in a relational database or a
graph, generalizing both simple and multiple correspondence analysis. It is based on a random-walk model through
the database defining a Markov chain having as many states as elements in the database. Suppose we are interested
in analyzing the relationships between some elements (or records) contained in two different tables of the relational
database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interest
and preserving the main characteristics of the initial chain is extracted by stochastic complementation. This reduced
chain is then analyzed by projecting jointly the elements of interest in the diffusion-map subspace and visualizing the
results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined and to
multiple correspondence analyses when the database takes the form of a simple star schema. On the other hand, a
kernel version of the diffusion-map distance, generalizing the basic diffusion-map distance to directed graphs, is also
introduced and the links with spectral clustering are discussed. Several datasets are analyzed by using the proposed
methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.
7. Title :Query Planning for Continuous Aggregation Queries over a Network of Data Aggregators
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/query-planning-continuous-aggregation-queries
Abstract : Continuous queries are used to monitor changes to time varying data and to provide results useful for
online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data
items, for example, to know value of portfolio for a client; or the AVG of temperatures sensed by a set of sensors. In
these queries a client specifies a coherency requirement as part of the query. We present a low-cost, scalable
technique to answer continuous aggregation queries using a network of aggregators of dynamic data items. In such a
network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as various
fragments of a dynamic web-page are served by one or more nodes of a content distribution network, our technique
involves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen data
aggregators with their individual sub-query incoherency bounds. We provide a technique for getting the optimal set of
sub-queries with their incoherency bounds which satisfies client query's coherency requirement with least number of
refresh messages sent from aggregators to the client. For estimating the number of refresh messages, we build a
query cost model which can be used to estimate the number of messages required to satisfy the client specified
incoherency bound. Performance results using real-world traces show that our cost based query planning leads to
queries being executed using less than one third the number of messages required by existing schemes.
Title :Scalable learning of collective behavior
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/scalable-learning-collective-behavior
Abstract : This study of collective behavior is to understand how individuals behave in a social networking
environment. Oceans of data generated by social media like Face book, Twitter, Flicker, and YouTube present
opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predict
collective behavior in social media. In particular, given information about some individuals, how can we infer the
behavior of unobserved individuals in the same network? A social-dimension-based approach has been shown
effective in addressing the heterogeneity of connections presented in social media. However, the networks in social
media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails
scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an
edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposed
approach can efficiently handle networks of millions of actors while demonstrating a comparable prediction
performance to other non-scalable methods.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :Horizontal Aggregations in SQL to prepare Data Sets for Data Mining Analysis
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/horizontal-aggregations-sql-data-mining-analysis
Abstract : Preparing a data set for analysis is generally the most time consuming task in a data mining project,
requiring many complex SQL queries, joining tables and aggregating columns. Existing SQL aggregations have
limitations to prepare data sets because they return one column per aggregated group. In general, a significant
manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful,
methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of
numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal
aggregations build data sets with a horizontal denormalized layout (e.g. point-dimension, observation-variable,
instance-feature), which is the standard layout required by most data mining algorithms. We propose three
fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ:
Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by
some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method
has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT
methods exhibit linear scalability, whereas the SPJ method does not.
M.Phil Computer Science Data Mining Projects
Title :A Machine Learning Approach for Identifying Disease-Treatment Relations in Short Texts
Language : C#
8. Project Link : http://kasanpro.com/p/c-sharp/machine-learning-identifying-disease-treatment-relations-short-texts
Abstract : The Machine Learning (ML) field has gained its momentum in almost any domain of research and just
recently has become a reliable tool in the medical domain. The empirical domain of automatic learning is used in
tasks such as medical decision support, medical imaging, protein-protein interaction, extraction of medical knowledge,
and for overall patient management care.
ML is envisioned as a tool by which computer-based systems can be integrated in the healthcare field in order to get
a better, more efficient medical care. This paper describes a ML-based methodology for building an application that is
capable of identifying and disseminating healthcare information.
It extracts sentences from published medical papers that mention diseases and treatments, and identifies semantic
relations that exist between diseases and treatments.
Our evaluation results for these tasks show that the proposed methodology obtains reliable outcomes that could be
integrated in an application to be used in the medical care domain. The potential value of this paper stands in the ML
settings that we propose and in the fact that we outperform previous results on the same data set.
Title :m-Privacy for Collaborative Data Publishing
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/privacy-collaborative-data-publishing
Abstract : In this paper, we consider the collaborative data publishing problem for anonymizing horizontally
partitioned data at multiple data providers. We consider a new type of "insider attack" by colluding data providers who
may use their own data records (a subset of the overall data) in addition to the external background knowledge to
infer the data records contributed by other data providers. The paper addresses this new threat and makes several
contributions. First, we introduce the notion of m-privacy, which guarantees that the anonymized data satisfies a given
privacy constraint against any group of up to m colluding data providers. Second, we present heuristic algorithms
exploiting the equivalence group monotonicity of privacy constraints and adaptive ordering techniques for efficiently
checking m-privacy given a set of records. Finally, we present a data provider-aware anonymization algorithm with
adaptive m- privacy checking strategies to ensure high utility and m-privacy of anonymized data with efficiency.
Experiments on real-life datasets suggest that our approach achieves better or comparable utility and efficiency than
existing and baseline algorithms while providing m-privacy guarantee.
Title :Spatial Approximate String Search
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/spatial-approximate-string-search
Abstract : This work deals with the approximate string search in large spatial databases. Specifically, we investigate
range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We
dub this query the spatial approximate string (S AS ) query. In Euclidean space, we propose an approximate solution,
the M H R-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps
a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning
functionality of such signatures based on the set resemblance between the query string and the q-grams from the
sub-trees of index nodes. We also discuss how to estimate the selectivity of a S AS query in Euclidean space, for
which we present a novel adaptive algorithm to find balanced partitions using both the spatial and string information
stored in the tree. For queries on road networks, we propose a novel exact method, R SAS S OL, which significantly
outperforms the baseline algorithm in practice. The R SAS S OL combines the q-gram based inverted lists and the
reference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency and
effectiveness of our approaches.
Title :Predicting iPhone Sales from iPhone Tweets
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/predicting-iphone-sales-iphone-tweets
Abstract : Recent research in the field of computational social science have shown how data resulting from the
widespread adoption and use of social media channels such as twitter can be used to predict outcomes such as
movie revenues, election winners, localized moods, and epidemic outbreaks. Underlying assumptions for this
research stream on predictive analytics are that social media actions such as tweeting, liking, commenting and rating
are proxies for user/consumer's attention to a particular object/product and that the shared digital artefact that is
persistent can create social influence. In this paper, we demonstrate how social media data from twitter can be used
9. to predict the sales of iPhones. Based on a conceptual model of social data consisting of social graph (actors, actions,
activities, and artefacts) and social text (topics, keywords, pronouns, and sentiments), we develop and evaluate a
linear regression model that transforms iPhone tweets into a prediction of the quarterly iPhone sales with an average
error close to the established prediction models from investment banks. This strong correlation between iPhone
tweets and iPhone sales becomes marginally stronger after incorporating sentiments of tweets. We discuss the
findings and conclude with implications for predictive analytics with big social data.
Title :A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/clustering-based-feature-subset-selection-algorithm-high-dimensional-data
Abstract : Feature selection involves identifying a subset of the most useful features that produces compatible results
as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and
effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the
effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature
selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two
steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second
step, the most representative feature that is strongly related to target classes is selected from each cluster to form a
subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has
a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we
adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm
are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several
representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to
four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the
instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly
available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces
smaller subsets of features but also improves the performances of the four types of classifiers.
M.Phil Computer Science Data Mining Projects
Title :Crowdsourcing Predictors of Behavioral Outcomes
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/crowdsourcing-predictors-behavioral-outcomes
Abstract : Generating models from large data sets--and deter- mining which subsets of data to mine--is becoming
increasingly automated. However choosing what data to collect in the first place requires human intuition or
experience, usually supplied by a domain expert. This paper describes a new approach to machine science which
demonstrates for the first time that non-domain experts can collectively formulate features, and provide values for
those features such that they are predictive of some behavioral outcome of interest. This was accomplished by
building a web platform in which human groups interact to both respond to questions likely to help predict a behavioral
outcome and pose new questions to their peers. This results in a dynamically-growing online survey, but the result of
this cooperative behavior also leads to models that can predict user's outcomes based on their responses to the
user-generated survey questions. Here we describe two web-based experiments that instantiate this approach: the
first site led to models that can predict users' monthly electric energy consumption; the other led to models that can
predict users' body mass index. As exponential increases in content are often observed in successful online
collaborative communities, the proposed methodology may, in the future, lead to similar exponential rises in discovery
and insight into the causal factors of behavioral outcomes.
Title :Data Extraction for Deep Web Using WordNet
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/data-extraction-deep-web-using-wordnet
Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved to
achieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of a
lightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarity
of data records and detect the correct data region with higher precision using the semantic properties of these data
records. The advantages of this method are that it can extract three types of data records, namely, single-section data
records, multiple-section data records, and loosely structured data records, and it also provides options for aligning
iterative and disjunctive data items. Experimental results show that our technique is robust and performs better than
the existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from
10. multilingual web pages and that it is domain independent.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :Data Extraction for Deep Web Using WordNet
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/data-extraction-deep-web-using-wordnet-code
Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved to
achieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of a
lightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarity
of data records and detect the correct data region with higher precision using the semantic properties of these data
records. The advantages of this method are that it can extract three types of data records, namely, single-section data
records, multiple-section data records, and loosely structured data records, and it also provides options for aligning
iterative and disjunctive data items. Experimental results show that our technique is robust and performs better than
the existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from
multilingual web pages and that it is domain independent.
Title :Data Extraction for Deep Web Using WordNet
Language : PHP
Project Link : http://kasanpro.com/p/php/data-extraction-deep-web-using-wordnet-implement
Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved to
achieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of a
lightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarity
of data records and detect the correct data region with higher precision using the semantic properties of these data
records. The advantages of this method are that it can extract three types of data records, namely, single-section data
records, multiple-section data records, and loosely structured data records, and it also provides options for aligning
iterative and disjunctive data items. Experimental results show that our technique is robust and performs better than
the existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from
multilingual web pages and that it is domain independent.
Title :An Effective Retrieval of Medical Records using Data Mining Techniques
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/retrieval-medical-records-data-mining
Abstract : Nowadays, the standard of healthcare domain mainly depends on in the delivery of modern healthcare and
efficiency of healthcare systems. Due to time and cost constraints, most of the people rely on health care systems to
obtain healthcare services. Healthcare system becomes very important to develop an automated tool that is capable
of identifying and disseminating relevant healthcare information. This work focuses on retrieval of updated, accurate
and relevant information from Medline datasets using Machine earning approach. The proposed work uses keyword
searching algorithm for extracting relevant information from Medline datasets and K-Nearest Neighbor algorithm
(KNN) to get the relation between disease and treatment. Since, improvement of patient care achieved effectively.
M.Phil Computer Science Data Mining Projects
Title :An Effective Retrieval of Medical Records using Data Mining Techniques
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/retrieval-medical-records-data-mining-code
Abstract : Nowadays, the standard of healthcare domain mainly depends on in the delivery of modern healthcare and
efficiency of healthcare systems. Due to time and cost constraints, most of the people rely on health care systems to
obtain healthcare services. Healthcare system becomes very important to develop an automated tool that is capable
of identifying and disseminating relevant healthcare information. This work focuses on retrieval of updated, accurate
and relevant information from Medline datasets using Machine earning approach. The proposed work uses keyword
searching algorithm for extracting relevant information from Medline datasets and K-Nearest Neighbor algorithm
11. (KNN) to get the relation between disease and treatment. Since, improvement of patient care achieved effectively.
Title :An Effective Retrieval of Medical Records using Data Mining Techniques
Language : PHP
Project Link : http://kasanpro.com/p/php/retrieval-medical-records-data-mining-implement
Abstract : Nowadays, the standard of healthcare domain mainly depends on in the delivery of modern healthcare and
efficiency of healthcare systems. Due to time and cost constraints, most of the people rely on health care systems to
obtain healthcare services. Healthcare system becomes very important to develop an automated tool that is capable
of identifying and disseminating relevant healthcare information. This work focuses on retrieval of updated, accurate
and relevant information from Medline datasets using Machine earning approach. The proposed work uses keyword
searching algorithm for extracting relevant information from Medline datasets and K-Nearest Neighbor algorithm
(KNN) to get the relation between disease and treatment. Since, improvement of patient care achieved effectively.
Title :Design and analysis of concept adapting real time data stream Applications
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/concept-adapting-real-time-data-stream-applications
Abstract : Real - time signals are continuous in nature and abruptly changing hence there is a need to apply an
efficient and concept adapting real - time data stream mining technique to take intelligent decisions online. Concept
drift in real time data stream refers to a change in the class (concept) definitions over time. It is also called as NON -
STATIONARY LEARNING (NSL).
The most important criteria are to solve the real - time data stream mining problem with 'concept drift' in well manner.
Title :Data Extraction for Deep Web Using WordNet
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/data-extraction-deep-web-using-wordnet-module
Abstract : Our survey shows that the techniques used in data extraction from deep webs need to be improved to
achieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of a
lightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarity
of data records and detect the correct data region with higher precision using the semantic properties of these data
records. The advantages of this method are that it can extract three types of data records, namely, single-section data
records, multiple-section data records, and loosely structured data records, and it also provides options for aligning
iterative and disjunctive data items. Experimental results show that our technique is robust and performs better than
the existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from
multilingual web pages and that it is domain independent.
Title :Answering General Time-Sensitive Queries
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/answering-general-time-sensitive-queries
Abstract : Time is an important dimension of relevance for a large number of searches, such as over blogs and news
archives. So far, research on searching over such collections has largely focused on locating topically similar
documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this
paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the
documents in a news archive is important and should be considered in conjunction with the topic similarity to derive
the final document ranking. Earlier work has focused on improving retrieval for "recency" queries that target recent
documents. We propose a more general framework for handling time-sensitive queries and we automatically identify
the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that
seamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimental
evaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using the
Amazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query over
a news archive and for incorporating this information in the retrieval process. We show that our techniques are robust
and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.
M.Phil Computer Science Data Mining Projects
12. Title :Answering General Time-Sensitive Queries
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/answering-general-time-sensitive-queries-framwork
Abstract : Time is an important dimension of relevance for a large number of searches, such as over blogs and news
archives. So far, research on searching over such collections has largely focused on locating topically similar
documents for a query. Unfortunately, topic similarity alone is not always sufficient for document ranking. In this
paper, we observe that, for an important class of queries that we call time-sensitive queries, the publication time of the
documents in a news archive is important and should be considered in conjunction with the topic similarity to derive
the final document ranking. Earlier work has focused on improving retrieval for "recency" queries that target recent
documents. We propose a more general framework for handling time-sensitive queries and we automatically identify
the important time intervals that are likely to be of interest for a query. Then, we build scoring techniques that
seamlessly integrate the temporal aspect into the overall ranking mechanism. We present an extensive experimental
evaluation using a variety of news article data sets, including TREC data as well as real web data analyzed using the
Amazon Mechanical Turk. We examine several techniques for detecting the important time intervals for a query over
a news archive and for incorporating this information in the retrieval process. We show that our techniques are robust
and significantly improve result quality for time-sensitive queries compared to state-of-the-art retrieval techniques.
Title :A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/indexing-scalable-record-linkage-deduplication
Abstract : Record linkage is the process of matching records from several databases that refer to the same entities.
When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming
important in many application areas, because they can contain information that is not available otherwise, or that is
too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process,
because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the
increasing size of today's databases, the complexity of the matching process becomes one of the major challenges
for record linkage and deduplication. In recent years, various indexing techniques have been developed for record
linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching
process by removing obvious non-matching pairs, while at the same time maintaining high matching quality. This
paper presents a survey of twelve variations of six indexing techniques. Their complexity is analysed, and their
performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No
such detailed survey has so far been published.
Title :Decentralized Probabilistic Text Clustering
Language : NS2
Project Link : http://kasanpro.com/p/ns2/decentralized-probabilistic-text-clustering
Abstract : Text clustering is an established technique for improving quality in information retrieval, for both
centralized and distributed environments. However, traditional text clustering algorithms fail to scale on highly
distributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves high
scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of
its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers
probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental
evaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of the
algorithm.
Title :Decentralized Probabilistic Text Clustering
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/decentralized-probabilistic-text-clustering-code
Abstract : Text clustering is an established technique for improving quality in information retrieval, for both
centralized and distributed environments. However, traditional text clustering algorithms fail to scale on highly
distributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves high
scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of
its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers
probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental
evaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of the
13. algorithm.
Title :Effective Pattern Discovery for Text Mining
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/effective-pattern-discovery-text-mining
Abstract : Many data mining techniques have been proposed for mining useful patterns in text documents. However,
how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text
mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of
polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based
approaches should perform better than the term-based ones, but many experiments do not support this hypothesis.
This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern
deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding
relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate
that the proposed solution achieves encouraging performance.
M.Phil Computer Science Data Mining Projects
Title :Ranking Model Adaptation for Domain-Specific Search
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/adaptation-domain-specific-search
Abstract : With the explosive emergence of vertical search domains, applying the broad-based ranking model directly
to different domains is no longer desirable due to domain differences, while building a unique ranking model for each
domain is both laborious for labeling data and time-consuming for training models. In this paper, we address these
difficulties by proposing a regularization based algorithm called ranking adaptation SVM (RA-SVM), through which we
can adapt an existing ranking model to a new domain, so that the amount of labeled data and the training cost is
reduced while the performance is still guaranteed. Our algorithm only requires the prediction from the existing ranking
models, rather than their internal representations or the data from auxiliary domains. In addition, we assume that
documents similar in the domain-specific feature space should have consistent rankings, and add some constraints to
control the margin and slack variables of RA-SVM adaptively. Finally, ranking adaptability measurement is proposed
to quantitatively estimate if an existing ranking model can be adapted to a new domain. Experiments performed over
Letor and two large scale datasets crawled from a commercial search engine demonstrate the applicabilities of the
proposed ranking adaptation algorithms and the ranking adaptability measurement.
Title :Ranking Model Adaptation for Domain-Specific Search
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/ranking-adaptation-domain-specific-search
Abstract : With the explosive emergence of vertical search domains, applying the broad-based ranking model directly
to different domains is no longer desirable due to domain differences, while building a unique ranking model for each
domain is both laborious for labeling data and time-consuming for training models. In this paper, we address these
difficulties by proposing a regularization based algorithm called ranking adaptation SVM (RA-SVM), through which we
can adapt an existing ranking model to a new domain, so that the amount of labeled data and the training cost is
reduced while the performance is still guaranteed. Our algorithm only requires the prediction from the existing ranking
models, rather than their internal representations or the data from auxiliary domains. In addition, we assume that
documents similar in the domain-specific feature space should have consistent rankings, and add some constraints to
control the margin and slack variables of RA-SVM adaptively. Finally, ranking adaptability measurement is proposed
to quantitatively estimate if an existing ranking model can be adapted to a new domain. Experiments performed over
Letor and two large scale datasets crawled from a commercial search engine demonstrate the applicabilities of the
proposed ranking adaptation algorithms and the ranking adaptability measurement.
Title :Scalable Learning of Collective Behavior
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/scalable-learning-collective-behavior-code
Abstract : This study of collective behavior is to understand how individuals behave in a social networking
environment. Oceans of data generated by social media like Facebook, Twitter, Flickr, and YouTube present
14. opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predict
collective behavior in social media. In particular, given information about some individuals, how can we infer the
behavior of unobserved individuals in the same network? A social-dimension-based approach has been shown
effective in addressing the heterogeneity of connections presented in social media. However, the networks in social
media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails
scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an
edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposed
approach can efficiently handle networks of millions of actors while demonstrating a comparable prediction
performance to other non-scalable methods.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :Scalable Learning of Collective Behavior
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/scalable-learning-collective-behavior-implement
Abstract : This study of collective behavior is to understand how individuals behave in a social networking
environment. Oceans of data generated by social media like Facebook, Twitter, Flickr, and YouTube present
opportunities and challenges to study collective behavior on a large scale. In this work, we aim to learn to predict
collective behavior in social media. In particular, given information about some individuals, how can we infer the
behavior of unobserved individuals in the same network? A social-dimension-based approach has been shown
effective in addressing the heterogeneity of connections presented in social media. However, the networks in social
media are normally of colossal size, involving hundreds of thousands of actors. The scale of these networks entails
scalable learning of models for collective behavior prediction. To address the scalability issue, we propose an
edge-centric clustering scheme to extract sparse social dimensions. With sparse social dimensions, the proposed
approach can efficiently handle networks of millions of actors while demonstrating a comparable prediction
performance to other non-scalable methods.
Title :Resilient Identity Crime Detection
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/resilient-identity-crime-detection
Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity
crime. The existing nondata mining detection system of business rules and scorecards, and known fraud matching
have limitations. To address these limitations and combat identity crime in real time, this paper proposes a new
multilayered detection system complemented with two additional layers: communal detection (CD) and spike
detection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to synthetic
social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates to
increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a
variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing
legal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several million
real credit applications. Results on the data support the hypothesis that successful credit application fraud patterns
are sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application fraud
detection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general to
the design, implementation, and evaluation of all detection systems.
M.Phil Computer Science Data Mining Projects
Title :Resilient Identity Crime Detection
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/resilient-identity-crime-detection-code
Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity
crime. The existing nondata mining detection system of business rules and scorecards, and known fraud matching
have limitations. To address these limitations and combat identity crime in real time, this paper proposes a new
multilayered detection system complemented with two additional layers: communal detection (CD) and spike
detection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to synthetic
social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates to
increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a
15. variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing
legal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several million
real credit applications. Results on the data support the hypothesis that successful credit application fraud patterns
are sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application fraud
detection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general to
the design, implementation, and evaluation of all detection systems.
Title :Resilient Identity Crime Detection
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/resilient-identity-crime-detection-implement
Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity
crime. The existing nondata mining detection system of business rules and scorecards, and known fraud matching
have limitations. To address these limitations and combat identity crime in real time, this paper proposes a new
multilayered detection system complemented with two additional layers: communal detection (CD) and spike
detection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to synthetic
social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates to
increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a
variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing
legal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several million
real credit applications. Results on the data support the hypothesis that successful credit application fraud patterns
are sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application fraud
detection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general to
the design, implementation, and evaluation of all detection systems.
Title :Resilient Identity Crime Detection
Language : PHP
Project Link : http://kasanpro.com/p/php/resilient-identity-crime-detection-module
Abstract : Identity crime is well known, prevalent, and costly; and credit application fraud is a specific case of identity
crime. The existing nondata mining detection system of business rules and scorecards, and known fraud matching
have limitations. To address these limitations and combat identity crime in real time, this paper proposes a new
multilayered detection system complemented with two additional layers: communal detection (CD) and spike
detection (SD). CD finds real social relationships to reduce the suspicion score, and is tamper resistant to synthetic
social relationships. It is the whitelist-oriented approach on a fixed set of attributes. SD finds spikes in duplicates to
increase the suspicion score, and is probe-resistant for attributes. It is the attribute-oriented approach on a
variable-size set of attributes. Together, CD and SD can detect more types of attacks, better account for changing
legal behavior, and remove the redundant attributes. Experiments were carried out on CD and SD with several million
real credit applications. Results on the data support the hypothesis that successful credit application fraud patterns
are sudden and exhibit sharp spikes in duplicates. Although this research is specific to credit application fraud
detection, the concept of resilience, together with adaptivity and quality data discussed in the paper, are general to
the design, implementation, and evaluation of all detection systems.
Title :Real-Time Analysis of Physiological Data to Support Medical Applications
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/real-time-analysis-physiological-data-support-medical-applications
Abstract : This paper presents a flexible framework that per- forms real-time analysis of physiological data to monitor
people's health conditions in any context (e.g., during daily activities, in hospital environments). Given historical
physiological data, different behavioral models tailored to specific conditions (e.g., a particular disease, a specific
patient) are automatically learnt. A suitable model for the currently monitored patient is exploited in the real- time
stream classification phase. The framework has been designed to perform both instantaneous evaluation and stream
analysis over a sliding time window. To allow ubiquitous monitoring, real-time analysis could also be executed on
mobile devices. As a case study, the framework has been validated in the intensive care scenario. Experimental
validation, performed on 64 patients affected by different critical illnesses, demonstrates the effectiveness and the
flexibility of the proposed framework in detecting different severity levels of monitored people's clinical situations.
Title :Contextual query classification in web search
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/contextual-query-classification-web-search
16. Abstract : There has been an increasing interest in exploiting multiple sources of evidence for improving the quality
of a search engine's results. User context elements like interests, preferences and intents are the main sources
exploited in information retrieval approaches to better fit the user information needs. Using the user intent to improve
the query specific retrieval search relies on classifying web queries into three types: informational, navigational and
transactional according to the user intent. However, query type classification strategies involved are based solely on
query features where the query type decision is made out of the user context represented by his search history. In this
paper, we present a con- textual query classification method making use of both query features and the user context
defined by quality indicators of the previous query session type called the query profile. We define a query session as
a sequence of queries of the same type. Preliminary experimental results carried out using TREC data show that our
approach is promising.
M.Phil Computer Science Data Mining Projects
Title :Contextual query classification in web search
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/contextual-query-classification-web-search-results
Abstract : There has been an increasing interest in exploiting multiple sources of evidence for improving the quality
of a search engine's results. User context elements like interests, preferences and intents are the main sources
exploited in information retrieval approaches to better fit the user information needs. Using the user intent to improve
the query specific retrieval search relies on classifying web queries into three types: informational, navigational and
transactional according to the user intent. However, query type classification strategies involved are based solely on
query features where the query type decision is made out of the user context represented by his search history. In this
paper, we present a con- textual query classification method making use of both query features and the user context
defined by quality indicators of the previous query session type called the query profile. We define a query session as
a sequence of queries of the same type. Preliminary experimental results carried out using TREC data show that our
approach is promising.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :Contextual query classification in web search
Language : PHP
Project Link : http://kasanpro.com/p/php/query-classification-web-search
Abstract : There has been an increasing interest in exploiting multiple sources of evidence for improving the quality
of a search engine's results. User context elements like interests, preferences and intents are the main sources
exploited in information retrieval approaches to better fit the user information needs. Using the user intent to improve
the query specific retrieval search relies on classifying web queries into three types: informational, navigational and
transactional according to the user intent. However, query type classification strategies involved are based solely on
query features where the query type decision is made out of the user context represented by his search history. In this
paper, we present a con- textual query classification method making use of both query features and the user context
defined by quality indicators of the previous query session type called the query profile. We define a query session as
a sequence of queries of the same type. Preliminary experimental results carried out using TREC data show that our
approach is promising.
Title :Annotating Search Results from Web Databases
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/annotating-search-results-web-databases
Abstract : An increasing number of databases have become web accessible through HTML form-based search
interfaces. The data units returned from the underlying database are usually encoded into the result pages
dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many
applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and
assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units
on a result page into different groups such that the data in the same group have the same semantic. Then, for each
group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label
for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result
pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
17. Title :Annotating Search Results from Web Databases
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/annotating-search-results-web-databas
Abstract : An increasing number of databases have become web accessible through HTML form-based search
interfaces. The data units returned from the underlying database are usually encoded into the result pages
dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many
applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and
assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units
on a result page into different groups such that the data in the same group have the same semantic. Then, for each
group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label
for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result
pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
Title :Annotating Search Results from Web Databases
Language : PHP
Project Link : http://kasanpro.com/p/php/annotating-search-results-web-databases-efficient
Abstract : An increasing number of databases have become web accessible through HTML form-based search
interfaces. The data units returned from the underlying database are usually encoded into the result pages
dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many
applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and
assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units
on a result page into different groups such that the data in the same group have the same semantic. Then, for each
group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label
for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result
pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
M.Phil Computer Science Data Mining Projects
Title :A cost sensitive decision tree classification in credit card identity crime detection system
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/cost-sensitive-decision-tree-classification-credit-card-identity-crime-detec
Abstract :
Title :A cost sensitive decision tree classification in credit card identity crime detection system
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/cost-sensitive-decision-tree-classification-credit-card-identity-crime-d
Abstract :
Title :A cost sensitive decision tree classification in credit card identity crime detection system
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/cost-sensitive-decision-tree-classification-credit-card-identity-fraud-crime-detection
Abstract :
Title :A cost sensitive decision tree classification in credit card identity crime detection system
Language : PHP
Project Link : http://kasanpro.com/p/php/decision-tree-classification-credit-card-identity-crime-detection-system
18. Abstract :
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :A cost-sensitive decision tree approach for fraud detection
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/credit-card-identity-crime-detection-system-cost-sensitive-decision-tree-classification
Abstract : With the developments in the information technology, fraud is spreading all over the world, resulting in
huge financial losses. Though fraud prevention mechanisms such as CHIP&PIN are developed for credit card
systems, these mechanisms do not prevent the most common fraud types such as fraudulent credit card usages over
virtual POS (Point Of Sale) terminals or mail orders so called online credit card fraud. As a result, fraud detection
becomes the essential tool and probably the best way to stop such fraud types. In this study, a new cost-sensitive
decision tree approach which minimizes the sum of misclassification costs while selecting the splitting attribute at
each non-terminal node is developed and the performance of this approach is compared with the well-known
traditional classification models on a real world credit card data set. In this approach, misclassification costs are taken
as varying. The results show that this cost-sensitive decision tree algorithm outperforms the existing well-known
methods on the given prob- lem set with respect to the well-known performance metrics such as accuracy and true
positive rate, but also a newly defined cost-sensitive metric specific to credit card fraud detection domain. Accordingly,
financial losses due to fraudulent transactions can be decreased more by the implementation of this approach in fraud
detection systems.
M.Phil Computer Science Data Mining Projects
Title :A cost-sensitive decision tree approach for fraud detection
Language : VB.NET
Project Link :
http://kasanpro.com/p/vb-net/cost-sensitive-decision-tree-classify-credit-card-identity-crime-detection-system
Abstract : With the developments in the information technology, fraud is spreading all over the world, resulting in
huge financial losses. Though fraud prevention mechanisms such as CHIP&PIN are developed for credit card
systems, these mechanisms do not prevent the most common fraud types such as fraudulent credit card usages over
virtual POS (Point Of Sale) terminals or mail orders so called online credit card fraud. As a result, fraud detection
becomes the essential tool and probably the best way to stop such fraud types. In this study, a new cost-sensitive
decision tree approach which minimizes the sum of misclassification costs while selecting the splitting attribute at
each non-terminal node is developed and the performance of this approach is compared with the well-known
traditional classification models on a real world credit card data set. In this approach, misclassification costs are taken
as varying. The results show that this cost-sensitive decision tree algorithm outperforms the existing well-known
methods on the given prob- lem set with respect to the well-known performance metrics such as accuracy and true
positive rate, but also a newly defined cost-sensitive metric specific to credit card fraud detection domain. Accordingly,
financial losses due to fraudulent transactions can be decreased more by the implementation of this approach in fraud
detection systems.
Title :PREDICTING HOME SERVICE DEMANDS FROM APPLIANCE USAGE DATA
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/predicting-home-service-demands-from-appliance-usage-data
Abstract : Power management in homes and offices requires appliance usage prediction when the future user
requests are not available. The randomness and uncertainties associated with an appliance usage make the
prediction of appliance usage from energy consumption data a non-trivial task. A general model for prediction at the
appliance level is still lacking. In this work, we propose to enrich learning algorithms with expert knowledge and
propose a general model using a knowledge driven approach to forecast if a particular appliance will start at a given
hour or not. The approach is both a knowledge driven and data driven one. The overall energy management for a
house requires that the prediction is done for the next 24 hours in the future. The proposed model is tested over the
Irise data and the results are compared with some trivial knowledge driven predictors.
Title :Data Mining and Wireless Sensor Network for Groundnut Pest/Disease Interaction and Predictions - A
19. Preliminary Study
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/data-mining-wireless-sensor-network-groundnut-pest-disease-predictions
Abstract : Data driven precision agriculture aspects, particularly the pest/disease management, require a dynamic
crop-weather data. An experiment was conducted in semi-arid region of India to understand the
crop-weather-pest/disease relations using wireless sensory and field-level surveillance data on closely related and
interdependent pest (Thrips) - disease (Bud Necrosis) dynamics of groundnut (peanut) crop. Various data mining
techniques were used to turn the data into useful information/ knowledge/ relations/ trends and correlation of
crop-weather-pest/disease continuum. These dynamics obtained from the data mining techniques and trained through
mathematical models were validated with corresponding ground level surveillance data. It was found that Bud
Necrosis viral disease infection is strongly influenced by Humidity, Maximum Temperature, prolonged duration of leaf
wetness, age of the crop and propelled by a carrier pest Thrips. Results obtained from the four continuous agriculture
seasons (monsoon & post monsoon) data has led to develop cumulative and non-cumulative prediction models,
which can assist the user community to take respective ameliorative measures.
Title :Mining Social Media Data for Understanding Student's Learning Experiences
Language : ASP.NET with VB
Project Link :
http://kasanpro.com/p/asp-net-with-vb/mining-social-media-data-understanding-students-learning-experiences
Abstract : Students' informal conversations on social media (e.g. Twitter, Facebook) shed light into their educational
experiences - opinions, feelings, and concerns about the learning process. Data from such uninstrumented
environment can provide valuable knowledge to inform student learning. Analyzing such data, however, can be
challenging. The complexity of student's experiences reflected from social media content requires human
interpretation. However, the growing scale of data demands automatic data analysis techniques. In this paper, we
developed a workflow to integrate both qualitative analysis and large - scale data mining techniques. We focus on
engineering student's Twitter posts to understand issues and problems in their educational experiences. We first
conducted a qualitative analysis on samples taken from about 25,000 tweets related to engagement, and sleep
deprivation. Based on these results, we implemented a multi - label classification algorithm to classify tweets
reflecting student's problems. We then used the algorithm to train a detector of student problems from about 35,000
tweets streamed at the geo - location of Purdue University. This work, for the first time, presents a methodology and
results that show how informal social media data can provide insights into students' experiences.
Title :Mining Social Media Data for Understanding Student's Learning Experiences
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/mining-social-media-data-understanding-students-learning-experien
Abstract : Students' informal conversations on social media (e.g. Twitter, Facebook) shed light into their educational
experiences - opinions, feelings, and concerns about the learning process. Data from such uninstrumented
environment can provide valuable knowledge to inform student learning. Analyzing such data, however, can be
challenging. The complexity of student's experiences reflected from social media content requires human
interpretation. However, the growing scale of data demands automatic data analysis techniques. In this paper, we
developed a workflow to integrate both qualitative analysis and large - scale data mining techniques. We focus on
engineering student's Twitter posts to understand issues and problems in their educational experiences. We first
conducted a qualitative analysis on samples taken from about 25,000 tweets related to engagement, and sleep
deprivation. Based on these results, we implemented a multi - label classification algorithm to classify tweets
reflecting student's problems. We then used the algorithm to train a detector of student problems from about 35,000
tweets streamed at the geo - location of Purdue University. This work, for the first time, presents a methodology and
results that show how informal social media data can provide insights into students' experiences.
M.Phil Computer Science Data Mining Projects
Title :Mining Social Media Data for Understanding Student's Learning Experiences
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/mining-social-media-data-understanding-students-learning-experiences-implement
Abstract : Students' informal conversations on social media (e.g. Twitter, Facebook) shed light into their educational
20. experiences - opinions, feelings, and concerns about the learning process. Data from such uninstrumented
environment can provide valuable knowledge to inform student learning. Analyzing such data, however, can be
challenging. The complexity of student's experiences reflected from social media content requires human
interpretation. However, the growing scale of data demands automatic data analysis techniques. In this paper, we
developed a workflow to integrate both qualitative analysis and large - scale data mining techniques. We focus on
engineering student's Twitter posts to understand issues and problems in their educational experiences. We first
conducted a qualitative analysis on samples taken from about 25,000 tweets related to engagement, and sleep
deprivation. Based on these results, we implemented a multi - label classification algorithm to classify tweets
reflecting student's problems. We then used the algorithm to train a detector of student problems from about 35,000
tweets streamed at the geo - location of Purdue University. This work, for the first time, presents a methodology and
results that show how informal social media data can provide insights into students' experiences.
Title :Cost-effective Viral Marketing for Time-critical Campaigns in Large-scale Social Networks
Language : ASP.NET with VB
Project Link : http://kasanpro.com/p/asp-net-with-vb/viral-marketing-cost-effective-time-critical-campaigns-large-scale-social-n
Abstract : Online social networks (OSNs) have become one of the most effective channels for marketing and
advertising. Since users are often influenced by their friends, "wordof- mouth" exchanges, so-called viral marketing, in
social networks can be used to increase product adoption or widely spread content over the network. The common
perception of viral marketing about being cheap, easy, and massively effective makes it an ideal replacement of
traditional advertising. However, recent studies have revealed that the propagation often fades quickly within only few
hops from the sources, counteracting the assumption on the self-perpetuating of influence considered in literature.
With only limited influence propagation, is massively reaching customers via viral marketing still affordable? How to
economically spend more resources to increase the spreading speed? We investigate the cost-effective massive viral
marketing problem, taking into the consideration the limited influence propagation. Both analytical analysis based on
power-law network theory and numerical analysis demonstrate that the viral marketing might involve costly seeding.
To minimize the seeding cost, we provide mathematical programming to find optimal seeding for medium-size
networks and propose VirAds, an efficient algorithm, to tackle the problem on largescale networks. VirAds guarantees
a relative error bound of O(1) from the optimal solutions in power-law networks and outperforms the greedy heuristics
which realizes on the degree centrality. Moreover, we also show that, in general, approximating the optimal seeding
within a ratio better than O(log n) is unlikely possible.
http://kasanpro.com/ieee/final-year-project-center-thanjavur-reviews
Title :Cost-effective Viral Marketing for Time-critical Campaigns in Large-scale Social Networks
Language : ASP.NET with C#
Project Link : http://kasanpro.com/p/asp-net-with-c-sharp/cost-effective-viral-marketing-time-critical-campaigns-large-scale-so
Abstract : Online social networks (OSNs) have become one of the most effective channels for marketing and
advertising. Since users are often influenced by their friends, "wordof- mouth" exchanges, so-called viral marketing, in
social networks can be used to increase product adoption or widely spread content over the network. The common
perception of viral marketing about being cheap, easy, and massively effective makes it an ideal replacement of
traditional advertising. However, recent studies have revealed that the propagation often fades quickly within only few
hops from the sources, counteracting the assumption on the self-perpetuating of influence considered in literature.
With only limited influence propagation, is massively reaching customers via viral marketing still affordable? How to
economically spend more resources to increase the spreading speed? We investigate the cost-effective massive viral
marketing problem, taking into the consideration the limited influence propagation. Both analytical analysis based on
power-law network theory and numerical analysis demonstrate that the viral marketing might involve costly seeding.
To minimize the seeding cost, we provide mathematical programming to find optimal seeding for medium-size
networks and propose VirAds, an efficient algorithm, to tackle the problem on largescale networks. VirAds guarantees
a relative error bound of O(1) from the optimal solutions in power-law networks and outperforms the greedy heuristics
which realizes on the degree centrality. Moreover, we also show that, in general, approximating the optimal seeding
within a ratio better than O(log n) is unlikely possible.
Title :Cost-effective Viral Marketing for Time-critical Campaigns in Large-scale Social Networks
Language : C#
Project Link :
http://kasanpro.com/p/c-sharp/effective-viral-marketing-time-critical-campaigns-large-scale-social-networks
Abstract : Online social networks (OSNs) have become one of the most effective channels for marketing and
21. advertising. Since users are often influenced by their friends, "wordof- mouth" exchanges, so-called viral marketing, in
social networks can be used to increase product adoption or widely spread content over the network. The common
perception of viral marketing about being cheap, easy, and massively effective makes it an ideal replacement of
traditional advertising. However, recent studies have revealed that the propagation often fades quickly within only few
hops from the sources, counteracting the assumption on the self-perpetuating of influence considered in literature.
With only limited influence propagation, is massively reaching customers via viral marketing still affordable? How to
economically spend more resources to increase the spreading speed? We investigate the cost-effective massive viral
marketing problem, taking into the consideration the limited influence propagation. Both analytical analysis based on
power-law network theory and numerical analysis demonstrate that the viral marketing might involve costly seeding.
To minimize the seeding cost, we provide mathematical programming to find optimal seeding for medium-size
networks and propose VirAds, an efficient algorithm, to tackle the problem on largescale networks. VirAds guarantees
a relative error bound of O(1) from the optimal solutions in power-law networks and outperforms the greedy heuristics
which realizes on the degree centrality. Moreover, we also show that, in general, approximating the optimal seeding
within a ratio better than O(log n) is unlikely possible.
Title :Green Mining: Investigating Power Consumption across Versions
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/green-mining-investigating-power-consumption-versions
Abstract : Power consumption is increasingly becoming a concern for not only electrical engineers, but for software
engineers as well, due to the increasing popularity of new power-limited contexts such as mobile-computing,
smart-phones and cloud-computing. Software changes can alter software power consumption behaviour and can
cause power performance regressions. By tracking software power consumption we can build models to provide
suggestions to avoid power regressions. There is much research on software power consumption, but little focus on
the relationship between software changes and power consumption. Most work measures the power consumption of
a single software task; instead we seek to extend this work across the history (revisions) of a project. We develop a
set of tests for a well established product and then run those tests across all versions of the product while recording
the power usage of these tests. We provide and demonstrate a methodology that enables the analysis of power
consumption performance for over 500 nightly builds of Firefox 3.6; we show that software change does induce
changes in power consumption. This methodology and case study are a first step towards combining power
measurement and mining software repositories research, thus enabling developers to avoid power regressions via
power consumption awareness.
M.Phil Computer Science Data Mining Projects
Title :Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster
number
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/categorical-numerical-attribute-data-clustering-based
Abstract : Most of the existing clustering approaches are applicable to purely numerical or categorical data only, but
not the both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical and
categorical attributes because there exists an awkward gap between the similarity metrics for categorical and
numerical data. This paper therefore presents a general clustering framework based on the concept of object-cluster
similarity and gives a unified similarity metric which can be simply applied to the data with categorical, numerical, and
mixed attributes. Accordingly, an iterative clustering algorithm is developed, whose outstanding performance is
experimentally demonstrated on different benchmark data sets. Moreover, to circumvent the difficult selection problem
of cluster number, we further develop a penalized competitive learning algorithm within the proposed clustering
framework. The embedded competition and penalization mechanisms enable this improved algorithm to determine
the number of clusters automatically by gradually eliminating the redundant clusters. The experimental results show
the efficacy of the proposed approach.
Title :Categorical-and-numerical-attribute data clustering using K - Mode clustering and Fuzzy K - Mode clustering
Language : C#
Project Link : http://kasanpro.com/p/c-sharp/categorical-numerical-attribute-data-clustering-fuzzy
Abstract : Most of the existing clustering approaches are applicable to purely numerical or categorical data only, but
not the both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical and
categorical attributes because there exists an awkward gap between the similarity metrics for categorical and
numerical data. This paper therefore presents a general clustering framework based on the concept of object-cluster
similarity and gives a unified similarity metric which can be simply applied to the data with categorical, numerical, and