SlideShare a Scribd company logo
1 of 21
Download to read offline
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a
computational intensive task with a large range of applications such as knowledge discovery or data
mining. However, as the volume and the dimension of data increase, only distributed approaches can
perform such costly operation in a reasonable time. Recent works have focused on implementing
efficient solutions using the MapReduce programming model because it is suitable for distributed large
scale data processing. Although these works provide different solutions to the same problem, each one
has particular constraints and properties. In this paper, we compare the different existing approaches
for computing kNN on MapReduce, first theoretically, and then by performing an extensive
experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN
computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze
each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety
of datasets, and analyze the impact of data volume, data dimension and the value of k from many
perspectives like time and space complexity, and accuracy. The experimental part brings new
advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this
is the first paper that compares kNN computing methods on MapReduce both theoretically and
experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based
practical problems in the context of big data.
ETPL
DM - 001
K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical
and Experimental Analysis
High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering
all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However,
setting min_util appropriately is a difficult problem for users. Generally speaking, finding an
appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set
too low, too many HUIs will be generated, which may cause the mining process to be very inefficient.
On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we
address the above issues by proposing a new framework for top-k high utility itemset mining, where k
is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining
Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining
such itemsets without the need to set min_util. We provide a structural comparison of the two
algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and
synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal
case of state-of-the-art utility mining algorithms.
ETPL
DM - 002
Efficient Algorithms for Mining Top-K High Utility Item sets
Textual documents created and distributed on the Internet are ever changing in various forms. Most of
existing works are devoted to topic modelling and the evolution of individual topics, while sequential
relations of topics in successive documents published by a specific user are ignored. In this paper, in
order to characterize and detect personalized and abnormal behaviours of Internet users, we propose
Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential
Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively
frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring
on abnormal user behaviours. We present a group of algorithms to solve this innovative mining
problem through three phases: pre-processing to extract probabilistic topics and identify sessions for
different users, generating all the STP candidates with (expected) support values for each user by
pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs.
Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover
special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’
characteristics.
ETPL
DM - 003
Mining User-Aware Rare Sequential Topic Patterns in Document
Streams
Sequence classification is an important task in data mining. We address the problem of sequence
classification using rules composed of interesting patterns found in a dataset of labelled sequences and
accompanying class labels. We measure the interestingness of a pattern in a given class of sequences
by combining the cohesion and the support of the pattern. We use the discovered patterns to generate
confident classification rules, and present two different ways of building a classifier. The first classifier
is based on an improved version of the existing method of classification based on association rules,
while the second ranks the rules by first measuring their value specific to the new data object.
Experimental results show that our rule based classifiers outperform existing comparable classifiers in
terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use
different kinds of patterns as features to represent each sequence as a feature vector. We then apply a
variety of machine learning algorithms for sequence classification, experimentally demonstrating that
the patterns we discover represent the sequences well, and prove effective for the classification task.
ETPL
DM - 004
Pattern Based Sequence Classification
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional
discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our
proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit
abnormal patterns. In many applications this can lead to better understanding of the nature of the
atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case
where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature
space. Individual AD techniques and techniques that detect anomalies using all the features typically
fail to detect such anomalies, but our method can detect such instances collectively, discover the shared
anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we
focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on
topic models. Results of our experiments show that our method can accurately detect anomalous topics
and salient features (words) under each such topic in a synthetic data set and two real-world text corpora
and achieves better performance compared to both standard group AD and individual AD techniques.
All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD.
ETPL
DM - 005
ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
Some important data management and analytics tasks cannot be completely addressed by automated
processes. These “computer-hard” tasks such as entity resolution, sentiment analysis, and image
recognition, can be enhanced through the use of human cognitive ability. Human Computation is an
effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd).
Thus, crowdsourced data management has become an area of increasing interest in research and
industry. There are three important problems in crowdsourced data management. (1) Quality Control:
Workers may return noisy results and effective techniques are required to achieve high quality; (2)
Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency
Control: The human workers can be slow, particularly in contrast to computing time scales, so latency-
control techniques are required. There has been significant work addressing these three factors for
designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing
plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies
on crowdsourced data management. Based on this analysis we then outline key factors that need to be
considered to improve crowdsourced data management.
ETPL
DM - 006
Crowd sourced Data Management: A Survey
Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving
paradigm has gained a lot of attention and has been a hot research topic in the field of Computer
Science. Even though a lot of work has been conducted on this topic, so far we do not have a
comprehensive survey on most relevant work done in crowdsourcing field. In this paper, we aim to
offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing.
According to their focus, we divide this work into three parts, which are: incentive design, task
assignment and quality control. For each part, we start with different problems faced in that area
followed by a brief description of existing work and a discussion of pros and cons. In addition, we also
present a real scenario on how the different techniques are used in implementing a location-based
crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose
crowdsourcing techniques and present some open problems in this area.
ETPL
DM - 007
A Survey of General-Purpose Crowdsourcing Techniques
General health examination is an integral part of healthcare in many countries. Identifying the participants
at risk is important for early warning and preventive intervention. The fundamental challenge of learning a
classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected
dataset. Particularly, the unlabeled data describes the participants in health examinations whose health
conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states
of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health
(Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively
developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed
and the proof of convergence is given. Extensive experiments based on both real health examination
datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
ETPL
DM - 008
Mining Health Examination Records — A Graph-based Approach
Twitter has become one of the largest microblogging platforms for users around the world to share
anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers
a surge of relevant tweets within a short period of time, which often reflects important events of mass
interest. How to leverage Twitter for early detection of bursty topics has therefore become an important
research problem with immense practical value. Despite the wealth of research work on topic
modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing
methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper
TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time
detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment
results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that
TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is
on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-
granularity.
ETPL
DM - 009
TopicSketch: Real-time Bursty Topic Detection from Twitter
The development of a topic in a set of topic documents is constituted by a series of person interactions
at a specific time and place. Knowing the interactions of the persons mentioned in these documents is
helpful for readers to better comprehend the documents. In this paper, we propose a topic person
interaction detection method called SPIRIT, which classifies the text segments in a set of topic
documents that convey person interactions. We design the rich interactive tree structure to represent
syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based
convolution kernel to identify interactive segments. Experiment results based on real world topics
demonstrate that the proposed rich interactive tree structure effectively detects the topic person
interactions and that our method outperforms many well-known relation extraction and protein-protein
interaction methods.
ETPL
DM -10
SPIRIT: A Tree Kernel-based Method for Topic Person Interaction
Detection
The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the
detection of spatial events when smartphone users move around in their daily lives. However, the
credibility of those detected events can be negatively impacted by unreliable participants with low-
quality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to
discover true events from diverse and noisy participants' reports. This problem is uniquely distinct from
its online counterpart in that it involves uncertainties in both participants' mobility and reliability.
Decoupling these two types of uncertainties through location tracking will raise severe privacy and
energy issues, whereas simply ignoring missing reports or treating them as negative reports will
significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised
models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events
(PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths
of events, and three-way participant reliability in a unified framework. In PTSE, we further model
personal location visit tendencies. These proposed models are capable of effectively handling various
types of uncertainties and automatically discovering truths without any supervision or location
tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed
models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing
environment.
ETPL
DM - 011
Truth Discovery in Crowdsourced Detection of Spatial Events
Feature selection is a challenging problem for high dimensional data processing, which arises in many
real applications such as data mining, information retrieval, and pattern recognition. In this paper, we
study the problem of unsupervised feature selection. The problem is challenging due to the lack of
label information to guide feature selection. We formulate the problem of unsupervised feature
selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the
selected features not only preserve the local structure of the original data space via graph regularization,
but also approximately reconstruct each data point via linear combination. Therefore, the graph
regularized data reconstruction error becomes a natural criterion for measuring the quality of the
selected features. By minimizing the reconstruction error, we are able to select the features that best
preserve both the similarity and discriminant information in the original data. We then develop an
efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the
performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the
effectiveness of our proposed approach.
ETPL
DM - 012
Graph Regularized Feature Selection with Data Reconstruction
The last few years have witnessed the emergence and evolution of a vibrant research stream on a large
variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users
among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help
solve many problems in social computing in both theory and applications. Since public profiles can be
duplicated and easily impersonated by users with different purposes, most current user identification
resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have
attempted to match users based on the location and timing of user content as well as writing style.
However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from
the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online
SMNs are quite symmetric, existing user identification schemes based on network structure are not
effective. The real-world friend cycle is highly individual and virtually no two users share a congruent
friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform
SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we
proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a
match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are
considered as identical users. We also developed two propositions to improve the efficiency of the
algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current
network structure-based algorithms.
ETPL
DM - 013
Cross-Platform Identification of Anonymous Identical Users in Multiple
Social Media Networks
Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well
as application development and utilization in various domains. To reduce human effort to build a
taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy
learning approach, named TaxoFinder. TaxoFinder takes three steps to automatically build a taxonomy.
First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph
representing how such concepts are associated together based on their co-occurrences. As the key
method in TaxoFinder, we propose a method for measuring associative strengths among the concepts,
which quantify how strongly they are associated in the graph, using similarities between sentences and
spatial distances between sentences. Lastly, TaxoFinder induces a taxonomy from the graph using a
graph analytic algorithm. TaxoFinder aims to build a taxonomy in such a way that it maximizes the
overall associative strengths among the concepts in the graph to build a taxonomy. We evaluate
TaxoFinder using gold-standard evaluation on three different domains: emergency management for
mass gatherings, autism research, and disease domains. In our evaluation, we compare TaxoFinder
with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach
significantly outperforming the subsumption method.
ETPL
DM - 014
TaxoFinder: A Graph-Based Approach for Taxonomy Learning
As more and more applications produce streaming data, clustering data streams has become an
important technique for data and knowledge engineering. A typical approach is to summarize the data
stream in real-time with an online process into a large number of so called micro-clusters. Micro-
clusters represent local density estimates by aggregating the information of many data points in a
defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline
step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-
clusters are used as pseudo points with the density estimates used as their weights. However,
information about density in the area between micro-clusters is not preserved in the online process and
reclustering is based on possibly inaccurate assumptions about the distribution of data within and
between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-
cluster-based online clustering component that explicitly captures the density between micro-clusters
via a shared density graph. The density information in this graph is then exploited for reclustering
based on actual density between adjacent micro-clusters. We discuss the space and time complexity of
maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets
highlight that using shared density improves clustering quality over other popular data stream
clustering methods which require the creation of a larger number of smaller micro-clusters to achieve
comparable results.
ETPL
DM - 015
Clustering Data Streams Based on Shared Density between Micro-
Clusters
Social media networks are dynamic. As such, the order in which network ties develop is an important
aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal
Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network
analysis. The proposed model focuses on how the nodal attributes of a network affect the order in
which the network ties develop. Temporal patterns in social media networks are modeled based on the
nodal attributes of individuals and the time information of network ties. Using social media data
collected from a knowledge sharing community, empirical tests were conducted to evaluate the
performance of the NATERGM on identifying the temporal patterns and predicting the characteristics
of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing
capability and an increased prediction accuracy of network characteristics compared to benchmark
models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation
process of dynamic networks.
ETPL
DM - 016
NATERGM: A Model for Examining the Role of Nodal Attributes in
Dynamic Social Media Networks
Graph classification aims to learn models to classify structure data. To date, all existing graph
classification methods are designed to target one single learning task and require a large number of
labeled samples for learning good classification models. In reality, each real-world task may only have
a limited number of labeled samples, yet multiple similar learning tasks can provide useful knowledge
to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG)
problem, where multiple graph classification tasks are jointly regularized to find discriminative
subgraphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited
number of training samples, subgraph features selected for one single graph classification task tend to
overfit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize
multiple tasks to explore high quality subgraph features for graph classification. To achieve this goal,
we formulate an objective function which combines multiple graph classification tasks to evaluate the
informativeness score of a subgraph feature. An iterative subgraph feature exploration and multi-task
learning process is further proposed to incrementally select subgraph features for graph classification.
Experiments on real-world multi-task graph classification datasets demonstrate significant
performance gain.
ETPL
DM - 018
Joint Structure Feature Exploration and Regularization for Multi-Task
Graph Classification
Resource Description Framework (RDF) has been widely used in the Semantic Web to describe
resources and their relationships. The RDF graph is one of the most commonly used representations
for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs
integrated from different data sources may often contain uncertain and inconsistent information (e.g.,
uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we
formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies
and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-
aware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves
subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and
with high quality scores (considering both consistency and uncertainty). In order to efficiently answer
QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and
quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an
effective index to facilitate our proposed pruning methods, and propose an efficient approach for
processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our
proposed approaches through extensive experiments.
ETPL
DM - 017
Quality-Aware Subgraph Matching Over Inconsistent Probabilistic
Graph Databases
General health examination is an integral part of healthcare in many countries. Identifying the
participants at risk is important for early warning and preventive intervention. The fundamental
challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes
the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health
examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground
truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised
learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk
predictions to classify a progressively developing situation with the majority of the data unlabeled. An
efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments
based on both real health examination datasets and synthetic datasets are performed to show the
effectiveness and efficiency of our method.
ETPL
DM - 019
Mining Health Examination Records — A Graph-based Approach
In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The
proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently
unifies both textual and semantic features into an ER blocking process. In order to understand how
similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity
metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of
records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces.
In doing so, the proposed framework can support efficient similarity searches on records in both textual
and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the
proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking
techniques. Our experimental study shows that the combination of semantic similarity and textual
similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic
nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking
for performing entity resolution tasks in a large-scale data environment.
ETPL
DM - 020
Semantic-Aware Blocking for Entity Resolution
Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice
models, we develop an approach to infer the unique and complex decision making process of a
decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along
with the attributes interaction, to name a few. On the basis of exemplary preference information in the
form of pairwise comparisons of alternatives, our method seeks to induce a DM’s preference model in
terms of the parameters of recent discrete choice models. To this end, we reduce our learning function
to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into
consideration the interaction among the attributes along with the priorities and the unique attitudinal
character of a DM. The experimental results on standard benchmark datasets suggest that our approach
is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art
methods.
ETPL
DM - 021
On Learning of Choice Models with Interactive Attributes
In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if
any. For example, in a router, this functionality can be used to determine the next hop of an incoming
packet. This problem is generally known as set separation and has been widely studied. Most existing
solutions make use of hash-based algorithms, particularly when a small percentage of false positives
is allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can
require several memory accesses, a significant limitation for some implementations. We propose an
approach using Block Bloom Filters, where each element is first hashed to a single memory block that
stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve
solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive
probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory
Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a giv-
n element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory
word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets
are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution.
ETPL
DM - 022
OMASS: One Memory Access Set Separation
Items shared through Social Media may affect more than one user's privacy—e.g., photos that depict
multiple users, comments that mention multiple users, events in which multiple users are invited, etc.
The lack of multi-party privacy management support in current mainstream Social Media
infrastructures makes users unable to appropriately control to whom these items are actually shared or
not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a
single policy for an item can help solve this problem. However, merging multiple users’ privacy
preferences is not an easy task, because privacy preferences may conflict, so methods to resolve
conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an
agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of
the users affected by the item to be shared. Current approaches are either too demanding or only
consider fixed ways of aggregating privacy preferences. In this paper, we propose the first
computational mechanism to resolve conflicts for multi-party privacy management in Social Media
that is able to adapt to different situations by modelling the concessions that users make to reach a
solution to the conflicts. We also present results of a user study in which our proposed mechanism
outperformed other existing approaches in terms of how many times each approach matched users’
behaviour.
ETPL
DM - 023
Resolving Multi-party Privacy Conflicts in Social Media
Data exchange is the process of generating an instance of a target schema from an instance of a source
schema such that source data is reflected in the target. Generally, data exchnge is performed using
schema mapping, representing high level relations between source and target schemas. In this paper,
we argue that data exchange solely based on schema level information limits the ability to express
semantics in data exchange. We show such schema level mappings not only may result in entity
fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this
problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on
data and schema mapping that employs similarities between relation trees of source and target relations
to find the best relations that can host source instances. Our experiments show SEDEX outperforms
other methods in terms of quality and scalability of data exchange.
ETPL
DM - 024
SEDEX: Scalable Entity Preserving Data Exchange
Despite recent advances in distributed RDF data management, processing large-amounts of RDF data
in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes
rich and complex graphs mixing both instance and schema-level data. Sharding such data using
classical techniques or partitioning the graph using traditional min-cut algorithms leads to very
inefficient distributed operations and to a high number of joins. In this paper, we describe DiploCloud,
an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous
approaches, DiploCloud runs a physiological analysis of both instance and schema information prior
to partitioning the data. In this paper, we describe the architecture of DiploCloud, its main data
structures, as well as the new algorithms we use to partition and distribute data. We also present an
extensive evaluation of DiploCloud showing that our system is often two orders of magnitude faster
than state-of-the-art systems on standard workloads.
ETPL
DM - 025
DiploCloud: Efficient and Scalable Management of RDF Data in the
Cloud
Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track
the traces of moving objects. A trajectory is typically represented by a sequence of time stamped
geographical locations. A wide spectrum of applications can benefit from the trajectory data mining.
Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this
paper, we survey various applications of trajectory data mining, e.g., path discovery, location
prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive
collection of existing trajectory data mining techniques and discusses them in a framework of trajectory
data mining. This framework and the survey can be used as a guideline for designing future trajectory
data mining solutions.
ETPL
DM - 026
A Survey on Trajectory Data Mining: Techniques and Applications
In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel-
based data mining (DKBDM), such as distributed support vector machine. Among several known data
breaching problems, those associated with insider attacks have been rising significantly, making this
one of the fastest growing types of security breaches. Once considered a negligible concern, insider
attacks have risen to be one of the top three central data violations. Insider-related research involving
the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in
designing protection against collaborative organizations. Prior works often fall short by addressing a
multifactorial model that is more limited in scope and implementation than addressing insiders within
an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an
insider shares data with an outsider, who can then recover the original data from message transmissions
(intermediary kernel values) among organizations. This attack requires only accessibility to a few data
entries within the organizations rather than requiring the encrypted administrative privileges typically
found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to
explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of
insider data necessary to launch the insider attack. Finally, we follow up by introducing several
proposed privacy-preserving schemes to counter the described attack.
ETPL
DM - 027
Insider Collusion Attack on Privacy-Preserving Kernel-Based Data
Mining Systems
Frequent sequence mining is well known and well-studied problem in data mining. The output of the
algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis.
Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we
present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing.
The static load-balancing is done by measuring the computational time using a probabilistic algorithm.
For reasonable size of instance, the algorithms achieve speedups up to where is the number of
processors. In the experimental evaluation, we show that our method performs significantly better than
the current state-of-the-art methods. The presented approach is very universal: it can be used for static
load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.
ETPL
DM - 028
Probabilistic Static Load-Balancing of Parallel Mining of Frequent
Sequences
As more and more applications produce streaming data, clustering data streams has become an
important technique for data and knowledge engineering. A typical approach is to summarize the data
stream in real-time with an online process into a large number of so called micro-clusters. Micro-
clusters represent local density estimates by aggregating the information of many data points in a
defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline
step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-
clusters are used as pseudo points with the density estimates used as their weights. However,
information about density in the area between micro-clusters is not preserved in the online process and
reclustering is based on possibly inaccurate assumptions about the distribution of data within and
between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-
cluster-based online clustering component that explicitly captures the density between micro-clusters
via a shared density graph. The density information in this graph is then exploited for reclustering
based on actual density between adjacent micro-clusters. We discuss the space and time complexity of
maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets
highlight that using shared density improves clustering quality over other popular data stream
clustering methods which require the creation of a larger number of smaller micro-clusters to achieve
comparable results.
ETPL
DM - 029
Clustering Data Streams Based on Shared Density between Micro-
Clusters
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic
parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to
this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the
MapReduce programming model. To achieve compressed storage and avoid building conditional
pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP
trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial
third MapReduce job, the mappers independently decompose itemsets, the reducers perform
combination operations by constructing small ultrametric trees, and the actual mining of these trees
separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster
is sensitive to data distribution and dimensions, because itemsets with different lengths have different
decomposition and construction costs. To improve FiDoop's performance, we develop a workload
balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD,
an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis.
Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution
is efficient and scalable.
ETPL
DM - 030
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many
natural and engineering complex systems, e.g. protein networks, power grid, and transportation
systems. Most of the existing techniques view the community mining problem as an optimization
problem based on a given quality function (e.g., modularity), however none of them are grounded with
a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining
efficiency and the community quality still remains an open problem. In this paper, we attempt to
address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable
influence factor is proposed to measure the leadership of each node, those nodes with highest local
leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system
to describe the dynamical assignment of community membership; and formulate the serval conditions
to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community
structure of the network can be revealed. The proposed dynamical system is independent of the quality
function used, so could also be applied in other community mining models. Our algorithm is highly
efficient: the computational complexity analysis shows that the execution time is nearly linearly
dependent on the number of nodes in sparse networks. We finally give demonstrative applications of
the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the
algorithmic performance.
ETPL
DM - 031
Fast and accurate mining the community structure: integrating center
locating and membership optimization
In mobile communication, spatial queries pose a serious threat to user location privacy because the
location of a query may reveal sensitive information about the mobile user. In this paper, we study
approximate k nearest neighbour (kNN) queries where the mobile user queries the location-based
service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current
location. We propose a basic solution and a generic solution for the mobile user to preserve his location
and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier
public-key cryptosystem and can provide both location and query privacy. To preserve query privacy,
our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k
nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic
solution can be applied to multiple discrete type attributes of private location-based queries. Compared
with existing solutions for kNN queries with location privacy, our solution is more efficient.
Experiments have shown that our solution is practical for kNN queries.
ETPL
DM - 032
Practical Approximate k Nearest Neighbour Queries with Location and
Query Privacy
With advances in geo-positioning technologies and geo-location services, there are a rapidly growing
amount of spatio-textual objects collected in many applications such as location based services and
social networks, in which an object is described by its spatial location and a set of keywords (terms).
Consequently, the study of spatial keyword search which explores both location and textual description
of the objects has attracted great attention from the commercial organizations and research
communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k
spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set
of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the
closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing
of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel
index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit
both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient
algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering
capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal
with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based
on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree
technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic
data clearly demonstrate the efficiency of our methods.
ETPL
DM - 033
Inverted Linear Quadtree: Efficient Top K ord Search
We propose TrustSVD, a trust-based matrix factorization technique for recommendations. TrustSVD
integrates multiple information sources into the recommendation model in order to reduce the data
sparsity and cold start problems and their degradation of recommendation performance. An analysis of
social trust data from four real-world data sets suggests that not only the explicit but also the implicit
influence of both ratings and trust should be taken into consideration in a recommendation model. Trust
SVD therefore builds on top of a state-of-the-art recommendation algorithm, SVD++ (which uses the
explicit and implicit influence of rated items), by further incorporating both the explicit and implicit
influence of trusted and trusting users on the prediction of items for an active user. The proposed
technique is the first to extend SVD++ with social trust information. Experimental results on the four
data sets demonstrate that Trust SVD achieves better accuracy than other ten counterpart’s
recommendation techniques.
ETPL
DM - 034
A Novel Recommendation Model Regularized with User Trust and Item
Ratings
Frequent sequence mining is well known and well-studied problem in data mining. The output of the
algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis.
Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we
present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing.
The static load-balancing is done by measuring the computational time using a probabilistic algorithm.
For reasonable size of instance, the algorithms achieve speedups up to where is the number of
processors. In the experimental evaluation, we show that our method performs significantly better than
the current state-of-the-art methods. The presented approach is very universal: it can be used for static
load-balancing of other pattern mining algorithms such as item set/tree/graph mining algorithms.
ETPL
DM - 036
Probabilistic static load-balancing of parallel mining of frequent
sequences
Although the matrix completion paradigm provides an appealing solution to the collaborative filtering
problem in recommendation systems, some major issues, such as data sparsity and cold-start problems,
still remain open. In particular, when the rating data for a subset of users or items is entirely missing,
commonly known as the cold-start problem, the standard matrix completion methods are inapplicable
due the non-uniform sampling of available ratings. In recent years, there has been considerable interest
in dealing with cold-start users or items that are principally based on the idea of exploiting other sources
of information to compensate for this lack of rating data. In this paper, we propose a novel and general
algorithmic framework based on matrix completion that simultaneously exploits the similarity
information among users and items to alleviate the cold-start problem. In contrast to existing methods,
our proposed recommender algorithm, dubbed DecRec, decouples the following two aspects of the
cold-start problem to effectively exploit the side information: (i) the completion of a rating sub-matrix,
which is generated by excluding cold-start users/items from the original rating matrix; and (ii) the
transduction of knowledge from existing ratings to cold-start items/users using side information. This
crucial difference prevents the error propagation of completion and transduction, and also significantly
boosts the performance when appropriate side information is incorporated. The recovery error of the
proposed algorithm is analyzed theoretically and, to the best of our knowledge, this is the first algorithm
that addresses the cold-start problem with provable guarantees on performance. Additionally, we also
address the problem where both cold-start user and item challenges are present simultaneously. We
conduct thorough experiments on real datasets that complement our theoretical results. These
experiments demonstrate the ef- ectiveness of the proposed algorithm in handling the cold-start
users/items problem and mitigating data sparsity issue.
ETPL
DM - 036
Cold-Start Recommendation with Provable Guarantees: A Decoupled
Approach
Updated TOP List of Data Mining IEEE Project DotNet and JAVA 2016-17 for ME/MTech,BE/BTech Final year CSE/IT students

More Related Content

Recently uploaded

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 

Recently uploaded (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Updated TOP List of Data Mining IEEE Project DotNet and JAVA 2016-17 for ME/MTech,BE/BTech Final year CSE/IT students

  • 1.
  • 2.
  • 3. Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first paper that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data. ETPL DM - 001 K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical and Experimental Analysis High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However, setting min_util appropriately is a difficult problem for users. Generally speaking, finding an appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set too low, too many HUIs will be generated, which may cause the mining process to be very inefficient. On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we address the above issues by proposing a new framework for top-k high utility itemset mining, where k is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining such itemsets without the need to set min_util. We provide a structural comparison of the two algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal case of state-of-the-art utility mining algorithms. ETPL DM - 002 Efficient Algorithms for Mining Top-K High Utility Item sets
  • 4. Textual documents created and distributed on the Internet are ever changing in various forms. Most of existing works are devoted to topic modelling and the evolution of individual topics, while sequential relations of topics in successive documents published by a specific user are ignored. In this paper, in order to characterize and detect personalized and abnormal behaviours of Internet users, we propose Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviours. We present a group of algorithms to solve this innovative mining problem through three phases: pre-processing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs. Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’ characteristics. ETPL DM - 003 Mining User-Aware Rare Sequential Topic Patterns in Document Streams Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task. ETPL DM - 004 Pattern Based Sequence Classification
  • 5. We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD. ETPL DM - 005 ATD: Anomalous Topic Discovery in High Dimensional Discrete Data Some important data management and analytics tasks cannot be completely addressed by automated processes. These “computer-hard” tasks such as entity resolution, sentiment analysis, and image recognition, can be enhanced through the use of human cognitive ability. Human Computation is an effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd). Thus, crowdsourced data management has become an area of increasing interest in research and industry. There are three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy results and effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly in contrast to computing time scales, so latency- control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management. ETPL DM - 006 Crowd sourced Data Management: A Survey
  • 6. Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of Computer Science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in crowdsourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowdsourcing techniques and present some open problems in this area. ETPL DM - 007 A Survey of General-Purpose Crowdsourcing Techniques General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method. ETPL DM - 008 Mining Health Examination Records — A Graph-based Approach
  • 7. Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer- granularity. ETPL DM - 009 TopicSketch: Real-time Bursty Topic Detection from Twitter The development of a topic in a set of topic documents is constituted by a series of person interactions at a specific time and place. Knowing the interactions of the persons mentioned in these documents is helpful for readers to better comprehend the documents. In this paper, we propose a topic person interaction detection method called SPIRIT, which classifies the text segments in a set of topic documents that convey person interactions. We design the rich interactive tree structure to represent syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based convolution kernel to identify interactive segments. Experiment results based on real world topics demonstrate that the proposed rich interactive tree structure effectively detects the topic person interactions and that our method outperforms many well-known relation extraction and protein-protein interaction methods. ETPL DM -10 SPIRIT: A Tree Kernel-based Method for Topic Person Interaction Detection
  • 8. The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the detection of spatial events when smartphone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low- quality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to discover true events from diverse and noisy participants' reports. This problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants' mobility and reliability. Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events (PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths of events, and three-way participant reliability in a unified framework. In PTSE, we further model personal location visit tendencies. These proposed models are capable of effectively handling various types of uncertainties and automatically discovering truths without any supervision or location tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing environment. ETPL DM - 011 Truth Discovery in Crowdsourced Detection of Spatial Events Feature selection is a challenging problem for high dimensional data processing, which arises in many real applications such as data mining, information retrieval, and pattern recognition. In this paper, we study the problem of unsupervised feature selection. The problem is challenging due to the lack of label information to guide feature selection. We formulate the problem of unsupervised feature selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the selected features not only preserve the local structure of the original data space via graph regularization, but also approximately reconstruct each data point via linear combination. Therefore, the graph regularized data reconstruction error becomes a natural criterion for measuring the quality of the selected features. By minimizing the reconstruction error, we are able to select the features that best preserve both the similarity and discriminant information in the original data. We then develop an efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the effectiveness of our proposed approach. ETPL DM - 012 Graph Regularized Feature Selection with Data Reconstruction
  • 9. The last few years have witnessed the emergence and evolution of a vibrant research stream on a large variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help solve many problems in social computing in both theory and applications. Since public profiles can be duplicated and easily impersonated by users with different purposes, most current user identification resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have attempted to match users based on the location and timing of user content as well as writing style. However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online SMNs are quite symmetric, existing user identification schemes based on network structure are not effective. The real-world friend cycle is highly individual and virtually no two users share a congruent friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are considered as identical users. We also developed two propositions to improve the efficiency of the algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current network structure-based algorithms. ETPL DM - 013 Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well as application development and utilization in various domains. To reduce human effort to build a taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy learning approach, named TaxoFinder. TaxoFinder takes three steps to automatically build a taxonomy. First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph representing how such concepts are associated together based on their co-occurrences. As the key method in TaxoFinder, we propose a method for measuring associative strengths among the concepts, which quantify how strongly they are associated in the graph, using similarities between sentences and spatial distances between sentences. Lastly, TaxoFinder induces a taxonomy from the graph using a graph analytic algorithm. TaxoFinder aims to build a taxonomy in such a way that it maximizes the overall associative strengths among the concepts in the graph to build a taxonomy. We evaluate TaxoFinder using gold-standard evaluation on three different domains: emergency management for mass gatherings, autism research, and disease domains. In our evaluation, we compare TaxoFinder with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach significantly outperforming the subsumption method. ETPL DM - 014 TaxoFinder: A Graph-Based Approach for Taxonomy Learning
  • 10. As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro- clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro- clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro- cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results. ETPL DM - 015 Clustering Data Streams Based on Shared Density between Micro- Clusters Social media networks are dynamic. As such, the order in which network ties develop is an important aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network analysis. The proposed model focuses on how the nodal attributes of a network affect the order in which the network ties develop. Temporal patterns in social media networks are modeled based on the nodal attributes of individuals and the time information of network ties. Using social media data collected from a knowledge sharing community, empirical tests were conducted to evaluate the performance of the NATERGM on identifying the temporal patterns and predicting the characteristics of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing capability and an increased prediction accuracy of network characteristics compared to benchmark models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation process of dynamic networks. ETPL DM - 016 NATERGM: A Model for Examining the Role of Nodal Attributes in Dynamic Social Media Networks
  • 11. Graph classification aims to learn models to classify structure data. To date, all existing graph classification methods are designed to target one single learning task and require a large number of labeled samples for learning good classification models. In reality, each real-world task may only have a limited number of labeled samples, yet multiple similar learning tasks can provide useful knowledge to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG) problem, where multiple graph classification tasks are jointly regularized to find discriminative subgraphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited number of training samples, subgraph features selected for one single graph classification task tend to overfit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize multiple tasks to explore high quality subgraph features for graph classification. To achieve this goal, we formulate an objective function which combines multiple graph classification tasks to evaluate the informativeness score of a subgraph feature. An iterative subgraph feature exploration and multi-task learning process is further proposed to incrementally select subgraph features for graph classification. Experiments on real-world multi-task graph classification datasets demonstrate significant performance gain. ETPL DM - 018 Joint Structure Feature Exploration and Regularization for Multi-Task Graph Classification Resource Description Framework (RDF) has been widely used in the Semantic Web to describe resources and their relationships. The RDF graph is one of the most commonly used representations for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs integrated from different data sources may often contain uncertain and inconsistent information (e.g., uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality- aware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high quality scores (considering both consistency and uncertainty). In order to efficiently answer QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an effective index to facilitate our proposed pruning methods, and propose an efficient approach for processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our proposed approaches through extensive experiments. ETPL DM - 017 Quality-Aware Subgraph Matching Over Inconsistent Probabilistic Graph Databases
  • 12. General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method. ETPL DM - 019 Mining Health Examination Records — A Graph-based Approach In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently unifies both textual and semantic features into an ER blocking process. In order to understand how similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces. In doing so, the proposed framework can support efficient similarity searches on records in both textual and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking techniques. Our experimental study shows that the combination of semantic similarity and textual similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking for performing entity resolution tasks in a large-scale data environment. ETPL DM - 020 Semantic-Aware Blocking for Entity Resolution
  • 13. Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice models, we develop an approach to infer the unique and complex decision making process of a decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along with the attributes interaction, to name a few. On the basis of exemplary preference information in the form of pairwise comparisons of alternatives, our method seeks to induce a DM’s preference model in terms of the parameters of recent discrete choice models. To this end, we reduce our learning function to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into consideration the interaction among the attributes along with the priorities and the unique attitudinal character of a DM. The experimental results on standard benchmark datasets suggest that our approach is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art methods. ETPL DM - 021 On Learning of Choice Models with Interactive Attributes In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if any. For example, in a router, this functionality can be used to determine the next hop of an incoming packet. This problem is generally known as set separation and has been widely studied. Most existing solutions make use of hash-based algorithms, particularly when a small percentage of false positives is allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can require several memory accesses, a significant limitation for some implementations. We propose an approach using Block Bloom Filters, where each element is first hashed to a single memory block that stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a giv- n element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution. ETPL DM - 022 OMASS: One Memory Access Set Separation
  • 14. Items shared through Social Media may affect more than one user's privacy—e.g., photos that depict multiple users, comments that mention multiple users, events in which multiple users are invited, etc. The lack of multi-party privacy management support in current mainstream Social Media infrastructures makes users unable to appropriately control to whom these items are actually shared or not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a single policy for an item can help solve this problem. However, merging multiple users’ privacy preferences is not an easy task, because privacy preferences may conflict, so methods to resolve conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of the users affected by the item to be shared. Current approaches are either too demanding or only consider fixed ways of aggregating privacy preferences. In this paper, we propose the first computational mechanism to resolve conflicts for multi-party privacy management in Social Media that is able to adapt to different situations by modelling the concessions that users make to reach a solution to the conflicts. We also present results of a user study in which our proposed mechanism outperformed other existing approaches in terms of how many times each approach matched users’ behaviour. ETPL DM - 023 Resolving Multi-party Privacy Conflicts in Social Media Data exchange is the process of generating an instance of a target schema from an instance of a source schema such that source data is reflected in the target. Generally, data exchnge is performed using schema mapping, representing high level relations between source and target schemas. In this paper, we argue that data exchange solely based on schema level information limits the ability to express semantics in data exchange. We show such schema level mappings not only may result in entity fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on data and schema mapping that employs similarities between relation trees of source and target relations to find the best relations that can host source instances. Our experiments show SEDEX outperforms other methods in terms of quality and scalability of data exchange. ETPL DM - 024 SEDEX: Scalable Entity Preserving Data Exchange
  • 15. Despite recent advances in distributed RDF data management, processing large-amounts of RDF data in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes rich and complex graphs mixing both instance and schema-level data. Sharding such data using classical techniques or partitioning the graph using traditional min-cut algorithms leads to very inefficient distributed operations and to a high number of joins. In this paper, we describe DiploCloud, an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous approaches, DiploCloud runs a physiological analysis of both instance and schema information prior to partitioning the data. In this paper, we describe the architecture of DiploCloud, its main data structures, as well as the new algorithms we use to partition and distribute data. We also present an extensive evaluation of DiploCloud showing that our system is often two orders of magnitude faster than state-of-the-art systems on standard workloads. ETPL DM - 025 DiploCloud: Efficient and Scalable Management of RDF Data in the Cloud Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track the traces of moving objects. A trajectory is typically represented by a sequence of time stamped geographical locations. A wide spectrum of applications can benefit from the trajectory data mining. Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this paper, we survey various applications of trajectory data mining, e.g., path discovery, location prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive collection of existing trajectory data mining techniques and discusses them in a framework of trajectory data mining. This framework and the survey can be used as a guideline for designing future trajectory data mining solutions. ETPL DM - 026 A Survey on Trajectory Data Mining: Techniques and Applications
  • 16. In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel- based data mining (DKBDM), such as distributed support vector machine. Among several known data breaching problems, those associated with insider attacks have been rising significantly, making this one of the fastest growing types of security breaches. Once considered a negligible concern, insider attacks have risen to be one of the top three central data violations. Insider-related research involving the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in designing protection against collaborative organizations. Prior works often fall short by addressing a multifactorial model that is more limited in scope and implementation than addressing insiders within an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an insider shares data with an outsider, who can then recover the original data from message transmissions (intermediary kernel values) among organizations. This attack requires only accessibility to a few data entries within the organizations rather than requiring the encrypted administrative privileges typically found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of insider data necessary to launch the insider attack. Finally, we follow up by introducing several proposed privacy-preserving schemes to counter the described attack. ETPL DM - 027 Insider Collusion Attack on Privacy-Preserving Kernel-Based Data Mining Systems Frequent sequence mining is well known and well-studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to where is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms. ETPL DM - 028 Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences
  • 17. As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro- clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro- clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro- cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results. ETPL DM - 029 Clustering Data Streams Based on Shared Density between Micro- Clusters Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop's performance, we develop a workload balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable. ETPL DM - 030 FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
  • 18. Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many natural and engineering complex systems, e.g. protein networks, power grid, and transportation systems. Most of the existing techniques view the community mining problem as an optimization problem based on a given quality function (e.g., modularity), however none of them are grounded with a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining efficiency and the community quality still remains an open problem. In this paper, we attempt to address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable influence factor is proposed to measure the leadership of each node, those nodes with highest local leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system to describe the dynamical assignment of community membership; and formulate the serval conditions to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community structure of the network can be revealed. The proposed dynamical system is independent of the quality function used, so could also be applied in other community mining models. Our algorithm is highly efficient: the computational complexity analysis shows that the execution time is nearly linearly dependent on the number of nodes in sparse networks. We finally give demonstrative applications of the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the algorithmic performance. ETPL DM - 031 Fast and accurate mining the community structure: integrating center locating and membership optimization In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study approximate k nearest neighbour (kNN) queries where the mobile user queries the location-based service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current location. We propose a basic solution and a generic solution for the mobile user to preserve his location and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier public-key cryptosystem and can provide both location and query privacy. To preserve query privacy, our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic solution can be applied to multiple discrete type attributes of private location-based queries. Compared with existing solutions for kNN queries with location privacy, our solution is more efficient. Experiments have shown that our solution is practical for kNN queries. ETPL DM - 032 Practical Approximate k Nearest Neighbour Queries with Location and Query Privacy
  • 19. With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods. ETPL DM - 033 Inverted Linear Quadtree: Efficient Top K ord Search We propose TrustSVD, a trust-based matrix factorization technique for recommendations. TrustSVD integrates multiple information sources into the recommendation model in order to reduce the data sparsity and cold start problems and their degradation of recommendation performance. An analysis of social trust data from four real-world data sets suggests that not only the explicit but also the implicit influence of both ratings and trust should be taken into consideration in a recommendation model. Trust SVD therefore builds on top of a state-of-the-art recommendation algorithm, SVD++ (which uses the explicit and implicit influence of rated items), by further incorporating both the explicit and implicit influence of trusted and trusting users on the prediction of items for an active user. The proposed technique is the first to extend SVD++ with social trust information. Experimental results on the four data sets demonstrate that Trust SVD achieves better accuracy than other ten counterpart’s recommendation techniques. ETPL DM - 034 A Novel Recommendation Model Regularized with User Trust and Item Ratings
  • 20. Frequent sequence mining is well known and well-studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to where is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as item set/tree/graph mining algorithms. ETPL DM - 036 Probabilistic static load-balancing of parallel mining of frequent sequences Although the matrix completion paradigm provides an appealing solution to the collaborative filtering problem in recommendation systems, some major issues, such as data sparsity and cold-start problems, still remain open. In particular, when the rating data for a subset of users or items is entirely missing, commonly known as the cold-start problem, the standard matrix completion methods are inapplicable due the non-uniform sampling of available ratings. In recent years, there has been considerable interest in dealing with cold-start users or items that are principally based on the idea of exploiting other sources of information to compensate for this lack of rating data. In this paper, we propose a novel and general algorithmic framework based on matrix completion that simultaneously exploits the similarity information among users and items to alleviate the cold-start problem. In contrast to existing methods, our proposed recommender algorithm, dubbed DecRec, decouples the following two aspects of the cold-start problem to effectively exploit the side information: (i) the completion of a rating sub-matrix, which is generated by excluding cold-start users/items from the original rating matrix; and (ii) the transduction of knowledge from existing ratings to cold-start items/users using side information. This crucial difference prevents the error propagation of completion and transduction, and also significantly boosts the performance when appropriate side information is incorporated. The recovery error of the proposed algorithm is analyzed theoretically and, to the best of our knowledge, this is the first algorithm that addresses the cold-start problem with provable guarantees on performance. Additionally, we also address the problem where both cold-start user and item challenges are present simultaneously. We conduct thorough experiments on real datasets that complement our theoretical results. These experiments demonstrate the ef- ectiveness of the proposed algorithm in handling the cold-start users/items problem and mitigating data sparsity issue. ETPL DM - 036 Cold-Start Recommendation with Provable Guarantees: A Decoupled Approach