This document discusses multi-instance learning and data mining. It begins by introducing multi-instance learning, where training data consists of labeled bags containing unlabeled instances. Each web page is treated as a bag and its linked pages are instances. Classification algorithms are adapted to handle this type of data representation. The document then evaluates algorithms for web index recommendation as a multi-instance problem. It compares algorithms that do and do not account for multi-instance characteristics. Finally, it discusses approaches for identifying multi-instance outliers based on single-instance outlier detection methods.
1. IJSRD - International Journal for Scientific Research & Development| Vol. 2, Issue 07, 2014 | ISSN (online): 2321-0613
All rights reserved by www.ijsrd.com 511
Data Mining in Multi-Instance and Multi-Represented Objects
Ajay aggarwal1
Mohammad Danish2
1
Research Scholar 2
Assistant Professor
1,2
Department of Computer Science & Engineering
1,2
Al-Falah School of Engineering & Technology, Dhoj, Faridabad, Haryana
Abstractâ In multi-instance learning, the training set
comprises labeled bags that are composed of unlabeled
instances, and the task is to predict the labels of unseen
bags. In this part, a web mining problem, i.e. web index
recommendation, is investigated from a multi-instance view.
In detail, each web index page is regarded as a bag, while
each of its linked pages is regarded as an instance. A user
favoring an index page means that he or she is interested in
at least one page linked by the index
Keywords: Instances, Labeled, Multi Instance view, index
I. INTRODUCTION
Most drugs are small molecules working by binding to
larger protein molecules such as enzymes and cell-surface
receptors. For molecules qualified to make a drug, one of its
low-energy shapes could tightly bind to the target area. While
for molecules unqualified to make a drug, none of its low-
energy shapes could tightly bind to the target area. The
main difficulty of drug activity prediction lies in that each
molecule may have many alternative low-energy shapes, but
biochemists only know that whether a molecule is qualified to
make a drug or not, instead of knowing that which of its
alternative low-energy shapes responses for the qualification.[3]
II. WEB INDEX RECOMMENDATION
There are diverse web pages on the Internet, among which
some pages contain plentiful information but themselves only
provide titles or brief summaries while leaving the detailed
presentation to their linked pages. These web pages are called web
index pages. For example, the entrance of NBA at Yahoo!
(sports.yahoo.com/ nba/) is a web index page.
Fig. 1 The web index page is regarded as a bag, while
its linked pages are regarded as the instances in the bag This
problem could be viewed as a multi-instance problem. Now the
goal is to label unseen web index pages as positive or negative. A
positive web index page is such a page that the user is
interested in at least one of its linked pages. A negative web
index page is such a page that none of its linked pages
interested the user.
Thus, each index page could be regarded as a bag while its linked
pages could be regarded as the instances in the bag. For
illustration, Fig. 1 shows a bag and two of its instances. [3]
For simplifying the analysis, this terms focuses on the
hypertext information on the pages while neglecting other
hypermedia such as images, audios, videos, etc. Then, each
instance can be represented by a term vector T = [t1, t2, ¡ ¡ ¡ , tn],
Where ti (i = 1, 2, ¡ ¡ ¡ , n) is one of the n most frequent
terms appearing in the corresponding linked page. T could be
obtained by pre-accessing the linked page and then counting the
occurrence of different terms. Note that some trivial terms
such as âaâ, âtheâ, âisâ, are neglected in this process. In this,
all the pages are described by the same number of frequent
terms, i.e. the length of any term vectors are the same.
However, for term vectors corresponding to different
instances, even though their length is the same, their
components may be quite different. [10] Moreover, for
different bags, since their corresponding web index pages may
contain different number of links, the number of instances in
the bags may be.
Thus, a web index page linking to m pages, i.e. a
bag containing m instances, can be represented as {[t11,
t12, ¡ ¡ ¡ , t1n], [t21, t22, ¡ ¡ ¡ , t2n], ¡ ¡ ¡ , [tm1, tm2, ¡ ¡ ¡ ,
tmn]}. The label of the bag is positive if the web index page
interested the user. Otherwise the label is negative.
Note that the web index pages may contain many links
to advertisements or other index pages, which may baffle the
analysis. In this part it is constrained that for a linked page to
be considered as an instance in a bag, its corresponding link in
the index page must contain at least four terms. It is surprising
that such a simple strategy helps remove most useless links.[2]
III. COMPARING TXT-KNN AND CIT-KNN WITH FRETCIT-
KNN
At first, experiments are performed to evaluate the performance
of Fretcit-kNN on the web index recommendation problem.
Since Fretcit-kNN is an extended kNN algorithm that
considers the characteristics of multi-instance problems, for
comparison, two extended kNN algorithms that do not
consider the characteristics of multi-instance problems are also
evaluated.[8,9]
The first compared algorithm is obtained through
adapting the standard kNN algorithm to textual objects. Recall
that the standard kNN algorithm utilizes Euclidean distance to
measure the distance between examples, which disable it be
applied to objects described by textual frequent terms.
However, if the distance metric is replaced by fret-minH(.), then
the modified algorithm can be easily applied to textual objects.
Here the modified algorithm is called Txt-kNN.[3,6]
This section shows a new data mining problem
called multiinstance outlier identification. This problem
arises in tasks where each sample consists of many
alternative feature vectors (instances) that describe it.
This part defines the multi-instance outliers and analyzes
the basic types of multiinstance outliers. Two general
identification approaches are proposed based on the
state-of-the-art (single-instance) outlier detector LOF
(local outlier factor).[8] One approach utilizes the
underlying mechanism of the kernel method and
plunges the set distance into LOF to detect the
multiinstance outliers. The other approach takes each
instanceâs neighborhood into account. Based on the
two approaches, four concrete multi-instance outlier
detectors are then introduction In clustering schemes, data
objects are usually represented as vectors of feature-value
pairs. Features represent certain attributes of the objects that
are known to be useful for the clustering task. Attributes that
are not relevant in forming structures out of data can lead to
non accurate results. Attributes can be numeric and non-
2. Data Mining in Multi-Instance and Multi-Represented Objects
(IJSRD/Vol. 2/Issue 07/2014/112)
All rights reserved by www.ijsrd.com 512
numeric, thus forming a mixed-mode data representation.[2]
Conceptual clustering is one of the algorithms that can deal
with mixed-mode data. However, conceptual clustering has
primarily focused on attributes described by nominal values.
The best way to combine numeric, ordinal, and nominal-
valued data is still an open question.[1] If a convention is
adopted for the ordering of the attributes in a given problem
context, we can represent instances of data as feature
vectors consisting of the attribute values only, where the
attribute names themselves are implicitly known by their
order. Sison and Shimura proposed a relational description
model to clustering data as opposed to the usual
prepositional attribute-value pair representation. Usually
attributes are single valued, but sometimes they can be
multi-valued, such as the document clustering problem at
hand. In this case a convention has to be adopted to deal
with multivolume attributes depending on the problem
context. [7]
Our data integration and visualization system is
composed of three layers in which the data constitutes the
back-end layer (Fig. 1). Schema mappings, ontology
definitions and conceptual learning implementations occupy
the middle tier and the user interface constitutes the front-
end layer. The middle tier also comprises sets of algorithms
and modules that process and display results of the query.
Most of our local data are represented in XML format. The
data are stored using XML data management system
Tamino XML server (Software AG) in a Redhat Linux
Advanced Server v2.1 environment. The databases are
queried using Tamino XQuery (Fiebig and SchĂśning, 2004)
which is an implementation of XQuery language. The
queries are enabled through the Tamino Java API. For
storing more voluminous data, such as gene-expression data
and in house produced mass spectrometry data, we use
Oracle 10g database server.
IV. INTEGRATED DATABASE SYSTEM
A. Architectural design
The core architecture of our data integration and
visualisation system, called megNet, is composed of three
layers; back-end, middle tier and front-end (Figure 1). The
data, schema maps, ontology definitions constitute the back-
end layer. Most of our local data are represented in XML or
RDF formats. The data is stored using XML data
management system Tamino XML server (Software AG) in
a Redhat Linux Advanced Server v3.0 environment. The
databases are queried using Tamino X-Query which is based
on XPath 1.0 specification. The queries are enabled through
the Tamino Java API. For storing more voluminous data
such as gene expression data and in house produced mass
spectrometry data, we use Oracle 10g database server
(Oracle, Inc.). The Oracle queries are performed using
Oracle JDBC Thin drivers. The results obtained from
queries to Tamino and Oracle are combined at the Java
programming level in the middle tier. The middle tier
comprises the business logic of our system. Business logic
events, such as graph constructions, distance data
projections, topology calculations are implemented as
stateless session beans. They are processed as web services.
The session beans are the end points of the web services.
They receive their request messages from the client for
performing a business logic event. In the end of their life
cycle they send the response to the client.
Fig. 1: Architecture of Tier Transmission
The middle tier resides physically in a JBoss 4.04
Application Server (JBoss, Inc.).The business logic events
are processed in the EJB Container of JBoss. The client and
server communicate through SOAP messages. The SOAP
messages are converted to Java objects by the middle tier
after it has received a request message from the front-end
client and Java objects are converted to SOAP messages
before they are sent back as a response message. These
conversions are implemented by using Apache Axis 1.4
(Apache Software Foundation).[11] They are processed in
Apache Tomcat 5.5 Servlet Container. The front-end
comzprises the user interface for visualising and interacting
with the end user. It is implemented in the Java
environment.
REFERENCES
[1] D.W. Aha. Lazy learning: special issue editorial.
Artificial Intelligence Review, vol.11 o.1-5, pp.7-10,
1997.
[2] R.A. Amar, D.R. Dooly, S.A. Goldman, and Q.
Zhang. Multiple-instance learning of real-valued data.
In Proceedings of the 18th International Conference on
Machine Learning, Williamstown, MA, pp.3-10,
2001.
[3] P. Auer. On learning from multi-instance
examples: empirical evaluation of a theoretical
approach. In Proceedings of the 14th International
Conference on Machine Learning, Nashville, TN, pp.21-
29, 1997.
[4] P. Auer, P.M. Long, and A. Srinivasan.
Approximating hyper-rectangles: learning and
pseudo-random sets. Journal of Computer and System
Sciences, vol.57, no.3,
pp.376-388, 1998.
[5] A. Blum and A. Kalai. A note on learning from
multiple-instance examples. Machine Learning, vol.30,
no.1, pp.23-29, 1998.
[6] Y. Chevaleyre and J.-D. Zucker. Solving multiple-
instance and multiple-part learning
problems with decision trees and decision rules.
Application to the mutagenesis
problem. In E. Stroulia and S. Matwin, Eds. Lecture
Notes in Artificial Intelligence 2056, Berlin: Springer,
pp.204-214, 2001.
3. Data Mining in Multi-Instance and Multi-Represented Objects
(IJSRD/Vol. 2/Issue 07/2014/112)
All rights reserved by www.ijsrd.com 513
[7] B.V. Dasarathy. Nearest Neighbor Norms: NN
Pattern Classification Techniques, Los
Alamitos, CA: IEEE Computer Society Press, 1991.
[8] Dietterich, T., Lathrop, R., Lozano-Perez, T.:
âSolving the multiple instance problem with axis-
parallel rectanglesâ. Artificial Intelligence 89
(1997) 31-71
[9] Kriegel, H.P., Schubert, M.: âClassification of
websites as sets of feature vectorsâ.
In: Proc. IASTED Int. Conf. on Databases and
Applications (DBA 2004), Inns-
bruck, Austria. (2004)
[10]Zhou, Z.H.: âMulti-Instance Learning: A
Surveyâ. Technical Report, AI Lab, Computer
Science a. Technology Department, Nanjing
University, Nanjing, China (2004)
[11]Ruffo, G.: Learning single and multiple instance
decision tree for computer security applications. PhD
thesis, Department of Computer Science, University
of Turin, Torino,Italy (2000)
[12]Weidmann, N., Frank, E., Pfahringer, B.: âA Two-
Level Learning Method for Generalized Multi-
instance Problemsâ. In: Proc. ECML 2003, Cavtat-
Dubrovnik,Cr. (2003)468-479