Patent data clustering a measuring unit for innovators

International Journal of Computer Engineering (IJCET), ISSN 0976 – 6367(Print),
International Journal of Computer Engineering and Technology
and Technology (IJCET), ISSN 0976 1, May - June (2010), © IAEME
ISSN 0976 – 6375(Online) Volume 1, Number
– 6367(Print) IJCET
ISSN 0976 – 6375(Online) Volume 1
Number 1, May - June (2010), pp. 158-165 ©IAEME
© IAEME, http://www.iaeme.com/ijcet.html

PATENT DATA CLUSTERING: A MEASURING UNIT FOR
INNOVATORS
M.Pratheeban
Research Scholar
Anna University of Technology Coimbatore
E-mail id: pratheeban_mca@yahoo.co.in

Dr. S. Balasubramanian
Former Director- IPR
Anna University of Technology Coimbatore
E-Mail id: s_balasubramanian@rediffmail.com

ABSTRACT
As software applications increase in volume, grouping the application into
smaller, more manageable components is often proposed as a means of assisting software
maintenance activities. One of the thrusting in software development is Patent Data
Clustering. The key challenge of Patent Data Clustering has how they can cluster and to
improve searching the patent data in repositories. In this paper, we propose a new
clustering algorithm that improved clustering facilities for patent data.
INTRODUCTION
Patent Data Clustering is a method for grouping patent related data. Clustering of
patent data documents (such as Titles, Abstract and Claims) has been used to bring out
the importance of patents for researchers. Clustering analysis is an unsupervised process
that divides a set of objectives into homogeneous groups. It is to measure or perceived
intrinsic characteristics or similarities among patent. Patent Clustering is to speed up
shifting through large sets of patent data for analyzing the patent that helps people to
identify competitive and technology trends. The need for academic researchers to retrieve
patents is increasing. Because applying for patents are now considered on important
research activity [6].

158

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME

PATENT INFORMATION
Patents are an important source of scientific, technical and information. For
anyone planning to apply for a patent, a search is crucial to identify the existence of prior
art, which affects the patentability of an invention. For researchers, patents can be
important as they are often the only published information on specific topics, and can
provide insight into research directions. Patents are also used by marketing and
competitive intelligence professionals, to find out about work being done by others.
PATENT DATABASE
Information that may be provided in Patent Databases
Patent data may relate to unexamined and examined patent applications, and
includes:
• Titles and abstract in English (if the patent is in another language)
• Inventor’s name
• Patent assignee
• Patent publication data
• Images
• Full text (sometimes this is available through a separate database, or must be
ordered)
• International Patent Classification (IPC) codes.
The IPC is used by over 70 patent authorities to classify and index the subject
matter of published patent specifications. It is presumably based on literacy warrant, and
sections range from the very broad to the specific [2].
PATENT ASSESSMENT AND TECHNOLOGY AREA ASSESSMENT
Currently high quality valuing of patents and patent applications and the
assessment of technology areas with respect to their potential to give rise to patent
application is done mainly manually which is very costly and time consuming. We are
developing techniques that uses statistical and semantic information from patent as well
as user based data for market aspects to prognosticate the patent.

159


MINING PATENT
A Clear and effective IP Strategy critically incorporates a clear and effective
strategy for managing an organization’s patent portfolio [7]. It means the analysis of all
patents that can directly revolutionize business and technology development practice.
Patent mining is a premeditated and core functions for any IP-Centric business to secure
technology development and provides an establishment to help the administrators make
to plan decisions regarding technology development.
Today patent management applications and robust search engines allow internal
IP managers to quickly pull together organized set of patents from within their own
portfolios those of specific competitors and those specific competitions and those patents
citing relevant technical or industry terms. Companies once only interested in
understanding the patents within their own portfolio are now interested in knowing about
the patents held by competitors [8].
BASICS OF CLUSTERING
Clustering is a division of data into groups of similar objects. Each group, called
cluster, consists of objects that are similar between themselves and dissimilar to objects
of other groups [1]. It groups a set of data in a way that maximizes the similarity within
clusters and minimizes the similarity between two different clusters. These discovered
clusters can help explain the characteristics of the underlying data distribution and serve
as the foundation for other data mining and analysis techniques [5]. The quality of a
clustering method is also measured by its ability to discover some or all of the hidden
patterns. The quality of a clustering result also depends on both the similarity measure
used by the method and its implementation [3].
CLUSTERING ALGORITHMS
Most existing clustering algorithms find clusters that fit some static model.
Although effective in some cases, these algorithms can break down that is, cluster the
data incorrectly if the user doesn’t select appropriate static-model parameters. Or
sometimes the model cannot adequately capture the clusters’ characteristics. Most of
these algorithms break down when the data contains clusters of diverse shapes, densities,

160


and sizes [5]. Cluster analysis is the organization of a collection of patterns into clusters
based on similarity [4].
LIMITATIONS OF TRADITIONAL CLUSTERING ALGORITHMS
Partition-based clustering techniques such as K-Means and Clarans attempt to
break a data set into K clusters such that the partition optimizes a given criterion. These
algorithms assume that clusters are hyper-ellipsoidal and of similar sizes. They can’t find
clusters that vary in size, or concave shapes [9]. DBScan (Density-Based Spatial
Clustering of Applications with Noise), a well known spatial clustering algorithm, can
find clusters of arbitrary shapes. DBScan defines a cluster to be a maximum set of
density-connected points, which means that every core point in a cluster must have at
least a minimum number of points (MinPts) within a given radius (Eps) [10].
DBScan assumes that all points within genuine clusters can be reached from one
another by traversing a path of density connected points and points across different
clusters cannot. DBScan can find arbitrarily shaped clusters if the cluster density can be
determined beforehand and the cluster density is uniform [10]. Hierarchical clustering
algorithms produce a nested sequence of clusters with a single, all-inclusive cluster at the
top and single-point clusters at the bottom.
Agglomerative hierarchical algorithms start with each data point as a separate
cluster. Each step of the algorithm involves merging two clusters that are the most
similar. After each merger, the total number of clusters decreases by one. Users can
repeat these steps until they obtain the desired number of clusters or the distance between
the two closest clusters goes above a certain threshold. The fact that most hierarchical
algorithms do not revisit once constructed (intermediate) clusters with the purpose of
their improvement [1].
In Agglomerative Hierarchical Clustering provision can be made for a relocation
of objects that may have been 'incorrectly' grouped at an early stage. The result should be
examined closely to ensure it makes sense. Use of different distance metrics for
measuring distances between clusters may generate different results. Performing multiple
experiments and comparing the results is recommended to support the veracity of the
original results. [11]

161


The many variations of agglomerative hierarchical algorithms primarily differ in
how they update the similarity between existing and merged clusters. In some
hierarchical methods, each cluster is represented by a centroid or medoid a data point that
is the closest to the center of the cluster and the similarity between two clusters is
measured by the similarity between the centroids / medoids. Both of these schemes fail
for data in which points in a given cluster are closer to the center of another cluster than
to the center of their own cluster.
Rock a recently developed algorithm that operates on a derived similarity graph,
scales the aggregate interconnectivity with respect to a user-specified interconnectivity
model. However, the major limitation of all such schemes is that they assume a static,
user supplied interconnectivity model. Such models are inflexible and can easily lead to
incorrect merging decisions when the model under or overestimates the interconnectivity
of the data set. Although some schemes allow the connectivity to vary for different
problem domains, it is still the same for all clusters irrespective of their densities and
shapes [12].
CURE measures the similarity between two clusters by the similarity of the
closest pair of points belonging to different clusters. Unlike centroid/medoid-based
methods, CURE can find clusters of arbitrary shapes and sizes, as it represents each
cluster via multiple representative points. Shrinking the representative points toward the
centroid allows CURE to avoid some of the problems associated with noise and outliers.
However, these techniques fail to account for special characteristics of individual
clusters. They can make incorrect merging decisions when the underlying data does not
follow the assumed model or when noise is present. In some algorithms, the similarity
between two clusters is captured by the aggregate of the similarities among pairs of items
belonging to different clusters [13].
Existing algorithms use a static model of the clusters and do not use information
about the nature of individual clusters as they are merged. Furthermore, one set of
schemes ignores the information about the aggregate interconnectivity of items in two
clusters. The other set of schemes ignores information about the closeness of two clusters
as defined by the similarity of the closest items across two clusters. By only considering

162


either interconnectivity or closeness, these algorithms can easily select and merge the
wrong pair of clusters
USAGE OF ALGORITHMS:
The most standard approach for document classification in recent years in
applying machine learning, such as support vector machine or Naïve Bayes. However this
approach is not easy to apply to the patent mining Task, because the number of classes is
large and it occurs in a high calculation cast [6]. So we propose a new algorithm rather
than machine learning algorithms.
OUR APPROACH
We propose a new dynamic algorithm it satisfies for both interlink and nearness
in identifying the most similar pair of clusters. Thus, it does not depend on a static, user-
supplied model and can automatically adapt to the internal characteristics of the merged
clusters. In above algorithm we replaced Chameleon with suitable k-mediods which may
give better result in interlink compared to interlink using k-means. From various
comparisons we came know that the average time taken by K-Means algorithm is greater
than the time taken by K-Medoids algorithm for same set of data and also K-Means
algorithm is efficient for smaller data sets and K-Medoids algorithm seems to perform
better for large data sets [14].
For Inter links of patent,
1. Randomly choose k objects from the data set to be the cluster medoids at the
initial state. Collect the patent data related to particular field or all fields

2. For each pair of non-selected object h and selected object i, calculate the total
swapping cost Tih.

3. For each pair of i and h, If Tih < 0, i is replaced by h Then assign each non-
selected object to the most similar representative object.

4. Repeat steps 2 and 3 until no change happens

163


Absolute nearness of two clusters is normalized by the internal nearness of the
clusters. During the calculation of nearness, the algorithm use to find the genuine clusters
by repeatedly combining these sub clusters.
CONCLUSION
The methodology of dynamic modeling of clusters in agglomerative hierarchical
methods is applicable to all types of data as long as a similarity is available. Even though
we chose to model the data using k-mediods in this paper, it is entirely possible to use
other algorithms suitable for patent mining domains. Our future research work includes
the practical implementation of this algorithm for better results in patent mining.
REFERENCE
[1] Pavel Berkhin, “Survey of Clustering Data Mining Techniques”, Accrue
Software, Inc http://www.ee.ucr.edu/~barth/EE242/clustering_survey.pdf.
[2] http://www.wipo.int/classifications/ipc/en/
[3] Dr. Osmar R. Zaïane, “Principles of Knowledge Discovery in Databases”,
University of Alberta, CMPUT690
[4] Cheng- Fa Tsai, Han-Chang Wu, Chun-Wei Tsai, ”A New Data Clustering
Approach for Data Mining in Large Database”, International Symposium on
Parallel Architectures, Algorithms and Networks (ISPAN,02).
[5] George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, “Chameleon:
Hierarchical Clustering Using Dynamic Modeling”. http://www-
leibniz.imag.fr/Apprentissage/Depot/Selection/ karypis99.pdf
[6] Hidetsugu Nanba, “Hiroshima City University at NTC1R-7 Patent Mining
Task”, Proceedings of NTCIR-7 Workshop Meeting, December 16–19, 2008,
Tokyo, Japan
[7] Bob Stembridge, Breda Corish, “Patent data mining and effective patent
portfolio management”, Intellectual Asset Management, October/November
2004
[8] Edward Khan,”Patent mining in a changing world of technology and product
development”, Intellectual Asset Management, July/August 2003

164


[9] Raymond T.Ng, Jiawei Han “Efficient and Effective Clustering Methods for
Spatial Data Mining”, Proceedings of the 20th VLDB Conference, Santiago,
Chile 1994.
[10] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, “A Density-Based
Algorithm for Discovering Clusters in Large Spatial Databases with Noise”,
Proceedings of 2nd International Conference on Knowledge Discovery and
Data Mining (KDD-96)
[11]http://www.improvedoutcomes.Com/docs/WebSiteDocs/Clustering/Agglomerat
ive_ Hierarchical_ Clustering_Overview.htm
[12] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm
for Categorical Attributes,” Proc. 15th Int’l Conf. Data Eng., IEEE CS Press,
Los Alamitos, Calif., 1999, pp. 512-521.
[13] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm
for Large Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data,
ACM Press, New York, 1998, pp. 73-84.
[14] T. Velmurugan and T. Santhanam,” Computational Complexity between K-
Means and K-Medoids Clustering Algorithms for Normal and Uniform
Distributions of Data Points”, Journal of Computer Science 6 (3): 363-368,
2010 ISSN 1549-3636, 2010 Science Publications

165

Patent data clustering a measuring unit for innovators

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (9)

Similar to Patent data clustering a measuring unit for innovators

Similar to Patent data clustering a measuring unit for innovators (20)

More from iaemedu

More from iaemedu (20)

Patent data clustering a measuring unit for innovators