Patent data clustering a measuring unit for innovators


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Patent data clustering a measuring unit for innovators

  1. 1. International Journal of Computer Engineering (IJCET), ISSN 0976 – 6367(Print), International Journal of Computer Engineering and Technologyand Technology (IJCET), ISSN 0976 1, May - June (2010), © IAEME ISSN 0976 – 6375(Online) Volume 1, Number – 6367(Print) IJCETISSN 0976 – 6375(Online) Volume 1Number 1, May - June (2010), pp. 158-165 ©IAEME© IAEME, PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS M.Pratheeban Research Scholar Anna University of Technology Coimbatore E-mail id: Dr. S. Balasubramanian Former Director- IPR Anna University of Technology Coimbatore E-Mail id: s_balasubramanian@rediffmail.comABSTRACT As software applications increase in volume, grouping the application intosmaller, more manageable components is often proposed as a means of assisting softwaremaintenance activities. One of the thrusting in software development is Patent DataClustering. The key challenge of Patent Data Clustering has how they can cluster and toimprove searching the patent data in repositories. In this paper, we propose a newclustering algorithm that improved clustering facilities for patent data.INTRODUCTION Patent Data Clustering is a method for grouping patent related data. Clustering ofpatent data documents (such as Titles, Abstract and Claims) has been used to bring outthe importance of patents for researchers. Clustering analysis is an unsupervised processthat divides a set of objectives into homogeneous groups. It is to measure or perceivedintrinsic characteristics or similarities among patent. Patent Clustering is to speed upshifting through large sets of patent data for analyzing the patent that helps people toidentify competitive and technology trends. The need for academic researchers to retrievepatents is increasing. Because applying for patents are now considered on importantresearch activity [6]. 158
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEPATENT INFORMATION Patents are an important source of scientific, technical and information. Foranyone planning to apply for a patent, a search is crucial to identify the existence of priorart, which affects the patentability of an invention. For researchers, patents can beimportant as they are often the only published information on specific topics, and canprovide insight into research directions. Patents are also used by marketing andcompetitive intelligence professionals, to find out about work being done by others.PATENT DATABASE Information that may be provided in Patent Databases Patent data may relate to unexamined and examined patent applications, andincludes: • Titles and abstract in English (if the patent is in another language) • Inventor’s name • Patent assignee • Patent publication data • Images • Full text (sometimes this is available through a separate database, or must be ordered) • International Patent Classification (IPC) codes. The IPC is used by over 70 patent authorities to classify and index the subjectmatter of published patent specifications. It is presumably based on literacy warrant, andsections range from the very broad to the specific [2].PATENT ASSESSMENT AND TECHNOLOGY AREA ASSESSMENT Currently high quality valuing of patents and patent applications and theassessment of technology areas with respect to their potential to give rise to patentapplication is done mainly manually which is very costly and time consuming. We aredeveloping techniques that uses statistical and semantic information from patent as wellas user based data for market aspects to prognosticate the patent. 159
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEMINING PATENT A Clear and effective IP Strategy critically incorporates a clear and effectivestrategy for managing an organization’s patent portfolio [7]. It means the analysis of allpatents that can directly revolutionize business and technology development practice.Patent mining is a premeditated and core functions for any IP-Centric business to securetechnology development and provides an establishment to help the administrators maketo plan decisions regarding technology development. Today patent management applications and robust search engines allow internalIP managers to quickly pull together organized set of patents from within their ownportfolios those of specific competitors and those specific competitions and those patentsciting relevant technical or industry terms. Companies once only interested inunderstanding the patents within their own portfolio are now interested in knowing aboutthe patents held by competitors [8].BASICS OF CLUSTERING Clustering is a division of data into groups of similar objects. Each group, calledcluster, consists of objects that are similar between themselves and dissimilar to objectsof other groups [1]. It groups a set of data in a way that maximizes the similarity withinclusters and minimizes the similarity between two different clusters. These discoveredclusters can help explain the characteristics of the underlying data distribution and serveas the foundation for other data mining and analysis techniques [5]. The quality of aclustering method is also measured by its ability to discover some or all of the hiddenpatterns. The quality of a clustering result also depends on both the similarity measureused by the method and its implementation [3].CLUSTERING ALGORITHMS Most existing clustering algorithms find clusters that fit some static model.Although effective in some cases, these algorithms can break down that is, cluster thedata incorrectly if the user doesn’t select appropriate static-model parameters. Orsometimes the model cannot adequately capture the clusters’ characteristics. Most ofthese algorithms break down when the data contains clusters of diverse shapes, densities, 160
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEand sizes [5]. Cluster analysis is the organization of a collection of patterns into clustersbased on similarity [4].LIMITATIONS OF TRADITIONAL CLUSTERING ALGORITHMS Partition-based clustering techniques such as K-Means and Clarans attempt tobreak a data set into K clusters such that the partition optimizes a given criterion. Thesealgorithms assume that clusters are hyper-ellipsoidal and of similar sizes. They can’t findclusters that vary in size, or concave shapes [9]. DBScan (Density-Based SpatialClustering of Applications with Noise), a well known spatial clustering algorithm, canfind clusters of arbitrary shapes. DBScan defines a cluster to be a maximum set ofdensity-connected points, which means that every core point in a cluster must have atleast a minimum number of points (MinPts) within a given radius (Eps) [10]. DBScan assumes that all points within genuine clusters can be reached from oneanother by traversing a path of density connected points and points across differentclusters cannot. DBScan can find arbitrarily shaped clusters if the cluster density can bedetermined beforehand and the cluster density is uniform [10]. Hierarchical clusteringalgorithms produce a nested sequence of clusters with a single, all-inclusive cluster at thetop and single-point clusters at the bottom. Agglomerative hierarchical algorithms start with each data point as a separatecluster. Each step of the algorithm involves merging two clusters that are the mostsimilar. After each merger, the total number of clusters decreases by one. Users canrepeat these steps until they obtain the desired number of clusters or the distance betweenthe two closest clusters goes above a certain threshold. The fact that most hierarchicalalgorithms do not revisit once constructed (intermediate) clusters with the purpose oftheir improvement [1]. In Agglomerative Hierarchical Clustering provision can be made for a relocationof objects that may have been incorrectly grouped at an early stage. The result should beexamined closely to ensure it makes sense. Use of different distance metrics formeasuring distances between clusters may generate different results. Performing multipleexperiments and comparing the results is recommended to support the veracity of theoriginal results. [11] 161
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME The many variations of agglomerative hierarchical algorithms primarily differ inhow they update the similarity between existing and merged clusters. In somehierarchical methods, each cluster is represented by a centroid or medoid a data point thatis the closest to the center of the cluster and the similarity between two clusters ismeasured by the similarity between the centroids / medoids. Both of these schemes failfor data in which points in a given cluster are closer to the center of another cluster thanto the center of their own cluster. Rock a recently developed algorithm that operates on a derived similarity graph,scales the aggregate interconnectivity with respect to a user-specified interconnectivitymodel. However, the major limitation of all such schemes is that they assume a static,user supplied interconnectivity model. Such models are inflexible and can easily lead toincorrect merging decisions when the model under or overestimates the interconnectivityof the data set. Although some schemes allow the connectivity to vary for differentproblem domains, it is still the same for all clusters irrespective of their densities andshapes [12]. CURE measures the similarity between two clusters by the similarity of theclosest pair of points belonging to different clusters. Unlike centroid/medoid-basedmethods, CURE can find clusters of arbitrary shapes and sizes, as it represents eachcluster via multiple representative points. Shrinking the representative points toward thecentroid allows CURE to avoid some of the problems associated with noise and outliers.However, these techniques fail to account for special characteristics of individualclusters. They can make incorrect merging decisions when the underlying data does notfollow the assumed model or when noise is present. In some algorithms, the similaritybetween two clusters is captured by the aggregate of the similarities among pairs of itemsbelonging to different clusters [13]. Existing algorithms use a static model of the clusters and do not use informationabout the nature of individual clusters as they are merged. Furthermore, one set ofschemes ignores the information about the aggregate interconnectivity of items in twoclusters. The other set of schemes ignores information about the closeness of two clustersas defined by the similarity of the closest items across two clusters. By only considering 162
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEeither interconnectivity or closeness, these algorithms can easily select and merge thewrong pair of clustersUSAGE OF ALGORITHMS: The most standard approach for document classification in recent years inapplying machine learning, such as support vector machine or Naïve Bayes. However thisapproach is not easy to apply to the patent mining Task, because the number of classes islarge and it occurs in a high calculation cast [6]. So we propose a new algorithm ratherthan machine learning algorithms.OUR APPROACH We propose a new dynamic algorithm it satisfies for both interlink and nearnessin identifying the most similar pair of clusters. Thus, it does not depend on a static, user-supplied model and can automatically adapt to the internal characteristics of the mergedclusters. In above algorithm we replaced Chameleon with suitable k-mediods which maygive better result in interlink compared to interlink using k-means. From variouscomparisons we came know that the average time taken by K-Means algorithm is greaterthan the time taken by K-Medoids algorithm for same set of data and also K-Meansalgorithm is efficient for smaller data sets and K-Medoids algorithm seems to performbetter for large data sets [14].For Inter links of patent, 1. Randomly choose k objects from the data set to be the cluster medoids at the initial state. Collect the patent data related to particular field or all fields 2. For each pair of non-selected object h and selected object i, calculate the total swapping cost Tih. 3. For each pair of i and h, If Tih < 0, i is replaced by h Then assign each non- selected object to the most similar representative object. 4. Repeat steps 2 and 3 until no change happens 163
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME Absolute nearness of two clusters is normalized by the internal nearness of theclusters. During the calculation of nearness, the algorithm use to find the genuine clustersby repeatedly combining these sub clusters.CONCLUSION The methodology of dynamic modeling of clusters in agglomerative hierarchicalmethods is applicable to all types of data as long as a similarity is available. Even thoughwe chose to model the data using k-mediods in this paper, it is entirely possible to useother algorithms suitable for patent mining domains. Our future research work includesthe practical implementation of this algorithm for better results in patent mining.REFERENCE [1] Pavel Berkhin, “Survey of Clustering Data Mining Techniques”, Accrue Software, Inc [2] [3] Dr. Osmar R. Zaïane, “Principles of Knowledge Discovery in Databases”, University of Alberta, CMPUT690 [4] Cheng- Fa Tsai, Han-Chang Wu, Chun-Wei Tsai, ”A New Data Clustering Approach for Data Mining in Large Database”, International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN,02). [5] George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, “Chameleon: Hierarchical Clustering Using Dynamic Modeling”. http://www- karypis99.pdf [6] Hidetsugu Nanba, “Hiroshima City University at NTC1R-7 Patent Mining Task”, Proceedings of NTCIR-7 Workshop Meeting, December 16–19, 2008, Tokyo, Japan [7] Bob Stembridge, Breda Corish, “Patent data mining and effective patent portfolio management”, Intellectual Asset Management, October/November 2004 [8] Edward Khan,”Patent mining in a changing world of technology and product development”, Intellectual Asset Management, July/August 2003 164
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME [9] Raymond T.Ng, Jiawei Han “Efficient and Effective Clustering Methods for Spatial Data Mining”, Proceedings of the 20th VLDB Conference, Santiago, Chile 1994. [10] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) [11]http://www.improvedoutcomes.Com/docs/WebSiteDocs/Clustering/Agglomerat ive_ Hierarchical_ Clustering_Overview.htm [12] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proc. 15th Int’l Conf. Data Eng., IEEE CS Press, Los Alamitos, Calif., 1999, pp. 512-521. [13] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data, ACM Press, New York, 1998, pp. 73-84. [14] T. Velmurugan and T. Santhanam,” Computational Complexity between K- Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points”, Journal of Computer Science 6 (3): 363-368, 2010 ISSN 1549-3636, 2010 Science Publications 165