This document summarizes methods for clustering web documents based on their content and structure. It discusses using a web document's hyperlinks, textual information, and co-citations to calculate similarity and group documents with similar topics. It also examines clustering based on analyzing the HTML schema to find structural similarities between pages. The document concludes that using hyperlinks, text and co-citations provides better clustering for topic-related pages, while structural analysis works for pages with common layouts like those on the same website.
Dynamic extraction of key paper from the cluster using variance values of cit...IJDKP
When looking into recent research trends in the field of academic landscape, citation network analysis is
common and automated clustering of many academic papers has been achieved by making good use of
various techniques. However, specifying the features of each area identified by automated clustering or
dynamically extracted key papers in each research area has not yet been achieved. In this study, therefore,
we propose a method for dynamically specifying the key papers in each area identified by clustering. We
will investigate variance values of the publication year of the cited literature and calculate each cited
paper’s importance by applying the variance values to the PageRank algorithm.
The document proposes using an A* algorithm along with a relational framework to more efficiently calculate shortest paths in graph data stored in a relational database. The system initializes a source node, then iteratively selects the next frontier node and expands paths until the target node is found. Experimental results on road network data show the proposed approach has faster execution time than bidirectional search, especially on larger datasets containing over 500,000 records. The approach requires more memory than bidirectional search but is more efficient than other shortest path algorithms.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
This document summarizes a research paper that proposes a new technique for web page clustering inspired by the cemetery organization behavior of ants. The technique involves 3 main steps:
1) Generating a term-document matrix to represent web pages and reducing the dimensionality of the matrix using Latent Semantic Indexing.
2) Transforming the web pages into a two-dimensional grid space based on the cemetery organization behavior of ants, where web pages are more likely to be placed near similar web pages.
3) Clustering the web pages represented in the two-dimensional grid space using k-means clustering. The paper claims this technique can improve web page clustering compared to other approaches.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
This document discusses various techniques for document clustering and retrieval, including cosine similarity, k-means clustering, hierarchical clustering, and the EM algorithm. Cosine similarity measures the similarity between document vectors based on the angle between them. K-means clustering partitions documents into k clusters to minimize intra-cluster similarity, while hierarchical clustering merges clusters in a dendogram based on similarity. The EM algorithm computes maximum likelihood estimates of document distributions. Evaluation of clustering assesses the quality based on intra-class and inter-class similarity.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
Dynamic extraction of key paper from the cluster using variance values of cit...IJDKP
When looking into recent research trends in the field of academic landscape, citation network analysis is
common and automated clustering of many academic papers has been achieved by making good use of
various techniques. However, specifying the features of each area identified by automated clustering or
dynamically extracted key papers in each research area has not yet been achieved. In this study, therefore,
we propose a method for dynamically specifying the key papers in each area identified by clustering. We
will investigate variance values of the publication year of the cited literature and calculate each cited
paper’s importance by applying the variance values to the PageRank algorithm.
The document proposes using an A* algorithm along with a relational framework to more efficiently calculate shortest paths in graph data stored in a relational database. The system initializes a source node, then iteratively selects the next frontier node and expands paths until the target node is found. Experimental results on road network data show the proposed approach has faster execution time than bidirectional search, especially on larger datasets containing over 500,000 records. The approach requires more memory than bidirectional search but is more efficient than other shortest path algorithms.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
This document summarizes a research paper that proposes a new technique for web page clustering inspired by the cemetery organization behavior of ants. The technique involves 3 main steps:
1) Generating a term-document matrix to represent web pages and reducing the dimensionality of the matrix using Latent Semantic Indexing.
2) Transforming the web pages into a two-dimensional grid space based on the cemetery organization behavior of ants, where web pages are more likely to be placed near similar web pages.
3) Clustering the web pages represented in the two-dimensional grid space using k-means clustering. The paper claims this technique can improve web page clustering compared to other approaches.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
This document discusses various techniques for document clustering and retrieval, including cosine similarity, k-means clustering, hierarchical clustering, and the EM algorithm. Cosine similarity measures the similarity between document vectors based on the angle between them. K-means clustering partitions documents into k clusters to minimize intra-cluster similarity, while hierarchical clustering merges clusters in a dendogram based on similarity. The EM algorithm computes maximum likelihood estimates of document distributions. Evaluation of clustering assesses the quality based on intra-class and inter-class similarity.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
Data conflation is a process that optimizes spatial data quality by automatically merging heterogeneous geospatial datasets. It detects corresponding real-world features represented differently across datasets and combines their geometric and semantic information. For example, conflating datasets can identify a traffic roundabout modeled inconsistently and insert it accurately. By integrating multiple sources, conflation improves data completeness, positional accuracy, and temporal accuracy to produce an optimized dataset for applications.
Improvement of Spatial Data Quality Using the Data ConflationBeniamino Murgante
Improvement of Spatial Data Quality Using the Data Conflation
Silvija Stankute, Hartmut Asche -Geoinformation Research Group, Department of Geography, University of Potsdam
Load balancing functionalities are crucial for best Grid performance and utilization. Accordingly,this paper presents a new meta-scheduling method called TunSys. It is inspired from the natural phenomenon of heat propagation and thermal equilibrium. TunSys is based on a Grid polyhedron model with a spherical like structure used to ensure load balancing through a local neighborhood propagation strategy. Furthermore, experimental results compared to FCFS, DGA and HGA show encouraging results in terms of system performance and scalability and in terms of load balancing efficiency.
Efficient Record De-Duplication Identifying Using Febrl FrameworkIOSR Journals
This document describes using the Febrl (Freely Extensible Biomedical Record Linkage) framework to perform efficient record de-duplication. It discusses how Febrl allows for data cleaning, standardization, indexing, field comparison, and weight vector classification. Indexing techniques like blocking indexes, q-grams, and canopy clustering are used to reduce the number of record pair comparisons. Field comparison functions calculate matching weights, and classifiers like Fellegi-Sunter and support vector machines are used to determine matches. The method is evaluated on real-world health data, showing accuracy, precision, recall, and false positive rates for different partitioning methods.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
This document describes ChemEngine, a program that can extract 3D molecular data from PDF files. ChemEngine uses pattern recognition to identify molecular coordinates in supplementary scientific articles. It generates standard molecular data like bond matrices and atomic coordinates that can then be used for computational analysis. The methodology was demonstrated on three case studies involving different coordinate data formats. ChemEngine accurately extracted coordinates and produced computational results like energies that agreed with original literature values. The tool aims to automate the conversion of molecular data from PDFs into a format suitable for computational workflows.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
The program ChemEngine recognizes textual patterns in supplementary scientific research article data to generate standard molecular structure data. It has been demonstrated to selectively harvest atomic coordinates from different formats of coordinates data stored in supplementary PDF files with high accuracy, as shown by close agreement of computed single point energies to the original values. The program and source code are available online at the given URL.
Improvement of a method based on hidden markov model for clustering web userscsandit
Nowadays the determination of the dynamics of sequential data, such as marketing, finance,
social sciences or web research has receives much attention from researchers and scholars.
Clustering of such data by nature is always a more challenging task. This paper investigates the
applications of different Markov models in web mining and improves a developed method for
clustering web users, using hidden Markov models. In the first step, the categorical sequences
are transformed into a probabilistic space by hidden Markov model. Then, in the second step,
hierarchical clustering, the performance of clustering process is evaluated with various
distances criteria. Furthermore this paper shows implementation of the proposed improvements
with symmetric distance measure as Total-Variance and Mahalanobis compared with the
previous use of the proposed method (such as Kullback–Leibler) on the well-known Microsoft
dataset with website user search patterns is more clearly result in separate clusters.
Xiaolin Wang - Managing and Integrating Geography Models in Distributed Envir...grssieee
This document discusses managing and integrating geography models in a distributed environment. It proposes using a model contract to describe models, their inputs/outputs, and how they are combined. A metadata standard defines atomic models, while an integration standard specifies how to compose models into hierarchical composite structures and define their workflow. A model contract execution engine would parse contracts to interpretively execute models according to their defined workflows, allowing for distributed integration and reuse of geography models.
This document summarizes a student's research project on approximate matching on graph databases using the GeX approach. It introduces graph databases and the need for approximate matching. It describes testing the GeX Top-K query algorithm on biological interaction data from multiple organisms. While accurate, the algorithm's performance decreases with larger datasets. Future work could approximate edge labels as well to improve scalability.
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...paperpublications3
Abstract: Maximum-flow problem are used to find Google spam sites, discover Face book communities, etc., on graphs from the Internet. Such graphs are now so large that they have outgrown conventional memory-resident algorithms. In this paper, we show how to effectively parallelize a maximum flow problem based on the Edmonds-Karp Algorithm (EKA) method on a cluster using the MapReduce framework. Our algorithm exploits the property that such graphs are small-world networks with low diameter and employs optimizations to improve the effectiveness of MapReduce and increase parallelism. We are able to compute maximum flow on a subset of the a large network graph with approximately more number of vertices and more number of edges using a cluster of 4 or 5 machines in reasonable time.Keywords: Algorithm, MapReduce, Hadoop.
Title: Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flow in Large Network Graph
Author: Dhananjaya Kumar K, Mr. Manjunatha A.S
International Journal of Recent Research in Mathematics Computer Science and Information Technology
ISSN 2350-1022
Paper Publications
A spatial data model for moving object databasesijdms
Moving Object Databases will have significant role in Geospatial Information Systems as they allow users
to model continuous movements of entities in the databases and perform spatio-temporal analysis. For
representing and querying moving objects, an algebra with a comprehensive framework of User Defined
Types together with a set of functions on those types is needed. Moreover, concerning real world
applications, moving objects move along constrained environments like transportation networks so that an
extra algebra for modeling networks is demanded, too. These algebras can be inserted in any data model if
their designs are based on available standards such as Open Geospatial Consortium that provides a
common model for existing DBMS’s. In this paper, we focus on extending a spatial data model for
constrained moving objects. Static and moving geometries in our model are based on Open Geospatial
Consortium standards. We also extend Structured Query Language for retrieving, querying, and
manipulating spatio-temporal data related to moving objects as a simple and expressive query language.
Finally as a proof-of-concept, we implement a generator to generate data for moving objects constrained
by a transportation network. Such a generator primarily aims at traffic planning applications.
USING ONTOLOGY BASED SEMANTIC ASSOCIATION RULE MINING IN LOCATION BASED SERVICESIJDKP
Recently, GPS and mobile devices allowed collecting a huge amount of mobility data. Researchers from
different communities have developed models and techniques for mobility analysis. But they mainly focused
on the geometric properties of trajectories and do not consider the semantic facet of moving objects. The
techniques are good at extracting patterns, but they are hard to interpret in a specific application domain.
This paper proposes a methodology to understand mobility data and semantically interpret trajectory
patterns. The process considers four different behavior types such as semantic, semantic and space,
semantic and time, and semantic and space-time. Finally, a system prototype was developed to evaluate the
behavior models in different aspects using one of the location based services. The results showed that
applying the semantic association rules could significantly reduce the number of available services and
customize the services based on the rules.
This document summarizes techniques for mapping application topologies to interconnect network topologies. It discusses how improving data locality through topology mapping can reduce communication costs, execution time, and energy consumption. Several common mapping techniques are described, including linear programming formulations, greedy approaches, partitioning approaches, transformative approaches, and those based on graph similarity. The document notes that finding an optimal mapping is NP-complete and different techniques may work better depending on the topology.
The document discusses using the OGC Web Coverage Service (WCS) protocol to deliver air quality data from various sources through a system called DataFed. The WCS allows querying distributed air quality monitoring data in various formats. It provides a common data model and can deliver gridded data, images, and point data like that from monitoring stations. For air quality analysis, extending WCS to better support point data from stations would be useful.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
Slides presented at the TREC conference about our participation in the Contextual Suggestion track. We applied tourist domain knowledge inferred from the public Web to filter documents from the ClueWeb12 collection (a crawl from the Web).
Large volume of information is stored in XML format in the Web, and clustering is a management method for this documents. Most of current methods for clustering XML documents consider only one of these two aspects. In this paper, we propose SCEM (Expectation Maximization Structure and Content) for XML documents which is used to effectively cluster XML documents by combining content and structural features. The other contribution of this paper is that we used probabilistic distributions in such way that have probability parameters corresponding to one cluster. In this way, we obtained better effectiveness compared to other clustering methods due to generality. Experimental results on real datasets show effectiveness of proposed method, particularly when it is applied on large XML documents without schema. Also it can be used to improve accuracy and effectiveness of XML information retrieval.
INVESTIGATING SIGNIFICANT CHANGES IN USERS’ INTEREST ON WEB TRAVERSAL PATTERNSIJCI JOURNAL
This document summarizes an article that investigates significant changes in users' interest on web traversal patterns. It proposes algorithms to discover transitional patterns based on support and utility constraints in two phases. The first phase discovers frequent and high utility patterns. The second phase identifies transitional patterns among them and detects significant milestones when support or utility dramatically increases or decreases. Experimental results on a real dataset are discussed. The information obtained helps understand users' dynamic preferences on web traversal patterns over time.
This document summarizes a research paper that proposes a new technique for web page clustering inspired by the cemetery organization behavior of ants. The technique begins by reducing the dimensionality of the web page index using latent semantic indexing. It then transforms web pages into a two-dimensional grid space based on the cemetery organization behavior of ants. Finally, it clusters the web pages in this space using k-means clustering. The paper aims to address the challenges of web page clustering by applying this three-step technique and demonstrates the impact of dimensionality reduction and distance measures on clustering results.
Enhancement of Single Moving Average Time Series Model Using Rough k-Means fo...IJERA Editor
This document proposes combining rough k-means clustering with a single moving average time series model to improve network traffic prediction. The document first discusses related work on network traffic prediction using various time series models. It then describes using a single moving average model to initially predict network packet loads, and enhancing this prediction by incorporating clusters identified through rough k-means analysis of the network data. The proposed integrated model is evaluated on real network traffic data and shown to improve prediction accuracy over the conventional single moving average model alone.
Data conflation is a process that optimizes spatial data quality by automatically merging heterogeneous geospatial datasets. It detects corresponding real-world features represented differently across datasets and combines their geometric and semantic information. For example, conflating datasets can identify a traffic roundabout modeled inconsistently and insert it accurately. By integrating multiple sources, conflation improves data completeness, positional accuracy, and temporal accuracy to produce an optimized dataset for applications.
Improvement of Spatial Data Quality Using the Data ConflationBeniamino Murgante
Improvement of Spatial Data Quality Using the Data Conflation
Silvija Stankute, Hartmut Asche -Geoinformation Research Group, Department of Geography, University of Potsdam
Load balancing functionalities are crucial for best Grid performance and utilization. Accordingly,this paper presents a new meta-scheduling method called TunSys. It is inspired from the natural phenomenon of heat propagation and thermal equilibrium. TunSys is based on a Grid polyhedron model with a spherical like structure used to ensure load balancing through a local neighborhood propagation strategy. Furthermore, experimental results compared to FCFS, DGA and HGA show encouraging results in terms of system performance and scalability and in terms of load balancing efficiency.
Efficient Record De-Duplication Identifying Using Febrl FrameworkIOSR Journals
This document describes using the Febrl (Freely Extensible Biomedical Record Linkage) framework to perform efficient record de-duplication. It discusses how Febrl allows for data cleaning, standardization, indexing, field comparison, and weight vector classification. Indexing techniques like blocking indexes, q-grams, and canopy clustering are used to reduce the number of record pair comparisons. Field comparison functions calculate matching weights, and classifiers like Fellegi-Sunter and support vector machines are used to determine matches. The method is evaluated on real-world health data, showing accuracy, precision, recall, and false positive rates for different partitioning methods.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
This document describes ChemEngine, a program that can extract 3D molecular data from PDF files. ChemEngine uses pattern recognition to identify molecular coordinates in supplementary scientific articles. It generates standard molecular data like bond matrices and atomic coordinates that can then be used for computational analysis. The methodology was demonstrated on three case studies involving different coordinate data formats. ChemEngine accurately extracted coordinates and produced computational results like energies that agreed with original literature values. The tool aims to automate the conversion of molecular data from PDFs into a format suitable for computational workflows.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
The program ChemEngine recognizes textual patterns in supplementary scientific research article data to generate standard molecular structure data. It has been demonstrated to selectively harvest atomic coordinates from different formats of coordinates data stored in supplementary PDF files with high accuracy, as shown by close agreement of computed single point energies to the original values. The program and source code are available online at the given URL.
Improvement of a method based on hidden markov model for clustering web userscsandit
Nowadays the determination of the dynamics of sequential data, such as marketing, finance,
social sciences or web research has receives much attention from researchers and scholars.
Clustering of such data by nature is always a more challenging task. This paper investigates the
applications of different Markov models in web mining and improves a developed method for
clustering web users, using hidden Markov models. In the first step, the categorical sequences
are transformed into a probabilistic space by hidden Markov model. Then, in the second step,
hierarchical clustering, the performance of clustering process is evaluated with various
distances criteria. Furthermore this paper shows implementation of the proposed improvements
with symmetric distance measure as Total-Variance and Mahalanobis compared with the
previous use of the proposed method (such as Kullback–Leibler) on the well-known Microsoft
dataset with website user search patterns is more clearly result in separate clusters.
Xiaolin Wang - Managing and Integrating Geography Models in Distributed Envir...grssieee
This document discusses managing and integrating geography models in a distributed environment. It proposes using a model contract to describe models, their inputs/outputs, and how they are combined. A metadata standard defines atomic models, while an integration standard specifies how to compose models into hierarchical composite structures and define their workflow. A model contract execution engine would parse contracts to interpretively execute models according to their defined workflows, allowing for distributed integration and reuse of geography models.
This document summarizes a student's research project on approximate matching on graph databases using the GeX approach. It introduces graph databases and the need for approximate matching. It describes testing the GeX Top-K query algorithm on biological interaction data from multiple organisms. While accurate, the algorithm's performance decreases with larger datasets. Future work could approximate edge labels as well to improve scalability.
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...paperpublications3
Abstract: Maximum-flow problem are used to find Google spam sites, discover Face book communities, etc., on graphs from the Internet. Such graphs are now so large that they have outgrown conventional memory-resident algorithms. In this paper, we show how to effectively parallelize a maximum flow problem based on the Edmonds-Karp Algorithm (EKA) method on a cluster using the MapReduce framework. Our algorithm exploits the property that such graphs are small-world networks with low diameter and employs optimizations to improve the effectiveness of MapReduce and increase parallelism. We are able to compute maximum flow on a subset of the a large network graph with approximately more number of vertices and more number of edges using a cluster of 4 or 5 machines in reasonable time.Keywords: Algorithm, MapReduce, Hadoop.
Title: Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flow in Large Network Graph
Author: Dhananjaya Kumar K, Mr. Manjunatha A.S
International Journal of Recent Research in Mathematics Computer Science and Information Technology
ISSN 2350-1022
Paper Publications
A spatial data model for moving object databasesijdms
Moving Object Databases will have significant role in Geospatial Information Systems as they allow users
to model continuous movements of entities in the databases and perform spatio-temporal analysis. For
representing and querying moving objects, an algebra with a comprehensive framework of User Defined
Types together with a set of functions on those types is needed. Moreover, concerning real world
applications, moving objects move along constrained environments like transportation networks so that an
extra algebra for modeling networks is demanded, too. These algebras can be inserted in any data model if
their designs are based on available standards such as Open Geospatial Consortium that provides a
common model for existing DBMS’s. In this paper, we focus on extending a spatial data model for
constrained moving objects. Static and moving geometries in our model are based on Open Geospatial
Consortium standards. We also extend Structured Query Language for retrieving, querying, and
manipulating spatio-temporal data related to moving objects as a simple and expressive query language.
Finally as a proof-of-concept, we implement a generator to generate data for moving objects constrained
by a transportation network. Such a generator primarily aims at traffic planning applications.
USING ONTOLOGY BASED SEMANTIC ASSOCIATION RULE MINING IN LOCATION BASED SERVICESIJDKP
Recently, GPS and mobile devices allowed collecting a huge amount of mobility data. Researchers from
different communities have developed models and techniques for mobility analysis. But they mainly focused
on the geometric properties of trajectories and do not consider the semantic facet of moving objects. The
techniques are good at extracting patterns, but they are hard to interpret in a specific application domain.
This paper proposes a methodology to understand mobility data and semantically interpret trajectory
patterns. The process considers four different behavior types such as semantic, semantic and space,
semantic and time, and semantic and space-time. Finally, a system prototype was developed to evaluate the
behavior models in different aspects using one of the location based services. The results showed that
applying the semantic association rules could significantly reduce the number of available services and
customize the services based on the rules.
This document summarizes techniques for mapping application topologies to interconnect network topologies. It discusses how improving data locality through topology mapping can reduce communication costs, execution time, and energy consumption. Several common mapping techniques are described, including linear programming formulations, greedy approaches, partitioning approaches, transformative approaches, and those based on graph similarity. The document notes that finding an optimal mapping is NP-complete and different techniques may work better depending on the topology.
The document discusses using the OGC Web Coverage Service (WCS) protocol to deliver air quality data from various sources through a system called DataFed. The WCS allows querying distributed air quality monitoring data in various formats. It provides a common data model and can deliver gridded data, images, and point data like that from monitoring stations. For air quality analysis, extending WCS to better support point data from stations would be useful.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
Slides presented at the TREC conference about our participation in the Contextual Suggestion track. We applied tourist domain knowledge inferred from the public Web to filter documents from the ClueWeb12 collection (a crawl from the Web).
Large volume of information is stored in XML format in the Web, and clustering is a management method for this documents. Most of current methods for clustering XML documents consider only one of these two aspects. In this paper, we propose SCEM (Expectation Maximization Structure and Content) for XML documents which is used to effectively cluster XML documents by combining content and structural features. The other contribution of this paper is that we used probabilistic distributions in such way that have probability parameters corresponding to one cluster. In this way, we obtained better effectiveness compared to other clustering methods due to generality. Experimental results on real datasets show effectiveness of proposed method, particularly when it is applied on large XML documents without schema. Also it can be used to improve accuracy and effectiveness of XML information retrieval.
INVESTIGATING SIGNIFICANT CHANGES IN USERS’ INTEREST ON WEB TRAVERSAL PATTERNSIJCI JOURNAL
This document summarizes an article that investigates significant changes in users' interest on web traversal patterns. It proposes algorithms to discover transitional patterns based on support and utility constraints in two phases. The first phase discovers frequent and high utility patterns. The second phase identifies transitional patterns among them and detects significant milestones when support or utility dramatically increases or decreases. Experimental results on a real dataset are discussed. The information obtained helps understand users' dynamic preferences on web traversal patterns over time.
This document summarizes a research paper that proposes a new technique for web page clustering inspired by the cemetery organization behavior of ants. The technique begins by reducing the dimensionality of the web page index using latent semantic indexing. It then transforms web pages into a two-dimensional grid space based on the cemetery organization behavior of ants. Finally, it clusters the web pages in this space using k-means clustering. The paper aims to address the challenges of web page clustering by applying this three-step technique and demonstrates the impact of dimensionality reduction and distance measures on clustering results.
Enhancement of Single Moving Average Time Series Model Using Rough k-Means fo...IJERA Editor
This document proposes combining rough k-means clustering with a single moving average time series model to improve network traffic prediction. The document first discusses related work on network traffic prediction using various time series models. It then describes using a single moving average model to initially predict network packet loads, and enhancing this prediction by incorporating clusters identified through rough k-means analysis of the network data. The proposed integrated model is evaluated on real network traffic data and shown to improve prediction accuracy over the conventional single moving average model alone.
Performance Analysis and Parallelization of CosineSimilarity of DocumentsIRJET Journal
This document discusses performance analysis and parallelization of the cosine similarity algorithm for calculating document similarity. It proposes an optimized algorithm that utilizes parallel computing to calculate cosine similarity for large sets of retrieved documents more efficiently. The conventional cosine similarity algorithm becomes inefficient for large document sets. The parallelized approach aims to enhance efficiency and reduce latency by processing more documents in less time. The document reviews related work applying techniques like parallelization, cosine similarity, and dimensionality reduction to problems involving document clustering, text summarization, and information retrieval.
A Conceptual Model For The Logical Design Of Temporal DatabasesWhitney Anderson
This document proposes a Temporal Event-Entity-Relationship Model (TEERM) for conceptual modeling of temporal databases. The TEERM extends the traditional Entity-Relationship model by introducing events as a new construct. It also classifies relationships and attributes as static, quasi-static, or temporal based on whether their history is relevant to the database. The paper then demonstrates how the TEERM can be used for logical design of temporal relational databases by mapping each construct in the conceptual model to a temporal relational model. This allows designers to benefit from a well-defined design process for temporal databases similar to the established process used for conventional databases.
IRJET- Kinematic Analysis of Planar and Spatial Mechanisms using MatpackIRJET Journal
This document discusses kinematic analysis of planar and spatial mechanisms using computational methods in MATLAB. It develops a MATLAB package called MATPACK for numerical analysis of planar and spatial mechanisms. It uses vector notation to analyze planar mechanisms and Denavit-Hartenberg parameters to analyze spatial mechanisms. Results for velocities and accelerations are obtained from MATPACK and compared to theoretical results. The objective is to introduce existing notation and methods to analyze spatial mechanisms using computational tools like MATPACK, AutoCAD, and CATIA.
IRJET- Review on Scheduling using Dependency Structure MatrixIRJET Journal
This document provides a review of scheduling methods using Dependency Structure Matrices (DSM). It begins with an introduction to DSMs, including how they are represented as binary or numerical matrices. It then discusses the four main types of DSM models (product, organization, process, and parameter) and focuses on process DSM models. The remainder of the document summarizes several research papers that proposed innovations for project scheduling using process DSM methods, including addressing task dependencies, iterations, and information flows.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
For non-grid 3D images like point clouds and meshes, and inherently graph-based data.
Inherently graph-based data include for example brain connectivity analysis, scientific article citation networks, (social) network analysis, etc.
Alternative download link:
https://www.dropbox.com/s/2o3cofcd6d6e2qt/geometricGraph_deepLearning.pdf?dl=0
Data dissemination and materials informatics at LBNLAnubhav Jain
The document summarizes data dissemination and materials informatics work done at LBNL. It discusses several key points:
1) The Materials Project shares simulation data on hundreds of thousands of materials through a science gateway and REST API, with millions of data points downloaded.
2) A new feature called MPContribs allows users to contribute their own data sets to be disseminated through the Materials Project.
3) A materials data mining platform called MIDAS is being built to retrieve, analyze, and visualize materials data from several sources using machine learning algorithms.
A Framework for Automated Association Mining Over Multiple DatabasesGurdal Ertek
Literature on association mining, the data mining methodology that investigates associations between items, has primarily focused on efficiently mining larger databases. The motivation for association mining is to use the rules obtained from historical data to influence future transactions. However, associations in transactional processes change significantly over time, implying that rules extracted for a given time interval may not be applicable for a later time interval. Hence, an
analysis framework is necessary to identify how associations change over time. This paper presents such a framework, reports the implementation of the framework as a tool, and demonstrates the applicability of and the necessity for the framework through a case study in the domain of finance.
http://research.sabanciuniv.edu.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Subhajit Sahu
This is my comprehensive viva report version 3.
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Graph is a generic data structure and is a superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced binary tree search. Database tables can be thought of as indexed lists, and table joins represent relations between columns. This can be modeled as graphs instead. Assignment of registers to variables (by compiler), and assignment of available channels to a radio transmitter and also graph problems. Finding shortest path between two points, and sorting web pages in order of importance are also graphs problems. Neural networks are graphs too. Interaction between messenger molecules in the body, and interaction between people on social media, also modeled as graphs.
Introducing Novel Graph Database Cloud Computing For Efficient Data ManagementIJERA Editor
Graph theory stands as a natural mathematical model for cloud networks, axiomatic cloud theory further defines the cloud with formal mathematical model. keeping axiomatic theory as a basis, paper proposes bipartite cloud and proposes graph database model as a suitable database for data management .it is highlighted that perfect matching in bipartite cloud can enhance searching in bipartite cloud.
This document provides a survey of various metrics for measuring attributes of the World Wide Web. It discusses metrics for quantifying properties of the web graph structure, significance of web pages, usage characterization, similarity between pages, search and retrieval performance, and information theoretic properties. The document categorizes these metrics and discusses their origins, measurement functions, formulations, and potential applications for improving access to and use of web information.
An adaptive clustering and classification algorithm for Twitter data streamin...TELKOMNIKA JOURNAL
On-going big data from social networks sites alike Twitter or Facebook has been an entrancing
hotspot for investigation by researchers in current decades as a result of various aspects including
up-to-date-ness, accessibility and popularity; however anyway there may be a trade off in accuracy.
Moreover, clustering of twitter data has caught the attention of researchers. As such, an algorithm which
can cluster data within a lesser computational time, especially for data streaming is needed. The presented
adaptive clustering and classification algorithm is used for data streaming in Apache spark to overcome
the existing problems is processed in two phases. In the first phase, the input pre-processed twitter data is
viably clustered utilizing an Improved Fuzzy C-means clustering and the proposed clustering is additionally
improved by an Adaptive Particle swarm optimization (PSO) algorithm. Further the clustered data
streaming is assessed utilizing spark engine. In the second phase, the input pre-processed Higgs data is
classified utilizing the modified support vector machine (MSVM) classifier with grid search optimization.
At long last the optimized information is assessed in spark engine and the assessed esteem is utilized to
discover an accomplished confusion matrix. The proposed work is utilizing Twitter dataset and Higgs
dataset for the data streaming in Apache Spark. The computational examinations exhibit the superiority of
presented approach comparing with the existing methods in terms of precision, recall, F-score,
convergence, ROC curve and accuracy.
Sub-Graph Finding Information over Nebula Networksijceronline
Social and information networks have been extensively studied over years. This paper studies a new query on sub graph search on heterogeneous networks. Given an uncertain network of N objects, where each object is associated with a network to an underlying critical problem of discovering, top-k sub graphs of entities with rare and surprising associations returns k objects such that the expected matching sub graph queries efficiently involves, Compute all matching sub graphs which satisfy "Nebula computing requests" and this query is useful in ranking such results based on the rarity and the interestingness of the associations among nebula requests in the sub graphs. "In evaluating Top k-selection queries, "we compute information nebula using a global structural context similarity, and our similarity measure is independent of connection sub graphs". We need to compute the previous work on the matching problem can be harnessed for expected best for a naive ranking after matching for large graphs. Top k-selection sets and search for the optimal selection set with the large graphs; sub graphs may have enormous number of matches. In this paper, we identify several important properties of top-k selection queries, We propose novel top–K mechanisms to exploit these indexes for answering interesting sub graph queries efficiently.
Converting UML Class Diagrams into Temporal Object Relational DataBase IJECEIAES
Number of active researchers and experts, are engaged to develop and implement new mechanism and features in time varying database management system (TVDBMS), to respond to the recommendation of modern business environment.Time-varying data management has been much taken into consideration with either the attribute or tuple time stamping schema. Our main approach here is to try to offer a better solution to all mentioned limitations of existing works, in order to provide the nonprocedural data definitions, queries of temporal data as complete as possible technical conversion ,that allow to easily realize and share all conceptual details of the UML class specifications, from conception and design point of view. This paper contributes to represent a logical design schema by UML class diagrams, which are handled by stereotypes to express a temporal object relational database with attribute timestamping.
Formal Models for Context Aware ComputingEditor IJCATR
Context-aware computing refers to a general class of mobile systems that can sense their physical environment, and
adapt their behavior accordingly. In this paper we seek to develop a systematic understanding of context-aware computing by
constructing a formal model and notation for expressing context-aware computations. This discussion is followed by a
description and comparison of current context modeling and reasoning techniques.
Similar to Baroclinic Channel Model in Fluid Dynamics (20)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
1. Kharatti Lal et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 2, (Part - 5) February 2016, pp.22-24
www.ijera.com 22|P a g e
Baroclinic Channel Model in Fluid Dynamics
Kharatti Lal
Deptt .of (Applied Science) – Mathematics Govt. Millennium Polytechnic College Chamba Himachal Pradesh -
176310
ABSTRACT
A complex flow structure is studied using a 2-dimentional baroclinic channel model Unsteady Navier - stokes
equation coupled with equation of thermal energy ,salinity and the equation of state are implemented .System
closure is achieved through a modified Prandtl,
s mixing length formulation of turbulence dissipation The model
is applied in a region where the fluid flow is effected by various forcing equation .In this case ,flow is estuarine
region affected by diurnal tide and the fresh water inflow in to the estuary and a submerged structure is
considered giving possible insight in to stress effects on submerged structure .the result show that in the time
evolution of the vertical velocity along downstream edge changes sign from negative to positive .as the dike
length increases the primary cell splits and flow becomes turbulent du e to the non-linear effect caused by the
dike .these are found to agree favourably with result published in the open literature.
Key Words: Submerged dike, estuarine dynamics, Baroclinic numerical model, Turbulent dike.
I. INTRODUCTION
The two – dimensional baroclinic model
predicts flow in a vertical plane with horizontal and
vertical momentum equation .from the economic
point of view the estuarine dynamics has gained
increasing importance .The exploration for oil in this
region has brought about an interesting feature of
interaction of the water waves and the estuarine
structure .the physical model s are not viable due to
economic and man –hours point of view and the
analytical model s do not gives an exact solution to
the problems .Steady of flow characteristics using a
numerical model has become an important tool for
gaining insight into the dynamical stability.
Accurate predictions of wave transformation due
to submerged dike could be achieved by non –linear
shallow water waves theories [1] using the boundary
integral equation method. A similar approach is
applied in Ref. [2] for transformation of periodic
wave crossing over the dike. The important and main
characterize this interaction is the generation of the
vortex in the flow field around the marine objects
due to wave body interaction and the subsequent
scouring effects .Numerical modeling is confined to
mainly the solitary wave transformation on crossing
the submerged dike and related corresponding flow
fields [ 3, 4] with some observational studies [5] the
present study extends the methodology to the case of
a complex resultant forcing of fresh water flow and
the semi – diurnal tide existing in estuarine
.turbulence changes are accounted as a function of
vertical density satisfaction as salinity and thermal
energy variation are substantial along the estuary.
The model uses a free surface boundary condition to
bring in a realistic semi diurnal – tidal forcing on the
down- stream Boundary condition are based on the
specification of condition on an incoming
characteristic , disposable parameters are used to
represent the forcing function which consists of
strength of the fresh water and the amplitude of the
semi – diurnal tide .Notable feature is the use of
upstream scheme for the horizontal advection term in
the transport equation s .Numerical investigations
delve into flow characteristic in the vertical, the flow
evolution and a possible insight into scouring effects
at the bottom .
Mining and web structure mining, Clustering,
Classification and finding. Association Rule are the
techniques used for almost all data mining and web
mining task s. other techniques are sequential
patterns, regression, Deviation detection and other
statistical and machine leaving method .
The Process of identification natural grouping:
This paper focuses on clustering, which is the
processing s of objects Most of the existing methods
for document clustering are based on the either
probabilistic m, methods, and distances based
methods such as k-means analysis, hierarchical
clustering [10] and nearest neighbor clustering [6]
use a selected set of words appearing in a different
documents as features The k- algorithmic is a straight
forward method s for clustering data [6] .The basic
procedure of k-means method typically. begins with
the assigning each data items to a cluster .The
number of cluster are say, k is provided a priori by
the user .Next the cluster center s are calculated by
the centroid of the data item in each cluster . After
that a new assignment of the data item s to cluster is
computed by assigning them to their closest center
according to some distance measure. Different
distance measure or web matrices are well explained
RESEARCH ARTICLE OPEN ACCESS
2. Kharatti Lal et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 2, (Part - 5) February 2016, pp.22-24
www.ijera.com 23|P a g e
in [1] .Web matrices are divided into six categories
based on its applications.
The second approach user the similarity based
on structure of the web page. There are several
techniques that can be used to detect lay out structure
in HTML documents. Tables are frames are two
components that are commonly used to organize the
contents of web documents .Thus a tables or frame
detection technique is required to give a view of web
page layout for an analyzing process. Many
developers prefer to use tables rather than frames to
design the prefer web pages layout. Even though
frames can provide added context and consistency
during navigation, they have several serious
problems that are related to screen real estate , page
model, the speed of the display and the complexity of
the web design [8] .Therefore , several researchers
have reported their work in table mining due to the
efficiency and the popularity of tables for web page
layout structure [11, 12] many other approaches such
as hybrid, partition based Clustering, based on
documents structure [9,10] are also develop by the
researcher s using one or other combination of basic
clustering techniques mentioned above. The other
approaches, which use Neural Networks [7]. and soft
computing techniques [6] are also in their way to
success .more about the advancement in web mining
can be found in [5]
(A) CLUSTERING METHOD
Web documents clustering using hyperlink
structures :The World Wide Web has a rich structure
:it contains both textual web documents and the
hyper links that connect them The web documents
are hyperlink s between them from a directed graph
in which the web documents can be views as
vertices and the hyperlinks as the directed edges
.Algorithms have been developed utilizing this
directed graph to extract information contained in a
collection of hyperlinked web documents Kleinberg
proposed the HITS algorithm based purely on
hyperlink information to retrieve the most relevant
information :Authority and hub documents may for
a user query ( Kleinberg (1998)). However, if the
hypertext collection consists of several topics
authority and hub documents may also cover the
most popular topics and leave out the less popular
ones .One way to remedy this situation is to first
partition the hypertext collection into topical groups,
and present the search results as a list of topics to the
user. This leads to the need to cluster web
documents based on both the textual and
hyperlink information . In this model ,the similarity
metric used for clustering web page is based on
hyperlink structure , Textual information and co –
citation patterns The link information is obtained
directly from the link graph .given a link graph G =
(V,E),which is directed , we define the matrix A to
be:
Ai j = { 1 if (i, j ) ε E or ( j , e) ε E
{ i f : 0 , otherwise
A is the adjacency matrix of the link graph
where directionality of the hyperlink is ignored. Link
structure alone provides us with rich information on
the topics of the documents collection.
(B) The Factor of the Textual Information:
The next factor included is textual information
(S ) of the web page .The textual information can be
included to better cluster the web documents
.moreover , compared to printed literature , web
documents reference each other more randomly .thus
is another reason that the text information is
incorporated in order to regulate the influence of the
documents .each web document is represented as a
vector in the vector space model of (IR) information
retrieval then computes the similarity between them
The higher the similarity , the more likely the two
documents deal with the same topic .For each
element of the vector we use the standard t f .X idf
weighting; t f ( I , j) X idf ( i ) . t f ( i , j ) is the term
frequency of the word in documents in documents j,
idf is the inverse Document Frequency
corresponding to word i , is defined such as :
idf ( i ) = log (no. of total docs) / No. of docs
containing word i )
Co–citation (C) is another metric to measure the
relevance of two web documents. If there are many
documents pointing to both of them , then these two
documents are likely to address a similar issue .the
Co – citation pattern is used by H .the Co – citation
C ij of the documents i and j is the number of the
web documents pointing to both i and j
.Incorporating the above information in to the
similarity metric , we form the weight
Matrix of the graph;
A⊗S C
W = α + (1 – α)
||A ⊗ S || 2 || C || 2
Where ʿαʾ is weightage ranging from 0 to1.
(C) Web documents clustering based on their
structure
Based on some information the assumption that
links that share layout and presentation properties
usually point to pages that are structurally similar,
are made the set are layout and presentation
properties associated with the links of a page can be
used to characterize the structure of the page itself. In
other words, whenever two or more pages contain
links that share the same layout and presentation
properties. There are several techniques that can be
used to detect layout structure in HTML documents.
3. Kharatti Lal et al. Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 2, (Part - 5) February 2016, pp.22-24
www.ijera.com 24|P a g e
Tables and frames are two components that are
commonly used to organize the contents of web -
documents. Thus, a table or frame detection
technique is required to give a view of web page
layout for an analyzing process .Many developers
prefer to use tables rather than frames to design the
web page s, layout. Even though frames can be
provides added context and consistency during
navigation , they have several serious problems that
are related to screen real estate , page model ,the
speed of the display and the complexity of web
design [4 ] . Therefore, several researchers have
reported their work in table mining due to the
efficiency and the popularity of tables for web pages
layout structure [15]. A model to abstract structure
from a web site, based on the main idea that layout
and presentation properties associated with links can
characterize the structure of the page can be found at
[13].[13] gives a site model generation algorithm for
finding the structure and cluster them according to
the similarity . The model tries to find the schema of
the web site based on the analysis of the HTML code
of the web site .Based on the analysis page schemas,
page classes, and class links are obtained. The
distances between the web pages is calculated .The
distances between schemas is defined as the
normalized cardinality of the symmetric set
difference between the two schemas .namely, G i and
G j be the schemas of the groups i and j ;
Then such as:
| (G i –G j) U (G j – G i)|
Dist (G i, G j) =
| (G i U G j) |
II. CONCLUSION
Both of these method were implemented and the
initial phase of result is obtained. In initial phase,
pages about 150 web pages are clustered. The
method based on the hyperlink structure, textual
information and Co – citation matrix, is .method 1,
gave us a better result s compared to second method
.The first method not only cluster s the web page s
according to content, but also their related topic. The
second method cluster only based on the HTML
schema, which is usually found in a organization or
pages giving similar type of information. In the
second type, web pages related to event like sports,
news etc .have the same page structure. In the first
method, the calculation for the clusters are
complicated compared to the second one. So , when
pages from a company or organization or pages for
particular events , which usually have the similar
structure , are needed to be clustered ,the second
method recommended , whereas , if pages along to a
particular is needed ,the first method gives us the
correct result .
REFERENCES
[1.] L.Ronsenfeld, and P. Morville,
ʽʽInformation Architecture for the World
Wide Webʼʼ, O, reilly, Sebastopol, Canada,
1998.
[2.] Yue Xu, “hybrid clustering with
Application to web Miningʼʼ, IEEE
(2005).
[3.] M. Hurst, ʽʽLayout and Language: ʽʽ
Challenges for Table Understanding on the
Web, ʾProceeding of the First International
workshop on the Web Documents Analysis
WDA (2001), Seattle, Washington, USA
.2002.
[4.] Xiaofeng He, Hongyuan Zha , Chris H.Q.
Ding , Horst D. simon , ʽʽWeb document
clustering hyperlink structure ,ʼʼ
Computational Statistics and Data Analysis
41( 2002) 19 – 45 .
[5.] Yi –Hong Doing, ʽʽA Novel Competitive
Neural Network for web miningʼʼ,
Proceedings of the Third International
Conferences on Machine Learning and
Cybernetics, shanghai, August 2004.
[6.] H .H .Chen, C.T. Shih and H.T. Jin,
ʽʽMining Tables from large Scale HTML
Textsʼʼ, the 18th
International Conference,
Saarbrucken, Germany, 2000.
[7.] Valter crescenzi , Paolo merialdo ,Paolo
Missier, Clustering Web Pages based on
their structure , data and Knowledge
Engineering 54 (2005 ) 279 – 299 .