DTFCA is an approach that uses decision tree clustering to reduce tuple reconstruction time in column-stores databases. It exploits decision tree algorithms to cluster frequently accessed attributes together based on an attribute usage matrix. This clusters attributes into projections so that tuples can be reconstructed more efficiently. Experiments on TPC-H data show that DTFCA lowers tuple reconstruction time compared to traditional methods, with execution time inversely proportional to the minimum support threshold used for clustering. DTFCA provides a way to organize attributes into projections that reflects actual query access patterns and correlations.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
In this video from the 2015 HPC User Forum in Broomfield, Barry Bolding from Cray presents: HPC + D + A = HPDA?
"The flexible, multi-use Cray Urika-XA extreme analytics platform addresses perhaps the most critical obstacle in data analytics today — limitation. Analytics problems are getting more varied and complex but the available solution technologies have significant constraints. Traditional analytics appliances lock you into a single approach and building a custom solution in-house is so difficult and time consuming that the business value derived from analytics fails to materialize. In contrast, the Urika-XA platform is open, high performing and cost effective, serving a wide range of analytics tools with varying computing demands in a single environment. Pre-integrated with the Hadoop and Spark frameworks, the Urika-XA system combines the benefits of a turnkey analytics appliance with a flexible, open platform that you can modify for future analytics workloads. This single-platform consolidation of workloads reduces your analytics footprint and total cost of ownership."
Learn more: http://www.cray.com/products/analytics/urika-xa
Watch the video presentation: http://wp.me/p3RLEV-3yR
Sign up for our insideBIGDATA Newsletter: http://insidebigdata.com/newsletter
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
Among many data clustering approaches available today, mixed data set of numeric and category data
poses a significant challenge due to difficulty of an appropriate choice and employment of
distance/similarity functions for clustering and its verification. Unsupervised learning models for
artificial neural network offers an alternate means for data clustering and analysis. The objective of this
study is to highlight an approach and its associated considerations for mixed data set clustering with
Adaptive Resonance Theory 2 (ART-2) artificial neural network model and subsequent validation of the
clusters with dimensionality reduction using Autoencoder neural network model.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
Data mining is a process to extract information from a huge amount of data and transform it into an
understandable structure. Data mining provides the number of tasks to extract data from large databases such
as Classification, Clustering, Regression, Association rule mining. This paper provides the concept of
Classification. Classification is an important data mining technique based on machine learning which is used to
classify the each item on the bases of features of the item with respect to the predefined set of classes or groups.
This paper summarises various techniques that are implemented for the classification such as k-NN, Decision
Tree, Naïve Bayes, SVM, ANN and RF. The techniques are analyzed and compared on the basis of their
advantages and disadvantages
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
Computational Grid (CG) creates a large heterogeneous and distributed paradigm to manage and execute the applications which are computationally intensive. In grid scheduling tasks are assigned to the proper processors in the grid system to for its execution by considering the execution policy and the optimization objectives. In this paper, makespan and the faulttolerance of the computational nodes of the grid which are the two important parameters for the task execution, are considered and tried to optimize it. As the grid scheduling is considered to be NP-Hard, so a meta-heuristics evolutionary based techniques are often used to find a solution for this. We have proposed a NSGA II for this purpose. The performance estimation ofthe proposed Fault tolerance Aware NSGA II (FTNSGA II) has been done by writing program in Matlab. The simulation results evaluates the performance of the all proposed algorithm and the results of proposed model is compared with existing model Min-Min and Max-Min algorithm which proves effectiveness of the model.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
In this video from the 2015 HPC User Forum in Broomfield, Barry Bolding from Cray presents: HPC + D + A = HPDA?
"The flexible, multi-use Cray Urika-XA extreme analytics platform addresses perhaps the most critical obstacle in data analytics today — limitation. Analytics problems are getting more varied and complex but the available solution technologies have significant constraints. Traditional analytics appliances lock you into a single approach and building a custom solution in-house is so difficult and time consuming that the business value derived from analytics fails to materialize. In contrast, the Urika-XA platform is open, high performing and cost effective, serving a wide range of analytics tools with varying computing demands in a single environment. Pre-integrated with the Hadoop and Spark frameworks, the Urika-XA system combines the benefits of a turnkey analytics appliance with a flexible, open platform that you can modify for future analytics workloads. This single-platform consolidation of workloads reduces your analytics footprint and total cost of ownership."
Learn more: http://www.cray.com/products/analytics/urika-xa
Watch the video presentation: http://wp.me/p3RLEV-3yR
Sign up for our insideBIGDATA Newsletter: http://insidebigdata.com/newsletter
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
Among many data clustering approaches available today, mixed data set of numeric and category data
poses a significant challenge due to difficulty of an appropriate choice and employment of
distance/similarity functions for clustering and its verification. Unsupervised learning models for
artificial neural network offers an alternate means for data clustering and analysis. The objective of this
study is to highlight an approach and its associated considerations for mixed data set clustering with
Adaptive Resonance Theory 2 (ART-2) artificial neural network model and subsequent validation of the
clusters with dimensionality reduction using Autoencoder neural network model.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this paper, a survey of clustering methods and techniques and identification of advantages and disadvantages of these methods are presented to give a solid background to choose the best method to extract strong association rules.
Data mining is a process to extract information from a huge amount of data and transform it into an
understandable structure. Data mining provides the number of tasks to extract data from large databases such
as Classification, Clustering, Regression, Association rule mining. This paper provides the concept of
Classification. Classification is an important data mining technique based on machine learning which is used to
classify the each item on the bases of features of the item with respect to the predefined set of classes or groups.
This paper summarises various techniques that are implemented for the classification such as k-NN, Decision
Tree, Naïve Bayes, SVM, ANN and RF. The techniques are analyzed and compared on the basis of their
advantages and disadvantages
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
Computational Grid (CG) creates a large heterogeneous and distributed paradigm to manage and execute the applications which are computationally intensive. In grid scheduling tasks are assigned to the proper processors in the grid system to for its execution by considering the execution policy and the optimization objectives. In this paper, makespan and the faulttolerance of the computational nodes of the grid which are the two important parameters for the task execution, are considered and tried to optimize it. As the grid scheduling is considered to be NP-Hard, so a meta-heuristics evolutionary based techniques are often used to find a solution for this. We have proposed a NSGA II for this purpose. The performance estimation ofthe proposed Fault tolerance Aware NSGA II (FTNSGA II) has been done by writing program in Matlab. The simulation results evaluates the performance of the all proposed algorithm and the results of proposed model is compared with existing model Min-Min and Max-Min algorithm which proves effectiveness of the model.
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...Sunny Kr
Cardinality estimation has a wide range of applications and
is of particular importance in database systems. Various
algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
From the impact of Pokemon Go on Silicon Valley to artificial intelligence, futurist Brian Solis talks to Mathew Parsons of World Travel Market about the future of travel, tourism and hospitality.
We’re all trying to find that idea or spark that will turn a good project into a great project. Creativity plays a huge role in the outcome of our work. Harnessing the power of collaboration and open source, we can make great strides towards excellence. Not just for designers, this talk can be applicable to many different roles – even development. In this talk, Seasoned Creative Director Sara Cannon is going to share some secrets about creative methodology, collaboration, and the strong role that open source can play in our work.
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
If your B2B blogging goals include earning social media shares and backlinks to boost your search rankings, this infographic lists the size best approaches.
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.Experimentations with TPC-H data shows the effectiveness of proposed approach.
Column store decision tree classification of unseen attribute setijma
A decision tree can be used for clustering of frequently used attributes to improve tuple reconstruction time
in column-stores databases. Due to ad-hoc nature of queries, strongly correlative attributes are grouped
together using a decision tree to share a common minimum support probability distribution. At the same
time in order to predict the cluster for unseen attribute set, the decision tree may work as a classifier. In
this paper we propose classification and clustering of unseen attribute set using decision tree to improve
tuple reconstruction time.
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
Abstract: Details linkage is a procedure which almost adjoins two or more places of data (surveyed or proprietary) from different companies to generate a value chest of information which can be used for further analysis. This allows for the real application of the details. One-to-Many data linkage affiliates an enterprise from the first data set with a number of related companies from the other data places. Before performs concentrate on accomplishing one-to-one data linkages. So formerly a two level clustering shrub known as One-Class Clustering Tree (OCCT) with designed in Jaccard Likeness evaluate was suggested in which each flyer contains team instead of only one categorized sequence. OCCT's strategy to use Jaccard's similarity co-efficient increases time complexness significantly. So we recommend to substitute jaccard's similarity coefficient with Jaro wrinket similarity evaluate to acquire the team similarity related because it requires purchase into consideration using positional indices to calculate relevance compared with Jaccard's. An assessment of our suggested idea suffices as approval of an enhanced one-to-many data linkage system.
Index Terms: Maximum-Weighted Bipartite Matching, Ant Colony Optimization, Graph Partitioning Technique
Column store databases approaches and optimization techniquesIJDKP
Column-Stores database stores data column-by-column. The need for Column-Stores database arose for
the efficient query processing in read-intensive relational databases. Also, for read-intensive relational
databases,extensive research has performed for efficient data storage and query processing. This paper
gives an overview of storage and performance optimization techniques used in Column-Stores.
For years, the Machine Learning community has focused on developing efficient
algorithms that can produce very accurate classifiers. However, it is often much easier
to find several good classifiers based on dataset combination, instead of single classifier
applied on deferent datasets. The advantages of using classifier dataset combinations
instead of a single one are twofold: it helps lowering the computational complexity by
using simpler models, and it can improve the classification accuracy and performance.
Most Data mining applications are based on pattern matching algorithms, thus improving
the performance of the classification has a positive impact on the quality of the overall
data mining task. Since combination strategies proved very useful in improving the
performance, these techniques have become very important in applications such as
Cancer detection, Speech Technology and Natural Language Processing .The aim of this
paper is basically to propose proprietary metric, Normalized Geometric Index (NGI)
based on the latent properties of datasets for improving the accuracy of data mining tasks.
Study of Density Based Clustering Techniques on Data StreamsIJERA Editor
Data streams are generated by many real time systems. Data stream is fast changing and massive. In stream data mining traditional methods are not efficient so that many methodologies developed to stream data processing. Many applications require data into groups based on its characteristics. So clustering on data streams is applied. Clustering of non liner data density based clustering is used. Review of clustering algorithm and methodologies is represented and evaluated if they meet requirement of users. Study of density based clustering algorithm is presented here because of advantages of density based clustering method over other clustering method.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MININGijcsit
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering
has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this
paper, a survey of clustering methods and techniques and identification of advantages and disadvantages
of these methods are presented to give a solid background to choose the best method to extract strong
association rules.
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MININGijcsit
The main goal of cluster analysis is to classify elements into groupsbased on their similarity. Clustering
has many applications such as astronomy, bioinformatics, bibliography, and pattern recognition. In this
paper, a survey of clustering methods and techniques and identification of advantages and disadvantages
of these methods are presented to give a solid background to choose the best method to extract strong
association rules.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
2. Computer Science & Information Technology (CS & IT)
296
Therefore, researching and designing a time effective tuple reconstruction approach adapted to
column-stores is of great significance. Exploiting projections to support tuple reconstruction is
included in C-Store. A relation is divided according to required attributes by query, into subrelations called projections. Each attribute may appear in more than one projection and stored on
different sorted order. Since all the attributes are not the part of projection, thus tuple
reconstruction pays penalty [4].
Decision tree algorithm is a popular approach for classifying data of various classes. A purity
function to partition the data space into several different classes is a requirement for Decision tree
algorithm. Although, as datasets have no pre-assigned labels, the decision tree partitioning
approach is not directly applicable to clustering. The partitioning of data space into cluster and
empty regions is carried through CLTree, as it efficiently finds cluster in large high dimensional
spaces [15]. Proposed approach inherits CLTree to reduce the tuple reconstruction time by
clustering frequently used attributes with available projection techniques.
It has been observed that attribute projection plays vital role for tuple reconstruction. Proposed
approach i.e. Decision Tree Frequent Clustering Algorithm (DTFCA) uses some existing
projection techniques to reduce tuple reconstruction time. These techniques are discussed in
Section 2. Some notations and terminology are necessary to review the correlation amongst query
attributes and are discussed in Section 3. Methodology to understand DTFCA is presented in
Section 4. Section 5 presents the detail description of DTFCA. To exploit DTFCA, experimental
data based on TPC-H schema is used as an input in suitable experimental environment. The
experimental results thus obtained are analyzed, and discussed in Section 6. Finally, paper
conclude with concluding remarks in Section 7.
2. RELATED WORK
All column-stores databases require tuple reconstruction to process multi-column queries.
Literature reveals that much work has been performed to present the materialization strategies
and their trade-offs [2]. Column-Stores database series C-Store and MonetDB has gained
popularity due to their good performance for analytical queries [4, 5]. Projections are exploited
to logically organize the attributes of the base table. Multiple attributes are involved in each
projection. The objective of projections is to reduce the tuple reconstruction time. MonetDB uses
late tuple materialization. Though partitioning strategies do not guarantee about the better
projection, a good solution is to cluster attributes into sub-relations based on the usage patterns
[5]. CLTree clustering technique performs clustering by partitioning the data space into dense and
sparse regions [15].
The Bond Energy Algorithm (BEA) is used to cluster attributes based on Attribute Usage Matrix
[1, 10, 11]. MonetDB proposed self-organization tuple reconstruction strategy in a main memory
structure called cracker map [5, 7]. Dividing and combining the pieces of existing cracker map
optimize the query performance. But for the large databases, cracker map pays performance
penalty due to high maintenance cost for memory [5]. Therefore, the cracker map is only adapted
to the main memory database systems such as MonetDB.
3. 297
Computer Science & Information Technology (CS & IT)
3. DEFINITIONS AND NOTATIONS
In relational Online Analytical Processing Applications, the common format of a query from the
fact table(s) and dimension tables is depicted as follows:
Select <attribute list>
From <Table list>
Where <condition>
Group by <attribute list>
Order by <attribute list>
Definition 1 Strong Correlation
k attributes A1, A2,…, Ak of relation R are strongly correlated if, and only if they appear in the
conjunction in conditional clause and target list of an access query.
Definition 2 Weak Correlation
In a collection of accesses A={a1, a2,…, am}, every ai(1≤i≤m) is an access to a query. If the
correlation among query target attributes is smaller than P, where P is the threshold, the attributes
are considered for weak correlation.
In column-stores, two critical tasks are needed to process a query: First, to determine qualifying
rows based on the conditions in Where clause, and to generate a set of row identifiers; Second, to
merge the qualifying columns of identical row identifiers, and to construct the target rows
according to target expression in Select clause. The attributes appear in conjunction in conditional
clause and query target are considered to be strongly correlative. The strong correlations in the
access list drives the creation of frequent attribute set.
4. METHODOLOGY
This section covers the search space for DTFCA.
4.1 Decision Tree Construction
Divide and conquer strategy has been used to recursively partition the data to build decision tree
[15]. In order to obtain purer regions each successive step chooses the cut to partition the space.
A commonly used criterion for choosing the best cut is the minimum support. It evaluates every
possible value (or cut point) for all projected attributes to find the cut point that gives the best
gain (Figure 1).
for each attribute Ai ∈{A1, A2, …, Ad} of the dataset D do
for each value x of Ai in D do
/* each value is considered as a possible cut */
Compute the information gain at x
end
Select the cut that satisfy the best information gain according to minimum support
Figure 1: The procedure to find the appropriate cluster
4. Computer Science & Information Technology (CS & IT)
298
4.2 Determining Cluster Attributes
This sub-section focuses on determining attributes for tuple reconstruction. A cluster is created
for each data point in the original dataset, and introduce some attributes for support greater than
minimum support. The number of attributes to be added for the cluster E is determined by the
following rule:
If the number of attributes inherited from the parent cluster projection is less than the number of
projected attributes in E then
the number of inherited attributes is increased to E
else
the number of inherited attributes is used for E
The recursive partitioning method will divide the data space until each cluster contains only
attribute of a single class, results in very complex tree that partitions the space more than
necessary. Hence the decision tree needs to be pruned to produce meaningful clusters. This
problem is similar to classification problem, however pruning methods used for classification,
cannot be applied for clustering. Subjective measures are required for pruning, since clustering, to
certain extent is a subjective task [12, 13]. The tree is pruned using the minimum support values.
After pruning, algorithm summarizes the clusters by retrieving only those attributes.
4.3 Scaling-up decision tree algorithm
For the decision tree, whole data must resides in memory. For the large data set, SPRINT, a
scalable technique is proposed by decision tree algorithm, for eliminating the need to have the
entire dataset in memory [23]. A statistical technique to construct a consistent tree based on small
subset of data is discussed in BOAT [22]. Proposed algorithm inherits these techniques to
determine the cluster.
5. DECISION TREE FREQUENT CLUSTERING ALGORITHM (DTFCA)
This section describes the implementation of DTFCA in the MonetDB analytic platform [5].
DTFCA is used to determine the clustering of frequent attribute set. DTFCA algorithm
combines multiple iterations of the original loop into a single loop body, and rearranges the
frequent attribute set.
Let each relation R be grouped into several projections based on empirical knowledge and users’
requirement. After the system is used by users for a period, a set of accesses A={a1, a2,…, am}
are collected by the system. DTFCA uses mainly a set of accesses to cluster frequently projected
attribute-set which are strongly correlated. The input to DTFCA is a variant of AUM (Table 1).
Each element in the output forms attributes of a projection. DTFCA clustering results truly reflect
the strong correlativity between attributes implied in previous collection of accesses for queries,
and conform to the meanings of projection in column-stores.
function DTFCA(String S)
{
/* Decision Tree Frequent Clustering Algorithm*/
5. Computer Science & Information Technology (CS & IT)
,
299
:
Input
a relation R(U={A[1],A[2],…,A[n]})
A={a1,a2,…,am}, the support threshold min_sup.
a collection of strongly correlated accesses
Output: clusters of strongly correlated attributes Cl1,Cl2,…,Clk, which holds support no smaller
than min_sup.
for each access in A //Processing an element per iteration
{
compute frequent attribute set for support > minimum support; // Checking for the limit for
traversal
for i=0 to N-1
{
string ResF[200];
string ResT[200];
ResT [i] = A[i].getName(); //Getting the item name of tree node for frequent attribute set
strcat(ResF,ResT) //Concat columns to build the tuples
}
visit the matching build tuple to compare keys and produce output tuple;
}
6. EXPERIMENT DETAILS
The objective of the experiment is to compare the execution time of existing tuple reconstruction
method with DTFCA for TPC-H schema queries on column-stores DBMS.
6.1. Experimental Environment
Experiments are conducted on 2.20 GHz Intel® Core™2 Duo Processor, 2M Cache, 1G Memory,
5400 RPM Hard Drive, Monet DB, a column oriented database and Windows® XP operating
system.
6.2. Experimental Data
TPC-H data set is used as the experiment data set, which is generated by the data generator.
Given a TPC-H Schema, fourteen different queries are accessed 140 times during a given window
period.
6.3. Experimental Analysis
Let us consider a relation with four attributes namely; S_Supplierkey (A1), P_partkey (A2),
O_orderdate (A3), and L_orderkey (A4). These attributes are accessed by fourteen different
queries of TPC-H schema for varying minimum support. For each attribute AUM has access
frequency generated by the system. The access frequency is derived from three parameters
namely; Minimum access frequency (Min_fq), Maximum access frequency (Max_fq), and
Average access frequency (Avg_fq). Minimum access frequency of query attributes is computed
for initial access of attributes. Maximum access frequency of query attributes is computed on the
system with more processing of queries. Average frequency is computed on the system with less
6. Computer Science & Information Technology (CS & IT)
300
processing of queries. DTFCA uses access frequency to cluster frequent attribute-set. Frequent
attribute-set refers to the set for frequency no smaller than the minimum support.
Table 1: Attribute Usage Matrix
Attr
Que
2
3
4
5
6
7
8
9
10
12
16
17
19
21
S_supplierkey (A1)
Min_
fq
1
0
0
1
0
1
1
0
0
0
0
0
0
0
Max_
Fq
1
0
1
0
1
0
0
1
0
1
0
1
0
1
Ave_
fq
1
0
0
0
0
0
0
0
0
0
0
0
0
0
Acc_
Fq
100123
0
331
331
331
331
331
331
0
331
0
331
0
331
P_partkey
(A2)
Min_
Max
fq
_fq
1
1
0
1
1
0
0
0
1
1
0
1
0
1
1
1
0
0
0
1
1
1
0
1
1
1
0
1
Ave
_fq
1
0
0
0
1
0
0
1
0
0
1
1
1
0
Acc
_fq
100123
331
331
0
100123
331
331
100123
0
331
100123
6612
100123
331
O_orderdate
(A3)
Min_
Max_
fq
Fq
0
0
1
1
1
1
1
1
0
0
0
1
1
1
0
0
1
0
0
1
0
1
0
1
0
1
0
0
l_orderkey (A4)
Ave_
Fq
0
1
1
1
0
0
1
0
0
0
0
0
0
0
Acc_
Fq
0
100123
100123
100123
0
331
100123
0
331
331
331
331
331
0
Min_
Fq
1
0
1
0
0
1
1
0
0
1
0
0
1
0
Max_
fq
0
1
1
1
0
1
1
0
1
1
1
0
1
1
Ave_
Fq
0
0
1
0
0
1
1
0
0
1
1
0
1
0
Acc_
Fq
331
331
100123
331
0
100123
100123
0
331
100123
6612
0
100123
331
Now, we explain the execution of DTFCA algorithm for different cases of minimum support. We
denote the attribute frequency as a candidate of frequent attribute-set by (int_val) xyz, where
int_val is the integer value of access frequency. Also, x, y and z represent the case variable
numbers respectively. For the given minimum support, the frequent attribute-set will be a
combination of four attributes (A1, A2, A3, and A4).
Case-Study
Let Case 1, Case 2 and Case 3 denote three minimum support case variables. For DTFCA
experiment analysis, let the minimum support values for these variables be 30, 50, and 60
respectively. For the implementation of Case 1, user gains the control to determine the strong
correlation among attributes, and hence formulating the frequent attribute-set. By referring to
Table 1, frequent attribute-sets generated are higher than other cases i.e. out of fourteen pairs or
patterns, thirteen patterns were frequent. This result may be considered as moderate and reduces
tuple reconstruction time for column-stores. However, with Case 2 and Case 3, generated
frequent attribute sets are four and one respectively. The clustering result of DTFCA is more in
conformity with the idea of projection in column-stores.
The experiments are performed on TPC-H dataset queries with the same selectivity and varying
minimum support. As shown in Figure 2 (horizontal axis denotes the minimum support, vertical
axis denotes the execution time), the execution time of TPC-H query schema is inversely
proportional to minimum support, and the system may perform well for query attributes which
are strongly correlative (refer Section 3). As shown in Figure 3, the execution time under DTFCA
is much lower than traditional tuple reconstruction method; hence DTFCA could minimize the
tuple reconstruction time.
7. 301
Computer Science & Information Technology (CS & IT)
Figure 2: Minimum Support Time Cost
Figure 3: Tuple Reconstruction Time Cost
7. CONCLUSION
DTFCA approach, exploits decision tree to cluster frequently accessed attributes of a relation. All
the attributes in each cluster form a projection. The experiment shows that proposed approach is
beneficial to cluster projection in column-stores and hence reduces the tuple reconstruction time.
The output of DTFCA is not a partition, but a group of attributes for the clustering into columnstores.
8. Computer Science & Information Technology (CS & IT)
302
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
D. J. Abadi.“Query execution in column-oriented database systems,”. MIT PhD Dissertation, 2008.
S. Harizopoulos, V.Liang, D.J.Abadi, and S.Madden, “Performance tradeoffs in read-optimized
databases,” Proc. the 32th international conference on very large data bases, VLDB Endowment,
2006, pp. 487-498.
D. 1 Abadi, D. S. Myers, D. 1 DeWitt, and S. Madden, “Materialization strategies in a columnoriented DBMS,” Proc. the 23th international conference on data engineering, ICDE, 2007, pp.466475
M. Stonebraker, D.J.Abadi, A.Batkin, X. Chen, and M. Cherniack, et al. “C-Store: a column-oriented
DBMS,” Proc. the 31 st international conference on very large data bases, VLDB Endowment, 2005,
pp. 553-564
P.Boncz, M.Zukowski, and N.Nes, “MonetDB/X100: hyper-pipelining query execution,” Proc. CIOR
2005, VLDB Endowment, 2005, pp. 225-237
R. MacNicol and B. French, "Sybase IQ multiplex-designed for analytics," Proc. the 30th
international conference on very large data bases, VLDB Endowment, 2004, pp. 1227-1230
S. Idreos, M.L.Kersten, and S.Manegold, “Self organizing tuple reconstruction in column-stores,”
Proc. the 35th SIGMOD international conference on management of data, ACMPress, 2009, pp. 297308
G. P. Copeland, and S.N. Khoshafian, “A decomposition storage model,” Proc. the 1985 ACM
SIGMOD international conference on management of data, ACM Press, 1985, pp. 268-279
D.J. Abadi, S.R.Madden, and M.C. Ferreira,“Integrating compression and execution in columnoriented database systems,” Proc. the 2006 ACM SIGMOD international conference on management
of data, ACM Press, 2006, pp. 671-682
H. I. Abdalla. “Vertical partitioning impact on performance and manageability of distributed database
systems (A Comparative study of some vertical partitioning algorithms)”. Proc. the 18th national
computer conference 2006, Saudi Computer Society.
S. Navathe, and M. Ra. “Vertical partitioning for database design: a graphical algorithm”. ACM
SIGMOD, Portland, June 1989:440-450
R. Agrawal, T. Imielinski, and A.Swami. “Mining association rules between sets of items in large
databases”. Proc. of the ACM SIGMOD conference on manage of data May 1993:207-216
R. Agrawal and R. Srikanr.”Fast algorithms for mining association rules,”. Proc. the 20th
international conference on very large data bases,Santiago,Chile, 1994.
Electronic Publication: P.O'Neil and X. Chen, The Star Schema Benchmark Revision 3, Jan. 2007,
doi: 10. I. 1.80.4071 VIO-589.
Bing Liu, Yiyuan Xia, Philip S. Yu "Clustering Through Decision Tree Construction, CIKM 2000,
McLean, VA USA
D. Abadi, D. Myers, D. DeWitt, and S. Madden, “Materialization strategies in a column-oriented
DBMS,” in Proc. ICDE, 2007, pp. 466–475
W. Yan and P.-A. Larson, “Eager aggregation and lazy aggregation,” in VLDB, 1995, pp. 345–357.
J. R. Quinlan. C4.5: program for machine learning.Morgan Kaufmann, 1992.
R. Dubes and A. K. Jain, "Clustering techniques: the user's dilemma." Pattern Recognition, 8:247260, 1976.
A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice Hall, 1988.
J. Gehrke, R. Ramakrishnan, V. Ganti. "RainForest - A framework for fast decision tree construction
of large datasets." VLDB-98, 1998.
J. Gehrke, V. Ganti, R. Ramakrishnan & W-Y. Loh,"BOAT – Optimistic decision tree construction.”
SIGMOD-99, 1999.
J.C. Shafer, R. Agrawal, & M. Mehta. "SPRINT: A scalable parallel classifier for data mining."
VLDB-96, 1996.
9. 303
Computer Science & Information Technology (CS & IT)
Authors
Tejaswini Apte received her M.C.A(CSE) from Banasthali Vidyapith Rajasthan. She is
pursuing research in the area of Machine Learning from DAU, Indore,(M.P.),INDIA..
Presently she is working as an ASSISTANT PROFESSOR in Symbiosis Institute of
Computer Studies and Research at Pune. She has 4 papers published in
International/National Journals and Conferences. Her areas of interest include Databases,
Data Warehouse and Query Optimization.
Mrs. Tejaswini Apte has a professional membership of ACM
Maya Ingle did her Ph.D in Computer Science from DAU, Indore (M.P.) , M.Tech (CS)
from IIT, Kharagpur, INDIA, Post Graduate Diploma in Automatic Computation,
University of Roorkee, INDIA, M.Sc. (Statistics) from DAU, Indore (M.P.) INDIA.
She is presently working as PROFESSOR, School of Computer Science and Information
Technology, DAU, Indore (M.P.) INDIA. She has over 100 research papers published in
various International / National Journals and Conferences. Her areas of interest
include Software Engineering, Statistical, Natural Language Processing, Usability
Engineering, Agile computing, Natural Language Processing, Object Oriented Software Engineering.
interest include Software Engineering, Statistical Natural Language Processing, Usability Engineering,
Agile computing, Natural Language Processing, Object Oriented Software Engineering.
Dr. Maya Ingle has a lifetime membership of ISTE, CSI, IETE. She has received best teaching Award by
All India Association of Information Technology in Nov 2007.
Dr. Arvind Goyal did his Ph.D. in Physiscs from Bhopal University,Bhopal. (M.P.),
M.C.M from DAU, Indore, M.Sc. (Maths) from U.P. He is presently working as
SENIOR PROGRAMMER, School of Computer Science and Information Technology,
DAU, Indore (M.P.). He has over 10 research papers published in various International /
National Journals and Conferences. His areas of interest include databases and data
structure. Dr. Arvind Goyal has lifetime membership of All India Association of
Education Technology