SlideShare a Scribd company logo
1 of 6
CHAPTER 1: INTRODUCTION
Modern storage and processor technology makes the accumulation of data
increasingly easy. Prediction and data analysis tools must be developed and adapted
rapidly to keep up with the dramatic growth in data volume. Data mining pursues the goal
of extracting information from large databases, under consideration of the storage structure.
Some data mining techniques have been developed recently with the exploration of large
databases in mind, such as Association Rule Mining, while others are adaptations of
traditional methods that have long been used in statistics and machine learning, such as
classification and clustering. Classification is the process of predicting membership of a
data point in one of a finite number of classes. The attribute that indicates the class
membership is called the class label attribute. Clustering, in contrast, has the goal of
identifying classes in data without a predefined class label. This thesis focuses on
classification and clustering techniques that can be formulated in terms of kernel-density
estimates, i.e., algorithms in which the density of all or a subset of data points is calculated
from the data set through convolution with a kernel function. Many statistics and machine
learning techniques can be viewed in this fashion. The algorithms in this thesis were
developed to suit the concept of P-trees [1], which can be seen as representing the limiting
case of a database model called vertical partitioning. Most database systems are organized
in record-oriented data structures, referred to as horizontal partitioning. Vertical
partitioning, or the organization of tables by column, however, is recognized by some
researchers as a solution to serious problems such as I/O [2]. P-Trees take the vertical
partitioning model to the limit by not only breaking up relations into their attributes but
further decomposing attributes into individual bits. The resulting bit-vectors are
1
compressed in a hierarchical fashion that allows fast Boolean operations. The optimization
of Boolean operations on P-Trees is essential for the purpose of most data mining tasks that
require knowledge of the number of data points with attributes of a particular value or
within a particular range.
1.1. Organization of the Thesis
Chapter 2 of this thesis motivates the bit-column-based data organization that is the
basis for the P-tree data structure, and analyses the theoretical implication for the
representation of attributes of various domains. Sections 2.3 - 2.5 draw on a multi-
authored paper that is not yet completed, but available in draft form [3]
Chapter 3 describes the P-tree data structure in detail and introduces a new sorting
scheme, generalized Peano-order sorting that significantly improves storage efficiency and
execution speed on data that shows no natural continuity. The logical structure of P-trees is
described and representation choices are highlighted with focus on an efficient array-
converted tree-based representation that was developed specifically for this thesis.
Appendix A formalizes the development of this representation. A full implementation of
P-tree construction and ANDing in Java was completed as part of the thesis and was used
for the data mining applications of the remaining chapters. Chapter 3 furthermore
describes an Application Programming Interface (API) that was developed in conjunction
with other group members to allow reuse of components developed for this thesis and other
research work. All new P-tree code development is being based on a revised C++ version
of the API.
2
Chapter 4 introduces kernel functions and develops the mathematical framework for
the following two chapters. Generalizing on previous work on kernel-based classifiers
using P-trees [4] this chapter presents an implementation of the kernel concept that can be
seen as the basis for modified approaches in the following chapters. Chapter 4 furthermore
includes some remarks to assist in the transition between later chapters.
Chapter 5 represents a paper that can be seen within the kernel framework, but that
also has roots in decision tree induction and rule-based techniques. It makes use of
information gain as well as statistical significance to determine relevant attributes
sequentially using a step-shaped kernel function. The fundamental relationship between
information gain and significance is demonstrated, where the standard formulation of
information gain [5] emerges as an approximation of an exact quantity that is derived in
Appendix B.
Chapter 6 represents a paper that introduces a Semi-Naive Bayes classifier using a
kernel-based representation. Naive Bayes classifiers in which single-attribute probabilities
are evaluated based on one-dimensional kernel estimates are well known in statistics [6].
This technique has not, however, previously been recognized by researchers who
generalize the Naive Bayesian approach to a Semi-Naive Bayes classifier by combining
attributes [7,8]. We show that the resulting lazy algorithm shows significant improvements
in accuracy compared with the naive Bayes classifier, and can be efficiently implemented
using P-trees and the HOBbit distance function.
Chapter 7 represents a multi-author paper that has been published [9]. This paper
demonstrates that kernel-density-based clustering [10] can be motivated as an
approximation to two of the best-known clustering algorithms, namely k-means and k-
3
medoids [11]. The paper furthermore integrates hierarchical clustering ideas. Dr. William
Perrizo guided the work, the performance was compared with results that Qiang Ding
achieved for his implementation of a k-means algorithm, and Qin Ding helped with writing
and formatting. Chapter 8 concludes the thesis.
References
[1] Qin Ding, Maleq Khan, Amalendu Roy, and William Perrizo, "P-tree Algebra", ACM
Symposium on Applied Computing (SAC'02), Madrid, Spain, 2002.
[2] Marianne Winslett, "David DeWitt Speaks Out", SIGMOD Record, Vol. 31, No. 2,
2002.
[3] Anne Denton, Qin Ding, Willaim Jockheck, Qiang Ding, William Perrizo, "Partitioning
- A uniform model for data mining",
http://www.cs.ndsu.nodak.edu/~datasurg/papers/foundation_correct_version.pdf
[4] William Perrizo, Qin Ding, Anne Denton, Kirk Scott, Qiang Ding, Maleq Khan, "PINE
- Podium Incremental Neighbor Evaluator for Spatial Data using P-trees", Symposium on
Applied Computing (SAC'03), Melbourne, Florida, USA, 2003.
[5] Claude Shannon, "A mathematical theory of communication", Bell Systems Journal,
1948.
[6] Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical
Learning", Springer, 2001.
[7] Igor Kononenko, "Semi-Naive Bayesian Classifier", In Proceedings of the Sixth
European Working Session on Learning, 206-219, 1991.
4
[8] Michael Pazzani, "Constructive Induction of Cartesian Product Attributes",
Information , Statistics, and Induction in Science, Melbourne, Australia, 1996.
[9] Anne Denton, Qiang Ding, William Perrizo, Qin Ding, "Efficient Hierarchical
Clustering of Large Data Sets Using P-trees", 15th International Conference on Computer
Applications in Industry and Engineering (CAINE'02), San Diego, California, USA, 2002.
[10] A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large
multimedia databases with noise", In Proc. 1998 Int. Conf. Knowledge Discovery and Data
Mining (KDD'98), 58-65, New York, Aug. 1998.
[11] L. Kaufman and P.J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster
Analysis", New York: John Wiley & Sons, 1990.
5
[8] Michael Pazzani, "Constructive Induction of Cartesian Product Attributes",
Information , Statistics, and Induction in Science, Melbourne, Australia, 1996.
[9] Anne Denton, Qiang Ding, William Perrizo, Qin Ding, "Efficient Hierarchical
Clustering of Large Data Sets Using P-trees", 15th International Conference on Computer
Applications in Industry and Engineering (CAINE'02), San Diego, California, USA, 2002.
[10] A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large
multimedia databases with noise", In Proc. 1998 Int. Conf. Knowledge Discovery and Data
Mining (KDD'98), 58-65, New York, Aug. 1998.
[11] L. Kaufman and P.J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster
Analysis", New York: John Wiley & Sons, 1990.
5

More Related Content

What's hot

84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957IJMER
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data IJECEIAES
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd Iaetsd
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graphijdms
 
Query Optimization Techniques in Graph Databases
Query Optimization Techniques in Graph DatabasesQuery Optimization Techniques in Graph Databases
Query Optimization Techniques in Graph Databasesijdms
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSijcses
 
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...IJECEIAES
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...IJDKP
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 

What's hot (20)

84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data A Survey on Graph Database Management Techniques for Huge Unstructured Data
A Survey on Graph Database Management Techniques for Huge Unstructured Data
 
50120130406022
5012013040602250120130406022
50120130406022
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graph
 
Query Optimization Techniques in Graph Databases
Query Optimization Techniques in Graph DatabasesQuery Optimization Techniques in Graph Databases
Query Optimization Techniques in Graph Databases
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKSMULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
MULTIDIMENSIONAL ANALYSIS FOR QOS IN WIRELESS SENSOR NETWORKS
 
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
 
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
A0360109
A0360109A0360109
A0360109
 

Viewers also liked

WorldCafeVancouverVAMC21Sep2016
WorldCafeVancouverVAMC21Sep2016WorldCafeVancouverVAMC21Sep2016
WorldCafeVancouverVAMC21Sep2016John Fuller
 
Impact Spring 2015 (2)
Impact Spring 2015 (2)Impact Spring 2015 (2)
Impact Spring 2015 (2)Andre Benjamin
 
北投社大造景計畫
北投社大造景計畫北投社大造景計畫
北投社大造景計畫平台 青
 
Reference letter 1
Reference letter 1Reference letter 1
Reference letter 1Ryan Cronce
 
CROSCO_certificate_NemcicMihael
CROSCO_certificate_NemcicMihaelCROSCO_certificate_NemcicMihael
CROSCO_certificate_NemcicMihaelMihael Nemčić
 
新聞に未来はあるのか?
新聞に未来はあるのか?新聞に未来はあるのか?
新聞に未来はあるのか?渚 佐藤
 
Self evaluation form co created rubric
Self evaluation form co created rubricSelf evaluation form co created rubric
Self evaluation form co created rubricgrantthomasonline
 
Update Officieel Remi Ramon
Update Officieel Remi RamonUpdate Officieel Remi Ramon
Update Officieel Remi RamonRemi Ramon
 
北投地質調查和導覽計畫
北投地質調查和導覽計畫北投地質調查和導覽計畫
北投地質調查和導覽計畫平台 青
 

Viewers also liked (13)

Alaina C Martin
Alaina C MartinAlaina C Martin
Alaina C Martin
 
ฉันเหมือนใคร
ฉันเหมือนใครฉันเหมือนใคร
ฉันเหมือนใคร
 
WorldCafeVancouverVAMC21Sep2016
WorldCafeVancouverVAMC21Sep2016WorldCafeVancouverVAMC21Sep2016
WorldCafeVancouverVAMC21Sep2016
 
Impact Spring 2015 (2)
Impact Spring 2015 (2)Impact Spring 2015 (2)
Impact Spring 2015 (2)
 
北投社大造景計畫
北投社大造景計畫北投社大造景計畫
北投社大造景計畫
 
Reference letter 1
Reference letter 1Reference letter 1
Reference letter 1
 
c.v
c.vc.v
c.v
 
4 FDTSSA
4 FDTSSA4 FDTSSA
4 FDTSSA
 
CROSCO_certificate_NemcicMihael
CROSCO_certificate_NemcicMihaelCROSCO_certificate_NemcicMihael
CROSCO_certificate_NemcicMihael
 
新聞に未来はあるのか?
新聞に未来はあるのか?新聞に未来はあるのか?
新聞に未来はあるのか?
 
Self evaluation form co created rubric
Self evaluation form co created rubricSelf evaluation form co created rubric
Self evaluation form co created rubric
 
Update Officieel Remi Ramon
Update Officieel Remi RamonUpdate Officieel Remi Ramon
Update Officieel Remi Ramon
 
北投地質調查和導覽計畫
北投地質調查和導覽計畫北投地質調查和導覽計畫
北投地質調查和導覽計畫
 

Similar to Chapter1_C.doc

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Towards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data warehTowards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data warehIJECEIAES
 
An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...C Sai Kiran
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
Document Classification Using Hierarchies Clusters Technique
Document Classification Using Hierarchies Clusters TechniqueDocument Classification Using Hierarchies Clusters Technique
Document Classification Using Hierarchies Clusters Techniqueupendra singh
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...IJCSIS Research Publications
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopBRNSSPublicationHubI
 
A Case Study of Innovation of an Information Communication System and Upgrade...
A Case Study of Innovation of an Information Communication System and Upgrade...A Case Study of Innovation of an Information Communication System and Upgrade...
A Case Study of Innovation of an Information Communication System and Upgrade...gerogepatton
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...IJDKP
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systemsMohammed Muqeet
 
E132833
E132833E132833
E132833irjes
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET Journal
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
 

Similar to Chapter1_C.doc (20)

IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Towards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data warehTowards a new hybrid approach for building documentoriented data wareh
Towards a new hybrid approach for building documentoriented data wareh
 
An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...An Effective Storage Management for University Library using Weighted K-Neare...
An Effective Storage Management for University Library using Weighted K-Neare...
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
Document Classification Using Hierarchies Clusters Technique
Document Classification Using Hierarchies Clusters TechniqueDocument Classification Using Hierarchies Clusters Technique
Document Classification Using Hierarchies Clusters Technique
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
Optimised Kd-Tree Approach with Dimension Reduction for Efficient Indexing an...
 
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using HadoopImplementation of Improved Apriori Algorithm on Large Dataset using Hadoop
Implementation of Improved Apriori Algorithm on Large Dataset using Hadoop
 
A Case Study of Innovation of an Information Communication System and Upgrade...
A Case Study of Innovation of an Information Communication System and Upgrade...A Case Study of Innovation of an Information Communication System and Upgrade...
A Case Study of Innovation of an Information Communication System and Upgrade...
 
New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...New proximity estimate for incremental update of non uniformly distributed cl...
New proximity estimate for incremental update of non uniformly distributed cl...
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systems
 
E132833
E132833E132833
E132833
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Chapter1_C.doc

  • 1. CHAPTER 1: INTRODUCTION Modern storage and processor technology makes the accumulation of data increasingly easy. Prediction and data analysis tools must be developed and adapted rapidly to keep up with the dramatic growth in data volume. Data mining pursues the goal of extracting information from large databases, under consideration of the storage structure. Some data mining techniques have been developed recently with the exploration of large databases in mind, such as Association Rule Mining, while others are adaptations of traditional methods that have long been used in statistics and machine learning, such as classification and clustering. Classification is the process of predicting membership of a data point in one of a finite number of classes. The attribute that indicates the class membership is called the class label attribute. Clustering, in contrast, has the goal of identifying classes in data without a predefined class label. This thesis focuses on classification and clustering techniques that can be formulated in terms of kernel-density estimates, i.e., algorithms in which the density of all or a subset of data points is calculated from the data set through convolution with a kernel function. Many statistics and machine learning techniques can be viewed in this fashion. The algorithms in this thesis were developed to suit the concept of P-trees [1], which can be seen as representing the limiting case of a database model called vertical partitioning. Most database systems are organized in record-oriented data structures, referred to as horizontal partitioning. Vertical partitioning, or the organization of tables by column, however, is recognized by some researchers as a solution to serious problems such as I/O [2]. P-Trees take the vertical partitioning model to the limit by not only breaking up relations into their attributes but further decomposing attributes into individual bits. The resulting bit-vectors are 1
  • 2. compressed in a hierarchical fashion that allows fast Boolean operations. The optimization of Boolean operations on P-Trees is essential for the purpose of most data mining tasks that require knowledge of the number of data points with attributes of a particular value or within a particular range. 1.1. Organization of the Thesis Chapter 2 of this thesis motivates the bit-column-based data organization that is the basis for the P-tree data structure, and analyses the theoretical implication for the representation of attributes of various domains. Sections 2.3 - 2.5 draw on a multi- authored paper that is not yet completed, but available in draft form [3] Chapter 3 describes the P-tree data structure in detail and introduces a new sorting scheme, generalized Peano-order sorting that significantly improves storage efficiency and execution speed on data that shows no natural continuity. The logical structure of P-trees is described and representation choices are highlighted with focus on an efficient array- converted tree-based representation that was developed specifically for this thesis. Appendix A formalizes the development of this representation. A full implementation of P-tree construction and ANDing in Java was completed as part of the thesis and was used for the data mining applications of the remaining chapters. Chapter 3 furthermore describes an Application Programming Interface (API) that was developed in conjunction with other group members to allow reuse of components developed for this thesis and other research work. All new P-tree code development is being based on a revised C++ version of the API. 2
  • 3. Chapter 4 introduces kernel functions and develops the mathematical framework for the following two chapters. Generalizing on previous work on kernel-based classifiers using P-trees [4] this chapter presents an implementation of the kernel concept that can be seen as the basis for modified approaches in the following chapters. Chapter 4 furthermore includes some remarks to assist in the transition between later chapters. Chapter 5 represents a paper that can be seen within the kernel framework, but that also has roots in decision tree induction and rule-based techniques. It makes use of information gain as well as statistical significance to determine relevant attributes sequentially using a step-shaped kernel function. The fundamental relationship between information gain and significance is demonstrated, where the standard formulation of information gain [5] emerges as an approximation of an exact quantity that is derived in Appendix B. Chapter 6 represents a paper that introduces a Semi-Naive Bayes classifier using a kernel-based representation. Naive Bayes classifiers in which single-attribute probabilities are evaluated based on one-dimensional kernel estimates are well known in statistics [6]. This technique has not, however, previously been recognized by researchers who generalize the Naive Bayesian approach to a Semi-Naive Bayes classifier by combining attributes [7,8]. We show that the resulting lazy algorithm shows significant improvements in accuracy compared with the naive Bayes classifier, and can be efficiently implemented using P-trees and the HOBbit distance function. Chapter 7 represents a multi-author paper that has been published [9]. This paper demonstrates that kernel-density-based clustering [10] can be motivated as an approximation to two of the best-known clustering algorithms, namely k-means and k- 3
  • 4. medoids [11]. The paper furthermore integrates hierarchical clustering ideas. Dr. William Perrizo guided the work, the performance was compared with results that Qiang Ding achieved for his implementation of a k-means algorithm, and Qin Ding helped with writing and formatting. Chapter 8 concludes the thesis. References [1] Qin Ding, Maleq Khan, Amalendu Roy, and William Perrizo, "P-tree Algebra", ACM Symposium on Applied Computing (SAC'02), Madrid, Spain, 2002. [2] Marianne Winslett, "David DeWitt Speaks Out", SIGMOD Record, Vol. 31, No. 2, 2002. [3] Anne Denton, Qin Ding, Willaim Jockheck, Qiang Ding, William Perrizo, "Partitioning - A uniform model for data mining", http://www.cs.ndsu.nodak.edu/~datasurg/papers/foundation_correct_version.pdf [4] William Perrizo, Qin Ding, Anne Denton, Kirk Scott, Qiang Ding, Maleq Khan, "PINE - Podium Incremental Neighbor Evaluator for Spatial Data using P-trees", Symposium on Applied Computing (SAC'03), Melbourne, Florida, USA, 2003. [5] Claude Shannon, "A mathematical theory of communication", Bell Systems Journal, 1948. [6] Trevor Hastie, Robert Tibshirani, Jerome Friedman, "The Elements of Statistical Learning", Springer, 2001. [7] Igor Kononenko, "Semi-Naive Bayesian Classifier", In Proceedings of the Sixth European Working Session on Learning, 206-219, 1991. 4
  • 5. [8] Michael Pazzani, "Constructive Induction of Cartesian Product Attributes", Information , Statistics, and Induction in Science, Melbourne, Australia, 1996. [9] Anne Denton, Qiang Ding, William Perrizo, Qin Ding, "Efficient Hierarchical Clustering of Large Data Sets Using P-trees", 15th International Conference on Computer Applications in Industry and Engineering (CAINE'02), San Diego, California, USA, 2002. [10] A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise", In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), 58-65, New York, Aug. 1998. [11] L. Kaufman and P.J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis", New York: John Wiley & Sons, 1990. 5
  • 6. [8] Michael Pazzani, "Constructive Induction of Cartesian Product Attributes", Information , Statistics, and Induction in Science, Melbourne, Australia, 1996. [9] Anne Denton, Qiang Ding, William Perrizo, Qin Ding, "Efficient Hierarchical Clustering of Large Data Sets Using P-trees", 15th International Conference on Computer Applications in Industry and Engineering (CAINE'02), San Diego, California, USA, 2002. [10] A. Hinneburg and D. A. Keim, "An efficient approach to clustering in large multimedia databases with noise", In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), 58-65, New York, Aug. 1998. [11] L. Kaufman and P.J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis", New York: John Wiley & Sons, 1990. 5