The document introduces a new distance measure between probability density functions called the Laplacian PDF distance. This distance measure has a connection to kernel-based learning theory via the Parzen window technique for density estimation. In a kernel feature space defined by the eigenspectrum of the Laplacian data matrix, the Laplacian PDF distance is shown to measure the cosine of the angle between cluster mean vectors. The Laplacian data matrix and its eigenspectrum can be obtained automatically based on the data, allowing the feature space mapping to be determined in an unsupervised manner.
In topological inference, the goal is to extract information about a shape, given only a sample of points from it. There are many approaches to this problem, but the one we focus on is persistent homology. We get a view of the data at different scales by imagining the points are balls and consider different radii. The shape information we want comes in the form of a persistence diagram, which describes the components, cycles, bubbles, etc in the space that persist over a range of different scales.
To actually compute a persistence diagram in the geometric setting, previous work required complexes of size n^O(d). We reduce this complexity to O(n) (hiding some large constants depending on d) by using ideas from mesh generation.
This talk will not assume any knowledge of topology. This is joint work with Gary Miller, Benoit Hudson, and Steve Oudot.
Universal Approximation Theorem
Here, we prove that the perceptron multi-layer can approximate all continuous functions in the hypercube [0,1]. For this, we used the Cybenko proof... I tried to include the basic in topology and mathematical analysis to make the slides more understandable. However, they still need some work to be done. In addition, I am a little bit rusty in my mathematical analysis, so I am still not so convinced with my linear functional I defined for the proof...!!! Back to the Rudin and Apostol!!! So expect changes in the future.
In topological inference, the goal is to extract information about a shape, given only a sample of points from it. There are many approaches to this problem, but the one we focus on is persistent homology. We get a view of the data at different scales by imagining the points are balls and consider different radii. The shape information we want comes in the form of a persistence diagram, which describes the components, cycles, bubbles, etc in the space that persist over a range of different scales.
To actually compute a persistence diagram in the geometric setting, previous work required complexes of size n^O(d). We reduce this complexity to O(n) (hiding some large constants depending on d) by using ideas from mesh generation.
This talk will not assume any knowledge of topology. This is joint work with Gary Miller, Benoit Hudson, and Steve Oudot.
Universal Approximation Theorem
Here, we prove that the perceptron multi-layer can approximate all continuous functions in the hypercube [0,1]. For this, we used the Cybenko proof... I tried to include the basic in topology and mathematical analysis to make the slides more understandable. However, they still need some work to be done. In addition, I am a little bit rusty in my mathematical analysis, so I am still not so convinced with my linear functional I defined for the proof...!!! Back to the Rudin and Apostol!!! So expect changes in the future.
In this article we consider macrocanonical models for texture synthesis. In these models samples are generated given an input texture image and a set of features which should be matched in expectation. It is known that if the images are quantized, macrocanonical models are given by Gibbs measures, using the maximum entropy principle. We study conditions under which this result extends to real-valued images. If these conditions hold, finding a macrocanonical model amounts to minimizing a convex function and sampling from an associated Gibbs measure. We analyze an algorithm which alternates between sampling and minimizing. We present experiments with neural network features and study the drawbacks and advantages of using this sampling scheme.
Fundamentals of Parameterised Covering Approximation SpaceYogeshIJTSRD
Combination of theories has not only advanced the research, but also helped in handling the issues of impreciseness in real life problems. The soft rough set has been defined by many authors by combining the theories of soft set and rough set. The concept Soft Covering Based Rough Set be given by J.Zhan et al 2008 , Feng Feng et al 2011 , S.Yuksel et al 2015 by taking full soft set instead of Covering. In this note We first consider the covering soft set and then covering based soft rough set. Again it defines a mapping from the coverings of element of universal set U to the parameters attributes . The new model “Parameterised Soft Rough Set on Covering Approximation Space is conceptualised to capture the issues of vagueness, and impreciseness of information. Also dependency on this new model and some properties be studied. Kedar Chandra Parida | Debadutta Mohanty "Fundamentals of Parameterised Covering Approximation Space" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd38761.pdf Paper URL: https://www.ijtsrd.com/mathemetics/applied-mathematics/38761/fundamentals-of-parameterised-covering-approximation-space/kedar-chandra-parida
In this article we consider macrocanonical models for texture synthesis. In these models samples are generated given an input texture image and a set of features which should be matched in expectation. It is known that if the images are quantized, macrocanonical models are given by Gibbs measures, using the maximum entropy principle. We study conditions under which this result extends to real-valued images. If these conditions hold, finding a macrocanonical model amounts to minimizing a convex function and sampling from an associated Gibbs measure. We analyze an algorithm which alternates between sampling and minimizing. We present experiments with neural network features and study the drawbacks and advantages of using this sampling scheme.
Fundamentals of Parameterised Covering Approximation SpaceYogeshIJTSRD
Combination of theories has not only advanced the research, but also helped in handling the issues of impreciseness in real life problems. The soft rough set has been defined by many authors by combining the theories of soft set and rough set. The concept Soft Covering Based Rough Set be given by J.Zhan et al 2008 , Feng Feng et al 2011 , S.Yuksel et al 2015 by taking full soft set instead of Covering. In this note We first consider the covering soft set and then covering based soft rough set. Again it defines a mapping from the coverings of element of universal set U to the parameters attributes . The new model “Parameterised Soft Rough Set on Covering Approximation Space is conceptualised to capture the issues of vagueness, and impreciseness of information. Also dependency on this new model and some properties be studied. Kedar Chandra Parida | Debadutta Mohanty "Fundamentals of Parameterised Covering Approximation Space" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd38761.pdf Paper URL: https://www.ijtsrd.com/mathemetics/applied-mathematics/38761/fundamentals-of-parameterised-covering-approximation-space/kedar-chandra-parida
Master Thesis on the Mathematial Analysis of Neural NetworksAlina Leidinger
Master Thesis submitted on June 15, 2019 at TUM's chair of Applied Numerical Analysis (M15) at the Mathematics Department.The project was supervised by Prof. Dr. Massimo Fornasier. The thesis took a detailed look at the existing mathematical analysis of neural networks focusing on 3 key aspects: Modern and classical results in approximation theory, robustness and Scattering Networks introduced by Mallat, as well as unique identification of neural network weights. See also the one page summary available on Slideshare.
Computer Science
Active and Programmable Networks
Active safety systems
Ad Hoc & Sensor Network
Ad hoc networks for pervasive communications
Adaptive, autonomic and context-aware computing
Advance Computing technology and their application
Advanced Computing Architectures and New Programming Models
Advanced control and measurement
Aeronautical Engineering,
Agent-based middleware
Alert applications
Automotive, marine and aero-space control and all other control applications
Autonomic and self-managing middleware
Autonomous vehicle
Biochemistry
Bioinformatics
BioTechnology(Chemistry, Mathematics, Statistics, Geology)
Broadband and intelligent networks
Broadband wireless technologies
CAD/CAM/CAT/CIM
Call admission and flow/congestion control
Capacity planning and dimensioning
Changing Access to Patient Information
Channel capacity modelling and analysis
Civil Engineering,
Cloud Computing and Applications
Collaborative applications
Communication application
Communication architectures for pervasive computing
Communication systems
Computational intelligence
Computer and microprocessor-based control
Computer Architecture and Embedded Systems
Computer Business
Computer Sciences and Applications
Computer Vision
Computer-based information systems in health care
Computing Ethics
Computing Practices & Applications
Congestion and/or Flow Control
Content Distribution
Context-awareness and middleware
Creativity in Internet management and retailing
Cross-layer design and Physical layer based issue
Cryptography
Data Base Management
Data fusion
Data Mining
Data retrieval
Data Storage Management
Decision analysis methods
Decision making
Digital Economy and Digital Divide
Digital signal processing theory
Distributed Sensor Networks
Drives automation
Drug Design,
Drug Development
DSP implementation
E-Business
E-Commerce
E-Government
Electronic transceiver device for Retail Marketing Industries
Electronics Engineering,
Embeded Computer System
Emerging advances in business and its applications
Emerging signal processing areas
Enabling technologies for pervasive systems
Energy-efficient and green pervasive computing
Environmental Engineering,
Estimation and identification techniques
Evaluation techniques for middleware solutions
Event-based, publish/subscribe, and message-oriented middleware
Evolutionary computing and intelligent systems
Expert approaches
Facilities planning and management
Flexible manufacturing systems
Formal methods and tools for designing
Fuzzy algorithms
Fuzzy logics
GPS and location-based app
Image sciences, image processing, image restoration, photo manipulation. Image and videos representation. Digital versus analog imagery. Quantization and sampling. Sources and models of noises in digital CCD imagery: photon, thermal and readout noises. Sources and models of blurs. Convolutions and point spread functions. Overview of other standard models, problems and tasks: salt-and-pepper and impulse noises, half toning, inpainting, super-resolution, compressed sensing, high dynamic range imagery, demosaicing. Short introduction to other types of imagery: SAR, Sonar, ultrasound, CT and MRI. Linear and ill-posed restoration problems.
In this paper, we solve a semi-supervised regression
problem. Due to the luck of knowledge about the
data structure and the presence of random noise, the considered data model is uncertain. We propose a method which combines graph Laplacian regularization and cluster ensemble methodologies. The co-association matrix of the ensemble is calculated on both labeled and unlabeled data; this matrix is used as a similarity matrix in the regularization framework to derive the predicted outputs. We use the low-rank decomposition of the co-association matrix to significantly speedup calculations and reduce memory. Two clustering problem examples are presented.
Full version is here https://arxiv.org/abs/1901.03919
Manifold regularization is an approach which exploits the geometry of the marginal distribution.
The main goal of this paper is to analyze the convergence issues of such regularization
algorithms in learning theory. We propose a more general multi-penalty framework and establish
the optimal convergence rates under the general smoothness assumption. We study a
theoretical analysis of the performance of the multi-penalty regularization over the reproducing
kernel Hilbert space. We discuss the error estimates of the regularization schemes under
some prior assumptions for the joint probability measure on the sample space. We analyze the
convergence rates of learning algorithms measured in the norm in reproducing kernel Hilbert
space and in the norm in Hilbert space of square-integrable functions. The convergence issues
for the learning algorithms are discussed in probabilistic sense by exponential tail inequalities.
In order to optimize the regularization functional, one of the crucial issue is to select regularization
parameters to ensure good performance of the solution. We propose a new parameter
choice rule “the penalty balancing principle” based on augmented Tikhonov regularization for
the choice of regularization parameters. The superiority of multi-penalty regularization over
single-penalty regularization is shown using the academic example and moon data set.
All of material inside is un-licence, kindly use it for educational only but please do not to commercialize it.
Based on 'ilman nafi'an, hopefully this file beneficially for you.
Thank you.
Presentation of my NSERC-USRA funded summer research project given at the Canadian Undergraduate Mathematics Conference (CUMC) 2014.
Please refer to the project site: http://jessebett.com/Radial-Basis-Function-USRA/
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Astaño 4
1. The Laplacian PDF Distance: A Cost
Function for Clustering in a Kernel
Feature Space
Robert Jenssen1∗ Deniz Erdogmus2 , Jose Principe2 , Torbjørn Eltoft1
,
1
Department of Physics, University of Tromsø, Norway
2
Computational NeuroEngineering Laboratory, University of Florida, USA
Abstract
A new distance measure between probability density functions
(pdfs) is introduced, which we refer to as the Laplacian pdf dis-
tance. The Laplacian pdf distance exhibits a remarkable connec-
tion to Mercer kernel based learning theory via the Parzen window
technique for density estimation. In a kernel feature space defined
by the eigenspectrum of the Laplacian data matrix, this pdf dis-
tance is shown to measure the cosine of the angle between cluster
mean vectors. The Laplacian data matrix, and hence its eigenspec-
trum, can be obtained automatically based on the data at hand,
by optimal Parzen window selection. We show that the Laplacian
pdf distance has an interesting interpretation as a risk function
connected to the probability of error.
1 Introduction
In recent years, spectral clustering methods, i.e. data partitioning based on the
eigenspectrum of kernel matrices, have received a lot of attention [1, 2]. Some
unresolved questions associated with these methods are for example that it is not
always clear which cost function that is being optimized and that is not clear how
to construct a proper kernel matrix.
In this paper, we introduce a well-defined cost function for spectral clustering. This
cost function is derived from a new information theoretic distance measure between
cluster pdfs, named the Laplacian pdf distance. The information theoretic/spectral
duality is established via the Parzen window methodology for density estimation.
The resulting spectral clustering cost function measures the cosine of the angle
between cluster mean vectors in a Mercer kernel feature space, where the feature
space is determined by the eigenspectrum of the Laplacian matrix. A principled
approach to spectral clustering would be to optimize this cost function in the feature
space by assigning cluster memberships. Because of space limitations, we leave it
to a future paper to present an actual clustering algorithm optimizing this cost
function, and focus in this paper on the theoretical properties of the new measure.
∗
Corresponding author. Phone: (+47) 776 46493. Email: robertj@phys.uit.no
2. An important by-product of the theory presented is that a method for learning the
Mercer kernel matrix via optimal Parzen windowing is provided. This means that
the Laplacian matrix, its eigenspectrum and hence the feature space mapping can
be determined automatically. We illustrate this property by an example.
We also show that the Laplacian pdf distance has an interesting relationship to the
probability of error.
In section 2, we briefly review kernel feature space theory. In section 3, we utilize
the Parzen window technique for function approximation, in order to introduce the
new Laplacian pdf distance and discuss some properties in sections 4 and 5. Section
6 concludes the paper.
2 Kernel Feature Spaces
Mercer kernel-based learning algorithms [3] make use of the following idea: via a
nonlinear mapping
Φ : Rd → F, x → Φ(x) (1)
the data x1 , . . . , xN ∈ Rd is mapped into a potentially much higher dimensional
feature space F. For a given learning problem one now considers the same algorithm
in F instead of in Rd , that is, one works with Φ(x1 ), . . . , Φ(xN ) ∈ F.
Consider a symmetric kernel function k(x, y). If k : C × C → R is a continuous
kernel of a positive integral operator in a Hilbert space L2 (C) on a compact set
C ∈ Rd , i.e.
∀ψ ∈ L2 (C) : k(x, y)ψ(x)ψ(y)dxdy ≥ 0, (2)
C
then there exists a space F and a mapping Φ : Rd → F, such that by Mercer’s
theorem [4]
NF
k(x, y) = Φ(x), Φ(y) = λi φi (x)φi (y), (3)
i=1
where ·, · denotes an inner product, the φi ’s are the orthonormal eigenfunctions
of the kernel and NF ≤ ∞ [3]. In this case
Φ(x) = [ λ1 φ1 (x), λ2 φ2 (x), . . . ]T , (4)
can potentially be realized.
In some cases, it may be desirable to realize this mapping. This issue has been
addressed in [5]. Define the (N × N ) Gram matrix, K, also called the affinity, or
kernel matrix, with elements Kij = k(xi , xj ), i, j = 1, . . . , N . This matrix can be
diagonalized as ET KE = Λ, where the columns of E contains the eigenvectors of K
˜ ˜ ˜
and Λ is a diagonal matrix containing the non-negative eigenvalues λ1 , . . . , λN , λ1 ≥
··· ≥ λ˜N . In [5], it was shown that the eigenfunctions and eigenvalues of (4) can
√ ˜
λj
be approximated as φj (xi ) ≈ N eji , λj ≈ N , where eji denotes the ith element
of the jth eigenvector. Hence, the mapping (4), can be approximated as
Φ(xi ) ≈ [ ˜
λ1 e1i , . . . , ˜
λ N e N i ]T . (5)
Thus, the mapping is based on the eigenspectrum of K. The feature space data set
may be represented in matrix form as ΦN ×N = [Φ(x1 ), . . . , Φ(xN )]. Hence, Φ =
1
Λ 2 ET . It may be desirable to truncate the mapping (5) to C-dimensions. Thus,
3. ˆ ˆ ˆT ˆ
only the C first rows of Φ are kept, yielding Φ. It is well-known that K = Φ Φ is
the best rank-C approximation to K wrt. the Frobenius norm [6].
The most widely used Mercer kernel is the radial-basis-function (RBF)
||x − y||2
k(x, y) = exp − . (6)
2σ 2
3 Function Approximation using Parzen Windowing
Parzen windowing is a kernel-based density estimation method, where the resulting
density estimate is continuous and differentiable provided that the selected kernel
is continuous and differentiable [7]. Given a set of iid samples {x1 , . . . , xN } drawn
from the true density f (x), the Parzen window estimate for this distribution is [7]
N
ˆ 1
f(x) = Wσ2 (x, xi ), (7)
N i=1
where Wσ2 is the Parzen window, or kernel, and σ 2 controls the width of the kernel.
The Parzen window must integrate to one, and is typically chosen to be a pdf itself
with mean xi , such as the Gaussian kernel
1 ||x − xi ||2
Wσ2 (x, xi ) = d exp − , (8)
(2πσ 2 ) 2 2σ 2
which we will assume in the rest of this paper. In the conclusion, we briefly discuss
the use of other kernels.
Consider a function h(x) = v(x)f (x), for some function v(x). We propose to
estimate h(x) by the following generalized Parzen estimator
N
ˆ 1
h(x) = v(xi )Wσ2 (x, xi ). (9)
N i=1
This estimator is asymptotically unbiased, which can be shown as follows
N
1
Ef v(xi )Wσ2 (x, xi ) = v(z)f (z)Wσ2 (x, z)dz = [v(x)f (x)] ∗ Wσ2 (x),
N i=1
(10)
where Ef (·) denotes expectation with respect to the density f (x). In the limit as
N → ∞ and σ(N ) → 0, we have
lim [v(x)f (x)] ∗ Wσ2 (x) = v(x)f (x). (11)
N →∞
σ(N )→0
Of course, if v(x) = 1 ∀x, then (9) is nothing but the traditional Parzen estimator
of h(x) = f (x). The estimator (9) is also asymptotically consistent provided that
the kernel width σ(N ) is annealed at a sufficiently slow rate. The proof will be
presented in another paper.
Many approaches have been proposed in order to optimally determine the size of
the Parzen window, given a finite sample data set. A simple selection rule was
proposed by Silverman [8], using the mean integrated square error (MISE) between
the estimated and the actual pdf as the optimality metric:
1
σopt = σX 4N −1 (2d + 1)−1 d+4
, (12)
2 −1
where d is the dimensionality of the data and =d i ΣXii , where ΣXii are the
σX
diagonal elements of the sample covariance matrix. More advanced approximations
to the MISE solution also exist.
4. 4 The Laplacian PDF Distance
Cost functions for clustering are often based on distance measures between pdfs.
The goal is to assign memberships to the data patterns with respect to a set of
clusters, such that the cost function is optimized.
Assume that a data set consists of two clusters. Associate the probability density
function p(x) with one of the clusters, and the density q(x) with the other cluster.
Let f (x) be the overall probability density function of the data set. Now define the
f −1 weighted inner product between p(x) and q(x) as p, q f ≡ p(x)q(x)f −1 (x)dx.
In such an inner product space, the Cauchy-Schwarz inequality holds, that is,
2
p, q f ≤ p, p f q, q f . Based on this discussion, an information theoretic distance
measure between the two pdfs can be expressed as
p, q f
DL = − log ≥ 0. (13)
p, p f q, q f
We refer to this measure as the Laplacian pdf distance, for reasons that we discuss
next. It can be seen that the distance DL is zero if and only if the two densities
are equal. It is non-negative, and increases as the overlap between the two pdfs
decreases. However, it does not obey the triangle inequality, and is thus not a
distance measure in the strict mathematical sense.
We will now show that the Laplacian pdf distance is also a cost function for clus-
tering in a kernel feature space, using the generalized Parzen estimators discussed
in the previous section. Since the logarithm is a monotonic function, we will derive
the expression for the argument of the log in (13). This quantity will for simplicity
be denoted by the letter “L” in equations.
Assume that we have available the iid data points {xi }, i = 1, . . . , N1 , drawn from
p(x), which is the density of cluster C1 , and the iid {xj }, j = 1, . . . , N2 , drawn from
1 1
q(x), the density of C2 . Let h(x) = f − 2 (x)p(x) and g(x) = f − 2 (x)q(x). Hence, we
may write
h(x)g(x)dx
L= . (14)
h2 (x)dx g 2 (x)dx
We estimate h(x) and g(x) by the generalized Parzen kernel estimators, as follows
N1 N2
ˆ 1 1
−2 1 1
h(x) = f (xi )Wσ2 (x, xi ), g(x) =
ˆ f − 2 (xj )Wσ2 (x, xj ). (15)
N1 i=1
N2 j=1
The approach taken, is to substitute these estimators into (14), to obtain
N1 N2
1 1 1 1
h(x)g(x)dx ≈ f − 2 (xi )Wσ2 (x, xi ) f − 2 (xj )Wσ2 (x, xj )
N1 i=1
N2 j=1
N1 ,N2
1 1 1
= f − 2 (xi )f − 2 (xj ) Wσ2 (x, xi )Wσ2 (x, xj )dx
N1 N2 i,j=1
N1 ,N2
1 1 1
= f − 2 (xi )f − 2 (xj )W2σ2 (xi , xj ), (16)
N1 N2 i,j=1
5. where in the last step, the convolution theorem for Gaussians has been employed.
Similarly, we have
N1 ,N1
1 1 1
h2 (x)dx ≈ 2 f − 2 (xi )f − 2 (xi )W2σ2 (xi , xi ), (17)
N1
i,i =1
N2 ,N2
1 1 1
g 2 (x)dx ≈ 2 f − 2 (xj )f − 2 (xj )W2σ2 (xj , xj ). (18)
N2
j,j =1
Now we define the matrix Kf , such that
1 1
Kfij = Kf (xi , xj ) = f − 2 (xi )f − 2 (xj )K(xi , xj ), (19)
where K(xi , xj ) = W2σ2 (xi , xj ) for i, j = 1, . . . , N and N = N1 + N2 . As a
consequence, (14) can be re-written as follows
N1 ,N2
i,j=1 Kf (xi , xj )
L= (20)
N1 ,N1 N2 ,N2
i,i =1 Kf (xi , xi ) j,j =1 Kf (xj , xj )
The key point of this paper, is to note that the matrix K = Kij = K(xi , xj ), i, j =
1, . . . , N , is the data affinity matrix, and that K(xi , xj ) is a Gaussian RBF kernel
function. Hence, it is also a kernel function that satisfies Mercer’s theorem. Since
K(xi , xj ) satisfies Mercer’s theorem, the following by definition holds [4]. For any
set of examples {x1 , . . . , xN } and any set of real numbers ψ1 , . . . , ψN
N N
ψi ψj K(xi , xj ) ≥ 0, (21)
i=1 j=1
in analogy to (3). Moreover, this means that
N N N N
1 1
ψi ψj f − 2 (xi )f − 2 (xj )K(xi , xj ) = ψi ψj Kf (xi , xj ) ≥ 0, (22)
i=1 j=1 i=1 j=1
hence Kf (xi , xj ) is also a Mercer kernel.
Now, it is readily observed that the Laplacian pdf distance can be analyzed in terms
of inner products in a Mercer kernel-based Hilbert feature space, since Kf (xi , xj ) =
Φf (xi ), Φf (xj ) . Consequently, (20) can be written as follows
N1 ,N2
i,j=1 Φf (xi ), Φf (xj )
L=
N1 ,N1 N2 ,N2
i,i =1 Φf (xi ), Φf (xi ) j,j =1 Φf (xj ), Φf (xj )
1 N1 1 N2
N1 i=1 Φf (xi ), N2 j=1 Φf (xj )
=
1 N1 1 N1 1 N2 1 N2
N1 i=1 Φf (xi ), N1 i =1 Φf (xi ) N2 j=1 Φf (xj ), N2 j =1 Φf (xj )
m1f , m2f
= = cos (m1f , m2f ), (23)
||m1f ||||m2f ||
1 Ni
where mif = Ni l=1 Φf (xl ), i = 1, 2, that is, the sample mean of the ith cluster
in feature space.
6. This is a very interesting result. We started out with a distance measure between
densities in the input space. By utilizing the Parzen window method, this distance
measure turned out to have an equivalent expression as a measure of the distance
between two clusters of data points in a Mercer kernel feature space. In the feature
space, the distance that is measured is the cosine of the angle between the cluster
mean vectors.
The actual mapping of a data point to the kernel feature space is given by the
eigendecomposition of Kf , via (5). Let us examine this mapping in more detail.
1
Note that f 2 (xi ) can be estimated from the data by the traditional Parzen pdf
estimator as follows
N
1 1
f 2 (xi ) = Wσf (xi , xl ) =
2 di . (24)
N
l=1
Define the matrix D = diag(d1 , . . . , dN ). Then Kf can be expressed as
1 1
Kf = D− 2 KD− 2 . (25)
Quite interestingly, for σf = 2σ 2 , this is in fact the Laplacian data matrix.
2 1
The above discussion explicitly connects the Parzen kernel and the Mercer kernel.
Moreover, automatic procedures exist in the density estimation literature to opti-
mally determine the Parzen kernel given a data set. Thus, the Mercer kernel is
also determined by the same procedure. Therefore, the mapping by the Laplacian
matrix to the kernel feature space can also be determined automatically. We regard
this as a significant result in the kernel based learning theory.
As an example, consider Fig. 1 (a) which shows a data set consisting of a ring
with a dense cluster in the middle. The MISE kernel size is σopt = 0.16, and
the Parzen pdf estimate is shown in Fig. 1 (b). The data mapping given by the
corresponding Laplacian matrix is shown in Fig. 1 (c) (truncated to two dimensions
for visualization purposes). It can be seen that the data is distributed along two lines
radially from the origin, indicating that clustering based on the angular measure
we have derived makes sense.
The above analysis can easily be extended to any number of pdfs/clusters. In the
C-cluster case, we define the Laplacian pdf distance as
C−1
p i , pj f
L= . (26)
i=1 j=i C p i , pi f p j , pj f
In the kernel feature space, (26), corresponds to all cluster mean vectors being
pairwise as orthogonal to each other as possible, for all possible unique pairs.
4.1 Connection to the Ng et al. [2] algorithm
Recently, Ng et al. [2] proposed to map the input data to a feature space determined
by the eigenvectors corresponding to the C largest eigenvalues of the Laplacian ma-
trix. In that space, the data was normalized to unit norm and clustered by the
C-means algorithm. We have shown that the Laplacian pdf distance provides a
1
It is a bit imprecise to refer to Kf as the Laplacian matrix, as readers familiar with
spectral graph theory may recognize, since the definition of the Laplacian matrix is L =
I − Kf . However, replacing Kf by L does not change the eigenvectors, it only changes the
eigenvalues from λi to 1 − λi .
7. 0
0
(a) Data set (b) Parzen pdf estimate (c) Feature space data
Figure 1: The kernel size is automatically determined (MISE), yielding the Parzen
estimate (b) with the corresponding feature space mapping (c).
clustering cost function, measuring the cosine of the angle between cluster means,
in a related kernel feature space, which in our case can be determined automati-
cally. A more principled approach to clustering than that taken by Ng et al. is to
optimize (23) in the feature space, instead of using C-means. However, because of
the normalization of the data in the feature space, C-means can be interpreted as
clustering the data based on an angular measure. This may explain some of the
success of the Ng et al. algorithm; it achieves more or less the same goal as cluster-
ing based on the Laplacian distance would be expected to do. We will investigate
this claim in our future work. Note that we in our framework may choose to use
only the C largest eigenvalues/eigenvectors in the mapping, as discussed in section
2. Since we incorporate the eigenvalues in the mapping, in contrast to Ng et al.,
the actual mapping will in general be different in the two cases.
5 The Laplacian PDF distance as a risk function
We now give an analysis of the Laplacian pdf distance that may further motivate its
use as a clustering cost function. Consider again the two cluster case. The overall
data distribution can be expressed as f (x) = P1 p(x) + P2 q(x), were Pi , i = 1, 2, are
the priors. Assume that the two clusters are well separated, such that for xi ∈ C1 ,
f (xi ) ≈ P1 p(xi ), while for xi ∈ C2 , f (xi ) ≈ P2 q(xi ). Let us examine the numerator
of (14) in this case. It can be approximated as p(x)q(x) dx f (x)
p(x)q(x) p(x)q(x) 1 1
≈ dx + dx ≈ q(x)dx + p(x)dx. (27)
C1 f (x) C2 f (x) P1 C1 P2 C2
By performing a similar calculation for the denominator of (14), it can be shown to
be approximately equal to √P1 P . Hence, the Laplacian pdf distance can be written
1 1
as a risk function, given by
1 1
L≈ P 1 P2 q(x)dx + p(x)dx . (28)
P1 C1 P2 C2
Note that if P1 = P2 = 1 , then L = 2Pe , where Pe is the probability of error when
2
assigning data points to the two clusters, that is
Pe = P 1 q(x)dx + P2 p(x)dx. (29)
C1 C2
8. Thus, in this case, minimizing L is equivalent to minimizing Pe . However, in the case
that P1 = P2 , (28) has an even more interesting interpretation. In that situation,
it can be seen that the two integrals in the expressions (28) and (29) are weighted
exactly oppositely. For example, if P1 is close to one, L ≈ C2 p(x)dx, while Pe ≈
C1
q(x)dx. Thus, the Laplacian pdf distance emphasizes to cluster the most un-
likely data points correctly. In many real world applications, this property may be
crucial. For example, in medical applications, the most important points to classify
correctly are often the least probable, such as detecting some rare disease in a group
of patients.
6 Conclusions
We have introduced a new pdf distance measure that we refer to as the Laplacian
pdf distance, and we have shown that it is in fact a clustering cost function in a
kernel feature space determined by the eigenspectrum of the Laplacian data matrix.
In our exposition, the Mercer kernel and the Parzen kernel is equivalent, making
it possible to determine the Mercer kernel based on automatic selection procedures
for the Parzen kernel. Hence, the Laplacian data matrix and its eigenspectrum can
be determined automatically too. We have shown that the new pdf distance has an
interesting property as a risk function.
The results we have derived can only be obtained analytically using Gaussian ker-
nels. The same results may be obtained using other Mercer kernels, but it requires
an additional approximation wrt. the expectation operator. This discussion is left
for future work.
Acknowledgments. This work was partially supported by NSF grant ECS-
0300340.
References
[1] Y. Weiss, “Segmentation Using Eigenvectors: A Unifying View,” in Interna-
tional Conference on Computer Vision, 1999, pp. 975–982.
[2] A. Y. Ng, M. Jordan, and Y. Weiss, “On Spectral Clustering: Analysis and an
Algorithm,” in Advances in Neural Information Processing Systems, 14, 2001,
vol. 2, pp. 849–856.
[3] K. R. M¨ller, S. Mika, G. R¨tsch, K. Tsuda, and B. Sch¨lkopf, “An Introduction
u a o
to Kernel-Based Learning Algorithms,” IEEE Transactions on Neural Networks,
vol. 12, no. 2, pp. 181–201, 2001.
[4] J. Mercer, “Functions of Positive and Negative Type and their Connection with
the Theory of Integral Equations,” Philos. Trans. Roy. Soc. London, vol. A, pp.
415–446, 1909.
[5] C. Williams and M. Seeger, “Using the Nystr¨m Method to Speed Up Kernel
o
Machines,” in Advances in Neural Information Processing Systems 13, Vancou-
ver, Canada, USA, 2001, pp. 682–688.
[6] M. Brand and K. Huang, “A Unifying Theorem for Spectral Embedding and
Clustering,” in Ninth Int’l Workshop on Artificial Intelligence and Statistics,
Key West, Florida, USA, 2003.
[7] E. Parzen, “On the Estimation of a Probability Density Function and the
Mode,” Ann. Math. Stat., vol. 32, pp. 1065–1076, 1962.
[8] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chap-
man and Hall, London, 1986.