The materials discovery process can be significantly expedited and simplified if we can learn effectively from available knowledge and data. In the present contribution, we show that efficient and accurate prediction of a diverse set of properties of material systems is possible by employing machine (or statistical) learning
methods trained on quantum mechanical computations in combination with the notions of chemical similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us to discover decision rules that establish a mapping between easily accessible attributes of a system and its properties. It is shown that fingerprints based on either chemo-structural (compositional and configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically explore and mine vast chemical spaces, and can significantly accelerate the discovery of new application-specific materials.
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
Quantum clustering (QC), is a data clustering algorithm based on quantum mechanics which is
accomplished by substituting each point in a given dataset with a Gaussian. The width of the Gaussian is a
σ value, a hyper-parameter which can be manually defined and manipulated to suit the application.
Numerical methods are used to find all the minima of the quantum potential as they correspond to cluster
centers. Herein, we investigate the mathematical task of expressing and finding all the roots of the
exponential polynomial corresponding to the minima of a two-dimensional quantum potential. This is an
outstanding task because normally such expressions are impossible to solve analytically. However, we
prove that if the points are all included in a square region of size σ, there is only one minimum. This bound
is not only useful in the number of solutions to look for, by numerical means, it allows to to propose a new
numerical approach “per block”. This technique decreases the number of particles by approximating some
groups of particles to weighted particles. These findings are not only useful to the quantum clustering
problem but also for the exponential polynomials encountered in quantum chemistry, Solid-state Physics
and other applications.
This document discusses a hybridization of the Magnetic Charge System Search (MCSS) method for efficient data clustering. MCSS is a meta-heuristic algorithm inspired by electromagnetic theory that has shown potential but also has issues with convergence rate and getting stuck in local optima. The authors propose a Hybrid MCSS (HMCSS) that incorporates a local search strategy and differential evolution inspired updating to improve convergence. An experiment on benchmark functions and real clustering problems shows HMCSS provides better results than existing algorithms and enhances MCSS convergence.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
This document discusses various techniques for characterizing metamaterials. It categorizes the techniques into analytical methods, field averaging methods, and experimental methods. Analytical methods include modeling the unit cell and periodic structure to determine effective parameters. Field averaging methods involve obtaining local field averages to derive constitutive parameters from transmission and reflection data. Experimental techniques discussed are the Nicolson-Ross-Weir method for extracting parameters from scattering data and resonator/free space methods for laboratory measurements.
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming
very tedious process. The key disadvantage of high dimensional data which we can pen down is curse
of dimensionality. As the magnitude of datasets grows the data points become sparse and density of
area becomes less making it difficult to cluster that data which further reduces the performance of
traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve
clustering results using limited supervision. The supervision is generally given as pair wise
constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based
approaches. We first show that a recently-proposed objective function for semi-supervised clustering
based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of
constraint penalty functions, can be expressed as a special case of the global kernel k-means objective
[3]. A recent theoretical connection between global kernel k-means and several graph clustering
objectives enables us to perform semi-supervised clustering of data. In particular, some methods have
been proposed for semi supervised clustering based on pair wise similarity or dissimilarity
information. In this paper, we propose a kernel approach for semi supervised clustering and present in
detail two special cases of this kernel approach.
Overview combining ab initio with continuum theoryDierk Raabe
Multi-methodological approaches combining quantum-mechanical and/or atomistic simulations
with continuum methods have become increasingly important when addressing multi-scale phenomena in
computational materials science. A crucial aspect when applying these strategies is to carefully check,
and if possible to control, a variety of intrinsic errors and their propagation through a particular multimethodological
scheme. The first part of our paper critically reviews a few selected sources of errors
frequently occurring in quantum-mechanical approaches to materials science and their multi-scale propagation
when describing properties of multi-component and multi-phase polycrystalline metallic alloys.
Our analysis is illustrated in particular on the determination of i) thermodynamic materials properties at
finite temperatures and ii) integral elastic responses. The second part addresses methodological challenges
emerging at interfaces between electronic structure and/or atomistic modeling on the one side and selected
continuum methods, such as crystal elasticity and crystal plasticity finite element method (CEFEM and
CPFEM), new fast Fourier transforms (FFT) approach, and phase-field modeling, on the other side.
This document discusses a method for predicting the dynamic response and flutter characteristics of structures using experimental modal parameters when the exact system properties like mass and stiffness are unknown. The method uses modal parameters obtained from ground vibration tests in finite element and computational fluid dynamics software to analyze transient response and flutter speeds. It was validated on a tapered aluminum plate structure by comparing results obtained using experimental modal data to those from a finite element model using the actual material properties. Close agreement was observed between the two methods, showing this approach can accurately analyze structures without prior knowledge of system configurations.
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
Machine learning for text classification is the
underpinning
of document
cataloging
, news filtering,
document
steering
and
exemplif
ication
. In text mining realm, effective feature selection is significant to
make the learning task more accurate and competent. One of the
traditional
lazy
text classifier
k
-
Nearest
Neighborhood (
k
NN) has
a
major pitfall in calculating the similarity between
all
the
objects in training and
testing se
t
s,
there by leads to exaggeration of
both
computational complexity
of the algorithm
and
massive
consumption
of
main memory
. To diminish these shortcomings
in
viewpoint
of a
data
-
mining
practitioner
a
n
amalgamati
ve technique is proposed in this paper using
a novel restructured version of
k
NN called
Augmented
k
NN
(AkNN)
and
k
-
Medoids
(kMdd)
clustering.
The proposed work
comprises
preprocesses
on
the
initial training
set
by
imposing
attribute feature selection
for reduc
tion of high dimensionality, also it
detects and excludes the high
-
fliers
samples
in t
he
initial
training set
and
re
structure
s
a
constricted
training
set
.
The kMdd clustering algorithm generates the cluster centers (as interior objects) for each category
and
restructures
the constricted training set
with centroids
. This technique
is
amalgamated with
AkNN
classifier
that
was prearranged with
text mining similarity measure
s.
Eventually, s
ignifican
tweights
and ranks were
assigned to each object in the new
training set based upon the
ir
accessory towards the
object in testing set
.
Experiments
conducted
on Reuters
-
21578 a
UCI benchmark
text mining
data
set
, and
comparisons with
traditional
k
NN
classifier designates
the
referred
method
yield
spreeminentrecital
in b
oth clustering and
classification
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
Quantum clustering (QC), is a data clustering algorithm based on quantum mechanics which is
accomplished by substituting each point in a given dataset with a Gaussian. The width of the Gaussian is a
σ value, a hyper-parameter which can be manually defined and manipulated to suit the application.
Numerical methods are used to find all the minima of the quantum potential as they correspond to cluster
centers. Herein, we investigate the mathematical task of expressing and finding all the roots of the
exponential polynomial corresponding to the minima of a two-dimensional quantum potential. This is an
outstanding task because normally such expressions are impossible to solve analytically. However, we
prove that if the points are all included in a square region of size σ, there is only one minimum. This bound
is not only useful in the number of solutions to look for, by numerical means, it allows to to propose a new
numerical approach “per block”. This technique decreases the number of particles by approximating some
groups of particles to weighted particles. These findings are not only useful to the quantum clustering
problem but also for the exponential polynomials encountered in quantum chemistry, Solid-state Physics
and other applications.
This document discusses a hybridization of the Magnetic Charge System Search (MCSS) method for efficient data clustering. MCSS is a meta-heuristic algorithm inspired by electromagnetic theory that has shown potential but also has issues with convergence rate and getting stuck in local optima. The authors propose a Hybrid MCSS (HMCSS) that incorporates a local search strategy and differential evolution inspired updating to improve convergence. An experiment on benchmark functions and real clustering problems shows HMCSS provides better results than existing algorithms and enhances MCSS convergence.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
This document discusses various techniques for characterizing metamaterials. It categorizes the techniques into analytical methods, field averaging methods, and experimental methods. Analytical methods include modeling the unit cell and periodic structure to determine effective parameters. Field averaging methods involve obtaining local field averages to derive constitutive parameters from transmission and reflection data. Experimental techniques discussed are the Nicolson-Ross-Weir method for extracting parameters from scattering data and resonator/free space methods for laboratory measurements.
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming
very tedious process. The key disadvantage of high dimensional data which we can pen down is curse
of dimensionality. As the magnitude of datasets grows the data points become sparse and density of
area becomes less making it difficult to cluster that data which further reduces the performance of
traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve
clustering results using limited supervision. The supervision is generally given as pair wise
constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based
approaches. We first show that a recently-proposed objective function for semi-supervised clustering
based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of
constraint penalty functions, can be expressed as a special case of the global kernel k-means objective
[3]. A recent theoretical connection between global kernel k-means and several graph clustering
objectives enables us to perform semi-supervised clustering of data. In particular, some methods have
been proposed for semi supervised clustering based on pair wise similarity or dissimilarity
information. In this paper, we propose a kernel approach for semi supervised clustering and present in
detail two special cases of this kernel approach.
Overview combining ab initio with continuum theoryDierk Raabe
Multi-methodological approaches combining quantum-mechanical and/or atomistic simulations
with continuum methods have become increasingly important when addressing multi-scale phenomena in
computational materials science. A crucial aspect when applying these strategies is to carefully check,
and if possible to control, a variety of intrinsic errors and their propagation through a particular multimethodological
scheme. The first part of our paper critically reviews a few selected sources of errors
frequently occurring in quantum-mechanical approaches to materials science and their multi-scale propagation
when describing properties of multi-component and multi-phase polycrystalline metallic alloys.
Our analysis is illustrated in particular on the determination of i) thermodynamic materials properties at
finite temperatures and ii) integral elastic responses. The second part addresses methodological challenges
emerging at interfaces between electronic structure and/or atomistic modeling on the one side and selected
continuum methods, such as crystal elasticity and crystal plasticity finite element method (CEFEM and
CPFEM), new fast Fourier transforms (FFT) approach, and phase-field modeling, on the other side.
This document discusses a method for predicting the dynamic response and flutter characteristics of structures using experimental modal parameters when the exact system properties like mass and stiffness are unknown. The method uses modal parameters obtained from ground vibration tests in finite element and computational fluid dynamics software to analyze transient response and flutter speeds. It was validated on a tapered aluminum plate structure by comparing results obtained using experimental modal data to those from a finite element model using the actual material properties. Close agreement was observed between the two methods, showing this approach can accurately analyze structures without prior knowledge of system configurations.
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
Machine learning for text classification is the
underpinning
of document
cataloging
, news filtering,
document
steering
and
exemplif
ication
. In text mining realm, effective feature selection is significant to
make the learning task more accurate and competent. One of the
traditional
lazy
text classifier
k
-
Nearest
Neighborhood (
k
NN) has
a
major pitfall in calculating the similarity between
all
the
objects in training and
testing se
t
s,
there by leads to exaggeration of
both
computational complexity
of the algorithm
and
massive
consumption
of
main memory
. To diminish these shortcomings
in
viewpoint
of a
data
-
mining
practitioner
a
n
amalgamati
ve technique is proposed in this paper using
a novel restructured version of
k
NN called
Augmented
k
NN
(AkNN)
and
k
-
Medoids
(kMdd)
clustering.
The proposed work
comprises
preprocesses
on
the
initial training
set
by
imposing
attribute feature selection
for reduc
tion of high dimensionality, also it
detects and excludes the high
-
fliers
samples
in t
he
initial
training set
and
re
structure
s
a
constricted
training
set
.
The kMdd clustering algorithm generates the cluster centers (as interior objects) for each category
and
restructures
the constricted training set
with centroids
. This technique
is
amalgamated with
AkNN
classifier
that
was prearranged with
text mining similarity measure
s.
Eventually, s
ignifican
tweights
and ranks were
assigned to each object in the new
training set based upon the
ir
accessory towards the
object in testing set
.
Experiments
conducted
on Reuters
-
21578 a
UCI benchmark
text mining
data
set
, and
comparisons with
traditional
k
NN
classifier designates
the
referred
method
yield
spreeminentrecital
in b
oth clustering and
classification
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
The document proposes a novel Spatial Fuzzy C-Means (PET-SFCM) clustering algorithm to segment PET scan images of patients with neurodegenerative disorders like Alzheimer's disease. The algorithm incorporates spatial neighborhood information into the traditional Fuzzy C-Means algorithm. It was tested on real patient data sets and showed satisfactory results compared to conventional FCM and K-Means clustering algorithms. The PET-SFCM algorithm provides an effective way to segment PET images and analyze brain changes related to neurological conditions.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
In this video from the 2015 HPC User Forum in Broomfield, Barry Bolding from Cray presents: HPC + D + A = HPDA?
"The flexible, multi-use Cray Urika-XA extreme analytics platform addresses perhaps the most critical obstacle in data analytics today — limitation. Analytics problems are getting more varied and complex but the available solution technologies have significant constraints. Traditional analytics appliances lock you into a single approach and building a custom solution in-house is so difficult and time consuming that the business value derived from analytics fails to materialize. In contrast, the Urika-XA platform is open, high performing and cost effective, serving a wide range of analytics tools with varying computing demands in a single environment. Pre-integrated with the Hadoop and Spark frameworks, the Urika-XA system combines the benefits of a turnkey analytics appliance with a flexible, open platform that you can modify for future analytics workloads. This single-platform consolidation of workloads reduces your analytics footprint and total cost of ownership."
Learn more: http://www.cray.com/products/analytics/urika-xa
Watch the video presentation: http://wp.me/p3RLEV-3yR
Sign up for our insideBIGDATA Newsletter: http://insidebigdata.com/newsletter
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data IJMER
Clustering mixed type data is one of the major research topics in the area of data mining. In
this paper, a new algorithm for clustering mixed type data is proposed where the concept of distribution
centroid is used to represent the prototype of categorical variables in a cluster which is then combined
with the mean to represent the prototype of clusters with mixed type variables. In the method, data is
observed from different views and the variables are grouped into different views. Those instances that
can be viewed differently from different viewpoints can be defined as multiview data. During clustering
process the differences among views are ignored in usual cases. Here, both views and variables weights
are computed simultaneously. The view weight is used to determine the closeness or density of view and
variable weight is used to identify the significance of each variable. With the intention of determining
the cluster of objects both these weights are used in the distance function. In the proposed method,
enhancement to the k-prototypes is done so that it automatically computes both view and variable
weights. The proposed algorithm MK-Prototypes algorithm is compared with two other clustering
algorithms.
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
This document proposes a Bayesian framework for detecting changepoints in multiple categorical sequences that are composed of piecewise homogeneous Markov segments. The framework models each sequence as having an ordered set of segments separated by unobserved changepoints, with each segment following its own Markov transition matrix. The model places priors on the changepoint locations and transition matrices and uses MCMC to obtain posterior estimates. Experimental results on simulated data show the methodology can infer changepoint positions across varying sequences and handle different numbers of changepoints per sequence. The approach is also applied to real data on monsoonal rainfall patterns and tree branching patterns.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document discusses implementing an integrated approach of the K-means clustering algorithm for prediction analysis. It begins with motivating the need to improve the accuracy and dependability of existing overlapping K-means clustering by removing its dependency on random initialization parameters. The proposed methodology determines the optimal number of clusters K based on the dataset, calculates initial centroid positions using a harmonic means method, and applies overlapping K-means clustering. The implementation and results on two large datasets show the integrated approach outperforms original overlapping K-means in terms of accuracy, F-measure, Rand index, and number of iterations.
This document discusses using hidden Markov models (HMMs) for unsupervised learning in hyperspectral image classification. It proposes an HMM-based probability density function classifier that models hyperspectral data using a reduced feature space. The approach uses an unsupervised learning scheme for maximum likelihood parameter estimation, combining both model selection and estimation. This HMM method can accurately model and synthesize approximate observations of true hyperspectral data in a reduced feature space without relying on supervised learning.
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...ijfls
This document presents a counterexample demonstrating that the fuzzy forward recursion method for determining critical paths does not always produce results consistent with the extension principle when discrete fuzzy sets are used to represent activity durations.
The document first provides background on fuzzy sets and critical path analysis. It then presents a proposition stating that the membership function for fuzzy critical path lengths can be determined by taking the maximum of the minimum membership values across all activity durations in each configuration.
The document goes on to present a counterexample using a simple series-parallel network with 18 configurations. It shows that applying the fuzzy forward recursion produces a different membership value for one critical path length compared to directly applying the extension principle. This difference proves the fuzzy forward
CHN and Swap Heuristic to Solve the Maximum Independent Set ProblemIJECEIAES
We describe a new approach to solve the problem to find the maximum independent set in a given Graph, known also as Max-Stable set problem (MSSP). In this paper, we show how Max-Stable problem can be reformulated into a linear problem under quadratic constraints, and then we resolve the QP result by a hybrid approach based Continuous Hopfeild Neural Network (CHN) and Local Search. In a manner that the solution given by the CHN will be the starting point of the local search. The new approach showed a good performance than the original one which executes a suite of CHN runs, at each execution a new leaner constraint is added into the resolved model. To prove the efficiency of our approach, we present some computational experiments of solving random generated problem and typical MSSP instances of real life problem.
EMI 2021 - A comparative review of peridynamics and phase-field models for en...Patrick Diehl
This document provides a comparative review of peridynamics and phase-field models for engineering fracture mechanics. It discusses the computational aspects of both models, their advantages in modeling crack initiation and propagation, and common challenges. Some specific challenges for each model are also outlined, such as applying boundary conditions for peridynamics and modeling fast crack propagation under dynamic loading for phase-field. The conclusion states that both models can capture microscale fracture physics but comparative validation studies against experimental data are still lacking.
The document discusses modifications to the PC algorithm for constraint-based causal structure learning that remove its order-dependence, which can lead to highly variable results in high-dimensional settings; the modified algorithms are order-independent while maintaining consistency under the same conditions, and simulations and analysis of yeast gene expression data show they improve performance over the original PC algorithm in high-dimensional settings.
This document discusses using convolutional neural networks and self-organized maps to visualize knowledge graphs. It presents a compositional model for embedding nodes and relationships in a knowledge graph into a vector space. A self-organized map is used to cluster the embeddings and extract semantic fingerprints. The fingerprints are useful for knowledge discovery and classification tasks. The technique is applied to a subset of the CTD knowledge graph containing compound-gene/protein interactions, and the results are comparable to structural models.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
This document summarizes Gunnar Rätsch's presentation on machine learning in science and engineering. It introduces several applications of machine learning, including spam classification, drug design, and face detection. It also provides overviews of boosting algorithms like AdaBoost and support vector machines. Key algorithms like boosting, SVMs, and kernels are explained at a high level. Finally, applications of machine learning across various domains are briefly mentioned.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
The document proposes a novel Spatial Fuzzy C-Means (PET-SFCM) clustering algorithm to segment PET scan images of patients with neurodegenerative disorders like Alzheimer's disease. The algorithm incorporates spatial neighborhood information into the traditional Fuzzy C-Means algorithm. It was tested on real patient data sets and showed satisfactory results compared to conventional FCM and K-Means clustering algorithms. The PET-SFCM algorithm provides an effective way to segment PET images and analyze brain changes related to neurological conditions.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
In this video from the 2015 HPC User Forum in Broomfield, Barry Bolding from Cray presents: HPC + D + A = HPDA?
"The flexible, multi-use Cray Urika-XA extreme analytics platform addresses perhaps the most critical obstacle in data analytics today — limitation. Analytics problems are getting more varied and complex but the available solution technologies have significant constraints. Traditional analytics appliances lock you into a single approach and building a custom solution in-house is so difficult and time consuming that the business value derived from analytics fails to materialize. In contrast, the Urika-XA platform is open, high performing and cost effective, serving a wide range of analytics tools with varying computing demands in a single environment. Pre-integrated with the Hadoop and Spark frameworks, the Urika-XA system combines the benefits of a turnkey analytics appliance with a flexible, open platform that you can modify for future analytics workloads. This single-platform consolidation of workloads reduces your analytics footprint and total cost of ownership."
Learn more: http://www.cray.com/products/analytics/urika-xa
Watch the video presentation: http://wp.me/p3RLEV-3yR
Sign up for our insideBIGDATA Newsletter: http://insidebigdata.com/newsletter
MK-Prototypes: A Novel Algorithm for Clustering Mixed Type Data IJMER
Clustering mixed type data is one of the major research topics in the area of data mining. In
this paper, a new algorithm for clustering mixed type data is proposed where the concept of distribution
centroid is used to represent the prototype of categorical variables in a cluster which is then combined
with the mean to represent the prototype of clusters with mixed type variables. In the method, data is
observed from different views and the variables are grouped into different views. Those instances that
can be viewed differently from different viewpoints can be defined as multiview data. During clustering
process the differences among views are ignored in usual cases. Here, both views and variables weights
are computed simultaneously. The view weight is used to determine the closeness or density of view and
variable weight is used to identify the significance of each variable. With the intention of determining
the cluster of objects both these weights are used in the distance function. In the proposed method,
enhancement to the k-prototypes is done so that it automatically computes both view and variable
weights. The proposed algorithm MK-Prototypes algorithm is compared with two other clustering
algorithms.
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
This document proposes a Bayesian framework for detecting changepoints in multiple categorical sequences that are composed of piecewise homogeneous Markov segments. The framework models each sequence as having an ordered set of segments separated by unobserved changepoints, with each segment following its own Markov transition matrix. The model places priors on the changepoint locations and transition matrices and uses MCMC to obtain posterior estimates. Experimental results on simulated data show the methodology can infer changepoint positions across varying sequences and handle different numbers of changepoints per sequence. The approach is also applied to real data on monsoonal rainfall patterns and tree branching patterns.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
The document discusses implementing an integrated approach of the K-means clustering algorithm for prediction analysis. It begins with motivating the need to improve the accuracy and dependability of existing overlapping K-means clustering by removing its dependency on random initialization parameters. The proposed methodology determines the optimal number of clusters K based on the dataset, calculates initial centroid positions using a harmonic means method, and applies overlapping K-means clustering. The implementation and results on two large datasets show the integrated approach outperforms original overlapping K-means in terms of accuracy, F-measure, Rand index, and number of iterations.
This document discusses using hidden Markov models (HMMs) for unsupervised learning in hyperspectral image classification. It proposes an HMM-based probability density function classifier that models hyperspectral data using a reduced feature space. The approach uses an unsupervised learning scheme for maximum likelihood parameter estimation, combining both model selection and estimation. This HMM method can accurately model and synthesize approximate observations of true hyperspectral data in a reduced feature space without relying on supervised learning.
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...ijfls
This document presents a counterexample demonstrating that the fuzzy forward recursion method for determining critical paths does not always produce results consistent with the extension principle when discrete fuzzy sets are used to represent activity durations.
The document first provides background on fuzzy sets and critical path analysis. It then presents a proposition stating that the membership function for fuzzy critical path lengths can be determined by taking the maximum of the minimum membership values across all activity durations in each configuration.
The document goes on to present a counterexample using a simple series-parallel network with 18 configurations. It shows that applying the fuzzy forward recursion produces a different membership value for one critical path length compared to directly applying the extension principle. This difference proves the fuzzy forward
CHN and Swap Heuristic to Solve the Maximum Independent Set ProblemIJECEIAES
We describe a new approach to solve the problem to find the maximum independent set in a given Graph, known also as Max-Stable set problem (MSSP). In this paper, we show how Max-Stable problem can be reformulated into a linear problem under quadratic constraints, and then we resolve the QP result by a hybrid approach based Continuous Hopfeild Neural Network (CHN) and Local Search. In a manner that the solution given by the CHN will be the starting point of the local search. The new approach showed a good performance than the original one which executes a suite of CHN runs, at each execution a new leaner constraint is added into the resolved model. To prove the efficiency of our approach, we present some computational experiments of solving random generated problem and typical MSSP instances of real life problem.
EMI 2021 - A comparative review of peridynamics and phase-field models for en...Patrick Diehl
This document provides a comparative review of peridynamics and phase-field models for engineering fracture mechanics. It discusses the computational aspects of both models, their advantages in modeling crack initiation and propagation, and common challenges. Some specific challenges for each model are also outlined, such as applying boundary conditions for peridynamics and modeling fast crack propagation under dynamic loading for phase-field. The conclusion states that both models can capture microscale fracture physics but comparative validation studies against experimental data are still lacking.
The document discusses modifications to the PC algorithm for constraint-based causal structure learning that remove its order-dependence, which can lead to highly variable results in high-dimensional settings; the modified algorithms are order-independent while maintaining consistency under the same conditions, and simulations and analysis of yeast gene expression data show they improve performance over the original PC algorithm in high-dimensional settings.
This document discusses using convolutional neural networks and self-organized maps to visualize knowledge graphs. It presents a compositional model for embedding nodes and relationships in a knowledge graph into a vector space. A self-organized map is used to cluster the embeddings and extract semantic fingerprints. The fingerprints are useful for knowledge discovery and classification tasks. The technique is applied to a subset of the CTD knowledge graph containing compound-gene/protein interactions, and the results are comparable to structural models.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
This document summarizes Gunnar Rätsch's presentation on machine learning in science and engineering. It introduces several applications of machine learning, including spam classification, drug design, and face detection. It also provides overviews of boosting algorithms like AdaBoost and support vector machines. Key algorithms like boosting, SVMs, and kernels are explained at a high level. Finally, applications of machine learning across various domains are briefly mentioned.
Zach and Aren talk on Materials Informatics at UW WIMSEA 2016-02-12ddm314
Presentation at WIsconsin Materials Science and Engineering Industrial Affiliates (WIMSEA) meeting at University of Wisconsin - Madison on 2016-02-12. Undergraduates Zach and Aren presented on informatics skunkworks activities in materials informatics.
Data Science and Machine Learning Using Python and Scikit-learnAsim Jalis
Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Reveal.js is an HTML presentation framework that allows users to create beautiful presentations using HTML. It has features like vertical slides, nested slides, Markdown support, different transition styles, themes, slide backgrounds, images, video, tables, quotes, and linking between slides. Presentations can be exported to PDF and custom states and events can be triggered on each slide. The framework is touch optimized and works on devices like mobile phones and tablets.
Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...OCTO Technology Suisse
Pour notre troisième Afterwork sur le thème du « Big Data », nous proposons une introduction aux pratiques et bénéfices de la Data Science. Si les précédentes sessions ont dévoilé comment stocker et traiter de gros volumes de données à moindre coût, nous aborderons un nouvel aspect : comment découvrir les trésors d’information présents dans vos données.
Nous vous présenterons les grands principes du Machine Learning et la puissance de la visualisation. S’appuyant sur des retours d’expériences OCTO, nous réaliserons un tour d’horizon des méthodes et des outils disponibles.
A l’issue de cette présentation. vous aurez découvert des approches pragmatiques pour explorer et comprendre vos données. Voire prédire votre futur …
During the process of molecular structure elucidation the selection of the most probable structural hypothesis may be based on chemical shift prediction. The prediction is carried out using either empirical or quantum-mechanical (QM) methods. When QM methods are used, NMR prediction commonly utilizes the GIAO option of the DFT approximation. In this approach the structural hypotheses are expected to be investigated by scientist. In this article we hope to show that the most rational manner by which to create structural hypotheses is actually by the application of an expert system capable of deducing all potential structures consistent with the experimental spectral data and specifically using 2D NMR data. When an expert system is used the best structure(s) can be distinguished using chemical shift prediction, which is best performed either by an incremental or neural net algorithm. The time-consuming QM calculations can then be applied, if necessary, to one or more of the 'best' structures to confirm the suggested solution.
The efficacy of neural network (NN) and partial least squares (PLS) methods is compared for the prediction of NMR chemical shifts for both 1H and 13C nuclei using very large databases containing millions of chemical shifts. The chemical structure description scheme used in this work is based on individual atoms rather than functional groups. The performances of each of the methods were optimized in a systematic manner described in this work. Both of the methods, least squares and neural network analysis, produce results of a very similar quality but the least squares algorithm is approximately 2-3 times faster.
Poster presentat a les jornades doctorals de la UABElisabeth Ortega
This document discusses computational techniques for studying biological systems containing transition metals. It provides examples of using quantum mechanics (QM) to study cisplatin binding to DNA and molecular mechanics (MM) and docking to study ruthenium complexes binding to kinases. For cisplatin-DNA, normal mode analysis found the lowest energy collective movements and QM minimization determined preferred geometries. For ruthenium-kinase, docking predicted binding energies that correlated with experimental values. Computational chemistry methods are versatile and useful for designing metal-based drugs by integrating different techniques to overcome individual limitations.
1) The document discusses mining data streams using an improved version of McDiarmid's bound. It aims to enhance the bounds obtained by McDiarmid's tree algorithm and improve processing efficiency.
2) Traditional data mining techniques cannot be directly applied to data streams due to their continuous, rapid arrival. The document proposes using Gaussian approximations to McDiarmid's bounds to reduce the size of training samples needed for split criteria selection.
3) It describes Hoeffding's inequality, which is commonly used but not sufficient for data streams. The document argues that McDiarmid's inequality, used appropriately, provides a more efficient technique for high-speed, time-changing data streams.
Statistical global modeling of β^- decay halflives systematics ...butest
This document discusses using machine learning techniques like neural networks and support vector machines to model beta decay half-lives of nuclei. It compares these statistical, data-driven models to existing theory-based global models. The machine learning procedures treat beta decay half-lives as a non-linear optimization problem solved through statistical learning. Neural networks and support vector regression machines were constructed and compared to experimental data, previous neural network results, and traditional nuclear model estimates, showing similar performance to the best global calculations.
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.
The document discusses using machine learning to develop density functional approximations for orbital-free density functional theory calculations. Specifically, kernel ridge regression is used to approximate the kinetic energy of non-interacting fermions confined to a 1D box as a functional of electron density. This machine-learned density functional approximation achieves highly accurate energies and self-consistent densities, outperforming traditional approximations. Various kernels, cross-validation methods, and representations of the electron density are explored to optimize the machine-learned approximation.
This document discusses using machine learning for scientific research, specifically for molecular dynamics simulations, quantum mechanics calculations, and computational chemistry. It provides examples of using machine learning to predict molecular properties through models trained on large datasets of molecules and their computed properties. The examples shown include predicting energies, geometries, reaction energies, and more with good accuracy compared to traditional methods. The document argues that machine learning is a promising approach for accelerating scientific computations and aiding in molecular design.
Graphs, Environments, and Machine Learning for Materials Scienceaimsnist
This document summarizes key points from a workshop on machine learning for materials science held on August 1, 2019. It discusses how graphs are a natural representation for materials as they can capture local atomic environments and periodicity in crystals. A graph network framework called MEGNet is presented that achieves state-of-the-art performance for molecular and crystal property prediction. MEGNet models outperform previous methods on standard benchmarks and allow for transfer learning across properties. The workshop also covered practical considerations for training deep learning models and applications beyond bulk crystals.
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAcsandit
This work presents a novel ranking scheme for structured data. We show how to apply the
notion of typicality analysis from cognitive science and how to use this notion to formulate the
problem of ranking data with categorical attributes. First, we formalize the typicality query
model for relational databases. We adopt Pearson correlation coefficient to quantify the extent
of the typicality of an object. The correlation coefficient estimates the extent of statistical
relationships between two variables based on the patterns of occurrences and absences of their
values. Second, we develop a top-k query processing method for efficient computation. TPFilter
prunes unpromising objects based on tight upper bounds and selectively joins tuples of highest
typicality score. Our methods efficiently prune unpromising objects based on upper bounds.
Experimental results show our approach is promising for real data.
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
This paper presents an approach based on applying an aggregated predictor formed by multiple versions of a multilayer neural network with a back-propagation optimization algorithm for helping the engineer to get a list of the most appropriate well-test interpretation models for a given set of pressure/ production data. The proposed method consists of three stages: (1) data decorrelation through principal component analysis to reduce the covariance between the variables and the dimension of the input layer in the artificial neural network, (2) bootstrap replicates of the learning set where the data is repeatedly sampled with a random split of the data into train sets and using these as new learning sets, and (3) automatic reservoir model identification through aggregated predictor formed by a plurality vote when predicting a new class. This method is described in detail to ensure successful replication of results. The required training and test dataset were generated by using analytical solution models. In our case, there were used 600 samples: 300 for training, 100 for cross-validation, and 200 for testing. Different network structures were tested during this study to arrive at optimum network design. We notice that the single net methodology always brings about confusion in selecting the correct model even though the training results for the constructed networks are close to 1. We notice also that the principal component analysis is an effective strategy in reducing the number of input features, simplifying the network structure, and lowering the training time of the ANN. The results obtained show that the proposed model provides better performance when predicting new data with a coefficient of correlation approximately equal to 95% Compared to a previous approach 80%, the combination of the PCA and ANN is more stable and determine the more accurate results with lesser computational complexity than was feasible previously. Clearly, the aggregated predictor is more stable and shows less bad classes compared to the previous approach.
This document discusses a new machine learning method for differentiating between quark and gluon jets using data from the ALICE experiment at CERN. Key points:
- The method uses features of jet substructure to construct discriminant variables to classify jets as initiated by quarks or gluons. Hundreds of features are explored.
- Data preprocessing steps are described, including removing unusable features, addressing class noise in jet labeling, and ranking features by information gain.
- Feature ranking identified both previously proposed discriminating features as well as new intriguing variables for better quark/gluon jet discrimination.
Predicting electricity consumption using hidden parametersIJLT EMAS
data mining technique to forecast power demand of a
biological region based on the metrological conditions. The value
forecast analytical data mining technique is implement with the
Hidden Marko Model. The morals of the factor such as heat,
clamminess and municipal celebration on which influence
operation depends and the everyday utilization morals compose
the data. Data mining operation are perform on this
chronological data to form a forecast model which is able of
predict every day utilization provide the meteorological
parameter. The steps of information detection of data process are
implemented. The data is preprocessed and fed to HMM for
guidance it. The educated HMM network is used to predict the
electricity demand for the given meteorological conditions.
This document discusses using density functional theory with different basis sets (Gaussian, plane waves, numerical) to calculate exchange coupling constants in transition metal complexes. It compares the accuracy and reliability of these approaches by calculating exchange coupling constants and spin distributions for three test complexes. The plane wave and numerical basis set approaches are found to be accurate alternatives to the more established Gaussian basis functions for calculating these properties, while also allowing for larger system sizes to be studied. Pseudopotentials are also found to not significantly affect the calculation of exchange coupling constants.
2015 New trans-stilbene derivatives with large TPA valuesvarun Kundi
This document discusses a theoretical study of the linear and non-linear optical properties of 13 new trans-stilbene derivatives designed to have large two-photon absorption cross-sections. The study uses density functional theory and time-dependent density functional theory calculations with the CAM-B3LYP functional to evaluate properties like hyperpolarizability and one- and two-photon absorption. It finds that derivatives TSBD-10, TSBD-11, TSBD-12, and TSBD-13 have particularly large non-linear optical susceptibilities and two-photon absorption cross-sections, with the largest being 5560 GM for TSBD-13.
Development of an artificial neural network constitutive model for aluminum 7...Thabatta Araújo
This document discusses the development of an artificial neural network constitutive model for aluminum 7075 alloy. It begins with an introduction to constitutive equations and the typical parametric approach to constitutive modeling. It then discusses the benefits of a non-parametric artificial neural network approach. The document outlines experimental data collected on aluminum 7075 alloy under different strain rates and temperatures. It aims to design and train a neural network model to accurately model the stress-strain behavior of aluminum 7075 alloy across a range of loading conditions.
Clustering using kernel entropy principal component analysis and variable ker...IJECEIAES
Clustering as unsupervised learning method is the mission of dividing data objects into clusters with common characteristics. In the present paper, we introduce an enhanced technique of the existing EPCA data transformation method. Incorporating the kernel function into the EPCA, the input space can be mapped implicitly into a high-dimensional of feature space. Then, the Shannon’s entropy estimated via the inertia provided by the contribution of every mapped object in data is the key measure to determine the optimal extracted features space. Our proposed method performs very well the clustering algorithm of the fast search of clusters’ centers based on the local densities’ computing. Experimental results disclose that the approach is feasible and efficient on the performance query.
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Melissa Moody
This summarizes a document describing research using machine learning to classify protein helix capping motifs. The researchers:
1) Used structural data from protein databases and helix cap classifications to train machine learning models, including bidirectional LSTM and SVC models, to predict helix cap positions in proteins.
2) Engineered features for the models including backbone torsion angles, residue properties, and additional physicochemical descriptors.
3) Evaluated the models using accuracy, balanced accuracy, and F1 score since the dataset was imbalanced between cap and non-cap residues.
4) Achieved 85% balanced accuracy classifying helix caps using a deep bidirectional LSTM model, offering an objective way to classify this important
Similar to Accelerating materials property predictions using machine learning (20)
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Accelerating materials property predictions using machine learning
1. Accelerating materials property
predictions using machine learning
Ghanshyam Pilania1
, Chenchen Wang1
, Xun Jiang2
, Sanguthevar Rajasekaran3
& Ramamurthy Ramprasad1
1
Department of Materials Science and Engineering, University of Connecticut, 97 North Eagleville Road, Storrs, Connecticut 06269,
2
Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, Connecticut 06269, 3
Department of Computer
Science and Engineering, University of Connecticut, 371 Fairfield Road, Storrs, Connecticut 06269.
The materials discovery process can be significantly expedited and simplified if we can learn effectively from
available knowledge and data. In the present contribution, we show that efficient and accurate prediction of
a diverse set of properties of material systems is possible by employing machine (or statistical) learning
methods trained on quantum mechanical computations in combination with the notions of chemical
similarity. Using a family of one-dimensional chain systems, we present a general formalism that allows us
to discover decision rules that establish a mapping between easily accessible attributes of a system and its
properties. It is shown that fingerprints based on either chemo-structural (compositional and
configurational information) or the electronic charge density distribution can be used to make ultra-fast, yet
accurate, property predictions. Harnessing such learning paradigms extends recent efforts to systematically
explore and mine vast chemical spaces, and can significantly accelerate the discovery of new
application-specific materials.
O
wing to the staggering compositional and configurational degrees of freedom possible in materials, it is
fair to assume that the chemical space of even a restricted subclass of materials (say, involving just two
elements) is far from being exhausted, and an enormous number of new materials with useful properties
are yet to be discovered. Given this formidable chemical landscape, a fundamental bottleneck to an efficient
materials discovery process is the lack of suitable methods to rapidly and accurately predict the properties of a vast
array (within a subclass) of new yet-to-be-synthesized materials. The standard approaches adopted thus far
involve either expensive and lengthy Edisonian synthesis-testing experimental cycles, or laborious and time-
intensive computations, performed on a case-by-case manner. Moreover, neither of these approaches is able to
readily unearth Hume-Rothery-like ‘‘hidden’’ semi-empirical rules that govern materials behavior.
The present contribution, aimed at materials property predictions, falls under a radically different paradigm1,2
,
namely, machine (or statistical) learning—a topic central to network theory3
, cognitive game theory4,5
, pattern
recognition6–8
, artificial intelligence9,10
, and event forecasting11
. We show that such learning methods may be used
to establish a mapping between a suitable representation of a material (i.e., its ‘fingerprint’ or its ‘profile’) and any
or all of its properties using known historic, or intentionally generated, data. The material fingerprint or profile
can be coarse-level chemo-structural descriptors, or something as fundamental as the electronic charge density,
both of which are explored here. Subsequently, once the profile u property mapping has been established, the
properties of a vast number of new materials within the same subclass may then be directly predicted (and
correlations between properties may be unearthed) at negligible computational cost, thereby completely by-
passing the conventional laborious approaches towards material property determination alluded to above. In its
most simplified form, this scheme is inspired by the intuition that (dis)similar materials will have (dis)similar
properties. Needless to say, training of this intuition requires a critical amount of prior diverse information/
results12–16
and robust learning devices12,17–22
.
The central problem in learning approaches is to come up with decision rules that will allow us to establish a
mapping between measurable (and easily accessible) attributes of a system and its properties. Quantum
mechanics (here employed within the framework of density functional theory, DFT)23,24
, provides us with such
a decision rule that connects the wave function (or charge density) with properties via the Schro¨dinger’s (or the
Kohn-Sham) equation. Here, we hope to replace the rather cumbersome rule based on the Schro¨dinger’s or Kohn-
Sham equation with a module based on similarity-based machine learning. The essential ingredients of the
proposed scheme is captured schematically in Figure 1.
SCIENTIFIC REPORTS SREP-13-02863-T.3d 18/9/13 12:57:56
OPEN
SUBJECT AREAS:
POLYMERS
ELECTRONIC STRUCTURE
COMPUTATIONAL METHODS
ELECTRONIC PROPERTIES AND
MATERIALS
Received
25 June 2013
Accepted
9 September 2013
Published
30 September 2013
Correspondence and
requests for materials
should be addressed to
R.R. (rampi@ims.
uconn.edu)
SCIENTIFIC REPORTS | 3 : 2810 | DOI: 10.1038/srep02810 1
2. Results
The ideal testing ground for such a paradigm is a case where a parent
material is made to undergo systematic chemical and/or configura-
tional variations, for which controlled initial training and test data
can be generated. In the present investigation, we consider infinite
polymer chains—quasi 1-d material motifs (Figure 1)—with their
building blocks drawn from a pool of the following seven possibil-
ities: CH2, SiF2, SiCl2, GeF2, GeCl2, SnF2, and SnCl2. Setting all the
building blocks of a chain to be CH2 leads to polyethylene (PE), a
common, inexpensive polymeric insulator. The rationale for intro-
ducing the other Group IV halides is to interrogate the beneficial
effects (if any) these blocks may have on various properties when
introduced in a base polymer such as PE. The properties that we will
focus on include: the atomization energy, the formation energy, the
lattice constant, the spring constant, the band gap, the electron affin-
ity, and the optical and static components of the dielectric constant.
The initial dataset for 175 such material motifs containing 4 building
blocks per repeat unit was generated using DFT.
The first step in the mapping process prescribed in the panels of
Figure 1 is to reduce each material system under inquiry to a string of
numbers—we refer to this string as the fingerprint vector. For the
specific case under consideration here, namely, polymeric chains
composed of seven possible building blocks, the following coarse-
level chemo-structural fingerprint vector was considered first: jf1, …,
f6, g1, …, g7, h1, …, h7æ, where fi, gi and hi are, respectively, the number
of building blocks of type i, number of i 2 i pairs, and number of i 2 i
2 i triplets, normalized to total number of units (note that f7 is
missing in above vector as it is not an independent quantity owing
to the relation: f7~1{
P6
i~1 fi). One may generalize the above vec-
tor to include all possible i 2 j pairs, i 2 j 2 k triplets, i 2 j 2 k 2 l
quadruplets, etc., but such extensions were found to be unnecessary
as the chosen 20-component vector was able to satisfactorily codify
the information content of the polymeric chains.
Next, a suitable measure of chemical distance is defined to allow for
a quantification of the degree of (dis)similarity between any two
fingerprint vectors. Consider two systems a and b with fingerprint
vectors~Fa
and~Fb
. The similarity of the two vectors may be measured
in many ways, e.g., using the Euclidean norm of the difference
between the two vectors, ~Fa
{~Fb
, or the dot product of the two
vectors~Fa:~Fb
. In the present work, we use the former, which we refer
to as ~Fab
(Figure 1). Clearly, if ~Fab
~0, materials a and b are
equivalent (insofar as we can conclude based on the fingerprint
vectors), and their property values Pa
and Pb
are the same. When
~Fab
=0, materials a and b are not equivalent, and Pa
2 Pb
is not
necessarily zero, and depends on ~Fab
. This observation may be
formally quantified when we have a prior materials-property dataset,
in which case we can determine the parametric dependence of the
property values on ~Fab
.
In the present work, we apply the machine learning algorithm
referred to as kernel ridge regression (KRR)25,26
. to our family of
polymeric chains. Technical details on the KRR methodology are
provided in the Methods section of the manuscript. As mentioned
Figure 1 | The machine (or statistical) learning methodology. First, material motifs within a class are reduced to numerical fingerprint vectors. Next, a
suitable measure of chemical (dis)similarity, or chemical distance, is used within a learning scheme—in this case, kernel ridge regression—to map the
distances to properties.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 2810 | DOI: 10.1038/srep02810 2
3. above, the initial dataset was generated using DFT for systems with
repeat units containing 4 distinct building blocks. Of the total 175
such systems, 130 were classified to be in the ‘training’ set (used in the
training of the KRR model, Equation (1)), and the remainder in the
‘test’ set. Figure 2 shows the agreement between the predictions of the
learning model and the DFT results for the training and the test sets,
for each of the 8 properties examined. Furthermore, we considered
several chains composed of 8-block repeat units (in addition to the
175 4-block systems), performed DFT computations on these, and
compared the DFT predictions of the 8-block systems with those
predicted using our learning scheme. As can be seen, the level of
agreement between the DFT and the learning schemes is uniformly
good for all properties across the 4-block training and test set, as well
as the somewhat out-of-sample 8-block test set (regardless of the
variance in the property values). Moreover, properties controlled
by the local environment (e.g., the lattice parameter), as well as those
controlled by nonlocal global effects (e.g., the electronic part of the
dielectric constant) are well-captured. We do note that the agreement
is most spectacular for the energies than for the other properties (as
the former are most well-converged, and latter are derived or extra-
polated properties; see Methods). Overall, the high fidelity nature of
the learning predictions is particularly impressive, given that these
calculations take a minuscule fraction of the time necessitated by a
typical DFT computation.
While the favorable agreement between the machine learning and
the DFT results for a variety of properties is exciting, in and of itself,
the real power of this prediction paradigm lies in the possibility of
exploring a much larger chemical-configurational space than is prac-
tically possible using DFT computations (or laborious experimenta-
tion). For instance, merely expanding into a family of 1-d systems
with 8-block repeat units leads to 29,365 symmetry unique cases (an
extremely small fraction of this class was scrutinized above for valid-
ation purposes). Not only can the learning approach make the study
of this staggeringly large number of cases possible, it also allows for a
search for correlations between properties in a systematic manner. In
order to unearth such correlations, we first determined the properties
of the 29,365 systems using our machine learning methodology,
followed by the estimation of Pearson’s correlation coefficient for
each pair of properties. The Pearson correlation coefficient (r) used
to quantify a correlation between two given property datasets {Xi}
and {Yi} for a class of n material systems is defined as follows:
r~
Pn
i~1 Xi{Xð Þ Yi{Yð Þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn
i~1 Xi{Xð Þ
2
q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn
i~1 Yi{Yð Þ
2
q : ð1Þ
Here, X and Y represent the average values of the properties over the
respective datasets. Figure 3a shows a matrix of the correlation coef-
ficients, color-coded to allow for immediate identification of pairs of
properties that are most correlated.
It can be seen from Figure 3a that the band gap is most strongly
correlated with many of the properties. Panels p1–p6 of Figure 3b
explicitly show the correlation between the band gap and six of the
remaining seven properties. Most notably, the band gap is inversely
correlated with the atomization energy (p1), size (p2), electron affin-
ity (p4), and the dielectric constants (p5 and p6), and directly corre-
lated with the spring constant (p3). The relationships captured in
panels p1–p3 follow from stability and bond strength arguments.
The interesting inverse relationship between the band gap and the
electron affinity is a consequence of the uniform shift of the conduc-
tion band minimum (due to changes in the band gap) with respect to
the vacuum level. The inverse correlation of the band gap with the
electronic part of the dielectric constant follows from the quantum
mechanical picture of electronic polarization being due to electronic
excitations. As no such requirement is expected for the ionic part of
the dielectric constant, it is rather surprising that a rough inverse
correlation is seen between the total dielectric constant and the band
gap, although clear deviations from this inverse behavior can be seen.
Finally, we note that the formation energy is uncorrelated with all the
other seven properties, including the band gap. This is particularly
notable as it is a common tendency to assume that the formation
energy (indicative of thermodynamic stability) is inversely correlated
with the band gap (indicative of electronic stability).
Discussion
Correlation diagrams such as the ones in Figure 3b offer a pathway to
‘design’ systems that meet a given set of property requirements. For
Figure 2 | Learning performance of chemo-structural fingerprint vectors. Parity plots comparing property values computed using DFT against
predictions made using learning algorithms trained using chemo-structural fingerprint vectors. Pearson’s correlation index is indicated in each of the
panels to quantify the agreement between the two schemes.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 2810 | DOI: 10.1038/srep02810 3
4. instance, a search for insulators with high dielectric constant and
large band gap would lead to those systems that are at the top part
of panel p6 of Figure 3b (corresponding to the ‘deviations’ from the
inverse correlation alluded to above, and indicated by a circle in panel
p6). These are systems that contain 2 or more contiguous SnF2 units,
but with an overall CH2 mole fraction of at least 25%. Such organo-
tin systems may be particularly appropriate for applications requir-
ing high-dielectric constant polymers. Furthermore, such diagrams
can aid in the extraction of knowledge from data eventually leading
to Hume-Rothery-like semi-empirical rules that dictate materials
behavior. For instance, the panel p3 reveals a well known corres-
pondence between mechanical strength and chemical stability27
,
while panels p5 and p6 capture an inverse relationship between the
dielectric constant and the bandgap, also quite familiar to the semi-
conductor physics community28
.
The entire discussion thus far has focused on fingerprint vectors
defined in terms of coarse-level chemo-structural descriptors. This
brings up a question as to whether other more fundamental quant-
ities may be used as a fingerprint to profile a material. The first
Hohenberg-Kohn theorem of DFT29
proves that the electronic
charge density of a system is a universal descriptor containing the
sum total of the information about the system, including all its prop-
erties. The shape30
and the holographic31
electron density theorems
constitute further extensions of the original Hohenberg-Kohn the-
orem. Inspired by these theorems, we propose that machine learning
methods may be used to establish a mapping between the electronic
charge density and various properties.
A fundamental issue related to this perspective deals with defining
a (dis)similarity criterion that can enable a fair comparison between
the charge density of two different systems. Note that any such mea-
sure has to be invariant with respect to relative translations and/or
rotations of the systems. In the present work, we have employed
Fourier coefficients of the 1-d charge density of our systems (aver-
aged along the plane normal to the chain axis). The Fourier coeffi-
cients are invariant to translations of the systems along the chain axis,
and consideration of the 1-d planar averaged charge density makes
the rotational degrees of freedom irrelevant. Figure 4 shows a com-
parison of the predictions of the learning model based on charge
density with the corresponding DFT results. While the agreement
between the learning scheme and DFT is not as remarkable as with
the chemo-structural fingerprint approach adopted earlier, this can
most likely be addressed by the utilization of the actual 3-d charge
density. Nevertheless, we believe that the performance of the learning
scheme is satisfactory, and heralds the possibility of arriving at a
‘universal’ approach for property predictions solely using the elec-
tronic charge density.
A second issue with the charge density based materials profiling
relates to determining the charge density in the first place. If indeed a
mapping between charge density and the properties can be made for
the training set, how do we obtain the charge density of a new system
without explicitly performing a DFT computation? We suggest that
the ‘atoms in molecules’ concept may be exploited to create a
patched-up charge density distribution32
. Needless to say, barring
some studies in the area of atoms and molecules33
, these concepts
are in a state of infancy, and there is much room available for both
fundamental developments and innovative applications.
To conclude, we have shown that the efficient and accurate pre-
diction of a diverse set of unrelated properties of material systems is
possible by combining the notions of chemical (dis)similarity and
machine (or statistical) learning methods. Using a family of 1-d chain
systems, we have presented a general formalism that allows us to
discover decision rules that establish a mapping between easily
accessible attributes of a system and its various properties. We have
unambiguously shown that simple fingerprint vectors based on
either compositional and configurational information, or the elec-
tronic charge density distribution, can be used to profile a material
and make property predictions at an enormously small cost com-
pared either with quantum mechanical calculations or laborious
experimentation. The methodology presented here is of direct rel-
evance in identifying (or screening) undiscovered materials in a tar-
geted class with desired combination of properties in an efficient
manner with high fidelity.
Methods
First principles computations. The quantum mechanical computations were
performed using density functional theory (DFT)23,24
as implemented in the Vienna
ab initio software package34,35
. The generalized gradient approximation (GGA)
functional parametrized by Perdew, Burke and Ernzerhof (PBE)36
to treat the
electronic exchange-correlation interaction, the projector augmented wave (PAW)37
Figure 3 | High throughput predictions and correlations from machine learning. (a) The upper triangle presents a schematic of the atomistic
model composed of repeat units with 8 building blocks. Populating each of he 8 blocks with one of the seven units leads to 29,365 systems. The matrix in
the lower triangle depicts the Pearson’s correlation index for each pair of the eight properties of the 8-block systems predicted using machine learning.
(b) Panels p1 to p6 show the correlations between the band gap and six properties. The panel labels are also appropriately indexed in (a). The circle in
panel p6 indicates systems with a simultaneously large dielectric constant and band gap.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 2810 | DOI: 10.1038/srep02810 4
5. potentials, and plane-wave basis functions up to a kinetic energy cutoff of 500 eV
were employed.
Our 1-d systems were composed of all-trans infinitely long isolated chains con-
taining 4 independent building units in a supercell geometry (with periodic boundary
conditions along the axial direction). One CH2 unit was always retained in the
backbone (to break the extent of s-conjugation along the backbone), and the three
other units were drawn from a ‘‘pool’’ of seven possibilities: CH2, SiF2, SiCl2, GeF2,
GeCl2, SnF2 and SnCl2, in a combinatorial and exhaustive manner. This scheme
resulted in 175 symmetry unique systems after accounting for translational peri-
odicity and inversion symmetry. A Monkhorst-Pack k-point mesh of 1 3 1 3 k (with
kc . 50) was used to produce converged results for a supercell of length c A˚ along the
chain direction (i.e., the z direction). The supercells were relaxed using a conjugate
gradient algorithm until the forces on all atoms were ,0.02 eV/A˚ and the stress
component along the z direction was ,1.0 3 1022
GPa. Sufficiently large grids were
used to avoid numerical errors in fast Fourier transforms. A small number of cases
involving 8 building units were also performed for validation purposes.
The calculated atomization energies and formation energies are referenced to the
isolated atoms and homo-polymer chains of the constituents, respectively. While the
lattice parameters, spring constants, band gaps and electron affinities of the systems
are readily accessible through DFT computations, the calculations of the optical and
static components of the dielectric constant require particular care. The dielectric
permittivity of the isolated polymer chains placed in a large supercell were first
computed within the density functional perturbation theory (DFPT)38,39
formalism,
which includes contributions from the polymer as well as from the surrounding
vacuum region of the supercell. Next, treating the supercell as a vacuum-polymer
composite, effective medium theory40
was used to estimate the dielectric constant of
just the polymer chains using methods described recently13,41
. Table 1 of the
Supporting Information contains the DFT computed atomization energies, forma-
tion energies, c lattice parameters, spring constants, electron affinities, bandgaps, and
dielectric permittivities for the 175 symmetry unique polymeric systems.
Machine learning details. Within the present similarity-based learning model, a
property of a system in the test set is given by a sum of weighted Gaussians over the
entire training set, as
Pb
~
XN
a~1
aa exp {
1
2s2
Fab
2
: ð2Þ
where a runs over the systems in the previously known dataset. The coefficients aas
and the parameter s are obtained by ‘training’ the above form on the systems a in the
previously known dataset. The training (or learning) process is built on minimizing
the expression
PN
a~1
Pa
Est{Pa
DFT
À Á2
zl
PN
a~1
a2
a, with Pa
Est being the estimated property
value, Pa
DFT the DFT value, and l a regularization parameter25,26
. The explicit solution
to this minimization problem is a 5 (K 1 lI)21
PDFT, where I is the identity matrix,
and Kab~exp {
1
2s2
Fab
2
is the kernel matrix elements of all polymers in the
training set. The parameters l, s and aas are determined in an inner loop of fivefold
cross validation using a logarithmically scaling fine grid.
1. Poggio, T., Rifkin, R., Mukherjee, S. Niyogi, P. General conditions for
predictivity in learning theory. Nature 428, 419–422 (2004).
2. Tomasi, C. Past performance and future results. Nature 428, 378 (2004).
3. Rehmeyer, J. Influential few predict behavior of the many. Nature News, http://dx.
doi.org/10.1038/nature.2013.12447.
4. Holland, J. H. Emergence: from Chaos to order (Cambridge, Perseus, 1998).
5. Jones, N. Quiz-playing computer system could revolutionize research. Nature
News, http://dx.doi.org/10.1038/news.2011.95.
6. MacLeod, N., Benfield, M. Culverhouse, P. Time to automate identification.
Nature 467, 154–155 (2010).
7. Crutchfield, J. P. Between order and chaos. Nature Physics 8, 17–24 (2012).
8. Chittka, L. Dyer, A. Cognition: Your face looks familiar. Nature 481, 154–155
(2012).
9. Abu-Mostafa, Y. S. Machines that Learn from Hints. Sci. Am. 272, 64–69 (1995).
10. Abu-Mostafa, Y. S. Machines that Think for Themselves. Sci. Am. 307, 78–81
(2012).
11. Silver, N. The Signal and the Noise: Why So Many Predictions Fail but Some Don’t
(Penguin Press, New York, 2012).
12. Curtarolo, S. et al. The high-throughput highway to computational materials
design. Nature Mater. 12, 191–201 (2013).
13. Pilania, G. et al. New group IV chemical motifs for improved dielectric
permittivity of polyethylene. J. Chem. Inf. Modeling 53, 879–886 (2013).
14. Levy, O., Hart, G. L. W. Curtarolo, S. Uncovering compounds by synergy of
cluster expansion and high-throughput methods. J. Am. Chem. Soc. 132,
4830–4833 (2010).
15. Jain, A. et al. A high-throughput infrastructure for density functional theory
calculations. Comp. Mater. Sci. 50, 22952310 (2011).
16. Hart, G. L. W., Blum, V., Walorski, M. J. Zunger, A. Evolutionary approach for
determining first-principles hamiltonians. Nature Mater. 4, 391394 (2005).
17. Fischer, C. C., Tibbetts, K. J., Morgan, D. Ceder, G. Predicting crystal structure
by merging data mining with quantum mechanics. Nature Mater. 5, 641–646
(2006).
18. Saad, Y. et al. Data mining for materials: Computational experiments with AB
compounds. Phys. Rev. B 85, 104104 (2012).
19. Rupp, M., Tkatchenko, A., Muller, K. von Lilienfeld, O. A. Fast and accurate
modeling of molecular atomization energies with machine learning. Phys. Rev.
Lett. 108, 058301 (2012).
20. Snyder, J. C., Rupp, M., Hansen, K., Mu¨ller, K.-R. Burke, K. Finding Density
Functionals with Machine Learning. Phys. Rev. Lett. 108, 253002 (2012).
Figure 4 | Learning performance of electron charge density-based fingerprint vectors. Parity plots comparing property values computed using DFT
against predictions made using learning algorithms trained using electron density-based fingerprint vectors. The Fourier coefficients of the planar-
averaged Kohn-Sham charge density are used to construct the fingerprint vector. Pearson’s correlation index is indicated in each of the panels to quantify
the agreement between the two schemes.
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 2810 | DOI: 10.1038/srep02810 5
6. 21. Montavon, G. et al. Machine Learning of Molecular Electronic Properties in
Chemical Compound Space. Accepted to New J. Phys.
22. Hautier, G., Fisher, C. C., Jain, A., Mueller, T. Ceder, G. Finding natures missing
ternary oxide compounds using machine learning and density functional theory.
Chem. Mater. 22, 3762 (2010).
23. Kohn, W. Electronic structure of matter–wave functions and density functionals.
Rev. Mod. Phys. 71, 1253 (1999).
24. Martin, R. Electronic Structure: Basic Theory and Practical Methods (Cambridge
University Press, New York, 2004).
25. Hastie, T., Tibshirani, R. Friedman, J. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction (Springer, New York, 2009).
26. Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K. Scholkopf, B. An introduction to
kernel-based learning algorithms. IEEE Transactions on Neural Networks 12, 181
(2001).
27. Gilman, J. J. Physical chemistry of intrinsic hardness. Mater. Sci. and Eng. A209,
74–81 (1996).
28. Zhu, H., Tang, C., Fonseca, L. R. C. Ramprasad, R. Recent progress in ab initio
simulations of hafnia-based gate stacks. J. Mater. Sci. 47, 7399–7416 (2012).
29. Hohenberg, P. Kohn, W. Inhomogeneous electron gas. Phys. Rev. 136,
B864–B871 (1964).
30. Geerlings, P., Boon, G., Van Alsenoy, C. De Proft, F. Density functional theory
and quantum similarity. Int. J. Quantum. Chem. 101, 722 (2005).
31. Mezey, P. G. Holographic electron density shape theorem and its role in drug
design and toxicological risk assessment. J. Chem. Inf. Comput. Sci. 39, 224 (1999).
32. Bader, R. F. W. Atoms in molecules: a quantum theory (Oxford University Press,
Oxford, 1990).
33. Bultinck, P., Girones, X. Carbo-Dorca, R. Molecular quantum similarity: theory
and applications. Reviews in Computational Chemistry, Volume 21 (2005).
34. Kresse, G. Furthmuller, J. Efficient iterative schemes for ab initio total-energy
calculations using a plane-wave basis set. Phys. Rev. B 54, 11169–11186 (1996).
35. Kresse, G. Furthmuller, J. Efficiency of ab-initio total energy calculations for
metals and semiconductors using a plane-wave basis set. J. Comput. Mater. Sci. 6,
15–50 (1996).
36. Perdew, J., Burke, K. Ernzerhof, M. Generalized Gradient Approximation Made
Simple. Phys. Rev. Lett. 77, 3865–3868 (1996).
37. Blo¨chl, P. Projector augmented-wave method. Phys. Rev. B 50, 17953–17979
(1994).
38. Baroni, S., de Gironcoli, S. Dal Corso, A. Phonons and related crystal properties
from density-functional perturbation theory. Rev. Mod. Phys. 73, 515–562 (2001).
39. Gonze, X. Dynamical matrices, Born effective charges, dielectric permittivity
tensors, and interatomic force constants from density-functional perturbation
theory. Phys. Rev. B 55, 10355–10368 (1997).
40. Choy, T. C. Effective medium theory: principles and applications (Oxford
University Press Inc., Oxford, 1999).
41. Wang, C. C., Pilania, G. Ramprasad, R. Dielectric properties of carbon-, silicon-,
and germanium-based polymers: A first-principles study. Phys. Rev. B 87, 035103
(2013).
Acknowledgements
This paper is based upon work supported by a Multidisciplinary University Research
Initiative (MURI) grant from the Office of Naval Research. Computational support was
provided by the Extreme Science and Engineering Discovery Environment (XSEDE), which
is supported by National Science Foundation. Discussions with Kenny Lipkowitz, Ganpati
Ramanath and Gerbrand Ceder are gratefully acknowledged.
Author contributions
R.R., C.W. and G.P. conceived the statistical learning model, with input from S.R. and X.J.
The DFT computations were performed by G.P. The initial implementation of the statistical
learning framework was performed by C.W. and extended by G.P. The manuscript was
written by G.P., S.R. and R.R.
Additional information
Supplementary information accompanies this paper at http://www.nature.com/
scientificreports
Competing financial interests: The authors declare no competing financial interests.
How to cite this article: Pilania, G., Wang, C., Jiang, X., Rajasekaran, S. Ramprasad, R.
Accelerating materials property predictions using machine learning. Sci. Rep. 3, 2810;
DOI:10.1038/srep02810 (2013).
This work is licensed under a Creative Commons Attribution 3.0 Unported license.
To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0
www.nature.com/scientificreports
SCIENTIFIC REPORTS | 3 : 2810 | DOI: 10.1038/srep02810 6