A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
Machine learning in science and industry — day 2arogozhnikov
- decision trees
- random forest
- Boosting: adaboost
- reweighting with boosting
- gradient boosting
- learning to rank with gradient boosting
- multiclass classification
- trigger in LHCb
- boosting to uniformity and flatness loss
- particle identification
Machine learning in science and industry — day 4arogozhnikov
- tabular data approach to machine learning and when it didn't work
- convolutional neural networks and their application
- deep learning: history and today
- generative adversarial networks
- finding optimal hyperparameters
- joint embeddings
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Machine learning in science and industry — day 2arogozhnikov
- decision trees
- random forest
- Boosting: adaboost
- reweighting with boosting
- gradient boosting
- learning to rank with gradient boosting
- multiclass classification
- trigger in LHCb
- boosting to uniformity and flatness loss
- particle identification
Machine learning in science and industry — day 4arogozhnikov
- tabular data approach to machine learning and when it didn't work
- convolutional neural networks and their application
- deep learning: history and today
- generative adversarial networks
- finding optimal hyperparameters
- joint embeddings
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Polynomial matrices can help to elegantly formulate many broadband multi-sensor / multi-channel processing problems, and represent a direct extension of well-established narrowband techniques which typically involve eigen- (EVD) and singular value decompositions (SVD) for optimisation. Polynomial matrix decompositions extend the utility of the EVD to polynomial parahermitian matrices, and this talk presents a brief overview of such polynomial matrices, characteristics of the polynomial EVD (PEVD) and iterative algorithms for its solution. The presentation concludes with some surprising results when applying the PEVD to subband coding and broadband beamforming.
This slide represents topics on PCA (Principal Component Analysis) and SVD (Singular Value Decomposition) where I tried to cover basic PCA, application, and use of PCA and SVD, Important keywords to know about PCA briefly, PCA algorithm and implementation, Basic SVD, SVD calculation, SVD implementation, Performance comparison of SVD and PCA regarding one publicly available dataset.
N.B. Information in this slide are gathered from
1. Machine Learning course by Andrew NG,
2. Mining of Massive Dataset | Stanford University | Artificial Intelligence - All in One (youtube channel)
3. and many more they are described in the slide.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Analytical study of feature extraction techniques in opinion miningcsandit
Although opinion mining is in a nascent stage of development but still the ground is set for
dense growth of researches in the field. One of the important activities of opinion mining is to
extract opinions of people based on characteristics of the object under study. Feature extraction
in opinion mining can be done by various ways like that of clustering, support vector machines
etc. This paper is an attempt to appraise the various techniques of feature extraction. The first
part discusses various techniques and second part makes a detailed appraisal of the major
techniques used for feature extraction
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Polynomial matrices can help to elegantly formulate many broadband multi-sensor / multi-channel processing problems, and represent a direct extension of well-established narrowband techniques which typically involve eigen- (EVD) and singular value decompositions (SVD) for optimisation. Polynomial matrix decompositions extend the utility of the EVD to polynomial parahermitian matrices, and this talk presents a brief overview of such polynomial matrices, characteristics of the polynomial EVD (PEVD) and iterative algorithms for its solution. The presentation concludes with some surprising results when applying the PEVD to subband coding and broadband beamforming.
This slide represents topics on PCA (Principal Component Analysis) and SVD (Singular Value Decomposition) where I tried to cover basic PCA, application, and use of PCA and SVD, Important keywords to know about PCA briefly, PCA algorithm and implementation, Basic SVD, SVD calculation, SVD implementation, Performance comparison of SVD and PCA regarding one publicly available dataset.
N.B. Information in this slide are gathered from
1. Machine Learning course by Andrew NG,
2. Mining of Massive Dataset | Stanford University | Artificial Intelligence - All in One (youtube channel)
3. and many more they are described in the slide.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Analytical study of feature extraction techniques in opinion miningcsandit
Although opinion mining is in a nascent stage of development but still the ground is set for
dense growth of researches in the field. One of the important activities of opinion mining is to
extract opinions of people based on characteristics of the object under study. Feature extraction
in opinion mining can be done by various ways like that of clustering, support vector machines
etc. This paper is an attempt to appraise the various techniques of feature extraction. The first
part discusses various techniques and second part makes a detailed appraisal of the major
techniques used for feature extraction
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGcsandit
Although opinion mining is in a nascent stage of development but still the ground is set for dense growth of researches in the field. One of the important activities of opinion mining is to extract opinions of people based on characteristics of the object under study. Feature extraction in opinion mining can be done by various ways like that of clustering, support vector machines
etc. This paper is an attempt to appraise the various techniques of feature extraction. The first part discusses various techniques and second part makes a detailed appraisal of the major techniques used for feature extraction
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...cscpconf
Although opinion mining is in a nascent stage of development but still the ground is set for dense growth of researches in the field. One of the important activities of opinion mining is to
extract opinions of people based on characteristics of the object under study. Feature extraction in opinion mining can be done by various ways like that of clustering, support vector machines
etc. This paper is an attempt to appraise the various techniques of feature extraction. The first part discusses various techniques and second part makes a detailed appraisal of the major techniques used for feature extraction.
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...IJCNCJournal
This paper presents a method for constructing intrusion detection systems based on efficient fuzzy rulebased
classifiers. The design process of a fuzzy rule-based classifier from a given input-output data set can
be presented as a feature selection and parameter optimization problem. For parameter optimization of
fuzzy classifiers, the differential evolution is used, while the binary harmonic search algorithm is used for
selection of relevant features. The performance of the designed classifiers is evaluated using the KDD Cup
1999 intrusion detection dataset. The optimal classifier is selected based on the Akaike information
criterion. The optimal intrusion detection system has a 1.21% type I error and a 0.39% type II error. A
comparative study with other methods was accomplished. The results obtained showed the adequacy of the
proposed method
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
Distributed algorithms in machine learning follow two main paradigms: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. The drawback of the latter approach is that the data needs to be replicated on each machine. Such replications limit the scalability of machine learning algorithms, since in several real-world tasks it is observed that the data and model sizes typically grow hand in hand. In this talk, I will present Hybrid-Parallelism, a new paradigm that partitions both, the data as well as the model parameters simultaneously in a completely de-centralized manner. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. Next, I will present a case-study showing how to apply these ideas to reformulate Multinomial Logistic Regression to achieve Hybrid Parallelism (DSMLR: Doubly-Separable Multinomial Logistic Regression). Finally, I will demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
1. Machine Learning in Science and Industry
Day 1, lectures 1 & 2
Alex Rogozhnikov, Tatiana Likhomanenko
Heidelberg, GradDays 2017
2. About authors
Alex Rogozhnikov Tatiana Likhomanenko
Yandex researchers (internet-company in Russia, mostly known for search engine)
applying machine learning in CERN projects for 3 (Alex) and 4 (Tatiana) years
graduated from Yandex School of Data Analysis
1 / 108
3. Intro notes
4 days course
but we'll cover most significant algorithms of machine learning
2–2.5 hours for lecture, then optional practical session
we can be a bit out of schedule, but we'll try to fit
all materials are published in repository:
https://github.com/yandexdataschool/MLatGradDays
we have a challenge!
know material? Spend more time on challenge!
you can participate in teams of two
use chat for questions on challenge
2 / 108
4. What is Machine Learning about?
a method of teaching computers to make and improve predictions or behaviors based
on some data?
a field of computer science, probability theory, and optimization theory which allows
complex tasks to be solved for which a logical/procedural approach would not be
possible or feasible?
a type of AI that provides computers with the ability to learn without being explicitly
programmed?
somewhat in between of statistics, AI, optimization theory, signal processing and pattern
matching?
3 / 108
5. What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
4 / 108
6. What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
Data is cheap, knowledge is precious
5 / 108
7. Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
handwriting recognition
opinion mining
6 / 108
8. Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
handwriting recognition
opinion mining
and hundreds more
7 / 108
9. Machine Learning in science
In particle physics
High-level triggers
Particle identification
Tracking
Tagging
High-level analysis
star / galaxy / quasar classification in astronomy
analysis of gene expression
brain-computer interfaces
weather forecasting
8 / 108
10. Machine Learning in science
In particle physics
High-level triggers
Particle identification
Tracking
Tagging
High-level analysis
star / galaxy / quasar classification in astronomy
analysis of gene expression
brain-computer interfaces
weather forecasting
In mentioned applications different data is used and different information is inferred, but the
ideas beyond are quite similar.
9 / 108
11. General notion
In supervised learning the training data is represented as a set of pairs
x , y
i is an index of an observation
x is a vector of features available for an observation
y is a target — the value we need to predict
features = observables = variables
observations = samples = events
i i
i
i
10 / 108
12. Classification problem
y ∈ Y , where Y is finite set of labels.
Examples
particle identification based on information about track:
x = (p, η, E, charge, χ , ...)
Y = {electron, muon, pion, ... }
binary classification:
Y = {0, 1} — 1 is signal, 0 is background
i
i
2
11 / 108
13. Regression problem
y ∈ R
Examples:
predicting price of a house by it's positions
predicting money income / number of customers of a shop
reconstructing real momentum of a particle
12 / 108
14. Regression problem
y ∈ R
Examples:
predicting price of a house by it's positions
predicting money income / number of customers of a shop
reconstructing real momentum of a particle
Why do we need automatic classification/regression?
in applications up to thousands of features
higher quality
much faster adaptation to new problems
13 / 108
15. Classification based on nearest neighbours
Given training set of objects and their labels {x , y } we predict the label for the new
observation x:
= y , j = arg ρ(x, x )
Here and after ρ(x, ) is the distance in the space of features.
i i
y^ j
i
min i
x~
14 / 108
16. Visualization of decision rule
Consider a classification problem with 2 features:
x = (x , x ), y ∈ Y = {0, 1}i i
1
i
2
i
15 / 108
17. k Nearest Neighbours (kNN)
We can use k neighbours to compute probabilities to belong to each class:
p (x) =
Question: why this may be a good idea?
y~
k
# of knn of x in class y~
16 / 108
21. Overfitting
Question: how accurate is classification on training dataset when k = 1?
answer: it is ideal (closest neighbor is observation itself)
quality is lower when k > 1
20 / 108
22. Overfitting
Question: how accurate is classification on training dataset when k = 1?
answer: it is ideal (closest neighbor is observation itself)
quality is lower when k > 1
this doesn't mean k = 1 is the best,
it means we cannot use training observations to
estimate quality
when classifier's decision rule is too complex and
captures details from training data that are not relevant
to distribution, we call this an overfitting
21 / 108
23. Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
22 / 108
24. Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
The best model is the model which gives better predictions for new observations.
Simplest way to control this is to check quality on a holdout — a sample not used during
training (cross-validation).This gives unbiased estimate of quality for new data.
estimates have variance
multiple testing introduces bias (solution: train + validation + test, like kaggle)
very important factor if amount of data is small
there are more approaches to cross-validation
23 / 108
25. Regression using kNN
Regression with nearest neighbours is done by averaging of output
Model prediction:
(x) = yy^
k
1
j∈knn(x)
∑ j
24 / 108
26. kNN with weights
Average neighbours' output with weights:
=
the closer the neighbour, the higher weights
of its contribution, i.e.:
w = 1/ρ(x, x )
y^
w∑j∈knn(x) j
w y∑j∈knn(x) j j
j j
25 / 108
27. Computational complexity
Given that dimensionality of space is d and there are n training samples:
training time: ~ O(save a link to the data)
prediction time: n × d for each observation
Prediction is extremely slow.
To resolve this one can build index-like structure to select neighbours faster.
26 / 108
28. Spacial index: ball tree
training time
~ O(d × n log(n))
prediction time
~ log(n) × d
for each observation
27 / 108
29. Spacial index: ball tree
training time
~ O(d × n log(n))
prediction time
~ log(n) × d
for each observation
Other options exist:
KD-tree
FLANN
FALCONN
IMI, NO-IMI
Those are very useful for searching over large databases with items of high dimensions28 / 108
30. Overview of kNN
Awesomely simple classifier and regressor
Provides too optimistic quality on training data
Quite slow, though optimizations exist
Too sensitive to scale of features
Doesn't provide good results with data of high dimensions
29 / 108
31. Sensitivity to scale of features
Euclidean distance:
ρ(x, ) = (x − ) + (x − ) + ... + (x − )x~ 2
1 x~
1
2
2 x~
2
2
d x~
d
2
30 / 108
32. Sensitivity to scale of features
Euclidean distance:
ρ(x, ) = (x − ) + (x − ) + ... + (x − )
Change scale of first feature:
(assuming that x , x , ..., x are of the same order)
x~ 2
1 x~
1
2
2 x~
2
2
d x~
d
2
ρ(x, )x~ 2
ρ(x, )x~ 2
= (10x − 10 ) + (x − ) + ... + (x − )1 x~
1
2
2 x~
2
2
d x~
d
2
∼ 100(x − )1 x~
1
2
1 2 n
31 / 108
33. Sensitivity to scale of features
Euclidean distance:
ρ(x, ) = (x − ) + (x − ) + ... + (x − )
Change scale of first feature:
(assuming that x , x , ..., x are of the same order)
Scaling of features frequently increases quality.
Otherwise contribution from features with large scale is dominating.
x~ 2
1 x~
1
2
2 x~
2
2
d x~
d
2
ρ(x, )x~ 2
ρ(x, )x~ 2
= (10x − 10 ) + (x − ) + ... + (x − )1 x~
1
2
2 x~
2
2
d x~
d
2
∼ 100(x − )1 x~
1
2
1 2 n
32 / 108
34. Distance function matters
Minkowski distance
ρ(x, ) = (x − )
Canberra
ρ(x, ) =
Cosine metric
ρ(x, ) =
x~ p
l
∑ l x~
l
p
x~
l
∑
∣x ∣ + ∣ ∣l x~
l
∣x − ∣l x~
l
x~
∣x∣ ∣ ∣x~
< x, >x~
33 / 108
35. image source
Problems with high dimensions
With higher dimensions d >> 1 the neighbouring
points are further.
Example: consider n training data points being
distributed uniformly in the unit cube:
expected number of point in the ball of size r is
proportional to the r
to collect the same amount on neighbors, we need
to put r = const → 1
kNN suffers from curse of dimensionality.
d
1/d
34 / 108
36. Nearest neighbours applications
Classification and regression based on kNN used quite rarely, but the nearest neighbours
search is very widespread, used when you need to find something similar in the database.
finding documents and images
finding similar proteins by using 3d-structure
search in chemical databases
person identification by photo or by speech
user identification by behavior on the web
Similarity (distance) is very important and isn't defined trivially in most cases.
35 / 108
37. Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue and
background is red)
Which classifier provides better discrimination?
36 / 108
38. Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue and
background is red)
Which classifier provides better discrimination?
Discrimination is identical in all three cases
37 / 108
41. ROC curve
These distributions have the same ROC curve:
(ROC curve is passed signal vs passed bck
dependency)
40 / 108
42. ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may achieve by setting
a threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain
this information
ROC curve = information about order of observations' predictions:
b b s b s b ... s s b s s
Comparison of algorithms should be based on the information from ROC curve.
41 / 108
43. ROC AUC (area under the ROC curve)
ROC AUC = P(r < r )
where r , r are predictions of random
background and signal observations.
b s
b s
42 / 108
44. Classifier have the same ROC AUC, but
which is better ...
if the problem is spam detection?
Spam is class 0, non-spam is class 1
(putting normal letter in spam is very
undesireable)
for background rejection system at the
LHC?
Background is class 0.
(we need to pass very few background)
43 / 108
45. Classifier have the same ROC AUC, but
which is better ...
if the problem is spam detection?
Spam is class 0, non-spam is class 1
(putting normal letter in spam is very
undesireable)
for background rejection system at the
LHC?
Background is class 0.
(we need to pass very few background)
Applications frequently demand different
metric.
44 / 108
46. Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
p(x, y)
Does there really exist the distribution of people / pages / texts?
Same question for stars / proteins / viruses?
45 / 108
47. Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
p(x, y)
Does there really exist the distribution of people / pages / texts?
Same question for stars / proteins / viruses?
In HEP these distributions do exist
Statistical framework turned out to be quite helpful to build models.
46 / 108
48. Optimal regression
Assuming that we know real distributions p(x, y) and the goal is to minimize squared error:
L = E(y − (x)) → min
An optimal prediction is a marginalized average
(x ) = E(y∣x = x )
y^ 2
y^ 0 0
47 / 108
49. Optimal classification. Bayes optimal classifier
Assuming that we know real distributions p(x, y) we reconstruct using Bayes' rule
p(y∣x) = =
=
p(x)
p(x, y)
p(x)
p(y)p(x∣y)
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0) p(x ∣ y = 0)
p(y = 1) p(x ∣ y = 1)
48 / 108
50. Optimal classification. Bayes optimal classifier
Assuming that we know real distributions p(x, y) we reconstruct using Bayes' rule
p(y∣x) = =
=
Lemma (Neyman–Pearson):
The best classification quality is provided by (Bayes optimal classifier)
p(x)
p(x, y)
p(x)
p(y)p(x∣y)
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0) p(x ∣ y = 0)
p(y = 1) p(x ∣ y = 1)
p(y = 0 ∣ x)
p(y = 1 ∣ x)
49 / 108
51. Optimal Binary Classification
Bayes optimal classifier has the highest possible ROC curve.
Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal
classification quality too!
= ×
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
50 / 108
52. Optimal Binary Classification
Bayes optimal classifier has the highest possible ROC curve.
Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal
classification quality too!
= ×
How can we estimate from data terms from this expression?
p(y = 1), p(y = 0) ?
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
51 / 108
53. Optimal Binary Classification
Bayes optimal classifier has the highest possible ROC curve.
Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal
classification quality too!
= ×
How can we estimate from data terms from this expression?
p(y = 1), p(y = 0) ?
p(x ∣ y = 1), p(x ∣ y = 0) ?
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
52 / 108
54. Histograms density estimation
Counting number of samples in each bin and normalizing.
fast
choice of binning is crucial
number of bins grows exponentially → curse of dimensionality
53 / 108
55. Kernel density estimation
f(x) = K
K(x) is kernel, h is bandwidth
Typically, gaussian kernel is used,
but there are many others.
Approach is very close to weighted
kNN.
nh
1
i
∑ (
h
x − xi
)
54 / 108
57. Kernel density estimation
bandwidth selection
Silverman's rule of thumb:
h =
is an empirical standard deviation
may be irrelevant if the data is far from being
gaussian
σ^ (
3n
4
)
5
1
σ^
56 / 108
59. Recapitulation
1. Statistical ML: applications and problems
2. k nearest neighbours classifier and regressor
distance function
overfitting and model selection
speeding up neighbours search
3. Binary classification: ROC curve, ROC AUC
4. Optimal regression and optimal classification
5. Density estimation
histograms
kernel density estimation
58 / 108
60. Parametric density estimation
Family of density functions: f(x; θ).
Problem: estimate parameters of a Gaussian distribution.
f(x; μ, Σ) = exp − (x − μ) Σ (x − μ)
(2π) ∣Σ∣d/2 1/2
1
(
2
1 T −1
)
59 / 108
63. QDA computational complexity
n samples, d dimensions
training consists of fitting p(x ∣ y = 0) and p(x ∣ y = 1) takes O(nd + d )
computing covariance matrix O(nd )
inverting covariance matrix O(d )
prediction takes O(d ) for each sample spent on computing dot product
2 3
2
3
2
62 / 108
64. QDA overview
simple analytical formula for prediction
fast prediction
many parameters to reconstruct in high dimensions
thus, estimate may become non-reliable
data almost never has gaussian distribution
63 / 108
65. Gaussian mixtures for density estimation
Mixture of distributions:
f(x) = π f (x, θ ); π = 1
Mixture of Gaussian distributions:
f(x) = π f(x; μ , Σ )
Parameters to be found: π , ..., π , μ , ..., μ , Σ , ..., Σ
c−components
∑ c c c
c−components
∑ c
c−components
∑ c c c
1 C 1 C 1 C
64 / 108
67. Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
log f(x ; θ) →
no analytic solution
we can use general-purpose optimization methods
i
∑ i
θ
max
66 / 108
68. Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
log f(x ; θ) →
no analytic solution
we can use general-purpose optimization methods
In mixtures parameters are split in two groups:
θ , ..., θ — parameters of components
π , ..., π — contributions of components
i
∑ i
θ
max
1 C
1 C
67 / 108
69. Expectation-Maximization algorithm [Dempster et al., 1977]
Idea: introduce set of hidden variables π (x)
Expectation: π (x) = p(x ∈ c) =
Maximization:
Maximization step is trivial for Gaussian distributions.
EM-algorithm is more stable and has good convergence properties.
c
c
π f (x; θ )∑c~ c~ c~ c~
π f (x; θ )c c c
πc
θc
= π (x )
i
∑ c i
= arg π (x ) log f (x , θ )
θ
max
i
∑ c i c i c
68 / 108
72. Density estimation applications
parametric fits in HEP
specifically, mixtures of signal + background
outlier detection
if a new observation has low p w.r.t. to current
distribution, it is considered as an outlier
photometric selection of quasars
gaussian mixtures are used as a smooth version of
clustering (discussed later)
71 / 108
73. Clustering (unsupervised learning)
Gaussian mixtures are considered as a smooth form of clustering.
However in some cases clusters have to be defined strictly (that is, each observation can be
assigned only to one cluster).
72 / 108
74. Demo of k-means and DBSCAN clusterings
(shown interactively)
73 / 108
75. OPERA (operated in 2008-2012)
Oscillation Project with Emulsion-tRacking Apparatus
neutrinos sent from CERN to Gran Sasso (INFN, Italy)
looking for neutrino oscillations (Nobel prize 2015)
hence, neutrinos aren't massless
OPERA detector is placed under the ground,
consists of 150'000 bricks 8.3 kg each
brick consists of interleaving photo emulsion and lead layers
74 / 108
76. OPERA: brick and scanning
bricks are taken out and scanned only if there are signs of something interesting
one brick = billions of base-tracks
75 / 108
78. OPERA clustering demo
(shown interactively)
different metrics can be used in clustering
the appropriate choice of metric is crucial
77 / 108
79. Clustering
news clustering in news aggregators
each topic is covered in dozens of media, those are joined in
one group to avoid showing lots of similar results
clustering of surface based on the spectra
(in the picture, 5 spectral channels for each point were taken
from satellite photos)
clustering in calorimeters in HEP
clustering of gamma ray bursts
in astronomy
78 / 108
80. Classification model based on mixtures density estimation is called MDA (mixture
discriminant analysis)
Generative approach
Generative approach: trying to reconstruct p(x, y), then use Bayes classification formula to
predict.
QDA, MDA are generative classifiers.
79 / 108
81. Classification model based on mixtures density estimation is called MDA (mixture
discriminant analysis)
Generative approach
Generative approach: trying to reconstruct p(x, y), then use Bayes classification formula to
predict.
QDA, MDA are generative classifiers.
Problems of generative approach
Real life distributions hardly can be reconstructed
Especially in high-dimensional spaces
So, we switch to discriminative approach: guessing directly p(y∣x)
80 / 108
83. Density estimation requires reconstructing dependency between all known parameters,
which requires much data, but most of these dependencies aren't needed in application.
If we can avoid multidimensional density estimation, we'd better do it.
82 / 108
84. Naive Bayes classification rule
Assuming that features are independent for each class, one can avoid multidimensional
density fitting:
= =
Contribution of each feature is independent, one-dimensional fitting is much easier.
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
p(y = 0)
p(y = 1)
i
∏
p(x ∣ y = 0)i
p(x ∣ y = 1)i
83 / 108
85. Application: B-tagging (picture from LHCb)
flavour tagging is important in
estimating CKM-matrix elements
to predict which meson was produced
(B or ), non-signal part of
collision is analyzed (tagging)
signal part (green) provides
information about decayed state
there are dozens of tracks that aren't
part of this scheme
0
B¯0
84 / 108
86. Inclusive tagging idea
use all (non-signal) tracks from collision
let machine guess which tracks are tagging, how their charge is connected to meson
flavour, and estimate probability
We can make a naive assumption, that information from tracks is independent:
= =
Number of terms in the product varies (compared to typical Naive Bayes).
Ratio for each track in the right side is estimated with ML techniques discussed later.
This simple technique provides very good results.
p( ∣tracks)B¯0
p(B ∣tracks)0
p( )B¯0
p(B )0
track
∏
p(track∣ )B¯0
p(track∣B )0
(
p(B )0
p( )B¯0
)
n −1tracks
track
∏
p( ∣track)B¯0
p(B ∣track)0
85 / 108
87. Linear decision rule
Decision function is linear:
d(x) =< w, x > +w
for brevity:
= sgn(d(x))
This is a parametric model (finding parameters w, w ).
QDA & MDA are parametric as well.
0
{
d(x) > 0 →
d(x) < 0 →
= +1y^
= −1y^
y^
0
86 / 108
88. Finding Optimal Parameters
A good initial guess: get such w, w , that error of classification is minimal:
L = 1 = sgn(d(x ))
Notion: 1 = 1, 1 = 0.
Discontinuous optimization (arrrrgh!)
0
i∈samples
∑ y ≠i y^i y^i i
true false
87 / 108
89. Finding Optimal Parameters
A good initial guess: get such w, w , that error of classification is minimal:
L = 1 = sgn(d(x ))
Notion: 1 = 1, 1 = 0.
Discontinuous optimization (arrrrgh!)
solution: let's make decision rule smooth
0
i∈samples
∑ y ≠i y^i y^i i
true false
p(y = +1∣x) = p (x)+1
p(y = −1∣x) = p (x)−1
= f(d(x))
= 1 − p (x)+1 ⎩⎪
⎨
⎪⎧f(0) = 0.5
f(x) > 0.5
f(x) < 0.5
x > 0
x < 0
88 / 108
91. Logistic regression
Define probabilities obtained with logistic function
with d(x) =< w, x > +w
and optimize log-likelihood:
L = − ln(p (x )) = L(x , y ) → min
p (x)+1
p (x)−1
= σ(d(x))
= σ(−d(x)) 0
i∈observations
∑ yi i
i
∑ i i
90 / 108
92. Logistic loss (LogLoss)
Term loss refers to somewhat we are minimizing. Losses typically estimate our risks,
denoted as L.
L = − ln(p (x )) = L(x , y ) → min
LogLoss penalty for single observation:
L(x , y ) = − ln(p (x )) = = ln(1 + e )
Margin y d(x ) is expected to be high for all observations.
i∈observations
∑ yi i
i
∑ i i
i i yi i {
ln(1 + e ),−d(x )i
ln(1 + e ),d(x )i
y = +1i
y = −1i
−y d(x )i i
i i
91 / 108
93. Logistic loss
L(x , y ) is a convex function.
Simple analysis shows that L is a sum
of convex functions w.r.t. to w, so
the optimization problem has
at most one optimum.
i i
92 / 108
95. Linear model for regression
How to use linear function for
regression?
d(x) =< w, x > +w
Simplification of notion:
x = 1, x = (1, x , ..., x ).
d(x) =< w, x >
0
0 1 d
94 / 108
96. Linear regression (ordinary least squares)
We can use linear function for regression:
d(x ) = y d(x) =< w, x >
This is a linear system with d + 1 variables and n equations.
Minimize OLS aka MSE (mean squared error):
L = (d(x ) − y ) → min
Explicit solution: x x w = y x
i i
i
∑ i i
2
(∑i i i
T
) ∑i i i
95 / 108
97. Linear regression
can use some other loss L different from MSE
but no explicit solution in other cases
demonstrates properties of linear models
reliable estimates when n >> d
able to completely fit to the data if n = d
undefined when n < d
96 / 108
98. Regularization: motivation
When the number of parameters is high (compared to the number of observations)
hard to estimate reliably all parameters
linear regression with MSE:
in d-dimensional space you can find hyperplane through any d points
non-unique solution if n < d
the matrix x x degenerates
Solution 1: manually decrease dimensionality of the problem (select more appropriate
features)
Solution 2: use regularization
∑i i i
T
97 / 108
99. Regularization
When number of parameters in model is high, overfitting is very probable
Solution: add a regularization term to the loss function:
L = L(x , y ) + L → min
L regularization: L = α ∣w ∣
L regularization: L = β ∣w ∣
L + L regularization: L = α ∣w ∣ + β ∣w ∣
N
1
i
∑ i i reg
2 reg ∑j j
2
1 reg ∑j j
1 2 reg ∑j j
2
∑j j
98 / 108
100. L , L – regularizations
Dependence of parameters (components of w) on the regularization (stronger
regularization to the left)
L regularization L (solid), L + L (dashed)
2 1
2 1 1 2
99 / 108
102. L regularizations
We can consider a more general case:
L = w
Expression for L : L = 1
exactly penalizes number of non-zero coefficients
but nobody uses L . Why?
p
p
i
∑ i
p
0 0 ∑i w ≠0i
0
101 / 108
103. L regularizations
We can consider a more general case:
L = w
Expression for L : L = 1
exactly penalizes number of non-zero coefficients
but nobody uses L . Why?
even L , 0 < p < 1 not used. Why?
p
p
i
∑ i
p
0 0 ∑i w ≠0i
0
p
102 / 108
104. Regularization summary
important tool to fight overfitting (= poor generalization on a new data)
different modifications for other models
makes it possible to handle really many features
machine learning should detect important features itself
from mathematical point:
turning convex problem to strongly convex (NB: only for linear models)
from practical point:
softly limiting the space of parameters
breaks scale-invariance of linear models
103 / 108
106. Data Scientist Tools
Experiments in appropriate high-level language or environment
After experiments are over — implement final algorithm in low-level language (C++,
CUDA, FPGA)
Second point is typically unnecessary
105 / 108
108. Scientific Python
Scipy
libraries for science and engineering
Root_numpy
convenient way to work with ROOT files
Astropy
A community python library for astronomy
107 / 108