K-nearest neighbors (KNN) is a non-parametric classification technique where an unlabeled sample is classified based on the labels of its k nearest neighbors in the training set, as determined by a distance function. The basic steps are to 1) compute the distance between the test sample and all training samples, 2) select the k nearest neighbors based on distance, and 3) assign the test sample the most common label of its k nearest neighbors. Key aspects include choosing an appropriate value of k, distance metrics, handling high-dimensional data, and reducing computational complexity through techniques like condensing.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Presentation of research paper 'Study of some data mining classification techniques'(International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395 -0056 p-ISSN: 2395-0072 Volume: 04 Issue: 04 | Apr -2017) for a academic purpose during post graduation. in this study i describe some data mining classification techniques such as ANN, SVM,decision tree with example.
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Marina Santini
In this lecture, we talk about two different discriminative machine learning methods: decision trees and k-nearest neighbors. Decision trees are hierarchical structures.k-nearest neighbors are based on two principles: recollection and resemblance.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Slides were formed by referring to the text Machine Learning by Tom M Mitchelle (Mc Graw Hill, Indian Edition) and by referring to Video tutorials on NPTEL
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Presentation of research paper 'Study of some data mining classification techniques'(International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395 -0056 p-ISSN: 2395-0072 Volume: 04 Issue: 04 | Apr -2017) for a academic purpose during post graduation. in this study i describe some data mining classification techniques such as ANN, SVM,decision tree with example.
Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Marina Santini
In this lecture, we talk about two different discriminative machine learning methods: decision trees and k-nearest neighbors. Decision trees are hierarchical structures.k-nearest neighbors are based on two principles: recollection and resemblance.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Bayes Classifier: Recap
L
P( HILSA | L)
P( TUNA | L) P( SHARK | L)
Maximum Aposteriori (MAP) Rule
Distributions assumed to be of particular family (e.g., Gaussian), and
parameters estimated from training data.
3. Bayes Classifier: Recap
L +-
P( HILSA | L)
P( TUNA | L) P( SHARK | L)
Approximate Maximum Aposteriori (MAP) Rule
Non-parametric (data driven) approach: consider a small window around L,
Find which class is most populous in that window.
4. Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s
probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
5. Basic Idea
k-NN classification rule is to assign to a test sample
the majority category label of its k nearest training
samples
In practice, k is usually chosen to be odd, so as to
avoid ties
The k = 1 rule is generally called the nearest-
neighbor classification rule
6. Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
7. Voronoi Diagram
Properties:
1) All possible points
within a sample's
Voronoi cell are the
nearest neighboring
points for that sample
2) For any sample, the
nearest sample is
determined by the
closest Voronoi cell
edge
8. Distance-weighted k-NN
Replace by:
å
=
Î
=
k
i
i
V
v
x
f
v
q
f
1
))
(
,
(
max
arg
)
(
ˆ d
ˆ
f (q) = argmax
vÎV
1
d xi,xq
( )
2 d(v, f (xi))
i=1
k
å
General Kernel functions like Parzen Windows may be considered
Instead of inverse distance.
9. Predicting Continuous Values
Replace by:
Note: unweighted corresponds to wi=1 for all i
å
å
=
=
= k
i
i
k
i
i
i
w
x
f
w
q
f
1
1
)
(
)
(
ˆ
å
=
Î
=
k
i
i
i
V
v
x
f
v
w
q
f
1
))
(
,
(
max
arg
)
(
ˆ d
10. Nearest-Neighbor Classifiers: Issues
– The value of k, the number of nearest
neighbors to retrieve
– Choice of Distance Metric to compute
distance between records
– Computational complexity
– Size of training set
– Dimension of data
11. Value of K
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes
X
Rule of thumb:
K = sqrt(N)
N: number of training points
13. Distance Measure: Scale Effects
Different features may have different measurement
scales
E.g., patient weight in kg (range [50,200]) vs. blood
protein values in ng/dL (range [-3,3])
Consequences
Patient weight will have a much greater influence on the
distance between samples
May bias the performance of the classifier
14. Standardization
Transform raw feature values into z-scores
is the value for the ith sample and jth feature
is the average of all for feature j
is the standard deviation of all over all input samples
Range and scale of z-scores should be similar
(providing distributions of raw feature values are alike)
zij =
xij - mj
s j
zij =
xij - mj
s j
xij
mj xij
s j xij
15. Nearest Neighbor : Dimensionality
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results
Shrinking density – sparsification effect
1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142 d = 1.4142
17. Distance for Heterogeneous Data
Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of
Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
18. Nearest Neighbour : Computational
Complexity
Expensive
To determine the nearest neighbour of a query point q, must
compute the distance to all N training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
Storage Requirements
Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
High Dimensional Data
“Curse of Dimensionality”
Required amount of training data increases exponentially with
dimension
Computational cost also increases dramatically
Partitioning techniques degrade to linear search in high dimension
19. Reduction in Computational Complexity
Reduce size of training set
Condensation, editing
Use geometric data structure for high dimensional
search
20. Condensation: Decision Regions
Each cell contains one
sample, and every
location within the cell is
closer to that sample than
to any other sample.
A Voronoi diagram divides
the space into such cells.
Every query point will be assigned the classification of the sample within that
cell. The decision boundary separates the class regions based on the 1-NN
decision rule.
Knowledge of this boundary is sufficient to classify new points.
The boundary itself is rarely computed; many algorithms seek to retain only
those points necessary to generate an identical boundary.
21. Condensing
Aim is to reduce the number of training samples
Retain only the samples that are needed to define the decision boundary
Decision Boundary Consistent – a subset whose nearest neighbour decision
boundary is identical to the boundary of the entire training set
Minimum Consistent Set – the smallest subset of the training data that correctly
classifies all of the original training data
Original data Condensed data Minimum Consistent Set
22. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
(or K) training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
•Incremental
•Order dependent
•Neither minimal nor decision
boundary consistent
•O(n3) for brute-force method
23. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
24. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
25. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
26. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
27. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
28. Condensing
Condensed Nearest Neighbour (CNN)
1. Initialize subset with a single
training example
2. Classify all remaining
samples using the subset,
and transfer any incorrectly
classified samples to the
subset
3. Return to 2 until no transfers
occurred or the subset is full
29. High dimensional search
Given a point set and a nearest neighbor query point
Find the points enclosed in a rectangle (range) around the
query
Perform linear search for nearest neighbor only in the
rectangle
Query
30. kd-tree: data structure for range search
Index data into a tree
Search on the tree
Tree construction: At each level we use a different
dimension to split
x=5
y=3
y=6
x=6
A
B
C
D
E
x<5 x>=5