SlideShare a Scribd company logo
1 of 30
Download to read offline
Mostofa Patwary1, Nadathur Satish1, Narayanan Sundaram1,
Jilalin Liu2, Peter Sadowski2, Evan Racah2, Suren Byna2, Craig Tull2, Wahid Bhimji2
Prabhat2, Pradeep Dubey1
1Intel Corporation, 2Lawrence Berkeley National Lab
Massively Parallel K-Nearest Neighbor
Computation on Distributed Architectures
KNN: K-Nearest Neighbors
3
v  Problem: Given a set of multi-dimensional data points, find the k closest neighbors
k = 3
v  Classification => take the majority vote from the neighbors
v  Regression => take the average value of the neighbors
KNN: Well known applications
4
v  Object classification in images
v  Text classification and information retrieval
v  Prediction of economic events
v  Medical diagnosis
v  3D structure prediction of protein-protein interactions and
v  Science applications
Scientific Motivation
5
Simulation of the Universe
by the Nyx code
3D simulation of magnetic reconnection in
electron-positron plasma by VPIC code
Image Courtesy: Prabhat [LBNL], Oliver Rübel [LBNL], Roy Kalschmidt (LBNL)
The interior of one of the cylindrical
Daya Bay Antineutrino detector
KNN: K-Nearest Neighbors
6
v  Problem: Given a set of multi-dimensional data points, find the k closest neighbors
k = 3
v  Naïve approach: compute distance from the query point to all points and get the k-closest neighbors
v  Pros: highly parallel
v  Cons: Lot of additional distance computation, not suitable for large scale dataset
(communication heavy)
Can we take advantage of the partitioning?
KNN: kd-tree
7
kd-tree construction
v  Use kd-tree, a space-partitioning data structure for organizing points in a k-dimensional space.
function kd-tree(Points* P)
select an axis (dimension)
compute median by axis from the P
create a tree node
node->median = median
node->left = kd-tree (points in P before median)
node->right = kd-tree (points in P after median)
KNN: kd-tree
8
kd-tree query
v  Use kd-tree, a space-partitioning data structure for organizing points in a k-dimensional space.
function find-knn(Node node, Query q, Results R)
if node is a leaf node
if node has a closer point than the point in R
add it to R
else
closer_node = q is in node->left ? node->left : node->right
far_node = q is in node->left ? node->right : node->left
find-knn(closer_node, q, R)
if(farthest point in R is far from median)
find-knn(far_node, q, R)
KNN
9
KNN has two steps:
1. kd-tree construction => build the kd-tree
2. query => to compute the k-nearest neighbors
What are the challenges in parallelizing KNN!
KNN: distributed kd-tree construction
10
v  At each node:
compute median of a selected dimension
move points to the left node and right node
v  Each internal node contains
o  Selected dimension
o  Median
o  Left pointer and right pointer
v  Each leaf node
o  contains a set of points
p0 p1 p2 p3
P0 p1
P2 p3
P0 P1 P2 P3
KNN: local kd-tree construction
11
When we have 4 threads and 4 nodes?
Data parallel
Thread parallel
Global kd-tree
- compute median of a selected dimension
- move points to the left node and right node
t0 t1 t2 t3
Stop when a leaf has
points <= threshold
KNN: distributed kd-tree querying
12
v  Each processor has a set of queries
o  Queries could be local and non-local
P0 P1 P2 P3
P0 P1 P2 P3
Xo
o
o
o
o
o
o
o o
o
o o
oo
o
o
P2
P3
P0
P1
X
v  Non-local queries:
o  Ask every node (more computation and communication)
o  Transfer ownership
ü  Get the local closest k points
ü  Ask neighbor node with the kth point distance
Query
Find query
owner from
global kd-
tree and
send
Find local
KNN from
local kd-
tree
Find others
that may
contain
better
neighbors
and send
For received
queries,
find better
local
neighbors
and send
For received
responses,
update
nearest
neighbor
list
Find top k
neighbors
KNN: distributed kd-tree querying
KNN: Algorithmic Choices [kd-tree construction]
14
v  Choice of split point
v  Choice of split dimension
v  Choice of bucket size
Median, average, or
min + (max-min)/2
Sampling based median computation
KNN: Sampling based median (local)
15
y x a b
(1)
a x y b
(2)
(3)
x
a b
y
KNN: Sampling based median (distributed)
16
a
x
b
y x a b
(1) p0
p1
a x y b
(2)
(4)x
a b
y
y
b
(3)
ay
x
p0
p1
KNN: Algorithmic Choices [kd-tree construction]
17
v  Choice of split point
v  Choice of split dimension
v  Choice of bucket size
Maximum range
Imbalanced kd-tree
More communication and computation
KNN: Algorithmic Choices [kd-tree construction]
18
v  Choice of split point
v  Choice of split dimension
v  Choice of bucket size
Data parallel
Global kd-tree
Stop when a leaf has
points <= threshold
Thread parallel
Construction time vs
query time
KNN: Experiments
19
Datasets
-  Cosmology: Three cosmological N-body simulations datasets using Gadget code [12 GB, 96 GB, 0.8 TB]
-  Plasma Physics: 3D simulation of magnetic reconnection in electron position using VPIC code [2.5 TB]
-  Particle Physics: Signals from cylindrical antineutrino detectors at Daya Bay experiment [30 GB]
-  One representative small dataset from each to experiment on single node.
KNN: Experiments
20
Hardware Platform
-  Edison, a Cray XC30 supercomputing system @NERSC
-  5576 compute nodes, each with two 12-cores Intel® Xeon ® E5-2695 v2
processors at 2.4 GHz and 64 GB of 1866-DDR3 memory.
-  Cray Aries interconnect (10 GB/s) bi-directional bandwidth per node
-  Codes developed in C/C++
-  Compiled using Intel® compiler v.15.0.1 and Intel ® MPI library v.5.0.2
-  Parallelized using OpenMP (within node) and MPI (between nodes)
KNN: Results (Multinode)
21
Scalability: Strong scaling (Construction and Querying)
cosmo_large [69B particles], plasma_large [189B particles], and dayabay_large [3B particles]
4.3x using 8x cores
5.2x using 8x cores
2.7x using 4x cores
4.4x using 4x cores
6.5x using 8x cores
6.4x using 8x cores
KNN: Results (Multinode)
22
The first three cosmology datasets
2.2x (1.5x) increase when scaled to 64x more cores and data
Scalability: Weak scaling (Construction and Querying)
KNN: Results (Multinode)
23
Runtime breakdown (Construction and Querying)
(a) (b)
cosmo_large [69B particles], plasma_large [189B particles], and dayabay_large [3B particles]
KNN: Results (single node)
24
Scalability (Construction and Querying)
12x using 24 cores (16.2x)20x using 24 cores (22.4x)
KNN: Results (single node)
25
Comparison to previous implementations
2.6x on single core 48x on single core 22x on 24 cores
59x on 24 cores
KNN: Intel Xeon Phi (KNL) processor
26
Datasets used for experiments on KNL
Comparing KNL to
Titan Z [1] performance
[1] Fabian Gieseke, Cosmin Eugen Oancea, Ashish Mahabal, Christian Igel, and Tom Heskes. Bigger Buffer k-d Trees on Multi-Many-Core
Systems. http://arxiv.org/abs/1512.02831, Dec 2015
Up to 3.5X performance improvement
KNN: Intel Xeon Phi (KNL) processor
27
Multinode KNL
[1] Fabian Gieseke, Cosmin Eugen Oancea, Ashish Mahabal, Christian Igel, and Tom Heskes. Bigger Buffer k-d Trees on Multi-Many-Core
Systems. http://arxiv.org/abs/1512.02831, Dec 2015
KNL Scaling (6.5X speedup using 8X more node)
v  Each node has partial view of the entire
kd-tree (keeps global kd-tree and only its
own local kd-tree)
v  127X larger construction dataset, 25X
larger query dataset
v  Titan Z [1] based implementation does
not use distributed kd-tree, hence
incapable to deal massive dataset
KNL Scaling
Each node keeps the entire kd-tree
similar to [1]
Conclusions
28
q  This is the first distributed kd-tree based KNN code that is demonstrated to scale up to ∼50,000 cores.
q  This is the first KNN algorithm that has been run on massive datasets (100B+ points) from diverse
scientific disciplines.
q We show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing
∼50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds.
q  We successfully demonstrate both strong and weak scalability of KNN implementation.
q  We showcase almost linear scalability un to 128 KNL nodes
q Our implementation is more than an order of magnitude faster than state-of-the-art KNN implementation.
Acknowledgements
•  Daya Bay Collaboration for providing access to datasets
•  Zarija Lukic for providing access to Gadget datasets
•  Vadim Roytershteyn for providing access to VPIC datasets
•  Tina Declerck and Lisa Gerhardt for facilitating large scale runs on NERSC
29
Thank you for your time
Prabhat
prabhat@lbl.gov
www.intel.com/hpcdevcon

More Related Content

What's hot

High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
Edge AI and Vision Alliance
 

What's hot (20)

A Platform for Accelerating Machine Learning Applications
 A Platform for Accelerating Machine Learning Applications A Platform for Accelerating Machine Learning Applications
A Platform for Accelerating Machine Learning Applications
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
[SNU Computer Vision Course Project] Image Style Recognition
[SNU Computer Vision Course Project] Image Style Recognition[SNU Computer Vision Course Project] Image Style Recognition
[SNU Computer Vision Course Project] Image Style Recognition
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep Learning
 
Convolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlibConvolutional Neural Networks at scale in Spark MLlib
Convolutional Neural Networks at scale in Spark MLlib
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
Beyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networksBeyond data and model parallelism for deep neural networks
Beyond data and model parallelism for deep neural networks
 
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detectionNVIDIA 深度學習教育機構 (DLI): Approaches to object detection
NVIDIA 深度學習教育機構 (DLI): Approaches to object detection
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
Per domain power analysis
Per domain power analysisPer domain power analysis
Per domain power analysis
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 

Similar to Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures

2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
nozomuhamada
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
butest
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
Aritra Sarkar
 

Similar to Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures (20)

Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Parallel kmeans clustering in Erlang
Parallel kmeans clustering in ErlangParallel kmeans clustering in Erlang
Parallel kmeans clustering in Erlang
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach2012 mdsp pr08 nonparametric approach
2012 mdsp pr08 nonparametric approach
 
convolutional_neural_networks in deep learning
convolutional_neural_networks in deep learningconvolutional_neural_networks in deep learning
convolutional_neural_networks in deep learning
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
"An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ..."An adaptive modular approach to the mining of sensor network ...
"An adaptive modular approach to the mining of sensor network ...
 
PAM.ppt
PAM.pptPAM.ppt
PAM.ppt
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
Knn solution
Knn solutionKnn solution
Knn solution
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 
KNN
KNNKNN
KNN
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 

More from Intel® Software

More from Intel® Software (20)

AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology AI for All: Biology is eating the world & AI is eating Biology
AI for All: Biology is eating the world & AI is eating Biology
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.AI for good: Scaling AI in science, healthcare, and more.
AI for good: Scaling AI in science, healthcare, and more.
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
AWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI ResearchAWS & Intel Webinar Series - Accelerating AI Research
AWS & Intel Webinar Series - Accelerating AI Research
 
Intel Developer Program
Intel Developer ProgramIntel Developer Program
Intel Developer Program
 
Intel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview SlidesIntel AIDC Houston Summit - Overview Slides
Intel AIDC Houston Summit - Overview Slides
 
AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
AIDC NY: Applications of Intel AI by QuEST Global - 09.19.2019
 
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
Advanced Single Instruction Multiple Data (SIMD) Programming with Intel® Impl...
 
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
AIDC India - AI on IA
AIDC India  - AI on IAAIDC India  - AI on IA
AIDC India - AI on IA
 
AIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino SlidesAIDC India - Intel Movidius / Open Vino Slides
AIDC India - Intel Movidius / Open Vino Slides
 
AIDC India - AI Vision Slides
AIDC India - AI Vision SlidesAIDC India - AI Vision Slides
AIDC India - AI Vision Slides
 
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
Enhance and Accelerate Your AI and Machine Learning Solution | SIGGRAPH 2019 ...
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures

  • 1.
  • 2. Mostofa Patwary1, Nadathur Satish1, Narayanan Sundaram1, Jilalin Liu2, Peter Sadowski2, Evan Racah2, Suren Byna2, Craig Tull2, Wahid Bhimji2 Prabhat2, Pradeep Dubey1 1Intel Corporation, 2Lawrence Berkeley National Lab Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
  • 3. KNN: K-Nearest Neighbors 3 v  Problem: Given a set of multi-dimensional data points, find the k closest neighbors k = 3 v  Classification => take the majority vote from the neighbors v  Regression => take the average value of the neighbors
  • 4. KNN: Well known applications 4 v  Object classification in images v  Text classification and information retrieval v  Prediction of economic events v  Medical diagnosis v  3D structure prediction of protein-protein interactions and v  Science applications
  • 5. Scientific Motivation 5 Simulation of the Universe by the Nyx code 3D simulation of magnetic reconnection in electron-positron plasma by VPIC code Image Courtesy: Prabhat [LBNL], Oliver Rübel [LBNL], Roy Kalschmidt (LBNL) The interior of one of the cylindrical Daya Bay Antineutrino detector
  • 6. KNN: K-Nearest Neighbors 6 v  Problem: Given a set of multi-dimensional data points, find the k closest neighbors k = 3 v  Naïve approach: compute distance from the query point to all points and get the k-closest neighbors v  Pros: highly parallel v  Cons: Lot of additional distance computation, not suitable for large scale dataset (communication heavy) Can we take advantage of the partitioning?
  • 7. KNN: kd-tree 7 kd-tree construction v  Use kd-tree, a space-partitioning data structure for organizing points in a k-dimensional space. function kd-tree(Points* P) select an axis (dimension) compute median by axis from the P create a tree node node->median = median node->left = kd-tree (points in P before median) node->right = kd-tree (points in P after median)
  • 8. KNN: kd-tree 8 kd-tree query v  Use kd-tree, a space-partitioning data structure for organizing points in a k-dimensional space. function find-knn(Node node, Query q, Results R) if node is a leaf node if node has a closer point than the point in R add it to R else closer_node = q is in node->left ? node->left : node->right far_node = q is in node->left ? node->right : node->left find-knn(closer_node, q, R) if(farthest point in R is far from median) find-knn(far_node, q, R)
  • 9. KNN 9 KNN has two steps: 1. kd-tree construction => build the kd-tree 2. query => to compute the k-nearest neighbors What are the challenges in parallelizing KNN!
  • 10. KNN: distributed kd-tree construction 10 v  At each node: compute median of a selected dimension move points to the left node and right node v  Each internal node contains o  Selected dimension o  Median o  Left pointer and right pointer v  Each leaf node o  contains a set of points p0 p1 p2 p3 P0 p1 P2 p3 P0 P1 P2 P3
  • 11. KNN: local kd-tree construction 11 When we have 4 threads and 4 nodes? Data parallel Thread parallel Global kd-tree - compute median of a selected dimension - move points to the left node and right node t0 t1 t2 t3 Stop when a leaf has points <= threshold
  • 12. KNN: distributed kd-tree querying 12 v  Each processor has a set of queries o  Queries could be local and non-local P0 P1 P2 P3 P0 P1 P2 P3 Xo o o o o o o o o o o o oo o o P2 P3 P0 P1 X v  Non-local queries: o  Ask every node (more computation and communication) o  Transfer ownership ü  Get the local closest k points ü  Ask neighbor node with the kth point distance
  • 13. Query Find query owner from global kd- tree and send Find local KNN from local kd- tree Find others that may contain better neighbors and send For received queries, find better local neighbors and send For received responses, update nearest neighbor list Find top k neighbors KNN: distributed kd-tree querying
  • 14. KNN: Algorithmic Choices [kd-tree construction] 14 v  Choice of split point v  Choice of split dimension v  Choice of bucket size Median, average, or min + (max-min)/2 Sampling based median computation
  • 15. KNN: Sampling based median (local) 15 y x a b (1) a x y b (2) (3) x a b y
  • 16. KNN: Sampling based median (distributed) 16 a x b y x a b (1) p0 p1 a x y b (2) (4)x a b y y b (3) ay x p0 p1
  • 17. KNN: Algorithmic Choices [kd-tree construction] 17 v  Choice of split point v  Choice of split dimension v  Choice of bucket size Maximum range Imbalanced kd-tree More communication and computation
  • 18. KNN: Algorithmic Choices [kd-tree construction] 18 v  Choice of split point v  Choice of split dimension v  Choice of bucket size Data parallel Global kd-tree Stop when a leaf has points <= threshold Thread parallel Construction time vs query time
  • 19. KNN: Experiments 19 Datasets -  Cosmology: Three cosmological N-body simulations datasets using Gadget code [12 GB, 96 GB, 0.8 TB] -  Plasma Physics: 3D simulation of magnetic reconnection in electron position using VPIC code [2.5 TB] -  Particle Physics: Signals from cylindrical antineutrino detectors at Daya Bay experiment [30 GB] -  One representative small dataset from each to experiment on single node.
  • 20. KNN: Experiments 20 Hardware Platform -  Edison, a Cray XC30 supercomputing system @NERSC -  5576 compute nodes, each with two 12-cores Intel® Xeon ® E5-2695 v2 processors at 2.4 GHz and 64 GB of 1866-DDR3 memory. -  Cray Aries interconnect (10 GB/s) bi-directional bandwidth per node -  Codes developed in C/C++ -  Compiled using Intel® compiler v.15.0.1 and Intel ® MPI library v.5.0.2 -  Parallelized using OpenMP (within node) and MPI (between nodes)
  • 21. KNN: Results (Multinode) 21 Scalability: Strong scaling (Construction and Querying) cosmo_large [69B particles], plasma_large [189B particles], and dayabay_large [3B particles] 4.3x using 8x cores 5.2x using 8x cores 2.7x using 4x cores 4.4x using 4x cores 6.5x using 8x cores 6.4x using 8x cores
  • 22. KNN: Results (Multinode) 22 The first three cosmology datasets 2.2x (1.5x) increase when scaled to 64x more cores and data Scalability: Weak scaling (Construction and Querying)
  • 23. KNN: Results (Multinode) 23 Runtime breakdown (Construction and Querying) (a) (b) cosmo_large [69B particles], plasma_large [189B particles], and dayabay_large [3B particles]
  • 24. KNN: Results (single node) 24 Scalability (Construction and Querying) 12x using 24 cores (16.2x)20x using 24 cores (22.4x)
  • 25. KNN: Results (single node) 25 Comparison to previous implementations 2.6x on single core 48x on single core 22x on 24 cores 59x on 24 cores
  • 26. KNN: Intel Xeon Phi (KNL) processor 26 Datasets used for experiments on KNL Comparing KNL to Titan Z [1] performance [1] Fabian Gieseke, Cosmin Eugen Oancea, Ashish Mahabal, Christian Igel, and Tom Heskes. Bigger Buffer k-d Trees on Multi-Many-Core Systems. http://arxiv.org/abs/1512.02831, Dec 2015 Up to 3.5X performance improvement
  • 27. KNN: Intel Xeon Phi (KNL) processor 27 Multinode KNL [1] Fabian Gieseke, Cosmin Eugen Oancea, Ashish Mahabal, Christian Igel, and Tom Heskes. Bigger Buffer k-d Trees on Multi-Many-Core Systems. http://arxiv.org/abs/1512.02831, Dec 2015 KNL Scaling (6.5X speedup using 8X more node) v  Each node has partial view of the entire kd-tree (keeps global kd-tree and only its own local kd-tree) v  127X larger construction dataset, 25X larger query dataset v  Titan Z [1] based implementation does not use distributed kd-tree, hence incapable to deal massive dataset KNL Scaling Each node keeps the entire kd-tree similar to [1]
  • 28. Conclusions 28 q  This is the first distributed kd-tree based KNN code that is demonstrated to scale up to ∼50,000 cores. q  This is the first KNN algorithm that has been run on massive datasets (100B+ points) from diverse scientific disciplines. q We show that our implementation can construct kd-tree of 189 billion particles in 48 seconds on utilizing ∼50,000 cores. We also demonstrate computation of KNN of 19 billion queries in 12 seconds. q  We successfully demonstrate both strong and weak scalability of KNN implementation. q  We showcase almost linear scalability un to 128 KNL nodes q Our implementation is more than an order of magnitude faster than state-of-the-art KNN implementation.
  • 29. Acknowledgements •  Daya Bay Collaboration for providing access to datasets •  Zarija Lukic for providing access to Gadget datasets •  Vadim Roytershteyn for providing access to VPIC datasets •  Tina Declerck and Lisa Gerhardt for facilitating large scale runs on NERSC 29
  • 30. Thank you for your time Prabhat prabhat@lbl.gov www.intel.com/hpcdevcon