Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

•

5 likes•3,137 views

Max-kernel search: How to search for just about anything? Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.

Technology

Max-kernel search
How to search for just about anything?
Parikshit Ram

Similarity search
q
● Set of objects
● Query
R ● Similarity function
1

Drug discovery
3
http://fineartamerica.com

Similarity search is ubiquitous
● Machine learning
● Computer vision
● Theory
● Databases
● Information retrieval
● Web application
● Collaborative filtering
● Scientific computing
5

Search-based classification
6
k-nearest-neighbor classification/regression

Search-based classification
7
“RomCom fan”

Search-based classification
7
“Kids movie fanatic”

Search-based ML
Advantage
● nonparametric - lets the data speak
● no need to train complex models
Key ingredient
● notion of similarity (domain/data-specific)
Main challenge: efficiency
● Sheer size of the data
● Varied data types
10

Properties of similarity functions
11
● symmetry
OR

11
3
1
The dissimilarity is the size of the set-theoretic difference

Properties of similarity functions
11
● symmetry
● self-similarity
OR
OR

12
Bregman
divergences
widely used for
distributions
Mercer kernels
widely used in
ML for variety of
objects and
problems
???
not quite
explored in
search or ML
Metrics
used everywhere

Breadth of Kernel Functions
Objects Kernel Functions
Images linear, polynomial, Gaussian, Pyramid match
Documents cosine
Sequences p-spectrum kernel, alignment score
Trees subtree, syntactic, partial tree
Graphs random walk
Time series cross-correlation, dynamic time-warping
Natural Lang. convolution, decomposition, lexical semantic
13

What is a Kernel Function?
In words
A pairwise symmetric function
● Correlation in a richer but hidden feature space
● Cannot access the hidden space
Object space
Hidden space
Hidden mapping
14

Max-kernel Search
Find the object in R most similar to q
with respect to a kernel
15

Existing methods
● Brute-force (parallel/distributed)
○ Domain-specific optimizations
● Coerce data to use metrics
○ Only approximate
No standard search tools!
16

Understanding kernels
If two objects equally similar to each other
then they are equally similar to the query q
17

Multi-resolution index in O( n log n ) time
p
18
Indexing our collection
Cover Tree (BKL 2006)

How to Search with this Index?
19
q
p
p'
p''

How to Search with this Index?
q
p
p''
p'
19

How to Search with this Index?
q
p
p''
p'
Safely ignore
a large chunk
(potentially millions)
19

Results: Efficiency
10000x
● Widely applicable algorithm
● Performance data/kernel-dependent
10x
Improvement
20

Results: Sublinear Query Time
Improvement
Object set size
Bigger data implies bigger efficiency gains
21

Can We Prove it?
What Makes Search Hard?
Thm.
For a set R of n objects, the query time is
● expansion constant
○ the distribution of the data
● directional concentration constant
○ the distribution of a kernel-induced transformation
of the data
22

Endnote
● Search is an essential tool for ML
● Exploring different types of similarity functions
increases the applicability and quality of search
● Kernels are widely applicable similarity functions
○ now we have provably fast max kernel search
Code/tutorial for Fast Exact Max-Kernel Search
23
version 1.0.5
http://www.mlpack.org Ryan R. Curtin
Email: pari@skytree.net

What's hot

Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain

Next generation analytics with yarn, spark and graph labImpetus Technologies

Analyzing Data With PythonSarah Guido

Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella

Data Science with SparkKrishna Sankar

Anomaly Detection with Apache SparkCloudera, Inc.

Yarn spark next_gen_hadoop_8_jan_2014Vijay Srinivas Agneeswaran, Ph.D

High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake

Google's DremelMaria Stylianou

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Yuanyuan Tian

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf

Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien

Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Big data distributed processing: Spark introductionHektor Jacynycz García

SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp

Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D

Intro to Python Data Analysis in WakariKarissa Rae McKelvey

Dremel: Interactive Analysis of Web-Scale Datasets robertlz

What's hot (20)

Multiplatform Spark solution for Graph datasources by Javier Dominguez

Next generation analytics with yarn, spark and graph lab

Analyzing Data With Python

Distributed machine learning 101 using apache spark from a browser devoxx.b...

Data Science with Spark

Anomaly Detection with Apache Spark

Yarn spark next_gen_hadoop_8_jan_2014

High Performance Data Analytics with Java on Large Multicore HPC Clusters

Google's Dremel

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Scalable Distributed Real-Time Clustering for Big Data Streams

Studies of HPCC Systems from Machine Learning Perspectives

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Big data distributed processing: Spark introduction

SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud

Optimizing Terascale Machine Learning Pipelines with Keystone ML

Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab

Intro to Python Data Analysis in Wakari

Dremel: Interactive Analysis of Web-Scale Datasets

Viewers also liked

Multiple Kernel Learning based Approach to Representation and Feature Selecti...ICAC09

Distance Metric LearningSanghyuk Chun

Machine Learning and ApplicationsGeeta Arora

Model selection and tuning at scaleOwen Zhang

MapR & Skytree: MapR Technologies

Machine learning in image processingData Science Thailand

Viewers also liked (6)

Multiple Kernel Learning based Approach to Representation and Feature Selecti...

Distance Metric Learning

Machine Learning and Applications

Model selection and tuning at scale

MapR & Skytree:

Machine learning in image processing

Similar to Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

Elasticsearch - basics and beyondErnesto Reig

Graph basedrdf storeforapachecassandraRavindra Ranwala

L15.pptxImonBennett

Data Structures & AlgorithmsMuhammad Jahanzaib

Data Science as ScaleConor B. Murphy

Apache Spark 101 - Demi Ben-AriDemi Ben-Ari

General introduction to AI ML DL DSRoopesh Kohad

Optimizing GenAI apps, by N. El Mawass and Maria KnorpsParis Women in Machine Learning and Data Science

Azure Databricks for Data ScientistsRichard Garris

Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox

Object Detection Beyond Mask R-CNN and RetinaNet IIWanjin Yu

Data Science At ZillowNicholas McClure

Search summit-2018-ltr-presentationSujit Pal

Distributed Decision Tree Inductiongregoryg

Session 2HarithaAshok3

Analysis of different similarity measures: SimrankAbhishek Mungoli

An introduction to similarity search and k-nn graphsThibault Debatty

Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum

Fast Variant Calling with ADAM and avocadofnothaft

The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar

Similar to Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL (20)

Elasticsearch - basics and beyond

Graph basedrdf storeforapachecassandra

L15.pptx

Data Structures & Algorithms

Data Science as Scale

Apache Spark 101 - Demi Ben-Ari

General introduction to AI ML DL DS

Optimizing GenAI apps, by N. El Mawass and Maria Knorps

Azure Databricks for Data Scientists

Comparing Big Data and Simulation Applications and Implications for Software ...

Object Detection Beyond Mask R-CNN and RetinaNet II

Data Science At Zillow

Search summit-2018-ltr-presentation

Distributed Decision Tree Induction

Session 2

Analysis of different similarity measures: Simrank

An introduction to similarity search and k-nn graphs

Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...

Fast Variant Calling with ADAM and avocado

The Hitchhiker's Guide to Machine Learning with Python & Apache Spark

Recently uploaded

Histor y of HAM Radio presentation slidevu2urc

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

How to convert PDF to text with Nanonetsnaman860154

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Recently uploaded (20)

Histor y of HAM Radio presentation slide

Axa Assurance Maroc - Insurer Innovation Award 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

How to convert PDF to text with Nanonets

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Handwritten Text Recognition for manuscripts and early printed texts

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Driving Behavioral Change for Information Management through Data-Driven Gree...

Injustice - Developers Among Us (SciFiDevCon 2024)

How to Troubleshoot Apps for the Modern Connected Worker

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Boost PC performance: How more available memory can improve productivity

A Domino Admins Adventures (Engage 2024)

Finology Group – Insurtech Innovation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

CNv6 Instructor Chapter 6 Quality of Service

Presentation on how to chat with PDF using ChatGPT code interpreter

Automating Google Workspace (GWS) & more with Apps Script

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

08448380779 Call Girls In Friends Colony Women Seeking Men

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

1. Max-kernel search How to search for just about anything? Parikshit Ram

2. Similarity search q ● Set of objects ● Query R ● Similarity function 1

3. Finding similar images 2

4. Drug discovery 3 http://fineartamerica.com

5. Movie recommendations 4

6. Similarity search is ubiquitous ● Machine learning ● Computer vision ● Theory ● Databases ● Information retrieval ● Web application ● Collaborative filtering ● Scientific computing 5

7. Search-based classification 6

8. Search-based classification 6 ?

9. Search-based classification 6 k-nearest-neighbor classification/regression

10. Search-based classification 7 “RomCom fan”

11. Search-based classification 7 “Kids movie fanatic”

12. Search-based outlier detection 8

13. 9

14. Search-based ML Advantage ● nonparametric - lets the data speak ● no need to train complex models Key ingredient ● notion of similarity (domain/data-specific) Main challenge: efficiency ● Sheer size of the data ● Varied data types 10

15. Properties of similarity functions 11 ● symmetry OR

16. 11 3 1 The dissimilarity is the size of the set-theoretic difference

17. Properties of similarity functions 11 ● symmetry ● self-similarity OR OR

18. 11 We do not really care about this.

19. Properties of similarity functions 11 ● symmetry ● self-similarity OR OR

20. 12

21. 12

22. 12 Metrics used everywhere

23. 12 Metrics used everywhere

24. 12 Bregman divergences widely used for distributions Mercer kernels widely used in ML for variety of objects and problems ??? not quite explored in search or ML Metrics used everywhere

25. Breadth of Kernel Functions Objects Kernel Functions Images linear, polynomial, Gaussian, Pyramid match Documents cosine Sequences p-spectrum kernel, alignment score Trees subtree, syntactic, partial tree Graphs random walk Time series cross-correlation, dynamic time-warping Natural Lang. convolution, decomposition, lexical semantic 13

26. What is a Kernel Function? In words A pairwise symmetric function ● Correlation in a richer but hidden feature space ● Cannot access the hidden space Object space Hidden space Hidden mapping 14

27. Max-kernel Search Find the object in R most similar to q with respect to a kernel 15

28. Existing methods ● Brute-force (parallel/distributed) ○ Domain-specific optimizations ● Coerce data to use metrics ○ Only approximate No standard search tools! 16

29. Understanding kernels If two objects equally similar to each other then they are equally similar to the query q 17

30. IF 17 Understanding kernels THEN

31. 18 Indexing our collection

32. 18 Indexing our collection

33. Multi-resolution index in O( n log n ) time p 18 Indexing our collection Cover Tree (BKL 2006)

34. How to Search with this Index? 19 q p

35. How to Search with this Index? 19 q p p' p''

36. How to Search with this Index? q p p'' p' 19

37. How to Search with this Index? q p p'' p' 19

38. How to Search with this Index? q p p'' p' Safely ignore a large chunk (potentially millions) 19

39. Results: Efficiency Improvement 20

40. Results: Efficiency 10000x ● Widely applicable algorithm ● Performance data/kernel-dependent 10x Improvement 20

41. Results: Sublinear Query Time Improvement Object set size Bigger data implies bigger efficiency gains 21

42. Can We Prove it? What Makes Search Hard? Thm. For a set R of n objects, the query time is ● expansion constant ○ the distribution of the data ● directional concentration constant ○ the distribution of a kernel-induced transformation of the data 22

43. Endnote ● Search is an essential tool for ML ● Exploring different types of similarity functions increases the applicability and quality of search ● Kernels are widely applicable similarity functions ○ now we have provably fast max kernel search Code/tutorial for Fast Exact Max-Kernel Search 23 version 1.0.5 http://www.mlpack.org Ryan R. Curtin Email: pari@skytree.net

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL

Similar to Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL