Locality sensitive hashing (LSH) is a technique to improve the efficiency of near neighbor searches in high-dimensional spaces. LSH works by hash functions that map similar items to the same buckets with high probability. The document discusses applications of near neighbor searching, defines the near neighbor reporting problem, and introduces LSH. It also covers techniques like gap amplification to improve LSH performance and parameter optimization to minimize query time.
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Simplilearn
This Deep Learning presentation will help you in understanding what is Deep Learning, why do we need Deep learning, what is neural network, applications of Deep Learning, what is perceptron, implementing logic gates using perceptron, types of neural networks. At the end of the video, you will get introduced to TensorFlow along with a usecase implementation on recognizing hand-written digits. Deep Learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. Deep Learning, on the other hand, uses advanced computing power and special type of neural networks and applies them to large amounts of data to learn, understand, and identify complicated patterns. W will also understand neural networks and how they work in this Deep Learning tutorial video. This Deep Learning tutorial is ideal for professionals with beginner to intermediate level of experience. Now, let us dive deep into this topic and understand what Deep Learning actually is.
Below topics are explained in this Deep Learning presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. What is Neural network?
4. What is Perceptron?
5. Implementing logic gates using Perceptron
6. Types of Neural networks
7. Applications of Deep Learning
8. Working of Neural network
9. Introduction to TensorFlow
10. Use case implementation using TensorFlow
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Monthly AI Tech Talks in Toronto 2019-08-28
https://www.meetup.com/aittg-toronto
The talk will cover the end-to-end details including contextual and linguistic feature extraction, vectorization, n-grams, topic modeling, named entity resolution which are based on concepts from mathematics, information retrieval and natural language processing. We will also be diving into more advanced feature engineering strategies such as word2vec, GloVe and fastText that leverage deep learning models.
In addition, attendees will learn how to combine NLP features with numeric and categorical features and analyze the feature importance from the resulting models.
The following libraries will be used to demonstrate the aforementioned feature engineering techniques: spaCy, Gensim, fasText and Keras in Python.
https://www.meetup.com/aittg-toronto/events/261940480/
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
This presentation describes in simple terms how the PageRank algorithm by Google founders works. It displays the actual algorithm as well as tried to explain how the calculations are done and how ranks are assigned to any webpage.
Machine learning (ML) and natural language processing (NLP)Nikola Milosevic
Short introduction on natural language processing (NLP) and machine learning (ML). Speaks about sub-areas of artificial inteligence and then mainly focuses on the sub-areas of machine learning and natural language processing. Explains the process of data mining from high perspective
Recurrent Neural Networks have shown to be very powerful models as they can propagate context over several time steps. Due to this they can be applied effectively for addressing several problems in Natural Language Processing, such as Language Modelling, Tagging problems, Speech Recognition etc. In this presentation we introduce the basic RNN model and discuss the vanishing gradient problem. We describe LSTM (Long Short Term Memory) and Gated Recurrent Units (GRU). We also discuss Bidirectional RNN with an example. RNN architectures can be considered as deep learning systems where the number of time steps can be considered as the depth of the network. It is also possible to build the RNN with multiple hidden layers, each having recurrent connections from the previous time steps that represent the abstraction both in time and space.
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Simplilearn
This Deep Learning presentation will help you in understanding what is Deep Learning, why do we need Deep learning, what is neural network, applications of Deep Learning, what is perceptron, implementing logic gates using perceptron, types of neural networks. At the end of the video, you will get introduced to TensorFlow along with a usecase implementation on recognizing hand-written digits. Deep Learning is inspired by the integral function of the human brain specific to artificial neural networks. These networks, which represent the decision-making process of the brain, use complex algorithms that process data in a non-linear way, learning in an unsupervised manner to make choices based on the input. Deep Learning, on the other hand, uses advanced computing power and special type of neural networks and applies them to large amounts of data to learn, understand, and identify complicated patterns. W will also understand neural networks and how they work in this Deep Learning tutorial video. This Deep Learning tutorial is ideal for professionals with beginner to intermediate level of experience. Now, let us dive deep into this topic and understand what Deep Learning actually is.
Below topics are explained in this Deep Learning presentation:
1. What is Deep Learning?
2. Why do we need Deep Learning?
3. What is Neural network?
4. What is Perceptron?
5. Implementing logic gates using Perceptron
6. Types of Neural networks
7. Applications of Deep Learning
8. Working of Neural network
9. Introduction to TensorFlow
10. Use case implementation using TensorFlow
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you’ll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change.
There is booming demand for skilled deep learning engineers across a wide range of industries, making this deep learning course with TensorFlow training well-suited for professionals at the intermediate to advanced level of experience. We recommend this deep learning online course particularly for the following professionals:
1. Software engineers
2. Data scientists
3. Data analysts
4. Statisticians with an interest in deep learning
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Monthly AI Tech Talks in Toronto 2019-08-28
https://www.meetup.com/aittg-toronto
The talk will cover the end-to-end details including contextual and linguistic feature extraction, vectorization, n-grams, topic modeling, named entity resolution which are based on concepts from mathematics, information retrieval and natural language processing. We will also be diving into more advanced feature engineering strategies such as word2vec, GloVe and fastText that leverage deep learning models.
In addition, attendees will learn how to combine NLP features with numeric and categorical features and analyze the feature importance from the resulting models.
The following libraries will be used to demonstrate the aforementioned feature engineering techniques: spaCy, Gensim, fasText and Keras in Python.
https://www.meetup.com/aittg-toronto/events/261940480/
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
In real-world scenarios, decision making can be a very challenging task even for modern computers. Generalized reinforcement learning (GRL) was developed to facilitate complex decision making in highly dynamical systems through flexible policy generalization mechanisms using kernel-based methods. GRL combines the use of sampling, kernel functions, stochastic process, non-parametric regression and functional clustering.
This presentation describes in simple terms how the PageRank algorithm by Google founders works. It displays the actual algorithm as well as tried to explain how the calculations are done and how ranks are assigned to any webpage.
Machine learning (ML) and natural language processing (NLP)Nikola Milosevic
Short introduction on natural language processing (NLP) and machine learning (ML). Speaks about sub-areas of artificial inteligence and then mainly focuses on the sub-areas of machine learning and natural language processing. Explains the process of data mining from high perspective
DBMS Architecture
Query Executor
Buffer Manager
Storage Manager
Storage
Transaction Manager
Logging &
Recovery
Lock Manager
Buffers Lock Tables
Main Memory
User/Web Forms/Applications/DBA
query transaction
Query Optimizer
Query Rewriter
Query Parser
Files & Access Methods
Past lectures
Today’s
lecture
Many query plans to execute a SQL query
3
S UTR
SR
T
U
SR
T
U
• Even more plans: multiple algorithms to execute each operation
SR
T
U
Sort-merge
Sort-merge
hash join
Table-scan
index-scan
Table-scanindex-scan
• Compute the join of R(A,B) S(B,C) T(C,D) U(D,E)
Explain command in PostgreSQL
4
SELECT C.STATE, SUM(O.NETAMOUNT), SUM(O.TOTALAMOUNT)
FROM CUSTOMERS C
JOIN CUST_HIST CH ON C.CUSTOMERID = CH.CUSTOMERID
JOIN ORDERS O ON CH.ORDERID = O.ORDERID
GROUP BY C.STATE
Communication between operators:
iterator model
• Each physical operator implements three functions:
– Open: initializes the data structures.
– GetNext: returns the next tuple in the result.
– Close: ends the operation and frees the resources.
• It enables pipelining
• Other option: compute the result of the operator in full
and store it in disk or memory:
– inefficient.
5
Query optimization: picking the fastest plan
• Optimal approach plan
– enumerate each possible plan
– measure its performance by running it
– pick the fastest one
– What’s wrong?
• Rule-based optimization
– Use a set of pre-defined rules to generate a fast plan
• e.g., If there is an index over a table, use it for scan and join.
6
Review Definitions
• Statistics on table R:
– T(R): Number of tuples in R
– B(R): Number of blocks in R
• B(R) = T(R ) / block size
– V(R,A): Number of distinct values of attribute A in R
7
Plans to select tuples from R: σA=a(R)
• We have an unclustered index on R
• Plans:
– (Unclustered) indexed-based scan
– Table-scan (sequential access)
• Statistics on R
– B(R)=5000, T(R)=200,000
– V(R,A) = 2, one value appears in 95% of tuples.
• Unclustered indexed scan vs. table-scan ?
8
Presenter
Presentation Notes
Unclustered indexed scan vs. table-scan ?
table-scan is the winner!
Query optimization methods
• Rule-based optimizer fails
– It uses static rules
– The rules do not consider the distribution of the data.
• Cost-based optimization
– predict the cost of each plan
– search the plan space to find the fastest one
– do it efficiently
• Optimization itself should be fast!
9
Cost-based optimization
• Plan space
– which plans to consider?
– it is time consuming to explore all alternatives.
• Cost estimator
– how to estimate the cost of each plan without executing it?
– we would like to have accurate estimation
• Search algorithm
– how to search the plan space fast?
– we would like to avoid checking inefficient plans
10
Space of query plans: basic operators
• Selection
– algorithms: sequential, index-based
– Ordering
• Projection
– Ordering
• Join
– algorithms: nested loop, sort-merge, hash
– orderi ...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
The processing and analysis of high dimensional geometric data plays a fundamental role in disciplines of science and engineering. A systematic framework to study these problems has been developing in the research area of discrete and computational geometry. This Phd thesis studies problems in this area. The fundamental geometric objects of our study are high dimensional convex polytopes defined byan oracle.The contribution of the thesis is threefold. First, the design and analysis of geometric algorithms for problems concerning high-dimensional convex polytopes, such as convex hull and volume computation and their applications to computational algebraic geometry and optimization. Second, the establishment of combinatorial characterization results for essential polytope families. Third, the implementation and experimental analysis of the proposed algorithms and methods
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
In this paper we focus on improving the effectiveness of hashing-based approximate nearest neighbour search. Generating similarity preserving hashcodes for images has been shown to be an effective and efficient method for searching through large datasets. Hashcode generation generally involves two steps: bucketing the input feature space with a set of hyperplanes, followed by quantising the projection of the data-points onto the normal vectors to those hyperplanes. This procedure results in the makeup of the hashcodes depending on the positions of the data-points with respect to the hyperplanes in the feature space, allowing a degree of locality to be encoded into the hashcodes. In this paper we study the effect of learning both the hyperplanes and the thresholds as part of the same model. Most previous research either learn the hyperplanes assuming a fixed set of thresholds, or vice-versa. In our experiments over two standard image datasets we find statistically significant increases in retrieval effectiveness versus a host of state-of-the-art data-dependent and independent hashing models.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
2. Motivation
• Real word applications
– Recommendation system
• Searching for similar items and users
– Malicious website detection
• Searching for websites similar to some know malicious websites
• The underlying core problem
Given:
• A large set P of high-dimensional data points in a metric space M
• A large set Q of high-dimensional query points in a metric space M
Goal:
• Find near neighbors in P for each query point in Q
• Avoid linearly scanning P for each query
3. Related Work
• Nearest Neighbor Searching
– Given: a set P of n points in a metric space M
– Goal: for any query q return a point p ∈ P minimizing dist(p,q)
• Classic Result
– Point location in arrangements of hyperplanes
• Meiser, IC’93
• In a d-dimensional Euclidean space under some Lp norm
• dO(1) logn query time and nO(d) space
4. Related Work
• Approximate Nearest Neighbor
– Given: a set P of n points in a metric space M and > 0
– Goal: for any query q return a point p P s.t. dist(p,q) (1+) dist(p*,q),
where p* is the nearest neighbor to q
• Classic Result
– Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
• Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12
• In d-dimensional Euclidean space under Lp norm
• 𝑂(𝑑𝑛
1
1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+
1
1+𝜀) space
• Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
5. Overview
1. Near Neighbor Reporting
– Formal problem formulation
2. Locality Sensitive Hashing
– Definition and example
– Algorithms for NNR based on LSH
– Query time decomposition
3. Performance tuning
– Gap amplification
– Parameter optimization
7. Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A deterministic algorithm for building a data structure T s.t.
• T.nbrs(q) nbrs(q) = {p P : dist(p,q)}
• 𝐸 𝑞 cvg(𝑞|𝑇) 𝑐 where q ~ f and cvg(q|T) =
𝑇.nbrs 𝑞
nbrs 𝑞
2015/3/4
hard to achieve
8. (Relaxed) Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A randomized algorithm for building a data structure T s.t.
• T.nbrs(q) nbrs(q) = {p P : dist(p,q)}
• 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇 𝑐, where q ~ f
• Fact:
– if 𝐸 𝑇 cvg(𝑞|𝑇) 𝑐 for any q
then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f
2015/3/4
10. Locality Sensitive Hashing
• Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
• Let h be chosen uniformly at random from H
– p and q are closer => Pr(h(p) = h(q)) is higher
• Formally, H is (r, r+, c, c’)-sensitive if
– r < r+ and c> c’
– Let h be chosen uniformly at random from H
– If dist(p, q) r , then Pr(h(p) = h(q)) c
– If dist(p, q) r+, then Pr(h(p) = h(q)) c’
11. LSH for angular distance
• Distance function: d(u, v) = arccos(u,v) = (u,v)
• Random Projection:
– Choose a random unit vector w
– h(u) = sgn(uw)
– Pr(h(u) = h(v)) = 1 - (u,v)/
u
v
12. LSH for Jaccard distance
• Distance function: d(A, B) = 1 - | A B |/| A B |
• MinHash:
–Pick a random permutation on the universe U
–h(A) = argminaA (a)
–Pr(h(A) = h(B)) = | A B |/| A B | = 1 – d(A, B)
• Note:
–Finding the Jaccard median
• Very easy to understand, very hard to compute
• Studied from 1981
• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10
– NP-Hard and no FPTAS
– PTAS
A B
13. Algorithm for NNR Based on LSH
• Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M
• Consider the following randomized algorithm for NNR
– Uniformly at random choose a hash function h from H
– Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)}
– Define Th.nbrs(q) to return points in Th(q) with dist(p,q) r
• For any q,
• 𝐸 𝑇ℎ
cvg 𝑞|𝑇 =
• 𝑇ℎ
Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) =
• ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 )
|nbrs(𝑞)|
𝑐
14. Query Time
• Query time = time for computing h(q)
+
|{p P: h(p) = h(q) & dist(p,q) > r| d
+
|{p P: h(p) = h(q) & dist(p,q) r| d
= timec + timeFP + |output|
16. Why Gap Amplification
• To lift coverage rate c
• Reduce false positive rate to improve timeFP
Jaccard
distance
0.2
c= 0.8
0.4
c’= 0.6
collision Prob.
Jaccard
distance
0.2
c= 0.9
0.4
c’= 0.1
collision Prob.
(r, r+, c, c’)-sensitive LSH
Gap amplification
(r, r+, c, c’)-sensitive LSH
17. 17
How Gap Amplification
• Construct LSH G from the original LSH H
– LSH B = {b(x; h1, h2,…, hk): hi H }
• b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q)
– LSH G = {g(x; b1, b2,… , bL): bi B}
• g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q)
• Intuition:
– AND increases the gap
• Collision probabilities of distant points decrease
exponentially faster than near points
– OR increases the collision probabilities approx. linearly
• Let P = PrhH (h(p) = h(q))
=> PrbB (b(p) = b(q)) = Pk
=> PrgG (g(p) g(q)) = (1 – Pk)L
=> PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk
L buckets
k hashes
or
or
or
18. 18
Parameter Optimization
• Situation
– A (r, r+, c, c’)-sensitive LSH H is given
– After gap amplification we want c to be lifted to c
• Let c = 1 – (1-ck)L
⇒ 𝐿 =
log 1 – c
log 1 − ck
⇒ L is a strictly increasing function of k
• So we only need to select a good k
19. 19
How to select a good k
• How to measure the “goodness”?
– Minimize the timec + E[timeFP ]under the space constraint
• Let’s investigate how the query time and space usage will
react when we increase k
20. k => space
• Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 +
log 1 – c
log 1 − ck
)
))
21. k => timec
• timec =O (dkL) = 𝑂(𝑑𝑘
log 1 – c
log 1 − ck )
– Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)
22. k => timeFP
c
P
Y
c
k=2
k=3
• Consider s-curves Y = 1 – (1-Pk)L passing through (c, c)
• Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP
original collision prob.
of distant points
23. 1. Determine the largest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
– In practice, timec and timeFP is measured experimentally by
• constructing a data structure T
• running several queries sampled from S on T
Procedure for optimizing k
24. • Let ∆ 𝑘time 𝑐 = timec(k) – timec(k-1)
• Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)]
• Observation:
– If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k*
would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can
be found using binary search.
• Question:
– in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing?
Observation & Question
k
∆ 𝑘time 𝑐
∆ 𝑘timeFP
25. Summary
• Near Neighbor Reporting
– Find many applications in practice
• Locality Sensitive Hashing
– Hash near points to the same value
– One of the most useful techniques for NNR
• Performance tuning
– Gap amplification for higher coverage and lower FP
– Parameter optimization for query time
26. Further Reading
• Dimensionality Reduction
– Variance preserving
– Principal Component Analysis
– Singular Value Decomposition
– Distance preserving
– Random Projection and the Johnson–Lindenstrauss lemma
– Locality preserving
– Locally Linear Embedding
– Multi-dimensional Scaling
– ISOMAP
27. References
1. Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH
28. Appendix
• Suppose that we have a (r, r+, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+, c, err)-sensitive LSH G.
• How does the bucket number L and the collision error err change
with k?