Locality Sensitive Hashing (LSH) is a technique for finding similar items in large datasets. It works in 3 steps:
1. Shingling converts documents to sets of n-grams (sequences of tokens). This represents documents as high-dimensional vectors.
2. MinHashing maps these high-dimensional sets to short signatures or sketches, in a way that preserves similarity according to the Jaccard coefficient. It uses random permutations to select the minimum value in each permutation.
3. LSH partitions the signature matrix into bands and hashes each band separately, so that similar signatures are likely to hash to the same buckets. Candidate pairs are those that share buckets in one or more bands, reducing
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
The document discusses parallel databases and their architectures. It introduces parallel databases as systems that seek to improve performance through parallelizing operations like loading data, building indexes, and evaluating queries using multiple CPUs and disks. It describes three main architectures for parallel databases: shared memory, shared disk, and shared nothing. The shared nothing architecture provides linear scale-up and speed-up but is more difficult to program. The document also discusses measuring performance improvements from parallelization through speed-up and scale-up.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
The document discusses multithreaded and distributed algorithms. It describes multithreaded algorithms as having concurrent execution of parts of a program to maximize CPU utilization. Key aspects include communication models, types of threading, and performance measures. Distributed algorithms do not assume a central coordinator and are run across distributed systems without shared memory. Examples of distributed algorithms provided are breadth-first search, minimum spanning tree, naive string matching, and Rabin-Karp string matching.
The document discusses ambiguous and unambiguous context-free grammars (CFGs). CFGs are classified as ambiguous if a string has two or more possible parse trees, and unambiguous otherwise. Several examples are provided to illustrate ambiguous grammars that have multiple derivations for strings, such as grammars generating expressions. Unambiguous grammars are also exemplified, including one for the language of strings with a's. The document concludes with references for further reading on ambiguity in grammars and automata theory.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Text clustering involves grouping text documents into clusters such that documents within a cluster are similar to each other and dissimilar to documents in other clusters. Common text clustering methods include bisecting k-means clustering, which recursively partitions clusters, and agglomerative hierarchical clustering, which iteratively merges clusters. Text clustering is used to automatically organize large document collections and improve search by returning related groups of documents.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
The document discusses parallel databases and their architectures. It introduces parallel databases as systems that seek to improve performance through parallelizing operations like loading data, building indexes, and evaluating queries using multiple CPUs and disks. It describes three main architectures for parallel databases: shared memory, shared disk, and shared nothing. The shared nothing architecture provides linear scale-up and speed-up but is more difficult to program. The document also discusses measuring performance improvements from parallelization through speed-up and scale-up.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
The document discusses multithreaded and distributed algorithms. It describes multithreaded algorithms as having concurrent execution of parts of a program to maximize CPU utilization. Key aspects include communication models, types of threading, and performance measures. Distributed algorithms do not assume a central coordinator and are run across distributed systems without shared memory. Examples of distributed algorithms provided are breadth-first search, minimum spanning tree, naive string matching, and Rabin-Karp string matching.
The document discusses ambiguous and unambiguous context-free grammars (CFGs). CFGs are classified as ambiguous if a string has two or more possible parse trees, and unambiguous otherwise. Several examples are provided to illustrate ambiguous grammars that have multiple derivations for strings, such as grammars generating expressions. Unambiguous grammars are also exemplified, including one for the language of strings with a's. The document concludes with references for further reading on ambiguity in grammars and automata theory.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Text clustering involves grouping text documents into clusters such that documents within a cluster are similar to each other and dissimilar to documents in other clusters. Common text clustering methods include bisecting k-means clustering, which recursively partitions clusters, and agglomerative hierarchical clustering, which iteratively merges clusters. Text clustering is used to automatically organize large document collections and improve search by returning related groups of documents.
This document provides an overview of data modeling, including definitions of key concepts like data models and data modeling. It describes the evolution of popular data models from hierarchical to network to relational to entity-relationship to object-oriented models. For each model, it outlines the basic concepts, advantages, and disadvantages. The document emphasizes that newer data models aimed to address shortcomings of previous approaches and capture real-world data and relationships.
The document discusses challenges with data annotation at scale and potential solutions. It notes that while data is important for AI, obtaining large datasets is difficult due to privacy laws, terms of use, and outsourcing challenges. Annotation quality and workflow optimization are also discussed, including using tight bounding boxes, automatic annotation, and open-source tools like CVAT that support tasks like object detection, classification, and semantic segmentation. The conclusion emphasizes that data requires management as a product and investing in infrastructure to develop high quality datasets.
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
The document summarizes information about a group project involving data stream clustering. It lists the group members and then discusses key concepts related to data stream clustering like requirements for algorithms, common algorithm types and steps, prototypes and windows. It also touches on outliers and applications of clustering.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
This document discusses spatial data mining and its applications. Spatial data mining involves extracting knowledge and relationships from large spatial databases. It can be used for applications like GIS, remote sensing, medical imaging, and more. Some challenges include the complexity of spatial data types and large data volumes. The document also covers topics like spatial data warehouses, dimensions and measures in spatial analysis, spatial association rule mining, and applications in fields such as earth science, crime mapping, and commerce.
Graph clustering involves representing data as a graph where each element is a node and connections between elements are modeled by edge weights. The goal is to cluster nodes such that elements within a cluster are well-connected but have few connections to elements outside the cluster. There are various algorithms for graph clustering, including k-spanning tree, shared nearest neighbor, betweenness centrality, highly connected components, and maximal clique enumeration. These algorithms group the nodes of a graph into clusters based on properties like minimum spanning trees, shared neighbors, centrality, edge connectivity, and maximal cliques. Graph clustering has applications in fields like image segmentation, network analysis, biology, medicine, and astronomy.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
This document discusses cloud programming models. It begins by defining programming models and noting that they provide an abstraction of a computer system through a language, libraries and runtime system. It then lists some key characteristics of a cloud programming model including efficiency, scalability, fault tolerance and data models. The document outlines an agenda to cover programming models for compute-intensive and big data workloads. It provides examples of bags of tasks and workflow programming models and their applications in fields like bioinformatics.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
The document discusses hashing techniques for implementing dictionaries. It begins by introducing the direct addressing method, which stores key-value pairs directly in an array indexed by keys. However, this wastes space when there are fewer unique keys than array slots. Hashing addresses this by using a hash function to map keys to array slots, reducing storage needs. However, collisions can occur when different keys hash to the same slot. The document then covers various techniques for handling collisions, including chaining, linear probing, quadratic probing, and double hashing. It also discusses properties of good hash functions such as minimizing collisions between related keys and producing uniformly random mappings.
This document discusses various methodologies for processing and analyzing stream data, time series data, and sequence data. It covers topics such as random sampling and sketches/synopses for stream data, data stream management systems, the Hoeffding tree and VFDT algorithms for stream data classification, concept-adapting algorithms, ensemble approaches, clustering of evolving data streams, time series databases, Markov chains for sequence analysis, and algorithms like the forward algorithm, Viterbi algorithm, and Baum-Welch algorithm for hidden Markov models.
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
The document outlines a presentation on multimedia data mining. It discusses three articles: 1) a tool for visually mining multimedia data for social studies, 2) a framework for mining traffic video sequences, and 3) using voice mining to understand customer feedback. It also provides an introduction to multimedia data mining and recommendations.
Exploratory data analysis in R - Data Science ClubMartin Bago
How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
This slide first introduces the sequential pattern mining problem and also presents some required definitions in order to understand GSP algorithm. At then end there is a brief introduction of GSP algorithm and some practical constraints which it supports.
This document discusses active database management systems. It defines active databases as database systems that can automatically respond to events inside or outside the system through the use of event-condition-action rules. These rules allow the database to monitor and react to specific events. The document outlines the key components of an active database architecture, including a knowledge model and execution model. It also discusses features, applications, strengths and weaknesses of active databases.
Data preprocessing techniques are applied before mining. These can improve the overall quality of the patterns mined and the time required for the actual mining.
Some important data preprocessing that must be needed before applying the data mining algorithm to any data sets are completely described in these slides.
The document discusses techniques for detecting duplicate and near-duplicate documents. It describes how near duplicates can be identified by computing syntactic similarity using measures like edit distance. Shingling transforms documents into sets of n-grams that can be used for similarity comparisons. Sketches provide a compact representation of a document's shingles using a subset chosen by permutations, allowing efficient estimation of resemblance between documents. MinHash signatures exploit the relationship between resemblance of sets and the probability of matching minhash values to detect near duplicates in one pass over the data.
This document discusses techniques for finding similar sets of items, specifically documents. It introduces the concepts of shingling, minhashing, and locality-sensitive hashing. Shingling converts documents to sets of n-grams. Minhashing converts large sets to short signatures while preserving similarity using multiple hash functions. Locality-sensitive hashing focuses on signature pairs likely to be similar by hashing signatures to multiple bands to generate candidate pairs for further comparison. These techniques allow efficiently finding similar documents when directly comparing all pairs would be prohibitively expensive.
This document provides an overview of data modeling, including definitions of key concepts like data models and data modeling. It describes the evolution of popular data models from hierarchical to network to relational to entity-relationship to object-oriented models. For each model, it outlines the basic concepts, advantages, and disadvantages. The document emphasizes that newer data models aimed to address shortcomings of previous approaches and capture real-world data and relationships.
The document discusses challenges with data annotation at scale and potential solutions. It notes that while data is important for AI, obtaining large datasets is difficult due to privacy laws, terms of use, and outsourcing challenges. Annotation quality and workflow optimization are also discussed, including using tight bounding boxes, automatic annotation, and open-source tools like CVAT that support tasks like object detection, classification, and semantic segmentation. The conclusion emphasizes that data requires management as a product and investing in infrastructure to develop high quality datasets.
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
The document summarizes information about a group project involving data stream clustering. It lists the group members and then discusses key concepts related to data stream clustering like requirements for algorithms, common algorithm types and steps, prototypes and windows. It also touches on outliers and applications of clustering.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
This document discusses spatial data mining and its applications. Spatial data mining involves extracting knowledge and relationships from large spatial databases. It can be used for applications like GIS, remote sensing, medical imaging, and more. Some challenges include the complexity of spatial data types and large data volumes. The document also covers topics like spatial data warehouses, dimensions and measures in spatial analysis, spatial association rule mining, and applications in fields such as earth science, crime mapping, and commerce.
Graph clustering involves representing data as a graph where each element is a node and connections between elements are modeled by edge weights. The goal is to cluster nodes such that elements within a cluster are well-connected but have few connections to elements outside the cluster. There are various algorithms for graph clustering, including k-spanning tree, shared nearest neighbor, betweenness centrality, highly connected components, and maximal clique enumeration. These algorithms group the nodes of a graph into clusters based on properties like minimum spanning trees, shared neighbors, centrality, edge connectivity, and maximal cliques. Graph clustering has applications in fields like image segmentation, network analysis, biology, medicine, and astronomy.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
This document discusses cloud programming models. It begins by defining programming models and noting that they provide an abstraction of a computer system through a language, libraries and runtime system. It then lists some key characteristics of a cloud programming model including efficiency, scalability, fault tolerance and data models. The document outlines an agenda to cover programming models for compute-intensive and big data workloads. It provides examples of bags of tasks and workflow programming models and their applications in fields like bioinformatics.
Topic modeling is a technique for discovering hidden semantic patterns in large document collections. It represents documents as probability distributions over latent topics, where each topic is characterized by a distribution over words. Two common probabilistic topic models are latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (pLSA). LDA assumes each document exhibits multiple topics in different proportions, with topics modeled as distributions over words. Topic modeling provides dimensionality reduction and can be applied to problems like text classification, collaborative filtering, and computer vision tasks like image classification.
The document discusses hashing techniques for implementing dictionaries. It begins by introducing the direct addressing method, which stores key-value pairs directly in an array indexed by keys. However, this wastes space when there are fewer unique keys than array slots. Hashing addresses this by using a hash function to map keys to array slots, reducing storage needs. However, collisions can occur when different keys hash to the same slot. The document then covers various techniques for handling collisions, including chaining, linear probing, quadratic probing, and double hashing. It also discusses properties of good hash functions such as minimizing collisions between related keys and producing uniformly random mappings.
This document discusses various methodologies for processing and analyzing stream data, time series data, and sequence data. It covers topics such as random sampling and sketches/synopses for stream data, data stream management systems, the Hoeffding tree and VFDT algorithms for stream data classification, concept-adapting algorithms, ensemble approaches, clustering of evolving data streams, time series databases, Markov chains for sequence analysis, and algorithms like the forward algorithm, Viterbi algorithm, and Baum-Welch algorithm for hidden Markov models.
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
The document outlines a presentation on multimedia data mining. It discusses three articles: 1) a tool for visually mining multimedia data for social studies, 2) a framework for mining traffic video sequences, and 3) using voice mining to understand customer feedback. It also provides an introduction to multimedia data mining and recommendations.
Exploratory data analysis in R - Data Science ClubMartin Bago
How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
This document discusses techniques for mining data streams. It begins by defining different types of streaming data like time-series data and sequence data. It then discusses the characteristics of data streams like their huge volume, fast changing nature, and requirement for real-time processing. The key challenges in stream query processing are the unbounded memory requirements and need for approximate query answering. The document outlines several synopsis data structures and techniques used for mining data streams, including random sampling, histograms, sketches, and randomized algorithms. It also discusses architectures for stream query processing and classification of dynamic data streams.
Big data is high-volume, high-velocity, and high-variety data that is difficult to process using traditional data management tools. It is characterized by 3Vs: volume of data is growing exponentially, velocity as data streams in real-time, and variety as data comes from many different sources and formats. The document discusses big data analytics techniques to gain insights from large and complex datasets and provides examples of big data sources and applications.
This slide first introduces the sequential pattern mining problem and also presents some required definitions in order to understand GSP algorithm. At then end there is a brief introduction of GSP algorithm and some practical constraints which it supports.
This document discusses active database management systems. It defines active databases as database systems that can automatically respond to events inside or outside the system through the use of event-condition-action rules. These rules allow the database to monitor and react to specific events. The document outlines the key components of an active database architecture, including a knowledge model and execution model. It also discusses features, applications, strengths and weaknesses of active databases.
Data preprocessing techniques are applied before mining. These can improve the overall quality of the patterns mined and the time required for the actual mining.
Some important data preprocessing that must be needed before applying the data mining algorithm to any data sets are completely described in these slides.
The document discusses techniques for detecting duplicate and near-duplicate documents. It describes how near duplicates can be identified by computing syntactic similarity using measures like edit distance. Shingling transforms documents into sets of n-grams that can be used for similarity comparisons. Sketches provide a compact representation of a document's shingles using a subset chosen by permutations, allowing efficient estimation of resemblance between documents. MinHash signatures exploit the relationship between resemblance of sets and the probability of matching minhash values to detect near duplicates in one pass over the data.
This document discusses techniques for finding similar sets of items, specifically documents. It introduces the concepts of shingling, minhashing, and locality-sensitive hashing. Shingling converts documents to sets of n-grams. Minhashing converts large sets to short signatures while preserving similarity using multiple hash functions. Locality-sensitive hashing focuses on signature pairs likely to be similar by hashing signatures to multiple bands to generate candidate pairs for further comparison. These techniques allow efficiently finding similar documents when directly comparing all pairs would be prohibitively expensive.
This document provides an overview of clustering techniques for probabilistic language processing. It discusses how clustering can be used to group similar examples or documents together based on a distance or similarity metric. Several clustering algorithms are described, including k-means clustering which assigns documents to clusters based on proximity to centroid points, and conceptual clustering which aims to represent clusters with general rules rather than just enumerating members. The document also discusses challenges like the curse of dimensionality for high-dimensional data and evaluating the optimal number of clusters k.
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
This document summarizes Liwei Ren's presentation on binary similarity algorithms. It discusses three existing algorithms - ssdeep, sdhash, and TLSH - and proposes a new algorithm called TSFP. TSFP represents files as bags of blocks and measures similarity based on the overlap of blocks between two files. It is suggested as a way to solve similarity search and clustering problems by creating an index of files represented as TSFPs. The presentation concludes with inviting questions from the audience.
This document describes a technique called MinHashing that can be used to efficiently find near-duplicate documents among a large collection. MinHashing works in three steps: 1) it converts documents to sets of shingles, 2) it computes signatures for the sets using MinHashing to preserve similarity, 3) it uses Locality-Sensitive Hashing to focus on signature pairs likely to be from similar documents, finding candidates efficiently. This avoids comparing all possible document pairs.
This document discusses techniques for efficiently finding similar items in large datasets, including near-neighbor search, entity resolution, and document similarity. It introduces the concept of minhashing as a method to represent datasets in a way that allows efficient comparison of similarities. Minhashing works by using hash functions to map items to signatures in a way that preserves estimates of Jaccard similarity between items.
The document describes different types of editors used in HDL design including schematic and layout editors.
Schematic editors allow designers to capture circuit connectivity and hierarchy using libraries of component symbols. Layout editors allow designers to specify the physical structure of a design using geometric shapes.
Both editor types provide file and display commands, drawing tools to create and edit circuit elements, and data structures like linked lists and quad trees to store and query design information.
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
This document introduces latent semantic indexing (LSI), a technique for information retrieval that overcomes some limitations of the vector space model. LSI represents documents and queries in a semantic space of concepts derived from word co-occurrence patterns in the original text. It uses singular value decomposition to project documents and queries into a concept space of lower dimensionality than the original word space. This addresses problems with synonymy and polysemy. An example shows how LSI can retrieve a document based on conceptual similarity rather than direct word matching. Advantages of LSI include capturing synonymy and polysemy, while disadvantages include increased storage and computational requirements.
Near duplicate detection method based on random projection.
This presentation gives an overview of existing category of NDD methods and introduces WSH (Weighted SimHash). It also presents some result comparing original Simhash with WSH and cosine similarity based method
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
This document discusses using locality sensitive hashing (LSH) to efficiently find similar items in high-dimensional spaces. It describes how LSH works by first representing items as sets of shingles/n-grams, then using minhashing to map these sets to compact signatures while preserving similarity. It explains that LSH further hashes the signatures into "bands" to generate candidate pairs that are likely similar and need direct comparison. The number of bands can be tuned to tradeoff between finding most similar pairs vs few dissimilar pairs.
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Mail.ru Group
Дмитрий рассказал о методе снижения размерности многомерных данных – Locality Sensitive Hashing. На примере задачи поиска похожих текстовых документов гости был подробно разобран алгоритм Minhash.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
In this talk I will discuss how to deduplicate large amounts of source code using the source{d} stack, and more specifically the Apollo project. The 3 steps of the process used in Apollo will be detailed, ie: - the feature extraction step; - the hashing step; - the connected component and community detection step; I'll then go on describing some of the results found from applying Apollo to Public Git Archive, as well as the issues I faced and how these issues could have been somewhat avoided. The talk will be concluded by discussing Gemini, the production-ready sibling project to Apollo, and imagining applications that could extract value from Apollo.
After a quick introduction on the motivation behind Apollo, as said in the abstract I'll describe each step of Apollo's process. As a rule of thumb I'll first describe it formally, then go into how we did it in practice.
Feature extraction: I'll describe code representation, specifically as UASTs, then from there detail the features used. This will allow me to differentiate Apollo from it's inspiration, DejaVu, and talk about code clones taxonomy a bit. TF-IDF will also be touched upon. Hashing: I'll describe the basic Minhashing algorithm, then the improvements Sergey Ioffe's variant brought. I'll justify it's use in our case simultaneously. Connected components/Community detection: I'll describe the connected components and community notion's first (as in in graphs), then talk about the different ways we can extract them from the similarity graph.
After this I'll talk about the issues I had applying Apollo to PGA due to the amount of data, and how I went around the major issued faced. Then I'll go on talking about the results, show some of the communities, and explain in light of these results how issues could have been avoided, and the whole process improved. Finally I'll talk about Gemini, and outline some of the applications that could be imagined to Source code Deduplication.
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
This document discusses IBM Research's work on knowledge graph creation and analytics for cognitive systems. Key points include:
1. IBM Research is developing novel graph analytics tools like algorithms for computing node centrality in O(N) time instead of O(N3), allowing analysis of much larger graphs.
2. These tools are being applied to strategic projects on materials analytics and knowledge graphs to accelerate discovery.
3. One example is creating a knowledge graph for metallurgy that links alloys, processes, and documents to enable new types of queries.
This document discusses IBM Research's work on knowledge graph creation and analytics for cognitive systems. Key points include:
1. IBM Research is developing novel graph analytics tools like algorithms for computing node centrality in O(N) time instead of O(N3), allowing analysis of much larger graphs.
2. These tools are being applied to strategic projects on materials analytics and knowledge graphs to accelerate discovery.
3. One example is creating a knowledge graph for metallurgy that links alloys, processes, and documents to enable new types of queries.
The document summarizes a lecture on decision trees for classification. It introduces decision trees as a non-linear classification approach that learns directly from data representations. It then describes the basic ID3 algorithm for learning decision trees in a greedy top-down manner by choosing attributes that best split the data based on information gain at each step, recursively building the tree until reaching leaf nodes of single target classifications. The goal is to learn a compact tree representation of the training data.
This lecture covers several topics in geometric algorithms:
- An optimal parallel algorithm for computing the 2D convex hull problem in O(n log n) time using O(n) work.
- Applications of the 2D convex hull algorithm to problems like range searching and geometric intersection finding.
- The use of techniques like sweep lines and space partitioning trees to solve higher dimensional geometric problems efficiently.
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN
This document provides an introduction to big data, including definitions of big data, where big data comes from, and the value of big data. It describes the Hadoop ecosystem as a framework for distributed storage and processing of large data sets. Key components of Hadoop include HDFS for storage, MapReduce for processing, and related tools like Pig, Hive, HBase, Sqoop, Kafka and Oozie. Machine learning platforms like H2O and Spark MLlib are also discussed. The document provides an overview of starting research in big data and common big data job titles.
giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN
- There is a need for a national real estate database in Vietnam to increase transparency in the market. Big data analytics can help by indexing millions of property listings and providing insights like price timelines.
- The author proposes using geographically weighted regression on Spark to perform large-scale automatic property appraisals. Experiments show their distributed approach scales to large training and regression datasets and outperforms other methods on cluster.
- A prototype applies their methods to predict land values and create heat maps, demonstrating the potential for real estate analytics to benefit investors and buyers in Vietnam. However, more research is still needed to uncover additional hidden values from real estate data.
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
Language modeling plays a critical role in many
natural language processing (NLP) tasks such as text prediction,
machine translation and speech recognition. Traditional
statistical language models (e.g. n-gram models) can only offer
words that have been seen before and can not capture long word
context. Neural language model provides a promising solution to
surpass this shortcoming of statistical language model. This paper
investigates Recurrent Neural Networks (RNNs) language model
for Vietnamese, at character and syllable-levels. Experiments
were conducted on a large dataset of 24M syllables, constructed
from 1,500 movie subtitles. The experimental results show that
our RNN-based language models yield reasonable performance
on the movie subtitle dataset. Concretely, our models outperform
n-gram language models in term of perplexity score.
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
Language modeling plays a critical role in many
natural language processing (NLP) tasks such as text prediction,
machine translation and speech recognition. Traditional
statistical language models (e.g. n-gram models) can only offer
words that have been seen before and can not capture long word
context. Neural language model provides a promising solution to
surpass this shortcoming of statistical language model. This paper
investigates Recurrent Neural Networks (RNNs) language model
for Vietnamese, at character and syllable-levels. Experiments
were conducted on a large dataset of 24M syllables, constructed
from 1,500 movie subtitles. The experimental results show that
our RNN-based language models yield reasonable performance
on the movie subtitle dataset. Concretely, our models outperform
n-gram language models in term of perplexity score.
Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN
Geographically Weighted Regression (GWR) is a local version of spatial regression that captures spatial dependency in regression analysis. GWR has many application in practice as a visualization and prediction tool for spatial exploration- (e.g in climate, economy, medical). However, this locally regression model is slow in process upon the volume of calculations and the spatial getting bigger. Improving performance of GWR is an critical issue, but their distributed implementations have not been studied. Recently, with the advent of Spark as well MapReduce framework, the development of machine learning applications and parallel programming becomes easier. In this article, we propose several large-scale implementations of distributed GWR, leveraging Spark framework. We implemented and evaluated these approaches with large datasets. To our best knowledge, this is the first work addressing GWR at large-scale.
Recent progress on distributing deep learningViet-Trung TRAN
This document summarizes recent progress in distributed deep learning. It discusses the state of the art in neural networks and deep learning, as well as factors driving advances in deep learning like big data and increased computing power. It then covers approaches for scaling deep learning through model parallelism, data parallelism, and distributed training frameworks. Several deep learning applications developed in Vietnam are presented as examples, including optical character recognition and predictive text. The document concludes with principles for machine learning system design in distributed settings.
This document provides guidance on developing successful project proposals. It outlines key steps like finding the right funding call, developing a clear idea, forming a strategic partnership, and completing the application form while addressing the evaluation criteria. When completing the form, it advises having clear rational and objectives, a logical framework, good writing in the required language, relevant references and experience, and a realistic budget. The conclusion emphasizes aligning with call objectives, a clear and logical proposal, understandable writing, a strong partnership, good management and technical references, and not sharing proposals with others.
GPSInsights is a scalable framework for efficiently storing and mining massive volumes of real-time GPS data reported every second from millions of vehicles. The framework was developed to address challenges with analyzing this type of location data including its size, imprecise measurements, and need to integrate the data with maps during processing.
OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN
This document discusses using deep learning techniques like LSTM and CTC for optical character recognition (OCR), specifically for Vietnamese documents. It provides an overview of OCR, the history including Tesseract, and challenges with traditional approaches. Connectionist temporal classification (CTC) is introduced as a way to directly train RNNs on unsegmented sequence data. CTC combined with LSTM networks allows for end-to-end training of OCR without needing pre-segmented text. The document demonstrates how this approach can be applied to perform OCR on Vietnamese documents.
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN
This document presents GPSInsights, a framework for efficiently storing and mining massive real-time vehicle location data. The framework uses an architecture based on scalable open-source components like Apache Spark, Storm, MongoDB and GeoMesa to store GPS data in real-time and perform mining tasks like scalable map-matching. The document describes experiments on 12 million GPS records which showed the framework can complete map-matching within seconds and scale to support millions of vehicles.
This document provides an overview of deep learning techniques for natural language processing (NLP). It discusses some of the challenges in language understanding like ambiguity and productivity. It then covers traditional ML approaches to NLP problems and how deep learning improves on these approaches. Some key deep learning techniques discussed include word embeddings, recursive neural networks, and language models. Word embeddings allow words with similar meanings to have similar vector representations, improving tasks like sentiment analysis. Recursive neural networks can model hierarchical structures like sentences. Language models assign probabilities to word sequences.
The document discusses the history and development of artificial neural networks and deep learning. It describes early neural network models like perceptrons from the 1950s and their use of weighted sums and activation functions. It then explains how additional developments led to modern deep learning architectures like convolutional neural networks and recurrent neural networks, which use techniques such as hidden layers, backpropagation, and word embeddings to learn from large datasets.
This document discusses decision trees and random forests for classification problems. It explains that decision trees use a top-down approach to split a training dataset based on attribute values to build a model for classification. Random forests improve upon decision trees by growing many de-correlated trees on randomly sampled subsets of data and features, then aggregating their predictions, which helps avoid overfitting. The document provides examples of using decision trees to classify wine preferences, sports preferences, and weather conditions for sport activities based on attribute values.
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
This document provides an overview of recommender systems, including content-based and collaborative filtering approaches. It discusses how content-based systems make recommendations based on item profiles and calculating similarity between user and item profiles. Collaborative filtering is described as finding similar users and making predictions based on their ratings. The document also covers evaluation metrics, complexity issues, and tips for building recommender systems.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
4. 10 nearest neighbors from a collection of 20,000 images
Scene Completion Problem
4
[Hays and Efros, SIGGRAPH 2007]
5. 10 nearest neighbors from a collection of 2 million images
Scene Completion Problem
5
[Hays and Efros, SIGGRAPH 2007]
6. A Common Metaphor
• Many problems can be expressed as
finding “similar” sets:
– Find near-neighbors in high-dimensional space
• Examples:
– Pages with similar words
• For duplicate detection, classification by topic
– Customers who purchased similar products
• Products with similar customer sets
– Images with similar features
• Users who visited similar websites
6
7. Problem
• Given high dimensional data points: x1,x2,...
– E.g. Image is a long vector of pixel colors
• And some distance function d(x1,x2)
• Goal: find all pairs of data points (xi;xj) that are
within some distance threshold d(xi;xj) <= s
• Naive solution: O(n2)
• Can it to be done in O(n)?
7
8. Relation to Previous Lecture
• Last time: Finding frequent pairs
8
Items1…N
Items 1…N
Count of pair {i,j}
in the data
Naïve solution:
Single pass but requires
space quadratic in the
number of items
Items1…K
Items 1…K
Count of pair {i,j}
in the data
A-Priori:
First pass: Find frequent singletons
For a pair to be a frequent pair
candidate, its singletons have to be
frequent!
Second pass:
Count only candidate pairs!
N … number of distinct items
K … number of items with support s
9. Relation to Previous Lecture
• Last time: Finding frequent pairs
• Further improvement: PCY
– Pass 1:
• Count exact frequency of each item:
• Take pairs of items {i,j}, hash them into B buckets and count of the
number of pairs that hashed to each bucket:
9
Items 1…N
Basket 1: {1,2,3}
Pairs: {1,2} {1,3} {2,3}
Buckets 1…B
2 1
10. Relation to Previous Lecture
• Last time: Finding frequent pairs
• Further improvement: PCY
– Pass 1:
• Count exact frequency of each item:
• Take pairs of items {i,j}, hash them into B buckets and count of the
number of pairs that hashed to each bucket:
– Pass 2:
• For a pair {i,j} to be a candidate for
a frequent pair, its singletons {i}, {j}
have to be frequent and the pair
has to hash to a frequent bucket!
10
Items 1…N
Basket 1: {1,2,3}
Pairs: {1,2} {1,3} {2,3}
Basket 2: {1,2,4}
Pairs: {1,2} {1,4} {2,4}
Buckets 1…B
3 1 2
11. Approaches
• A-Priori
– Main idea: Candidates
– Instead of keeping a count of each pair, only keep a count
of candidate pairs!
• Find pairs of similar docs
– Main idea: Candidates
– Pass 1: Take documents and hash them to buckets such that
documents that are similar hash to the same bucket
– Pass 2: Only compare documents that are candidates
(i.e., they hashed to a same bucket)
– Benefits: Instead of O(N2) comparisons, we need O(N)
comparisons to find similar documents
11
13. Distance Measures
Goal: Find near-neighbors in high-dim. space
– We formally define “near neighbors” as
points that are a “small distance” apart
• For each application, we first need to define what
“distance” means
• Today: Jaccard distance/similarity
– The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
– Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
13
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
14. Task: Finding Similar Documents
• Goal: Given a large number ( in the millions or billions)
of documents, find “near duplicate” pairs
• Applications:
– Mirror websites, or approximate mirrors
• Don’t want to show both in search results
– Similar news articles at many news sites
• Cluster articles by “same story”
• Problems:
– Many small pieces of one document can appear
out of order in another
– Too many documents to compare all pairs
– Documents are so large or so many that they cannot
fit in main memory
14
15. 3 Essential Steps for Similar Docs
1. Shingling: Convert documents to sets
2. Min-Hashing: Convert large sets to short
signatures, while preserving similarity
3. Locality-Sensitive Hashing: Focus on
pairs of signatures likely to be from
similar documents
– Candidate pairs!
15
16. The Big Picture
16
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
Locality-
Sensitive
Hashing
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity
17. Shingling
Step 1: Shingling: Convert
documents to sets
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
18. Documents as High-Dim Data
• Step 1: Shingling: Convert documents to sets
• Simple approaches:
– Document = set of words appearing in document
– Document = set of “important” words
– Don’t work well for this application. Why?
• Need to account for ordering of words!
• A different way: Shingles!
18
19. Define: Shingles
• A k-shingle (or k-gram) for a document is a
sequence of k tokens that appears in the doc
– Tokens can be characters, words or something else,
depending on the application
– Assume tokens = characters for examples
• Example: k=2; document D1 = abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
– Option: Shingles as a bag (multiset), count ab twice:
S’(D1) = {ab, bc, ca, ab}
19
20. Compressing Shingles
• To compress long shingles, we can hash them to
32 bits
• Represent a document by the set of hash values
of its k-shingles
– Idea: Two documents could (rarely) appear to have
shingles in common, when in fact only the hash-values
were shared
• Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
20
21. Similarity Metric for Shingles
• Document D1 is a set of its k-shingles C1=S(D1)
• Equivalently, each document is a
0/1 vector in the space of k-shingles
– Each unique shingle is a dimension
– Vectors are very sparse
• A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
21
22. Working Assumption
• Documents that have lots of shingles in
common have similar text, even if the text
appears in different order
• Caveat: You must pick k large enough, or most
documents will have most shingles
– k = 5 is OK for short documents
– k = 10 is better for long documents
22
23. Motivation for Minhash/LSH
• Suppose we need to find near-duplicate
documents among 𝑵 = 𝟏 million documents
• Naïvely, we would have to compute pairwise
Jaccard similarities for every pair of docs
– 𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
– At 105 secs/day and 106 comparisons/sec,
it would take 5 days
• For 𝑵 = 𝟏𝟎 million, it takes more than a year…
23
24. MinHashing
Step 2: Minhashing: Convert large sets
to short signatures, while preserving
similarity
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
25. Encoding Sets as Bit Vectors
• Encode sets using 0/1 (bit, boolean) vectors
– One dimension per element in the universal set
• Interpret set intersection as bitwise AND, and
set union as bitwise OR
• Example: C1 = 10111; C2 = 10011
– Size of intersection = 3; size of union = 4,
– Jaccard similarity (not distance) = 3/4
– Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4
25
26. From Sets to Boolean Matrices
• Rows = elements (shingles)
• Columns = sets (documents)
– 1 in row e and column s if and only if e is a
member of s
– Column similarity is the Jaccard similarity of
the corresponding sets (rows with value 1)
– Typical matrix is sparse!
• Each document is a column:
– Example: sim(C1 ,C2) = ?
• Size of intersection = 3; size of union = 6,
Jaccard similarity (not distance) = 3/6
• d(C1,C2) = 1 – (Jaccard similarity) = 3/6
26
0101
0111
1001
1000
1010
1011
0111
Documents
Shingles
27. Outline: Finding Similar Columns
• So far:
– Documents Sets of shingles
– Represent sets as boolean vectors in a matrix
• Next goal: Find similar columns while
computing small signatures
– Similarity of columns == similarity of signatures
27
28. Outline: Finding Similar Columns
• Next Goal: Find similar columns, Small signatures
• Naïve approach:
– 1) Signatures of columns: small summaries of columns
– 2) Examine pairs of signatures to find similar columns
– 3) Optional: Check that columns with similar signatures are
really similar
• Warnings:
– Comparing all pairs may take too much time: Job for LSH
• These methods can produce false negatives, and even false
positives (if the optional check is not made)
28
29. Hashing Columns (Signatures)
• Key idea: “hash” each column C to a small signature
h(C), such that:
– (1) h(C) is small enough that the signature fits in RAM
– (2) sim(C1, C2) is the same as the “similarity” of signatures
h(C1) and h(C2)
• Goal: Find a hash function h(·) such that:
– If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
– If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
• Hash docs into buckets. Expect that “most” pairs of
near duplicate docs hash into the same bucket!
29
30. Min-Hashing
• Goal: Find a hash function h(·) such that:
– if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
– if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
• Clearly, the hash function depends on
the similarity metric:
– Not all similarity metrics have a suitable
hash function
• There is a suitable hash function for
the Jaccard similarity: It is called Min-Hashing
30
31. Min-Hashing
• Imagine the rows of the boolean matrix permuted
under random permutation
• Define a “hash” function h(C) = the index of the
first (in the permuted order ) row in which column
C has value 1:
h (C) = min (C)
• Use several (e.g., 100) independent hash functions
(that is, permutations) to create a signature of a
column
31
32. Min-Hashing Example
32
3
4
7
2
6
1
5
Signature matrix M
1212
5
7
6
3
1
2
4
1412
4
5
1
6
7
3
2
2121
2nd element of the permutation
is the first to map to a 1
4th element of the permutation
is the first to map to a 1
0101
0101
1010
1010
1010
1001
0101
Input matrix (Shingles x Documents)Permutation
33. Similarity for Signatures
• We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
• Now generalize to multiple hash functions
• The similarity of two signatures is the fraction of
the hash functions in which they agree
• Note: Because of the Min-Hash property, the
similarity of columns is the same as the expected
similarity of their signatures
33
35. Implementation Trick
• Row hashing!
– Pick K = 100 hash functions ki
– Ordering under ki gives a random row permutation!
• One-pass implementation
– For each column C and hash-func. ki keep a “slot” for the
min-hash value
– Initialize all sig(C)[i] =
– Scan rows looking for 1s
• Suppose row j has 1 in column C
• Then for each ki :
– If ki(j) < sig(C)[i], then sig(C)[i] ki(j)
35
How to pick a random
hash function h(x)?
Universal hashing:
ha,b(x)=((a·x+b) mod p) mod N
where:
a,b … random integers
p … prime number (p > N)
36. Locality Sensitive Hashing
Step 3: Locality-Sensitive Hashing:
Focus on pairs of signatures likely to be
from similar documents
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
Locality-
Sensitive
Hashing
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity
37. LSH: First Cut
• Goal: Find documents with Jaccard similarity at
least s (for some similarity threshold, e.g., s=0.8)
• LSH – General idea: Use a function f(x,y) that tells
whether x and y is a candidate pair: a pair of
elements whose similarity must be evaluated
• For Min-Hash matrices:
– Hash columns of signature matrix M to many buckets
– Each pair of documents that hashes into the
same bucket is a candidate pair
37
1212
1412
2121
38. Partition M into b Bands
38
Signature matrix M
r rows
per band
b bands
One
signature
1212
1412
2121
39. Partition M into Bands
• Divide matrix M into b bands of r rows
• For each band, hash its portion of each
column to a hash table with k buckets
– Make k as large as possible
• Candidate column pairs are those that hash
to the same bucket for ≥ 1 band
• Tune b and r to catch most similar pairs,
but few non-similar pairs
39
40. Matrix M
r rows b bands
Buckets
Columns 2 and 6
are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.
Hashing Bands
40
41. Simplifying Assumption
• There are enough buckets that columns are
unlikely to hash to the same bucket unless they are
identical in a particular band
• Hereafter, we assume that “same bucket” means
“identical in that band”
• Assumption needed only to simplify analysis, not for
correctness of algorithm
41
42. Example of Bands
Assume the following case:
• Suppose 100,000 columns of M (100k docs)
• Signatures of 100 integers (rows)
• Therefore, signatures take 40Mb
• Choose b = 20 bands of r = 5 integers/band
• Goal: Find pairs of documents that
are at least s = 0.8 similar
42
1212
1412
2121
43. C1, C2 are 80% Similar
• Find pairs of s=0.8 similarity, set b=20, r=5
• Assume: sim(C1, C2) = 0.8
– Since sim(C1, C2) s, we want C1, C2 to be a candidate pair:
We want them to hash to at least 1 common bucket (at least
one band is identical)
• Probability C1, C2 identical in one particular
band: (0.8)5 = 0.328
• Probability C1, C2 are not similar in all of the 20 bands:
(1-0.328)20 = 0.00035
– i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
– We would find 99.965% pairs of truly similar documents
43
1212
1412
2121
44. C1, C2 are 30% Similar
• Find pairs of s=0.8 similarity, set b=20, r=5
• Assume: sim(C1, C2) = 0.3
– Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
• Probability C1, C2 identical in one particular band:
(0.3)5 = 0.00243
• Probability C1, C2 identical in at least 1 of 20 bands: 1 -
(1 - 0.00243)20 = 0.0474
– In other words, approximately 4.74% pairs of docs with
similarity 0.3% end up becoming candidate pairs
• They are false positives since we will have to examine them (they
are candidate pairs) but then it will turn out their similarity is below
threshold s
44
1212
1412
2121
45. LSH Involves a Tradeoff
• Pick:
– The number of Min-Hashes (rows of M)
– The number of bands b, and
– The number of rows r per band
to balance false positives/negatives
• Example: If we had only 15 bands of 5 rows,
the number of false positives would go down,
but the number of false negatives would go
up
45
1212
1412
2121
46. What b Bands of r Rows Gives
You
46
t r
All rows
of a band
are equal
1 -
Some row
of a band
unequal
( )b
No bands
identical
1 -
At least
one band
identical
s ~ (1/b)1/r
Similarity t=sim(C1, C2) of two sets
Probability
of sharing
a bucket
47. Example: b = 20; r = 5
• Similarity threshold s
• Prob. that at least 1 band is
identical:
47
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
48. LSH Summary
• Tune M, b, r to get almost all pairs with similar
signatures, but eliminate most pairs that do not
have similar signatures
• Check in main memory that candidate pairs
really do have similar signatures
• Optional: In another pass through data, check
that the remaining candidate pairs really
represent similar documents
48
49. Summary: 3 Steps
• Shingling: Convert documents to sets
– We used hashing to assign each shingle an ID
• Min-Hashing: Convert large sets to short signatures,
while preserving similarity
– We used similarity preserving hashing to generate
signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
– We used hashing to get around generating random
permutations
• Locality-Sensitive Hashing: Focus on pairs of
signatures likely to be from similar documents
– We used hashing to find candidate pairs of similarity s
49
Editor's Notes
10 nearest neighbors, 20,000 image database
10 nearest neighbors, 20,000 image database
10 nearest neighbors, 2.3 million image database
Each agress with prob s.
So to estimate s we compute what fraction of hash functions agree
false negative : a false negative is an error in which a test result improperly indicates no presence of a condition (the result is negative), when in reality it is present.