The document describes a method for accelerating set similarity joins using graphics processing units (GPUs). It proposes using MinHash to estimate the Jaccard similarity between sets and generate signature matrices for the sets. These signature matrices are then processed on the GPU in parallel to perform the similarity join and detect similar records above a given threshold. Experiments show the GPU implementation achieves speedups of up to 150x over serial CPU processing and 25x over parallel CPU processing.
A database application differs form regular applications in that some of its inputs may be database queries. The program will execute the queries on a database and may use any result values in its subsequent program logic. This means that a user-supplied query may determine the values that the application will use in subsequent branching conditions. At the same time, a new database application is often required to work well on a body of existing data stored in some large database. For systematic testing of database applications, recent techniques replace the existing database with carefully crafted mock databases. Mock databases return values that will trigger as many execution paths in the application as possible and thereby maximize overall code coverage of the database application.
In this paper we offer an alternative approach to database application testing. Our goal is to support software engineers in focusing testing on the existing body of data the application is required to work well on. For that, we propose to side-step mock database generation and instead generate queries for the existing database. Our key insight is that we can use the information collected during previous program executions to systematically generate new queries that will maximize the coverage of the application under test, while guaranteeing that the generated test cases focus on the existing data.
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
The document discusses different machine learning techniques including regression, classification, clustering, anomaly detection, and recommendation. It then provides examples of data and labels that could be used for training models with these techniques. It also discusses topics like updating model weights, learning rates, and derivatives or gradients of cost functions. Finally, it provides examples of using Azure machine learning services to train models with cloud resources and deploy them for consumption.
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
This document proposes improving data leak prevention performance using a white-list approach and Bloom filters. It summarizes previous related work using blacklists and keywords to detect leaks. The authors then improve upon prior work by Fang Hao et al. that used CRC to create fingerprints, by using hash functions with Bloom filters instead to generate fingerprints faster while maintaining accuracy. Experiments test five hash functions on a 9.3GB dataset to evaluate system throughput and percentage of leaked files.
The document discusses stacks as a data structure. It defines a stack as a list where all insertions and deletions are made at one end, called the top. Stacks follow the LIFO (last in, first out) principle. The document provides examples of stack implementations using arrays in C++ and describes the basic stack operations like push, pop, peek, and isEmpty. It also gives examples of real-world stacks like stacks of books, chairs, and cups.
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.
Introduction to the R Statistical Computing Environmentizahn
Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r
A database application differs form regular applications in that some of its inputs may be database queries. The program will execute the queries on a database and may use any result values in its subsequent program logic. This means that a user-supplied query may determine the values that the application will use in subsequent branching conditions. At the same time, a new database application is often required to work well on a body of existing data stored in some large database. For systematic testing of database applications, recent techniques replace the existing database with carefully crafted mock databases. Mock databases return values that will trigger as many execution paths in the application as possible and thereby maximize overall code coverage of the database application.
In this paper we offer an alternative approach to database application testing. Our goal is to support software engineers in focusing testing on the existing body of data the application is required to work well on. For that, we propose to side-step mock database generation and instead generate queries for the existing database. Our key insight is that we can use the information collected during previous program executions to systematically generate new queries that will maximize the coverage of the application under test, while guaranteeing that the generated test cases focus on the existing data.
Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato
This document summarizes and evaluates several algorithms for classification of data streams: VFDTc, UFFT, and CVFDT. It describes their approaches for handling concept drift, detecting outliers and noise. The algorithms were tested on synthetic data streams generated with configurable attributes like drift frequency and noise percentage. Results show VFDTc and UFFT performed best in accuracy, while CVFDT and UFFT were fastest. The study aims to help choose algorithms suitable for different data stream characteristics like gradual vs sudden drift or frequent vs infrequent drift.
The document discusses different machine learning techniques including regression, classification, clustering, anomaly detection, and recommendation. It then provides examples of data and labels that could be used for training models with these techniques. It also discusses topics like updating model weights, learning rates, and derivatives or gradients of cost functions. Finally, it provides examples of using Azure machine learning services to train models with cloud resources and deploy them for consumption.
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
This document proposes improving data leak prevention performance using a white-list approach and Bloom filters. It summarizes previous related work using blacklists and keywords to detect leaks. The authors then improve upon prior work by Fang Hao et al. that used CRC to create fingerprints, by using hash functions with Bloom filters instead to generate fingerprints faster while maintaining accuracy. Experiments test five hash functions on a 9.3GB dataset to evaluate system throughput and percentage of leaked files.
The document discusses stacks as a data structure. It defines a stack as a list where all insertions and deletions are made at one end, called the top. Stacks follow the LIFO (last in, first out) principle. The document provides examples of stack implementations using arrays in C++ and describes the basic stack operations like push, pop, peek, and isEmpty. It also gives examples of real-world stacks like stacks of books, chairs, and cups.
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.
Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.
Introduction to the R Statistical Computing Environmentizahn
Get an introduction to R, the open-source system for statistical computation and graphics. With hands-on exercises, learn how to import and manage datasets, create R objects, and conduct basic statistical analyses. Full workshop materials can be downloaded from http://projects.iq.harvard.edu/rtc/event/introduction-r
Experiments on Design Pattern DiscoveryTim Menzies
The document describes experiments conducted to discover design patterns from source code. It outlines the approach taken by DP-Miner tool, presents experiment data on four Java systems, and evaluates results by calculating precision and recall values. Benchmarks are lacking for accurately evaluating design pattern discovery techniques.
The document introduces distributed stream processing. It discusses maintaining synopses of streams using single-pass, small space and time algorithms. Distributed queries can be one-shot or continuous, requiring approximation to minimize communication. Tree-based aggregation and decentralized gossiping are introduced for in-network processing. Handling message loss and node failures is also important. Future work includes stream mining queries and compressing XML streams.
This document summarizes a presentation on analyzing word co-occurrences in text data using network analysis techniques. It discusses counting the frequency of word combinations, representing the co-occurrence data as a network with nodes for words and edges for co-occurrences, and visualizing the network in Gephi. It also provides an example analysis of tweets about a political debate, examining which topics were emphasized by each candidate based on word associations on Twitter.
SPSS (Statistical Package for the Social Sciences) is software used for data analysis. It can process questionnaires, report data in tables and graphs, and analyze means, chi-squares, regression, and more. Originally its own company, SPSS is now owned by IBM and integrated into their software portfolio. The document provides an overview of using SPSS, including entering data from questionnaires, different question/response formats, and descriptive statistical analysis functions in SPSS like frequencies, cross-tabs, and graphs.
The document describes an activity analysis and visualization project with the following objectives:
1. Build a system to support groups in learning how to work more effectively through visualizing collaboration data logs.
2. Develop different types of visualizations like activity radars and interaction networks to provide insights into participation, interactions, and timelines of events.
3. Apply data mining techniques to find frequent patterns and sequences of events that characterize aspects of teamwork.
The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
This document discusses the process of compiling programs from source code to executable code. It covers lexical analysis, parsing, semantic analysis, code optimization, and code generation. The overall compilation process involves breaking the source code into tokens, generating an abstract syntax tree, performing semantic checks, translating to intermediate representations, optimizing the code, and finally generating target machine code.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
This document summarizes research that implemented the same transitive closure algorithm for entity resolution on three different Apache Hadoop distributions: a local HDFS cluster, Cloudera Enterprise, and Talend Big Data Sandbox. The algorithm was run on a synthetic dataset to discover entity clusters. While the local HDFS cluster produced consistent results matching the baseline, the Cloudera and Talend platforms had inconsistent results due to differences in configuration requirements, load balancing, and blocking behavior across nodes. The experiments highlighted scalability issues for entity resolution processes in distributed environments due to inconsistencies introduced by differences in platform implementations.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
This document discusses the concepts of parallel programming including processes, communication, and synchronization. It describes different approaches to parallel programming such as monitors, message passing, and synchronous communication. The document then introduces SR (Synchronizing Resources) as a language that unifies these approaches and is well-suited for conventional and distributed systems. It provides examples of basic SR concepts like resources, processes, and statements, as well as examples of communication and parallel matrix multiplication in SR.
The document discusses building human-based software estimation models that are accurate, intuitive, and easy to understand. It presents an approach using correlation and scale factors between estimated and actual effort. Experiments on a dataset of 178 samples show that combining correlation and scale factors into a decision tree achieves up to 93.3% accuracy. The resulting model bridges expert and algorithmic estimation methods.
The document outlines a presentation on regression analysis using Stata. It discusses Stata's features and windows. It covers data structure types like cross-sectional, panel, and time series data. Regression diagnostics like normality, heteroskedasticity, multicollinearity, and specification are explained. Other regression models like logistic, probit, and Poisson are also covered. The presentation concludes with suggestions for presenting results and suggested readings.
The document outlines a presentation on regression analysis using Stata. It discusses Stata's features and windows. It covers data structure types like cross-sectional, panel, and time series data. Regression diagnostics like normality, heteroskedasticity, multicollinearity, and specification are explained. Other regression models like logistic, probit, and Poisson are also covered. The presentation concludes with suggestions for presenting results and suggested readings.
"An Evaluation of Models for Runtime Approximation in Link Discovery" as presented in the IEEE/WIC/ACM WI, August 25th, 2017, held in Leipzig, Germany.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
The document discusses some of the promises and perils of mining software repositories like Git and GitHub for research purposes. It notes that while these sources contain rich data on software development, there are also challenges to consider. For example, decentralized version control systems like Git allow private collaboration that may be missed. And most GitHub projects are personal and inactive, while it is also used for storage and hosting. The document recommends researchers approach these data sources carefully and provides lessons on how to properly analyze and interpret the data from repositories like Git and GitHub.
Inverted Index Based Multi-Keyword Public-key Searchable Encryption with Stro...Mateus S. H. Cruz
This document summarizes a research paper that proposes an encrypted search scheme using an inverted index to allow for multi-keyword queries on encrypted data. The key contributions are: (1) supporting the reuse of the same encrypted index for multiple queries while preserving query privacy, (2) enabling conjunctive multi-keyword searches, and (3) providing efficiency by only using multiplication and exponentiation operations. The proposed scheme uses an encrypted inverted index along with trapdoor generation and private set intersection techniques to enable accurate yet private searches on outsourced encrypted data.
Privacy-Preserving Search for Chemical Compound DatabasesMateus S. H. Cruz
Presentation about the paper "Privacy-Preserving Search for Chemical Compound Databases"*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/09/02/summary-privacy-preserving-search-for-chemical-compound-databases/
*Shimizu et al.: "Privacy-Preserving Search for Chemical Compound Databases". BMC Bioinformatics 2015.
More Related Content
Similar to GPU Acceleration of Set Similarity Joins
Experiments on Design Pattern DiscoveryTim Menzies
The document describes experiments conducted to discover design patterns from source code. It outlines the approach taken by DP-Miner tool, presents experiment data on four Java systems, and evaluates results by calculating precision and recall values. Benchmarks are lacking for accurately evaluating design pattern discovery techniques.
The document introduces distributed stream processing. It discusses maintaining synopses of streams using single-pass, small space and time algorithms. Distributed queries can be one-shot or continuous, requiring approximation to minimize communication. Tree-based aggregation and decentralized gossiping are introduced for in-network processing. Handling message loss and node failures is also important. Future work includes stream mining queries and compressing XML streams.
This document summarizes a presentation on analyzing word co-occurrences in text data using network analysis techniques. It discusses counting the frequency of word combinations, representing the co-occurrence data as a network with nodes for words and edges for co-occurrences, and visualizing the network in Gephi. It also provides an example analysis of tweets about a political debate, examining which topics were emphasized by each candidate based on word associations on Twitter.
SPSS (Statistical Package for the Social Sciences) is software used for data analysis. It can process questionnaires, report data in tables and graphs, and analyze means, chi-squares, regression, and more. Originally its own company, SPSS is now owned by IBM and integrated into their software portfolio. The document provides an overview of using SPSS, including entering data from questionnaires, different question/response formats, and descriptive statistical analysis functions in SPSS like frequencies, cross-tabs, and graphs.
The document describes an activity analysis and visualization project with the following objectives:
1. Build a system to support groups in learning how to work more effectively through visualizing collaboration data logs.
2. Develop different types of visualizations like activity radars and interaction networks to provide insights into participation, interactions, and timelines of events.
3. Apply data mining techniques to find frequent patterns and sequences of events that characterize aspects of teamwork.
The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
This document discusses the process of compiling programs from source code to executable code. It covers lexical analysis, parsing, semantic analysis, code optimization, and code generation. The overall compilation process involves breaking the source code into tokens, generating an abstract syntax tree, performing semantic checks, translating to intermediate representations, optimizing the code, and finally generating target machine code.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
This document summarizes research that implemented the same transitive closure algorithm for entity resolution on three different Apache Hadoop distributions: a local HDFS cluster, Cloudera Enterprise, and Talend Big Data Sandbox. The algorithm was run on a synthetic dataset to discover entity clusters. While the local HDFS cluster produced consistent results matching the baseline, the Cloudera and Talend platforms had inconsistent results due to differences in configuration requirements, load balancing, and blocking behavior across nodes. The experiments highlighted scalability issues for entity resolution processes in distributed environments due to inconsistencies introduced by differences in platform implementations.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
This document discusses the concepts of parallel programming including processes, communication, and synchronization. It describes different approaches to parallel programming such as monitors, message passing, and synchronous communication. The document then introduces SR (Synchronizing Resources) as a language that unifies these approaches and is well-suited for conventional and distributed systems. It provides examples of basic SR concepts like resources, processes, and statements, as well as examples of communication and parallel matrix multiplication in SR.
The document discusses building human-based software estimation models that are accurate, intuitive, and easy to understand. It presents an approach using correlation and scale factors between estimated and actual effort. Experiments on a dataset of 178 samples show that combining correlation and scale factors into a decision tree achieves up to 93.3% accuracy. The resulting model bridges expert and algorithmic estimation methods.
The document outlines a presentation on regression analysis using Stata. It discusses Stata's features and windows. It covers data structure types like cross-sectional, panel, and time series data. Regression diagnostics like normality, heteroskedasticity, multicollinearity, and specification are explained. Other regression models like logistic, probit, and Poisson are also covered. The presentation concludes with suggestions for presenting results and suggested readings.
The document outlines a presentation on regression analysis using Stata. It discusses Stata's features and windows. It covers data structure types like cross-sectional, panel, and time series data. Regression diagnostics like normality, heteroskedasticity, multicollinearity, and specification are explained. Other regression models like logistic, probit, and Poisson are also covered. The presentation concludes with suggestions for presenting results and suggested readings.
"An Evaluation of Models for Runtime Approximation in Link Discovery" as presented in the IEEE/WIC/ACM WI, August 25th, 2017, held in Leipzig, Germany.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
The document discusses some of the promises and perils of mining software repositories like Git and GitHub for research purposes. It notes that while these sources contain rich data on software development, there are also challenges to consider. For example, decentralized version control systems like Git allow private collaboration that may be missed. And most GitHub projects are personal and inactive, while it is also used for storage and hosting. The document recommends researchers approach these data sources carefully and provides lessons on how to properly analyze and interpret the data from repositories like Git and GitHub.
Similar to GPU Acceleration of Set Similarity Joins (20)
Inverted Index Based Multi-Keyword Public-key Searchable Encryption with Stro...Mateus S. H. Cruz
This document summarizes a research paper that proposes an encrypted search scheme using an inverted index to allow for multi-keyword queries on encrypted data. The key contributions are: (1) supporting the reuse of the same encrypted index for multiple queries while preserving query privacy, (2) enabling conjunctive multi-keyword searches, and (3) providing efficiency by only using multiplication and exponentiation operations. The proposed scheme uses an encrypted inverted index along with trapdoor generation and private set intersection techniques to enable accurate yet private searches on outsourced encrypted data.
Privacy-Preserving Search for Chemical Compound DatabasesMateus S. H. Cruz
Presentation about the paper "Privacy-Preserving Search for Chemical Compound Databases"*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/09/02/summary-privacy-preserving-search-for-chemical-compound-databases/
*Shimizu et al.: "Privacy-Preserving Search for Chemical Compound Databases". BMC Bioinformatics 2015.
Privacy-Preserving Multi-Keyword Fuzzy Search over Encrypted Data in the CloudMateus S. H. Cruz
The document proposes a method for privacy-preserving multi-keyword fuzzy search over encrypted data. It uses Bloom filters to represent encrypted indexes and queries, and locality sensitive hashing functions to allow fuzzy matching of keywords. An inner product calculation is used to determine similarity between encrypted indexes and queries. The proposal includes an enhanced scheme that adds a pseudorandom function for additional security against background knowledge attacks. Experiments demonstrate the performance and accuracy of the approach.
Fuzzy Keyword Search over Encrypted Data in Cloud ComputingMateus S. H. Cruz
The document proposes a wildcard-based approach for efficient fuzzy keyword search over encrypted data stored in the cloud. It aims to address the large fuzzy sets and high storage costs of the straightforward approach by using wildcards to denote edit operations. This allows for a more efficient construction of smaller fuzzy sets and reduced storage requirements, while still maintaining search privacy.
Fast, Private and Verifiable: Server-aided Approximate Similarity Computation...Mateus S. H. Cruz
Presentation given at the SWIM seminar (University of Tsukuba) about the paper "Fast, Private and Verifiable: Server-aided Approximate Similarity Computation over Large-Scale Datasets"*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/08/05/summary-fast-private-and-verifiable-server-aided-approximate-similarity-computation-over-large-scale-datasets/
*Qiu et al.: "Fast, Private and Verifiable: Server-aided Approximate Similarity Computation over Large-Scale Datasets". SCC 2016.
Realizing Fine-Grained and Flexible Access Control to Outsourced Data with At...Mateus S. H. Cruz
Presentation given at the SWIM Seminar (University of Tsukuba) about the paper "Realizing Fine-Grained and Flexible Access Control to Outsourced Data with Attribute-Based Cryptosystems"*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/07/22/summary-fine-grained-access-control-using-abe-and-abs/
*Zhao et al.: "Realizing Fine-Grained and Flexible Access Control to Outsourced Data with Attribute-Based Cryptosystems". ISPEC 2011.
DBMask: Fine-Grained Access Control on Encrypted Relational DatabasesMateus S. H. Cruz
Presentation given at the SWIM Seminar (University of Tsukuba) about MONOMI*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/07/15/summary-dbmask/
*Nabeel et al.: "DBMask: Fine-Grained Access Control on Encrypted Relational Databases". CODASPY 2015.
ENKI: Access Control for Encrypted Query ProcessingMateus S. H. Cruz
Presentation given at the SWIM Seminar (University of Tsukuba) about ENKI*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/07/11/summary-enki/
*Hang et al.: "ENKI: Access Control for Encrypted Query Processing". SIGMOD 2015.
Presentation given at the SWIM Seminar (University of Tsukuba) about MONOMI*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/07/01/summary-monomi/
*Tu et al.: "Processing Analytical Queries over Encrypted Data". VLDB 2013.
Presentation given at the KDE Seminar (University of Tsukuba) about CryptDB*.
This presentation is based on the uploader's understanding of the paper and may contain inaccurate interpretations.
A summary of the paper is available at: https://mshcruz.wordpress.com/2016/06/24/summary-cryptdb/
The official website for CryptDB is: http://css.csail.mit.edu/cryptdb/
*Popa et al.: "CryptDB: Protecting Confidentiality with Encrypted Query Processing". SOSP 2011.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
9. Introduction
Tools
Proposal
Preprocessing
Signature Matrix
Join
Experiments
Summary
Set Similarity Join
Find similar records
Similarity threshold (δ)
Strings can be seen as sets of words (tokens)
Set similarity metric: Jaccard similarity (JS)
Student
Name Univ. Name
Bob Tsukuba Univ.
Mary Harvard Univ.
John Harvard Univ.
Anna Univ. of Berlin
University
Univ. Name Country
Univ. of Tsukuba Japan
Harvard Univ. USA
Univ. of Berlin Germany
δ = 0.6
2/21
10. Introduction
Tools
Proposal
Preprocessing
Signature Matrix
Join
Experiments
Summary
Set Similarity Join
Find similar records
Similarity threshold (δ)
Strings can be seen as sets of words (tokens)
Set similarity metric: Jaccard similarity (JS)
Student
Name Univ. Name
Bob Tsukuba Univ.
Mary Harvard Univ.
John Harvard Univ.
Anna Univ. of Berlin
University
Univ. Name Country
Univ. of Tsukuba Japan
Harvard Univ. USA
Univ. of Berlin Germany
δ = 0.7
2/21
11. Introduction
Tools
Proposal
Preprocessing
Signature Matrix
Join
Experiments
Summary
Set Similarity Join
Find similar records
Similarity threshold (δ)
Strings can be seen as sets of words (tokens)
Set similarity metric: Jaccard similarity (JS)
Problem: Expensive processing
Student
Name Univ. Name
Bob Tsukuba Univ.
Mary Harvard Univ.
John Harvard Univ.
Anna Univ. of Berlin
University
Univ. Name Country
Univ. of Tsukuba Japan
Harvard Univ. USA
Univ. of Berlin Germany
δ = 0.6
2/21
13. Introduction
Tools
Proposal
Preprocessing
Signature Matrix
Join
Experiments
Summary
Related Work
Serial similarity joins
Xiao et al., Efficient Similarity Joins for Near Duplicate
Detection, TODS 2011
Parallel similarity joins using MapReduce
Vernica et al., Efficient Parallel Set-similarity Joins using
Mapreduce, SIGMOD 2010
Parallel similarity joins using GPU
Lieberman et al., A Fast Similarity Join Algorithm Using
Graphics Processing Units, ICDE 2008
– Normed metric
B¨ohm et al., Index-supported Similarity Join on Graphics
Processors, BTW 2009
– Euclidean distance
4/21
17. Introduction
Tools
Proposal
Preprocessing
Signature Matrix
Join
Experiments
Summary
MinHash1
Estimates Jaccard similarity
Apply hash functions to sets and keep the minimum
Similar sets will have the same hash value
Use hash values to create signatures
Parts of signatures: bins
Good coupling with GPU
Efficient storage
Suitable for parallel processing
Li et al., GPU-based Minwise Hashing, WWW 2012
1
Broder, On the Resemblance and Containment of Documents,
Compression and Complexity of Sequences: Proceedings 1997
6/21
32. Introduction
Tools
Proposal
Preprocessing
Signature Matrix
Join
Experiments
Summary
Result Output
The result size is initially unknown
Cannot allocate memory beforehand
Write conflicts between blocks
Three-phase scheme for result output2
4 2 0 2
0 4 6 6
1 - Execute the join and find the num-
ber of similar pairs for each block
2 - Execute a scan and obtain the initial
writing positions for each block
3 - Allocate the result array, execute the
join and output the similar pairs
2
He et al., Relational Joins on Graphics Processors, SIGMOD 2008
15/21
42. Environment
Parameters
MinHash Alg.
Join Alg.
Detailed Setup
GCC 4.4.7 -O3
NVCC 6.5 -O3 -use fast math
OpenMP 4.0
Component Specification
CPU Intel Xeon CPU E5-1650
CPU cores 6 (12 threads with Hyper-Threading)
CPU clock 3.50 GHz
Main memory 32GB
GPU NVIDIA Tesla K20Xm
Scalar processors 2688
Processor clock 732 MHz
Global memory 6GB
2/5
43. Environment
Parameters
MinHash Alg.
Join Alg.
Parameters
Not much impact on the performance/accuracy
Threads per block
Similarity threshold
Join selectivity
0
25
50
75
100
32
64
128
256
384
512
640
768
8961024
Number of threads per block
Elapsedtime(s)
GPU
0
5
10
15
20
0.2
0.4
0.6
0.8
1.0
Similarity Threshold
Elapsedtime(s)
GPU
50
100
150
200
0.01
0.05
0.1
0.2
0.3
0.4
0.5
Selectivity
Elapsedtime(s)
CPU (Serial)
CPU (Parallel)
GPU
Abstracts dataset, |R| = |S| = 131, 072
3/5
44. Environment
Parameters
MinHash Alg.
Join Alg.
Parallel MinHash Algorithm
Algorithm 1: Parallel MinHash
input : characteristic matrix CMt×d (t tokens, d
documents), number of bins b
output: signature matrix SMd×b (d documents, b bins)
1 binSize ← t/b ;
2 for i ← 0 to d in parallel do // exec. by blocks
3 for j ← 0 to t in parallel do // exec. by threads
4 if CMj,i = 1 then
5 h ← hash(CMj,i );
6 binIdx ← h/binSize ;
7 SMi,binIdx ← min(SMi,binIdx , h);
8 end
9 end
10 end
4/5
45. Environment
Parameters
MinHash Alg.
Join Alg.
Parallel NLJ Algorithm
Algorithm 2: Parallel nested-loop join
input : collections R and S, similarity threshold δ
output: pairs of documents whose similarity is greater than δ
1 foreach r ∈ R in parallel do // exec. by blocks
2 foreach s ∈ S do
3 if Sim(r, s) ≥ δ then
4 output(r, s);
5 end
6 end
7 end
5/5