This document discusses privacy-preserving record linkage (PPRL), which allows object matching between datasets while preserving privacy. It covers the basics of PPRL, including protocols, encodings using bloom filters, and scalable approaches like blocking and parallelization. The document also evaluates different techniques like P4Join for performing PPRL more efficiently on GPUs to further improve scalability for large datasets.
This document summarizes research on optimizing the LZ77 data compression algorithm. The researchers present a method to improve LZ77 compression by using variable triplet sizes instead of fixed sizes. Specifically, when the offset is equal to the match length, they represent it with a "doublet" of just length and next symbol, saving bits compared to the standard triplet representation. They show this optimization achieves higher compression ratios than the conventional LZ77 algorithm in experiments. The decoding process is made slightly more complex to identify doublets versus triplets but still allows decompressing the encoded data back to the original.
This document analyzes identification systems for large-scale databases using information theory. It presents three key contributions:
1. It proposes a list-based decoder to improve identification performance while addressing issues of search and memory complexity.
2. It analyzes the trade-off between identification rate, search complexity, and memory complexity. Clustering the database is suggested to reduce search complexity.
3. It introduces active content fingerprinting (aCFP), which marries passive fingerprinting and watermarking techniques.
The document outlines these topics and provides statistical analysis of digital fingerprints extracted from correlated data. Performance of the proposed list decoder is also evaluated numerically.
The document outlines topics that will be covered in a bioinformatics course, including biological databases, sequence alignments, database searching, phylogenetics, protein structure, gene prediction, gene ontologies, hidden Markov models, and non-coding RNA. It then provides more details on topics like using sequence similarity to gather information from unknown protein sequences, finding conserved patterns in alignments using profiles and hidden Markov models, and challenges around gene prediction from raw genomic sequences.
The document discusses privacy preserving machine learning (PPML). It provides an agenda covering privacy attacks and laws, the PPML life-cycle and basic techniques, federated learning with TensorFlow Encrypted, differential privacy, and research to production. The speaker is an expert in machine learning, data science, and privacy preserving techniques with experience applying these topics in industry.
This document discusses privacy-preserving techniques for information sharing and their applications. It introduces privacy-enhancing technologies that allow parties to share just the minimum required information while protecting privacy. Secure computation and private set intersection protocols are described that enable privacy-respecting data processing and information sharing when parties have limited trust. The document outlines several private set intersection techniques and their applications to privacy-preserving genomic testing and collaborative machine learning.
Presentation given at the International Conference on Scalable Uncertainty Management, 2-5 Oct 2018, Milan, Italy.
Paper: https://research.utwente.nl/en/publications/rule-based-conditioning-of-probabilistic-data
Data interoperability is a major issue in data management for data science and big data analytics. Probabilistic data integration (PDI) is a specific kind of data integration where extraction and integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. This allows a data integration process with two phases: (1) a quick partial integration where data quality problems are represented as uncertainty in the resulting integrated data, and (2) using the uncertain data and continuously improving its quality as more evidence is gathered. The main contribution of this paper is an iterative approach for incorporating evidence of users in the probabilistically integrated data. Evidence can be specified as hard or soft rules (i.e., rules that are uncertain themselves).
This document summarizes research on optimizing the LZ77 data compression algorithm. The researchers present a method to improve LZ77 compression by using variable triplet sizes instead of fixed sizes. Specifically, when the offset is equal to the match length, they represent it with a "doublet" of just length and next symbol, saving bits compared to the standard triplet representation. They show this optimization achieves higher compression ratios than the conventional LZ77 algorithm in experiments. The decoding process is made slightly more complex to identify doublets versus triplets but still allows decompressing the encoded data back to the original.
This document analyzes identification systems for large-scale databases using information theory. It presents three key contributions:
1. It proposes a list-based decoder to improve identification performance while addressing issues of search and memory complexity.
2. It analyzes the trade-off between identification rate, search complexity, and memory complexity. Clustering the database is suggested to reduce search complexity.
3. It introduces active content fingerprinting (aCFP), which marries passive fingerprinting and watermarking techniques.
The document outlines these topics and provides statistical analysis of digital fingerprints extracted from correlated data. Performance of the proposed list decoder is also evaluated numerically.
The document outlines topics that will be covered in a bioinformatics course, including biological databases, sequence alignments, database searching, phylogenetics, protein structure, gene prediction, gene ontologies, hidden Markov models, and non-coding RNA. It then provides more details on topics like using sequence similarity to gather information from unknown protein sequences, finding conserved patterns in alignments using profiles and hidden Markov models, and challenges around gene prediction from raw genomic sequences.
The document discusses privacy preserving machine learning (PPML). It provides an agenda covering privacy attacks and laws, the PPML life-cycle and basic techniques, federated learning with TensorFlow Encrypted, differential privacy, and research to production. The speaker is an expert in machine learning, data science, and privacy preserving techniques with experience applying these topics in industry.
This document discusses privacy-preserving techniques for information sharing and their applications. It introduces privacy-enhancing technologies that allow parties to share just the minimum required information while protecting privacy. Secure computation and private set intersection protocols are described that enable privacy-respecting data processing and information sharing when parties have limited trust. The document outlines several private set intersection techniques and their applications to privacy-preserving genomic testing and collaborative machine learning.
Presentation given at the International Conference on Scalable Uncertainty Management, 2-5 Oct 2018, Milan, Italy.
Paper: https://research.utwente.nl/en/publications/rule-based-conditioning-of-probabilistic-data
Data interoperability is a major issue in data management for data science and big data analytics. Probabilistic data integration (PDI) is a specific kind of data integration where extraction and integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. This allows a data integration process with two phases: (1) a quick partial integration where data quality problems are represented as uncertainty in the resulting integrated data, and (2) using the uncertain data and continuously improving its quality as more evidence is gathered. The main contribution of this paper is an iterative approach for incorporating evidence of users in the probabilistically integrated data. Evidence can be specified as hard or soft rules (i.e., rules that are uncertain themselves).
The document discusses various techniques for data preprocessing, including data cleaning, integration, reduction, and transformation. It describes why preprocessing is important for improving data quality, accuracy, and consistency. Several forms of data preprocessing are covered in detail, such as handling missing or noisy data, data integration, dimensionality reduction techniques like principal component analysis, and different strategies for data reduction.
This document discusses association rules and charts in XLMiner. Association rules are used to find relationships between data that occur frequently and can be formulated into if-then rules. Charts allow data to be viewed visually and XLMiner provides box plots, histograms, and matrix plots. Box plots display five data metrics, histograms show frequency of values, and matrix plots enable viewing pairwise variable relationships. Support, confidence, and lift are discussed as properties of association rules.
This document introduces association rules and charts in XLMiner. Association rules are used to find relationships between frequently occurring data patterns. Support, confidence, and lift are properties of association rules. Charts allow visualizing data through box plots, histograms, and matrix plots. A box plot displays five data points, while a histogram shows frequency of values. A matrix plot enables viewing pairwise variable relationships.
Java Abs Scalable Wireless Ad Hoc Network Simulation Usingncct
The document discusses extensions to the Domain Name System (DNS) that provide data integrity and authentication. It describes how digital signatures are included in DNS zones as resource records to authenticate DNS data. It also discusses how public keys can be stored in the DNS to support key distribution and authentication of DNS transactions.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
This document summarizes research that implemented the same transitive closure algorithm for entity resolution on three different Apache Hadoop distributions: a local HDFS cluster, Cloudera Enterprise, and Talend Big Data Sandbox. The algorithm was run on a synthetic dataset to discover entity clusters. While the local HDFS cluster produced consistent results matching the baseline, the Cloudera and Talend platforms had inconsistent results due to differences in configuration requirements, load balancing, and blocking behavior across nodes. The experiments highlighted scalability issues for entity resolution processes in distributed environments due to inconsistencies introduced by differences in platform implementations.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
This document summarizes a research paper that proposes a new semi-supervised dimensionality reduction algorithm called Semi-supervised Discriminant Analysis based on Data Structure (SGLS). SGLS aims to learn a low-dimensional embedding of high-dimensional data by integrating both global and local data structures, while also utilizing pairwise constraints to maximize discrimination. The algorithm formulates an optimization problem that incorporates a regularization term to preserve local geometry, while maximizing the distance between cannot-link pairs and minimizing the distance between must-link pairs in the embedding space. Experimental results on benchmark datasets demonstrate the effectiveness of SGLS compared to other semi-supervised dimensionality reduction methods.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Rate distortion theory is a major branch of information theory that provides the theoretical foundations for lossy data compression schemes. It gives an analytical expression for how much compression can be achieved using lossy compression methods by establishing a tradeoff between the rate of compression and the resulting distortion. The theory aims to minimize the amount of distortion for the lowest possible compression rate given a source distribution and distortion measure. The rate distortion function specifies the lowest rate at which an output can be encoded while keeping distortion below a given level.
This document proposes a new approach for preserving sensitive data privacy when clustering data. It involves adding noise to numeric attributes in the data using a fuzzy membership function, which distorts the data while maintaining the original clusters. This method is compared to other privacy preservation techniques like data swapping and noise addition. It is found to reduce processing time compared to other methods. The document outlines literature on privacy preservation techniques including data modification, cryptography, and data reconstruction methods. It then describes the proposed method of using a fuzzy membership function to add noise to sensitive attributes before clustering the data.
This document proposes a new approach for preserving sensitive data privacy when clustering data. It involves adding noise to numeric attributes in the data using a fuzzy membership function, which distorts the data while maintaining the original clusters. The fuzzy membership function uses a S-shaped curve to map original attribute values to modified values. Clustering is then performed on the distorted data. This approach aims to preserve privacy while reducing processing time compared to other privacy-preserving methods like cryptographic techniques, data swapping, and noise addition.
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
This document discusses a method for visualizing direct and partial correlations using ELI (Exploratory Linear Information) plots. The method allows correlations between any number of variables to be plotted in an overlay fashion. The plots can show correlations against a single "with" variable, sorted by absolute value. Partial correlations can also be plotted. The method is implemented in a SAS macro. An example uses continuous variables from a dataset to demonstrate plotting correlations without a "with" variable.
The document describes a competition between Dries Godderis and Wim Van Criekinge to find the longest words in English, Dutch, and Portuguese that meet certain criteria. Dries Godderis provides examples of words that meet each of the criteria in the different languages. He also discusses how a repeated stretch of letters at the beginning and end of a sequence could stabilize hairpin loop structures in biology.
The document provides information about various bioinformatics lessons that will take place on Thursdays, including topics like biological databases, sequence alignments, database searching using FASTA and BLAST, phylogenetics, and protein structure. It also includes details about database searching methods like dynamic programming, FASTA, BLAST, and parameters that can be adjusted for BLAST searches.
The goal of this work is to advance our understanding of what new can be learned about crypto-tokens by analyzing the topological structure of the Ethereum transaction network. By introducing a novel combination of tools from topological data analysis and functional data depth into blockchain data analytics, we show that Ethereum network can provide critical insights on price strikes of crypto-token prices that are otherwise largely inaccessible with conventional data sources and traditional analytic methods.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
Comparison between riss and dcharm for mining gene expression dataIJDKP
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to
reveal biological information about genes functions and their relation to health. Data mining techniques
are effective and efficient in extracting useful patterns. Most of the current data mining algorithms suffer
from high processing time while generating frequent itemsets. The aim of this paper is to provide a
comparative study of two Closed Frequent Itemsets algorithms (CFI), dCHARM and RISS. They are
examined with high dimension data specifically gene expression data. Nine experiments are conducted with
different number of genes to examine the performance of both algorithms. It is found that RISS outperforms
dCHARM in terms of processing time..
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...IOSR Journals
This document summarizes a research paper on a novel technique called "slicing" for privacy-preserving publication of microdata. Slicing partitions data both horizontally into buckets and vertically into correlated attribute columns. This preserves more utility than generalization while preventing attribute and membership disclosure better than bucketization. Experiments on census data show slicing outperforms other methods in preserving utility and privacy for high-dimensional and sensitive attribute workloads. Slicing groups correlated attributes to maintain useful correlations and breaks links between uncorrelated attributes that pose privacy risks.
The document discusses various techniques for data preprocessing, including data cleaning, integration, reduction, and transformation. It describes why preprocessing is important for improving data quality, accuracy, and consistency. Several forms of data preprocessing are covered in detail, such as handling missing or noisy data, data integration, dimensionality reduction techniques like principal component analysis, and different strategies for data reduction.
This document discusses association rules and charts in XLMiner. Association rules are used to find relationships between data that occur frequently and can be formulated into if-then rules. Charts allow data to be viewed visually and XLMiner provides box plots, histograms, and matrix plots. Box plots display five data metrics, histograms show frequency of values, and matrix plots enable viewing pairwise variable relationships. Support, confidence, and lift are discussed as properties of association rules.
This document introduces association rules and charts in XLMiner. Association rules are used to find relationships between frequently occurring data patterns. Support, confidence, and lift are properties of association rules. Charts allow visualizing data through box plots, histograms, and matrix plots. A box plot displays five data points, while a histogram shows frequency of values. A matrix plot enables viewing pairwise variable relationships.
Java Abs Scalable Wireless Ad Hoc Network Simulation Usingncct
The document discusses extensions to the Domain Name System (DNS) that provide data integrity and authentication. It describes how digital signatures are included in DNS zones as resource records to authenticate DNS data. It also discusses how public keys can be stored in the DNS to support key distribution and authentication of DNS transactions.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
This document summarizes research that implemented the same transitive closure algorithm for entity resolution on three different Apache Hadoop distributions: a local HDFS cluster, Cloudera Enterprise, and Talend Big Data Sandbox. The algorithm was run on a synthetic dataset to discover entity clusters. While the local HDFS cluster produced consistent results matching the baseline, the Cloudera and Talend platforms had inconsistent results due to differences in configuration requirements, load balancing, and blocking behavior across nodes. The experiments highlighted scalability issues for entity resolution processes in distributed environments due to inconsistencies introduced by differences in platform implementations.
This paper describes the outcome of an attempt to implement the same transitive closure (TC) algorithm
for Apache MapReduce running on different Apache Hadoop distributions. Apache MapReduce is a
software framework used with Apache Hadoop, which has become the de facto standard platform for
processing and storing large amounts of data in a distributed computing environment. The research
presented here focuses on the variations observed among the results of an efficient iterative transitive
closure algorithm when run against different distributed environments. The results from these comparisons
were validated against the benchmark results from OYSTER, an open source Entity Resolution system. The
experiment results highlighted the inconsistencies that can occur when using the same codebase with
different implementations of Map Reduce.
This document summarizes a research paper that proposes a new semi-supervised dimensionality reduction algorithm called Semi-supervised Discriminant Analysis based on Data Structure (SGLS). SGLS aims to learn a low-dimensional embedding of high-dimensional data by integrating both global and local data structures, while also utilizing pairwise constraints to maximize discrimination. The algorithm formulates an optimization problem that incorporates a regularization term to preserve local geometry, while maximizing the distance between cannot-link pairs and minimizing the distance between must-link pairs in the embedding space. Experimental results on benchmark datasets demonstrate the effectiveness of SGLS compared to other semi-supervised dimensionality reduction methods.
Semi-Supervised Discriminant Analysis Based On Data Structureiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Rate distortion theory is a major branch of information theory that provides the theoretical foundations for lossy data compression schemes. It gives an analytical expression for how much compression can be achieved using lossy compression methods by establishing a tradeoff between the rate of compression and the resulting distortion. The theory aims to minimize the amount of distortion for the lowest possible compression rate given a source distribution and distortion measure. The rate distortion function specifies the lowest rate at which an output can be encoded while keeping distortion below a given level.
This document proposes a new approach for preserving sensitive data privacy when clustering data. It involves adding noise to numeric attributes in the data using a fuzzy membership function, which distorts the data while maintaining the original clusters. This method is compared to other privacy preservation techniques like data swapping and noise addition. It is found to reduce processing time compared to other methods. The document outlines literature on privacy preservation techniques including data modification, cryptography, and data reconstruction methods. It then describes the proposed method of using a fuzzy membership function to add noise to sensitive attributes before clustering the data.
This document proposes a new approach for preserving sensitive data privacy when clustering data. It involves adding noise to numeric attributes in the data using a fuzzy membership function, which distorts the data while maintaining the original clusters. The fuzzy membership function uses a S-shaped curve to map original attribute values to modified values. Clustering is then performed on the distorted data. This approach aims to preserve privacy while reducing processing time compared to other privacy-preserving methods like cryptographic techniques, data swapping, and noise addition.
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
This document discusses a method for visualizing direct and partial correlations using ELI (Exploratory Linear Information) plots. The method allows correlations between any number of variables to be plotted in an overlay fashion. The plots can show correlations against a single "with" variable, sorted by absolute value. Partial correlations can also be plotted. The method is implemented in a SAS macro. An example uses continuous variables from a dataset to demonstrate plotting correlations without a "with" variable.
The document describes a competition between Dries Godderis and Wim Van Criekinge to find the longest words in English, Dutch, and Portuguese that meet certain criteria. Dries Godderis provides examples of words that meet each of the criteria in the different languages. He also discusses how a repeated stretch of letters at the beginning and end of a sequence could stabilize hairpin loop structures in biology.
The document provides information about various bioinformatics lessons that will take place on Thursdays, including topics like biological databases, sequence alignments, database searching using FASTA and BLAST, phylogenetics, and protein structure. It also includes details about database searching methods like dynamic programming, FASTA, BLAST, and parameters that can be adjusted for BLAST searches.
The goal of this work is to advance our understanding of what new can be learned about crypto-tokens by analyzing the topological structure of the Ethereum transaction network. By introducing a novel combination of tools from topological data analysis and functional data depth into blockchain data analytics, we show that Ethereum network can provide critical insights on price strikes of crypto-token prices that are otherwise largely inaccessible with conventional data sources and traditional analytic methods.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
Comparison between riss and dcharm for mining gene expression dataIJDKP
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to
reveal biological information about genes functions and their relation to health. Data mining techniques
are effective and efficient in extracting useful patterns. Most of the current data mining algorithms suffer
from high processing time while generating frequent itemsets. The aim of this paper is to provide a
comparative study of two Closed Frequent Itemsets algorithms (CFI), dCHARM and RISS. They are
examined with high dimension data specifically gene expression data. Nine experiments are conducted with
different number of genes to examine the performance of both algorithms. It is found that RISS outperforms
dCHARM in terms of processing time..
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...IOSR Journals
This document summarizes a research paper on a novel technique called "slicing" for privacy-preserving publication of microdata. Slicing partitions data both horizontally into buckets and vertically into correlated attribute columns. This preserves more utility than generalization while preventing attribute and membership disclosure better than bucketization. Experiments on census data show slicing outperforms other methods in preserving utility and privacy for high-dimensional and sensitive attribute workloads. Slicing groups correlated attributes to maintain useful correlations and breaks links between uncorrelated attributes that pose privacy risks.
Similar to Scalable and Privacy-preserving Data Integration - Part 2 (20)
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
2. ScaDS Dresden/Leipzig
Big Data Integration
Scalable entity resolution / link discovery
Large-scale schema/ontology matching
Holistic data integration
Privacy-preserving record linkage (PPRL)
Privacy for Big Data
PPRL basics
Scalable PPRL
Graph-based data integration and analytics
Introduction
Graph-based data integration / business intelligence (BIIIG)
Hadoop-based graph analytics (GRADOOP)
AGENDA
2
3. Privacy
- right of individuals to determine by themselves when, how and to what
extent information about them is communicated to others (Agrawal
2002)
Privacy threats
• extensive collection of personal/private information / surveillance
• Information dissemination: disclosure of sensitive/confidential
information
• Invasions of privacy: intrusion attacks to obtain access to private
information
• Information aggregation: combining data, e.g. to enhance personal
profiles or identify persons (de-anonymization)
PRIVACY
3
4. Protection especially for personally identifiable information (PID)
name, birthdate, address, email address etc
healthcare and genetic records, financial records
criminal justice investigations and proceedings …
Challenge: preserve privacy despite need to use person-related data
for improved analysis / business success (advertisement,
recommendations), website optimizations, clinical/health studies,
identification of criminals …
tracking and profiling of web / smartphone / social network users
(different kinds of cookies, canvas fingerprinting …)
often user agreement needed
INFORMATION / DATA PRIVACY
4
5.
6.
7.
8. Need for comprehensive privacy support (“privacy by design”)
Privacy-preserving publishing of datasets
Anonymization of datasets
Privacy-preserving data mining
analysis of anonymized data without re-identification
Privacy-preserving record linkage
object matching with encoded data to preserve privacy
prerequisite for privacy-preserving data mining
PRIVACY FOR BIG DATA
8
9. Anonymization
removing, generalizing or changing personally identifying attributes
so that people whom the data describe remain anonymous
no way to identify different records for same person (e.g., different
points in time) or to match/combine records from different sources
Pseudonymization
replace most identifying fields within a data record are replaced by
one or more artificial identifiers, or pseudonyms
one-way pseudonymization (e.g. one-way hash functions) vs. re-
identifiable pseudonymization
records with same pseudonym can be matched
improved potential for data analysis
ANONYMIZATION VS PSEUDONYMIZATION
9
10. RE-IDENTIFICATION OF
„ANONYMOUS DATA“ (SWEENEY 2001)
10
US voter registration data
69% unique on postal code (ZIP) and birth date
87% US-wide with sex, postal code and birth data
Solution approach: K-Anonymity
any combination of values appears
at least k times
generalize values, e.g., on ZIP or birth date
12. Physically integrated data (e.g. data warehouse) about persons entails greatest
privacy risks
Data mining over distributed data can better protect personal data by limiting data
exchange, e.g. using SMC methods
PRIVACY-PRESERVING DATA MINING
12
Local
Data
Local
Data
Local
Data
Warehouse
Data
Mining
Local
Data
Mining
Local
Data
Local
Data
Local
Data
Data
Mining
Combiner
Local
Data
Mining
Local
Data
Mining
13. HORIZONTAL VS VERTICAL DATA DISTRIBUTION
13
Source: J. Vaidya, C. Clifton: Privacy-Preserving data mining: Why, How, and When. IEEE Security&Privacy 2004
14. object matching with encoded data to preserve privacy
data exchange / integration of person-related data
many uses cases: medicine, sociology (“population informatics”),
business, …
privacy aspects
need to support secure 1-way encoding (pseudonymization)
protection against attacks to identify persons
conflicting requirements:
high privacy
match effectiveness (need to support fuzzy matches)
scalability to large datasets and many parties
PRIVACY-PRESERVING RECORD LINKAGE (PPRL)
14
15. PRIVACY-PRESERVING RECORD LINKAGE
15
data encryption blocking
similarity functions
Vatasalan, Christen, Verykios: A taxonomy of privacy-preserving record linkage techniques. Information Systems 2013
16. Two-party protocols
only two data owners communicate who wish to link their data
Three-party protocols
Use of a trusted third party (linkage unit, LU)
LU will never see unencoded data, but collusion is possible
Multi-party protocols (> 2 data owners)
with or without linkage unit
BASIC PPRL CONFIGURATIONS
16
20. Compute a function across several parties, such as no party learns the
information from the other parties, but all receive the final results
Example 1: millionaire problem
two millionaires, Alice and Bob, are interested in knowing which of
them is richer but without revealing their actual wealth.
Example 2: secure summation
SECURE MULTI-PARTY COMPUTATION (SMC)
20
21. Adversary models (from crytography)
Honest-But-Curious: parties follow agreed-on protocols
Malicious
Privacy attacks
Crack data encoding based on background knowledge:
Frequency attack
Dictionary attack
Collusion between parties
PRIVACY ASPECTS
21
22. effective and simple encoding uses cryptographic bloom filters
(Schnell et al., 2009)
tokenize all match-relevant attribute values, e.g. using bigrams or
trigrams
typical attributes: first name, last name (at birth), sex, date of birth, country of
birth, place of birth
map each token with a family of one-way hash functions to fixed-size
bit vector (fingerprint)
original data cannot be reconstructed
match of bit vectors (Jaccard similarity) is good approximation of
true match result
PPRL WITH BLOOM FILTERS
22
24. same optimization techniques than for regular
object matching apply
(private) blocking approaches
filtering for specific similarity metrics / thresholds
to reduce number of comparisons
privacy-preserving PPJoin (P4Join)
metric space: utilize triangular inequality
parallel linkage (Hadoop, GPUs, …)
SCALABLE PPRL
24
25. Many possibilities: standard blocking, sorted-neighborhood, Locality-
sensitive Hashing (LSH) …
Simple approach for standard blocking
Each database performs agreed-upon standard blocking, e.g. using
soundex or another function on selected attributes (e.g., names)
Encoded records are transferred blockwise to LU, possibly with
added noise records for improved security
LU only matches records per block
different blocks could be processed in parallel
PRIVATE BLOCKING
25
26. one of the most efficient similarity join algorithms
determine all pairs of records with simJaccard(x,y) ≥ t
use of filter techniques to reduce search space
length, prefix, and position filter
relatively easy to run in parallel
good candidate to improve scalability for PPRL
evaluate set bit positions instead of (string) tokens
PP-JOIN: POSITION PREFIX JOIN (XIAO ET AL, 2008)
26
27. matching records pairs must have similar lengths
length / cardinality: number of set bits in bit vector
Example for minimal similarity t = 0,8:
record B of length 4 cannot match with C and all records with
greater length (number of set positions), e.g., A
LENGTH FILTER
SimJaccard(x, y) ≥ t ⇒ |x| ≥ | y| ∗ t
Bit vector
0 1 0 1 1 1 1 1 1 1 0 0 0
ID
1 0 1 0 0 0 0 0 1 1 0 0 0
0 0 0 1 1 1 1 1 1 1 0 0 0
B
C
A
card.
8
4
7
length filter
7 * 0.8 = 5.6 > 4
28. Similar records must have a minimal overlap α in their sets of tokens (or set bit
positions)
Prefix filter approximates this test
reorder bit positions for all fingerprints according to their overall frequency from
infrequent to frequent
exclude pairs of records without any overlap in their prefixes with
Example (t = 0.8)
PREFIX FILTER
28
prefix_length(x) = ( (1-t)∗|x|) + 1
SimJaccard(x, y) ≥ t ⇔ Overlap(x, y) ≥ α = ∗ | | | |) )
0 0 0 1 1 1prefix fingerprint
1 0 1
0 0 0 1 1 1
0 1 0 1 1
reordered fingerprint
0 1 0 1 1 1 1 1 1 1 0 0 0 0
ID
1 0 1 0 0 0 0 0 1 1 0 0 0 0
0 0 0 1 1 1 1 1 1 1 0 0 0 0
B
C
A
card.
8
4
7
AND operation on prefixes shows non-zero result for C and A so that these records still need
to be considered for matching
29. improvement of prefix filter to avoid matches even for overlapping
prefixes
estimate maximally possible overlap and checking whether it is below the minimal
overlap α to meet threshold t
original position filter considers the position of the last common prefix token
revised position filter
record x, prefix length 9
record y, prefix length 8
highest prefix position (here fourth pos. in x) limits possible overlap with
other record: the third position in y prefix cannot have an overlap with x
maximal possible overlap = #shared prefix tokens (2) + min (9-3, 8-3)= 7
< minimal overlap α = 8
P4JOIN: POSITION FILTER
29
1 1 0 1
1 1 1
30. comparison between NestedLoop, P4Join, MultiBitTree
MultiBitTree: best filter approach in previous work by Schnell
applies length filter and organizes fingerprints within a binary tree so
that fingerprints with the same set bits are grouped within sub-trees
can be used to filter out many fingerprints from comparison
two input datasets R, S
determined with FEBRL data generator
N=[100.000, 200.000, …, 500.000]. |R|=1/5⋅N, |S|=4/5⋅N
bit vector length: 1000
similarity threshold 0.8
EVALUATION
30
31. runtime in minutes on standard PC
similar results for P4Join and Multibit Tree
relatively small improvements compared to NestedLoop
EVALUATION RESULTS
31
Approach
Dataset size N
100 000 200 000 300 000 400 000 500 000
NestedLoop 6.1 27.7 66.1 122.0 194.8
MultiBitTree 4.7 19.0 40.6 78.2 119.7
P4Join 2.3 15.5 40.1 77.8 125.5
32. Operations on bit vectors easy to compute on GPUs
Length and prefix filters
Jaccard similarity
Frameworks CUDA und OpenCL support data-parallel
execution of general computations on GPUs
program („kernel“) written in C dialect
limited to base data types (float, long, int, short, arrays)
no dynamic memory allocation (programmer controls memory
management)
important to minimize data transfer between main memory and
GPU memory
GPU-BASED PPRL
32
33. partition inputs R and S (fingerprints sorted by length) into equally-
sized partitions that fit into GPU memory
generate match tasks per pair of partition
only transfer to GPU if length intervals per partition meet length
filter
optional use of CPU thread to additionally match on CPU
EXECUTION SCHEME
S0 S1 S2 S3 S4
R0
R1
R2
R3
GPU thread
CPU thread(s)
Match task
matches
MR0,S3
GPU memory
Replace
S3 with S4
Read
MR0,S4
Kernel0
Kernel|R0|‐1
…
GPU
0‐S4
|R0|‐1‐S4
bitsR0 S4
bits
R0 S3
change S3 with S4
33
bitscardprefix
bitscardprefix
bits
bitscardprefix
bitscardprefix
Main memory (host)
34. 100 000 200 000 300 000 400 000 500 000
GForce GT 610 0.33 1.32 2.95 5.23 8.15
GeForce GT 540M 0.28 1.08 2.41 4.28 6.67
GPU-BASED EVALUATION RESULTS
34
GeForce GT 610
• 48 Cuda Cores@810MHz
• 1GB
• 35€
GeForce GT 540M
• 96 Cuda Cores@672MHz
• 1GB
improvements by up to a factor of 20, despite low-profile graphic cards
still non-linear increase in execution time with growing data volume
36. utilize triangle equality to avoid similarity computations
consider distances for query object q to pivot object p
check whether we need to determine distance d(o,q) to object o for
given similarity threshold (maximal distance max-dist=rad(q))
From triangle inequality we know:
d(p,o) + d(o,q) ≥ d(p,q) ⇒ d(p,q) – d(p,o) <= d(o,q)
we can safely exclude object o if d(p,q) – d(p,o) > max-dist = rad(q)
REDUCTION OF SEARCH SPACE
36www.scads.de
q
o
p
rad(q)
d(p, q)
37. Preprocessing:
Given: a set of objects (fingerprints) S1 to be matched
choose a subset P={p1,…, pj} of objects to be the pivots
different possibilities, e.g., random subset
assign each object oi to the nearest pivot pj and save the distance d(pj,oi),
determine radius(p) as maximal distance from p to its assigned objects
Matching:
Given: set of query objects (fingerprints) S2, threshold max-dist
for each q of S2 and each pivot p do:
determine distance d(p,q)
If d(p,q) <= rad(p)+max-dist:
for each object assigned to p
check whether it can be skipped due to triangular inequality (d(p,q)-d(p,o)> max-dist)
otherwise match o with p (o matches q if d(o,q) <= max-dist)
LINKAGE APPROACH
37www.scads.de
38. Comparison with previous approaches using the same datasets
Runtime in minutes (using faster PC than in previous evaluation)
Pivot-based approach shows the best results and is up to 40X faster
than other algorithms
EVALUATION
38www.scads.de
Algorithms
Datasets
100 000 200 000 300 000 400 000 500 000
NestedLoop 3.8 20.8 52.1 96.8 152.6
MultiBitTree 2.6 11.3 26.5 50.0 75.9
P4Join 1.4 7.4 24.1 52.3 87.9
Pivots (metric space) 0.2 0.4 0.9 1.3 1.7
39. Privacy for Big Data
privacy-preserving publishing / record linkage / data mining
tradeoff between protection of personal/sensitive data and data
utility for analysis
complete anonymization prevents record linkage -> 1-way
pseudonymization of sensitive attributes good compromise
Scalable Privacy-Preserving Record Linkage
bloom filters allow simple, effective and relatively efficient match
approach
performance improvements by blocking / filtering / parallel PPRL
effective filtering by P4Join and utilizing metric-space distance
functions
GPU usage achieves significant speedup
SUMMARY
39
40. High PPRL match quality for real, dirty data
Quantitative evaluation of privacy characteristics
Efficient SCM approaches for multiple sources without linkage unit
Combined study of PPRL + data mining
OUTLOOK / CHALLENGES
40
41. E.A. Durham, M. Kantarcioglu, Y. Xue, C. Tóth, M.Kuzu, B.Malin: Composite Bloom Filters for Secure Record Linkage. IEEE Trans.
Knowl. Data Eng. 26(12): 2956-2968 (2014)
B. Fung, K. Wang, R.Chen, P.S Yu: Privacy-preserving data publishing: A survey of recent developments. ACM CSUR 2010
R. Hall, S. E. Fienberg: Privacy-Preserving Record Linkage. Privacy in Statistical Databases 2010: 269-283
D.Karapiperis, V. S. Verykios: A distributed near-optimal LSH-based framework for privacy-preserving record linkage. Comput.
Sci. Inf. Syst. 11(2): 745-763 (2014)
C.W. Kelman, A.J. Bass. C.D. Holman: Research use of linked health data--a best practice protocol. Aust NZ J Public Health.
2002
R.Schnell, T. Bachteler, J. Reiher: Privacy-preserving record linkage using Bloom filters. BMC Med. Inf. & Decision Making 9: 41
(2009)
Schnell, R.: Privacy-preserving Record Linkage and Privacy-preserving Blocking for Large Files with Cryptographic Keys Using
Multibit Trees. Proc. Joint Statistical Meetings, American Statistical Association, 2013
Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm: Privacy Preserving Record Linkage with PPJoin. Proc. BTW Conf. 2015
Z. Sehili, E. Rahm: Speeding Up Privacy Preserving Record Linkage for Metric Space Similarity Measures. TR, Univ. Leipzig, 2016
J. Vaidya, C. Clifton: Privacy-Preserving data mining: Why, How, and When. IEEE Security&Privacy, 2004
D. Vatasalan, P- Christen, V.S. Verykios: An evaluation framework for privacy-preserving record linkage. Journal of Privacy and
Confidentialty 2014
D. Vatasalan, P- Christen, C.M. O’Keefe, V.S. Verykios: A taxonomy of privacy-preserving record linkage techniques. Information
Systems 2013
C. Xiao, W. Wang, X. Lin, J.X. Yu: Efficient Similarity Joins for Near Duplicate Detection. Proc. WWW 2008
REFERENCES
41