SlideShare a Scribd company logo
1
Building a PII
scrubbing layer
Benefits of masking and tokenization of PII
data.
2
1. Ease of access for the data analysts
2. Ensuring the stakeholders from various systems feel safe in providing access to data.
3. Ease of sharing the data.
Primary transformations in our scrubbing
service
Building blocks for other transformations
3
Mask
B68 Shastri Nagar
Bhopal, 462003
xxxxx
Tokenise
Acc no:
00014590900
ac8a58d5da5fd35a7f
60cb906a724589e76
1f25e8fbea4346fd01
23ea351ea51
Derive
Age: 26
Age 20-30
Wipes out
information
completely
Useful for preserving
uniqueness
Useful for selective
preservation of
information
First glimpse:
- We needed cron to invoke this python script.
- It will pick up files from the disk, scrub, tokenize and write to S3.
Initial set of ideas:
1. Python with simple CSV reader.
2. One step ahead, maybe we can use pandas, provides good support for various
readers, and dataframe abstraction.
A simple scrubbing of CSVs, write the output to S3
4
Second glance:
Since, we had chosen pandas, with sqlalchemy interface we were able to achieve this.
We started to notice few behaviours:
1. Too much RAM was being consumed.
2. Reads were slow.
Usual suspects:
1. Intermediate copies.
2. Loading more data then needed.
3. No lazy evaluation.
4. No parallelization
Possible remedies:
1. Load less data aka chunking
2. Use efficient data types.
We need to connect to DBs.
5
Final reckoning
1. We need to connect to plethora of databases systems.
2. We might end up ingesting various raw formats like excels etc.
3. Orchestration would not be as simple as cron.
4. Backfilling of huge amounts of data.
5. We started to see pandas teeter under requirements were we needed to join
multiple data sources and perform aggregates, before pushing the scrubbed
dataset to S3.
We realised pandas won’t scale. We had to migrate to spark.
Towards the sprint end, some realisations
6
You do not need TBs of data for using spark.
Using spark does not necessarily mean you would have to invest and maintain into a spark cluster. If
your scale is small, start with smaller spark setup. You can gradually step up as the scale of data
increases
You can bring a pet elephant, and feed it accordingly.
7
Requirement Deployment mode
One job at a time and data size in few GBs. Local mode, (single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few GBs.
Standalone (Single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few 100 GBs.
Standalone (Multi-node or if a big machine using containers)
(local storage)
Multiple jobs with resource allocation in advanced scheduling
and data size in few 100 GBs.
Cluster mode (with YARN, Mesos etc as schedulers)
(distributed storage)
Ways to tokenize your data
Problem statement: In place of real keys in the data we need to put tokens, for given
key the token should be same, and given a key we should be able to tell the real data.
With spark in our arsenal, the next problem to solve.
8
Needs mapping table
No need of mapping
table
create
lookup
Encryption
Hashing with salt
Hashing with pepper
Using UUIDs
Using UUIDs
1. Though using UUIDs for generating token are better as they are completely
random and has no relation with the original data, but they pose the problem of
fetchOrCreate between parallel spark jobs.
9
Lookup Table (Cached Table or Table)
token(12232)
token(12232) token(42)
token(100)
Making a choice
10
Client needs
Lookup table is must and should be
encrypted, this gives them more control
over the tokenized information, than the
encryption method.
Our needs
We wanted to avoid fetchOrCreate problem
with multiple spark jobs trying to lookup
already generated token value.
We wanted the creation of tokens such that
two jobs arrive at the same token without
any coordination.
Using hashing with pepper
11
00014590900
ac8a58d5da5fd35a7f60cb90
6a724589e761f25e8fbea434
6fd0123ea351ea51
CUSTOMER_CODE_TOKEN CUSTOMER_CODE
ac8a58d5da5fd35a7f60
cb906a724589e761f25e
8fbea4346fd0123ea351
ea51
adfa58d5da5fd35a7f60c
b90sdfsd761f25e8fbea4
346fd0123ea351ea51==
CUSTOMER_CODE_TOKEN CUSTOMER_CODE
ac8a58d5da5fd35a7f60
cb906a724589e761f25e
8fbea4346fd0123ea351
ea51
adfa58d5da5fd35a7f60c
b90sdfsd761f25e8fbea4
346fd0123ea351ea51==
adfa58d5da5fd35a7f60
cb90sdfsd761f25e8fbe
a4346fd0123ea351ea5
1==
00014590900
lookup
store
fetch
decrypt
Token creation
Token fetch
The current state of our scrubbing service
1. Airflow 1.10.10 on docker, with CeleryExecutor 5 workers and 1 hourly worker.
2. 48 cores CPU and 128 GB RAM.
3. Spark and airflow on simple docker-compose.
4. 103 data pipelines pushing data to AWS from 28 data sources
5. Oracle has 9,43,68,195 number of tokens.
6. Alerts using emails and prometheus, grafana setup.
7. RBAC UI
The PII gatekeeper to the cloud
12
Spark on docker
1. Mount all the spark related scratch directories. Spark.local.dir or docker diff to
figure out what other data the container is generating.
2. Watchout for zombies: killing jobs, or executor retries leaves zombie processes
behind, since docker has no init system. Add tini or use docker run --init.
A glimpse of few challenges you might face
13
And the stored PII data mapping
It’s something like this
14
Thank You

More Related Content

What's hot

Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Walaa Eldin Moustafa
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
DataKitchen
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast Essentials
Rahul Gupta
 
Introdução ao OpenLayers
Introdução ao OpenLayersIntrodução ao OpenLayers
Introdução ao OpenLayers
Fernando Quadro
 
Revolutionizing the Energy Industry with Graphs
Revolutionizing the Energy Industry with GraphsRevolutionizing the Energy Industry with Graphs
Revolutionizing the Energy Industry with Graphs
Neo4j
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
Databricks
 
Cebuco Gebieden
Cebuco GebiedenCebuco Gebieden
Cebuco Gebieden
Peteroptimaal
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
공간데이터베이스(Spatial db)
공간데이터베이스(Spatial db)공간데이터베이스(Spatial db)
공간데이터베이스(Spatial db)
H.J. SIM
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
Ming Ma
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Cambridge Semantics
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
 
Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
Luis Miguel Salgado
 
Big query
Big queryBig query
Big query
Tanvi Parikh
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
AWS Services Overview - September 2016 Webinar Series
AWS Services Overview - September 2016 Webinar SeriesAWS Services Overview - September 2016 Webinar Series
AWS Services Overview - September 2016 Webinar Series
Amazon Web Services
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
DATAVERSITY
 

What's hot (20)

Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Hazelcast Essentials
Hazelcast EssentialsHazelcast Essentials
Hazelcast Essentials
 
Introdução ao OpenLayers
Introdução ao OpenLayersIntrodução ao OpenLayers
Introdução ao OpenLayers
 
Revolutionizing the Energy Industry with Graphs
Revolutionizing the Energy Industry with GraphsRevolutionizing the Energy Industry with Graphs
Revolutionizing the Energy Industry with Graphs
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
Cebuco Gebieden
Cebuco GebiedenCebuco Gebieden
Cebuco Gebieden
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
공간데이터베이스(Spatial db)
공간데이터베이스(Spatial db)공간데이터베이스(Spatial db)
공간데이터베이스(Spatial db)
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricUsing a Semantic and Graph-based Data Catalog in a Modern Data Fabric
Using a Semantic and Graph-based Data Catalog in a Modern Data Fabric
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
Big data case study collection
Big data   case study collectionBig data   case study collection
Big data case study collection
 
Big query
Big queryBig query
Big query
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
AWS Services Overview - September 2016 Webinar Series
AWS Services Overview - September 2016 Webinar SeriesAWS Services Overview - September 2016 Webinar Series
AWS Services Overview - September 2016 Webinar Series
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 

Similar to Building a PII scrubbing layer

Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
Dylan Tong
 
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
Altinity Ltd
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
Connor McDonald
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
dhiguero
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Databricks
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Riccardo Zamana
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
Kognitio
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data GridsSpark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Ali Hodroj
 
ieee paper
ieee paper ieee paper
ieee paper
Shweta Dolhare
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Proposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Proposed Lightweight Block Cipher Algorithm for Securing Internet of ThingsProposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Proposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Seddiq Q. Abd Al-Rahman
 

Similar to Building a PII scrubbing layer (20)

Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data GridsSpark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
 
ieee paper
ieee paper ieee paper
ieee paper
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Proposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Proposed Lightweight Block Cipher Algorithm for Securing Internet of ThingsProposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Proposed Lightweight Block Cipher Algorithm for Securing Internet of Things
 

Recently uploaded

一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
mbawufebxi
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
keesa2
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 

Recently uploaded (20)

一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
一比一原版雷丁大学毕业证(UoR毕业证书)学历如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理一比一原版悉尼大学毕业证如何办理
一比一原版悉尼大学毕业证如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 

Building a PII scrubbing layer

  • 2. Benefits of masking and tokenization of PII data. 2 1. Ease of access for the data analysts 2. Ensuring the stakeholders from various systems feel safe in providing access to data. 3. Ease of sharing the data.
  • 3. Primary transformations in our scrubbing service Building blocks for other transformations 3 Mask B68 Shastri Nagar Bhopal, 462003 xxxxx Tokenise Acc no: 00014590900 ac8a58d5da5fd35a7f 60cb906a724589e76 1f25e8fbea4346fd01 23ea351ea51 Derive Age: 26 Age 20-30 Wipes out information completely Useful for preserving uniqueness Useful for selective preservation of information
  • 4. First glimpse: - We needed cron to invoke this python script. - It will pick up files from the disk, scrub, tokenize and write to S3. Initial set of ideas: 1. Python with simple CSV reader. 2. One step ahead, maybe we can use pandas, provides good support for various readers, and dataframe abstraction. A simple scrubbing of CSVs, write the output to S3 4
  • 5. Second glance: Since, we had chosen pandas, with sqlalchemy interface we were able to achieve this. We started to notice few behaviours: 1. Too much RAM was being consumed. 2. Reads were slow. Usual suspects: 1. Intermediate copies. 2. Loading more data then needed. 3. No lazy evaluation. 4. No parallelization Possible remedies: 1. Load less data aka chunking 2. Use efficient data types. We need to connect to DBs. 5
  • 6. Final reckoning 1. We need to connect to plethora of databases systems. 2. We might end up ingesting various raw formats like excels etc. 3. Orchestration would not be as simple as cron. 4. Backfilling of huge amounts of data. 5. We started to see pandas teeter under requirements were we needed to join multiple data sources and perform aggregates, before pushing the scrubbed dataset to S3. We realised pandas won’t scale. We had to migrate to spark. Towards the sprint end, some realisations 6
  • 7. You do not need TBs of data for using spark. Using spark does not necessarily mean you would have to invest and maintain into a spark cluster. If your scale is small, start with smaller spark setup. You can gradually step up as the scale of data increases You can bring a pet elephant, and feed it accordingly. 7 Requirement Deployment mode One job at a time and data size in few GBs. Local mode, (single node) (local storage) Multiple jobs with resource allocation in FIFO scheduling and data size in few GBs. Standalone (Single node) (local storage) Multiple jobs with resource allocation in FIFO scheduling and data size in few 100 GBs. Standalone (Multi-node or if a big machine using containers) (local storage) Multiple jobs with resource allocation in advanced scheduling and data size in few 100 GBs. Cluster mode (with YARN, Mesos etc as schedulers) (distributed storage)
  • 8. Ways to tokenize your data Problem statement: In place of real keys in the data we need to put tokens, for given key the token should be same, and given a key we should be able to tell the real data. With spark in our arsenal, the next problem to solve. 8 Needs mapping table No need of mapping table create lookup Encryption Hashing with salt Hashing with pepper Using UUIDs
  • 9. Using UUIDs 1. Though using UUIDs for generating token are better as they are completely random and has no relation with the original data, but they pose the problem of fetchOrCreate between parallel spark jobs. 9 Lookup Table (Cached Table or Table) token(12232) token(12232) token(42) token(100)
  • 10. Making a choice 10 Client needs Lookup table is must and should be encrypted, this gives them more control over the tokenized information, than the encryption method. Our needs We wanted to avoid fetchOrCreate problem with multiple spark jobs trying to lookup already generated token value. We wanted the creation of tokens such that two jobs arrive at the same token without any coordination.
  • 11. Using hashing with pepper 11 00014590900 ac8a58d5da5fd35a7f60cb90 6a724589e761f25e8fbea434 6fd0123ea351ea51 CUSTOMER_CODE_TOKEN CUSTOMER_CODE ac8a58d5da5fd35a7f60 cb906a724589e761f25e 8fbea4346fd0123ea351 ea51 adfa58d5da5fd35a7f60c b90sdfsd761f25e8fbea4 346fd0123ea351ea51== CUSTOMER_CODE_TOKEN CUSTOMER_CODE ac8a58d5da5fd35a7f60 cb906a724589e761f25e 8fbea4346fd0123ea351 ea51 adfa58d5da5fd35a7f60c b90sdfsd761f25e8fbea4 346fd0123ea351ea51== adfa58d5da5fd35a7f60 cb90sdfsd761f25e8fbe a4346fd0123ea351ea5 1== 00014590900 lookup store fetch decrypt Token creation Token fetch
  • 12. The current state of our scrubbing service 1. Airflow 1.10.10 on docker, with CeleryExecutor 5 workers and 1 hourly worker. 2. 48 cores CPU and 128 GB RAM. 3. Spark and airflow on simple docker-compose. 4. 103 data pipelines pushing data to AWS from 28 data sources 5. Oracle has 9,43,68,195 number of tokens. 6. Alerts using emails and prometheus, grafana setup. 7. RBAC UI The PII gatekeeper to the cloud 12
  • 13. Spark on docker 1. Mount all the spark related scratch directories. Spark.local.dir or docker diff to figure out what other data the container is generating. 2. Watchout for zombies: killing jobs, or executor retries leaves zombie processes behind, since docker has no init system. Add tini or use docker run --init. A glimpse of few challenges you might face 13
  • 14. And the stored PII data mapping It’s something like this 14

Editor's Notes

  1. T In the initial phase of the project we found the Data Scientists struggling to get data access in the bank. Instead of the database access dumps are often provided, and any change in the dump requires another set of approvals.