Building a PII scrubbing layer

1
Building a PII
scrubbing layer

Benefits of masking and tokenization of PII
data.
2
1. Ease of access for the data analysts
2. Ensuring the stakeholders from various systems feel safe in providing access to data.
3. Ease of sharing the data.

Primary transformations in our scrubbing
service
Building blocks for other transformations
3
Mask
B68 Shastri Nagar
Bhopal, 462003
xxxxx
Tokenise
Acc no:
00014590900
ac8a58d5da5fd35a7f
60cb906a724589e76
1f25e8fbea4346fd01
23ea351ea51
Derive
Age: 26
Age 20-30
Wipes out
information
completely
Useful for preserving
uniqueness
Useful for selective
preservation of
information

First glimpse:
- We needed cron to invoke this python script.
- It will pick up files from the disk, scrub, tokenize and write to S3.
Initial set of ideas:
1. Python with simple CSV reader.
2. One step ahead, maybe we can use pandas, provides good support for various
readers, and dataframe abstraction.
A simple scrubbing of CSVs, write the output to S3
4

Second glance:
Since, we had chosen pandas, with sqlalchemy interface we were able to achieve this.
We started to notice few behaviours:
1. Too much RAM was being consumed.
2. Reads were slow.
Usual suspects:
1. Intermediate copies.
2. Loading more data then needed.
3. No lazy evaluation.
4. No parallelization
Possible remedies:
1. Load less data aka chunking
2. Use efficient data types.
We need to connect to DBs.
5

Final reckoning
1. We need to connect to plethora of databases systems.
2. We might end up ingesting various raw formats like excels etc.
3. Orchestration would not be as simple as cron.
4. Backfilling of huge amounts of data.
5. We started to see pandas teeter under requirements were we needed to join
multiple data sources and perform aggregates, before pushing the scrubbed
dataset to S3.
We realised pandas won’t scale. We had to migrate to spark.
Towards the sprint end, some realisations
6

You do not need TBs of data for using spark.
Using spark does not necessarily mean you would have to invest and maintain into a spark cluster. If
your scale is small, start with smaller spark setup. You can gradually step up as the scale of data
increases
You can bring a pet elephant, and feed it accordingly.
7
Requirement Deployment mode
One job at a time and data size in few GBs. Local mode, (single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few GBs.
Standalone (Single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few 100 GBs.
Standalone (Multi-node or if a big machine using containers)
(local storage)
Multiple jobs with resource allocation in advanced scheduling
and data size in few 100 GBs.
Cluster mode (with YARN, Mesos etc as schedulers)
(distributed storage)

Ways to tokenize your data
Problem statement: In place of real keys in the data we need to put tokens, for given
key the token should be same, and given a key we should be able to tell the real data.
With spark in our arsenal, the next problem to solve.
8
Needs mapping table
No need of mapping
table
create
lookup
Encryption
Hashing with salt
Hashing with pepper
Using UUIDs

Using UUIDs
1. Though using UUIDs for generating token are better as they are completely
random and has no relation with the original data, but they pose the problem of
fetchOrCreate between parallel spark jobs.
9
Lookup Table (Cached Table or Table)
token(12232)
token(12232) token(42)
token(100)

Making a choice
10
Client needs
Lookup table is must and should be
encrypted, this gives them more control
over the tokenized information, than the
encryption method.
Our needs
We wanted to avoid fetchOrCreate problem
with multiple spark jobs trying to lookup
already generated token value.
We wanted the creation of tokens such that
two jobs arrive at the same token without
any coordination.

Using hashing with pepper
11
00014590900
ac8a58d5da5fd35a7f60cb90
6a724589e761f25e8fbea434
6fd0123ea351ea51
CUSTOMER_CODE_TOKEN CUSTOMER_CODE
ac8a58d5da5fd35a7f60
cb906a724589e761f25e
8fbea4346fd0123ea351
ea51
adfa58d5da5fd35a7f60c
b90sdfsd761f25e8fbea4
346fd0123ea351ea51==
CUSTOMER_CODE_TOKEN CUSTOMER_CODE
ac8a58d5da5fd35a7f60
cb906a724589e761f25e
8fbea4346fd0123ea351
ea51
adfa58d5da5fd35a7f60c
b90sdfsd761f25e8fbea4
346fd0123ea351ea51==
adfa58d5da5fd35a7f60
cb90sdfsd761f25e8fbe
a4346fd0123ea351ea5
1==
00014590900
lookup
store
fetch
decrypt
Token creation
Token fetch

The current state of our scrubbing service
1. Airflow 1.10.10 on docker, with CeleryExecutor 5 workers and 1 hourly worker.
2. 48 cores CPU and 128 GB RAM.
3. Spark and airflow on simple docker-compose.
4. 103 data pipelines pushing data to AWS from 28 data sources
5. Oracle has 9,43,68,195 number of tokens.
6. Alerts using emails and prometheus, grafana setup.
7. RBAC UI
The PII gatekeeper to the cloud
12

Spark on docker
1. Mount all the spark related scratch directories. Spark.local.dir or docker diff to
figure out what other data the container is generating.
2. Watchout for zombies: killing jobs, or executor retries leaves zombie processes
behind, since docker has no init system. Add tini or use docker run --init.
A glimpse of few challenges you might face
13

And the stored PII data mapping
It’s something like this
14

Building a PII scrubbing layer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a PII scrubbing layer

Similar to Building a PII scrubbing layer (20)

Recently uploaded

Recently uploaded (20)

Building a PII scrubbing layer

Editor's Notes