[2024]Digital Global Overview Report 2024 Meltwater.pdf
Automation for test data anonymization
1. T E S T D ATA
A N O N Y M I Z AT I O N
Prateek Gupta
T R A N S F O R M I N G
R E A L D A T A I N T O
R E A L I S T I C T E S T
D A T A
2. T E S T D ATA A N O N Y M I Z AT I O N
Test data anonymization is a critical practice in the realm of data privacy
and software testing. It involves the process of transforming sensitive
information in a dataset used for testing purposes to protect individuals'
privacy and adhere to data protection regulations. This ensures that
personal or sensitive data does not get exposed during testing while still
allowing organizations to effectively evaluate the functionality, performance,
and security of their software or systems.
2
3. T H E N E E D F O R T E S T D ATA
A N O N Y M I Z AT I O N
Today's digital age collects
vast amounts of data for
various purposes,
including software
development and testing
This data often
contains personally
identifiable information
(PII) and other sensitive
information.
Data protection regulations
like GDPR and HIPAA
require organizations to
protect this data.
Test data
anonymization is
necessary to fulfill
these obligations and
protect individuals'
privacy.
4. C O M M O N T E C H N I Q U E S F O R T E S T D ATA
A N O N Y M I Z AT I O N
Tokenization: Replace
sensitive data with
tokens that require
access to a secure
database.
Synthetic Data Generation:
Generate fictional data
mirroring the
characteristics of the
original data
Data Masking: Replace
sensitive data with fake or
pseudonymous data.
Data Encryption: Convert
sensitive data into a
scrambled format
Data Subset Selection: Use
a subset of non-sensitive
data for testing.
5. 10/19/2023
B E N E F I T S O F T E S T D ATA
A N O N Y M I Z AT I O N
Risk Mitigation:
Minimizes the risk of
exposing sensitive data
during testing
Effective Testing:
Allows thorough
testing without
compromising data
privacy.
Privacy Protection:
Safeguards individuals'
sensitive information.
Privacy Protection:
Safeguards individuals'
sensitive information.
6. S O L U T I O N F O R D ATA M A S K I N G
A Python script was created to mask data in acceptable
form. The script takes an input CSV file with column
headers and prompts the user to choose the output file
type from CSV, XML, EXCEL, JSON, or SQL. Based on
the user's chosen data type for each column, the script
generates mock data and writes the output to the
selected file type in the output folder. The output file is
named as <”input_CSV__file_name" + "mock_data" +
{timestap}>.
8. T E C H N O L O G Y U S E D A N D
A D VA N TA G E S
Faker Library: A Python library for generating
fake data with various customizable data
types.
YAML configuration file: A human-readable
data serialization format used to specify the
script's input and output file locations.
pandas: A Python library used for data
manipulation and analysis.
Element Tree: A Python library for working
with XML documents, which are a popular
format for storing structured data.
The solution can be used across various
environments such as for Load Testing , Performance
Testing , User-acceptance Testing, Pre-production and
Production.
And it offers the capability to generate output data in
multiple formats like CSV, XML, Excel, JSON, and SQL
9. T H A N K Y O U F O R D I V I N G I N T H E T E S T
D ATA A N O N Y M I Z AT I O N . .
P R A T E E K . G U P T A @ T H E P S I . C O
M
P R E S E N T E D B Y: