The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
PyData Sri Lanka 2023 Presentation - Nuzhi Meyen-V2.pptx
1. Techniques to Handle PII Data in
Data Engineering Workflows to
Ensure Compliance to Data
Protection Laws
Sri Lanka
2023
Nuzhi Meyen
2. Importance of Compliance
Introduction
• The Personal Data Protection Act No. 9 of 2022
(PDPA) in Sri Lanka.
• Digital Personal Data Protection (DPDP) Act in
India (2023)
• GDPR in the EU (201 8)
• CCPA in California, USA (201 8)
• COPPA (1 998)
• HIPPA (1 996)
• PCI DSS (2004)
Source - Data Privacy Vocabulary - W3C Data Privacy
Vocabularies and Controls CG (DPVCG)
Maximum of
2.5 Billion INR
and Minimum
of 500 Million
INR
DPDP
Upto 1 0
Million Euros
or 2% of
preceding
fiscal year
turnover*
GDPR
Upto a
maximum of
1 0 Million
LKR
PDPA
Sri Lanka
2023
3. What is PII Data ?
Sensitive Data
Confidential Data
PII stands for Personally Identifable Infformation. It is any data that
could potentially identify a specific individual.
Sometimes referred to as “Public” data, sensitive data is any
information that can be found in public records like newspapers,
telephone books, or social media sites
Confidential (or “private”) Data is information that an individual
would prefer not be made public. This can include information such
as:
• Physical home address
• Telephone number (mobile, business, and personal numbers)
• Date or location of their birth
High-Risk Data
Sometimes labeled “Restricted” data, high-risk data is the highly
confidential information that supports cyber-crime activities and
typically can’t be found through legal means of inquiry. This can
include data such as:
• Credit card information
• Medical records
• Social Security or TIN (Tax Identification Number)
Sources - dataprivacymanager.net & digitalguardian.com
Sri Lanka
2023
4. Data Minimization
Purpose Limitation
Storage Limitation
General Principles
for Handling PII
The principle of data minimization encourages organizations to
only collect the data that is absolutely necessary for the specific
purpose it will serve.
This principle states that data should only be used for the purpose
for which it was initially collected.
This principle advocates for the deletion of personal data once it is
no longer necessary for the purpose it was collected for.
Sri Lanka
2023
5. Data Engineering
Techniques
Tokenization
Replace sensitive data
with non-sensitive
placeholders.
There are several data engineering techniques which can be
considered in the context of handling PII data. A few of them are
given below.
Encryption
• At-rest: Encrypte
data when it's stored
• In- transit: Use
SSL/TLS encryption
during data transfer.
Masking
Conceal portions of the
data to protect it.
Role-Based Access
Control (RBAC)
Limit access to data based on
roles within the organization
Auditing &
Monitoring
Track who access what data, when
and why.
Sri Lanka
2023
6. What it is : This algorithm
keeps the format of the input
data. For example, if a 1 6-digit
credit card number is
tokenized, the token will also
be a 1 6-digit number.
Use Case: Useful in scenarios
where the format of the
tokenized data needs to be
similar to the original data,
such as in legacy systems.
Format-Preserving
Encryption (FPE)
Tokenization
Secure Hash Algorithm
(SHA) Tokenization
What it is : Uses a one-way
hash function to create a hash
of the original data. A random
salt is then added to the hash.
The salted hash is then stored
as a token.
Use Case: Suited for
situations where you don't
need to retrieve the original
data but do need to verify the
integrity of the data..
Random
Tokenization
What it is : Generates a
completely random string as
a token and maps it to the
original data in a secure
lookup table.
Use Case: Good for general-
purpose tokenization where
format preservation is not
necessary.
Cipher-Based
Tokenization
What it is : Generates a
completely random string as
a token and maps it to the
original data in a secure
lookup table.
Use Case: Good for general-
purpose tokenization where
format preservation is not
necessary.
Vault-Based
Tokenization
What it is : Stores the original
data in a highly secure data
vault. Each piece of stored
data is mapped to a unique
token.
Use Case: Ideal for
applications that require high
levels of security but also
need to detokenize data
frequently.
Sri Lanka
2023
10. # Install with: pip install cryptography
from cryptography.fernet import Fernet
key = Fernet.generate_key()
cipher_suite = Fernet(key)
# Tokenize
token = cipher_suite.encrypt(b"Sensitive Data")
# Detokenize
original = cipher_suite.decrypt(token)
print(f'Token: {token}, Original: {original.decode()}')
Tokenization - Cipher based
Sri Lanka
2023
11. import hvac
# Initialize Vault client
client = hvac.Client()
# Verify if Vault is initialized and unsealed
assert client.is_initialized() is True
assert client.sys.is_sealed() is False
# Create a secret in the Vault (Tokenization)
write_response = client.secrets.kv.v2.create_or_update_secret(
path='my-secret',
secret=dict(sensitive_data="This is very secret information"),
)
# The returned `write_response` will contain metadata, not the token
# In Vault, the token is usually the path ('my-secret' in this case)
# Retrieve the secret from the Vault (Detokenization)
read_response = client.secrets.kv.read_secret_version(
path='my-secret',
)
sensitive_data = read_response['data']['data']['sensitive_data']
print(f"Sensitive Data Retrieved: {sensitive_data}")
Tokenization - Vault based
Sri Lanka
2023
12. What it is : To completely
hide the original data.
Use Case: Useful for fields
where the actual information
is extremely sensitive and
should not be exposed under
any circumstances, such as
Social Security numbers or
passwords in a system log.
Redaction
Masking
Partial Masking
What it is : To conceal only
part of the data, leaving some
characters visible.
Use Case: Commonly used for
email addresses or phone
numbers in customer
interfaces, where the full
visibility of the data is not
necessary but some context is
useful. For instance, showing
only the last four digits of a
credit card number.
Shuffling
What it is : To randomize the
order of the characters in the
data.
Use Case: Suitable for textual
data where the format needs
to be preserved but the data
should not be recognizable.
It's not ideal for numerical
data or data with a specific
pattern
Substitution
What it is : To replace each
character or substring with
another character or
substring based on a
mapping.
Use Case: Useful when you
need a reversible masking
process. For example, during
software testing, you might
want to mask sensitive data
but will need to revert it back
to its original form for
verification
Number
Variance
What it is : To add a random
variance to numerical data.
Use Case: Useful for datasets
involving numbers where the
exact number is sensitive but
the general range is not. For
instance, in a dataset used for
salary analysis, you might add
variance to the actual salaries
to protect individual privacy
while maintaining the overall
distribution for analytical
purposes.
Sri Lanka
2023
14. 1 0%
Points to consider ...
• What will privacy look like in a post quantum
encryption timeline ? NIST has already developed
standards for quantum-safe cryptographic
algorithms.
• How will Generative AI technology such as LLM in
the form of ChatGPT etc. impact how developers in
the middle develop secure code without privacy
leaks in the context of code generation? eg:
CodexLeaks: Privacy Leaks from Code Generation
Language Models in GitHub Copilot
Sri Lanka
2023