Data Privacy Conference, Rootconf, 23-29 April 2021
Synthetic data
generation
Sandeep Joshi
[ needl.ai ]
https://www.linkedin.com/in/sanjoshi/
Data Privacy Conference, Rootconf, 23-29 April 2021
Agenda
1. Introduction to the problem
2. Capturing variation within a column
3. Capturing dependence between columns
4. Masking ๏ฌelds
5. Summary
2
Data Privacy Conference, Rootconf, 23-29 April 2021
Background on needl.ai
3
Data Privacy Conference, Rootconf, 23-29 April 2021
Privacy requirement
Engineers should not access customerโ€™s data
But how do we test new features, especially ML-related ?
4
Aim : Generate Synthetic data from Production data
Data Privacy Conference, Rootconf, 23-29 April 2021
Real data --> Synthetic Data
5
Name Age Gender Respons
e
Tsunami
Singh
34 F Yes
Pappu
Pager
23 M Maybe
Khokha
Singh
53 F
Vasooli
Bhai
21 M
Jagat
Sahni
66 F No
Name Age Gender Respons
e
John
Smith
45 M Yes
Jack
Ryan
45 M
Jill
Reacher
34 F Maybe
Myles
Togo
23 F
Bill
Melater
18 M No
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture variation within a column
6
Name Age Gender Response
34
23
53
21
66
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture correlation between columns
7
Name Age Gender Response
34 F
23 M
53 F
21 M
66 F
Age and Gender may be correlated
Data Privacy Conference, Rootconf, 23-29 April 2021
Mask actual names
8
Name Age Gender Response
John May
April Smith
August Ryan
June Jackson
Money Spinner
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV : Synthetic data vault
Package from MIT Data-to-AI group (https://sdv.dev/)
9
Data Privacy Conference, Rootconf, 23-29 April 2021
Single column
variation 10
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture variation within column
11
Name Age Gender Response
Tsunami Singh 34 F Yes
Pappu Pager 23 M Maybe
Khokha Singh 53 F
Vasooli Bhai 21 M
Jagat Sahni 66 F No
Data Privacy Conference, Rootconf, 23-29 April 2021
Statistics : density estimation problem
Age
34
23
53
21
Which one ??
12
Parametric and non-parametrics methods have been invented...
Data Privacy Conference, Rootconf, 23-29 April 2021
Density estimation
Using scipy
13
from scipy.stats import gaussian_kde, beta
data = [5, 12, 22, 400, 800, ... ]
beta.fit(data, floc=loc, fscale=scale)
gaussian_kde(data, bw_method=โ€™silvermanโ€™)
Demo coming up...
Data Privacy Conference, Rootconf, 23-29 April 2021
Dependence
between ๏ฌelds 14
Data Privacy Conference, Rootconf, 23-29 April 2021
Capture correlation between columns
15
Name Age Gender Response
Tsunami Singh 34 F Yes
Pappu Pager 23 M Maybe
Khokha Singh 53 F
Vasooli Bhai 21 M
Jagat Sahni 66 F No
Find how Age and Gender are correlated
Data Privacy Conference, Rootconf, 23-29 April 2021
Correlation
Both ๏ฌelds can change in the same/opposite direction or be unrelated
16
https://brianwhitworth.com/research-correlations/
Data Privacy Conference, Rootconf, 23-29 April 2021
Finding correlation with Pandas
columns_dict = {โ€˜heightโ€™: [160, 156, 175, 180, 165, 143],
'age': [64, 55, 46, 23, 22, 19]}
df = pd.DataFrame.from_dict(columns_dict)
print(df.corr().values)
17
Use Pandas df.corr()
Data Privacy Conference, Rootconf, 23-29 April 2021
Correlation : graphical view
18
correlation age height
age 1 0.7
height 0.7 1
Data Privacy Conference, Rootconf, 23-29 April 2021
How to generate correlated data
Use Copulas (Gaussian Copula, CopulaGAN, etc)
19
https://en.wikipedia.org/wiki/Copula_(probability_theory)
Data Privacy Conference, Rootconf, 23-29 April 2021
Gaussian Copulas
Capture dependence between columns (Age vs Gender)
20
Name Age Gender Response
Transform individual columns (e.g. Age) while retaining their dependency
Data Privacy Conference, Rootconf, 23-29 April 2021
Demo
21
https://github.com/sanjosh/machine_learning/blob/main/copulas/synthetic%20data%20generation.ipynb
1. Capture variation within a column
2. Capture dependence between columns
3. Generate data for all columns
Data Privacy Conference, Rootconf, 23-29 April 2021
Masking ๏ฌelds 22
Data Privacy Conference, Rootconf, 23-29 April 2021
Masking emails
23
SDV uses Faker library https://faker.readthedocs.io/
Data Privacy Conference, Rootconf, 23-29 April 2021
Masking names
24
SDV uses Faker library https://faker.readthedocs.io/
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV (synthetic
data vault) 25
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV model for a single table
26
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV : custom constraints
27
https://github.com/sdv-dev/SDV/blob/master/tutorials/single_table_data/05_Handling_Constraints.ipynb
Data Privacy Conference, Rootconf, 23-29 April 2021
SDV : other features
1. Transformers for ๏ฌelds (convert categorical data to numbers)
2. Capture relations between tables
3. Time series
4. Metrics to compare synthetic with actual data
28
https://sdv.dev/SDV/index.html
Data Privacy Conference, Rootconf, 23-29 April 2021
Conclusion
SDV is a versatile tool
Good
1. Modular : can use parts of the framework (di๏ฌ€erent git repos)
2. Usable with less data, unlike โ€œdeep learningโ€-based solutions (SDV does support GANs)
3. Its explainable (can debug or modify the output)
Issues
1. Di๏ฌƒcult to add a custom transformer (no code samples)
2. Does not solve synthetic text generation problem (NLG)
3. Does not solve synthetic graph generation
29
Questions ?

Synthetic data generation

  • 1.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Synthetic data generation Sandeep Joshi [ needl.ai ] https://www.linkedin.com/in/sanjoshi/
  • 2.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Agenda 1. Introduction to the problem 2. Capturing variation within a column 3. Capturing dependence between columns 4. Masking ๏ฌelds 5. Summary 2
  • 3.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Background on needl.ai 3
  • 4.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Privacy requirement Engineers should not access customerโ€™s data But how do we test new features, especially ML-related ? 4 Aim : Generate Synthetic data from Production data
  • 5.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Real data --> Synthetic Data 5 Name Age Gender Respons e Tsunami Singh 34 F Yes Pappu Pager 23 M Maybe Khokha Singh 53 F Vasooli Bhai 21 M Jagat Sahni 66 F No Name Age Gender Respons e John Smith 45 M Yes Jack Ryan 45 M Jill Reacher 34 F Maybe Myles Togo 23 F Bill Melater 18 M No
  • 6.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Capture variation within a column 6 Name Age Gender Response 34 23 53 21 66
  • 7.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Capture correlation between columns 7 Name Age Gender Response 34 F 23 M 53 F 21 M 66 F Age and Gender may be correlated
  • 8.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Mask actual names 8 Name Age Gender Response John May April Smith August Ryan June Jackson Money Spinner
  • 9.
    Data Privacy Conference,Rootconf, 23-29 April 2021 SDV : Synthetic data vault Package from MIT Data-to-AI group (https://sdv.dev/) 9
  • 10.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Single column variation 10
  • 11.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Capture variation within column 11 Name Age Gender Response Tsunami Singh 34 F Yes Pappu Pager 23 M Maybe Khokha Singh 53 F Vasooli Bhai 21 M Jagat Sahni 66 F No
  • 12.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Statistics : density estimation problem Age 34 23 53 21 Which one ?? 12 Parametric and non-parametrics methods have been invented...
  • 13.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Density estimation Using scipy 13 from scipy.stats import gaussian_kde, beta data = [5, 12, 22, 400, 800, ... ] beta.fit(data, floc=loc, fscale=scale) gaussian_kde(data, bw_method=โ€™silvermanโ€™) Demo coming up...
  • 14.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Dependence between ๏ฌelds 14
  • 15.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Capture correlation between columns 15 Name Age Gender Response Tsunami Singh 34 F Yes Pappu Pager 23 M Maybe Khokha Singh 53 F Vasooli Bhai 21 M Jagat Sahni 66 F No Find how Age and Gender are correlated
  • 16.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Correlation Both ๏ฌelds can change in the same/opposite direction or be unrelated 16 https://brianwhitworth.com/research-correlations/
  • 17.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Finding correlation with Pandas columns_dict = {โ€˜heightโ€™: [160, 156, 175, 180, 165, 143], 'age': [64, 55, 46, 23, 22, 19]} df = pd.DataFrame.from_dict(columns_dict) print(df.corr().values) 17 Use Pandas df.corr()
  • 18.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Correlation : graphical view 18 correlation age height age 1 0.7 height 0.7 1
  • 19.
    Data Privacy Conference,Rootconf, 23-29 April 2021 How to generate correlated data Use Copulas (Gaussian Copula, CopulaGAN, etc) 19 https://en.wikipedia.org/wiki/Copula_(probability_theory)
  • 20.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Gaussian Copulas Capture dependence between columns (Age vs Gender) 20 Name Age Gender Response Transform individual columns (e.g. Age) while retaining their dependency
  • 21.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Demo 21 https://github.com/sanjosh/machine_learning/blob/main/copulas/synthetic%20data%20generation.ipynb 1. Capture variation within a column 2. Capture dependence between columns 3. Generate data for all columns
  • 22.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Masking ๏ฌelds 22
  • 23.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Masking emails 23 SDV uses Faker library https://faker.readthedocs.io/
  • 24.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Masking names 24 SDV uses Faker library https://faker.readthedocs.io/
  • 25.
    Data Privacy Conference,Rootconf, 23-29 April 2021 SDV (synthetic data vault) 25
  • 26.
    Data Privacy Conference,Rootconf, 23-29 April 2021 SDV model for a single table 26
  • 27.
    Data Privacy Conference,Rootconf, 23-29 April 2021 SDV : custom constraints 27 https://github.com/sdv-dev/SDV/blob/master/tutorials/single_table_data/05_Handling_Constraints.ipynb
  • 28.
    Data Privacy Conference,Rootconf, 23-29 April 2021 SDV : other features 1. Transformers for ๏ฌelds (convert categorical data to numbers) 2. Capture relations between tables 3. Time series 4. Metrics to compare synthetic with actual data 28 https://sdv.dev/SDV/index.html
  • 29.
    Data Privacy Conference,Rootconf, 23-29 April 2021 Conclusion SDV is a versatile tool Good 1. Modular : can use parts of the framework (di๏ฌ€erent git repos) 2. Usable with less data, unlike โ€œdeep learningโ€-based solutions (SDV does support GANs) 3. Its explainable (can debug or modify the output) Issues 1. Di๏ฌƒcult to add a custom transformer (no code samples) 2. Does not solve synthetic text generation problem (NLG) 3. Does not solve synthetic graph generation 29 Questions ?