www.csiro.au	
Data	Analy1cs		
WITHOUT	Seeing	the	Data	
Max	O>	
…	with	input	from	the	en1re	N1	Team	
max.o>@data61.csiro.au
Future	Value	of	Data	
Data Analytics Without Seeing the Data
2		|	
time
value
release
Data decays with time!
Future	Value	of	Data	
Data Analytics Without Seeing the Data
3		|	
time
value
release
Joined with another data set
– more value!!
Future	Value	of	Data	
Data Analytics Without Seeing the Data
4		|	
time
value
release
New analytics techniques
– more value!!
Future	Value	of	Data	
Data Analytics Without Seeing the Data5		|	
time
value
release
Data decay
+
Joining new data
+
New analytics techniques
Uncertain future value
Unknown future risk
Challenge	
Computa.on	
Result	
Confidential 	
Learn	this!	
Learn		
NOTHING	
Data	Analy.cs	Without	Seeing	the	Data	6	|
The	Problem	
How	can	we	learn	valuable	insights	from	
sensi1ve	data	from	mul1ple	organisa.ons?	
Insights
Sensitive
data
Sensitive
data
Joint
Analysis
Confidential 	 Confidential 	
Data	Analy.cs	Without	Seeing	the	Data	7	|
Three	Basic	Building	Blocks	
• Private	computa.on	
•  Arithme.c	on	encrypted	numbers	
• Distributed,	confiden.al	analy.cs	
•  Distributed	algorithms,	computa.on	&	protocols	
• Private	Record	Linkage	
•  Privacy	preserving	record	level	matching	
Data	Analy.cs	Without	Seeing	the	Data	8	|
Solu1on	(1):	Private	computa1on	
3	 E	
71175935987496430338623223060201843925208459762815635262949815592595
16861516633702469933935260534155369128712003211669147527394965883186
98743040588706948658192655353713280945959536474253285115856347911583
77797185627083578174160157299579445890692023902698424427665636040729
38327792655060957281939887206011322264791188672934779233385835564950
538042608146734818512597109…..........	
65535371328094595953647425328511585634791158377797185627083578174160
15729957944589069202390269842442766563604072976104138715920619699952
17697451818900805720754176976456091364980410538327792655060957281939
88720601132226479118867293477923338583556495053804260814673481851259
70093558089132685793389213865608731685640953069735077874534452166343
33195600873200349632089…....	
2	 E	
+	 “+”	
95364742532851158563479115837779718562708357817416015729957944589069
20239026984244276656360407297610413871592061969995217697451818900805
11886729347792333858355649505380426081467348185125971095628099782109
58956224480113528398128884692700462576308469655060770093558089132685
79338921386560873168564095306973507787453445216634333195600873200349
632089270046257630846…....	
D	5	
=	
=	Data	Analy.cs	Without	Seeing	the	Data	9	|
Solu1on	(1):	Private	computa1on	
3	 E	
71175935987496430338623223060201843925208459762815635262949815592595
16861516633702469933935260534155369128712003211669147527394965883186
98743040588706948658192655353713280945959536474253285115856347911583
77797185627083578174160157299579445890692023902698424427665636040729
38327792655060957281939887206011322264791188672934779233385835564950
538042608146734818512597109…..........	
65535371328094595953647425328511585634791158377797185627083578174160
15729957944589069202390269842442766563604072976104138715920619699952
17697451818900805720754176976456091364980410538327792655060957281939
88720601132226479118867293477923338583556495053804260814673481851259
70093558089132685793389213865608731685640953069735077874534452166343
33195600873200349632089…....	
2	 E	
+	 “+”	
95364742532851158563479115837779718562708357817416015729957944589069
20239026984244276656360407297610413871592061969995217697451818900805
11886729347792333858355649505380426081467348185125971095628099782109
58956224480113528398128884692700462576308469655060770093558089132685
79338921386560873168564095306973507787453445216634333195600873200349
632089270046257630846…....	
D	5	
=	
=	10	|	 Data	Analy.cs	Without	Seeing	the	Data
Solu1on	(2):	Distributed	analy1cs	
Compute
Data	
Dept	2	
Compute
Data	
N1 Secure compute	
Confidentiality boundary	
Data	always	remains	confiden1al		
to	the	source	ins.tu.on	
Dept	1	
Compute
N1
Coordinator	
Messages	containing	
encrypted	data	
11	|	 Data	Analy.cs	Without	Seeing	the	Data
Solu1on	(3):	Private	Record	Linkage	
Dataset	A	 Dataset	B	
Tori Mckone 7/06/1921 F
Tori Mackon 6/07/1921 F
Victoria Mckon 7/06/1921 F
?	
?	
12	|	 Data	Analy.cs	Without	Seeing	the	Data
Use	Cases
Scoring	
Model	
Own	
Data	
Other	
Data	
Quality	
??	
15	|	 Data	Analy.cs	Without	Seeing	the	Data
Suspicious	Ac1vi1es	
Need	to	report?	
Model
Builder
16	|	 Data	Analy.cs	Without	Seeing	the	Data
Industry	using	Gov	Data	
Model
Builder
Own	
Data	
Gov	
Data	
17	|	 Data	Analy.cs	Without	Seeing	the	Data
Benchmarking	
Own	
Data	
Model
Builder
18	|	 Data	Analy.cs	Without	Seeing	the	Data
Device	Analy1cs	
Data Analytics Without Seeing the Data
Model	of	normal	
behaviour	
OK OK NG OK
Private	Modeling	
learn	
deploy	
OK NG OK
19	|
Private	Computa1on
Homomorphic	encryp1on	
Partial
Homomorphic
Encryption
Somewhat
Homomorphic
Encryption
Fully
Homomorphic
Encryption
Allows either addition or
multiplication of encrypted
numbers
Allows evaluation of low order
polynomials
Allows evaluation of arbitrary
functions
Moregeneral
Faster
Data	Analy.cs	Without	Seeing	the	Data	21	|
Paillier	Encryp1on	
c = gm
rn
modn2
Encryption of m:
D E m1( ).E m2( )modn2
( )= m1 + m2 modn
D E m1( )
m2
modn2
( )= m1m2 modn
Addition of encrypted numbers:
Multiplication of encrypted number by a scalar:
Data	Analy.cs	Without	Seeing	the	Data	22	|
Paillier	Encryp1on	
c = gm
rn
modn2
Encryption of m:
Addition of encrypted numbers:
Multiplication of encrypted number by a scalar:
gm1
× gm2
= gm1+m2
gm1
( )
m2
= gm1m2
Data	Analy.cs	Without	Seeing	the	Data	23	|
Paillier	Implementa1ons	
• Python	–	open	source		
•  www.github.com/nicta/python-paillier	
• Java	–	open	source	
•  www.github.com/nicta/javallier	
• Javascript	–	s.ll	under	closed	
development	
24	|	 Data	Analy.cs	Without	Seeing	the	Data
Distributed,	Confiden1al	
Analy1cs
Distributed	Compu1ng	with	a	Twist	
Compute
Data	
Org	2	
Compute
Data	
N1 Secure compute	
Confidentiality boundary	
Data	always	remains	confiden1al		
to	the	source	organisa.on	
Org	1	
Compute
N1
Coordinator	
Messages	containing	
ONLY	encrypted	data	
Data	Analy.cs	Without	Seeing	the	Data	26	|
Graph	Computa1on	Engine	
Domains
CE
CE
CE
DF DF
CE
DF
CE
Coordinator
Worker
Workers
Properties
M
M
M
M
M
Messages
M JSON Message
CE AKKA actors
DF Data frames
27	|	 Data	Analy.cs	Without	Seeing	the	Data
N1	Analy1cs	Pla[orm	
Privacy Technologies
Partial homomorphic
encryption
Private Record
Linkage
Irreversible
aggregation
Distributed Graph Computation Engine
Analytics
Statistics Regression Clustering
Data Auth
Machine Learning
Learn Evaluate Deploy
Network
Data	Analy.cs	Without	Seeing	the	Data	28	|
Logis1c	Regression	
p x;θ( )=
1
1+e−θ.x
L θ( )= yi logp xi;θ( )+ 1− yi( )
i=0
n
∑ log 1− p xi;θ( )( )
Logis.c	func.on	
Log	likelihood	
Minimise	for				:	
Evaluate:	
θ
Requires	“secure	log”	and	“secure	inverse”	protocol		
using	Paillier	encryp.on	
29	|	 Data	Analy.cs	Without	Seeing	the	Data	
Builds on Han et al. 2010 “Privacy Preserving Gradient Descent Methods”
Example	Paillier	Logis1c	Regression	
Org B
CECE
Coordinator
Worker
Secure
Log
Logistic
Learner
Secure
Inverse
M JSON Message
CE AKKA actors
DF Data frames
Gradient
Descent
Private key holder
Features & labels Features
Org A
N1Analytics
30	|	 Data	Analy.cs	Without	Seeing	the	Data
Performance	
•  Learning	
•  Learnt	models	have	the	same	
accuracy	as	unencrypted	
calcula.ons	
•  “Private	learning”	is	(1000x)	
slower	due	to	encrypted	
computa.ons.	Learning	.mes	are	
several	hours.	
•  Deployment	
•  A	score	can	be	generated	in	real	
.me	(<50ms)	
•  Customer	data	that	contributes	to	
the	score	remains	private.	
��� ���� ����
��������� (����)
���
����
����
���
�������� ���� (�)
�������� �������� ����������
������� ���� ��� ����
31	|	 Data	Analy.cs	Without	Seeing	the	Data
Scaling	
Coordinator
Data Provider 1
Data Provider 2
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
��������
●
●
●
●
●
■
■
■ ■ ■
◆
◆
◆
◆
0 100 200 300 400
Cores
5
10
50
100
500
Minutes
Learning time scaling
● 10,000x10 features
■ 100,000x10 features
◆ 1,000,000x10 features
32	|	 Data	Analy.cs	Without	Seeing	the	Data
Confiden1al	Record	Linkage
Record	Linkage	Challenge	
Dataset	A	 Dataset	B	
Tori Mckone 7/06/1921 F
Tori Mackon 6/07/1921 F
Victoria Mckon 7/06/1921 F
?	
?	
41	|	 Data	Analy.cs	Without	Seeing	the	Data
Solu1on	(3):	Private	Record	Linkage	
Jane	Doe	
Paul	Doe	
Jim	Clark	
Kate	Clark	
Shan	Bo	
Reg	Pal	
Janet	Doe	
Bob	Doe	
Jim	Clark	
Kat	Clark	
Shan	Bo	
Joe	Smith	
a8bf342	
f72630b	
14oe54	
a72bef4	
7830530	
4bf6021	
a8bf242	
b3894f3	
14oe54	
672bef4	
7830530	
80ac364	
Fuzzy	Matching	
One	way	hash	func.ons	 One	way	hash	func.ons	
42	|	 Data	Analy.cs	Without	Seeing	the	Data
Private	Record	Linkage	
Fuzzy		
Matcher	
Shared	Secret	
Salt	Hasher	
Personally	
Iden.fiable	
Informa.on	
Anonymous	
Bloom	filter	
Hasher	
Personally	
Iden.fiable	
Informa.on	
Anonymous	
Bloom	filter	
Linkage	Table	
N1	
Company	A	 Company	B	
PII	cannot	be	recovered	from	the	hashes	
43	|	 Data	Analy.cs	Without	Seeing	the	Data
Private	Record	Linkage	
44		|	
44
Organisa.on	B	
Fuzzy		
Matcher	
Organisa.on	A	
N1	Analy.cs	
A's$PII$data
Name DOB Gender
John/Smith 12/01/82 M
Mark/Gorgon 1/12/90 M
Hanna/Smith 4/02/78 F
… … …
… … …
Juliet/Baker 2/11/72 F
B's$PII$data
Name DOB Gender
Mark.Gorgon 1/12/90 M
Juliet.Baker 2/11/72 F
Andrew.Roberts 4/02/93 M
… … …
… … …
Hanna.Smith 4/02/78 F
A's$Cryptographic$Hashes
Row Key
1 10110110...00101010
2 01110110...11010101
3 10011001...10100110
… …
… …
100000 01101011...00101101
B's$Cryptographic$Hashes
Row Key
1 01110110…11010101
2 01101011...00101101
3 01111000…00110011
… …
… …
100000 10011101...10100111
Shared	
Secret	Salt	
Hasher	 Hasher	
Linkage(Table
Row$A Row$B
1 X
2 1
3 100000
… …
… …
100000 X
Similar in approach to MERLIN - Ranbaduge, Vatsalan, Christen (2015)
Data	Analy.cs	Without	Seeing	the	Data
Probabilis1c	Record	Linkage	
Common	categorical	features	
(e.g	post	code,	age	range,	gender)	
Record	linkage	can	be	a	privacy	issue	
45	|	 Data	Analy.cs	Without	Seeing	the	Data
Classifica1on	without	iden1ty	linking	
46		|	
Features	
Labels	
Rados	Features	
Shared	feature	
Labels*	
Label	Propor.ons	
Learning from Label Proportions
Patrini, Nock, Caetano, & Rivera, NIPS (2014), (Almost) No label no cry
Data	Analy.cs	Without	Seeing	the	Data
Classifica1on	without	iden1ty	linking	
47		|	
Features	
Labels	
Rados	Features	
Shared	feature	
Labels*	
Encrypted	Label	
Propor.ons	
Learning from Encrypted Label Proportions
Data	Analy.cs	Without	Seeing	the	Data
Current	Status
Current	Capabili1es	of	N1	pla[orm	
•  Standard	data	analy.cs	
techniques	on	confiden.al	
data:	
•  Correla.on	analysis	
•  Classifica.on	/	predic.on	
•  Regression	
•  Clustering	/	outlier	detec.on	
•  Automated	private	record	
linkage	
•  Fine	grained	authorisa.on	and	
access	control	
Dept	1	
Org	2	
Comp3	
Private	record	
linkage	
Sta.s.cs	 Classifiers	
Anomaly	
Detec.on	
Private	analy.cs	
Federated	model	–	No	central	database		
Data	is	kept	local	to	the	source	
49	|	 Data	Analy.cs	Without	Seeing	the	Data
Beta	program	
• Not	open	sourced	(yet!)	
• Looking	for	partners	who	want	to	use	our	
system	in	their	applica1ons	
• S.ll	some	warts,	but	working	in	
commercial	sesng	
50		|	 Data	Analy.cs	Without	Seeing	the	Data
Acknowledgements	
51		|	
Engineering
Mr. Brian Thorne
Dr. Mentari Djatmiko
Dr. Guillaume Smith
Dr. Wilko Hanecka
Dr. Hamish Ivey-Law
Research
Dr. Richard Nock
Mr. Giorgio Patrini
Dr. Roksana Borelli
Dr. Arik Friedman
Prof. Hugh Durrant-Whyte
Business
Mr. Warren Bradey
Ms. Shelley Copsey
Lead: Dr. Stephen Hardy	
Data	Analy.cs	Without	Seeing	the	Data
www.csiro.au	
Data	Analy1cs	Without	
Seeing	the	Data	
Max	O>	
…	with	input	from	the	en1re	N1	Team	
max.o>@data61.csiro.au

Confidential Computing - Analysing Data Without Seeing Data