Say "Hi!" to Your New Boss

Say "Hi!" to Your New Boss
How algorithms might soon control our lifes
(and why we should be careful with them)

Motivation
no alternatives, Google?

Outline
Theory
1. Algorithms
2. Machine Learning
3. Big Data & Consequences for Machine Learning
4. Use of Algorithms Today and in the Future
Experiments
1. Discriminating people with machine learning & algorithms
2. Creating persistent user identities by (accidental) de-
anonymization
Summary & Outlook
1. Strategies for Handling Data Responsibly

Algorithms , Machine Learning & Big
Data

Algorithms
An algorithm is a "recipe" that gives a computer (or a
human) step-by-step instructions in order to achieve a
certain goal.
Start
Door
bell
ringing
Andreas
stands on
trapdoor?
Open
trapdoor
Wait.
Our time
will
come.
yes
no

Machine Learning
A machine learning algorithm automatically generates
models and checks them against the training data we
provide, trying to find a model that explains the data well
and can predict unknown data.

Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1

Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1

Sources of Error
𝜀 = 𝜀 𝑠𝑦𝑠 + 𝜀 𝑛𝑜𝑖𝑠𝑒 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛
systematic errors arise due to
imperfect measurements of
known variables
noise is present due to
the nature of the process
or our measurement apparatus
many variables are
usually unknown to us

Big Data & Machine Learning
2000 2015
more data sources
high data volume
higher density
higher frequency
longer retention

Data Volume: More is (usually) better

Exploiting New Sources of Data
𝑦 = 𝑚 𝑥, 𝑝 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 + ⋯
incorporate variables that were hidden
into the model, reducing error

Understanding Results
Models can be easy or very difficult to interpret
Parameter space is often huge and can't be
explored entirely
age > 37 ?
height < 1.78 projects > 19 ?
decision tree classifier (easy to interpret) neural network classifier (hard to interpret
yes no

Example: Deep Learning for Image
Recognition
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html

Classifying Use of Algorithms
low risk
mildly annoying in case of failure /
misbehaviour
medium risk
large impact on our life in
case of failure / misbehaviour
high risk
critical impact on our
life in case of failure /
misbehaviour

low risk
personalization of services
(e.g. recommendation engines for webs
video-on-demand, content, ...)
individualized ad targeting
customer rating / profiling
consumer demand prediction

medium risk
personalized health
person classification (e.g. crime,
terrorism)
autonomous cars/ planes/ machines
...
automated trading

military intelligence / intervention
political oppression
critical infrastructure services (e.g. elect
life-changing decisions (e.g. about healt
high risk

Big Data & Advances in Machine
Learning

Data
"Mishaps"
Two Experiments

Discriminating People
With Algorithms
Humans can be prejudiced.
Are algorithms better?

Discrimination
Discrimination is treatment or consideration of, or making
a distinction in favor of or against, a person or thing based
on the group, class, or category to which that person or
thing is perceived to belong to rather than on individual
merit.
Wikipedia
Protected attributes (examples):
Ethnicity, Gender, Sexual Orientation, ...

When is a process discriminating?
Disparate Impact: Adverse impact of a process C on a given
group X
Outcome X = 0 X = 1
C = NO P(C = NO, X = 0) P(C = NO, X = 1)
C = YES P(C = YES, X = 0) P(C =YES,X = 1)
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 0
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 1
< τ
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al.

When is a process discriminating?
Estimating  with real-world data
Outcome X = 0 X = 1
C = NO a b
C = YES c d
𝑐/ 𝑎 + 𝑐
𝑑/ 𝑏 + 𝑑
< τ

Discrimination through Data Analysis
Replacing a manual hiring process with
an automated one.
Benefits:
Save time screening CVs by hand
Improve candidate choice

The Setup
human
CV
algorithm
C Training Data

The Setup
Use submitted information (CV, work
samples) along with publicly available /
external information to predict candidate
success.
Use data from the manual process (invite/ no
invite) to train the classifier
Provide it with as much data as possible to

Our decision model
𝑆 = 𝑚 𝑌 + 𝑑 𝑋 + 𝜀
score of candidate
(merit function) discrimination
malus/bonus
hidden variables &
luck (if you believe in it)
𝐶 =
𝑌𝐸𝑆, 𝑆 > 𝑡
𝑁𝑂, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
luckcandidate merit
without discrimination with discrimination

Training a predictor for C
𝐶 𝑌, 𝑍
information about Y
(unprotected attributes)
additional information
we give to the algorithm
𝒁 ∝ 𝑋 + 𝜀 𝛾
we can predict the value of X from Z with fidelity 

A Simulation
• Generate 10.000 samples of C with disparate impact

• Train a classifer (e.g. Support-Vector-Machine) on
the test data
• Provide it with (noisy) information about X
• Measure the algorithm-based  on the test data

Discrimination by Algorithm
 (how much information about X leaks into the data)

 (disparate impact on protected class)

8 % luck / noise
6-8 % discrimination
87 % merit

Why give that information to the
algorithm?
𝒁
We don't! But it leaks through anyway...
𝑋

But can it be done?
Discrimination through information
leakage is possible, but how likely is it in
practice?
Let's try!
We use publicly available data to predict
the gender of Github users (protected
attribute X).

Basic Information
Manually classify users as men/women (by looking at
profile pictures, names) -> 5.000 training samples with
small error
Use the Github API to retrieve information about users
(followers, repositories, stargazers, contributions, ...)
We only use data that is easy to get and likely to be used in
real-world setting for classification
We only use a limited dataset (proof of concept, not

Stargazers, Followers, Projects, ...
No predictive power for X

Github Event Data
https://www.githubarchive.org/
PushEvent
2015-03-17 21:21h
3 commits
Log : "..."
PullRequestEvent
2015-03-17 22:43
CommentEvent
2015-03-17 23:14h
"Hi, I think we should add more
cats to the landing page"

Hourly event patterns & event types

Commit Message Analysis
Use the commit messages (as obtained from the event
data) to predict gender by training a Support Vector
Machine (SVM) classifier on the word frequency data.
lol
emoji
wtf
serious
ly
rtfm
dude
fuck
git

Predictive Power of Model
15 % 35 % error50 % baseline fidelity
30 % information leakage
(with a very simple data set)

Takeaways
Algorithms will readily "learn"
discrimination from us if we provide
them with contaminated training
data.
Information leakage of protected
attributes can happen easily.

How we can fix this
Harder than you might think! We need to know X to
measure disparate impact and remove it
Incorporate penality for discrimination into target
function
Remove information about X from dataset by
performing a suitable transformation (reduces
fidelity of model)
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al

Oh, it's you again! De-anonymizing
data

What is de-anonymization?
Use data recorded about individuals / entities
to identify those same individuals / entities in
another set of data (exactly or with high
likelihood).
Deanonymization becomes an increasing risk as datasets
about individual entities become larger and more detailed.

"Buckets of Truth"
N boolean attributes per entity - on average M < N of them
are set
𝑃𝑐𝑜𝑙. = 𝑃(𝑀1
1
= 𝑀1
2
, ⋯, 𝑀 𝑁
1
= 𝑀 𝑁
2
)
fun with deanonymization: http://en.akinato

Examples
𝑃𝑐𝑜𝑙. = 1 − 2𝑝(1 − 𝑝) 𝑁
uniform distribution long-tailed distribution
𝑃𝑐𝑜𝑙. = ?

Geolife Trajectories
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-
Question:
w easy is it to re-identify single users through their data?
Could an algorithm build a representation of a given user?

Individual trajectories (color-coded)
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-

How good are our buckets?
𝑒−𝑥 𝑎 𝛾 here's the interesting information

Identifying / comparing fingerprints
𝑠 𝑢𝑖, 𝑢𝑗 =
𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗
𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗
* =

Testing De-Anonymization
Use 75 % of the trajectories as prior data set
Predict the user ID belonging to the remaining
25 %
Measure average success probability and
identification rank (i.e. at which position is the
correct user)

Possible Improvements
Use Temporal / Sequence Information
Use speed of movement / mode of transportation
Improve choice of buckets for fingerprinting
Interesting Review Article: "Life in the network: the coming age of computational social science." D. Laze

Summary
The more data we have, the more difficult it is
to keep algorithms from directly learning and
using object identities instead of attributes.
Our data follows us around!

As Data Scientists / Analysts /
Programmers
Consume data responsibly: Don't include everything
under the sun just because it increases fidelity by a
slim margin
Check for disparate impact and remove it from the
input data
Test anonymization safety by using machine learning

As Citizens / Hackers / Users
Do not blindly trust decisions made by algorithms
Test them if possible (using different input values)
Reverse-engineer them (using e.g. active learning)
Fight back with data: Collect and analyze
algorithm-based decisions using collaborative
approaches

As a Society
Create better regulations for algorithms and their
use
Force companies / organizations to open up black
boxes
Making access to data easier, also for small
organizations

Algorithms are
like children:
Smart & eager to learn
So let's make sure
we raise them to
be responsible
adults.

Thanks!
Slides slideshare.net/japh44
Website andreas-dewes.de/en
Code (coming soon) github.com/adewes/32c3
E-Mail andreas@7scientists.com
Twitter @japh44
License Creative Commons Attribution 4.0
International
(except Google Deep Learning image)

Intro
Whenever we measure user actions, we (automatically) gain
information about them that we can use to classify them.

Classifying and Controlling People

Case Study: Click Rate Optimization
Simple but common use case for big data: Collaborative
filtering
• Users have an opinion on a given topic A (between 0-1)
• They are more likely to like articles that confirm their
opinion
• Our algorithm knows nothing about A, just tries to
optimize click rate
• User opinion may change over time according to the
content he/she is exposed to (2 % change per exposure)

Mathematical Model
𝑃 𝐿𝑖𝑘𝑒 ∝ 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 − 𝐴 𝑢𝑠𝑒𝑟 + 𝜀 𝑚𝑜𝑜𝑑

Like Rate vs. Articles Viewed
only observe, don't
optimize

What have we learned?
60 observations / user

Clustering users into groups
Similarity measure: # Articles that both users like or dislike
Clustering: K-Means (minimize distance within clusters, maximize distance betw

Like Rate vs. Articles Viewed
with click-rate
optimization

Consequence of optimization: "Filter
Bubbles"

Switching On User Feedback
𝐴 𝑢𝑠𝑒𝑟
𝑡+1 = 𝐴 𝑢𝑠𝑒𝑟
𝑡 + γ ∙ sgn 𝐴 𝑢𝑠𝑒𝑟
𝑡 − 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒

User opinions with and without
feedback
the algorithm has an interest to steer opinions towards the
no feedback 2 % feedback

Say "Hi!" to Your New Boss

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Say "Hi!" to Your New Boss

Similar to Say "Hi!" to Your New Boss (20)

More from Andreas Dewes

More from Andreas Dewes (7)

Recently uploaded

Recently uploaded (20)

Say "Hi!" to Your New Boss