SlideShare a Scribd company logo
Say "Hi!" to Your New Boss
How algorithms might soon control our lifes
(and why we should be careful with them)
Motivation
no alternatives, Google?
Outline
Theory
1. Algorithms
2. Machine Learning
3. Big Data & Consequences for Machine Learning
4. Use of Algorithms Today and in the Future
Experiments
1. Discriminating people with machine learning & algorithms
2. Creating persistent user identities by (accidental) de-
anonymization
Summary & Outlook
1. Strategies for Handling Data Responsibly
Algorithms , Machine Learning & Big
Data
Algorithms
An algorithm is a "recipe" that gives a computer (or a
human) step-by-step instructions in order to achieve a
certain goal.
Start
Door
bell
ringing
Andreas
stands on
trapdoor?
Open
trapdoor
Wait.
Our time
will
come.
yes
no
Machine Learning
A machine learning algorithm automatically generates
models and checks them against the training data we
provide, trying to find a model that explains the data well
and can predict unknown data.
Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
Data vs. Model
𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
Sources of Error
𝜀 = 𝜀 𝑠𝑦𝑠 + 𝜀 𝑛𝑜𝑖𝑠𝑒 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛
systematic errors arise due to
imperfect measurements of
known variables
noise is present due to
the nature of the process
or our measurement apparatus
many variables are
usually unknown to us
Big Data & Machine Learning
2000 2015
more data sources
high data volume
higher density
higher frequency
longer retention
Data Volume: More is (usually) better
Data Volume: More is (usually) better
Exploiting New Sources of Data
𝑦 = 𝑚 𝑥, 𝑝 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 + ⋯
incorporate variables that were hidden
into the model, reducing error
Understanding Results
Models can be easy or very difficult to interpret
Parameter space is often huge and can't be
explored entirely
age > 37 ?
height < 1.78 projects > 19 ?
decision tree classifier (easy to interpret) neural network classifier (hard to interpret
yes no
Example: Deep Learning for Image
Recognition
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
Classifying Use of Algorithms
low risk
mildly annoying in case of failure /
misbehaviour
medium risk
large impact on our life in
case of failure / misbehaviour
high risk
critical impact on our
life in case of failure /
misbehaviour
low risk
personalization of services
(e.g. recommendation engines for webs
video-on-demand, content, ...)
individualized ad targeting
customer rating / profiling
consumer demand prediction
medium risk
personalized health
person classification (e.g. crime,
terrorism)
autonomous cars/ planes/ machines
...
automated trading
military intelligence / intervention
political oppression
critical infrastructure services (e.g. elect
life-changing decisions (e.g. about healt
high risk
Big Data & Advances in Machine
Learning
Data
"Mishaps"
Two Experiments
Discriminating People
With Algorithms
Humans can be prejudiced.
Are algorithms better?
Discrimination
Discrimination is treatment or consideration of, or making
a distinction in favor of or against, a person or thing based
on the group, class, or category to which that person or
thing is perceived to belong to rather than on individual
merit.
Wikipedia
Protected attributes (examples):
Ethnicity, Gender, Sexual Orientation, ...
When is a process discriminating?
Disparate Impact: Adverse impact of a process C on a given
group X
Outcome X = 0 X = 1
C = NO P(C = NO, X = 0) P(C = NO, X = 1)
C = YES P(C = YES, X = 0) P(C =YES,X = 1)
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 0
𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 1
< τ
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al.
When is a process discriminating?
Estimating  with real-world data
Outcome X = 0 X = 1
C = NO a b
C = YES c d
𝑐/ 𝑎 + 𝑐
𝑑/ 𝑏 + 𝑑
< τ
Discrimination through Data Analysis
Replacing a manual hiring process with
an automated one.
Benefits:
Save time screening CVs by hand
Improve candidate choice
The Setup
human
CV
algorithm
C Training Data
The Setup
Use submitted information (CV, work
samples) along with publicly available /
external information to predict candidate
success.
Use data from the manual process (invite/ no
invite) to train the classifier
Provide it with as much data as possible to
Our decision model
𝑆 = 𝑚 𝑌 + 𝑑 𝑋 + 𝜀
score of candidate
(merit function) discrimination
malus/bonus
hidden variables &
luck (if you believe in it)
𝐶 =
𝑌𝐸𝑆, 𝑆 > 𝑡
𝑁𝑂, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
luckcandidate merit
without discrimination with discrimination
Training a predictor for C
𝐶 𝑌, 𝑍
information about Y
(unprotected attributes)
additional information
we give to the algorithm
𝒁 ∝ 𝑋 + 𝜀 𝛾
we can predict the value of X from Z with fidelity 
A Simulation
• Generate 10.000 samples of C with disparate impact

• Train a classifer (e.g. Support-Vector-Machine) on
the test data
• Provide it with (noisy) information about X
• Measure the algorithm-based  on the test data
Discrimination by Algorithm
Discrimination by Algorithm
 (how much information about X leaks into the data)
Discrimination by Algorithm
 (disparate impact on protected class)
Discrimination by Algorithm
8 % luck / noise
6-8 % discrimination
87 % merit
Discrimination by Algorithm
Discrimination by Algorithm
Why give that information to the
algorithm?
𝒁
We don't! But it leaks through anyway...
𝑋
But can it be done?
Discrimination through information
leakage is possible, but how likely is it in
practice?
Let's try!
We use publicly available data to predict
the gender of Github users (protected
attribute X).
Basic Information
Manually classify users as men/women (by looking at
profile pictures, names) -> 5.000 training samples with
small error
Use the Github API to retrieve information about users
(followers, repositories, stargazers, contributions, ...)
We only use data that is easy to get and likely to be used in
real-world setting for classification
We only use a limited dataset (proof of concept, not
Stargazers, Followers, Projects, ...
No predictive power for X
Github Event Data
https://www.githubarchive.org/
PushEvent
2015-03-17 21:21h
3 commits
Log : "..."
PullRequestEvent
2015-03-17 22:43
CommentEvent
2015-03-17 23:14h
"Hi, I think we should add more
cats to the landing page"
Hourly event patterns & event types
Commit Message Analysis
Use the commit messages (as obtained from the event
data) to predict gender by training a Support Vector
Machine (SVM) classifier on the word frequency data.
lol
emoji
wtf
serious
ly
rtfm
dude
fuck
git
Predictive Power of Model
15 % 35 % error50 % baseline fidelity
30 % information leakage
(with a very simple data set)
Takeaways
Algorithms will readily "learn"
discrimination from us if we provide
them with contaminated training
data.
Information leakage of protected
attributes can happen easily.
How we can fix this
Harder than you might think! We need to know X to
measure disparate impact and remove it
Incorporate penality for discrimination into target
function
Remove information about X from dataset by
performing a suitable transformation (reduces
fidelity of model)
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al
Oh, it's you again! De-anonymizing
data
What is de-anonymization?
Use data recorded about individuals / entities
to identify those same individuals / entities in
another set of data (exactly or with high
likelihood).
Deanonymization becomes an increasing risk as datasets
about individual entities become larger and more detailed.
"Buckets of Truth"
N boolean attributes per entity - on average M < N of them
are set
𝑃𝑐𝑜𝑙. = 𝑃(𝑀1
1
= 𝑀1
2
, ⋯, 𝑀 𝑁
1
= 𝑀 𝑁
2
)
fun with deanonymization: http://en.akinato
Examples
𝑃𝑐𝑜𝑙. = 1 − 2𝑝(1 − 𝑝) 𝑁
uniform distribution long-tailed distribution
𝑃𝑐𝑜𝑙. = ?
Geolife Trajectories
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-
Question:
w easy is it to re-identify single users through their data?
Could an algorithm build a representation of a given user?
Individual trajectories (color-coded)
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-
How good are our buckets?
𝑒−𝑥 𝑎 𝛾 here's the interesting information
Identifying / comparing fingerprints
𝑠 𝑢𝑖, 𝑢𝑗 =
𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗
𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗
* =
Testing De-Anonymization
Use 75 % of the trajectories as prior data set
Predict the user ID belonging to the remaining
25 %
Measure average success probability and
identification rank (i.e. at which position is the
correct user)
Identification Rate
Finding Similar Users
Possible Improvements
Use Temporal / Sequence Information
Use speed of movement / mode of transportation
Improve choice of buckets for fingerprinting
Interesting Review Article: "Life in the network: the coming age of computational social science." D. Laze
Summary
The more data we have, the more difficult it is
to keep algorithms from directly learning and
using object identities instead of attributes.
Our data follows us around!
What can we do?
As Data Scientists / Analysts /
Programmers
Consume data responsibly: Don't include everything
under the sun just because it increases fidelity by a
slim margin
Check for disparate impact and remove it from the
input data
Test anonymization safety by using machine learning
As Citizens / Hackers / Users
Do not blindly trust decisions made by algorithms
Test them if possible (using different input values)
Reverse-engineer them (using e.g. active learning)
Fight back with data: Collect and analyze
algorithm-based decisions using collaborative
approaches
As a Society
Create better regulations for algorithms and their
use
Force companies / organizations to open up black
boxes
Making access to data easier, also for small
organizations
Algorithms are
like children:
Smart & eager to learn
So let's make sure
we raise them to
be responsible
adults.
Thanks!
Slides slideshare.net/japh44
Website andreas-dewes.de/en
Code (coming soon) github.com/adewes/32c3
E-Mail andreas@7scientists.com
Twitter @japh44
License Creative Commons Attribution 4.0
International
(except Google Deep Learning image)
Result
Intro
Whenever we measure user actions, we (automatically) gain
information about them that we can use to classify them.
Classifying and Controlling People
Case Study: Click Rate Optimization
Simple but common use case for big data: Collaborative
filtering
• Users have an opinion on a given topic A (between 0-1)
• They are more likely to like articles that confirm their
opinion
• Our algorithm knows nothing about A, just tries to
optimize click rate
• User opinion may change over time according to the
content he/she is exposed to (2 % change per exposure)
Mathematical Model
𝑃 𝐿𝑖𝑘𝑒 ∝ 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 − 𝐴 𝑢𝑠𝑒𝑟 + 𝜀 𝑚𝑜𝑜𝑑
Like Rate vs. Articles Viewed
Like Rate vs. Articles Viewed
only observe, don't
optimize
What have we learned?
60 observations / user
Clustering users into groups
Similarity measure: # Articles that both users like or dislike
Clustering: K-Means (minimize distance within clusters, maximize distance betw
Like Rate vs. Articles Viewed
with click-rate
optimization
Consequence of optimization: "Filter
Bubbles"
Switching On User Feedback
𝐴 𝑢𝑠𝑒𝑟
𝑡+1 = 𝐴 𝑢𝑠𝑒𝑟
𝑡 + γ ∙ sgn 𝐴 𝑢𝑠𝑒𝑟
𝑡 − 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒
User opinions with and without
feedback
the algorithm has an interest to steer opinions towards the
no feedback 2 % feedback
Summary

More Related Content

What's hot

Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
Sara Hooker
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
Laguna State Polytechnic University
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
Sara Hooker
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Edureka!
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
Sara Hooker
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
Knoldus Inc.
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Edureka!
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
Sara Hooker
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
Sara Hooker
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
Sara Hooker
 
Measures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairnessMeasures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairness
Manojit Nandi
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
Sara Hooker
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
Sara Hooker
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Wagston Staehler
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
Sara Hooker
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Francesca Lazzeri, PhD
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
Edureka!
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
Hayim Makabee
 

What's hot (20)

Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
Sentiment Analysis | Machine Learning Algorithms | Data Science Tutorial | Ed...
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Measures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairnessMeasures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairness
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
Spark + AI Summit - The Importance of Model Fairness and Interpretability in ...
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Applications of Machine Learning
Applications of Machine LearningApplications of Machine Learning
Applications of Machine Learning
 

Similar to Say "Hi!" to Your New Boss

Intro 2 Machine Learning
Intro 2 Machine LearningIntro 2 Machine Learning
Intro 2 Machine Learning
Brockhaus Consulting GmbH
 
Modex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual OverviewModex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual Overview
Modex
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
CodePolitan
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Intel® Software
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
University of Minnesota, Duluth
 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfwhat-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdf
Temok IT Services
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
Vedaj Padman
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
Dr.DHANALAKSHMI SENTHILKUMAR
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
László Kovács
 
Machine learning
Machine learningMachine learning
Machine learning
Sandeep Singh
 
Machine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfMachine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdf
WaiYipLiew
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
ShrutiPatel870590
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
MATLABISRAEL
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Govind Mudumbai
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
Enes Bolfidan
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Jon Mead
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
David Chiu
 
The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)
RR IT Zone
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
Theo Schlossnagle
 

Similar to Say "Hi!" to Your New Boss (20)

Intro 2 Machine Learning
Intro 2 Machine LearningIntro 2 Machine Learning
Intro 2 Machine Learning
 
Modex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual OverviewModex Talks - AI Conceptual Overview
Modex Talks - AI Conceptual Overview
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdfwhat-is-machine-learning-and-its-importance-in-todays-world.pdf
what-is-machine-learning-and-its-importance-in-todays-world.pdf
 
An Introduction to Machine Learning
An Introduction to Machine LearningAn Introduction to Machine Learning
An Introduction to Machine Learning
 
Machine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdfMachine Learning_Unit 2_Full.ppt.pdf
Machine Learning_Unit 2_Full.ppt.pdf
 
Machine learning at b.e.s.t. summer university
Machine learning  at b.e.s.t. summer universityMachine learning  at b.e.s.t. summer university
Machine learning at b.e.s.t. summer university
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdfMachine Learning in Cybersecurity.pdf
Machine Learning in Cybersecurity.pdf
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 

More from Andreas Dewes

Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!
Andreas Dewes
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
Andreas Dewes
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
Andreas Dewes
 
Learning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysisLearning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysis
Andreas Dewes
 
Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!
Andreas Dewes
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Andreas Dewes
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
Andreas Dewes
 

More from Andreas Dewes (7)

Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!Type Annotations in Python: Whats, Whys and Wows!
Type Annotations in Python: Whats, Whys and Wows!
 
Analyzing data with docker v4
Analyzing data with docker   v4Analyzing data with docker   v4
Analyzing data with docker v4
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
Learning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysisLearning from other's mistakes: Data-driven code analysis
Learning from other's mistakes: Data-driven code analysis
 
Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!
 
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...Demonstrating Quantum Speed-Up  with a Two-Transmon Quantum Processor Ph.D. d...
Demonstrating Quantum Speed-Up with a Two-Transmon Quantum Processor Ph.D. d...
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Say "Hi!" to Your New Boss

  • 1. Say "Hi!" to Your New Boss How algorithms might soon control our lifes (and why we should be careful with them)
  • 3. Outline Theory 1. Algorithms 2. Machine Learning 3. Big Data & Consequences for Machine Learning 4. Use of Algorithms Today and in the Future Experiments 1. Discriminating people with machine learning & algorithms 2. Creating persistent user identities by (accidental) de- anonymization Summary & Outlook 1. Strategies for Handling Data Responsibly
  • 4. Algorithms , Machine Learning & Big Data
  • 5. Algorithms An algorithm is a "recipe" that gives a computer (or a human) step-by-step instructions in order to achieve a certain goal. Start Door bell ringing Andreas stands on trapdoor? Open trapdoor Wait. Our time will come. yes no
  • 6. Machine Learning A machine learning algorithm automatically generates models and checks them against the training data we provide, trying to find a model that explains the data well and can predict unknown data.
  • 7. Data vs. Model 𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀 see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997). y x1
  • 8. Data vs. Model 𝒙 𝑦 = 𝑚 𝒙, 𝒑 + 𝜀 see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997). y x1
  • 9. Sources of Error 𝜀 = 𝜀 𝑠𝑦𝑠 + 𝜀 𝑛𝑜𝑖𝑠𝑒 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 systematic errors arise due to imperfect measurements of known variables noise is present due to the nature of the process or our measurement apparatus many variables are usually unknown to us
  • 10. Big Data & Machine Learning 2000 2015 more data sources high data volume higher density higher frequency longer retention
  • 11. Data Volume: More is (usually) better
  • 12. Data Volume: More is (usually) better
  • 13. Exploiting New Sources of Data 𝑦 = 𝑚 𝑥, 𝑝 + 𝜀ℎ𝑖𝑑𝑑𝑒𝑛 + ⋯ incorporate variables that were hidden into the model, reducing error
  • 14. Understanding Results Models can be easy or very difficult to interpret Parameter space is often huge and can't be explored entirely age > 37 ? height < 1.78 projects > 19 ? decision tree classifier (easy to interpret) neural network classifier (hard to interpret yes no
  • 15. Example: Deep Learning for Image Recognition http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
  • 16. Classifying Use of Algorithms low risk mildly annoying in case of failure / misbehaviour medium risk large impact on our life in case of failure / misbehaviour high risk critical impact on our life in case of failure / misbehaviour
  • 17. low risk personalization of services (e.g. recommendation engines for webs video-on-demand, content, ...) individualized ad targeting customer rating / profiling consumer demand prediction
  • 18. medium risk personalized health person classification (e.g. crime, terrorism) autonomous cars/ planes/ machines ... automated trading
  • 19. military intelligence / intervention political oppression critical infrastructure services (e.g. elect life-changing decisions (e.g. about healt high risk
  • 20. Big Data & Advances in Machine Learning
  • 22. Discriminating People With Algorithms Humans can be prejudiced. Are algorithms better?
  • 23. Discrimination Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit. Wikipedia Protected attributes (examples): Ethnicity, Gender, Sexual Orientation, ...
  • 24. When is a process discriminating? Disparate Impact: Adverse impact of a process C on a given group X Outcome X = 0 X = 1 C = NO P(C = NO, X = 0) P(C = NO, X = 1) C = YES P(C = YES, X = 0) P(C =YES,X = 1) 𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 0 𝑃 𝐶 = 𝑌𝐸𝑆 𝑋 = 1 < τ see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al.
  • 25. When is a process discriminating? Estimating  with real-world data Outcome X = 0 X = 1 C = NO a b C = YES c d 𝑐/ 𝑎 + 𝑐 𝑑/ 𝑏 + 𝑑 < τ
  • 26. Discrimination through Data Analysis Replacing a manual hiring process with an automated one. Benefits: Save time screening CVs by hand Improve candidate choice
  • 28. The Setup Use submitted information (CV, work samples) along with publicly available / external information to predict candidate success. Use data from the manual process (invite/ no invite) to train the classifier Provide it with as much data as possible to
  • 29. Our decision model 𝑆 = 𝑚 𝑌 + 𝑑 𝑋 + 𝜀 score of candidate (merit function) discrimination malus/bonus hidden variables & luck (if you believe in it) 𝐶 = 𝑌𝐸𝑆, 𝑆 > 𝑡 𝑁𝑂, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 luckcandidate merit without discrimination with discrimination
  • 30. Training a predictor for C 𝐶 𝑌, 𝑍 information about Y (unprotected attributes) additional information we give to the algorithm 𝒁 ∝ 𝑋 + 𝜀 𝛾 we can predict the value of X from Z with fidelity 
  • 31. A Simulation • Generate 10.000 samples of C with disparate impact  • Train a classifer (e.g. Support-Vector-Machine) on the test data • Provide it with (noisy) information about X • Measure the algorithm-based  on the test data
  • 33. Discrimination by Algorithm  (how much information about X leaks into the data)
  • 34. Discrimination by Algorithm  (disparate impact on protected class)
  • 35. Discrimination by Algorithm 8 % luck / noise 6-8 % discrimination 87 % merit
  • 38. Why give that information to the algorithm? 𝒁 We don't! But it leaks through anyway... 𝑋
  • 39. But can it be done? Discrimination through information leakage is possible, but how likely is it in practice? Let's try! We use publicly available data to predict the gender of Github users (protected attribute X).
  • 40. Basic Information Manually classify users as men/women (by looking at profile pictures, names) -> 5.000 training samples with small error Use the Github API to retrieve information about users (followers, repositories, stargazers, contributions, ...) We only use data that is easy to get and likely to be used in real-world setting for classification We only use a limited dataset (proof of concept, not
  • 41. Stargazers, Followers, Projects, ... No predictive power for X
  • 42. Github Event Data https://www.githubarchive.org/ PushEvent 2015-03-17 21:21h 3 commits Log : "..." PullRequestEvent 2015-03-17 22:43 CommentEvent 2015-03-17 23:14h "Hi, I think we should add more cats to the landing page"
  • 43. Hourly event patterns & event types
  • 44. Commit Message Analysis Use the commit messages (as obtained from the event data) to predict gender by training a Support Vector Machine (SVM) classifier on the word frequency data. lol emoji wtf serious ly rtfm dude fuck git
  • 45. Predictive Power of Model 15 % 35 % error50 % baseline fidelity 30 % information leakage (with a very simple data set)
  • 46. Takeaways Algorithms will readily "learn" discrimination from us if we provide them with contaminated training data. Information leakage of protected attributes can happen easily.
  • 47. How we can fix this Harder than you might think! We need to know X to measure disparate impact and remove it Incorporate penality for discrimination into target function Remove information about X from dataset by performing a suitable transformation (reduces fidelity of model) see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al
  • 48. Oh, it's you again! De-anonymizing data
  • 49. What is de-anonymization? Use data recorded about individuals / entities to identify those same individuals / entities in another set of data (exactly or with high likelihood). Deanonymization becomes an increasing risk as datasets about individual entities become larger and more detailed.
  • 50. "Buckets of Truth" N boolean attributes per entity - on average M < N of them are set 𝑃𝑐𝑜𝑙. = 𝑃(𝑀1 1 = 𝑀1 2 , ⋯, 𝑀 𝑁 1 = 𝑀 𝑁 2 ) fun with deanonymization: http://en.akinato
  • 51. Examples 𝑃𝑐𝑜𝑙. = 1 − 2𝑝(1 − 𝑝) 𝑁 uniform distribution long-tailed distribution 𝑃𝑐𝑜𝑙. = ?
  • 52. Geolife Trajectories http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e- Question: w easy is it to re-identify single users through their data? Could an algorithm build a representation of a given user?
  • 53.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59. How good are our buckets? 𝑒−𝑥 𝑎 𝛾 here's the interesting information
  • 60. Identifying / comparing fingerprints 𝑠 𝑢𝑖, 𝑢𝑗 = 𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗 𝑓 𝑢𝑖 ∙ 𝑓 𝑢𝑗 * =
  • 61. Testing De-Anonymization Use 75 % of the trajectories as prior data set Predict the user ID belonging to the remaining 25 % Measure average success probability and identification rank (i.e. at which position is the correct user)
  • 64. Possible Improvements Use Temporal / Sequence Information Use speed of movement / mode of transportation Improve choice of buckets for fingerprinting Interesting Review Article: "Life in the network: the coming age of computational social science." D. Laze
  • 65. Summary The more data we have, the more difficult it is to keep algorithms from directly learning and using object identities instead of attributes. Our data follows us around!
  • 66. What can we do?
  • 67. As Data Scientists / Analysts / Programmers Consume data responsibly: Don't include everything under the sun just because it increases fidelity by a slim margin Check for disparate impact and remove it from the input data Test anonymization safety by using machine learning
  • 68. As Citizens / Hackers / Users Do not blindly trust decisions made by algorithms Test them if possible (using different input values) Reverse-engineer them (using e.g. active learning) Fight back with data: Collect and analyze algorithm-based decisions using collaborative approaches
  • 69. As a Society Create better regulations for algorithms and their use Force companies / organizations to open up black boxes Making access to data easier, also for small organizations
  • 70. Algorithms are like children: Smart & eager to learn So let's make sure we raise them to be responsible adults.
  • 71. Thanks! Slides slideshare.net/japh44 Website andreas-dewes.de/en Code (coming soon) github.com/adewes/32c3 E-Mail andreas@7scientists.com Twitter @japh44 License Creative Commons Attribution 4.0 International (except Google Deep Learning image)
  • 73. Intro Whenever we measure user actions, we (automatically) gain information about them that we can use to classify them.
  • 74.
  • 76. Case Study: Click Rate Optimization Simple but common use case for big data: Collaborative filtering • Users have an opinion on a given topic A (between 0-1) • They are more likely to like articles that confirm their opinion • Our algorithm knows nothing about A, just tries to optimize click rate • User opinion may change over time according to the content he/she is exposed to (2 % change per exposure)
  • 77. Mathematical Model 𝑃 𝐿𝑖𝑘𝑒 ∝ 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒 − 𝐴 𝑢𝑠𝑒𝑟 + 𝜀 𝑚𝑜𝑜𝑑
  • 78. Like Rate vs. Articles Viewed
  • 79. Like Rate vs. Articles Viewed only observe, don't optimize
  • 80. What have we learned? 60 observations / user
  • 81. Clustering users into groups Similarity measure: # Articles that both users like or dislike Clustering: K-Means (minimize distance within clusters, maximize distance betw
  • 82. Like Rate vs. Articles Viewed with click-rate optimization
  • 83. Consequence of optimization: "Filter Bubbles"
  • 84. Switching On User Feedback 𝐴 𝑢𝑠𝑒𝑟 𝑡+1 = 𝐴 𝑢𝑠𝑒𝑟 𝑡 + γ ∙ sgn 𝐴 𝑢𝑠𝑒𝑟 𝑡 − 𝐴 𝑎𝑟𝑡𝑖𝑐𝑙𝑒
  • 85. User opinions with and without feedback the algorithm has an interest to steer opinions towards the no feedback 2 % feedback