Is Feature Selection Secure against Training Data Poisoning

Pa#ern
Recogni-on

and
Applica-ons
Lab

University

of
Cagliari,
Italy

Department
of

Electrical
and
Electronic

Engineering

Is Feature Selection Secure
against Training Data Poisoning?
Huang
Xiao2,
BaEsta
Biggio1,
Gavin
Brown3,
Giorgio
Fumera1,

Claudia
Eckert2,
Fabio
Roli1

(1)
Dept.
Of
Electrical
and
Electronic
Engineering,
University
of
Cagliari,
Italy

(2)

Department
of
Computer
Science,
Technische
Universität
München,
Germany

(3)
School
of
Computer
Science,
University
of
Manchester,
UK

Jul
6
-‐
11,
2015
ICML
2015

http://pralab.diee.unica.it
Motivation
•  Increasing number of services and apps available on the Internet
–  Improved user experience
•  Proliferation and sophistication of attacks and cyberthreats
–  Skilled / economically-motivated attackers
•  Several security systems use machine learning to detect attacks
–  but … is machine learning secure enough?
2

Is Feature Selection Secure?
•  Adversarial ML: security of learning and clustering algorithms
–  Barreno et al., 2006; Huang et al., 2011; Biggio et al., 2014; 2012; 2013a;
Brueckner et al., 2012; Globerson & Roweis, 2006
•  Feature Selection
–  High-dimensional feature spaces (e.g., spam and malware detection)
–  Dimensionality reduction to improve interpretability and generalization
•  How about the security of feature selection?
3

x1
x2
...
…
…
xd
x(1)
x(2)
…
x(k)

Feature Selection under Attack
Attacker Model
•  Goal of the attack
•  Knowledge of the attacked system
•  Capability of manipulating data
•  Attack strategy
4

PD(X,Y)?

f(x)

Attacker’s Goal
•  Integrity Violation: to perform malicious activities without
compromising normal system operation
–  enforcing selection of features to facilitate evasion at test time
•  Availability Violation: to compromise normal system operation
–  enforcing selection of features to maximize generalization error
•  Privacy Violation: gaining confidential information on system users
–  reverse-engineering feature selection to get confidential information
5

Security
Violation
Integrity Availability Privacy

Attacker’s Knowledge
•  Perfect knowledge
–  upper bound on performance degradation under attack
•  Limited knowledge
–  attack on surrogate data sampled from same distribution
TRAINING DATA
FEATURE
REPRESENTATION
FEATURE
SELECTION
ALGORITHM
x1
x2
...
…
…
xd
6

x(1)
x(2)
…
x(k)

•  Inject points into the training data
•  Constraints on data manipulation
–  Fraction of the training data under the attacker’s control
–  Application-specific constraints
•  Example on PDF data
–  PDF file: hierarchy of interconnected objects
–  Objects can be added but not easily removed without compromising
the file structure
Attacker’s Capability
7

13
0
obj

<<
/Kids
[
1
0
R
11
0
R
]

/Type
/Page

...
>>
end
obj

17
0
obj

<<
/Type
/Encoding

/Diﬀerences
[
0
/C0032
]
>>

endobj

Attack Scenarios
•  Different potential attack scenarios depending on assumptions
on the attacker’s goal, knowledge, capability
–  Details and examples in the paper
•  Poisoning Availability Attacks
Enforcing selection of features to maximize generalization error
–  Goal: availability violation
–  Knowledge: perfect / limited
–  Capability: injecting samples into the training data
8

Embedded Feature Selection Algorithms
•  Linear models
–  Select features according to |w|
9

LASSO

Tibshirani, 1996
Ridge
Regression

Hoerl & Kennard, 1970
Elas9c
Net

Zou & Hastie, 2005

Poisoning Embedded Feature Selection
•  Attacker’s objective
–  to maximize generalization error on untainted data
•  Solution: subgradient-ascent technique
10

Loss estimated on surrogate data
(excluding the attack point)
Algorithm is trained on surrogate data
(including the attack point)
… w.r.t. choice of the attack point

KKTconditions
Gradient Computation
11

How does the solution change w.r.t. xc?
Subgradient is unique at the optimal solution!

Gradient Computation
•  We require the KKT conditions to hold under perturbation of xc
12

Gradient is now uniquely determined

Poisoning Attack Algorithm
13

Experiments on PDF Malware Detection
•  PDF: hierarchy of interconnected objects (keyword/value pairs)
•  Learner’s task: to classify benign vs malware PDF files
•  Attacker’s task: to maximize classification error by injecting
poisoning attack samples
–  Only feature increments are considered (object insertion)
•  Object removal may compromise the PDF file
/Type

2

/Page

1

/Encoding
1

…

13
0
obj

<<
/Kids
[
1
0
R
11
0
R
]

/Type
/Page

...
>>
end
obj

17
0
obj

<<
/Type
/Encoding

/Diﬀerences
[
0
/C0032
]
>>

endobj

Features:
keyword
counts

14

Maiorca et al., 2012; 2013;
Smutz & Stavrou, 2012;
Srndic & Laskov, 2013

Experimental Results
15

PerfectKnowledge
Data: 300 (TR) and 5,000 (TS) samples – 114 features
Similar results obtained for limited-knowledge attacks!

16

PerfectKnowledge
A: selected features in the absence of attack
B: selected features under attack
k: number of features selected out of d
r: common features between the two setsKuncheva et al., 2007

Conclusions and Future Work
•  Framework for security evaluation of feature selection under attack
–  Poisoning attacks against embedded feature selection algorithms
•  Poisoning can significantly affect feature selection
–  LASSO significantly vulnerable to poisoning attacks
•  Future research directions
–  Error bounds on the impact of poisoning on learning algorithms
–  Secure / robust feature selection algorithms
17

L1 regularization: stability against random noise,
but not against adversarial (worst-case) noise?

?
Any questions
Thanks
for
your
a#en-on!

18

19

Perfect
Knowledge
Limited
Knowledge

Is Feature Selection Secure against Training Data Poisoning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Is Feature Selection Secure against Training Data Poisoning

Similar to Is Feature Selection Secure against Training Data Poisoning (20)

More from Pluribus One

More from Pluribus One (15)

Recently uploaded

Recently uploaded (20)

Is Feature Selection Secure against Training Data Poisoning