Your SlideShare is downloading. ×
Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Battista Biggio @ AISec 2013 - Is Data Clustering in Adversarial Settings Secure?

254
views

Published on

Clustering algorithms have been increasingly adopted in security applications to spot dangerous or illicit activities. …

Clustering algorithms have been increasingly adopted in security applications to spot dangerous or illicit activities.
However, they have not been originally devised to deal with deliberate attack attempts that may aim to subvert the clustering process itself. Whether clustering can be safely adopted in such settings remains thus questionable.
In this work we propose a general framework that allows one to identify potential attacks against clustering algorithms, and to evaluate their impact, by making specific assumptions on the adversary's goal, knowledge of the attacked system, and capabilities of manipulating the input data. We show that an attacker may significantly poison the whole clustering process by adding a relatively small percentage of attack samples to the input data, and that some attack samples may be obfuscated to be hidden within some existing clusters.
We present a case study on single-linkage hierarchical clustering, and report experiments on clustering of malware samples and handwritten digits.

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
254
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Pattern Recognition and Applications Lab Is Data Clustering in Adversarial Settings Secure? Ba#sta  Biggio  (1),  Ignazio  Pillai  (1),  Samuel  Rota  Bulò  (2),  Davide  Ariu  (1),   Marcello  Pelillo  (3),  and  Fabio  Roli  (1)     (1)  Università  di  Cagliari  (IT);  (2)  FBK-­‐irst  (IT);  (3)  Università  Ca’  Foscari  di  Venezia  (IT)                                     Berlin,  4  November  2013   University of Cagliari, Italy   Department of Electrical and Electronic Engineering
  • 2.   Motivation: is clustering secure? •  Data clustering increasingly applied in security-sensitive tasks –  e.g., malware clustering for anti-virus / IDS signature generation •  Carefully targeted attacks may mislead the clustering process x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   Samples can be added to merge (and split) existing clusters http://pralab.diee.unica.it x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters) (1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf. 2   Intelligence and Security Informatics, pp.185–187, 2008.
  • 3.   Our work •  Framework for security evaluation of clustering algorithms 1.  Definition of potential attacks 2.  Empirical evaluation of their impact •  Adversary’s model –  –  –  –  •  Goal Knowledge Capability Attack strategy Inspired from previous work on adversarial learning –  Barreno et al., Can machine learning be secure?, ASIACCS 2006 –  Huang et al., Adversarial machine learning, AISec 2011 –  Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans. Knowledge and Data Eng., 2013 http://pralab.diee.unica.it 3  
  • 4.   Adversary’s goal •  Security violation –  Integrity: hiding clusters / malicious activities without compromising normal system operation •  e.g., creating fringe clusters –  Availability: compromising normal system operation by altering the clustering output •  e.g., merging existing clusters –  Privacy: gaining confidential information about system users by reverse-engineering the clustering process •  Attack specificity –  Targeted: affects clustering of a given subset of samples –  Indiscriminate: affects clustering of any sample http://pralab.diee.unica.it 4  
  • 5.   Adversary’s knowledge •  The adversary may know: INPUT DATA •  FEATURE REPRESENTATION CLUSTERING ALGORITHM ALGORITHM PARAMETERS e.g., initialization Perfect knowledge –  upper bound on the performance degradation under attack http://pralab.diee.unica.it 5  
  • 6.   Adversary’s capability •  Attacker’s capability is bounded: –  maximum number of samples that can be added to the input data •  e.g., the attacker may only control a small fraction of malware samples collected by a honeypot –  maximum amount of modifications (distance in feature space) •  e.g., malware samples should preserve their malicious functionality x2   Feasible domain (e.g., L1-norm) x' http://pralab.diee.unica.it x x1   x − x " ≤ d max 1 6  
  • 7.   Formalizing the optimal attack strategy Attacker’s goal Knowledge of the data, features, … max Eθ ~µ "g ( A!;θ )$ # % A! s.t. A! ∈ Ω(A) Capability of manipulating the input data Perfect knowledge: http://pralab.diee.unica.it Eθ ~µ "g ( A!;θ )$ = g ( A!;θ 0 ) # % 7  
  • 8.   Poisoning attacks (availability violation) •  •  Goal: maximally compromising the clustering output on D Capability: adding m attack samples max g ( A!;θ 0 ) = dc (C, fD (D ∪ A!)) A! m s.t. A! ∈ Ω p = {{ai!}i=1 ⊂ R d } x   x   x   x   x   x   x   x   x   C = f (D) http://pralab.diee.unica.it x   x   x   Heuristics tailored to the clustering algorithm for efficient solution! A’   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   x   f (D ∪ A") x   x   x   x   x   8  
  • 9.   Single-linkage hierarchical clustering •  Bottom-up agglomerative clustering –  each point is initially considered as a cluster –  closest clusters are iteratively merged –  single-linkage criterion dist(Ci , C j ) = min a∈Ci , b∈C j x   x   x   x   x   x   x   x   x   x   x   x   x   C = f (D) x   x   d(a, b) 0.8 0.7 0.6 0.5 Dendrogram cut 0.4 0.3 0.2 x   x   0.1 0 3 http://pralab.diee.unica.it 7 2 4 5 9 1 6 8 14 15 16 17 10 11 12 13 9  
  • 10.   Poisoning attacks vs. single-linkage HC max g ( A!;θ 0 ) = dc (C, fD (D ∪ A!)) A! s.t. A! ∈ Ω p # % For a given cut criterion: … % T T dc (Y, Y !) = YY − Y !Y ! , Y =% F % % Sample 5 % $ Sample 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 & # 1 0 ( % ( % 0 1 (, YY T = % 0 1 ( % 1 0 ( % ( % 0 0 ' $ 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 & ( ( ( ( ( ( ' We assume the most advantageous criterion for the clustering algorithm: the dendrogram cut is chosen to minimize the attacker’s objective! http://pralab.diee.unica.it 10  
  • 11.   Poisoning attacks vs. single-linkage HC •  Heuristic-based solutions –  Greedy approach: adding one attack sample at each iteration 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 Local maxima are often found in between clusters 16 14 Close to connections (bridges) that have been cut to obtain the final k clusters 12 10 8 Can be obtained directly from the dendrogram! 6 4 0.9 2 0.8 k-1 Bridges 0.7 0 0.5 1 1.5 0.6 0.5 0.4 Dendrogram cut 0.3 0.2 http://pralab.diee.unica.it 0.1 11   0 3 7 9 4 2 5 1 8 6 14 16 17 15 18 21 19 20 10 12 11 13
  • 12.   Poisoning attacks vs. single-linkage HC •  Heuristic-based solutions 1. Bridge (Best): evaluates the objective function k-1 times, each time by adding an attack point in between a bridge 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 http://pralab.diee.unica.it 16 Requires running the clustering algorithm k-1 times! 14 12 10 8 6 4 2 0 0.5 1 2. Bridge (Hard): estimates the objective function assuming that each attack point will merge the corresponding clusters Does not require running the clustering algorithm 1.5 12  
  • 13.   Poisoning attacks vs. single-linkage HC •  Heuristic-based solutions 3. Bridge (Soft): similar to Bridge (Hard), but using soft clustering assignments for Y (estimated with Gaussian KDE) 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 http://pralab.diee.unica.it 0 0.5 1 1.5 4.52.5 4 2 3.51.5 1 3 0.5 2.5 0 2 −0.5 1.5 −1 1 −1.5 −2 0.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Clustering  output  aVer  greedily  adding   20  aXack  points   13  
  • 14.   Experiments on poisoning attacks •  Banana: artificial data, 80 samples, 2 features, k=4 initial clusters •  Malware: real data(1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index) –  Features: 1.  2.  3.  4.  5.  6.  •  number of GET requests number of POST requests average URL length average number of URL parameters average amount of data sent by POST requests average response length MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to digits ‘0’, ‘1’, and ‘6’ http://pralab.diee.unica.it (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013. 14  
  • 15.   Experiments on poisoning attacks •  Attack strategies: Bridge (Best), Bridge (Hard), Bridge (Soft), Random, Random (Best) –  Random (Best) selects the best random attack over k-1 attempts –  Same complexity as Bridge (Best) Banana Malware Digits 180 800 160 700 50 140 600 Random 120 40 500 Random (Best) 100 400 30 Bridge (Best) 80 300 Bridge (Soft) 60 20 200 Bridge (Hard) 40 10 100 20 0 0 0 0% 2% 5% 7% 9% 12% 15% 18% 20% 0% 1% 2% 3% 4% 5% 0.0% 0.2% 0.4% 0.6% 0.8% 1.0% 6 30 100 4 25 80 14 20 60 12 15 40 10 10 20 8 5 0 0% 2% 5% 7% 9% 12% 15% 18% 20% 0% 1% 2% 3% 4% 5% 0.0% 0.2% 0.4% 0.6% 0.8% 1.0% Fraction of samples controlled by the attacker Fraction of samples controlled by the attacker Fraction of samples controlled by the attacker http://pralab.diee.unica.it Objective Function Objective Function Num Clusters (k) Num Clusters (k) Num Clusters (k) Objective Function 60 15  
  • 16.   Experiments on poisoning attacks •  Some attack samples obtained by the given attack strategies on the MNIST Handwritten Digits, at iterations 1, 2, and 10. Random Random (Best) Bridge (Best) Bridge (Soft) Bridge (Hard) 1 2 10 http://pralab.diee.unica.it 16  
  • 17.   Obfuscation attacks (integrity violation) •  •  Goal: hiding attacks A without compromising clustering of D Capability: bounded manipulation of attack samples max g ( A!;θ 0 ) = −dc (C t , f (D ∪ A!)), where π D (C t ) = f (D) A! { s.t. A! ∈ Ωo (A) = {ai!}|A| : ds (A, A!) = max ai − ai! 2 ≤ dmax i=1 x   x   x   x   x   x   x   x   x   D x   x   x   x   i x   x   x   x   x   x   x   x   x   x   x   A! A http://pralab.diee.unica.it x   x   x   } x   x   x   x   x   x   x   17  
  • 18.   Obfuscation attacks vs. single-linkage HC •  Heuristic-based solution –  For each attack sample a in A –  Select the closest sample d in D from the cluster to which a should belong to, and define a’ as ( d − a) , a! = a + α d−a 2 α = min ( dmax , d − a 2 ) x   x   x   x   x   x   x   x   x  d a! x   x   x   x   x   x   x   x   a x   x   x   http://pralab.diee.unica.it 18  
  • 19.   Experiments on obfuscation attacks MNIST Handwritten Digits Objective Function –  –  –  –  Initial clusters ‘0’, ‘1’, ‘6’, ‘3’ Attacker modifies ‘3’s to have them clustered with ‘6’ Attacker minimizes distance from the desired clustering Clustering minimizes distance from the initial clusters (where ‘3’s are not manipulated) 350 300 250 200 150 Clustering Attacker 100 50 0 0 1 2 3 4 0.0 http://pralab.diee.unica.it 5 6 7 8 2.0 Num Clusters (k) •  5 4.6 4.2 3.8 3.4 3 0 1 2 9 10 3.0 4.0 3 4 5 6 d max 5.0 7 8 9 10 7.0 19  
  • 20.   350 300 250 200 150 Clustering Attacker 100 50 0 0 1 2 3 4 5 6 7 8 Num Clusters (k) Objective Function Experiments on obfuscation attacks 5 4.6 4.2 3.8 3.4 3 0 1 2 3 9 10 4 5 6 d max 7 8 9 10 Why the attacker’s objective increases here? x   x   x   x   x   x   x   x   x x   x   x   x   x   3   6   x   x   x x x   x   x  x   x   x   x   x   x   x   x   x   x   x   x  x x x   x   x   x   x   x   x   Bridging! This may suggest a more effective heuristic, based on modifying only a subset of attacks! http://pralab.diee.unica.it 20  
  • 21.   Conclusions and future work •  •  •  Framework for security evaluation of clustering algorithms Definition of poisoning and obfuscation attacks Case study on single-linkage HC highlights vulnerability to attacks •  Future work –  Extensions to other algorithms, common solver for the attack strategy •  e.g., black-box optimization with suitable heuristics –  Connections with clustering stability –  Secure / Robust clustering algorithms http://pralab.diee.unica.it 21  
  • 22.     Thanks  for  your  aXenon!   ?     Any  ques*ons http://pralab.diee.unica.it 22