Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage Hierarchical Clustering

  1. 1. Pattern Recognition and Applications Lab Poisoning Complete-Linkage Hierarchical University of Cagliari, Italy Department of Electrical and Electronic Engineering Clustering Ba#sta Biggio1, Samuel Rota Bulò2, Ignazio Pillai1, Michele Mura1, Eyasu Zemene Mequanint3, Marcello Pelillo3, and Fabio Roli1 (1) Università di Cagliari (IT); (2) FBK-­‐irst, Trento (IT); (3) Università Ca’ Foscari di Venezia (IT) Joensuu, Finland, S+SSPR 2014 20-­‐22 August 2014
  2. 2. Threats and Attacks in Computer Security • Growing number of devices, services and applications connected to the Internet • Vulnerabilities and attacks through malicious software (malware) – Examples: Android market, malware applications • Identity theft • Stolen credentials / credit card numbers http://pralab.diee.unica.it 2
  3. 3. Threats and Attacks in Computer Security • Need for (automated) detection (and rule generation) – machine learning-based defenses (data clustering) http://pralab.diee.unica.it 3 Evasion: malware families / variants +65% new malware variants from 2012 to 2013 Mobile Adware and Malw. Analysis, Symantec, 2014 Detection: antivirus systems Rule-based systems
  4. 4. Data Clustering for Computer Security • Goal: clustering of malware families to identify common characteristics and design suitable countermeasures • e.g., antivirus rules / signatures http://pralab.diee.unica.it 4 xx x x x x x x x x x x x x x x x x1 x2 ... xd feature extraction (e.g., URL length, num. of parameters, etc.) data collection (honeypots) clustering of malware families (e.g., similar HTTP requests) if … then … else … data analysis / countermeasure design (e.g., signature generation) e.g., suspicious HTTP request to a web server hVp://www.vulnerablehotel.com/components/ com_hbssearch/longDesc.php?h_id=1& id=-­‐2%20union%20select%20concat%28username, 0x3a,password%29%20from%20jos_users-­‐-­‐
  5. 5. Is Data Clustering Secure? • Attackers can poison input data to subvert malware clustering http://pralab.diee.unica.it 5 x x x x x x x x x x x x x x x x x x1 x2 ... xd feature extraction (e.g., URL length, num. of parameters, etc.) data collection (honeypots) clustering of malware families (e.g., similar HTTP requests) if … then … else … data analysis / countermeasure design (e.g., signature generation) Well-­‐cra9ed HTTP requests to subvert clustering hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… hVp://www.vulnerablehotel.com/… … is significantly compromised … becomes useless (too many false alarms, low detection rate) (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
  6. 6. Is Data Clustering Secure? • Earlier work (1,2): qualitative definition of attacks http://pralab.diee.unica.it 6 x x x Samples can be added to merge (and/or split) existing clusters x x Samples can be obfuscated and hidden within existing clusters (e.g., fringe clusters) (1) D. B. Skillicorn. Adversarial knowledge discovery. IEEE Intelligent Systems, 24:54–61, 2009. (2) J. G. Dutrisac and D. Skillicorn. Hiding clusters in adversarial settings. In IEEE Int’l Conf. Intelligence and Security Informatics, pp.185–187, 2008. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Clustering on untainted data
  7. 7. Is Data Clustering Secure? • Our previous work (1): – Framework for security evaluation of clustering algorithms – Formalization of poisoning and obfuscation attacks (optimization) – Case study on single-linkage hierarchical clustering • Despite hierarchical clustering is widely used for malware clustering (2,3), it is significantly vulnerable to well-crafted attacks! • In this work we focus on Poisoning a+acks against complete-­‐linkage http://pralab.diee.unica.it 7 hierarchical clustering (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013. (2) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013 (3) K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using machine learning. J. Comput. Secur., 19(4):639-668, 2011.
  8. 8. Complete-Linkage Hierarchical Clustering • Bottom-up agglomerative clustering – each point is initially considered as a cluster – closest clusters are iteratively merged • Linkage criterion to define distance between clusters – complete-linkage criterion x x • Clustering output is a hierarchy of clusterings – Criterion needed to select a given clustering (e.g., number of clusters) http://pralab.diee.unica.it 8 dist(Ci,Cj ) = max a∈Ci , b∈Cj d(a, b) x x x x x x
  9. 9. Poisoning Attacks • Goal: to maximally compromise the clustering output on D • Capability: adding m attack samples • Knowledge: perfect / worst-case attack • Attack strategy: Distance between the clustering in the absence of attack and that under attack x http://pralab.diee.unica.it 9 max A m dc (Y,Y!(A)), A= ai { }i=1 x Y = f (D) Y! = fD(D∪A) x x x Attack samples A x x x x x x x x x x x x x x x x x x x x x x x Clustering on untainted data D
  10. 10. Poisoning Attacks dc (Y,Y!) = YY T −Y!Y!T http://pralab.diee.unica.it 10 F m , Y = %%%%%% & 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 # $ (((((( , YY T = ' %%%%%% The clustering algorithm chooses the number of clusters that minimizes the attacker’s objective! 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 # $ & (((((( ' For a given clustering: Sample 1 … Sample 5 max A dc (Y,Y!(A)), A= ai { }i=1 How to choose a given clustering from the hierarchy? This gives us a lower bound on the worst-case attack’s impact!
  11. 11. Poisoning Complete-Linkage Clustering • Attack strategy: • Heuristic-based solutions m – Greedy approach: adding one attack sample at a time http://pralab.diee.unica.it 11 max A dc (Y,Y!(A)), A= ai { }i=1
  12. 12. Poisoning Complete-Linkage Clustering • Local maxima are found at the clusters’ boundaries (wide regions) http://pralab.diee.unica.it 12 dc (Y,Y!(a)) x1 x2
  13. 13. Poisoning Complete-Linkage Clustering http://pralab.diee.unica.it 13 • Underlying idea: to increase intra-cluster distance (extend attack) • For each cluster, consider two candidate attack points Candidate attack points
  14. 14. Poisoning Complete-Linkage Clustering http://pralab.diee.unica.it 14 • Underlying idea: to increase intra-cluster distance (extend attack)
  15. 15. Poisoning Complete-Linkage Clustering http://pralab.diee.unica.it 15 • Underlying idea: to increase intra-cluster distance (extend attack) Candidate attack points
  16. 16. Poisoning Complete-Linkage Clustering 1. Extend (Best): evaluates Y’(a) for each candidate attack, retaining the best one – Clustering is run for each candidate attack point, twice per cluster 2. Extend (Hard): estimates Y’(a) assuming that each candidate will split the corresponding cluster, potentially merging it with a fragment of the closest cluster – It does not require running clustering to find the best attack point 3. Extend (Soft): estimates Y’(a) as Extend (Hard), but using a soft probabilistic estimate instead of 0/1 sample-to-cluster assignments – It does not require running clustering to find the best attack point http://pralab.diee.unica.it 16
  17. 17. Poisoning Complete-Linkage Clustering • The attack compromises the initial clustering by forming heterogeneous clusters http://pralab.diee.unica.it 17 Clustering on untainted data Clustering after adding 10 attack samples
  18. 18. Experimental Setup • Banana: artificial data, 80 samples, 2 features, k=4 initial clusters • Malware: real data (1), 1,000 samples, 6 features, k≈9 initial clusters (estimated from data minimizing the Davies-Bouldin Index) – Features: 1. number of GET requests 2. number of POST requests 3. average URL length 4. average number of URL parameters 5. average amount of data sent by POST requests 6. average response length • MNIST Handwritten Digits: real data, 330 samples per cluster, 28 x 28 = 784 features (pixels), k=3 initial clusters corresponding to http://pralab.diee.unica.it 18 (1) R. Perdisci, D. Ariu, and G. Giacinto. Scalable fine-grained behavioral clustering of http-based malware. Computer Networks, 57(2):487-500, 2013.
  19. 19. Experimental Results • Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) – Banana: • Extend (Best) very close to Optimal (Grid Search) • Random (Best) competitive with Extend (Hard / Soft) 50 45 40 35 30 25 20 15 10 5 0 Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal (Grid Search) http://pralab.diee.unica.it 19 0%2%5%7%9% 12% 15% 18% 20% Objective Function Banana Fraction of samples controlled by the attacker 0% 11.1% (10 attack samples)
  20. 20. Experimental Results • Attack strategies: Extend (Best/Hard/Soft), Random, Random (Best) Random Random (Best) Extend (Hard) Extend (Soft) Extend (Best) Optimal 150 100 50 0 http://pralab.diee.unica.it 250 200 150 100 50 0 0.0% 0.2% 0.4% 0.6% 0.8% 1.0% Objective Function Digits 0% 1% 2% 3% 4% 5% Objective Function Malware – Malware: • Extend attacks and Random (Best) perform rather well – MNIST Handwritten Digits: • Random (Best) not effective – high-dimensional feature space • Extend (Soft) outperforms Extend (Best / Hard) 20 Fraction of samples controlled by the attacker
  21. 21. Conclusions and Future Work • Framework for security evaluation of clustering algorithms • Poisoning attack vs. complete-linkage hierarchical clustering – Even random-based attacks can be effective! • Future work – Extensions to other clustering algorithms, common attack strategy • e.g., black-box optimization with suitable heuristics – Attacks with limited knowledge of the input data http://pralab.diee.unica.it 21 Secure clustering algorithms Attacks against clustering
  22. 22. http://pralab.diee.unica.it ? 22 Thanks for your aVenion! Any ques<ons
  23. 23. Extra slides http://pralab.diee.unica.it 23
  24. 24. Is Data Clustering Secure? • Our previous work (1): – Framework for security evaluation of clustering algorithms 1. Formal definition of potential attacks 2. Empirical evaluation of their impact • Adversary’s model – Goal (security violation) – Knowledge of the attacked system – Capability of manipulating the input data – Attack strategy (optimization problem) • Inspired from previous work on adversarial machine learning – Barreno et al., Can machine learning be secure?, ASIACCS 2006 – Huang et al., Adversarial machine learning, AISec 2011 – Biggio et al., Security evaluation of pattern classifiers under attack, IEEE Trans. Knowledge and Data Eng., 2013 http://pralab.diee.unica.it 24 (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
  25. 25. Adversary’s Goal • Security violation – Integrity: hiding clusters / malicious activities without compromising normal system operation • e.g., creating fringe clusters à obfuscation attack – Availability: compromising normal system operation by maximally altering the clustering output • e.g., merging existing clusters à poisoning attack Integrity Availability Privacy – Privacy: gaining confidential information about system users by reverse-engineering the clustering process http://pralab.diee.unica.it 25 (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
  26. 26. Adversary’s Knowledge • Perfect knowledge x xx x x x x x x x x x x x x x x x1 x2 ... xd – upper bound on the performance degradation under attack http://pralab.diee.unica.it 26 INPUT DATA FEATURE REPRESENTATION CLUSTERING ALGORITHM e.g., k-means ALGORITHM PARAMETERS e.g., initialization (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
  27. 27. Adversary’s Capability • Attacker’s capability is bounded: – maximum number of samples that can be added to the input data • e.g., the attacker may only control a small fraction of malware samples collected by a honeypot – maximum amount of modifications (application-specific constraints in feature space) • e.g., malware samples should preserve their malicious functionality (elements can not be removed à features can only be incremented) http://pralab.diee.unica.it 27 x Feasible domain x ' (1) B. Biggio, I. Pillai, S. R. Bulò, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. ACM Workshop on Artif. Intell. & Sec., AISec ’13, pp. 87–98, 2013.
  28. 28. Formalizing the Optimal Attack Strategy http://pralab.diee.unica.it 28 max A Eθ ~μ g A;θ ( ) !" #$ s.t. A ∈ Ω Knowledge of the data, features, … Capability of manipulating the input data Attacker’s goal Perfect knowledge: Eθ ~μ g A;θ ( ) !" #$ = g A;θ0 ( )

    Be the first to comment

    Login to see the comments

Views

Total views

1,041

On Slideshare

0

From embeds

0

Number of embeds

416

Actions

Downloads

11

Shares

0

Comments

0

Likes

0

×