When Cyber Security Meets Machine Learning


Published on

Published in: Technology, Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • טקסונומיה של אבטחת מידע בשילוב למידת מכונה.עבודה לדוגמא יכולה להתמקד בחקירת משפחה של וירוסים הגורמים לאיבוד מידע באפליקציות רשת. אנטי וירוס רשתי הוא אמצעי ההגנה המתואר בעבודהכאשר המידע הגולמי שמערכת כזו מעבדת הם קבצים ברי הרצה.פתרון המוצע משתמש בחתימות מחרוזת. את המאפיינים לבחירת המחרוזת בוחרים בעזרת פישר סקור.מציאת החתימה נעשית על ידי ניתוח קוד סטטי.מאמנים מודל סיווג עי שימוש באחד מאלגוריתמי למידה מונחית.
  • When Cyber Security Meets Machine Learning

    1. 1. When Cyber Security Meets Machine Learning Lior RokachInformation Systems Eng., Ben-Gurion University of the NegevCollege of Information Sciences and Technology, Penn State University
    2. 2. About MeProf. Lior RokachDepartment of Information Systems EngineeringFaculty of Engineering SciencesHead of the Machine Learning LabBen-Gurion University of the NegevEmail: liorrk@bgu.ac.ilPhD (2004) from Tel Aviv University
    3. 3. Why Cyber Security?• Evolving Domain – Endless Game• Plenty of Data• Practical Contribution• Strong support of the stakeholders – Communications – Collaborations – Grants7/30/2012 Ben-Gurion University
    4. 4. Cyber Security Cyber security is defined as the intersection of • computer security • network security • information security 2011 2008 2009 2010 2010 2011 SONY LOCKHEEDGHOSTNET AURORA STUXNET NASDAQ 4 MARTIN
    5. 5. Cyber Security• Is a domain problem, not a domain solution, thus, it seeks solutions from other areas.• Traditionally, Security problems were aided by Mathematical model. e.g. – Secrecy – using cryptography7/30/2012 Ben-Gurion University
    6. 6. Modern Cyber Security• Deals with abstract threats which cannot be solved only by using mathematical models: – Malware detection. – Intrusion detection. – Data leakage, etc.• Need for other research methods7/30/2012 Ben-Gurion University
    7. 7. 7/30/2012 Ben-Gurion University
    8. 8. The concept of learning in a ML system• Learning = Improving with experience at some task – Improve over task T, – With respect to performance measure, P – Based on experience, E. 8
    9. 9. Motivating Example Learning to Filter Spam Example: Spam Filtering Spam - an email that the user does not want to receive and has not asked to receive T: Identify Spam Emails P: % of spam emails that were filtered % of ham (non-spam) emails that were incorrectly filtered-out E: a database of emails that were labelled by users/experts7/30/2012 Ben-Gurion University
    10. 10. The Learning Process Model Learning Model Testing
    11. 11. The Learning Process in our Example Model Learning Model Testing Number of recipients Size of message Number of attachments Number of "res" in the subject line Email Server …
    12. 12. Data Set Target Input Attributes Attribute Number of Email Country (IP) Customer Email Type new Length (K) Type Recipients 0 2 Germany Gold Ham 1 4 Germany Silver Ham 5 2 Nigeria Bronze SpamInstances 2 4 Russia Bronze Spam 3 4 Germany Bronze Ham 0 1 USA Silver Ham 4 2 USA Silver Spam
    13. 13. Information security and machine learning: Taxonomy  Problem Domain : Information Security – the problems we need to solve:  Malware detection  Intrusion detection  SPAM mitigation  Etc.  Solution Domain: Machine-Learning – from which solutions are drawn.  Artificial neural networks  Decision Trees etc.7/30/2012 Ben-Gurion University
    14. 14. ISML Taxonomy Computer Security Using Machine Learning Security domain Machine Learning domain Protective Feature Learning Threat type Damage Threat Raw data Extracted Analysis Security Selection Algorithm Type Domain Type Features type System Method type Information Network IntrusionPrevention Executable n-grams Gain Ration Static Viruses Leakage components System Supervised Intrusion Detection Portable Denial of Service Web applications Text File Fisher Score Dynamic Worms System Executable Unsupervised End Point Firewall/VPN E-Mail Function Based Document Information Loss Sequence Spam Computer Frequency Semi-supervised Messaging End Point Antivirus IP-Packet String Signature Hierarchical feature Positive Examples Personality theft D.O.S attacks selection Only Learning Loss of Internet Service Document Network Antivirus XML File Network traffic Buffer Overflow confidentuality providers Frequency Signature Based Time series SQL Injection Filter Device Anti-Spam OpCode n-grams Misuse Systems XML features System Intrusion Packet header7/30/2012 Ben-Gurion University
    15. 15. Detection of Unknown Malicious Code
    16. 16. Malware• Malware, short for malicious software, is software designed to disrupt computer operation, gather sensitive information, or gain unauthorized access to a computer system7/30/2012 Ben-Gurion University
    17. 17. Static vs. Dynamic Analysis• Static – Analyze the program (code) – – leverage structural information (e.g. sequence of bytes) – attempts to detect malware before the program under inspection executes• Dynamic – Analyze the running process – – leverage runtime information (e.g. network usage) – attempts to detect malicious behavior during program execution or after program execution.7/30/2012 Ben-Gurion University
    18. 18. Static Analysis Using Machine Learning7/30/2012 Ben-Gurion University
    19. 19. Analogous of Malcode Detection as Text Categorization• Classifying Malicious Code can be analogous to Text Categorization.• Texts  Malicious Code (Files)• Words  Code expressions• Then weighting functions, such as tf or tfidf can be used.7/30/2012 Ben-Gurion University
    20. 20. Sec. 6.2.2 tf-idf weighting• Best known weighting scheme in information retrieval TF IDF w t ,d log(1 tft ,d ) log10 ( N / dft )• The TF (term frequency) tft,d of term t in document d is defined as the number of times that t occurs in d.• The IDF (inverse document frequency) : the inverse number of documents that contain t• Increases with the number of occurrences within a document• Increases with the rarity of the term in the collection
    21. 21. Dataset• We acquired the malicious files from the VX Heaven website - 7688 malicious files for windows OS.• Including executable and DLL (Dynamic Linked Library) files• The benign set contained 22,735 files.7/30/2012 Ben-Gurion University
    22. 22. Data preparations • Creating Vocabularies (TF Vector) N-Grams Vocabulary Size 3-gram 16,777,216 4-gram 1,084,793,035 5-gram 1,575,804,954 6-gram 1,936,342,2207/30/2012 Ben-Gurion University
    23. 23. Classification Algorithms • In order to create rules from the raw data gathered and presented on the previous slide, different Classification Algorithms were examined – Artificial Neural Networks (ANN) – Decision Trees (DT) – Naive Bayes (NB) – Support Vector Machines (SVM) – Boosted Decision Trees (BDT) – Boosted Naive Bayes (BNB)7/30/2012 Ben-Gurion University
    24. 24. Steps • Determine the best conditions: – The best term representation (TF / TFIDF) – The best N-gram (3 / 4 / 5 / 6) – The best top-selection (50 / 100 / 200 / 300) & best features selection ( DF / FS / GR)7/30/2012 Ben-Gurion University
    25. 25. Performance Measures • True Positive Rate (TPR) - The number of positive instances classified correctly. • False Positive Rate (FPR) - The number of negative instances misclassified. • Total Accuracy - The number of absolutely correctly classified instances.7/30/2012 Ben-Gurion University
    26. 26. Preliminary Results• Mean accuracies quite similar.• Best performance: top 5500.• Best representation: TF• Best N-gram: 5-gram.7/30/2012 Ben-Gurion University
    27. 27. Classifiers• Under the best conditions presented above, the classifiers that achieved the highest accuracies, with lowest false positive rates, are: Classifier Accuracy FP FN – Boosted Decision Tree ANN 0.941 0.033 0.134 – Artificial Neural Network DT 0.943 0.039 0.099 NB 0.697 0.382 0.069 BDT 0.949 0.040 0.110 BNB 0.697 0.382 0.069 SVM-lin 0.921 0.033 0.214 SVM-poly 0.852 0.014 0.544 SVM-rbf 0.939 0.029 0.154 7/30/2012 Ben-Gurion University
    28. 28. Portable Executable (PE)• Extracted from certain parts of EXE files stored in Win32 PE binaries (EXE or DLL).• PE Header that describes physical structure of a PE binary (e.g., creation/modification time, machine type, file size)• Import Section: which DLLs were imported and which functions from which imported DLLs were used• Exports Section: which functions were exported (if the file being examined is a DLL)• Resource Directory: resources used by a given file (e.g., dialogs, cursors)• Version Information (e.g., internal and external name of a file, version number)7/30/2012 Ben-Gurion University
    29. 29. n-Grams vs. PE Features7/30/2012 Ben-Gurion University
    30. 30. Imbalanced Classification Tasks • Data set is Imbalanced, if the classes are unequally distributed • Class of interest (minority class) is often much smaller or rarer • But, the cost of error on the minority class can have a bigger bite7/30/2012 Ben-Gurion University
    31. 31. The Mal-ID Method • Common Libraries • Anti-Forensic means to avoid their detection • Chronological evolution of malware – Most viruses are variants of previous malwares. Mal-ID: Automatic Malware Detection Using Common Segment Analysis and Meta-Features, Journal of Machine Learning Research 1 (2012) 1-487/30/2012 Ben-Gurion University
    32. 32. Andromaly • Lightweight Host-based Intrusion Detection System for Android- based devices7/30/2012 Ben-Gurion University
    33. 33. The “Andromaly” • A lightweight Host-based Intrusion Detection System for Android-based devices • Providing real-time, monitoring, collection, preprocessing and analysis of various system metrics • Open framework – possible to apply different types of detection techniques • Threat assessments (TAs) are weighted and smoothed to avoid instantaneous false alarms • An alert is matched against a set of automatic/manual countermeasures7/30/2012 Ben-Gurion University
    34. 34. The “Andromaly” architecture Graphical User Interface Feature Extractors Application Level Operating System Alert Agent Service Loggers Scheduling Manager SQLite Processor Memory Manager Config Manager Keyboard Operation Mode Alert Manager Handler Feature Network Threat Manager Hardware Weighting Unit Communication layer Power Processors Rule-based Classifier Application Linux Framework Kernel Anomaly Detector KBTA7/30/2012 Ben-Gurion University
    35. 35. Few screenshots…7/30/2012 Ben-Gurion University
    36. 36. Few screenshots…7/30/2012 Ben-Gurion University
    37. 37. Collected features Collected Features (88) Touch screen: Memory: Network: Avg_Touch_Pressure Garbage_Collections Local_TX_Packets Avg_Touch_Area Free_Pages Local_TX_Bytes Keyboard: Inactive_Pages Local_RX_Packets Avg_Key_Flight_Tim Active_Pages Local_RX_Bytes e Anonymous_Pages WiFi_TX_Packets Del_Key_Use_Rate Mapped_Pages WiFi_TX_Bytes Avg_Trans_To_U File_Pages WiFi_RX_Packets Avg_Trans_L_To_R Dirty_Pages WiFi_RX_Bytes Avg_Trans_R_To_L Writeback_Pages Hardware: Avg_Key_Dwell_Tim DMA_Allocations Camera e Page_Frees USB_State Keyboard_Opening Page_Activations Binder: Keyboard_Closing Page_Deactivations BC_Transaction Scheduler: Minor_Page_Faults BC_Reply Yield_Calls Application: BC_Acquire Schedule_Calls Package_Changing BC_Release Schedule_Idle Package_Restarting Binder_Active_Nodes Running_Jiffies Package_Addition Binder_Total_Nodes Waiting_Jiffies Package_Removal Binder_Ref_Active CPU Load: Package_Restart Binder_Ref_Total CPU_Usage UID_Removal Binder_Death_Active Load_Avg_1_min Calls: Binder_Death_Total Load_Avg_5_mins Incoming_Calls Binder_Transaction_Active Load_Avg_15_mins Outgoing_Calls Binder_Transaction_Total Runnable_Entities Missed_Calls Binder_Trns_Complete_Act Total_Entities Outgoing_Non_CL_C ive Messaging: alls Binder_Trns_Complete_Tot Outgoing_SMS Operating System: al Incoming_SMS Running_Processes Leds: Outgoing_Non_CL_S Context_Switches Button_Backlight MS Processes_Created Keyboard_Backlight Power: Orientation_Changing LCD_Backlight Charging_Enabled Blue_Led Battery_Voltage Green_Led Battery_Current Red_Led Battery_Temp Battery_Level_Change Battery_Level7/30/2012 Ben-Gurion University
    38. 38. Evaluation Preparation of the data-sets• The applications were installed on 25 Android G1devices (each device has one user only)• Each user activate each application• In the background the Android agent was running and logging data (feature vectors) on the SD-card (88 features each 2 seconds)• The feature vectors were added to our data-set and labeled with the device id, application name and class (game/tool)7/30/2012 Ben-Gurion University
    39. 39. Abnormal state detection • Identify the most informative features to • Detection algorithms: K- monitor Means, Histograms, Logistic Regression, Decision Tree, Bayesian • Evaluating various detection methods and Net, Naïve Bayes algorithms • Feature selection: InfoGain, Chi • Understanding the feasibility of running Square, Fisher Score these methods as detection units on Android devices • Top best features: 10, 20, 50 (d) Experiment IVFeature Extraction Malicious (4) Tools/Games (4) • Recorded 90 features while activating Train Train applicationsFeature Selection • Differentiate between applications Train Device 1 which are not included in the training Training set when training and testing are Test Device 2 performed on different devices Testing Test TestStep I: Step II:Differentiating games (23K) and tools (20K) using Detecting Android malware (15K) using 25 devices 25 devices Rotation Forest/Fisher Score/Top 10Logistic Regression/Fisher Score/Top 20 Accuracy 87.4% (TRP 0.794, FPR 0.126)Accuracy 75.3% (TRP 0.797, FPR 0.303) 7/30/2012 Ben-Gurion University
    40. 40. Data Leakage Prevention7/30/2012 Ben-Gurion University
    41. 41. Data Leakage 41
    42. 42. Data Leakage Prevention Data leakage prevention solution is a system that is designed to detect potential data breach incidents in timely manner and prevent them by monitoring data while in-use (endpoint actions), in-motion (network traffic), and at-rest (data storage)7/30/2012 Ben-Gurion University
    43. 43. Honeytokens• Honeytokens - faked digital data (e.g., a credit card number, a database entry or bogus login credentials) planted into a genuine system resource (e.g., databases, files and emails).• Example: – Insert a honey-table: a table with "sweet" name able to attract malicious user (e.g. "CREDIT_CARDS") – These tables are not being used by any
    44. 44. HoneyGen• Challenge: A good honeytoken is an artificial data item that is hard to distinguish between real tokens and the honeytoken• HoneyGen: an Automated Honeytokens Generator [Berkovitch, 2011] – Proposed a generic method for honeytokens generation that given any database will be able to generate high quality honeytokens 44
    45. 45. HoneyGen System• Rule mining: extrapolates rules that describe the "real" data structure, attributes, constraints and logic (identity, reference, cardinality, value- set, attribute dependency)  Honeytoken generation  Likelihood rating: sort honeytokens by similarity Real tokens to real tokens in the input INPUT: DB database, according to the commonness of its combination of values PROCESS: Rules Mining Rules Honeytokens Honeytokens Generation Likelihood Rating Honeytokens OUTPUT: Honeytokens Likelihood Scores DB 45
    46. 46. Activity Based Verification7/30/2012 Ben-Gurion University
    47. 47. Motivation• Identity theft is one of the most usual crimes in North America. There are closet to 10 million victims of identity theft each year.• The Federal Trade Commission (FTC) estimates that the cost of identity theft to companies is approximately $50 billion per year additionally to $5 billion worth of costs to consumers.• These days all the authentication of users is based on Username & Password, which can be stolen physically, by Phishing sites, Trojans, as well as given.
    48. 48. Current Authentication Mechanisms – Costly and often Unavailable.Current Authentication mechanisms Disadvantages  Authentication by  Hard to remember many passwordsPassword predefined user name and  Password may be copied, cracked or stolen password  Can be lost or stolenToken  Based on an object (i.e., magnetic card, RFID tag)  Expensive to deploy and maintain for consumer marketBiometric  Biometrics based on a palm  Expensive(Palm/finger signature  Limited availability)Biometric  Biometrics based on a  Accuracy limited with illness or background(Voice) vocal patterns noiseBiometric  Biometrics based on  Expensive structural and color patterns(Iris) of the human iris.  Accuracy problems for diabetes patients  ID numbers which areSecure ID constantly generated (i.e.,  Can be lost or stolen by RSA Secure ID)Users are already hassled by current security mechanisms and reluctant to accept new ones. 22.01.2010 48
    49. 49. Activity Based VerificationTextbox Headline Textbox Headline Solution 49
    50. 50. Keyboard Dynamics Features Sort the Di- Group 5 similar di-graphs to one cluster Extract the Di-Graphs and their corresponding temporal H-e 200 Graphs based l-l 130 e-l 150 on their e-l 150 l-l, e-l, space-w, H-e, r-l 176 features l-l 130 temporal space-w 180 w-o, l-o, l-d, o-space, o-r 266 l-o 240 features H-e 200 o-space 300 r-l 220 Hello world Group 2 similar di-graphs to one cluster space-w 180 w-o 230 w-o 230 l-o 240 l-l, e-l 140 o-r 310 l-d 250 space-w, H-e 190 r-l 220 o-space 300 r-l, w-o 225 l-d 250 o-r 310 l-o, l-d 245 o-space, o-r 3057/30/2012 Ben-Gurion University
    51. 51. Mouse Trajectories Classifier
    52. 52. Various actions for learning purposes Mouse Point and Point and Double Move Left Click Click Mouse Left Click Move Mouse Double Move Click Drag and Drop Point and Point and Right Click Mouse Mouse Mouse DD Down Move UpMouse Right Click Mouse Mouse Mouse MouseMove Move Down Move Up
    53. 53. Evaluation Measures• False Acceptance Rate (FAR) –the ratio between the number of attacks that were erroneously labeled as authentic interactions and the total number of attacks.• False Rejection Rate (FRR) –the ratio between the number of legitimate interactions that were erroneously labeled as attacks and the total number of legitimate interactions.7/30/2012 Ben-Gurion University
    54. 54. Evaluation • Fixed Text (password) / Continues Verification • The Session Length (number of actions) – more actions  Better performance • Keyboard is better than mouse Session Size FAR FRR EER AUC 1/4 Session 4.33% 3.17% 3.75% 0.0308 1/2 Session 2.59% 2.86% 2.72% 0.0234 Full Session 1.48% 1.59% 1.53% 0.0144 • ~ 3 % FAR, FRR after 10 actions • Clint Feher, Yuval Elovici, Robert Moskovitch, Lior Rokach, Alon Schclar, “User Identity Verification via Mouse Dynamics”, Information Sciences Volume 201, 15 October 2012, Pages 19–36.7/30/2012 Ben-Gurion University
    55. 55. Privacy Preserving Data Mining7/30/2012 Ben-Gurion University
    56. 56. Motivation• Huge databases exist in society today – Medical data – Consumer purchase data – Census data – Communication and media-related data – Data gathered by government agencies• Can this data be utilized? – For medical research – For improving customer service – For homeland security• The Problem: The huge amount of data available means that it is possible to learn a lot of information about individuals
    57. 57. Privacy Challenge (Sweeney, 1998) Disease Birth Date Zip Sex Name87% of the population in the USA can be uniquely identified by zip, sex and DoB
    58. 58. Quasi-identifier• The minimal set of attributes in a table that can be joined with external information to re-identify individual records
    59. 59. k-AnonymityLet R(A1,...,An) be a relation and QI be the quasi-identifierassociated with it. R is said to satisfy k-anonymity if and only ifevery distinct value of QI has at least k occurrences in R.
    60. 60. Generalization and Suppression Generalization replacement of a value by a less specific (more general) value using domain generalization relationship. Suppression remove the value. Z2 = {537**} Z1 = {5371*. 5370*} Z0 = {53715. 53710, 53706, 53703} 537** S1 = {Person} 5371* 5370* S0 = {Male, Female} 53715 53710 53706 53703
    61. 61. Privacy-preserving data mining (PPDM)• Goal: Create accurate data mining models from anonymous data.• Performing anonymization while ignoring the data mining task results in a loss of data quality• Data owners must balance the desire to share useful data and the need to protect private information within the data. Trade-Off
    62. 62. k-Anonymity Classification Tree Using Suppression • Induce a classification tree with existing algorithm (like C4.5) • Walk over the tree and iteratively prune the rule in bottom-up manner until we reach k-anonymity  The order of attributes in the path (from root to the leaf) already denotes the importance (from high to low) for predicting the class.7/30/2012 Ben-Gurion University
    63. 63. Example QI = {Marital Status, Education, Occupation, Sex} K=100 Marital Status = Married | Education = High School: <=50K. (200) | Education = Some college | | Occupation = Handlers-cleaners: <=50K. (89) | | Occupation = Exec managerial: >50K (120) Complying nodes – child leafs whose frequency is bigger than k-anonymity threshold Non-complying nodes – child leafs whose frequency is lower than k-anonymity threshold Compensation - use complying nodes to drive anonymization process by compensating part of their records in favor of non-complying records using suppression
    64. 64. Example QI = {Marital Status, Education, Occupation, Sex} K=100 Marital Status = Married | Education = High School: <=50K. (200) | Education = Some college | | Occupation = Handlers-cleaners: <=50K. (89) | | Occupation = Exec managerial: >50K (120) Marital Status Education Occupation Sex Class Married High School * * <=50K 200 : : : : : Married High School * * <=50K Married Some college * * <=50K 89 : : : : : Married Some college * * <=50K 11 Married Some college * * <=50K Married Some college Exec managerial * >50K 120-11=109 :
    65. 65. Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, Efficient Multidimensional Suppression for K-Anonymity, IEEE Transactions on Knowledge and Data Engineering, 22(3): 334-347 (2010).7/30/2012 Ben-Gurion University
    66. 66. Results – Bottom Line Accuracy vs. Anonymity Level 87 Classification Accuracy 86 85 84 83 82 81 80 79 78 0 100 200 300 400 500 600 700 800 900 1000 k-Anonymity Level7/30/2012 Ben-Gurion University
    67. 67. Conclusions7/30/2012 Ben-Gurion University
    68. 68. Machine Learning and Security• Many current and emerging computer and network security challenges can be solved only by using machine learning techniques: – Information leakage – Data misuse – Anomaly detection – etc…• It is very important to understand how to employ machine learning techniques in an effective way.• In particular: – carefully construct training corpora, – Effective feature extraction – Effective feature selection, and – Valid evaluations on representative corpora.
    69. 69. Thank You Lior Rokachliorrk@bgu.ac.il