Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Rootconf_phishing_v2

125 views

Published on

Rootconf_phishing_v2

Published in: Technology
  • Login to see the comments

Rootconf_phishing_v2

  1. 1. THE FIFTH ELEPHANT - ARJUN B.M. MUDPIPE MaliciousURLDetectionfor PhishingIdentificationandPrevention
  2. 2. PHISHING INTRODUCTION The fraudulent practice of sending emails purporting to be from reputable companies in order to induce individuals to reveal personal information, such as passwords and credit card numbers MOTIVES: Financial gain, damage reputation, identity theft, fame & notoriety Phishing websites indicators: • Visually appears like the original website • Email creates a sense of urgency to force user action • Fake HTTPS certificate & domain name • Provides attractive offers which tempts the user to respond
  3. 3. PROBLEM STATEMENT For Employees: • Accessing malicious sites by being victims of phishing emails • No mechanism to check bad sites by employees through self-service • Lack of awareness and training for employees For Security Teams • Manual time & effort spent to block sites by Security Operations team • Lack of internal ML solution insights on phishing data; current solutions maybe rule-based • Different teams/networks may have different requirements for site access, which cannot be served by external commercial solutions For Business: • 91% of all cyber-attacks are via phishing and they have devastating consequences • Licensing cost for commercial based solutions to detect phishing sites • Dependency on external solution/product
  4. 4. APPROACH &METHODOLOGY MACHINE LEARNING APPROACH • Data Collection & Validation • Parameter Determination (Address Bar based Features, HTML and JavaScript based Features, Domain based Features, Abnormal Based Features, URL Blacklist Features) • Feature Extraction from unknown, incoming data (test data) • Create baseline model with initial dataset • Evaluate performance of model and fine-tuning • Apply test data on pre-trained baseline model & make prediction • Compare with known data sources & further fine-tune results • Retrain model on frequent intervals for better accuracy, context and relevancy • Classification model pickled and exposed as REST API WHITELIST / BLACKLIST APPROACH • Identify data sources which provide info on phishing sites • Scrape data from data sources • Create whitelist / blacklist and compare URLs CONS • Lack of updated data sources • Lack of real-time intelligence • Data not comprehensive enough • Extensive effort for data scraping RULE BASED APPROACH • Determine phishing indicators • Define rules using combination of indicators • Compare & match URLs against rules to deny/allow CONS • Complex rule set definitions • Overhead in managing and updating rules • High False Positive and False Negative rates
  5. 5. ARCHITECTURE /WORKFLOW BASELINE DATA SET BASELINE ML MODEL PREDICTION TRAIN EXPOSE AS REST API FEATURE EXTRACTION TEST DATA OUTPUT RETRAIN WEB TRAFFIC (UNKNOWN DATA) INPUT • Address Bar based Features • HTML and JavaScript based Features • Domain based Features • Abnormal Based Features • URL Blacklist Features • Total of 30 features • CLASSIFICATION = 0: LEGITIMATE • CLASSIFICATION = 1: PHISHING • COMPARE WITH KNOWN SOURCES • PROBABILITY OF PREDICTION SECURITY ACTION (BLOCK / ALLOW)
  6. 6. FEATURE EXTRACTION 1. having_IP_Address 2. URL_Length 3. Shortening_Service 4. having_At_Symbol 5. double_slash_redirecting 6. Prefix_Suffix 7. having_Sub_Domain 8. SSL_State 25. DNS_Record 26. web_traffic_rank 27. Page_Rank 28. Google_Index 29. Links_pointing_to_page 30. Statistical_report - top phishing domains Classification output: 0 = legitimate, 1 = phishing 9. Domain_registeration_length 10. Favicon 11. Open_ports 12. HTTPS_token_in_URL 13. Request_URL 14. URL_of_Anchor 15. Links_in_tags 16. Server_Form_Handler 17. Submitting_to_email 18. Abnormal_URL 19. Site_Redirect 20. on_mouseover_changes 21. RightClick_Disabled 22. popUpWindow 23. Iframe_redirection 24. age_of_domain
  7. 7. DEPLOYING TOPRODUCTION • Context specific use-cases: • Certain sub-nets within the org might require access to certain websites to support business functionality • Org might want to block access to sites even though they are classified as “suspicious” by commercial softwares • Infrastructural & capacity planning considerations: client, load balancer, web server, queues, etc • REST-API approach: train, retrain & predictions • Develop automation test cases for your model (especially on feature engineering side) • Automate evaluation of the production model, which allows to efficiently back-test changes to the model on historical data and determine if improvements have been made or not • Possibly have different ML models / end-points exposed for different sections of the network or for different departments • Have a fall-back or set-default-value for parameters which fail to get processed by the Feature Engineering module (exception handling) • Decouple the input and the output for the model; model should still work if parameters are added, modified or deleted in feature engineering • Single egress point for web traffic, where the ML model can be plugged-in with the REST API • Have a fail-open or kill-switch mechanism for traffic to flow through if model processing fails • Place model operation in “monitoring” or “non-blocking” mode initially, which allows the ML model to get additional data and allows for fine- tuning and prevents errors • Supplement with existing controls like spam filtering, black-listing, etc • Model should refer to other data sources as well for fine-tuning in the initial stages • Baselining and retraining the model at frequent internals; also maintaining model versions • Provide security analysts with an option to tweak/edit input data for contextual representation • Deploying the MODEL client-side versus server-side
  8. 8. PROS&CONS PROS • Reduce dependency, cost & license on third-party external software • Re-use of in-house org’s data rather than contribute towards improving commercial software • Better insights into online behavior of employees • Real-time protection for employees who access malicious websites or click on phishing links • Detect and prevent against unknown phishing attacks, as new patterns are created by attackers • Next level of intelligence on top of signature-based prevention techniques & blacklists • Email filtering solutions help in filtering phishing/spam emails, but this provides holistic protection for all outgoing internet traffic • Centralized solution implemented org-wide and no dependency on client- side agents/software • Anti-phishing: move from offline to real-time; move from reactive to proactive CONS • Data collection & building data repository • Initial baseline dataset has too few records • Cost / Maintenance of solution/product • Fine-tuning of rules & predictions to meet changing threat vectors • False positive rate could cause bad user experience • Needs to be supplemented with Cyber Threat Intel • Solution works only when users are connected to org network, since there is no client-side agent
  9. 9. AUDIENCE TAKE-AWAYS • Opportunity for engineers and analysts to collaborate and work together to build tailored intelligent security solutions / products • Learn the various considerations in designing and deploying a ML solution in the InfoSec domain
  10. 10. EMAIL : arjun.job14@gmail.com LINKEDIN: https://www.linkedin.com/in/arjunbm FURTHERREADING LINKS & REFERENCES • https://www.researchgate.net/publication/226420039_Detection_of_Phishing_Attacks_ A_Machine_Learning_Approach • https://ieeexplore.ieee.org/document/8004877 • https://pdfs.semanticscholar.org/188f/3bde688d5a47ce86bc0a8eca03aeb1bb9dfc.pdf

×