By Tiago Henriques, Filipa Rodrigues
Florentino Bexiga, Ana Barbosa
I, for one, welcome our
new Cyber Overlords!
An introd...
WHO ARE WE?
MACHINE LEARNING AND CYBERSECURITY
IMAGE WORKFLOW
IMAGE ANALYSIS IN DETAIL
DATA VISUALISATION
Agenda
Tiago is the CEO and Data necromancer at
BinaryEdge however he gets to meddle in the
intersection of data science and cybe...
Florentino is the Data MacGyver at
BinaryEdge. On a daily basis he needs to
deploy infrastructure used to analyse big
and ...
Filipa is the Data Diva at BinaryEdge, she
dances the macarena with numbers to get
them to tell her all their dirty secret...
Ana is the Data Ferret at BinaryEdge.
She is small and hides between the 110th
and 111th characters of the ascii code to
s...
Earlier today
BinaryEdge
HACKING
SKILLS
SECURITY DOMAIN
EXPERTISE
STATISTICS
KNOWLEDGE
MACHINE
LEARNING
TRADITIONAL
RESEARCH
DANGER
ZONE...
How we got here....
200 port scan of the entire internet/ month
1,400,000,000 scanning events/ month *
746,000 torrents mo...
Worldwide distribution of IPs running services
<= 100
Number of IPs found
>= 1,000,000
100,000 < #found < 1,000,000
10,000...
Map IPv4 addresses to Hilbert curves
% of coverage
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Data Science & Machine Learning
How many IP addresses did job X had vs. job Y?
What is the average duration of the scans?
...
Data Science & Machine Learning
DATA SCIENCE MACHINE LEARNING
INITIAL ANALYSIS AND CLEAN UP
EXPLORATORY DATA ANALYSIS
DATA...
Problems and Limitations of
Machine Learning in CyberSecurity
Lots of adversarial scenarios – Attacks to the classifiers, ...
Good use cases
further work needs to be done, but will allow to move antivirus from a static/
signature based system into ...
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forum...
Torrent Correlation
Torrent Correlation
China or Military
Data correlation
Data correlation
Turkish IP
DEMO
At PixelsCamp
At PixelsCamp
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forum...
Microservices (REST API)
MICROSERVICES
(REST API)
PORT WORD
TAG
FACECOUNTRY LOGO
IP
Scan
SCAN
GENERATES EVENTS
DOES IT
GENERATE A
SCREENSHOT?
STORE THE IMAGE FILE
ON THE CLOUD
YES
NO
GENERATE A NOTIFICATION...
Image Workflow
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
Image Workflow
PULL MESSAGE
FROM QUEU...
Image Workflow
PULL MESSAGE
FROM QUEUE
DOES THE
IMAGE HAVE ANY
INFORMATION?
PERFORM SIMPLE
ENTROPY FILTERING
YES
NO
FINISH...
PULL MESSAGE
FROM QUEUE
ENHANCE IMAGE WITH
APPLICATION OF SOME FILTERS
RUN FACE AND LOGO DETECTION
AND OCR ALGORITHMS
STOR...
Image Workflow
[{"BreachDate": "2013-10-04", "DataClasses": ["Email addresses",
"Password hints", "Passwords", "Usernames"...
Image WorkflowImage Workflow
INITIALIZER FILTER LOGO DETECTION
FACE DETECTION
OPTICAL CHARACTER
RECOGNITION (OCR)
Shannon’s Entropy
Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits
Filter
Data Visualization
EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS
“a multidisciplinary recipe of art, science, math,...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DATA TYPE
RELEVANCE
FILTER
What is the most interesting?
What is mos...
Representation
Experimentation is important
Conceive ideas
Storyboarding
Do multipe iterations
Prototype
Test
design can b...
Representation
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Distribution of IP addresses running encrypted and un...
Data Visualization
Representation
PRECISION IN DESIGN
Geometric Calculations
Truncated axis
Scales
MAKE IT UNDERSTANDABLE
...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Representation
Consider different design solutions
DATA TYPE
CONDITI...
CVE
Identifier
Number
References
Description
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
CVE: CommonVulnerabilit...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Overview of protocols used for email, according to encryption used
E...
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorica...
Representation
Consider different design solutions
DATA TYPE
CONDITION
Hierarchical
Relational
Temporal
Spatial
Categorica...
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
VNC wordcloud
loginwindows
edition
2016
delete
ctr...
Details
ANNOTATION
Titles and subtitles
Labels
Legends
TYPOGRAPHY
Use fonts that are easy to read
Don’t use fonts that are...
Details
ANNOTATION
Titles and subtitles
Labels
Legends
TYPOGRAPHY
Use fonts that are easy to read
Don’t use fonts that are...
Details
COLOR
Legibility
Functional purpose
Salience
Consistency
Color Blindness
COMPOSITION
Chart size/ orientation
Align...
Tools
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
BALANCE
Automation
Programming Language
to create plots
Fine t...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DOCUMENT EVERY STEP OF THE PROCESS
Calculations
Choices of visualisa...
INTERNET
SECURITY
EXPOSURE
2016
BinaryEdge.io
Be Ready. Be Safe. Be Secure.
ise.binaryedge.io
THE SCIENCE
BEHIND THE DATA
CREATED BY
BINARYEDGE
I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY
Upcoming SlideShare
Loading in …5
×

I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY

371 views

Published on

Talk given at Pixels Camp 2016 about combining Machine Learning and CyberSecurity

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
371
On SlideShare
0
From Embeds
0
Number of Embeds
309
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACHINE LEARNING IN CYBERSECURITY

  1. 1. By Tiago Henriques, Filipa Rodrigues Florentino Bexiga, Ana Barbosa I, for one, welcome our new Cyber Overlords! An introduction to the use of data science in cybersecurity
  2. 2. WHO ARE WE? MACHINE LEARNING AND CYBERSECURITY IMAGE WORKFLOW IMAGE ANALYSIS IN DETAIL DATA VISUALISATION Agenda
  3. 3. Tiago is the CEO and Data necromancer at BinaryEdge however he gets to meddle in the intersection of data science and cybersecurity by providing his team with lovely problems that they solve on a daily basis. Tiago Henriques Presenter
  4. 4. Florentino is the Data MacGyver at BinaryEdge. On a daily basis he needs to deploy infrastructure used to analyse big and realtime data. When not doing that, he can be found creating models to analyse data. Give him an orange, he’ll give you a skynet. Why an orange you ask? He’s hungry and likes oranges, there! Florentino Bexiga Presenter
  5. 5. Filipa is the Data Diva at BinaryEdge, she dances the macarena with numbers to get them to tell her all their dirty secret. Filipa Rodrigues Presenter
  6. 6. Ana is the Data Ferret at BinaryEdge. She is small and hides between the 110th and 111th characters of the ascii code to see and show data in that unique perspective of someone who can’t reach the box of cookies stored on top of the capitol 'I' Ana Barbosa Presenter
  7. 7. Earlier today
  8. 8. BinaryEdge HACKING SKILLS SECURITY DOMAIN EXPERTISE STATISTICS KNOWLEDGE MACHINE LEARNING TRADITIONAL RESEARCH DANGER ZONE! DATA SCIENCE Source: Data-Driven Security: Analysis, visualisation and Dashboards (adapted)
  9. 9. How we got here.... 200 port scan of the entire internet/ month 1,400,000,000 scanning events/ month * 746,000 torrents monitored and increasing 1,362,225,600 torrent events/ month * at a minimum
  10. 10. Worldwide distribution of IPs running services <= 100 Number of IPs found >= 1,000,000 100,000 < #found < 1,000,000 10,000 < #found <= 100,000 1,000 < #found <= 10,000 100 < #found <= 1,000
  11. 11. Map IPv4 addresses to Hilbert curves % of coverage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
  12. 12. Data Science & Machine Learning How many IP addresses did job X had vs. job Y? What is the average duration of the scans? Can we extract more from all the screenshots we get? Can we have a more optimized job distribution? We can only identify X% of services because we’re using static signatures, can we do better? Can we find similar images? MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER DATA SCIENCE & MACHINE LEARNING
  13. 13. Data Science & Machine Learning DATA SCIENCE MACHINE LEARNING INITIAL ANALYSIS AND CLEAN UP EXPLORATORY DATA ANALYSIS DATA VISUALISATION KNOWLEDGE DISCOVERY CLASSIFICATION CLUSTERING SIMILARITY MATCHING REGRESSION IDENTIFICATION
  14. 14. Problems and Limitations of Machine Learning in CyberSecurity Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of machine learning Prediction – Scenarios and data too volatile, not enough proper sources of data Lack of data in quantity and quality to train models
  15. 15. Good use cases further work needs to be done, but will allow to move antivirus from a static/ signature based system into a much improved dynamic/ learning based system If a computer is hacked certain behaviors will change, if constant data is being monitored and fed into a system the hack could be detected detection of vulnerable patterns during development sentiment analysis applied to emails, tweets, social networks of employees PATTERN DETECTION/OUTLIER DETECTION (IDS/IPS) ANTIVIRUS ANTI-SPAM SMARTER FUZZERS SOURCE CODE ANALYSIS INTERNAL ATTACKERS
  16. 16. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  17. 17. Torrent Correlation
  18. 18. Torrent Correlation China or Military
  19. 19. Data correlation
  20. 20. Data correlation Turkish IP
  21. 21. DEMO
  22. 22. At PixelsCamp
  23. 23. At PixelsCamp
  24. 24. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  25. 25. Microservices (REST API) MICROSERVICES (REST API) PORT WORD TAG FACECOUNTRY LOGO IP
  26. 26. Scan SCAN GENERATES EVENTS DOES IT GENERATE A SCREENSHOT? STORE THE IMAGE FILE ON THE CLOUD YES NO GENERATE A NOTIFICATION THAT NEW IMAGE WAS UPLOADED FINISH
  27. 27. Image Workflow INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  28. 28. INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR) Image Workflow PULL MESSAGE FROM QUEUE IS THERE A NEW IMAGE? DECRYPT AND STORE IMAGE METADATA ON A DATABASE YES NO GENERATE IMAGE SIGNATURE FOR SIMILARITY COMPARISON FINISH MESSAGE QUEUE
  29. 29. Image Workflow PULL MESSAGE FROM QUEUE DOES THE IMAGE HAVE ANY INFORMATION? PERFORM SIMPLE ENTROPY FILTERING YES NO FINISH MESSAGED QUEUE INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  30. 30. PULL MESSAGE FROM QUEUE ENHANCE IMAGE WITH APPLICATION OF SOME FILTERS RUN FACE AND LOGO DETECTION AND OCR ALGORITHMS STORE RESULTS IN DATABASE PERFORM ADDITIONAL ACTIONS WITH THE RESULTS Image Workflow INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  31. 31. Image Workflow [{"BreachDate": "2013-10-04", "DataClasses": ["Email addresses", "Password hints", "Passwords", "Usernames"], "Title": "Adobe", "IsAc- tive": true, "Description": "In October 2013, 153 million Adobe accounts were breached with each containing an internal ID, username, email, <em>encrypted</em> password and a password hint in plain text. The password cryptography was poorly done and <a href="http://stric- ture-group.com/files/adobe-top100.txt" target="_blank">many were quickly resolved back to plain text</a>. The unencrypted hints also <a href="http://www.troyhunt.com/2013/11/adobe-creden- tials-and-serious.html" target="_blank">disclosed much about the passwords</a> adding further to the risk that hundreds of millions of Adobe customers already faced.", "Domain": "adobe.com", "Added- Date": "2013-12-04T00:00:00Z", "PwnCount": 152445165, "IsRetired": false, "IsVerified": true, "LogoType": "svg", "IsSensitive": false, "Name": "Adobe"}] Email DataLeak API
  32. 32. Image WorkflowImage Workflow INITIALIZER FILTER LOGO DETECTION FACE DETECTION OPTICAL CHARACTER RECOGNITION (OCR)
  33. 33. Shannon’s Entropy Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits Filter
  34. 34. Data Visualization EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS “a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.” Andy Kirk, “Data Visualization: a successful design process”
  35. 35. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP DATA TYPE RELEVANCE FILTER What is the most interesting? What is most important? Audience’s Profile What is the most relevant information in the context? Show all values or just a few? Define periods? Define a threshold? Hierarchical Relational Temporal Spatial Categorical Exploration Data Visualization
  36. 36. Representation Experimentation is important Conceive ideas Storyboarding Do multipe iterations Prototype Test design can be used in the future Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP 69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331 951,629 854,817 789,515 759,115 490,290 288,885 266,827 257,105 219,025 198,898 186,286 141,474 HowmanyopenportsdoesanIPhave? NumberofIPswithXopenportsport NumberofIPs
  37. 37. Representation EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Distribution of IP addresses running encrypted and unencrypted services MARKS Points Areas Lines ATTTRIBUTES Position Connections/ Patterns Size/ Color REPRESENT RECORDS EMPHASIZE THE MOST IMPORTANT ASPECTS OF THE DATA on port 443 on port 80 51,467,779 HTTP 28,671,263 IPs running HTTP services IPs running HTTPS services 16,519,503IPs running both HTTP and HTTPS services HTTP & HTTPS HTTPS Data Visualization
  38. 38. Data Visualization Representation PRECISION IN DESIGN Geometric Calculations Truncated axis Scales MAKE IT UNDERSTANDABLE Reference lines Markers MAKE IT APPEALING Minimise the clutter Priority: preserve function Top 10Web Servers for theWeb Most common web servers found on port 80 Apache httpd AkamaiGHost Micorosft IIS httpd nginx lighttpd Huawei HG532e ADSL modem http admin Microsoft HTTPAPI httpd Technicolor DSL modem http admin Mbedthis-Appweb micro_httpd 2 4 6 80 10 12 millions 11,493,552 8,361,080 4,843,769 3,860,883 2,031,741 1,539,629 952,300 699,202 694,393 678,657 EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
  39. 39. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical CVSS SCORES LOW MEDIUM HIGH 0.0 10.0 4.0 7.0 SEVERITY CVSS: CommonVulnerability Scoring System Data Visualization
  40. 40. CVE Identifier Number References Description EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP CVE: CommonVulnerabilities and Exposure Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical Data Visualization
  41. 41. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Overview of protocols used for email, according to encryption used Email Protocols ENCRYPTED UNENCRYPTED POP3 POP3S SMTP SMTPS IMAP IMAPS 4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364 10,416,812 12,234,969 SERVICE COUNT Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical Data Visualization
  42. 42. Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Big Data Technologies Changes in amount of data exposed without security MongoDB Memcached Redis 2 TB 644.3 TB Aug 2015 Jan 2016 July 2016 724.7 TB 627.7 TB 13.2 TB 11.3 TB 710.9 TB 12.0 TB 598.7 TB 27.5 TB 1.5 TB 1.8 TB 619.8 TB Data Visualization
  43. 43. Representation Consider different design solutions DATA TYPE CONDITION Hierarchical Relational Temporal Spatial Categorical EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Heartbleed Countries with higher number of IPs vulnerable to Heartbleed Russia 5,264 Republic of Korea 4,564 China 6,790 United States 23,649 Italy 2,508 Germany 6,382 France 5,622 Netherlands 2,779United Kingdom 3,459 Japan 2,484 Data Visualization
  44. 44. Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP VNC wordcloud loginwindows edition 2016 delete ctrl server press microsoft system welcome your help file linux google kernel from ubuntu
  45. 45. Details ANNOTATION Titles and subtitles Labels Legends TYPOGRAPHY Use fonts that are easy to read Don’t use fonts that are considered sloppy SSH Banners SSH-2.0-OpenSSH_5.3 SSH-2.0-OpenSSH_6.6.1p1 SSH-2.0-OpenSSH_6.6.1 SSH-2.0-OpenSSH_4.3 SSH-2.0-OpenSSH_6.0p1 SSH-2.0-OpenSSH_6.7p1 SSH-2.0-dropbear_2014.63 SSH-2.0-OpenSSH_5.5p1 SSH-2.0-ROSSSH SSH-2.0-OpenSSH_5.9p1 202,361 352,978 436,700449,570 462,616 537,667 555,779 604,579 1,501,749 2,632,270 count banner Most common SSH Banners found EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Data Visualization
  46. 46. Details ANNOTATION Titles and subtitles Labels Legends TYPOGRAPHY Use fonts that are easy to read Don’t use fonts that are considered sloppy SSH -2.0-O penSSH _5.3 SSH -2.0-O penSSH _6.6.1p1 SSH -2.0-O penSSH _6.6.1 SSH -2.0-O penSSH _4.3 SSH -2.0-O penSSH _6.0p1 SSH -2.0-O penSSH _6.7p1 SSH -2.0-dropbear_2014.63 SSH-2.0-OpenSSH_5.5p1 SSH -2.0-RO SSSH SSH -2.0-O penSSH _5.9p1 202,361 352,978 436,700449,570 462,616 537,667 555,779 604,579 1,501,749 2,632,270 EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Data Visualization
  47. 47. Details COLOR Legibility Functional purpose Salience Consistency Color Blindness COMPOSITION Chart size/ orientation Alignments EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP SSH Key Lengths Most common key lengths found Key length count 641,719 1040 186,070 1032 13,845 4096 5,068,711 1024 3,740,593 2048 9,064 512 7,830 2056 6,265 2064 6,212 1016 4,755 768 Data Visualization
  48. 48. Tools EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP BALANCE Automation Programming Language to create plots Fine tunning in illustrator (make it better for the audience) Hand-editing process Human error Originality Automated Analysis Illustrator (or other tool) to create visualization solution Human error Data Visualization
  49. 49. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP DOCUMENT EVERY STEP OF THE PROCESS Calculations Choices of visualisations Choices of data points REVIEW EVERYTHING What could have been done differently? What could be better? TAKE CONSTRUCTIVE FEEDBACK Even if it means to start over A visualization can be used in the future Data Visualization
  50. 50. INTERNET SECURITY EXPOSURE 2016 BinaryEdge.io Be Ready. Be Safe. Be Secure. ise.binaryedge.io
  51. 51. THE SCIENCE BEHIND THE DATA CREATED BY BINARYEDGE

×