Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
By Tiago Henriques, Filipa Rodrigues
Florentino Bexiga, Ana Barbosa
I, for one, welcome our
new Cyber Overlords!
An introd...
WHO ARE WE?
MACHINE LEARNING AND CYBERSECURITY
IMAGE WORKFLOW
IMAGE ANALYSIS IN DETAIL
DATA VISUALISATION
Agenda
Tiago is the CEO and Data necromancer at
BinaryEdge however he gets to meddle in the
intersection of data science and cybe...
Florentino is the Data MacGyver at
BinaryEdge. On a daily basis he needs to
deploy infrastructure used to analyse big
and ...
Filipa is the Data Diva at BinaryEdge, she
dances the macarena with numbers to get
them to tell her all their dirty secret...
Ana is the Data Ferret at BinaryEdge.
She is small and hides between the 110th
and 111th characters of the ascii code to
s...
HACKING
SKILLS
SECURITY DOMAIN
EXPERTISE
STATISTICS
KNOWLEDGE
MACHINE
LEARNING
TRADITIONAL
RESEARCH
DANGER
ZONE!
DATA
SCIE...
200 port scan of the entire internet/ month
1,400,000,000 scanning events/ month *
746,000 torrents monitored and increasi...
<= 100
Number of IPs found
>= 1,000,000
100,000 < #found < 1,000,000
10,000 < #found <= 100,000
1,000 < #found <= 10,000
1...
% of coverage
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Map IPv4 addresses to Hilbert curves
Data Science & Machine Learning
How many IP addresses did job X had vs. job Y?
What is the average duration of the scans?
...
Data Science & Machine Learning
DATA SCIENCE MACHINE LEARNING
INITIAL ANALYSIS AND CLEAN UP
EXPLORATORY DATA ANALYSIS
DATA...
Problems and Limitations of
Machine Learning in CyberSecurity
Lots of adversarial scenarios – Attacks to the classifiers, ...
Good use cases
further work needs to be done, but will allow to move antivirus from a static/
signature based system into ...
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forum...
Torrent Correlation
Torrent Correlation
China or Military
Data correlation
Data correlation
Turkish IP
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forum...
DEMO
At PixelsCamp
At PixelsCamp
metadata
files people
photos
family&friends
behaviour
social
search
company
registration
ip address
url address
news
forum...
Microservices (REST API)
MICROSERVICES
(REST API)
PORT WORD
TAG
FACECOUNTRY LOGO
IP
Scan
DOES IT
GENERATE A
SCREENSHOT?
STORE THE IMAGE FILE
ON THE CLOUD
YES
NO
GENERATE A NOTIFICATION
THAT NEW IMAGE WAS UP...
Image Workflow
GET IMAGE
EXTRACT TARGET METADATA
DOES IT
CONTAIN ANY
CONTENT?
YES
CREATE IMAGE SIGNATURE
STORE DATA
NO
FIN...
Image WorkflowImage Workflow
GET IMAGE
EXTRACT TARGET METADATA
DOES IT
CONTAIN ANY
CONTENT?
YES
CREATE IMAGE SIGNATURE
STO...
Shannon’s Entropy
Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits
Filter
DEMO
Data Visualization
EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS
“a multidisciplinary recipe of art, science, math,...
Experimentation is important
design can be used in the future
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS F...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Distribution of IP addresses running encrypted and unencrypted servi...
Data Visualization
Top 10Web Servers for theWeb
Most common web servers found on port 80
Apache httpd
AkamaiGHost
Micorosf...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Overview of protocols used for email, according to encryption used
E...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Big Data Technologies
Changes in amount of data exposed without secu...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
Heartbleed
Countries with higher number of IPs vulnerable to Heartbl...
Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
VNC wordcloud
loginwindows
edition
2016
delete
ctr...
SSH Banners
SSH-2.0-OpenSSH_5.3
SSH-2.0-OpenSSH_6.6.1p1
SSH-2.0-OpenSSH_6.6.1
SSH-2.0-OpenSSH_4.3
SSH-2.0-OpenSSH_6.0p1
SS...
SSH
-2.0-O
penSSH
_5.3
SSH
-2.0-O
penSSH
_6.6.1p1
SSH
-2.0-O
penSSH
_6.6.1
SSH
-2.0-O
penSSH
_4.3
SSH
-2.0-O
penSSH
_6.0p1...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
SSH Key Lengths
Most common key lengths found
Key
length
count
641,7...
Tools
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
BALANCE
Automation
Programming Language
to create plots
Fine t...
EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP
DOCUMENT EVERY STEP OF THE PROCESS
Calculations
Choices of visualisa...
INTERNET
SECURITY
EXPOSURE
2016
BinaryEdge.io
Be Ready. Be Safe. Be Secure.
ise.binaryedge.io
THE SCIENCE
BEHIND THE DATA
CREATED BY
BINARYEDGE
BSides Lisbon - Data science, machine learning and cybersecurity
Upcoming SlideShare
Loading in …5
×

BSides Lisbon - Data science, machine learning and cybersecurity

1,108 views

Published on

In this talk we will present some techniques that we use on a day to day basis in our research, where we combine our internet-wide data scanning and acquisition platform with ML/Data science techniques which allows us to find things faster or extract results in a more automated way. We will focus on practical cases and examples that even our audience at home will be able to use if they want. A couple of examples we will look at is how to classify images such as VNC screenshots, we will look at network scans and using machine learning to classify them and also the use of natural language processing to analyze CVEs. We will also talk a bit about a data analysis and classification pipeline architecture, we will look at the different technologies and what they do and how they can be used.

We will start by giving a very brief entry to the data science world and talk about:
Technologies
Techniques
How these relate to infosec
Algorithms and how they can be used
How people can come into the world of data and machine learning
Data visualization techniques and what are the best choices for different types of data
A couple of examples we will look at is how to classify images such as VNC or x11 screenshots, OCR, we will look at network scans and using machine learning to classify them and also the use of natural language processing to analyze CVEs. We will look at scoring and classification algorithms and how they can be used on ip addresses and we will talk about the use of learning and how we are applying it in real life.

We will also talk a bit about a data analysis and classification pipeline architecture, we will look at the different technologies and what they do and how they can be used. Some specific examples of our research that should give you an idea of some things we will talk about can be seen here:
https://blog.binaryedge.io/2015/11/10/ssh/
https://blog.binaryedge.io/2015/09/30/vnc-image-analysis-and-data-science/
https://blog.binaryedge.io/2015/08/10/data-technologies-and-security-part-1/

Published in: Technology
  • Be the first to comment

BSides Lisbon - Data science, machine learning and cybersecurity

  1. 1. By Tiago Henriques, Filipa Rodrigues Florentino Bexiga, Ana Barbosa I, for one, welcome our new Cyber Overlords! An introduction to the use of data science in cybersecurity
  2. 2. WHO ARE WE? MACHINE LEARNING AND CYBERSECURITY IMAGE WORKFLOW IMAGE ANALYSIS IN DETAIL DATA VISUALISATION Agenda
  3. 3. Tiago is the CEO and Data necromancer at BinaryEdge however he gets to meddle in the intersection of data science and cybersecurity by providing his team with lovely problems that they solve on a daily basis. Tiago Henriques Presenter
  4. 4. Florentino is the Data MacGyver at BinaryEdge. On a daily basis he needs to deploy infrastructure used to analyse big and realtime data. When not doing that, he can be found creating models to analyse data. Give him an orange, he’ll give you a skynet. Why an orange you ask? He’s hungry and likes oranges, there! Florentino Bexiga Presenter
  5. 5. Filipa is the Data Diva at BinaryEdge, she dances the macarena with numbers to get them to tell her all their dirty secret. Filipa Rodrigues Presenter
  6. 6. Ana is the Data Ferret at BinaryEdge. She is small and hides between the 110th and 111th characters of the ascii code to see and show data in that unique perspective of someone who can’t reach the box of cookies stored on top of the capitol 'I' Ana Barbosa Presenter
  7. 7. HACKING SKILLS SECURITY DOMAIN EXPERTISE STATISTICS KNOWLEDGE MACHINE LEARNING TRADITIONAL RESEARCH DANGER ZONE! DATA SCIENCE Source: Data-Driven Security: Analysis, visualisation and Dashboards (adapted) BinaryEdge
  8. 8. 200 port scan of the entire internet/ month 1,400,000,000 scanning events/ month * 746,000 torrents monitored and increasing 1,362,225,600 torrent events/ month * at a minimum How we got here....
  9. 9. <= 100 Number of IPs found >= 1,000,000 100,000 < #found < 1,000,000 10,000 < #found <= 100,000 1,000 < #found <= 10,000 100 < #found <= 1,000 Worldwide distribution of IPs running services
  10. 10. % of coverage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Map IPv4 addresses to Hilbert curves
  11. 11. Data Science & Machine Learning How many IP addresses did job X had vs. job Y? What is the average duration of the scans? Can we extract more from all the screenshots we get? Can we have a more optimized job distribution? We can only identify X% of services because we’re using static signatures, can we do better? Can we find similar images? MULTIPLE WILD QUESTIONS APPEAR... ...ONE COMMON ANSWER DATA SCIENCE & MACHINE LEARNING
  12. 12. Data Science & Machine Learning DATA SCIENCE MACHINE LEARNING INITIAL ANALYSIS AND CLEAN UP EXPLORATORY DATA ANALYSIS DATA VISUALISATION KNOWLEDGE DISCOVERY CLASSIFICATION CLUSTERING SIMILARITY MATCHING REGRESSION IDENTIFICATION
  13. 13. Problems and Limitations of Machine Learning in CyberSecurity Lots of adversarial scenarios – Attacks to the classifiers, goes against the foundation of machine learning Prediction – Scenarios and data too volatile, not enough proper sources of data Lack of data in quantity and quality to train models
  14. 14. Good use cases further work needs to be done, but will allow to move antivirus from a static/ signature based system into a much improved dynamic/ learning based system If a computer is hacked certain behaviors will change, if constant data is being monitored and fed into a system the hack could be detected detection of vulnerable patterns during development sentiment analysis applied to emails, tweets, social networks of employees PATTERN DETECTION/OUTLIER DETECTION (IDS/IPS) ANTIVIRUS ANTI-SPAM SMARTER FUZZERS SOURCE CODE ANALYSIS INTERNAL ATTACKERS
  15. 15. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  16. 16. Torrent Correlation
  17. 17. Torrent Correlation China or Military
  18. 18. Data correlation
  19. 19. Data correlation Turkish IP
  20. 20. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  21. 21. DEMO
  22. 22. At PixelsCamp
  23. 23. At PixelsCamp
  24. 24. metadata files people photos family&friends behaviour social search company registration ip address url address news forums sub-reddits internal external phone email linked urls likes topics BGP AS whois AS membership AS peer list of IPs shared infrastructure co-hosted sites contact geolocation office locations social networks phone portscan dns torrents binaryedge.io2016 domains AXFR MX records screenshots web services http https webserver framework headers cookies certificate configuration authorities entities SMB VNC RDP users appsfiles peers torrent name OCR SW banners image classifier vulnerabilities data points
  25. 25. Microservices (REST API) MICROSERVICES (REST API) PORT WORD TAG FACECOUNTRY LOGO IP
  26. 26. Scan DOES IT GENERATE A SCREENSHOT? STORE THE IMAGE FILE ON THE CLOUD YES NO GENERATE A NOTIFICATION THAT NEW IMAGE WAS UPLOADED FINISH SCAN GENERATES EVENTS { "origin": { "type": "vnc", ... }, "target": { "ip": "XX.XXX.XX.XXX", "port": 5900 }, "result": { "data": { "version": "3.7", "width": "1366", "height": "768", "auth_enabled": false, "link": "https://5723981752938cbafeefbcfab42342342.jpg" } }, "@timestamp": "2016-04-22T14:53:02.377Z" }
  27. 27. Image Workflow GET IMAGE EXTRACT TARGET METADATA DOES IT CONTAIN ANY CONTENT? YES CREATE IMAGE SIGNATURE STORE DATA NO FINISH ENHANCE IMAGE FOR LOGO AND FACE DETECTION AND OCR EXTRACTION PERFORM LOGO AND FACE DETECTION AND OCR EXTRACTION STORE RESULTS PERFORM ADDITIONAL ACTIONS
  28. 28. Image WorkflowImage Workflow GET IMAGE EXTRACT TARGET METADATA DOES IT CONTAIN ANY CONTENT? YES CREATE IMAGE SIGNATURE STORE DATA NO FINISH ENHANCE IMAGE FOR LOGO AND FACE DETECTION AND OCR EXTRACTION PERFORM LOGO AND FACE DETECTION AND OCR EXTRACTION STORE RESULTS PERFORM ADDITIONAL ACTIONS
  29. 29. Shannon’s Entropy Entropy = 0.00 bits Entropy ~ 0.03 bits Entropy ~ 2.13 bits Filter
  30. 30. DEMO
  31. 31. Data Visualization EXPLORATION REPRESENTATION DETAILS FINISHING UPTOOLS “a multidisciplinary recipe of art, science, math, technology, and many other interesting ingredients.” Andy Kirk, “Data Visualization: a successful design process”
  32. 32. Experimentation is important design can be used in the future Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP 69,543,915 25,436,974 7,008,108 3,475,472 1,287,446 1,043,331 951,629 854,817 789,515 759,115 490,290 288,885 266,827 257,105 219,025 198,898 186,286 141,474 HowmanyopenportsdoesanIPhave? NumberofIPswithXopenportsport NumberofIPs
  33. 33. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Distribution of IP addresses running encrypted and unencrypted services { "origin": { "type": "service-simple", ... }, "target": { "ip": "XX.XX.XXX.XXX", "port": 80, "protocol": "tcp" }, "result": { ... "service": { "product": "Microsoft HTTPAPI httpd", "name": "http", "extrainfo": "SSDP/UPnP", "cpe": [ "cpe:/o:microsoft:windows" ] } }, "@timestamp": "2016-04-22T04:07:18.161Z" } on port 443 on port 80 51,467,779 HTTP 28,671,263 IPs running HTTP services IPs running HTTPS services 16,519,503IPs running both HTTP and HTTPS services HTTP & HTTPS HTTPS Data Visualization
  34. 34. Data Visualization Top 10Web Servers for theWeb Most common web servers found on port 80 Apache httpd AkamaiGHost Micorosft IIS httpd nginx lighttpd Huawei HG532e ADSL modem http admin Microsoft HTTPAPI httpd Technicolor DSL modem http admin Mbedthis-Appweb micro_httpd 2 4 6 80 10 12 millions 11,493,552 8,361,080 4,843,769 3,860,883 2,031,741 1,539,629 952,300 699,202 694,393 678,657 EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP { ... "result": { "data": { "apps": [ { "name": "Apache", "confidence": 100, "version": "2.2.26", "categories": [ "web-servers" ] ... } } } }
  35. 35. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Overview of protocols used for email, according to encryption used Email Protocols ENCRYPTED UNENCRYPTED POP3 POP3S SMTP SMTPS IMAP IMAPS 4,572,161 3,742,289 3,531,071 2,971,159 4,131,737 3,703,364 10,416,812 12,234,969 SERVICE COUNT Data Visualization { "origin": { "type": "service-simple", ... }, "target": { "ip": "XX.XXX.XXX.XX", "port": 143, "protocol": "tcp" }, "result": { ... "service": { "method": "probe_matching", "product": "Dovecot imapd", "name": "imap", "cpe": [ "cpe:/a:dovecot:dovecot" ] ... }, "@timestamp": "2016-04-22T01:56:54.583Z" }
  36. 36. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Big Data Technologies Changes in amount of data exposed without security MongoDB Memcached Redis 2 TB 644.3 TB Aug 2015 Jan 2016 July 2016 724.7 TB 627.7 TB 13.2 TB 11.3 TB 710.9 TB 12.0 TB 598.7 TB 27.5 TB 1.5 TB 1.8 TB 619.8 TB { "origin": { "type": "redis", ... }, "target": { "ip": "XXX.XX.XX.XXX", "port": 6379 }, "result": { "data": { "redis_version": "3.0.6", ... "used_memory": 1374760, "used_memory_human": "1.31M", "used_memory_rss": 1839104, "used_memory_peak": 25195656, "used_memory_peak_human": "24.03M", "used_memory_lua": 36864, "mem_fragmentation_ratio": 1.34, ... }, "@timestamp": "2016-04-22T15:37:10.913Z" } Data Visualization
  37. 37. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Heartbleed Countries with higher number of IPs vulnerable to Heartbleed Russia 5,264 Republic of Korea 4,564 China 6,790 United States 23,649 Italy 2,508 Germany 6,382 France 5,622 Netherlands 2,779United Kingdom 3,459 Japan 2,484 { "origin": { "type": "ssl", }, "target": { "ip":“XXX.XX.X.XXX”, "port": 443 }, "result": { "data": { "vulnerabilities": { "heartbleed": { "is_vulnerable_to_heartbleed": true }, "openssl_ccs": { "is_vulnerable_to_ccs_injection": false } }, } } } Data Visualization
  38. 38. Data VisualizationEXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP VNC wordcloud loginwindows edition 2016 delete ctrl server press microsoft system welcome your help file linux google kernel from ubuntu
  39. 39. SSH Banners SSH-2.0-OpenSSH_5.3 SSH-2.0-OpenSSH_6.6.1p1 SSH-2.0-OpenSSH_6.6.1 SSH-2.0-OpenSSH_4.3 SSH-2.0-OpenSSH_6.0p1 SSH-2.0-OpenSSH_6.7p1 SSH-2.0-dropbear_2014.63 SSH-2.0-OpenSSH_5.5p1 SSH-2.0-ROSSSH SSH-2.0-OpenSSH_5.9p1 202,361 352,978 436,700449,570 462,616 537,667 555,779 604,579 1,501,749 2,632,270 count banner Most common SSH Banners found EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP { "origin": { "type": "ssh", "job_id": "client-816f1185-4bc1-4b5f-9a7d-61a2df315a6b", "client_id": "client", "country": "uk", "module": "grabber", "ts": 1453385574412 }, "target": { "ip": "X.X.X.X", "port": 22, "protocol": "tcp" }, "result": { "data": { ... "banner": "SSH-2.0-OpenSSH_6.6.1p1" } } } Data Visualization
  40. 40. SSH -2.0-O penSSH _5.3 SSH -2.0-O penSSH _6.6.1p1 SSH -2.0-O penSSH _6.6.1 SSH -2.0-O penSSH _4.3 SSH -2.0-O penSSH _6.0p1 SSH -2.0-O penSSH _6.7p1 SSH -2.0-dropbear_2014.63 SSH-2.0-OpenSSH_5.5p1 SSH -2.0-RO SSSH SSH -2.0-O penSSH _5.9p1 202,361 352,978 436,700449,570 462,616 537,667 555,779 604,579 1,501,749 2,632,270 EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP Data Visualization { "origin": { "type": "ssh", "job_id": "client-816f1185-4bc1-4b5f-9a7d-61a2df315a6b", "client_id": "client", "country": "uk", "module": "grabber", "ts": 1453385574412 }, "target": { "ip": "X.X.X.X", "port": 22, "protocol": "tcp" }, "result": { "data": { ... "banner": "SSH-2.0-OpenSSH_6.6.1p1" } } }
  41. 41. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP SSH Key Lengths Most common key lengths found Key length count 641,719 1040 186,070 1032 13,845 4096 5,068,711 1024 3,740,593 2048 9,064 512 7,830 2056 6,265 2064 6,212 1016 4,755 768 { "origin": { ... }, "target": { "ip": "X.X.X.X", "port": 22, "protocol": "tcp" }, "result": { ... { "cypher": "ssh-rsa", "key": "AAAAB3NzaC1yc2EAAAABIwAAAQEAudfUFJtWp8R5qPxXB0acGHctH0Yyx- VrZZfvnG37osNc32kX35aXVm8Ulk49zl/jMIIQnzP7zeOUJeJJsyXsG6Cu3qjLvD5qlc0tRjoV mV08aDgAsfeq7qQFEzzDqyoL8kV9akj8WyP+aN3QHvM4a/+3Y+UTVqrw5jSUiIIW5JOd+ UWzSz6SCGalFbop1wGELUTY6MDTHwwn+qXYgltQG6hP5tI9tl3gAVajIHg2IxM8IXz4SYH 33ZeOPypzrcr1/DvFx1s0773eGSArIi83BeYyxvN/T68RxIqAieLxVy8zJgyevpqHpUX7/+kDu vVZdfKkmFoNzBTEiIvR5eMrjTw==", "fingerprint": "5b:71:c9:85:6a:ea:40:dc:62:95:4c:25:40:b7:97:55", "length": 2048 } ], ... } } } Data Visualization
  42. 42. Tools EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP BALANCE Automation Programming Language to create plots Fine tunning in illustrator (make it better for the audience) Hand-editing process Human error Originality Automated Analysis Illustrator (or other tool) to create visualization solution Human error Data Visualization
  43. 43. EXPLORATION REPRESENTATION DETAILS TOOLS FINISHING UP DOCUMENT EVERY STEP OF THE PROCESS Calculations Choices of visualisations Choices of data points REVIEW EVERYTHING What could have been done differently? What could be better? TAKE CONSTRUCTIVE FEEDBACK Even if it means to start over A visualization can be used in the future Data Visualization
  44. 44. INTERNET SECURITY EXPOSURE 2016 BinaryEdge.io Be Ready. Be Safe. Be Secure. ise.binaryedge.io
  45. 45. THE SCIENCE BEHIND THE DATA CREATED BY BINARYEDGE

×