Shmoocon XV - Analyzing Shodan Images with Optical Character Recognition
1. Analyzing Shodan
Images with Optical
Character Recognition
Shmoocon XV, 2019/01/19
Michael Portera
Product names mentioned in this document are the trademarks or registered trademarks of their respective
owners and are mentioned for identification purposes only.
2. About me
• Primary domains include threat hunting and OSINT
• B.S. and M.S. from Auburn University
• Certs: OSCP, OSWP, CEH, CISSP, CRISC, Sec+, ITILv3
• Twitter: @mportatoes
3. Overview
• Previous Raspberry Pi projects used optimal character
recognition (OCR) – how can this be applied to
information security?
• Create a process cost-effectively and with a low level of
effort
• Started applying OCR to Shodan Images in September
2018 for establishing attribution and possible third-
party risk
• What are other applications of this within information
security?
5. Simple Process
Determine
Volume of
Screenshots on
Shodan
Run Shodan API
script or CLI to
collect images
Use AWS CLI to
invoke
Rekognition
service
Write output to
csv, databases,
json, etc.
Analyze the data
6. Shodan
• “Shodan is the world’s first search engine for Internet-
connected devices”1
• Free “Membership” with .edu account (one-time
payment of $50 otherwise, $5 during Black Friday)
• Allows image downloads (10k/month) and access to
images.shodan.io
• Free tier if no .edu or money
• No bulk downloads or access to images.shodan.io
• Can still obtain screenshots via API
• OCR for RDP images introduced in late December 2018
1 – According to shodan.io
7. Shodan
• Can use filters to determine the volume of screenshots
before running scripts
• has_screenshot:true org:"Amazon.com" port:3389
country:US
8. Shodan
• If using the Membership Tier
• Easier to use the Shodan CLI
• pip install shodan
• shodan init YOUR_API_KEY
• shodan download --limit -1 file.json.gz "has_screenshot:true
port:3389"
• shodan convert file.json.gz images
• If using the Free Tier
• Can’t download images from CLI
• Use API script: https://github.com/mportatoes/shodan_ocr
9. AWS Rekognition and the CLI
• Rekognition
• Machine learning API that performs OCR and other visual
analytics like object/scene/activity detection, facial
recognition, etc.
• Scan 5k images per month with the Free Tier for one year
• Can use local files directly with the API
• AWS CLI
• Setup Identity Access Management (IAM)
• pip install boto3
• Update local config file with secret keys for IAM user:
~/.aws/credentials
10. Automagic
• Rekognition script
• Available at https://github.com/mportatoes/shodan_ocr
• python rekognition.py –[t,o] –d image_directory/ -s IP
• t = send to text detection API
• o = send to object detection API
• d = directory of images to lookup in AWS
• s = lookup single IP in shodan and send to Rekognition
• Interpreting Line versus Word
• Lines can help us determine quickly if there are multiple
accounts, warning banners, etc.
• Foreign characters will appear as random English
characters in AWS
13. ICS and IoT via VNC – Text Detection
Search Query: shodan download --limit -1 vnc.json.gz "has_screenshot:true port:5900,5901"
• Download Date: 2019-01-02
• Query: VNC, has screenshot
• Images Analyzed: 2,375
• Identified ICS & IoT devices: 319 (~13%)
• Tagged by Shodan: 12
• Total New ICS and IoT: 307 (96%)
Motor Power
Liter/Litre On/Off
Calibration Solar
Plant kw, kg, mm, etc
Temp Control
Pump Rows of Numbers
Valve Frequency
Agri Thermo
Light Timer
Sample KeywordsScope
Discovered Devices
12
256
51
0
50
100
150
200
250
300
ICS IoT
Other Findings
= Tagged by Shodan
= New ICS and IoT Hacking Attempt - 92.63.197.[48,60]/malware.exe
1
4
6
12
0 5 10 15
Cyber Vigilante
Clear-Text Passwords
Email Address
Hacking Attempt
14. Sample Applications
• Offensive:
• Reconnaissance: Naming conventions of endpoints, usernames,
and domains with minimal effort
• Social engineering scenarios (e.g., knowing who was logged into
a cloud instance in near real time)
• Could be useful for other processes like analyzing massive
amounts of RDP/VNC screenshots from EyeWitness or
screenshots from meterpreter sessions
• Defensive:
• Identifying rogue/unmanaged cloud instances for the
organization
• Identifying third-party risk
• Threat Intelligence:
• Identifying ransomware victims or IPs/domains being used for
malware
• Other
15. Webcams – Object Detection
Search Query: shodan download --limit -1 obj_det.json.gz "has_screenshot:true !port:3389 !port:3388 !port:5901
!port:5900 country:US“
*Many of these were taken at night or completely blank and defaulted to these terms
• Download Date: 2018-12-17
• Images Analyzed: 1,965
• Unique Labels for Detected Objects: 891
Label Count
Outdoors 900
Nature 891
Building 581
Night* 392
Indoors 356
Astronomy* 354
Universe 347
Space* 347
Outer Space* 347
Urban 307
Top 10 LabelsScope
Privacy Concerns
0
20
40
60
80
100
120
140
160
180
17. Results: Webcam – Object Detection
Label Confidence
Nature 99.83994
Outdoors 99.47083
Piste 98.90309
Person 98.90309
Sport 98.90309
Snow 98.90309
Sports 98.90309
Human 98.90309
Yard 81.48057
AnalysisXX.XXX.XX.172:80
18. Other Uses for Text & Obj. Det.
• TraceLabs: Capture The Flag (CTF) for missing persons
• Use facial recognition to identify missing person in photos or
video collection
• Physical Security:
• Video doorbell + video analysis + passive Wi-Fi monitoring
• Video analysis of personnel + badging
• Data Loss Prevention:
• Detect sensitive content in an image file