This document summarizes Chris Partridge's work on scraping and analyzing DNS data to generate threat intelligence. It discusses scraping domain and DNS data at scale, analyzing it for anomalies, integrating threat intelligence, and limitations. The goal is to develop proactive threat intelligence by identifying relationships between domains, IPs, and known bad actors from DNS data and intelligence. Future work includes scaling up data collection, distributed analysis, and integrating findings into security tools.
2. [~] whois -h
tweedge
▪ Founder of
dnstrace.pro
▪ Third year RIT CSEC
student and BSides
regular
▪ Runs Snort on own
network
▪ Guacamole aficionado
▪ Dungeons and Dragons
3. Contents
1.Quick Refresher on
DNS
2.The Reactive Threat
Intelligence Problem
3.Scraping and
Ingesting DNS Data at
Scale
4.Anomalies, Analysis
and General Findings
22. What Would We
Need?
▪ Huge quantities of
parsed domain data
▫ Some collect this
passively; we won’t
▫ Difficult to acquire
aggressively
▪ As much threat
intelligence as
possible
26. Acquiring
Domains
▪ Buy access to
curated zone files
▫ ~$300/year ( ° °)╯ □ ╯ ︵
┻━┻
▪ Request access to
zone files from
registrars
▫ ICANN’s CZDS is a
good start
29. Website
Crawling
▪ Find and follow links
▪ Complex and resource
intensive if the entire
document is rendered
for each page
▪ Requires a webserver
to be running
30. Search Engines
& Passive DNS
▪ Great for real-life
engagements, exposes
nothing about your
recon to a target
▪ Depends on external
services
Recommended software:
31. Probabilistic
Lookups
▪ Use a list of known
FQDNs and parse out
the most common
subdomains
▪ Combine with anything
you know about the
target (eg.
wordlists) to
increase
32.
33. Reverse DNS
▪ Useful for IPv4
(dense), less useful
for IPv6 (sparse)
▪ Often results in ISP-
assigned FQDNs
▪ ...hrm.
34. DNSSEC
▪ A set of security
extensions for DNS
▪ Provides:
▫ Origin authentication
▫ Data integrity
▫ Denial of existence
35.
36. NSEC Walks
▪ How does denial of
existence work with
DNSSEC?
▫ NS returns NSEC
response: “next secure
record”
Generally:
examp
le.com
api
ww
w
User
requests
“test”
NS
returns
40. NSEC3 Walks
▪ Privacy improvements in
2008 to DNSSEC,
creating NSEC3 records
by hashing adjacent
valid records
Generally:
examp
le.com
api
ww
w
User requests “test”
NS returns NSEC3
record stating:
“There is nothing between
‘71f64b...’ and ‘724611...’”
41.
42.
43.
44. DNSSEC, NSEC,
NSEC3 Recap
▪ If a target has
DNSSEC enabled it’s
absolutely worth
investigating an
NSEC(3) walk
▪ NSEC scales well,
NSEC3 does not (on
CPU)
▪ NSEC5 on the way
45. Zone Transfers
(AXFR Query)
▪ Ask the nameserver
politely for all its
zone data
▪ Between 1/7 and 1/10
nameservers allow
AXFR
▪ Requires little effort
for possibly large
payout
46. North Korea
DNS Leak
Found by
mandatoryprogrammer/
TLDR
Sept. 2016, 28 domains:
airkoryo.com.kp, cooks.org.kp,
friend.com.kp, gnu.rep.kp,
kass.org.kp, kcna.kp,
kiyctc.com.kp, knic.com.kp,
koredufund.org.kp,
korelcfund.org.kp,
47.
48.
49. Resolving the
Domain Space
▪ The DIY solution
1.Try an AXFR (if
applicable)
2.Try an ANY query
3.Iterate through
desired query types
▪ Thread and
geographically
distribute
50. Make Use of
Open Domain
Data
Rapid7 Sonar
▪ SSL, Forward DNS, and
Reverse DNS = great
▪ Approx. 2.3 billion
data points per week
in FDNS
▪ Permits non-malicious
noncommercial use
63. Adding Threat
Intelligence
▪ Ingest as many lists
as possible
▫ Phishing feeds of
URLs
▫ Domain reputation
feeds
▫ IP reputation feeds
▫ BOGONs
▪ Considering heuristics
73. Limitations
▪ Data is far from
complete at the
moment
▪ Threat intelligence
sources are good, not
great
▫ Response time to
emerging threats is
slow
75. Future
Improvements
▪ Talk to a lawyer
▪ Scale out, cover more
geographic areas,
increase query
throughput
▫ You can help!
▪ Implement
distributed NSEC
walking, AXFRs
76. The Endgame
▪ Make dnstrace a
proactive tool for
geeks
▫ Generate firewall
configurations
▫ Generate DNS
blocklists
▪ Make dnstrace a
proactive tool for
79. This is the Last
Slide-
Thank you so much for
coming to my talk!
Keep in touch via ...
Email
chris@partridge.tech
LinkedIn /in/tweedge/
GitHub? @tweedge
Editor's Notes
No questions section at the end re: not an expert. I’m on the mailing list, but this is a discussion, not a lecture.
I know, nobody really likes DNS. me neither.
Strip away request data to get at the parts of a domain
DNS hierarchy - a given NS only knows about itself and its delegates. None are obligated to expose information that isn’t being requested. Replace text-heavy slide with diagram
With the talk about the basics over - what does security in the DNS space look like?
When was this from? What does it mean?
ouch
Getting better? When/where/who?
Even Talos is making reactive decisions, but this time supplemented with proactive heuristics
Reasonable heuristics can help protect people - ok, great. It’s a start. But we don’t have that many heuristics to work with. What about a dynamic address so Bob can access his nextcloud at home?
Requires complex, relational structureAllows for link-based clusteringLower dimensionality and high similarity tend to make systems more effective
Too many characteristics to prove effective for clustering - though specific characteristics and sets of characteristics are classifiable, thus, heuristics
Finite set of characteristics, strong structure of both nodes and characteristics.
We could give it a go
TLDs are easy. TLD holders want you to know that their TLD exists for marketing reasons, technical reasons (parsing), etc
Generally for enumerating domains people buy access to zone files
I’ll make recommendations for discovering out of scope things, and demo discovering in-scope things
Absolutely not. Good lawyer can make it sound like a DoS attack. Especially bad at scale. Read: DDoS.
Limits search to web-only domains, out of scope for the intended function of DNS. Useful in other situations. Lots of options.
Good for limited IRL engagements, can’t perform this at scale due to blacklisting
Can be used with certain limitations at scale. Example assumes randomized brute force search of strlen<=6 at 50% completion using only alpha charset - numbers, dashes, etc. would greatly expand search space
We can usually tell the ISP by AS allocations anyway, but whatever
Implements several new records types including dnskey, rrsig, nsec... dnssec-secured responses can be verified to be signed by the NS, valid, and to prevent an ISP just intercepting/nulling dns responses there is authenticated denial of existence.
Denial of existence you say... how does that work?
Replies with an NSEC record. There is not a standard for NSEC records beyond “the next record signed by the nameserver” but this typically means (in most implementations) the next record that exists/returns data.
hmmmmmmmm
There you have it, example.com replies with the next valid domain. So we can enumerate.
But example.com is too small. Let’s enumerate the DNS footprint of something bigger... say, PayPal. 689 queries. That’s an 80-char wide window, longest subdomain I saw was ~30 chars, so 7 * 10^55 is a good estimate of the 50% brute force space. ONE SEPT-EN-DEC-ILLION TIMES as many queries.
People realized the former was a bad idea and came up with NSEC3, the big change being hashed subdomain results
You can see we get gibberish back - no easy walks to be found here. But that’s not to say they’re not possible - you can collect a list of those hashes and then feed them into a cracking software
Here’s an NSEC3 walk of the .pro TLD - in several seconds we crack 11% of domains using hashcat on a Xeon L5640 CPU with a very bad mask. With a last-gen GPU cracking system and a better mask or wordlist, we could push much higher coverage with little extra investment
Cloudflare’s “DNSSEC done right.” NSEC doesn’t have to be the next existing domain - just has to be the next signed record. So CF generates garbage records to prevent a walk.
Four our project we’d need to create failure cases. NSEC5 prevents enumeration - it’s available but not really used right now
Should not be enabled for production environments. 1/7 numbers from dns arc, marjorie @ ic3. 1/10 is approx. lower bound what I’ve seen so far. Reveals activity beyond reasonable doubt to any DNS admin checking the logs though.
Demo axfr - lots of content. We don’t just get a list of valid domains/subdomains - we get query types and responses too. neat!
We now have a lot of ways of acquiring FQDNs - but we’ve only scratched the surface of resolving that domain space
AXFR first - very little effort to enumerate an entire zone, try it against any registrable domain. Failing that, ANY (since we want all the data). Some service providers are refusing ANY, generally due to complexity and DoS amp. Re: cloudflare. Failing that, iterate through most-wanted types (A, AAAA, CNAME, MX, NS, etc.). We thread heavily and maintain 20 nodes in 6 countries and 3 continents.
Really no reason not to if you’re doing this at the scale we are noncomercially. More scans = more granular data = better maps.
Decompose for faster querying. Much worse performance and utility to select where “10.%” to find everything in the 10.0.0.0/8 space. Useful to have everything that can be an integer represented as an integer, and even FQDNs parsed out into subsections
Not that we have the data we need, let’s start playing with it.
Filing an issue
Hmmmmmm
0.1% of addresses went to private or loopback addresses - doesn’t seem like much until you consider that’s about 1.2 million addresses. Here’s the breakdown. An additional 0.01% of addresses resolve to 1.1.1.1, so APNIC was right to lend that address to Cloudflare.
Protocol doesn’t belong here, .invalid???, a series of CNAMEs with the same hash-like value, simple errors, keyboard smash????
Localhost is not, in fact, a valid MX or NS record
Phishing: eg openphish, domain reputation: eg spamhaus, IP rep: eg everything in firehol, BOGONs from team cymru
Assign trust ratings based on common criteria to all threat intelligence lists. User bias is a multiplier so they can quickly zero out things not relevant to them (eg. if they don’t mind ads, or want to access a certain cove of ragged sailors)
A map of a hosting provider, beget.tech
I wouldn’t trust that adobe update, would you?
Heroku, 000webhost
Harder decisions - some things will get caught, so we need to find ways to minimize false positives - eg. integrating heuristics for false positive reduction
AWS but limited
GitHub but limited
Considering making some visualizations for estimating our current coverage
Lots of querying power in few areas
Could show map of where we have nodes
This is the end goal - not to just have an investigative tool but a utility that can help people be secure at a domain level with a level of precision and proactivity that has not been executed before.