These slides were part of a presentation given at HushCon East 2017. The talk covered how we can use big data to improve the effectiveness of offensive security tools.
3. WHOAMI
3
• ATL
• Web development
• Academic researcher
• Haxin’ all the things
• (but I rlllly like networks)
• Founder
• Red team
@_lavalamp
4. • Common Crawl
• MapReduce
• Hadoop
• Amazon Elastic MapReduce (EMR)
• Mining Common Crawl using Hadoop
on EMR
• Other ”big” data sources
WHAT’S DIS
4
5. • Academic research =/= industry
research
• Tactics can (and should!) be cross-
applied
• Lots of power in big data, only
problem is how to extract it
• Largely untapped resource
• Content discovery (largely) sucks
WHY’S DIS
5
6. 1. Background
2. Common Crawl
3. MapReduce & Hadoop
4. Elastic Map Reduce
5. Mining Common Crawl
6. Data Mining Results
7. Big(ish) Data Sources
8. Conclusion
Agenda
6
8. • DARPA CINDER program
• Continual authentication through
side channel data mining
• Penetration testing
• Web Sight
My Background
8
9. • Penetration testing scopes are
rarely adequate
• Faster, more accurate tools ==
better engagements
• It’s 2017 – application layer often
comprises the majority of attack
surface
• Expedite discovery of application-
layer attack surface
Time == $$$
9
10. • Many web applications map disk
contents to URLs
• Un-linked resources are commonly
less secure
• Older versions
• Debugging tools
• Backups with wrong extensions
• Find via brute force
• Current tools are quite lacking
Web App Content Discovery
10
12. • California-based 501(c)(3) non-
profit organization
• Performing full web crawls on a
regular basis using different user
agents since 2008
• Data stored in AWS HDFS
• A single crawl contains many
terabytes of data
• Full crawl metadata can exceed 10TB
What is Common Crawl?
12
http://commoncrawl.org/
13. • Crawl data is stored in three
proprietary data formats
• WARC (Web ARChive) – raw crawl data
• WAT – HTTP request and response
metadata
• WET – plain-text HTTP responses
• WAT files likely contain the juicy bits
you’re interested in
• Use existing libraries for parsing file
contents
CC Data Format
13
14. • Data is stored in AWS HDFS (S3)
http://commoncrawl.org/the-
data/get-started/
• Can use the usual AWS S3
command line tools for debugging
• Newer crawls contain files listing
WAT and WET paths
CC HDFS Storage
14
15. • When running Hadoop jobs, HDFS
path is supplied to identify all files
to process
• Pulling down single files and
checking them out helps with
debugging code
• Use AWS S3 command line tool to
interact with CC data
Accessing HDFS in AWS
15
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz .
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
17. • Programming model for processing
large amounts of data
• Processing done in two phases:
• Map – take input data and extract
what you care about (key-value pairs)
• Reduce – apply a simple aggregation
function across the mapped data
(count, sum, etc)
• Easy concept, quirky to get what
you need out of it
What is MapReduce?
17
https://en.wikipedia.org/wiki/MapReduce
18. • Apache Hadoop
• De-facto standard open source
implementation of MapReduce
• Written in Java
• Has an interface to process data in
other languages, but writing code in
Java comes with perks
How ‘bout Hadoop?
18
19. • Use the Hadoop library for the
version you’ll be deploying against
• Implement Tool and Configured
class
• Implement mapper and reducer
classes
• Configure data types and
input/output paths
• ???
• Profit
Writing Hadoop Code
19
20. • MapReduce supports the map ->
reduce paradigm
• This is a fairly constrictive paradigm
• Have to be creative to determine
what to do during both the map
and reduce phases to extract and
aggregate the data you care about
Shoehorning into Hadoop
20
22. • EMR
• Amazon’s cloud service for running
Hadoop jobs
• Usage of all the standard AWS tools
• Set up a cloud of EC2 instances to
process your data
• Free access to data stored in S3
Elastic MapReduce?!
22
23. • Choose how much you want to pay
for EC2 instances
• EMR allows you to use spot pricing
for your instances
• Must have one or two master nodes
alive at all time (no spot pricing)
• Choose the right spot price and
your total cost for processing all of
Common Crawl can be <$100.00
Spot Pricing!!!
23
25. • We want to find the most common
URL paths for every server type
• We have access to HTTP request
and response headers
• We must find a way to map our
requirements into the map and
reduce phases
• Map – Collect/generate the data we
care about, fit into key-value pairs
• Reduce – Apply a mathematical
aggregation across the collected data
Here Comes the Shoehorn
25
26. MAP
• Create unique strings that contain
(1) a reference to the type of server
and (2) the URL path segment for
every URL path segment in ever URL
found within the CC HTTP responses
REDUCE
• Count the number of instances of
each unique string
My Solution
26
27. • Working with big data requires
coercion of input data to expected
values
• Aggregating on random data ==
huge output files
• For processing CC data, I had to
coerce the following values to avoid
massive result files
• Server headers
• GUIDs in URL paths
• Integers in URL paths
Mapping URL Paths
27
28. • People put wonky stuff in server
headers
• Reviewed the contents of a few
WAT files and retrieved all server
headers
• Chose a list of server types to
support
• Coerce header values into list of
supported server types
• Not supported -> misc_server
• No server header -> null_server
Coercing Server Headers
28
29. • URL paths can contain regularly
randomized data
• Dates
• GUIDs
• Integers
• Replace URL paths with default
strings when
• Length exceeds 16
• Contents all integers
• Contents majority integers
Coercing URL Paths
29
30. Mapping process results in strings containing coerced
server header and URL path
1. Record type
2. Server type
3. URL path segment
Mapping Result Key
30
< 02_';)_apache_generic_';)_ AthenaCarey >
1 2 3
31. Mapping Example
31
GET /foo/bar/baz.html?asd=123 HTTP/1.1
Host: www.woot.com
User-Agent: Mozilla/5.0 (Macintosh; Intel
Mac OS X 10.12; rv:53.0) Gecko/20100101
Firefox/53.0
Accept: text/html
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Server: Apache/2.4.9 (Unix)
Connection: close
Upgrade-Insecure-Requests: 1
/foo/bar/baz.html on Apache (Unix)
< 02_';)_apache_unix_';)_/foo/>, 1
< 02_';)_apache_unix_';)_/bar/>, 1
< 02_';)_apache_unix_';)_baz.html>, 1
32. • Swap out the fileInputPath
and fileOutputPath values in
HadoopRunner.java
• Compile using ant (not Eclipse,
unless you really like tearing your
hair out)
• Upload Hadoop JAR file to AWS S3
• Create EMR cluster
• Add a “step” to EMR cluster
referencing the JAR file in AWS S3
Running in EMR
32
33. • Processing took about two days
using five medium-powered EC2
instances as task nodes
• 93,914,151 results (mapped string
combined with # of occurences)
• ~3.6GB across 14 files
• Still fairly raw data – we need to
process it for it to be useful
Resulting Data
33
34. • We effectively have tuples of server
types, URL path segments, and the
number of occurrences for each
server type and segment pair
• Must process the results and order
by most common path segments
• Parsing code can be found here:
Parsing the Results
34
https://github.com/lavalamp-/lava-hadoop-processing
36. URL Segment Counts
36
500,000.00 5,000,000.00 50,000,000.00 500,000,000.00
Gunicon
Thin
Openresty
Zope
Lotus Domino
Sun Web Server
Apache (Windows)
Jetty
PWS
Lighttpd
IBM HTTP Server
Resin
Oracle Application Server
Litespeed
Miscellaneous
IIS
Nginx
Apache (Unix)
Apache (Generic)
# of Discovered URL Segments
ServerType
# of URL Segments by Server Type
39. Comparison w/ Other Sources
39
FuzzDB (all) 850,425 +99.8%
FuzzDB (web & app server) 7,234 +81.2%
Dirs3arch 5,992 +77.3%
Dirbuster 105,847 +98.7%
Burp Suite 424,203 +99.7%
91.34% Average improvement upon existing technologies
*no other approaches provide coverage guarantees
40. • Common Crawl respects (I believe)
robots.txt
• Certainly has a number of blind
spots
• Results omit highly-repetitive URL
segments (integers, GUIDs)
• Crawling likely misses plenty of
JavaScript-based URLs
• Lots of juicy files are never linked,
therefore missed by Common Crawl
Caveats
40
41. Resulting hit list files can be found in the following repository:
https://goo.gl/lxdPDm
Getchu Some Data
41
43. • Public archive of research data
collected through active scans of
the Internet
• Lots of references to other projects
containing data about
• DNS
• Port scans
• Web crawls
• SSL certificates
Scans.io
43
https://scans.io/
44. • American Registry for Internet
Numbers
• WHOIS records for a significant
amount of the IPv4 address space
• Other regional registries have
similar services
• ARIN
• AFRINIC
• APNIC
• LACNIC
• RIPE NCC
ARIN
44
https://www.arin.net/
45. • Awesome open source tools for
performing Internet-scale data
collection
• Zmap – network scans
• Zgrab – banner grabbing & network
service interaction
• ZDNS – DNS lookups
Zmap
45
https://zmap.io/
46. • Use SQL syntax to search all sorts of
huge datasets
• One public dataset contains all
public GitHub data…
Google BigQuery
46
https://cloud.google.com/bigquery/
47. Google BigQuery Tastiness
47
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa"
OR BQFILES.path like "%id_dsa";
13,706
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.aws/credentials’;
42
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.keystore’;
14,558
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%robots.txt’;
197,694
49. • MapReduce
• Hadoop
• Amazon Elastic MapReduce
• Common Crawl
• Shoehorning problem sets into
MapReduce
• Benefits from using big data
• Additional data sources
Recap
49
50. • Hone content discovery based on
already-found URL paths
• Generate content discovery hit lists
for specific user agents (mobile vs.
desktop)
• Hone network service scanning
based on already-found service
ports
Future Work
50
51. • Common Crawl Hadoop Project
https://github.com/lavalamp-/LavaHadoopCrawlAnalysis
• Common Crawl Results Processing Project
https://github.com/lavalamp-/lava-hadoop-processing
• Content Discovery Hit Lists
https://github.com/lavalamp-/content-discovery-hit-lists
• Lavalamp’s Blog
https://l.avala.mp/
References
51