Cloudstone
Sharpening Your Weapons Through Big Data
Christopher Grayson
@_lavalamp
+ =
Introduction
WHOAMI
3
• ATL
• Web development
• Academic researcher
• Haxin’ all the things
• (but I rlllly like networks)
• Founder
• Red team
@_lavalamp
• Common Crawl
• MapReduce
• Hadoop
• Amazon Elastic MapReduce (EMR)
• Mining Common Crawl using Hadoop
on EMR
• Other ”big” data sources
WHAT’S DIS
4
• Academic research =/= industry
research
• Tactics can (and should!) be cross-
applied
• Lots of power in big data, only
problem is how to extract it
• Largely untapped resource
• Content discovery (largely) sucks
WHY’S DIS
5
1. Background
2. Common Crawl
3. MapReduce & Hadoop
4. Elastic Map Reduce
5. Mining Common Crawl
6. Data Mining Results
7. Big(ish) Data Sources
8. Conclusion
Agenda
6
Background
• DARPA CINDER program
• Continual authentication through
side channel data mining
• Penetration testing
• Web Sight
My Background
8
• Penetration testing scopes are
rarely adequate
• Faster, more accurate tools ==
better engagements
• It’s 2017 – application layer often
comprises the majority of attack
surface
• Expedite discovery of application-
layer attack surface
Time == $$$
9
• Many web applications map disk
contents to URLs
• Un-linked resources are commonly
less secure
• Older versions
• Debugging tools
• Backups with wrong extensions
• Find via brute force
• Current tools are quite lacking
Web App Content Discovery
10
Common Crawl
• California-based 501(c)(3) non-
profit organization
• Performing full web crawls on a
regular basis using different user
agents since 2008
• Data stored in AWS HDFS
• A single crawl contains many
terabytes of data
• Full crawl metadata can exceed 10TB
What is Common Crawl?
12
http://commoncrawl.org/
• Crawl data is stored in three
proprietary data formats
• WARC (Web ARChive) – raw crawl data
• WAT – HTTP request and response
metadata
• WET – plain-text HTTP responses
• WAT files likely contain the juicy bits
you’re interested in
• Use existing libraries for parsing file
contents
CC Data Format
13
• Data is stored in AWS HDFS (S3)
http://commoncrawl.org/the-
data/get-started/
• Can use the usual AWS S3
command line tools for debugging
• Newer crawls contain files listing
WAT and WET paths
CC HDFS Storage
14
• When running Hadoop jobs, HDFS
path is supplied to identify all files
to process
• Pulling down single files and
checking them out helps with
debugging code
• Use AWS S3 command line tool to
interact with CC data
Accessing HDFS in AWS
15
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz .
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
MapReduce & Hadoop
• Programming model for processing
large amounts of data
• Processing done in two phases:
• Map – take input data and extract
what you care about (key-value pairs)
• Reduce – apply a simple aggregation
function across the mapped data
(count, sum, etc)
• Easy concept, quirky to get what
you need out of it
What is MapReduce?
17
https://en.wikipedia.org/wiki/MapReduce
• Apache Hadoop
• De-facto standard open source
implementation of MapReduce
• Written in Java
• Has an interface to process data in
other languages, but writing code in
Java comes with perks
How ‘bout Hadoop?
18
• Use the Hadoop library for the
version you’ll be deploying against
• Implement Tool and Configured
class
• Implement mapper and reducer
classes
• Configure data types and
input/output paths
• ???
• Profit
Writing Hadoop Code
19
• MapReduce supports the map ->
reduce paradigm
• This is a fairly constrictive paradigm
• Have to be creative to determine
what to do during both the map
and reduce phases to extract and
aggregate the data you care about
Shoehorning into Hadoop
20
Elastic Map Reduce
• EMR
• Amazon’s cloud service for running
Hadoop jobs
• Usage of all the standard AWS tools
• Set up a cloud of EC2 instances to
process your data
• Free access to data stored in S3
Elastic MapReduce?!
22
• Choose how much you want to pay
for EC2 instances
• EMR allows you to use spot pricing
for your instances
• Must have one or two master nodes
alive at all time (no spot pricing)
• Choose the right spot price and
your total cost for processing all of
Common Crawl can be <$100.00
Spot Pricing!!!
23
Mining Common Crawl
• We want to find the most common
URL paths for every server type
• We have access to HTTP request
and response headers
• We must find a way to map our
requirements into the map and
reduce phases
• Map – Collect/generate the data we
care about, fit into key-value pairs
• Reduce – Apply a mathematical
aggregation across the collected data
Here Comes the Shoehorn
25
MAP
• Create unique strings that contain
(1) a reference to the type of server
and (2) the URL path segment for
every URL path segment in ever URL
found within the CC HTTP responses
REDUCE
• Count the number of instances of
each unique string
My Solution
26
• Working with big data requires
coercion of input data to expected
values
• Aggregating on random data ==
huge output files
• For processing CC data, I had to
coerce the following values to avoid
massive result files
• Server headers
• GUIDs in URL paths
• Integers in URL paths
Mapping URL Paths
27
• People put wonky stuff in server
headers
• Reviewed the contents of a few
WAT files and retrieved all server
headers
• Chose a list of server types to
support
• Coerce header values into list of
supported server types
• Not supported -> misc_server
• No server header -> null_server
Coercing Server Headers
28
• URL paths can contain regularly
randomized data
• Dates
• GUIDs
• Integers
• Replace URL paths with default
strings when
• Length exceeds 16
• Contents all integers
• Contents majority integers
Coercing URL Paths
29
Mapping process results in strings containing coerced
server header and URL path
1. Record type
2. Server type
3. URL path segment
Mapping Result Key
30
< 02_';)_apache_generic_';)_ AthenaCarey >
1 2 3
Mapping Example
31
GET /foo/bar/baz.html?asd=123 HTTP/1.1
Host: www.woot.com
User-Agent: Mozilla/5.0 (Macintosh; Intel
Mac OS X 10.12; rv:53.0) Gecko/20100101
Firefox/53.0
Accept: text/html
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Server: Apache/2.4.9 (Unix)
Connection: close
Upgrade-Insecure-Requests: 1
/foo/bar/baz.html on Apache (Unix)
< 02_';)_apache_unix_';)_/foo/>, 1
< 02_';)_apache_unix_';)_/bar/>, 1
< 02_';)_apache_unix_';)_baz.html>, 1
• Swap out the fileInputPath
and fileOutputPath values in
HadoopRunner.java
• Compile using ant (not Eclipse,
unless you really like tearing your
hair out)
• Upload Hadoop JAR file to AWS S3
• Create EMR cluster
• Add a “step” to EMR cluster
referencing the JAR file in AWS S3
Running in EMR
32
• Processing took about two days
using five medium-powered EC2
instances as task nodes
• 93,914,151 results (mapped string
combined with # of occurences)
• ~3.6GB across 14 files
• Still fairly raw data – we need to
process it for it to be useful
Resulting Data
33
• We effectively have tuples of server
types, URL path segments, and the
number of occurrences for each
server type and segment pair
• Must process the results and order
by most common path segments
• Parsing code can be found here:
Parsing the Results
34
https://github.com/lavalamp-/lava-hadoop-processing
Data Mining Results
URL Segment Counts
36
500,000.00 5,000,000.00 50,000,000.00 500,000,000.00
Gunicon
Thin
Openresty
Zope
Lotus Domino
Sun Web Server
Apache (Windows)
Jetty
PWS
Lighttpd
IBM HTTP Server
Resin
Oracle Application Server
Litespeed
Miscellaneous
IIS
Nginx
Apache (Unix)
Apache (Generic)
# of Discovered URL Segments
ServerType
# of URL Segments by Server Type
Coverage
Server Type 50% 75% 90% 95% 99% 99.70% 99.90%
Apache (Generic) 58 217 475 611 749 776 784
Apache (Unix) 53 189 395 502 604 624 629
Apache (Windows) 14 41 78 97 117 121 122
Gunicon 2 4 5 6 6 6 6
IBM HTTP Server 6 15 21 24 26 26 26
IIS 103 330 610 738 859 882 889
Jetty 4 10 15 17 19 19 19
Lighttpd 20 76 178 240 306 320 324
Litespeed 16 43 73 90 109 113 114
Lotus Domino 3 5 6 7 7 7 7
Miscellaneous 93 329 687 907 1147 1196 1210
Nginx 87 341 760 1005 1284 1343 1360
Openresty 7 31 97 159 271 306 318
Oracle Application Server 1 4 6 6 7 7 7
PWS 6 15 22 25 28 29 29
Resin 1 5 9 10 12 12 12
Sun Web Server 6 11 14 16 17 17 17
Thin 3 6 10 11 12 13 13
Zope 12 25 37 42 47 47 48
Coverage by # of Requests
37
Most Common URL Segments
38
index.php
/forum/
/forums/
/news/
viewtopic.php
showthread.php
/tag/
/index.php/
newreply.php
/cgi-bin/
Apache (Unix)
index.php
index.cfm
/uhtbin/
/cgisirsi.exe/
/NCLD/
/catalog/
modules.php
/events/
/forum/
/item/
Apache (Windows)
/news/
index.php
/wiki/
/forums/
/forum/
/tag/
/search/
showthread.php
viewtopic.php
/en/
Apache (Generic)
/article/
/news/
/page/
/id/
default.aspx
/products/
/NEWS/
/en/
/apps/
/search/
IIS
/tag/
/news/
/forums/
/forum/
index.php
/tags/
showthread.php
/page/
/category/
/articles/
Nginx
Comparison w/ Other Sources
39
FuzzDB (all) 850,425 +99.8%
FuzzDB (web & app server) 7,234 +81.2%
Dirs3arch 5,992 +77.3%
Dirbuster 105,847 +98.7%
Burp Suite 424,203 +99.7%
91.34% Average improvement upon existing technologies
*no other approaches provide coverage guarantees
• Common Crawl respects (I believe)
robots.txt
• Certainly has a number of blind
spots
• Results omit highly-repetitive URL
segments (integers, GUIDs)
• Crawling likely misses plenty of
JavaScript-based URLs
• Lots of juicy files are never linked,
therefore missed by Common Crawl
Caveats
40
Resulting hit list files can be found in the following repository:
https://goo.gl/lxdPDm
Getchu Some Data
41
Big(ish) Data Sources
• Public archive of research data
collected through active scans of
the Internet
• Lots of references to other projects
containing data about
• DNS
• Port scans
• Web crawls
• SSL certificates
Scans.io
43
https://scans.io/
• American Registry for Internet
Numbers
• WHOIS records for a significant
amount of the IPv4 address space
• Other regional registries have
similar services
• ARIN
• AFRINIC
• APNIC
• LACNIC
• RIPE NCC
ARIN
44
https://www.arin.net/
• Awesome open source tools for
performing Internet-scale data
collection
• Zmap – network scans
• Zgrab – banner grabbing & network
service interaction
• ZDNS – DNS lookups
Zmap
45
https://zmap.io/
• Use SQL syntax to search all sorts of
huge datasets
• One public dataset contains all
public GitHub data…
Google BigQuery
46
https://cloud.google.com/bigquery/
Google BigQuery Tastiness
47
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa"
OR BQFILES.path like "%id_dsa";
13,706
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.aws/credentials’;
42
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.keystore’;
14,558
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%robots.txt’;
197,694
Conclusion
• MapReduce
• Hadoop
• Amazon Elastic MapReduce
• Common Crawl
• Shoehorning problem sets into
MapReduce
• Benefits from using big data
• Additional data sources
Recap
49
• Hone content discovery based on
already-found URL paths
• Generate content discovery hit lists
for specific user agents (mobile vs.
desktop)
• Hone network service scanning
based on already-found service
ports
Future Work
50
• Common Crawl Hadoop Project
https://github.com/lavalamp-/LavaHadoopCrawlAnalysis
• Common Crawl Results Processing Project
https://github.com/lavalamp-/lava-hadoop-processing
• Content Discovery Hit Lists
https://github.com/lavalamp-/content-discovery-hit-lists
• Lavalamp’s Blog
https://l.avala.mp/
References
51
THANK YOU!
@_lavalamp
chris [AT] websight [DOT] io
https://github.com/lavalamp-
https://l.avala.mp

Cloudstone - Sharpening Your Weapons Through Big Data

  • 1.
    Cloudstone Sharpening Your WeaponsThrough Big Data Christopher Grayson @_lavalamp + =
  • 2.
  • 3.
    WHOAMI 3 • ATL • Webdevelopment • Academic researcher • Haxin’ all the things • (but I rlllly like networks) • Founder • Red team @_lavalamp
  • 4.
    • Common Crawl •MapReduce • Hadoop • Amazon Elastic MapReduce (EMR) • Mining Common Crawl using Hadoop on EMR • Other ”big” data sources WHAT’S DIS 4
  • 5.
    • Academic research=/= industry research • Tactics can (and should!) be cross- applied • Lots of power in big data, only problem is how to extract it • Largely untapped resource • Content discovery (largely) sucks WHY’S DIS 5
  • 6.
    1. Background 2. CommonCrawl 3. MapReduce & Hadoop 4. Elastic Map Reduce 5. Mining Common Crawl 6. Data Mining Results 7. Big(ish) Data Sources 8. Conclusion Agenda 6
  • 7.
  • 8.
    • DARPA CINDERprogram • Continual authentication through side channel data mining • Penetration testing • Web Sight My Background 8
  • 9.
    • Penetration testingscopes are rarely adequate • Faster, more accurate tools == better engagements • It’s 2017 – application layer often comprises the majority of attack surface • Expedite discovery of application- layer attack surface Time == $$$ 9
  • 10.
    • Many webapplications map disk contents to URLs • Un-linked resources are commonly less secure • Older versions • Debugging tools • Backups with wrong extensions • Find via brute force • Current tools are quite lacking Web App Content Discovery 10
  • 11.
  • 12.
    • California-based 501(c)(3)non- profit organization • Performing full web crawls on a regular basis using different user agents since 2008 • Data stored in AWS HDFS • A single crawl contains many terabytes of data • Full crawl metadata can exceed 10TB What is Common Crawl? 12 http://commoncrawl.org/
  • 13.
    • Crawl datais stored in three proprietary data formats • WARC (Web ARChive) – raw crawl data • WAT – HTTP request and response metadata • WET – plain-text HTTP responses • WAT files likely contain the juicy bits you’re interested in • Use existing libraries for parsing file contents CC Data Format 13
  • 14.
    • Data isstored in AWS HDFS (S3) http://commoncrawl.org/the- data/get-started/ • Can use the usual AWS S3 command line tools for debugging • Newer crawls contain files listing WAT and WET paths CC HDFS Storage 14
  • 15.
    • When runningHadoop jobs, HDFS path is supplied to identify all files to process • Pulling down single files and checking them out helps with debugging code • Use AWS S3 command line tool to interact with CC data Accessing HDFS in AWS 15 aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz . aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
  • 16.
  • 17.
    • Programming modelfor processing large amounts of data • Processing done in two phases: • Map – take input data and extract what you care about (key-value pairs) • Reduce – apply a simple aggregation function across the mapped data (count, sum, etc) • Easy concept, quirky to get what you need out of it What is MapReduce? 17 https://en.wikipedia.org/wiki/MapReduce
  • 18.
    • Apache Hadoop •De-facto standard open source implementation of MapReduce • Written in Java • Has an interface to process data in other languages, but writing code in Java comes with perks How ‘bout Hadoop? 18
  • 19.
    • Use theHadoop library for the version you’ll be deploying against • Implement Tool and Configured class • Implement mapper and reducer classes • Configure data types and input/output paths • ??? • Profit Writing Hadoop Code 19
  • 20.
    • MapReduce supportsthe map -> reduce paradigm • This is a fairly constrictive paradigm • Have to be creative to determine what to do during both the map and reduce phases to extract and aggregate the data you care about Shoehorning into Hadoop 20
  • 21.
  • 22.
    • EMR • Amazon’scloud service for running Hadoop jobs • Usage of all the standard AWS tools • Set up a cloud of EC2 instances to process your data • Free access to data stored in S3 Elastic MapReduce?! 22
  • 23.
    • Choose howmuch you want to pay for EC2 instances • EMR allows you to use spot pricing for your instances • Must have one or two master nodes alive at all time (no spot pricing) • Choose the right spot price and your total cost for processing all of Common Crawl can be <$100.00 Spot Pricing!!! 23
  • 24.
  • 25.
    • We wantto find the most common URL paths for every server type • We have access to HTTP request and response headers • We must find a way to map our requirements into the map and reduce phases • Map – Collect/generate the data we care about, fit into key-value pairs • Reduce – Apply a mathematical aggregation across the collected data Here Comes the Shoehorn 25
  • 26.
    MAP • Create uniquestrings that contain (1) a reference to the type of server and (2) the URL path segment for every URL path segment in ever URL found within the CC HTTP responses REDUCE • Count the number of instances of each unique string My Solution 26
  • 27.
    • Working withbig data requires coercion of input data to expected values • Aggregating on random data == huge output files • For processing CC data, I had to coerce the following values to avoid massive result files • Server headers • GUIDs in URL paths • Integers in URL paths Mapping URL Paths 27
  • 28.
    • People putwonky stuff in server headers • Reviewed the contents of a few WAT files and retrieved all server headers • Chose a list of server types to support • Coerce header values into list of supported server types • Not supported -> misc_server • No server header -> null_server Coercing Server Headers 28
  • 29.
    • URL pathscan contain regularly randomized data • Dates • GUIDs • Integers • Replace URL paths with default strings when • Length exceeds 16 • Contents all integers • Contents majority integers Coercing URL Paths 29
  • 30.
    Mapping process resultsin strings containing coerced server header and URL path 1. Record type 2. Server type 3. URL path segment Mapping Result Key 30 < 02_';)_apache_generic_';)_ AthenaCarey > 1 2 3
  • 31.
    Mapping Example 31 GET /foo/bar/baz.html?asd=123HTTP/1.1 Host: www.woot.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0 Accept: text/html Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Server: Apache/2.4.9 (Unix) Connection: close Upgrade-Insecure-Requests: 1 /foo/bar/baz.html on Apache (Unix) < 02_';)_apache_unix_';)_/foo/>, 1 < 02_';)_apache_unix_';)_/bar/>, 1 < 02_';)_apache_unix_';)_baz.html>, 1
  • 32.
    • Swap outthe fileInputPath and fileOutputPath values in HadoopRunner.java • Compile using ant (not Eclipse, unless you really like tearing your hair out) • Upload Hadoop JAR file to AWS S3 • Create EMR cluster • Add a “step” to EMR cluster referencing the JAR file in AWS S3 Running in EMR 32
  • 33.
    • Processing tookabout two days using five medium-powered EC2 instances as task nodes • 93,914,151 results (mapped string combined with # of occurences) • ~3.6GB across 14 files • Still fairly raw data – we need to process it for it to be useful Resulting Data 33
  • 34.
    • We effectivelyhave tuples of server types, URL path segments, and the number of occurrences for each server type and segment pair • Must process the results and order by most common path segments • Parsing code can be found here: Parsing the Results 34 https://github.com/lavalamp-/lava-hadoop-processing
  • 35.
  • 36.
    URL Segment Counts 36 500,000.005,000,000.00 50,000,000.00 500,000,000.00 Gunicon Thin Openresty Zope Lotus Domino Sun Web Server Apache (Windows) Jetty PWS Lighttpd IBM HTTP Server Resin Oracle Application Server Litespeed Miscellaneous IIS Nginx Apache (Unix) Apache (Generic) # of Discovered URL Segments ServerType # of URL Segments by Server Type
  • 37.
    Coverage Server Type 50%75% 90% 95% 99% 99.70% 99.90% Apache (Generic) 58 217 475 611 749 776 784 Apache (Unix) 53 189 395 502 604 624 629 Apache (Windows) 14 41 78 97 117 121 122 Gunicon 2 4 5 6 6 6 6 IBM HTTP Server 6 15 21 24 26 26 26 IIS 103 330 610 738 859 882 889 Jetty 4 10 15 17 19 19 19 Lighttpd 20 76 178 240 306 320 324 Litespeed 16 43 73 90 109 113 114 Lotus Domino 3 5 6 7 7 7 7 Miscellaneous 93 329 687 907 1147 1196 1210 Nginx 87 341 760 1005 1284 1343 1360 Openresty 7 31 97 159 271 306 318 Oracle Application Server 1 4 6 6 7 7 7 PWS 6 15 22 25 28 29 29 Resin 1 5 9 10 12 12 12 Sun Web Server 6 11 14 16 17 17 17 Thin 3 6 10 11 12 13 13 Zope 12 25 37 42 47 47 48 Coverage by # of Requests 37
  • 38.
    Most Common URLSegments 38 index.php /forum/ /forums/ /news/ viewtopic.php showthread.php /tag/ /index.php/ newreply.php /cgi-bin/ Apache (Unix) index.php index.cfm /uhtbin/ /cgisirsi.exe/ /NCLD/ /catalog/ modules.php /events/ /forum/ /item/ Apache (Windows) /news/ index.php /wiki/ /forums/ /forum/ /tag/ /search/ showthread.php viewtopic.php /en/ Apache (Generic) /article/ /news/ /page/ /id/ default.aspx /products/ /NEWS/ /en/ /apps/ /search/ IIS /tag/ /news/ /forums/ /forum/ index.php /tags/ showthread.php /page/ /category/ /articles/ Nginx
  • 39.
    Comparison w/ OtherSources 39 FuzzDB (all) 850,425 +99.8% FuzzDB (web & app server) 7,234 +81.2% Dirs3arch 5,992 +77.3% Dirbuster 105,847 +98.7% Burp Suite 424,203 +99.7% 91.34% Average improvement upon existing technologies *no other approaches provide coverage guarantees
  • 40.
    • Common Crawlrespects (I believe) robots.txt • Certainly has a number of blind spots • Results omit highly-repetitive URL segments (integers, GUIDs) • Crawling likely misses plenty of JavaScript-based URLs • Lots of juicy files are never linked, therefore missed by Common Crawl Caveats 40
  • 41.
    Resulting hit listfiles can be found in the following repository: https://goo.gl/lxdPDm Getchu Some Data 41
  • 42.
  • 43.
    • Public archiveof research data collected through active scans of the Internet • Lots of references to other projects containing data about • DNS • Port scans • Web crawls • SSL certificates Scans.io 43 https://scans.io/
  • 44.
    • American Registryfor Internet Numbers • WHOIS records for a significant amount of the IPv4 address space • Other regional registries have similar services • ARIN • AFRINIC • APNIC • LACNIC • RIPE NCC ARIN 44 https://www.arin.net/
  • 45.
    • Awesome opensource tools for performing Internet-scale data collection • Zmap – network scans • Zgrab – banner grabbing & network service interaction • ZDNS – DNS lookups Zmap 45 https://zmap.io/
  • 46.
    • Use SQLsyntax to search all sorts of huge datasets • One public dataset contains all public GitHub data… Google BigQuery 46 https://cloud.google.com/bigquery/
  • 47.
    Google BigQuery Tastiness 47 SELECTcount(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa" OR BQFILES.path like "%id_dsa"; 13,706 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.aws/credentials’; 42 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.keystore’; 14,558 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%robots.txt’; 197,694
  • 48.
  • 49.
    • MapReduce • Hadoop •Amazon Elastic MapReduce • Common Crawl • Shoehorning problem sets into MapReduce • Benefits from using big data • Additional data sources Recap 49
  • 50.
    • Hone contentdiscovery based on already-found URL paths • Generate content discovery hit lists for specific user agents (mobile vs. desktop) • Hone network service scanning based on already-found service ports Future Work 50
  • 51.
    • Common CrawlHadoop Project https://github.com/lavalamp-/LavaHadoopCrawlAnalysis • Common Crawl Results Processing Project https://github.com/lavalamp-/lava-hadoop-processing • Content Discovery Hit Lists https://github.com/lavalamp-/content-discovery-hit-lists • Lavalamp’s Blog https://l.avala.mp/ References 51
  • 52.
    THANK YOU! @_lavalamp chris [AT]websight [DOT] io https://github.com/lavalamp- https://l.avala.mp