SlideShare a Scribd company logo
1 of 52
Cloudstone
Sharpening Your Weapons Through Big Data
Christopher Grayson
@_lavalamp
+ =
Introduction
WHOAMI
3
• ATL
• Web development
• Academic researcher
• Haxin’ all the things
• (but I rlllly like networks)
• Founder
• Red team
@_lavalamp
• Common Crawl
• MapReduce
• Hadoop
• Amazon Elastic MapReduce (EMR)
• Mining Common Crawl using Hadoop
on EMR
• Other ”big” data sources
WHAT’S DIS
4
• Academic research =/= industry
research
• Tactics can (and should!) be cross-
applied
• Lots of power in big data, only
problem is how to extract it
• Largely untapped resource
• Content discovery (largely) sucks
WHY’S DIS
5
1. Background
2. Common Crawl
3. MapReduce & Hadoop
4. Elastic Map Reduce
5. Mining Common Crawl
6. Data Mining Results
7. Big(ish) Data Sources
8. Conclusion
Agenda
6
Background
• DARPA CINDER program
• Continual authentication through
side channel data mining
• Penetration testing
• Web Sight
My Background
8
• Penetration testing scopes are
rarely adequate
• Faster, more accurate tools ==
better engagements
• It’s 2017 – application layer often
comprises the majority of attack
surface
• Expedite discovery of application-
layer attack surface
Time == $$$
9
• Many web applications map disk
contents to URLs
• Un-linked resources are commonly
less secure
• Older versions
• Debugging tools
• Backups with wrong extensions
• Find via brute force
• Current tools are quite lacking
Web App Content Discovery
10
Common Crawl
• California-based 501(c)(3) non-
profit organization
• Performing full web crawls on a
regular basis using different user
agents since 2008
• Data stored in AWS HDFS
• A single crawl contains many
terabytes of data
• Full crawl metadata can exceed 10TB
What is Common Crawl?
12
http://commoncrawl.org/
• Crawl data is stored in three
proprietary data formats
• WARC (Web ARChive) – raw crawl data
• WAT – HTTP request and response
metadata
• WET – plain-text HTTP responses
• WAT files likely contain the juicy bits
you’re interested in
• Use existing libraries for parsing file
contents
CC Data Format
13
• Data is stored in AWS HDFS (S3)
http://commoncrawl.org/the-
data/get-started/
• Can use the usual AWS S3
command line tools for debugging
• Newer crawls contain files listing
WAT and WET paths
CC HDFS Storage
14
• When running Hadoop jobs, HDFS
path is supplied to identify all files
to process
• Pulling down single files and
checking them out helps with
debugging code
• Use AWS S3 command line tool to
interact with CC data
Accessing HDFS in AWS
15
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz .
aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
MapReduce & Hadoop
• Programming model for processing
large amounts of data
• Processing done in two phases:
• Map – take input data and extract
what you care about (key-value pairs)
• Reduce – apply a simple aggregation
function across the mapped data
(count, sum, etc)
• Easy concept, quirky to get what
you need out of it
What is MapReduce?
17
https://en.wikipedia.org/wiki/MapReduce
• Apache Hadoop
• De-facto standard open source
implementation of MapReduce
• Written in Java
• Has an interface to process data in
other languages, but writing code in
Java comes with perks
How ‘bout Hadoop?
18
• Use the Hadoop library for the
version you’ll be deploying against
• Implement Tool and Configured
class
• Implement mapper and reducer
classes
• Configure data types and
input/output paths
• ???
• Profit
Writing Hadoop Code
19
• MapReduce supports the map ->
reduce paradigm
• This is a fairly constrictive paradigm
• Have to be creative to determine
what to do during both the map
and reduce phases to extract and
aggregate the data you care about
Shoehorning into Hadoop
20
Elastic Map Reduce
• EMR
• Amazon’s cloud service for running
Hadoop jobs
• Usage of all the standard AWS tools
• Set up a cloud of EC2 instances to
process your data
• Free access to data stored in S3
Elastic MapReduce?!
22
• Choose how much you want to pay
for EC2 instances
• EMR allows you to use spot pricing
for your instances
• Must have one or two master nodes
alive at all time (no spot pricing)
• Choose the right spot price and
your total cost for processing all of
Common Crawl can be <$100.00
Spot Pricing!!!
23
Mining Common Crawl
• We want to find the most common
URL paths for every server type
• We have access to HTTP request
and response headers
• We must find a way to map our
requirements into the map and
reduce phases
• Map – Collect/generate the data we
care about, fit into key-value pairs
• Reduce – Apply a mathematical
aggregation across the collected data
Here Comes the Shoehorn
25
MAP
• Create unique strings that contain
(1) a reference to the type of server
and (2) the URL path segment for
every URL path segment in ever URL
found within the CC HTTP responses
REDUCE
• Count the number of instances of
each unique string
My Solution
26
• Working with big data requires
coercion of input data to expected
values
• Aggregating on random data ==
huge output files
• For processing CC data, I had to
coerce the following values to avoid
massive result files
• Server headers
• GUIDs in URL paths
• Integers in URL paths
Mapping URL Paths
27
• People put wonky stuff in server
headers
• Reviewed the contents of a few
WAT files and retrieved all server
headers
• Chose a list of server types to
support
• Coerce header values into list of
supported server types
• Not supported -> misc_server
• No server header -> null_server
Coercing Server Headers
28
• URL paths can contain regularly
randomized data
• Dates
• GUIDs
• Integers
• Replace URL paths with default
strings when
• Length exceeds 16
• Contents all integers
• Contents majority integers
Coercing URL Paths
29
Mapping process results in strings containing coerced
server header and URL path
1. Record type
2. Server type
3. URL path segment
Mapping Result Key
30
< 02_';)_apache_generic_';)_ AthenaCarey >
1 2 3
Mapping Example
31
GET /foo/bar/baz.html?asd=123 HTTP/1.1
Host: www.woot.com
User-Agent: Mozilla/5.0 (Macintosh; Intel
Mac OS X 10.12; rv:53.0) Gecko/20100101
Firefox/53.0
Accept: text/html
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Server: Apache/2.4.9 (Unix)
Connection: close
Upgrade-Insecure-Requests: 1
/foo/bar/baz.html on Apache (Unix)
< 02_';)_apache_unix_';)_/foo/>, 1
< 02_';)_apache_unix_';)_/bar/>, 1
< 02_';)_apache_unix_';)_baz.html>, 1
• Swap out the fileInputPath
and fileOutputPath values in
HadoopRunner.java
• Compile using ant (not Eclipse,
unless you really like tearing your
hair out)
• Upload Hadoop JAR file to AWS S3
• Create EMR cluster
• Add a “step” to EMR cluster
referencing the JAR file in AWS S3
Running in EMR
32
• Processing took about two days
using five medium-powered EC2
instances as task nodes
• 93,914,151 results (mapped string
combined with # of occurences)
• ~3.6GB across 14 files
• Still fairly raw data – we need to
process it for it to be useful
Resulting Data
33
• We effectively have tuples of server
types, URL path segments, and the
number of occurrences for each
server type and segment pair
• Must process the results and order
by most common path segments
• Parsing code can be found here:
Parsing the Results
34
https://github.com/lavalamp-/lava-hadoop-processing
Data Mining Results
URL Segment Counts
36
500,000.00 5,000,000.00 50,000,000.00 500,000,000.00
Gunicon
Thin
Openresty
Zope
Lotus Domino
Sun Web Server
Apache (Windows)
Jetty
PWS
Lighttpd
IBM HTTP Server
Resin
Oracle Application Server
Litespeed
Miscellaneous
IIS
Nginx
Apache (Unix)
Apache (Generic)
# of Discovered URL Segments
ServerType
# of URL Segments by Server Type
Coverage
Server Type 50% 75% 90% 95% 99% 99.70% 99.90%
Apache (Generic) 58 217 475 611 749 776 784
Apache (Unix) 53 189 395 502 604 624 629
Apache (Windows) 14 41 78 97 117 121 122
Gunicon 2 4 5 6 6 6 6
IBM HTTP Server 6 15 21 24 26 26 26
IIS 103 330 610 738 859 882 889
Jetty 4 10 15 17 19 19 19
Lighttpd 20 76 178 240 306 320 324
Litespeed 16 43 73 90 109 113 114
Lotus Domino 3 5 6 7 7 7 7
Miscellaneous 93 329 687 907 1147 1196 1210
Nginx 87 341 760 1005 1284 1343 1360
Openresty 7 31 97 159 271 306 318
Oracle Application Server 1 4 6 6 7 7 7
PWS 6 15 22 25 28 29 29
Resin 1 5 9 10 12 12 12
Sun Web Server 6 11 14 16 17 17 17
Thin 3 6 10 11 12 13 13
Zope 12 25 37 42 47 47 48
Coverage by # of Requests
37
Most Common URL Segments
38
index.php
/forum/
/forums/
/news/
viewtopic.php
showthread.php
/tag/
/index.php/
newreply.php
/cgi-bin/
Apache (Unix)
index.php
index.cfm
/uhtbin/
/cgisirsi.exe/
/NCLD/
/catalog/
modules.php
/events/
/forum/
/item/
Apache (Windows)
/news/
index.php
/wiki/
/forums/
/forum/
/tag/
/search/
showthread.php
viewtopic.php
/en/
Apache (Generic)
/article/
/news/
/page/
/id/
default.aspx
/products/
/NEWS/
/en/
/apps/
/search/
IIS
/tag/
/news/
/forums/
/forum/
index.php
/tags/
showthread.php
/page/
/category/
/articles/
Nginx
Comparison w/ Other Sources
39
FuzzDB (all) 850,425 +99.8%
FuzzDB (web & app server) 7,234 +81.2%
Dirs3arch 5,992 +77.3%
Dirbuster 105,847 +98.7%
Burp Suite 424,203 +99.7%
91.34% Average improvement upon existing technologies
*no other approaches provide coverage guarantees
• Common Crawl respects (I believe)
robots.txt
• Certainly has a number of blind
spots
• Results omit highly-repetitive URL
segments (integers, GUIDs)
• Crawling likely misses plenty of
JavaScript-based URLs
• Lots of juicy files are never linked,
therefore missed by Common Crawl
Caveats
40
Resulting hit list files can be found in the following repository:
https://goo.gl/lxdPDm
Getchu Some Data
41
Big(ish) Data Sources
• Public archive of research data
collected through active scans of
the Internet
• Lots of references to other projects
containing data about
• DNS
• Port scans
• Web crawls
• SSL certificates
Scans.io
43
https://scans.io/
• American Registry for Internet
Numbers
• WHOIS records for a significant
amount of the IPv4 address space
• Other regional registries have
similar services
• ARIN
• AFRINIC
• APNIC
• LACNIC
• RIPE NCC
ARIN
44
https://www.arin.net/
• Awesome open source tools for
performing Internet-scale data
collection
• Zmap – network scans
• Zgrab – banner grabbing & network
service interaction
• ZDNS – DNS lookups
Zmap
45
https://zmap.io/
• Use SQL syntax to search all sorts of
huge datasets
• One public dataset contains all
public GitHub data…
Google BigQuery
46
https://cloud.google.com/bigquery/
Google BigQuery Tastiness
47
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa"
OR BQFILES.path like "%id_dsa";
13,706
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.aws/credentials’;
42
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%.keystore’;
14,558
SELECT count(*)
FROM [bigquery-public-data:github_repos.files] as BQFILES
WHERE BQFILES.path LIKE '%robots.txt’;
197,694
Conclusion
• MapReduce
• Hadoop
• Amazon Elastic MapReduce
• Common Crawl
• Shoehorning problem sets into
MapReduce
• Benefits from using big data
• Additional data sources
Recap
49
• Hone content discovery based on
already-found URL paths
• Generate content discovery hit lists
for specific user agents (mobile vs.
desktop)
• Hone network service scanning
based on already-found service
ports
Future Work
50
• Common Crawl Hadoop Project
https://github.com/lavalamp-/LavaHadoopCrawlAnalysis
• Common Crawl Results Processing Project
https://github.com/lavalamp-/lava-hadoop-processing
• Content Discovery Hit Lists
https://github.com/lavalamp-/content-discovery-hit-lists
• Lavalamp’s Blog
https://l.avala.mp/
References
51
THANK YOU!
@_lavalamp
chris [AT] websight [DOT] io
https://github.com/lavalamp-
https://l.avala.mp

More Related Content

What's hot

Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]APNIC
 
AstriCon 2017 - Docker Swarm & Asterisk
AstriCon 2017  - Docker Swarm & AsteriskAstriCon 2017  - Docker Swarm & Asterisk
AstriCon 2017 - Docker Swarm & AsteriskEvan McGee
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
 
Oram And Secure Computation
Oram And Secure ComputationOram And Secure Computation
Oram And Secure ComputationChong-Kuan Chen
 
Docker Registry + Basic Auth
Docker Registry + Basic AuthDocker Registry + Basic Auth
Docker Registry + Basic AuthRemotty
 
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)William Yeh
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveMadhu Venugopal
 
Docker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-PlaneDocker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-PlaneMadhu Venugopal
 
Docker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu VenugopalDocker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu VenugopalMichelle Antebi
 
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...Derek Ashmore
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewYafang Chang
 
Docker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingDocker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingSreenivas Makam
 
Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica Docker, Inc.
 
SANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a ServiceSANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a ServiceToni de la Fuente
 

What's hot (20)

Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAILDNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
DNSSEC - WHAT IS IT ? INSTALL AND CONFIGURE IN CHROOT JAIL
 
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
Understanding and Deploying DNSSEC, by Champika Wijayatunga [APRICOT 2015]
 
AstriCon 2017 - Docker Swarm & Asterisk
AstriCon 2017  - Docker Swarm & AsteriskAstriCon 2017  - Docker Swarm & Asterisk
AstriCon 2017 - Docker Swarm & Asterisk
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
 
Oram And Secure Computation
Oram And Secure ComputationOram And Secure Computation
Oram And Secure Computation
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 
Docker Registry + Basic Auth
Docker Registry + Basic AuthDocker Registry + Basic Auth
Docker Registry + Basic Auth
 
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
 
HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep dive
 
Docker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-PlaneDocker summit : Docker Networking Control-plane & Data-Plane
Docker summit : Docker Networking Control-plane & Data-Plane
 
Docker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu VenugopalDocker Meetup: Docker Networking 1.11, by Madhu Venugopal
Docker Meetup: Docker Networking 1.11, by Madhu Venugopal
 
NkSIP: The Erlang SIP application server
NkSIP: The Erlang SIP application serverNkSIP: The Erlang SIP application server
NkSIP: The Erlang SIP application server
 
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
Microservices with Terraform, Docker and the Cloud. Chicago Coders Conference...
 
HadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop OverviewHadoopCon- Trend Micro SPN Hadoop Overview
HadoopCon- Trend Micro SPN Hadoop Overview
 
Docker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingDocker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental Networking
 
Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica Orchestrating Least Privilege by Diogo Monica
Orchestrating Least Privilege by Diogo Monica
 
YARN
YARNYARN
YARN
 
SANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a ServiceSANS Cloud Security Summit 2018: Forensics as a Service
SANS Cloud Security Summit 2018: Forensics as a Service
 

Viewers also liked

Introduction to LavaPasswordFactory
Introduction to LavaPasswordFactoryIntroduction to LavaPasswordFactory
Introduction to LavaPasswordFactoryChristopher Grayson
 
You, and Me, and Docker Makes Three
You, and Me, and Docker Makes ThreeYou, and Me, and Docker Makes Three
You, and Me, and Docker Makes ThreeChristopher Grayson
 
Grey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request ForgeryGrey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request ForgeryChristopher Grayson
 
Started In Security Now I'm Here
Started In Security Now I'm HereStarted In Security Now I'm Here
Started In Security Now I'm HereChristopher Grayson
 
Grey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 RecapGrey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 RecapChristopher Grayson
 
Root the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF AdministrationRoot the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF AdministrationChristopher Grayson
 

Viewers also liked (7)

So You Want to be a Hacker?
So You Want to be a Hacker?So You Want to be a Hacker?
So You Want to be a Hacker?
 
Introduction to LavaPasswordFactory
Introduction to LavaPasswordFactoryIntroduction to LavaPasswordFactory
Introduction to LavaPasswordFactory
 
You, and Me, and Docker Makes Three
You, and Me, and Docker Makes ThreeYou, and Me, and Docker Makes Three
You, and Me, and Docker Makes Three
 
Grey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request ForgeryGrey H@t - Cross-site Request Forgery
Grey H@t - Cross-site Request Forgery
 
Started In Security Now I'm Here
Started In Security Now I'm HereStarted In Security Now I'm Here
Started In Security Now I'm Here
 
Grey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 RecapGrey H@t - Academic Year 2012-2013 Recap
Grey H@t - Academic Year 2012-2013 Recap
 
Root the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF AdministrationRoot the Box - An Open Source Platform for CTF Administration
Root the Box - An Open Source Platform for CTF Administration
 

Similar to Cloudstone - Sharpening Your Weapons Through Big Data

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageDaniel Rohan
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Pagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index OptimizationPagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index OptimizationRalf Schwoebel
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big DataDhafer Malouche
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 

Similar to Cloudstone - Sharpening Your Weapons Through Big Data (20)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud Storage
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Oracle Big Data Cloud service
Oracle Big Data Cloud serviceOracle Big Data Cloud service
Oracle Big Data Cloud service
 
Pagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index OptimizationPagespeed SEO 2018: Google Mobile Index Optimization
Pagespeed SEO 2018: Google Mobile Index Optimization
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
CDC to the Max!
CDC to the Max!CDC to the Max!
CDC to the Max!
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 

Recently uploaded

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts servicesonalikaur4
 
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneRussian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneCall girls in Ahmedabad High profile
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsThierry TROUIN ☁
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012rehmti665
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Roomdivyansh0kumar0
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girlsstephieert
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 

Recently uploaded (20)

Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Ishita 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Alwarpet Phone 🍆 8250192130 👅 celebrity escorts service
 
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Porur Phone 🍆 8250192130 👅 celebrity escorts service
 
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneRussian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane
 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with Flows
 
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
Call Girls South Delhi Delhi reach out to us at ☎ 9711199012
 
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
 
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With RoomVIP Kolkata Call Girl Dum Dum 👉 8250192130  Available With Room
VIP Kolkata Call Girl Dum Dum 👉 8250192130 Available With Room
 
Russian Call girls in Dubai +971563133746 Dubai Call girls
Russian  Call girls in Dubai +971563133746 Dubai  Call girlsRussian  Call girls in Dubai +971563133746 Dubai  Call girls
Russian Call girls in Dubai +971563133746 Dubai Call girls
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Samaira 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata
 

Cloudstone - Sharpening Your Weapons Through Big Data

  • 1. Cloudstone Sharpening Your Weapons Through Big Data Christopher Grayson @_lavalamp + =
  • 3. WHOAMI 3 • ATL • Web development • Academic researcher • Haxin’ all the things • (but I rlllly like networks) • Founder • Red team @_lavalamp
  • 4. • Common Crawl • MapReduce • Hadoop • Amazon Elastic MapReduce (EMR) • Mining Common Crawl using Hadoop on EMR • Other ”big” data sources WHAT’S DIS 4
  • 5. • Academic research =/= industry research • Tactics can (and should!) be cross- applied • Lots of power in big data, only problem is how to extract it • Largely untapped resource • Content discovery (largely) sucks WHY’S DIS 5
  • 6. 1. Background 2. Common Crawl 3. MapReduce & Hadoop 4. Elastic Map Reduce 5. Mining Common Crawl 6. Data Mining Results 7. Big(ish) Data Sources 8. Conclusion Agenda 6
  • 8. • DARPA CINDER program • Continual authentication through side channel data mining • Penetration testing • Web Sight My Background 8
  • 9. • Penetration testing scopes are rarely adequate • Faster, more accurate tools == better engagements • It’s 2017 – application layer often comprises the majority of attack surface • Expedite discovery of application- layer attack surface Time == $$$ 9
  • 10. • Many web applications map disk contents to URLs • Un-linked resources are commonly less secure • Older versions • Debugging tools • Backups with wrong extensions • Find via brute force • Current tools are quite lacking Web App Content Discovery 10
  • 12. • California-based 501(c)(3) non- profit organization • Performing full web crawls on a regular basis using different user agents since 2008 • Data stored in AWS HDFS • A single crawl contains many terabytes of data • Full crawl metadata can exceed 10TB What is Common Crawl? 12 http://commoncrawl.org/
  • 13. • Crawl data is stored in three proprietary data formats • WARC (Web ARChive) – raw crawl data • WAT – HTTP request and response metadata • WET – plain-text HTTP responses • WAT files likely contain the juicy bits you’re interested in • Use existing libraries for parsing file contents CC Data Format 13
  • 14. • Data is stored in AWS HDFS (S3) http://commoncrawl.org/the- data/get-started/ • Can use the usual AWS S3 command line tools for debugging • Newer crawls contain files listing WAT and WET paths CC HDFS Storage 14
  • 15. • When running Hadoop jobs, HDFS path is supplied to identify all files to process • Pulling down single files and checking them out helps with debugging code • Use AWS S3 command line tool to interact with CC data Accessing HDFS in AWS 15 aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2017-17/wat.paths.gz . aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017-17/
  • 17. • Programming model for processing large amounts of data • Processing done in two phases: • Map – take input data and extract what you care about (key-value pairs) • Reduce – apply a simple aggregation function across the mapped data (count, sum, etc) • Easy concept, quirky to get what you need out of it What is MapReduce? 17 https://en.wikipedia.org/wiki/MapReduce
  • 18. • Apache Hadoop • De-facto standard open source implementation of MapReduce • Written in Java • Has an interface to process data in other languages, but writing code in Java comes with perks How ‘bout Hadoop? 18
  • 19. • Use the Hadoop library for the version you’ll be deploying against • Implement Tool and Configured class • Implement mapper and reducer classes • Configure data types and input/output paths • ??? • Profit Writing Hadoop Code 19
  • 20. • MapReduce supports the map -> reduce paradigm • This is a fairly constrictive paradigm • Have to be creative to determine what to do during both the map and reduce phases to extract and aggregate the data you care about Shoehorning into Hadoop 20
  • 22. • EMR • Amazon’s cloud service for running Hadoop jobs • Usage of all the standard AWS tools • Set up a cloud of EC2 instances to process your data • Free access to data stored in S3 Elastic MapReduce?! 22
  • 23. • Choose how much you want to pay for EC2 instances • EMR allows you to use spot pricing for your instances • Must have one or two master nodes alive at all time (no spot pricing) • Choose the right spot price and your total cost for processing all of Common Crawl can be <$100.00 Spot Pricing!!! 23
  • 25. • We want to find the most common URL paths for every server type • We have access to HTTP request and response headers • We must find a way to map our requirements into the map and reduce phases • Map – Collect/generate the data we care about, fit into key-value pairs • Reduce – Apply a mathematical aggregation across the collected data Here Comes the Shoehorn 25
  • 26. MAP • Create unique strings that contain (1) a reference to the type of server and (2) the URL path segment for every URL path segment in ever URL found within the CC HTTP responses REDUCE • Count the number of instances of each unique string My Solution 26
  • 27. • Working with big data requires coercion of input data to expected values • Aggregating on random data == huge output files • For processing CC data, I had to coerce the following values to avoid massive result files • Server headers • GUIDs in URL paths • Integers in URL paths Mapping URL Paths 27
  • 28. • People put wonky stuff in server headers • Reviewed the contents of a few WAT files and retrieved all server headers • Chose a list of server types to support • Coerce header values into list of supported server types • Not supported -> misc_server • No server header -> null_server Coercing Server Headers 28
  • 29. • URL paths can contain regularly randomized data • Dates • GUIDs • Integers • Replace URL paths with default strings when • Length exceeds 16 • Contents all integers • Contents majority integers Coercing URL Paths 29
  • 30. Mapping process results in strings containing coerced server header and URL path 1. Record type 2. Server type 3. URL path segment Mapping Result Key 30 < 02_';)_apache_generic_';)_ AthenaCarey > 1 2 3
  • 31. Mapping Example 31 GET /foo/bar/baz.html?asd=123 HTTP/1.1 Host: www.woot.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0 Accept: text/html Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Server: Apache/2.4.9 (Unix) Connection: close Upgrade-Insecure-Requests: 1 /foo/bar/baz.html on Apache (Unix) < 02_';)_apache_unix_';)_/foo/>, 1 < 02_';)_apache_unix_';)_/bar/>, 1 < 02_';)_apache_unix_';)_baz.html>, 1
  • 32. • Swap out the fileInputPath and fileOutputPath values in HadoopRunner.java • Compile using ant (not Eclipse, unless you really like tearing your hair out) • Upload Hadoop JAR file to AWS S3 • Create EMR cluster • Add a “step” to EMR cluster referencing the JAR file in AWS S3 Running in EMR 32
  • 33. • Processing took about two days using five medium-powered EC2 instances as task nodes • 93,914,151 results (mapped string combined with # of occurences) • ~3.6GB across 14 files • Still fairly raw data – we need to process it for it to be useful Resulting Data 33
  • 34. • We effectively have tuples of server types, URL path segments, and the number of occurrences for each server type and segment pair • Must process the results and order by most common path segments • Parsing code can be found here: Parsing the Results 34 https://github.com/lavalamp-/lava-hadoop-processing
  • 36. URL Segment Counts 36 500,000.00 5,000,000.00 50,000,000.00 500,000,000.00 Gunicon Thin Openresty Zope Lotus Domino Sun Web Server Apache (Windows) Jetty PWS Lighttpd IBM HTTP Server Resin Oracle Application Server Litespeed Miscellaneous IIS Nginx Apache (Unix) Apache (Generic) # of Discovered URL Segments ServerType # of URL Segments by Server Type
  • 37. Coverage Server Type 50% 75% 90% 95% 99% 99.70% 99.90% Apache (Generic) 58 217 475 611 749 776 784 Apache (Unix) 53 189 395 502 604 624 629 Apache (Windows) 14 41 78 97 117 121 122 Gunicon 2 4 5 6 6 6 6 IBM HTTP Server 6 15 21 24 26 26 26 IIS 103 330 610 738 859 882 889 Jetty 4 10 15 17 19 19 19 Lighttpd 20 76 178 240 306 320 324 Litespeed 16 43 73 90 109 113 114 Lotus Domino 3 5 6 7 7 7 7 Miscellaneous 93 329 687 907 1147 1196 1210 Nginx 87 341 760 1005 1284 1343 1360 Openresty 7 31 97 159 271 306 318 Oracle Application Server 1 4 6 6 7 7 7 PWS 6 15 22 25 28 29 29 Resin 1 5 9 10 12 12 12 Sun Web Server 6 11 14 16 17 17 17 Thin 3 6 10 11 12 13 13 Zope 12 25 37 42 47 47 48 Coverage by # of Requests 37
  • 38. Most Common URL Segments 38 index.php /forum/ /forums/ /news/ viewtopic.php showthread.php /tag/ /index.php/ newreply.php /cgi-bin/ Apache (Unix) index.php index.cfm /uhtbin/ /cgisirsi.exe/ /NCLD/ /catalog/ modules.php /events/ /forum/ /item/ Apache (Windows) /news/ index.php /wiki/ /forums/ /forum/ /tag/ /search/ showthread.php viewtopic.php /en/ Apache (Generic) /article/ /news/ /page/ /id/ default.aspx /products/ /NEWS/ /en/ /apps/ /search/ IIS /tag/ /news/ /forums/ /forum/ index.php /tags/ showthread.php /page/ /category/ /articles/ Nginx
  • 39. Comparison w/ Other Sources 39 FuzzDB (all) 850,425 +99.8% FuzzDB (web & app server) 7,234 +81.2% Dirs3arch 5,992 +77.3% Dirbuster 105,847 +98.7% Burp Suite 424,203 +99.7% 91.34% Average improvement upon existing technologies *no other approaches provide coverage guarantees
  • 40. • Common Crawl respects (I believe) robots.txt • Certainly has a number of blind spots • Results omit highly-repetitive URL segments (integers, GUIDs) • Crawling likely misses plenty of JavaScript-based URLs • Lots of juicy files are never linked, therefore missed by Common Crawl Caveats 40
  • 41. Resulting hit list files can be found in the following repository: https://goo.gl/lxdPDm Getchu Some Data 41
  • 43. • Public archive of research data collected through active scans of the Internet • Lots of references to other projects containing data about • DNS • Port scans • Web crawls • SSL certificates Scans.io 43 https://scans.io/
  • 44. • American Registry for Internet Numbers • WHOIS records for a significant amount of the IPv4 address space • Other regional registries have similar services • ARIN • AFRINIC • APNIC • LACNIC • RIPE NCC ARIN 44 https://www.arin.net/
  • 45. • Awesome open source tools for performing Internet-scale data collection • Zmap – network scans • Zgrab – banner grabbing & network service interaction • ZDNS – DNS lookups Zmap 45 https://zmap.io/
  • 46. • Use SQL syntax to search all sorts of huge datasets • One public dataset contains all public GitHub data… Google BigQuery 46 https://cloud.google.com/bigquery/
  • 47. Google BigQuery Tastiness 47 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%server.pem' OR BQFILES.path like "%id_rsa" OR BQFILES.path like "%id_dsa"; 13,706 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.aws/credentials’; 42 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%.keystore’; 14,558 SELECT count(*) FROM [bigquery-public-data:github_repos.files] as BQFILES WHERE BQFILES.path LIKE '%robots.txt’; 197,694
  • 49. • MapReduce • Hadoop • Amazon Elastic MapReduce • Common Crawl • Shoehorning problem sets into MapReduce • Benefits from using big data • Additional data sources Recap 49
  • 50. • Hone content discovery based on already-found URL paths • Generate content discovery hit lists for specific user agents (mobile vs. desktop) • Hone network service scanning based on already-found service ports Future Work 50
  • 51. • Common Crawl Hadoop Project https://github.com/lavalamp-/LavaHadoopCrawlAnalysis • Common Crawl Results Processing Project https://github.com/lavalamp-/lava-hadoop-processing • Content Discovery Hit Lists https://github.com/lavalamp-/content-discovery-hit-lists • Lavalamp’s Blog https://l.avala.mp/ References 51
  • 52. THANK YOU! @_lavalamp chris [AT] websight [DOT] io https://github.com/lavalamp- https://l.avala.mp