SlideShare a Scribd company logo
1 of 64
Download to read offline
Deep Web
Part II.B. Techniques and Tools:
Network Forensics
CSF: Forensics Cyber-Security
Fall 2015
Nuno Santos
Summary
2015/16CSF - Nuno Santos2
}  The Surface Web
}  The Deep Web
Remember were we are
2015/16CSF - Nuno Santos3
}  Our journey in this course:
}  Part I: Foundations of digital forensics
}  Part II: Techniques and tools
}  A. Computer forensics
}  B. Network forensics
}  C. Forensic data analysis
Current focus
Previously: Three key instruments in cybercrime
2015/16CSF - Nuno Santos4
Tools of
cybercrime
Anonymity systems
How criminals hide their IDs
Botnets
How to launch large scale attacks
Digital currency
How to make untraceable payments
Today: One last key instrument – The Web itself
2015/16CSF - Nuno Santos5
}  Web allows for accessing services for criminal activity
}  E.g., drug selling, weapon selling, etc.
}  Provides huge source of information, used in:
}  Crime premeditation, privacy violations, identity theft, extortion, etc.
}  To find services and info, there are powerful search engines
}  Google, Bing, Shodan, etc.
Offender
The Web: powerful also for crime investigation
2015/16CSF - Nuno Santos6
}  Powerful investigation tool about suspects
}  Find evidence in blogs, social networks, browsing activity, etc.
}  The playground where the crime itself is carried out
}  Illegal transactions, cyber stalking, blackmail, fraud, etc.
Investigator
An eternal cat & mouse race (who’s who?)
2015/16CSF - Nuno Santos7
}  The sophistication of offenses (and investigations) is driven
by the nature and complexity of the Web
The web is deep, very deep…
2015/16CSF - Nuno Santos8
}  What’s “visible” through typical search engines is minimal
What can be found in the Deep Web?
2015/16CSF - Nuno Santos9
}  Deep Web is not
necessarily bad: it’s
just that the content
is not directly
indexed
}  Part of the deep
web where criminal
activity is carried
out is named the
Dark Web
Some examples of services in the Web “ocean”
2015/16CSF - Nuno Santos10
Offenders operate at all layers
2015/16CSF - Nuno Santos11
}  Investigators too!
Roadmap
2015/16CSF - Nuno Santos12
}  The Surface Web
}  The Deep Web
The Surface Web
2015/16CSF - Nuno Santos13
The Surface Web
2015/16CSF - Nuno Santos14
}  The Surface Web is that portion of the World Wide Web
that is readily available to the general public and
searchable with standard web search engines
}  AKA Visible Web, Clearnet, Indexed Web, Indexable Web or Lightnet
}  As of June 14, 2015, Google's index of the surface web
contains about 14.5 billion pages
Surface Web characteristics
2015/16CSF - Nuno Santos15
}  Distributed data
}  80 million web sites (hostnames responding) in April 2006
}  40 million active web sites (don’t redirect, …)
}  High volatility
}  Servers come and go …
}  Large volume
}  One study found 11.5 billion pages in January 2005 (at that
time Google indexed 8 billion pages)
Surface Web characteristics
2015/16CSF - Nuno Santos16
}  Unstructured data
}  Lots of duplicated content (30% estimate)
}  Semantic duplication much higher
}  Quality of data
}  No required editorial process
}  Many typos and misspellings (impacts IR)
}  Heterogeneous data
}  Different media
}  Different languages
Surface Web composition by file type
2015/16CSF - Nuno Santos17
}  As of 2003,
about 70% of
Web content is
images, HTML,
PHP, and PDF files
To hide
How to find content and services?
2015/16CSF - Nuno Santos18
}  Using search engines
1. A web crawler gathers a
snapshot of the Web
2.The gathered pages are
indexed for easy retrieval
3. User submits a search
query
4. Search engine ranks pages that match
the query and returns an ordered list
How a typical search engine works
2015/16CSF - Nuno Santos19
}  Architecture of a typical search engine
Interface Query Engine
IndexerCrawler
Users
Index
Web
Lots and lots of computers
What a Web crawler does
2015/16CSF - Nuno Santos20
}  Creates and repopulates
search engines data by
navigating the web,
fetching docs and files
}  The Web crawler is a foundational species
}  Without crawlers, there would be nothing to search
What a Web crawler is
2015/16CSF - Nuno Santos21
}  In general, it’s a program for downloading web pages
}  Crawler AKA spider, bot, harvester
}  Given an initial set of seed URLs, recursively download
every page that is linked from pages in the set
}  A focused web crawler downloads only those pages whose
content satisfies some criterion
}  The next node to crawl is the URL frontier
}  Can include multiple pages from the same host
Crawling the Web: Start from the seed pages
2015/16CSF - Nuno Santos22
Web
URLs crawled
and parsed
URLs frontier
Unseen Web
Seed
pages
Crawling the Web: Keep expanding URL frontier
2015/16CSF - Nuno Santos23
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
Web crawler algorithm is conceptually simple
2015/16CSF - Nuno Santos24
}  Basic Algorithm
Initialize queue (Q) with initial set of known URL’s
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop
If already visited L, continue loop
Download page, P, for L
If cannot download P (e.g. 404 error, robot excluded)
continue loop
Index P (e.g. add to inverted index or store cached copy)
Parse P to obtain list of new links N
Append N to the end of Q
But not so simple to build in practice
2015/16CSF - Nuno Santos25
}  Performance: How do you crawl 1,000,000,000 pages?
}  Politeness: How do you avoid overloading servers?
}  Failures: Broken links, time outs, spider traps.
}  Strategies: How deep to go? Depth first or breadth first?
}  Implementations: How do we store and update the URL
list and other data structures needed?
Crawler performance measures
2015/16CSF - Nuno Santos26
}  Completeness
Is the algorithm guaranteed to find a solution when
there is one?
}  Optimality
Is this solution optimal?
}  Time complexity
How long does it take?
}  Space complexity
How much memory does it require?
No single crawler can crawl the entire Web
2015/16CSF - Nuno Santos27
}  Crawling technique may depend on goal
}  Types of crawling goals:
}  Create large broad index
}  Create a focused topic or domain-specific index
}  Target topic-relevant sites
}  Index preset terms
}  Create subset of content to model characteristics
of the Web
}  Need to survey appropriately
}  Cannot use simple depth-first or breadth-first
}  Create up-to-date index
}  Use estimated change frequencies
Crawlers also be used for nefarious purposes
2015/16CSF - Nuno Santos28
}  Spiders can be used to collect email addresses for
unsolicited communication
}  From: http://spiders.must.die.net
Crawler code available for free
2015/16CSF - Nuno Santos29
Spider traps
2015/16CSF - Nuno Santos30
}  A spider trap is a set of web pages that may be used to
cause a web crawler to make an infinite number of
requests or cause a poorly constructed crawler to crash
}  To “catch” spambots or similar that waste a website's bandwidth
}  Common techniques used are:
•  Creation of indefinitely deep directory
structures like
•  http://foo.com/bar/foo/bar/foo/bar/foo/
bar/.....
•  Dynamic pages like calendars that
produce an infinite number of pages for
a web crawler to follow
•  Pages filled with many chars, crashing
the lexical analyzer parsing the page
Search engines run specific and benign crawlers
2015/16CSF - Nuno Santos31
}  Search engines obtain their listings in two ways:
}  The search engines “crawl” or “spider” documents by following one
hypertext link to
}  Authors may submit their own Web pages
}  As a result, only static Web content can be found on public
search engines
}  Nevertheless, a lot of info can be retrieved by criminals and
investigators, especially when using “hidden” features of the
search engine
Google hacking
2015/16CSF - Nuno Santos32
}  Google provides keywords for advanced searching
}  Logic operators in search expressions
}  Advanced query attributes: “login password filetype:pdf”
}  Intitle,	
  allintitle	
  
}  Inurl,	
  allinurl	
  
}  Filetype	
  
}  Allintext	
  
}  Site	
  
}  Link	
  
}  Inanchor	
  
}  Daterange	
  
}  Cache	
  
}  Info	
  
}  Related	
  
}  Phonebook	
  
}  Rphonebook	
  
}  Bphonebook	
  
}  Author	
  
}  Group	
  
}  Msgid	
  
}  Insubject	
  
}  Stocks	
  
}  Define	
  
There’s entire books dedicated to Google hacking
2015/16CSF - Nuno Santos33
Dornfest, Rael, Google Hacks 3rd ed,
O’Rielly, (2006)
Ethical Hacking,
http://www.nc-net.info/2006conf/
Ethical_Hacking_Presentation_October_
2006.ppt
A cheat sheet of Google search
features:
http://www.google.com/intl/en/help/
features.html
A Cheat Sheet for Google Search Hacks
-- how to find information fast and
efficiently
http://www.expertsforge.com/Security/
hacking-everything-using-google-3.asp
Google hacking examples: Simple word search
2015/16CSF - Nuno Santos34
}  A simple search: “cd ls .bash_history ssh”
}  Can return surprising
results: this is the
contents of a
live .bash_history file
Google hacking examples: URL searches
2015/16CSF - Nuno Santos35
}  inurl: find the search
term within the URLinurl:admin
inurl:admin
users mbox
inurl:admin users
passwords
Google hacking examples: File type searches
2015/16CSF - Nuno Santos36
}  filetype: narrow down search results to specific file type
filetype:xls “checking
account” “credit card”
Google hacking examples: Finding servers
2015/16CSF - Nuno Santos37
intitle:"Welcome to Windows
2000 Internet Services"
intitle:"Under
construction" "does not
currently have"
Google hacking examples: Finding webcams
2015/16CSF - Nuno Santos38
}  To find open unprotected Internet
webcams that broadcast to the
web, use the following query:
}  inurl:/view.shtml
}  Can also search by manufacturer-specific URL patterns
}  inurl:ViewerFrame?Mode=
}  inurl:ViewerFrame?Mode=Refresh
}  inurl:axis-cgi/jpg
}  ...
Google hacking examples: Finding webcams
2015/16CSF - Nuno Santos39
}  How to Find and View Millions of Free Live Web Cams
http://www.traveltowork.net/2009/02/how-to-find-view-
free-live-web-cams/
}  How to Hack Security Cameras,
http://www.truveo.com/How-To-Hack-Security-Cameras/
id/180144027190129591
}  How to Hack Security Cams all over the World
http://www.youtube.com/watch?
v=9VRN8BS02Rk&feature=related
And we’re just scratching the surface…
2015/16CSF - Nuno Santos40
What can be found in the depths of the Web?
The Deep Web
2015/16CSF - Nuno Santos41
The Deep Web
2015/16CSF - Nuno Santos42
}  Deep Web is the part
of the Web which is
not indexed by
conventional search
engines and therefore
don’t appear in
search results
}  Why is it not indexed
by typical search
engines?
Some content can’t be found through URL traversal
2015/16CSF - Nuno Santos43
•  Dynamic web pages and searchable databases
–  Response to a query or accessed only through a form
•  Unlinked contents
–  Pages without any backlinks
•  Private web
–  Sites requiring registration and login
•  Limited access web
–  Sites with captchas, no-cache pragma http headers
•  Scripted pages
–  Page produced by javascrips, Flash, etc.
In other times, content won’t be found
2015/16CSF - Nuno Santos44
}  Crawling restrictions by site owner
}  Use a robots.txt file to keep files off limits from spiders
}  Crawling restrictions by the search engine
}  E.g.: a page may be found this way:
http://www.website.com/cgi-bin/getpage.cgi?name=sitemap
}  Most search engines will not read past the ? in that URL
}  Limitations of the crawling engine
}  E.g., real-time data – changes rapidly – too “fresh”
How big is Deep Web?
2015/16CSF - Nuno Santos45
}  Studies suggest it’s approx. 500x the surface Web
}  But cannot be determined accurately
}  A 2001 study showed that 60 deep sites exceeded the
size of the surface web (at that time) by 40x
Distribution of Deep Web sites by content type
2015/16CSF - Nuno Santos46
}  Back in 2001,
biggest fraction
goes to
databases
Approaches for finding content in Deep Web
2015/16CSF - Nuno Santos47
1.  Specialized search engines
2.  Directories
Specialized search engines
2015/16CSF - Nuno Santos48
}  Crawl deeper
}  Go beyond top page, or homepage
}  Crawl focused
}  Choose sources to spider—topical sites only
}  Crawl informed
}  Indexing based on knowledge of the specific subject
Specialized search engines abound
2015/16CSF - Nuno Santos49
}  There’s hundreds of specialized search engines for almost
every topic
Directories
2015/16CSF - Nuno Santos50
}  Collections of pre-screened web-sites into categories
based on a controlled ontology
}  Including access to content in databases
}  Ontology: classification of human knowledge into
topics, similar to traditional library catalogs
}  Two maintenance models: open or closed
}  Closed model: paid editors; quality control (Yahoo)
}  Open model: volunteer editors; (Open Directory Project)
Example of ontology
2015/16CSF - Nuno Santos51
}  Ontologies allow for adding structure to Web content
A particularly interesting search engine
2015/16CSF - Nuno Santos52
}  Shodan lets the user find specific types of computers connected
to the internet using a variety of filters
}  Routers, servers, traffic lights, security cameras, home heating systems
}  Control systems for water parks, gas stations, water plants, power grids,
nuclear power plants and particle-accelerating cyclotrons
}  Why is it interesting?
}  Many devices use "admin" as user name and "1234" as password, and
the only software required to connect them is a web browser
How does Shodan work?
2015/16CSF - Nuno Santos53
}  Shodan collects data mostly on HTTP servers (port 80)
}  But also from FTP (21), SSH (22) Telnet (23), and SNMP (161)
“Google crawls URLs – I don’t do that at all.The only thing I
do is randomly pick an IP out of all the IPs that exist,
whether it’s online or not being used, and I try to connect
to it on different ports. It’s probably not a part of the visible
web in the sense that you can’t just use a browser. It’s not
something that most people can easily discover, just because
it’s not visual in the same way a website is.”
John Matherly, Shodan's creator
One can see through the eye of a webcam
2015/16CSF - Nuno Santos54
Play with the controls for a water treatment facility
2015/16CSF - Nuno Santos55
Find the creepiest stuff…
2015/16CSF - Nuno Santos56
}  Controls for a crematorium; accessible from your computer
No words needed
2015/16CSF - Nuno Santos57
}  Controls of Caterpiller trucks connected to the Internet
A Deep Web’s particular case
2015/16CSF - Nuno Santos58
Dark Web
Dark Web
2015/16CSF - Nuno Santos59
}  Dark Web is the Web content that exists on darknets
}  Darknets are overlay nets which use the public Internet
but require specific SW or authorization to access
}  Delivered over small peer-to-peer networks
}  As hidden services on top of Tor
}  The Dark Web forms a small part of the Deep Web,
the part of the Web not indexed by search engines
The Dark Web is a haven for criminal activities
2015/16CSF - Nuno Santos60
}  Hacking services
}  Fraud and fraud
services
}  Markets for
illegal products
}  Hitmen
}  …
Surface Web vs. Deep Web
Surface Web Deep Web
2015/16CSF - Nuno Santos61
}  Size: Estimated to be 8+ billion
(Google) to 45 billion
(About.com) web pages
}  Static, crawlable web pages
}  Large amounts of unfiltered
information
}  Limited to what is easily found
by search engines
}  Size: Estimated to be 5 to 500x
larger (BrightPlanet)
}  Dynamically generated content
that lives inside databases
}  High-quality, managed, subject-
specific content
}  Growing faster than surface
web (BrightPlanet)
Conclusions
2015/16CSF - Nuno Santos62
}  The Web is a major source of information for both
criminal and legal investigation activities
}  The Web content that is typically accessible through
conventional search engines is named the Surface Web
and represents only a small fraction of the whole Web
}  The Deep Web includes the largest bulk of the Web, a
small part of it (the Dark Web), being used
specifically for carrying out criminal activities
References
2015/16CSF - Nuno Santos
}  Primary bibliography
}  Michael K. Bergman, The Deep Web: Surfacing Hidden Value
http://brightplanet.com/wp-content/uploads/2012/03/12550176481-
deepwebwhitepaper1.pdf
63
Next class
CSF - Nuno Santos
}  Flow analysis and intrusion detection
2015/1664

More Related Content

Similar to Deep Web

Open Web and Open Data Conf irm 2013
Open Web and Open Data Conf irm 2013Open Web and Open Data Conf irm 2013
Open Web and Open Data Conf irm 2013Reinaldo Ferraz
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
 
OSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan BraganzaOSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan BraganzaNSConclave
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxHitechIOT
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...ijmech
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...ijmech
 
Use of Open Data in Hong Kong (LegCo 2014)
Use of Open Data in Hong Kong (LegCo 2014)Use of Open Data in Hong Kong (LegCo 2014)
Use of Open Data in Hong Kong (LegCo 2014)Sammy Fung
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUEIAEME Publication
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong KongSammy Fung
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_pptManant Sweet
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Sammy Fung
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Rana Jayant
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldCarlo Vaccari
 

Similar to Deep Web (20)

Open Web and Open Data Conf irm 2013
Open Web and Open Data Conf irm 2013Open Web and Open Data Conf irm 2013
Open Web and Open Data Conf irm 2013
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 
OSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan BraganzaOSINT: Open Source Intelligence - Rohan Braganza
OSINT: Open Source Intelligence - Rohan Braganza
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptx
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
Week10
Week10Week10
Week10
 
Use of Open Data in Hong Kong (LegCo 2014)
Use of Open Data in Hong Kong (LegCo 2014)Use of Open Data in Hong Kong (LegCo 2014)
Use of Open Data in Hong Kong (LegCo 2014)
 
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUESTUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
STUDY OF DEEP WEB AND A NEW FORM BASED CRAWLING TECHNIQUE
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Research Paper
Research PaperResearch Paper
Research Paper
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
Web2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google worldWeb2.0.2012 - lesson 8 - Google world
Web2.0.2012 - lesson 8 - Google world
 
Ice dec04-04-sammy
Ice dec04-04-sammyIce dec04-04-sammy
Ice dec04-04-sammy
 

More from Gol D Roger

Seizing Electronic Evidence & Best Practices – Secret Service
Seizing Electronic Evidence & Best Practices – Secret ServiceSeizing Electronic Evidence & Best Practices – Secret Service
Seizing Electronic Evidence & Best Practices – Secret ServiceGol D Roger
 
Forensic artifacts in modern linux systems
Forensic artifacts in modern linux systemsForensic artifacts in modern linux systems
Forensic artifacts in modern linux systemsGol D Roger
 
Cybercrimeandforensic 120828021931-phpapp02
Cybercrimeandforensic 120828021931-phpapp02Cybercrimeandforensic 120828021931-phpapp02
Cybercrimeandforensic 120828021931-phpapp02Gol D Roger
 
8 0-os file-system management
8 0-os file-system management8 0-os file-system management
8 0-os file-system managementGol D Roger
 
8 1-os file system implementation
8 1-os file system implementation8 1-os file system implementation
8 1-os file system implementationGol D Roger
 
Desktop Forensics: Windows
Desktop Forensics: WindowsDesktop Forensics: Windows
Desktop Forensics: WindowsGol D Roger
 
Windows logon password – get windows logon password using wdigest in memory d...
Windows logon password – get windows logon password using wdigest in memory d...Windows logon password – get windows logon password using wdigest in memory d...
Windows logon password – get windows logon password using wdigest in memory d...Gol D Roger
 
HTTPs Strict Transport Security
HTTPs    Strict Transport Security HTTPs    Strict Transport Security
HTTPs Strict Transport Security Gol D Roger
 
IT Passport Examination.
IT Passport Examination.IT Passport Examination.
IT Passport Examination.Gol D Roger
 
Basic configuration fortigate v4.0 mr2
Basic configuration fortigate v4.0 mr2Basic configuration fortigate v4.0 mr2
Basic configuration fortigate v4.0 mr2Gol D Roger
 
windows server 2012 R2
windows server 2012 R2windows server 2012 R2
windows server 2012 R2Gol D Roger
 
Users guide-to-winfe
Users guide-to-winfeUsers guide-to-winfe
Users guide-to-winfeGol D Roger
 
10 things group policy preferences does better
10 things group policy preferences does better10 things group policy preferences does better
10 things group policy preferences does betterGol D Roger
 

More from Gol D Roger (14)

Seizing Electronic Evidence & Best Practices – Secret Service
Seizing Electronic Evidence & Best Practices – Secret ServiceSeizing Electronic Evidence & Best Practices – Secret Service
Seizing Electronic Evidence & Best Practices – Secret Service
 
Forensic artifacts in modern linux systems
Forensic artifacts in modern linux systemsForensic artifacts in modern linux systems
Forensic artifacts in modern linux systems
 
Cybercrimeandforensic 120828021931-phpapp02
Cybercrimeandforensic 120828021931-phpapp02Cybercrimeandforensic 120828021931-phpapp02
Cybercrimeandforensic 120828021931-phpapp02
 
8 0-os file-system management
8 0-os file-system management8 0-os file-system management
8 0-os file-system management
 
8 1-os file system implementation
8 1-os file system implementation8 1-os file system implementation
8 1-os file system implementation
 
Desktop Forensics: Windows
Desktop Forensics: WindowsDesktop Forensics: Windows
Desktop Forensics: Windows
 
Email Forensics
Email ForensicsEmail Forensics
Email Forensics
 
Windows logon password – get windows logon password using wdigest in memory d...
Windows logon password – get windows logon password using wdigest in memory d...Windows logon password – get windows logon password using wdigest in memory d...
Windows logon password – get windows logon password using wdigest in memory d...
 
HTTPs Strict Transport Security
HTTPs    Strict Transport Security HTTPs    Strict Transport Security
HTTPs Strict Transport Security
 
IT Passport Examination.
IT Passport Examination.IT Passport Examination.
IT Passport Examination.
 
Basic configuration fortigate v4.0 mr2
Basic configuration fortigate v4.0 mr2Basic configuration fortigate v4.0 mr2
Basic configuration fortigate v4.0 mr2
 
windows server 2012 R2
windows server 2012 R2windows server 2012 R2
windows server 2012 R2
 
Users guide-to-winfe
Users guide-to-winfeUsers guide-to-winfe
Users guide-to-winfe
 
10 things group policy preferences does better
10 things group policy preferences does better10 things group policy preferences does better
10 things group policy preferences does better
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Deep Web

  • 1. Deep Web Part II.B. Techniques and Tools: Network Forensics CSF: Forensics Cyber-Security Fall 2015 Nuno Santos
  • 2. Summary 2015/16CSF - Nuno Santos2 }  The Surface Web }  The Deep Web
  • 3. Remember were we are 2015/16CSF - Nuno Santos3 }  Our journey in this course: }  Part I: Foundations of digital forensics }  Part II: Techniques and tools }  A. Computer forensics }  B. Network forensics }  C. Forensic data analysis Current focus
  • 4. Previously: Three key instruments in cybercrime 2015/16CSF - Nuno Santos4 Tools of cybercrime Anonymity systems How criminals hide their IDs Botnets How to launch large scale attacks Digital currency How to make untraceable payments
  • 5. Today: One last key instrument – The Web itself 2015/16CSF - Nuno Santos5 }  Web allows for accessing services for criminal activity }  E.g., drug selling, weapon selling, etc. }  Provides huge source of information, used in: }  Crime premeditation, privacy violations, identity theft, extortion, etc. }  To find services and info, there are powerful search engines }  Google, Bing, Shodan, etc. Offender
  • 6. The Web: powerful also for crime investigation 2015/16CSF - Nuno Santos6 }  Powerful investigation tool about suspects }  Find evidence in blogs, social networks, browsing activity, etc. }  The playground where the crime itself is carried out }  Illegal transactions, cyber stalking, blackmail, fraud, etc. Investigator
  • 7. An eternal cat & mouse race (who’s who?) 2015/16CSF - Nuno Santos7 }  The sophistication of offenses (and investigations) is driven by the nature and complexity of the Web
  • 8. The web is deep, very deep… 2015/16CSF - Nuno Santos8 }  What’s “visible” through typical search engines is minimal
  • 9. What can be found in the Deep Web? 2015/16CSF - Nuno Santos9 }  Deep Web is not necessarily bad: it’s just that the content is not directly indexed }  Part of the deep web where criminal activity is carried out is named the Dark Web
  • 10. Some examples of services in the Web “ocean” 2015/16CSF - Nuno Santos10
  • 11. Offenders operate at all layers 2015/16CSF - Nuno Santos11 }  Investigators too!
  • 12. Roadmap 2015/16CSF - Nuno Santos12 }  The Surface Web }  The Deep Web
  • 13. The Surface Web 2015/16CSF - Nuno Santos13
  • 14. The Surface Web 2015/16CSF - Nuno Santos14 }  The Surface Web is that portion of the World Wide Web that is readily available to the general public and searchable with standard web search engines }  AKA Visible Web, Clearnet, Indexed Web, Indexable Web or Lightnet }  As of June 14, 2015, Google's index of the surface web contains about 14.5 billion pages
  • 15. Surface Web characteristics 2015/16CSF - Nuno Santos15 }  Distributed data }  80 million web sites (hostnames responding) in April 2006 }  40 million active web sites (don’t redirect, …) }  High volatility }  Servers come and go … }  Large volume }  One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages)
  • 16. Surface Web characteristics 2015/16CSF - Nuno Santos16 }  Unstructured data }  Lots of duplicated content (30% estimate) }  Semantic duplication much higher }  Quality of data }  No required editorial process }  Many typos and misspellings (impacts IR) }  Heterogeneous data }  Different media }  Different languages
  • 17. Surface Web composition by file type 2015/16CSF - Nuno Santos17 }  As of 2003, about 70% of Web content is images, HTML, PHP, and PDF files To hide
  • 18. How to find content and services? 2015/16CSF - Nuno Santos18 }  Using search engines 1. A web crawler gathers a snapshot of the Web 2.The gathered pages are indexed for easy retrieval 3. User submits a search query 4. Search engine ranks pages that match the query and returns an ordered list
  • 19. How a typical search engine works 2015/16CSF - Nuno Santos19 }  Architecture of a typical search engine Interface Query Engine IndexerCrawler Users Index Web Lots and lots of computers
  • 20. What a Web crawler does 2015/16CSF - Nuno Santos20 }  Creates and repopulates search engines data by navigating the web, fetching docs and files }  The Web crawler is a foundational species }  Without crawlers, there would be nothing to search
  • 21. What a Web crawler is 2015/16CSF - Nuno Santos21 }  In general, it’s a program for downloading web pages }  Crawler AKA spider, bot, harvester }  Given an initial set of seed URLs, recursively download every page that is linked from pages in the set }  A focused web crawler downloads only those pages whose content satisfies some criterion }  The next node to crawl is the URL frontier }  Can include multiple pages from the same host
  • 22. Crawling the Web: Start from the seed pages 2015/16CSF - Nuno Santos22 Web URLs crawled and parsed URLs frontier Unseen Web Seed pages
  • 23. Crawling the Web: Keep expanding URL frontier 2015/16CSF - Nuno Santos23 URLs crawled and parsed Unseen Web Seed Pages URL frontier Crawling thread
  • 24. Web crawler algorithm is conceptually simple 2015/16CSF - Nuno Santos24 }  Basic Algorithm Initialize queue (Q) with initial set of known URL’s Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop If already visited L, continue loop Download page, P, for L If cannot download P (e.g. 404 error, robot excluded) continue loop Index P (e.g. add to inverted index or store cached copy) Parse P to obtain list of new links N Append N to the end of Q
  • 25. But not so simple to build in practice 2015/16CSF - Nuno Santos25 }  Performance: How do you crawl 1,000,000,000 pages? }  Politeness: How do you avoid overloading servers? }  Failures: Broken links, time outs, spider traps. }  Strategies: How deep to go? Depth first or breadth first? }  Implementations: How do we store and update the URL list and other data structures needed?
  • 26. Crawler performance measures 2015/16CSF - Nuno Santos26 }  Completeness Is the algorithm guaranteed to find a solution when there is one? }  Optimality Is this solution optimal? }  Time complexity How long does it take? }  Space complexity How much memory does it require?
  • 27. No single crawler can crawl the entire Web 2015/16CSF - Nuno Santos27 }  Crawling technique may depend on goal }  Types of crawling goals: }  Create large broad index }  Create a focused topic or domain-specific index }  Target topic-relevant sites }  Index preset terms }  Create subset of content to model characteristics of the Web }  Need to survey appropriately }  Cannot use simple depth-first or breadth-first }  Create up-to-date index }  Use estimated change frequencies
  • 28. Crawlers also be used for nefarious purposes 2015/16CSF - Nuno Santos28 }  Spiders can be used to collect email addresses for unsolicited communication }  From: http://spiders.must.die.net
  • 29. Crawler code available for free 2015/16CSF - Nuno Santos29
  • 30. Spider traps 2015/16CSF - Nuno Santos30 }  A spider trap is a set of web pages that may be used to cause a web crawler to make an infinite number of requests or cause a poorly constructed crawler to crash }  To “catch” spambots or similar that waste a website's bandwidth }  Common techniques used are: •  Creation of indefinitely deep directory structures like •  http://foo.com/bar/foo/bar/foo/bar/foo/ bar/..... •  Dynamic pages like calendars that produce an infinite number of pages for a web crawler to follow •  Pages filled with many chars, crashing the lexical analyzer parsing the page
  • 31. Search engines run specific and benign crawlers 2015/16CSF - Nuno Santos31 }  Search engines obtain their listings in two ways: }  The search engines “crawl” or “spider” documents by following one hypertext link to }  Authors may submit their own Web pages }  As a result, only static Web content can be found on public search engines }  Nevertheless, a lot of info can be retrieved by criminals and investigators, especially when using “hidden” features of the search engine
  • 32. Google hacking 2015/16CSF - Nuno Santos32 }  Google provides keywords for advanced searching }  Logic operators in search expressions }  Advanced query attributes: “login password filetype:pdf” }  Intitle,  allintitle   }  Inurl,  allinurl   }  Filetype   }  Allintext   }  Site   }  Link   }  Inanchor   }  Daterange   }  Cache   }  Info   }  Related   }  Phonebook   }  Rphonebook   }  Bphonebook   }  Author   }  Group   }  Msgid   }  Insubject   }  Stocks   }  Define  
  • 33. There’s entire books dedicated to Google hacking 2015/16CSF - Nuno Santos33 Dornfest, Rael, Google Hacks 3rd ed, O’Rielly, (2006) Ethical Hacking, http://www.nc-net.info/2006conf/ Ethical_Hacking_Presentation_October_ 2006.ppt A cheat sheet of Google search features: http://www.google.com/intl/en/help/ features.html A Cheat Sheet for Google Search Hacks -- how to find information fast and efficiently http://www.expertsforge.com/Security/ hacking-everything-using-google-3.asp
  • 34. Google hacking examples: Simple word search 2015/16CSF - Nuno Santos34 }  A simple search: “cd ls .bash_history ssh” }  Can return surprising results: this is the contents of a live .bash_history file
  • 35. Google hacking examples: URL searches 2015/16CSF - Nuno Santos35 }  inurl: find the search term within the URLinurl:admin inurl:admin users mbox inurl:admin users passwords
  • 36. Google hacking examples: File type searches 2015/16CSF - Nuno Santos36 }  filetype: narrow down search results to specific file type filetype:xls “checking account” “credit card”
  • 37. Google hacking examples: Finding servers 2015/16CSF - Nuno Santos37 intitle:"Welcome to Windows 2000 Internet Services" intitle:"Under construction" "does not currently have"
  • 38. Google hacking examples: Finding webcams 2015/16CSF - Nuno Santos38 }  To find open unprotected Internet webcams that broadcast to the web, use the following query: }  inurl:/view.shtml }  Can also search by manufacturer-specific URL patterns }  inurl:ViewerFrame?Mode= }  inurl:ViewerFrame?Mode=Refresh }  inurl:axis-cgi/jpg }  ...
  • 39. Google hacking examples: Finding webcams 2015/16CSF - Nuno Santos39 }  How to Find and View Millions of Free Live Web Cams http://www.traveltowork.net/2009/02/how-to-find-view- free-live-web-cams/ }  How to Hack Security Cameras, http://www.truveo.com/How-To-Hack-Security-Cameras/ id/180144027190129591 }  How to Hack Security Cams all over the World http://www.youtube.com/watch? v=9VRN8BS02Rk&feature=related
  • 40. And we’re just scratching the surface… 2015/16CSF - Nuno Santos40 What can be found in the depths of the Web?
  • 41. The Deep Web 2015/16CSF - Nuno Santos41
  • 42. The Deep Web 2015/16CSF - Nuno Santos42 }  Deep Web is the part of the Web which is not indexed by conventional search engines and therefore don’t appear in search results }  Why is it not indexed by typical search engines?
  • 43. Some content can’t be found through URL traversal 2015/16CSF - Nuno Santos43 •  Dynamic web pages and searchable databases –  Response to a query or accessed only through a form •  Unlinked contents –  Pages without any backlinks •  Private web –  Sites requiring registration and login •  Limited access web –  Sites with captchas, no-cache pragma http headers •  Scripted pages –  Page produced by javascrips, Flash, etc.
  • 44. In other times, content won’t be found 2015/16CSF - Nuno Santos44 }  Crawling restrictions by site owner }  Use a robots.txt file to keep files off limits from spiders }  Crawling restrictions by the search engine }  E.g.: a page may be found this way: http://www.website.com/cgi-bin/getpage.cgi?name=sitemap }  Most search engines will not read past the ? in that URL }  Limitations of the crawling engine }  E.g., real-time data – changes rapidly – too “fresh”
  • 45. How big is Deep Web? 2015/16CSF - Nuno Santos45 }  Studies suggest it’s approx. 500x the surface Web }  But cannot be determined accurately }  A 2001 study showed that 60 deep sites exceeded the size of the surface web (at that time) by 40x
  • 46. Distribution of Deep Web sites by content type 2015/16CSF - Nuno Santos46 }  Back in 2001, biggest fraction goes to databases
  • 47. Approaches for finding content in Deep Web 2015/16CSF - Nuno Santos47 1.  Specialized search engines 2.  Directories
  • 48. Specialized search engines 2015/16CSF - Nuno Santos48 }  Crawl deeper }  Go beyond top page, or homepage }  Crawl focused }  Choose sources to spider—topical sites only }  Crawl informed }  Indexing based on knowledge of the specific subject
  • 49. Specialized search engines abound 2015/16CSF - Nuno Santos49 }  There’s hundreds of specialized search engines for almost every topic
  • 50. Directories 2015/16CSF - Nuno Santos50 }  Collections of pre-screened web-sites into categories based on a controlled ontology }  Including access to content in databases }  Ontology: classification of human knowledge into topics, similar to traditional library catalogs }  Two maintenance models: open or closed }  Closed model: paid editors; quality control (Yahoo) }  Open model: volunteer editors; (Open Directory Project)
  • 51. Example of ontology 2015/16CSF - Nuno Santos51 }  Ontologies allow for adding structure to Web content
  • 52. A particularly interesting search engine 2015/16CSF - Nuno Santos52 }  Shodan lets the user find specific types of computers connected to the internet using a variety of filters }  Routers, servers, traffic lights, security cameras, home heating systems }  Control systems for water parks, gas stations, water plants, power grids, nuclear power plants and particle-accelerating cyclotrons }  Why is it interesting? }  Many devices use "admin" as user name and "1234" as password, and the only software required to connect them is a web browser
  • 53. How does Shodan work? 2015/16CSF - Nuno Santos53 }  Shodan collects data mostly on HTTP servers (port 80) }  But also from FTP (21), SSH (22) Telnet (23), and SNMP (161) “Google crawls URLs – I don’t do that at all.The only thing I do is randomly pick an IP out of all the IPs that exist, whether it’s online or not being used, and I try to connect to it on different ports. It’s probably not a part of the visible web in the sense that you can’t just use a browser. It’s not something that most people can easily discover, just because it’s not visual in the same way a website is.” John Matherly, Shodan's creator
  • 54. One can see through the eye of a webcam 2015/16CSF - Nuno Santos54
  • 55. Play with the controls for a water treatment facility 2015/16CSF - Nuno Santos55
  • 56. Find the creepiest stuff… 2015/16CSF - Nuno Santos56 }  Controls for a crematorium; accessible from your computer
  • 57. No words needed 2015/16CSF - Nuno Santos57 }  Controls of Caterpiller trucks connected to the Internet
  • 58. A Deep Web’s particular case 2015/16CSF - Nuno Santos58 Dark Web
  • 59. Dark Web 2015/16CSF - Nuno Santos59 }  Dark Web is the Web content that exists on darknets }  Darknets are overlay nets which use the public Internet but require specific SW or authorization to access }  Delivered over small peer-to-peer networks }  As hidden services on top of Tor }  The Dark Web forms a small part of the Deep Web, the part of the Web not indexed by search engines
  • 60. The Dark Web is a haven for criminal activities 2015/16CSF - Nuno Santos60 }  Hacking services }  Fraud and fraud services }  Markets for illegal products }  Hitmen }  …
  • 61. Surface Web vs. Deep Web Surface Web Deep Web 2015/16CSF - Nuno Santos61 }  Size: Estimated to be 8+ billion (Google) to 45 billion (About.com) web pages }  Static, crawlable web pages }  Large amounts of unfiltered information }  Limited to what is easily found by search engines }  Size: Estimated to be 5 to 500x larger (BrightPlanet) }  Dynamically generated content that lives inside databases }  High-quality, managed, subject- specific content }  Growing faster than surface web (BrightPlanet)
  • 62. Conclusions 2015/16CSF - Nuno Santos62 }  The Web is a major source of information for both criminal and legal investigation activities }  The Web content that is typically accessible through conventional search engines is named the Surface Web and represents only a small fraction of the whole Web }  The Deep Web includes the largest bulk of the Web, a small part of it (the Dark Web), being used specifically for carrying out criminal activities
  • 63. References 2015/16CSF - Nuno Santos }  Primary bibliography }  Michael K. Bergman, The Deep Web: Surfacing Hidden Value http://brightplanet.com/wp-content/uploads/2012/03/12550176481- deepwebwhitepaper1.pdf 63
  • 64. Next class CSF - Nuno Santos }  Flow analysis and intrusion detection 2015/1664