Deep Web and Digital Investigations

Deep Web and Digital
Investigations
Damir Delija
Milano 2014
1

What we will talk about
• Web and “Deep Web”
• Web and documents
• Definitions
• Technical issues
• Forensic issues
• I’m not an expert on deep or dark web
• Discussion based on many sources and
references

Inaccessible Web
• Deep Web is a name for data inaccessible by
regular search engines on the Internet
• Deep Web sounds much better than
inaccessible
• Searchable / Accessible web is also called
surface web
• Dark web is part of www with illegal or
immoral content
• Dark web is not Deep Web it is part of it, but
dark pages are on the surface web too

Inaccessible Resources
• Inaccessible resources
– it exists but we don’t know about it or it’s location
– we can’t use it
• It is an old problem
– you have it, even in your own room
• Is there any solution ?
– idea from Gopher days, Veronica
– it works well with static pages and data
– abandoned in web days, becomes a source of tremendous
power and wealth for Search Engines

Web and Internet and Documents
• WWW is not the Internet ☺
– also full data or document space of each networked
computer is not part of the Internet
• WWW is hypertext document based structure
– we have links among documents
– a document is not necessarily a web page
– documents must have a presentation ability to be visible
through the web interface (transcription layer, often
dynamicaly generated)
– Links, web pages and documents can be static or
dynamically generated
– Dynamic documents are here because of volume of data
(can’t be organised in static pages)
Definitions are crucial in understandig deep and
surface web

Volume of Data
• For each document there is in average of 11
copies in the system
– enterprise measurements pre SAN calculation
• Shows how document space expands rapidly
• Even simple mail can cause data avalanches
• From sourface web point of view ?
• Mostly invisible
• From Deep Web point of view ?
• Data/documents copies are probably floating
around, inaccessible to us

Web and Search Engines
• Web can access material which is only
referenced by a link and is not access
protected
• Today mostly we assumes search engine span
equals web and Internet
• To be effective search engines must have pre
organised data to answer query
• Enormous changing volume of collected data
and propagation lag
http://en.wikipedia.org/wiki/List_of_search_engines

Deep Resources
• Deep Web depends on the method of how
search engines acquire and store data
• Web can be crawled or explored as link space
• Hints are cache, proxy, protocol traffic
• No clear boundary between deep resources
and surface resources

Uncollectible Resources
Deep Web Resources
• Dynamic Web Pages
– returns in response to a query or accessed only through a form
• Unlinked Contents
– Pages without any backlinks
• Private Web
– sites requiring registration and login (password-protected resources)
• Limited Access web
– Sites with captchas, no-cache pragma http headers
• Scripted Pages
– Page produced by javascript, Flash, AJAX etc
• Non HTML contents
– Multimedia files e.g. images or videos

Uncollectible Resources
Documents and Disk Space
• This comes close to e-discovery field
• Is this part of Deep Web ?
• Documents not in the web tree
• accessible only by direct filesystem access
• or by dedicated script effort
• Files generally on the web servers and no-web
servers machines
– accessible only by direct filesystem access

Forgotten Data
• From the security aspect, forgotten data is a
very interesting part of Deep Web
• What is forgotten data – maybe data without
custodian ?
• Verizon reported about big data breach from
2008,
– unknown data being part of data breach in 66% of
incidents

Data Lifecycle
• Data creation and circulation
• How to find data and correlate it
• Search engines
• Proxies
• Metadata, Logs , Feeds
• Very interesting ideas in “Programming
Collective Intelligence” By: Toby Segaran,
O'Reilly Media, August 16, 2007

Hidden Data in Surface web ?
• Web handles data available trough html and
extensions
• What about metadata and embedded data which
is not accessible for search engines ?

Surface Web and Deep Issues
• “Hidden Data in Internet Published Documents”
– deep forensic impact
• Specific data formats can have embedded
elements which is not visible to search engine
– like thumb views embeded in pictures
– exif data in images
– metadata in documents
– stego

Idea of Treasure Island
• What is not on the map is unknown
• Hiden as treasure island
• Idea of unexplored, uncharted with big gains ..
• Because of size idea of Iceberg

Why Deep Web Exists ?
• Why search engine fails?
– Technology
• Most of the web data is behind dynamically
generated pages (web gateways)
– Web crawler cannot reach them or data not announced
– Can only be obtained if we have access to the system
containing the information
– Forms have to populated with values
– understanding the semantic of the web gateway and
data behind it

Measuring the Deep Web
• How to measure – estimates are based on known
examples
• Try to generate pages based on known home pages
and explore the link space, based on hop distances
• First Attempt: Bergman (2000)
– Size of surface web is around 19 TB
– Size of Deep Web is around 7500 TB
– Deep Web is nearly 400 times larger than the Surface Web
• 2004 Mitesh classified the Deep Web more accurately
– Most of the html forms are two hops from the home page

Deep Web Size
Current Estimates 2014
• Deep Web about 7500 Terabytes
• Surface Web about 19 terabytes
• Deep Web has between 400 and 550 times more
public information than the Surface Web.
• 95% of the Deep Web is publically accessible
• More than 200,000 Deep Web sites currently exist.
• 550 billion documents on Deep Web
• 1 billion documents on Surface Web

History of Deep Web
• Start: static html pages, web crawlers can easily
reach, only few cgi-scripts
• In mid-90’s: Introduction of dynamic pages, page
generated as a result of a query or link access
• In 1994: Jill Ellsworth used the term “Invisible
Web” to refer to these websites.
• In 2001, Bergman coined it as “Deep Web”
• Dark web goes in parallel as crime start to spread
over the Internet

Rough Timeline
• 2001: Raghavan et al -> Hidden Web Exposure
– domain specific human assisted crawler
• 2002: Stumbleupon used Human Crawler
– human crawlers can find relevant links that algorithmic crawlers miss.
• 2003: Bergman introduced LexiBot
– used for quantifying the Deep Web
• 2004: Yahoo! Content Acquisition Program
– paid inclusion for webmasters
• 2005: Yahoo! Subscriptions
– Yahoo started searching subcription only sites
• 2005: Noulas et. al. -> Hidden Web Crawler
– automatically generated meaningful queries to issue against search form
• 2005: Google site map
– Allows webmasters to inform search engines about urls on their websites that
are available for crawling.
– Web 2.0 infrastructure
– Today Mobile device and Internet of things
– each gadget can have (and has) web server for configuration

From Digital Forensic Viewpoint
• Is there a way to carry out forensically sound
actions on Deep Web ?
• Can we apply standard digital forensic
procedures and best practices ?
• In both cases yes,
– we are always limited in digital forensics, but that
does not prevent reliable results

Web and Digital Forensic
• Web is web ☺
• Web artifacts are web artifacts
• The type of investigation determines how we
handle web data
– key element is: legal
• Many possible scenarios and situations
– follow the forensic principles and best practices as
in any other situation
– use scientific method
– test and experiment to prove method

Deep Web and Forensic Tasks
• How to prove access to Deep Web resources
– same as ordinary resources, because it is mostly
through browsers
– advantage over blind Deep Web access since there
are history, cache, log artifacts which shows which
Deep Web resource was accessed
• Deep Web artifacts
– Mostly like any other web artifacts
– Hidden Data in Internet Published Documents
– Dark web as a specific subrange

Forensic Tools Issues
• Forensics of specialised browsers and access tools
– Thor / onion
– Unusual browsers/accessing tools links, lynx, wget
– Other browsers 12P Freenet
• Key Question: Does our forensic framework
support such tools?
– Internet Evidence Finder
– Encase
– FTK
– If not how to handle artifacts and data ?
• What about mobile devices?

Conclusion and Questions
• Challenging field
• Size will grow with IPv6 take over and
“Internet of things” concept
• Cloud concept is important (size, acces, legal
isuses)
• Each new tehnology will add a new layer of
invisibility eg. complexity
• Size of available data simply force use of
dynamic web pages

References
Too many links ...
• http://papergirls.wordpress.com/2008/10/07/timeline-deep-
web
• http://deepwebtechblog.com/federated-search-finds-content-
that-google-can’t-reach-part-i-of-iii
• http://deepwebtechblog.com/a-federated-search-primer-
part-ii-of-iii
• http://googleblog.blogspot.com/2008/07/we-knew-web-
was-big.html
• http://www.online-college-blog.com/features/100-
useful-tips-and-tools-to-research-the-deep-web/

Deep Web and Digital Investigations

More Related Content

What's hot

Viewers also liked

Similar to Deep Web and Digital Investigations

More from Damir Delija

Recently uploaded

Deep Web and Digital Investigations