Deep Web and Digital 
Investigations 
Damir Delija 
Milano 2014 
1
What we will talk about 
• Web and “Deep Web” 
• Web and documents 
• Definitions 
• Technical issues 
• Forensic issues 
• I’m not an expert on deep or dark web 
• Discussion based on many sources and 
references
Inaccessible Web 
• Deep Web is a name for data inaccessible by 
regular search engines on the Internet 
• Deep Web sounds much better than 
inaccessible 
• Searchable / Accessible web is also called 
surface web 
• Dark web is part of www with illegal or 
immoral content 
• Dark web is not Deep Web it is part of it, but 
dark pages are on the surface web too
Inaccessible Resources 
• Inaccessible resources 
– it exists but we don’t know about it or it’s location 
– we can’t use it 
• It is an old problem 
– you have it, even in your own room 
• Is there any solution ? 
– idea from Gopher days, Veronica 
– it works well with static pages and data 
– abandoned in web days, becomes a source of tremendous 
power and wealth for Search Engines
Web and Internet and Documents 
• WWW is not the Internet ☺ 
– also full data or document space of each networked 
computer is not part of the Internet 
• WWW is hypertext document based structure 
– we have links among documents 
– a document is not necessarily a web page 
– documents must have a presentation ability to be visible 
through the web interface (transcription layer, often 
dynamicaly generated) 
– Links, web pages and documents can be static or 
dynamically generated 
– Dynamic documents are here because of volume of data 
(can’t be organised in static pages) 
Definitions are crucial in understandig deep and 
surface web
Volume of Data 
• For each document there is in average of 11 
copies in the system 
– enterprise measurements pre SAN calculation 
• Shows how document space expands rapidly 
• Even simple mail can cause data avalanches 
• From sourface web point of view ? 
• Mostly invisible 
• From Deep Web point of view ? 
• Data/documents copies are probably floating 
around, inaccessible to us
Web and Search Engines 
• Web can access material which is only 
referenced by a link and is not access 
protected 
• Today mostly we assumes search engine span 
equals web and Internet 
• To be effective search engines must have pre 
organised data to answer query 
• Enormous changing volume of collected data 
and propagation lag 
http://en.wikipedia.org/wiki/List_of_search_engines
Deep Resources 
• Deep Web depends on the method of how 
search engines acquire and store data 
• Web can be crawled or explored as link space 
• Hints are cache, proxy, protocol traffic 
• No clear boundary between deep resources 
and surface resources
Uncollectible Resources 
Deep Web Resources 
• Dynamic Web Pages 
– returns in response to a query or accessed only through a form 
• Unlinked Contents 
– Pages without any backlinks 
• Private Web 
– sites requiring registration and login (password-protected resources) 
• Limited Access web 
– Sites with captchas, no-cache pragma http headers 
• Scripted Pages 
– Page produced by javascript, Flash, AJAX etc 
• Non HTML contents 
– Multimedia files e.g. images or videos
Uncollectible Resources 
Documents and Disk Space 
• This comes close to e-discovery field 
• Is this part of Deep Web ? 
• Documents not in the web tree 
• accessible only by direct filesystem access 
• or by dedicated script effort 
• Files generally on the web servers and no-web 
servers machines 
– accessible only by direct filesystem access
Forgotten Data 
• From the security aspect, forgotten data is a 
very interesting part of Deep Web 
• What is forgotten data – maybe data without 
custodian ? 
• Verizon reported about big data breach from 
2008, 
– unknown data being part of data breach in 66% of 
incidents
Data Lifecycle 
• Data creation and circulation 
• How to find data and correlate it 
• Search engines 
• Proxies 
• Metadata, Logs , Feeds 
• Very interesting ideas in “Programming 
Collective Intelligence” By: Toby Segaran, 
O'Reilly Media, August 16, 2007
Hidden Data in Surface web ? 
• Web handles data available trough html and 
extensions 
• What about metadata and embedded data which 
is not accessible for search engines ?
Surface Web and Deep Issues 
• “Hidden Data in Internet Published Documents” 
– deep forensic impact 
• Specific data formats can have embedded 
elements which is not visible to search engine 
– like thumb views embeded in pictures 
– exif data in images 
– metadata in documents 
– stego
Idea of Treasure Island 
• What is not on the map is unknown 
• Hiden as treasure island 
• Idea of unexplored, uncharted with big gains .. 
• Because of size idea of Iceberg
Why Deep Web Exists ? 
• Why search engine fails? 
– Technology 
• Most of the web data is behind dynamically 
generated pages (web gateways) 
– Web crawler cannot reach them or data not announced 
– Can only be obtained if we have access to the system 
containing the information 
– Forms have to populated with values 
– understanding the semantic of the web gateway and 
data behind it
Measuring the Deep Web 
• How to measure – estimates are based on known 
examples 
• Try to generate pages based on known home pages 
and explore the link space, based on hop distances 
• First Attempt: Bergman (2000) 
– Size of surface web is around 19 TB 
– Size of Deep Web is around 7500 TB 
– Deep Web is nearly 400 times larger than the Surface Web 
• 2004 Mitesh classified the Deep Web more accurately 
– Most of the html forms are two hops from the home page
Deep Web Size 
Current Estimates 2014 
• Deep Web about 7500 Terabytes 
• Surface Web about 19 terabytes 
• Deep Web has between 400 and 550 times more 
public information than the Surface Web. 
• 95% of the Deep Web is publically accessible 
• More than 200,000 Deep Web sites currently exist. 
• 550 billion documents on Deep Web 
• 1 billion documents on Surface Web
History of Deep Web 
• Start: static html pages, web crawlers can easily 
reach, only few cgi-scripts 
• In mid-90’s: Introduction of dynamic pages, page 
generated as a result of a query or link access 
• In 1994: Jill Ellsworth used the term “Invisible 
Web” to refer to these websites. 
• In 2001, Bergman coined it as “Deep Web” 
• Dark web goes in parallel as crime start to spread 
over the Internet
Rough Timeline 
• 2001: Raghavan et al -> Hidden Web Exposure 
– domain specific human assisted crawler 
• 2002: Stumbleupon used Human Crawler 
– human crawlers can find relevant links that algorithmic crawlers miss. 
• 2003: Bergman introduced LexiBot 
– used for quantifying the Deep Web 
• 2004: Yahoo! Content Acquisition Program 
– paid inclusion for webmasters 
• 2005: Yahoo! Subscriptions 
– Yahoo started searching subcription only sites 
• 2005: Noulas et. al. -> Hidden Web Crawler 
– automatically generated meaningful queries to issue against search form 
• 2005: Google site map 
– Allows webmasters to inform search engines about urls on their websites that 
are available for crawling. 
– Web 2.0 infrastructure 
– Today Mobile device and Internet of things 
– each gadget can have (and has) web server for configuration
Forensic Issues
From Digital Forensic Viewpoint 
• Is there a way to carry out forensically sound 
actions on Deep Web ? 
• Can we apply standard digital forensic 
procedures and best practices ? 
• In both cases yes, 
– we are always limited in digital forensics, but that 
does not prevent reliable results
Web and Digital Forensic 
• Web is web ☺ 
• Web artifacts are web artifacts 
• The type of investigation determines how we 
handle web data 
– key element is: legal 
• Many possible scenarios and situations 
– follow the forensic principles and best practices as 
in any other situation 
– use scientific method 
– test and experiment to prove method
Deep Web and Forensic Tasks 
• How to prove access to Deep Web resources 
– same as ordinary resources, because it is mostly 
through browsers 
– advantage over blind Deep Web access since there 
are history, cache, log artifacts which shows which 
Deep Web resource was accessed 
• Deep Web artifacts 
– Mostly like any other web artifacts 
– Hidden Data in Internet Published Documents 
– Dark web as a specific subrange
Forensic Tools Issues 
• Forensics of specialised browsers and access tools 
– Thor / onion 
– Unusual browsers/accessing tools links, lynx, wget 
– Other browsers 12P Freenet 
• Key Question: Does our forensic framework 
support such tools? 
– Internet Evidence Finder 
– Encase 
– FTK 
– If not how to handle artifacts and data ? 
• What about mobile devices?
Conclusion and Questions 
• Challenging field 
• Size will grow with IPv6 take over and 
“Internet of things” concept 
• Cloud concept is important (size, acces, legal 
isuses) 
• Each new tehnology will add a new layer of 
invisibility eg. complexity 
• Size of available data simply force use of 
dynamic web pages
References 
Too many links ... 
• http://papergirls.wordpress.com/2008/10/07/timeline-deep- 
web 
• http://deepwebtechblog.com/federated-search-finds-content- 
that-google-can’t-reach-part-i-of-iii 
• http://deepwebtechblog.com/a-federated-search-primer- 
part-ii-of-iii 
• http://googleblog.blogspot.com/2008/07/we-knew-web- 
was-big.html 
• http://www.online-college-blog.com/features/100- 
useful-tips-and-tools-to-research-the-deep-web/

Deep Web and Digital Investigations

  • 1.
    Deep Web andDigital Investigations Damir Delija Milano 2014 1
  • 2.
    What we willtalk about • Web and “Deep Web” • Web and documents • Definitions • Technical issues • Forensic issues • I’m not an expert on deep or dark web • Discussion based on many sources and references
  • 3.
    Inaccessible Web •Deep Web is a name for data inaccessible by regular search engines on the Internet • Deep Web sounds much better than inaccessible • Searchable / Accessible web is also called surface web • Dark web is part of www with illegal or immoral content • Dark web is not Deep Web it is part of it, but dark pages are on the surface web too
  • 4.
    Inaccessible Resources •Inaccessible resources – it exists but we don’t know about it or it’s location – we can’t use it • It is an old problem – you have it, even in your own room • Is there any solution ? – idea from Gopher days, Veronica – it works well with static pages and data – abandoned in web days, becomes a source of tremendous power and wealth for Search Engines
  • 5.
    Web and Internetand Documents • WWW is not the Internet ☺ – also full data or document space of each networked computer is not part of the Internet • WWW is hypertext document based structure – we have links among documents – a document is not necessarily a web page – documents must have a presentation ability to be visible through the web interface (transcription layer, often dynamicaly generated) – Links, web pages and documents can be static or dynamically generated – Dynamic documents are here because of volume of data (can’t be organised in static pages) Definitions are crucial in understandig deep and surface web
  • 6.
    Volume of Data • For each document there is in average of 11 copies in the system – enterprise measurements pre SAN calculation • Shows how document space expands rapidly • Even simple mail can cause data avalanches • From sourface web point of view ? • Mostly invisible • From Deep Web point of view ? • Data/documents copies are probably floating around, inaccessible to us
  • 7.
    Web and SearchEngines • Web can access material which is only referenced by a link and is not access protected • Today mostly we assumes search engine span equals web and Internet • To be effective search engines must have pre organised data to answer query • Enormous changing volume of collected data and propagation lag http://en.wikipedia.org/wiki/List_of_search_engines
  • 8.
    Deep Resources •Deep Web depends on the method of how search engines acquire and store data • Web can be crawled or explored as link space • Hints are cache, proxy, protocol traffic • No clear boundary between deep resources and surface resources
  • 9.
    Uncollectible Resources DeepWeb Resources • Dynamic Web Pages – returns in response to a query or accessed only through a form • Unlinked Contents – Pages without any backlinks • Private Web – sites requiring registration and login (password-protected resources) • Limited Access web – Sites with captchas, no-cache pragma http headers • Scripted Pages – Page produced by javascript, Flash, AJAX etc • Non HTML contents – Multimedia files e.g. images or videos
  • 10.
    Uncollectible Resources Documentsand Disk Space • This comes close to e-discovery field • Is this part of Deep Web ? • Documents not in the web tree • accessible only by direct filesystem access • or by dedicated script effort • Files generally on the web servers and no-web servers machines – accessible only by direct filesystem access
  • 11.
    Forgotten Data •From the security aspect, forgotten data is a very interesting part of Deep Web • What is forgotten data – maybe data without custodian ? • Verizon reported about big data breach from 2008, – unknown data being part of data breach in 66% of incidents
  • 12.
    Data Lifecycle •Data creation and circulation • How to find data and correlate it • Search engines • Proxies • Metadata, Logs , Feeds • Very interesting ideas in “Programming Collective Intelligence” By: Toby Segaran, O'Reilly Media, August 16, 2007
  • 13.
    Hidden Data inSurface web ? • Web handles data available trough html and extensions • What about metadata and embedded data which is not accessible for search engines ?
  • 14.
    Surface Web andDeep Issues • “Hidden Data in Internet Published Documents” – deep forensic impact • Specific data formats can have embedded elements which is not visible to search engine – like thumb views embeded in pictures – exif data in images – metadata in documents – stego
  • 15.
    Idea of TreasureIsland • What is not on the map is unknown • Hiden as treasure island • Idea of unexplored, uncharted with big gains .. • Because of size idea of Iceberg
  • 16.
    Why Deep WebExists ? • Why search engine fails? – Technology • Most of the web data is behind dynamically generated pages (web gateways) – Web crawler cannot reach them or data not announced – Can only be obtained if we have access to the system containing the information – Forms have to populated with values – understanding the semantic of the web gateway and data behind it
  • 17.
    Measuring the DeepWeb • How to measure – estimates are based on known examples • Try to generate pages based on known home pages and explore the link space, based on hop distances • First Attempt: Bergman (2000) – Size of surface web is around 19 TB – Size of Deep Web is around 7500 TB – Deep Web is nearly 400 times larger than the Surface Web • 2004 Mitesh classified the Deep Web more accurately – Most of the html forms are two hops from the home page
  • 18.
    Deep Web Size Current Estimates 2014 • Deep Web about 7500 Terabytes • Surface Web about 19 terabytes • Deep Web has between 400 and 550 times more public information than the Surface Web. • 95% of the Deep Web is publically accessible • More than 200,000 Deep Web sites currently exist. • 550 billion documents on Deep Web • 1 billion documents on Surface Web
  • 19.
    History of DeepWeb • Start: static html pages, web crawlers can easily reach, only few cgi-scripts • In mid-90’s: Introduction of dynamic pages, page generated as a result of a query or link access • In 1994: Jill Ellsworth used the term “Invisible Web” to refer to these websites. • In 2001, Bergman coined it as “Deep Web” • Dark web goes in parallel as crime start to spread over the Internet
  • 20.
    Rough Timeline •2001: Raghavan et al -> Hidden Web Exposure – domain specific human assisted crawler • 2002: Stumbleupon used Human Crawler – human crawlers can find relevant links that algorithmic crawlers miss. • 2003: Bergman introduced LexiBot – used for quantifying the Deep Web • 2004: Yahoo! Content Acquisition Program – paid inclusion for webmasters • 2005: Yahoo! Subscriptions – Yahoo started searching subcription only sites • 2005: Noulas et. al. -> Hidden Web Crawler – automatically generated meaningful queries to issue against search form • 2005: Google site map – Allows webmasters to inform search engines about urls on their websites that are available for crawling. – Web 2.0 infrastructure – Today Mobile device and Internet of things – each gadget can have (and has) web server for configuration
  • 21.
  • 22.
    From Digital ForensicViewpoint • Is there a way to carry out forensically sound actions on Deep Web ? • Can we apply standard digital forensic procedures and best practices ? • In both cases yes, – we are always limited in digital forensics, but that does not prevent reliable results
  • 23.
    Web and DigitalForensic • Web is web ☺ • Web artifacts are web artifacts • The type of investigation determines how we handle web data – key element is: legal • Many possible scenarios and situations – follow the forensic principles and best practices as in any other situation – use scientific method – test and experiment to prove method
  • 24.
    Deep Web andForensic Tasks • How to prove access to Deep Web resources – same as ordinary resources, because it is mostly through browsers – advantage over blind Deep Web access since there are history, cache, log artifacts which shows which Deep Web resource was accessed • Deep Web artifacts – Mostly like any other web artifacts – Hidden Data in Internet Published Documents – Dark web as a specific subrange
  • 25.
    Forensic Tools Issues • Forensics of specialised browsers and access tools – Thor / onion – Unusual browsers/accessing tools links, lynx, wget – Other browsers 12P Freenet • Key Question: Does our forensic framework support such tools? – Internet Evidence Finder – Encase – FTK – If not how to handle artifacts and data ? • What about mobile devices?
  • 26.
    Conclusion and Questions • Challenging field • Size will grow with IPv6 take over and “Internet of things” concept • Cloud concept is important (size, acces, legal isuses) • Each new tehnology will add a new layer of invisibility eg. complexity • Size of available data simply force use of dynamic web pages
  • 27.
    References Too manylinks ... • http://papergirls.wordpress.com/2008/10/07/timeline-deep- web • http://deepwebtechblog.com/federated-search-finds-content- that-google-can’t-reach-part-i-of-iii • http://deepwebtechblog.com/a-federated-search-primer- part-ii-of-iii • http://googleblog.blogspot.com/2008/07/we-knew-web- was-big.html • http://www.online-college-blog.com/features/100- useful-tips-and-tools-to-research-the-deep-web/