This document discusses digital forensics techniques used by law enforcement and researchers. It describes how digital forensics emerged in response to criminal use of electronic devices and emphasizes scientifically valid methods. Key techniques discussed include imaging media to obtain evidence, using hashing to filter known files, and data carving to recover deleted information. Challenges include analyzing increasing digital data and addressing ethical issues when recovering deleted files.
Powerful Start- the Key to Project Success, Barbara Laskowska
Watching the Detectives: Using digital forensics techniques to investigate the digital persona
1. Watching the Detectives
Using digital forensic techniques to
investigate the digital persona
Gareth Knight
Centre for e-Research,
Anatomy Museum, King’s College London, 8th November 2011
2. Overview
• Introduction to digital forensics
• How is it used in law enforcement
• How can it be used for research and digital
curation
• Forensic practices
• Media imaging
• Hash filtering
• Data carving
• Current/future challenges
3. Origin of Digital Forensics
• Emerged in 1980s as a response to increasing use of
electronic devices for criminal activity.
• Practioner-led approach - a set of methods applied to
gather, retrieve and analyse potential evidence held
on digital devices
• Emphasis upon “scientifically derived and proven
methods” to obtain, analyse & report upon digital
evidence (Digital Forensics Research Workshop,
2001)
• Legal acceptability influenced by Daubert Standard:
• methods must be tested,
• Subject to peer review and publication,
• Possess a known error rate,
• Subject to standards governing their application
4. Intelligence gathering in law-
enforcement
•Role in legal Disclosure
(UK)/e-discovery (US) to obtain
data designated as evidence in
legal investigation.
Robert Clark’s target-centric approach
•Broad intelligence gathering
activities – develop & test
hypothesis
•Several intelligence cycles
developed to model
investigation process
Peter Pirolli and Stuart Card sense making loop
5. Value for digital archiving and
research
Increasing amount of digital
Salman Rushdie Archive
information:
Analysis of research activities
• When did an author create a notable
work?
• What tools did they use?
• What sources did they consult?
• Is there evidence of material they
abandoned?
Business function
Staff have their machine appraised Emulation of several Apple Macs
owned by the author
prior to leaving institution/finishing a
project to identify data of long-term http://www.emory.edu/home/academic
s/libraries/salman-rushdie.html
value not held elsewhere
6. Digital Forensics workflow
Forensic activities, as described by Digital Forensics Research Workshop (2001)
Preservation Collection Validation Identification Analysis Interpretation Documentation Presentation
Acquisition Analysis Reporting
7. Data Acquisition
Act of obtaining possession of digital data for subsequent analysis.
Commonly achieved through creation a disk image or clone that
provides a bit copy of disk.
1 or more
60GB files that
hard disk add up to
60GB
Motivation for creating a disk image in forensic environment:
1. Backup copy avoids risk of media failure or other damage during use
2. Avoids risk of making inadvertent, unrecoverable change to the
primary copy
• Files can be created/modified/deleted through access to disk
1. Enable analysis using methods and tools that are not
possible/available in the original environment (e.g. emulation, text
mining)
8. Forensic Utility Belt
(1) Capture software (2) Write Blocker
Stored on bootable Prevents OS
media (floppy, CD, writing to
USB) connected devices
E.g. USB plug-
Examples: Dc3dd, through unit
DDRescue, OSFClone,
FTK Imager
(3) Access Devices (4) Destination Media
Drive enclosure
allows use of internal Digital media on
disks via USB which the disk
image will be
Kryoflux USB disk written, e.g. USB
controller allows low hard disk
level disk access
9. Key Questions to be addressed
1. What type of media do you want to
capture?
• Floppy disk, hard disk, optical media
1. How can the data be accessed?
• Hard disk installed within users’ computer
• Accessed using appropriate reader (USB
hard disk caddy, floppy disk reader,
CD/DVD reader)
• Network connected disk
1. Where will the acquired image be
stored?
• External USB disk,
• Network device over Ethernet/Serial, etc.
1. What software should you use to
Different Hardware capture the disk image?
Different Media
10. Data Analysis
Content held on digital media serves many purposes:
• Operating system files, e.g. Windows has 30,000+ after fresh install
• Software: Applications, utilities, games, etc.
• Log data: Windows Registry, browser cache, cookies, temp files
• User-generated content: Documents, images, sound, emails, etc.
Different data layers available:
1. Active data: Information readily available as normally seen by an OS
2. Inactive/residual data: Information that has been deleted or modified
• Deleted files located in unallocated space that have yet to be overwritten
(retrieved using undelete application)
• Data fragments that contains information from a partially deleted file
(retrieved through carving)
Inactive data useful, but need to consider ethical issues
10
11. Locating active files
Common techniques for locating user content:
• Navigate directory structure to get a ‘feel’ for data
files held on disk
• Search by:
• File name, e.g. *report*
• File type, e.g. *.doc, *.pdf, etc.
• Creation/modification date
• Content type, e.g. word usage
• File size
• Additional parameters configurable
Windows search easy to perform, but does not identify
everything – investigation process can leave
artefacts, e.g. thumbs.db behind
12. Case Management Tools
Common interface for analysing drive
without content change
Commercial: FTK, OSForensic
OSS: Sleuthkit/Autopsy, Digital
Forensics Framework, PyFlag
Provide tools to sort/visualise data by:
• Name,
• Folder,
• Size,
• Type,
• Creation/Modification date
• Hash set
13. Identifying user data using
checksums
• Checksum algorithm applied to a file
generates a distinct (possibly unique)
alphanumeric value
• Many different types of checksum algorithm
• Commonly used to check for
accidental/deliberate data change/corruption
• Generate checksum on October 1st
• Generate checksum on October 14th & compare
to Oct 1st value – are they the same?
14. Hash filtering / Exclusion Hashing
• Technique to identify data files obtained from
different sources
• Calculate checksum (e.g. MD5, SHA-1) of one or
more files
• Compare each checksum against a checksum
database indicating files known to originate from a
third party
Checksum types
• known good’ - Files that perform a legitimate
purpose, e.g. Operating System, application.
• ‘known bad’ - Files that denote viruses, Trojans,
cracker's tools, or other malicious files
• Unknown – Files that have not been previously
encountered.
15. Hash datasets – Information
Sources
NIST National Software Reference Library (NSRL):
• Checksums of legitimate files generated from software products
obtained through purchase/donation.
• Stores 10,000+ software files.
• Reference Data Set published every 3 months & available through 3rd
parties, such as Find-a-Hash
HashKeeper - National Drug Intelligence Center
• Checksums gathered through criminal investigation.
• Academic (and other) institutions must file a FoI request to gain
access to software and database.
Online File Signature Database (OFSDB):
• Subscription based system dependent upon user contribution.
• Full access available through subscription of 25 USD per year
• Currently being used by curators/archivists to distinguish between
known third-party and potential user created files.
16. Practical Example
60GB hard disk 9,698 known files, 12,974 unknown files
Windows 2000 files that match the NSRL Unknown files that may be user created
database content
Method may be combined with other techniques, e.g. path and filename
analysis to exclude other common files (e.g. thumbs.db)
17. Recovering deleted data
• Data files continue to exist in full or in part for some
time after deletion
• The list of disk clusters occupied by the file is relabelled as
‘unallocated’, i.e. available for use.
Recovering complete files
• Files may be recovered if the space has not been
allocated to new data – Recovery soft
may be used to recreate pointer to files
that exist
• Likelihood of retrieving entire file
decreases over time
18. (Data/File) Carving
“File carving is the process of
recovering computer files from a
storage medium without the use of
the standard file-system metadata
that is typically used during a normal
file retrieval.”
http://www.techheadsitconsulting.com/f/file-carving.html
Useful for data recovery when:
• The File system ‘pointer’ (directory
entry) to the file has been deleted or
corrupted.
• Sectors allocated to data file have
been partially overwritten
20. Header/Footer Carving
Analyse file to identify data sequences that
match a known filetype header & footer
Header Footer
GIF nx47nx49nx46nx38nx37nx61 nx00nx3b
JPG nxffnxd8nxffnxe0nx00nx10 nxffnxd9
ZIP PKnx03nx04 nx3cnxac
Sample header/information used by Scapel to identify files
21. Other carving methods
• Header/Maximum (file size) Carving: Match header of known
file type and extract data in sequence until a specified file size
(e.g. 10MB) has been reached.
• Header/Embedded Length carving: Technique for carving
formats that store total size(length) in header, e.g. BMP, PDF,
AVI
• File structure based/Deep carving: Use documentation on file
type structure to carve files
• Smart Carving: Use documentation on file system’s data
handling to address disk fragmentation issues
22. Data Carving tool capabilities
A disk containing 20 deleted files - 5 100k text files, 5 5Mb JPEGs, 5
90MB WMV videos and 5 300MB AVI videos (approx file size) is
imaged and stored as RAW /DD
1. PhotoRec recovered all texts and JPGs. 3 AVIs were recovered in
entirety, 2 were incomplete (but partially playable).
2. Scalpel – Recovered all JPGs and 3 incomplete (but partially
playable) AVIs. Did not extract WMV or txt
3. MagicRescue – Only recovers files it has a ‘recipe’ for (JPG, AVI,
but not txt or WMV) – recovered JPGs, but not AVI. Did not attempt
other formats.
4. Foremost - unable to recover any files
Planned Carver 2.0 may provide intelligent carving
http://www.forensicswiki.org/wiki/Carver_2.0_Planning_Pag
22
23. Real world Experience
Laptop containing 60GB hard disk in use for 6-7 years
•Able to extract 363 legitimate files,
but….
• Disk fragmentation a big problem!
• Data carving can take a loooonnng
time – potentially weeks or months
to perform in full
• Software instability
• Data carving requires a lot of disk
space to store extracted data files
• Large number of false positives
(fake files) produced
• Filestreams (e.g. images within
container) often extracted, but not Examples of Incomplete & invalid data files
larger file (PowerPoint)
24. Timeline visualisation
Chronological list of activities performed
on the host machine
Uses:
• Gain understanding of research
activities on machine
• Investigate a specific incident
•Traditionally concerned with File
creation/accessed/modification
•SuperTimeline tools being developed
that merge time data from multiple
sources.
• OSS Timescanner useful for
generating log of events
25. Text Mining
Java characterisation tool (AQUA)
•Uses Apache Tika to obtain information
about a file collection and its textual
content
•Relative path, file name, size, modified
date, SHA-256 digest, MIME type,
•Word frequency of the generated
Lucene index
Stanford MUSE Java tool
Mailbox analysis
•Relationships - Grouping of contacts
•Name lists (people, places,
organizations
•Sentiment analysis using word lists
– map over time
AQUA http://wiki.opf-labs.org/display/AQuA/Characterising+Externally+Generated+Content
Stanford Muse http://vis.stanford.edu/papers/muse
26. Conclusion (1): Challenges for use of
digital forensics in research
Expertise of the researcher
• Some technical expertise req. to perform acquisition and
analysis
Ethics of a forensic investigation
• User may not realise that deleted/scraps of content
continues to exist - how do you communicate intent to your
user community?
• Terminology is currently influenced by law enforcement
community and is a barrier to wider use – forensics?
Suspect?
Capabilities of the tools
• No single tool is appropriate – require a combination of
different ones
• Some integration is necessary to simplify process.
26
27. Conclusion (2):
Current/Future challenges
Multi-user systems
• Distinguishing between data created by multiple users on
same machine is time-consuming - requires analysis of
timestamps and other features.
Archiving data on 3rd party services:
• Ethical issues associated with accessing & archiving user data
on mail servers, second life, and cloud providers etc.
Diverse device & media types:
• Solid State devices subject to ‘wear levelling’ which purges
inactive data
(http://www.jdfsl.org/subscriptions/abstracts/abstract-v5n3-bell.htm)
• Use of portable (personal/work) devices in the workplace, e.g.
iPad, iPhone, Android devices – what is the master copy?
27
28. References
Digital Forensics and Born-Digital Content in Cultural Heritage
Collections (2010)
http://www.clir.org/pubs/abstract/pub149abst.html
Performance Evaluation of Open-Source Disk Imaging Tools for
Collecting Digital Evidence
http://www.kuis.edu.my/ictconf/proceedings/353_integration2010_proceedi
ngs.pdf
The Evolution of File Carving (2009)
http://digital-assembly.com/technology/research/pubs/ieee-spm-2009.pdf
Hash Filtering techniques
http://computer-forensics.sans.org/blog/2010/02/22/extracting-known-bad-
hashset-from-nsrl/
Digital Forensic tutorials http://computer-forensics.sans.org/blog/
Open Source Forensics http://www2.opensourceforensics.org/
Forensics Wiki http://www.forensicswiki.org/wiki/Main_Page