Semantically-Enabled Digital Investigations - Research Overview

A M E T H O D F O R S E M A N T I C I N T E G R A T I O N
A N D C O R R E L A T I O N O F D I G I T A L E V I D E N C E
U S I N G A H Y P O T H E S I S - B A S E D A P P R O A C H
Semantically-Enabled Digital
Investigations
by Spyridon Dosis
February 2013, Stockholm

Problem Definition
 Sophisticated attacks against highly interconnected
networked systems.
 Multitude, variety and size of data sources with
possible evidentiary value.
 Need for continuous state-of-the-art technical
expertise.
 Evidence-oriented first-generation forensic tools
with poor integration and correlation features.
 Lack of common, standardized data
representation/abstraction formats.

Research Questions and Limitations
 How can the Semantic Web technologies and the Linked Data
initiative be applied to Digital Forensics?
 How a common ontological-based knowledge representation
layer can improve the level of integration of currently disjoint
specialized areas of DF such as storage, network, mobile, live
memory and others?
 How such a new method may improve the efficiency and
capabilities of existing DF investigation models, techniques and
tools?
 Not full coverage of the features and capabilities of the
Semantic Web technologies.
 Simplified complexity for the conducted experiments.

Digital Evidence
 “any digital data that contain reliable information
that supports or refutes a hypothesis about an
incident” – (Carrier & Spafford 2004)
 Continuously increasing scope
 Varying layers of abstraction
 (Schatz 2007) identifies 3 basic properties
 Latency -> Semantic Interpretation
 Fidelity -> Chain of Custody
 Volatility -> Order of Volatility

Digital Investigations
 The set of principles and methods that are followed
during the lifecycle of digital evidence with the goal
of event reconstruction.
 Slight definition variations among different contexts.
 The Event-based Digital Forensic Investigation
Framework (Carrier & Spafford 2004)
 System Preservation, Evidence Searching, Event
Reconstruction
 The Digital Investigation Process (Casey 2004)
 The Hypothesis-based Approach (Carrier 2006)

Semantic Web Technologies
 “… information is given well-defined meaning, better
enabling computers and people to work in
cooperation” – (Tim Berners Lee 2001)
 Metadata – Annotation of data providing contextual
or domain-specific information about the content
 Ontology – “explicit and formal specification of a
conceptualization” – (Gruber 1993)
 Entities, attributes, interrelationships
 Open world assumption
 Reasoning over data by inferencing implicit
conclusions

Semantic Web Architecture : Part A
adapted from Antoniou & Van Harmelen 2004
• URI/IRI enables unique
identification of a resource under a
global scope.
• XML provides a consistent
machine-consumable data encoding
scheme in an unambiguous scoped
manner.
• XML Schema used for defining the
rules and the ‘tag’ vocabulary that
data must conform against.
• RDF provides a simple but flexible
data model for encoding metadata
• Subject-Predicate-Object
• RDF Schema used for defining RDF
vocabularies
• Class and Property hierarchies

Semantic Web Architecture : Part B
adapted from Antoniou & Van Harmelen 2004
• OWL 2 is a computational logic-based
language that enables automated
reasoning for inferencing and
consistency verification.
• Increased expressivity
• Property Restrictions
• Class and Property Equivalency
• Property Relationships
• Global Cardinality Constraints and
Individual Identity (no unique-
names assumption)
• OWL Dialects for varying levels of
expressiveness and computational
complexity.
• SWRL supports more advanced
reasoning cases.
• SPARQL is an RDF-based query
language and protocol

Previous Work #1
 XML-based Approaches
 Digital Forensics XML (Garfinkel 2009) for describing disk images
and their contents (partitions, files, byte runs).
 EDRM XML for describing electronic document metadata.
 XIRAF for XML-based extraction, storage and querying of evidence
files.
 DEX for including provenance-related metadata.
 Other domain-specific XML approaches for live forensics, network
forensics, vulnerability assessment, logs, malware.
 Support a level of tool interoperability and standarization
 No support for automated reasoning or semantic
integration of data.

Previous Work #2
 RDF-based Approaches
 AFF forensic format uses RDF for including arbitrary metadata
(system or process-related, user-specific ones)
 Strengthening the chain-of-custody by additional RDF metadata
(evidence-access, examiner or artifact-related information) (Giova
2011)
 Ontological Approaches
 FORE (Schatz 2004) comprised of a log parser, a forensic ontology
and a custom rule language for aggregating lower level events into
higher level ones. Later expanded by referencing external ontologies.
 DIALOG conceptualized ‘procedural’ and ‘practical’ aspects of a
digital investigation with practical examples of registry analysis.
Later expanded with additional concepts for encoding forensically
relevant types of data.
 (Saad 2010) applied an ontology in the network forensics area for
modeling network attacks and supporting different types of
reasoning based on collected events

Methodology
 Two main research paradigms in IT (Hevner 2004)
 Behavioural Science
 Design Science
 Outcomes of a design science process can be:
 Constructs
 Models
 Methods
 Instantiations

Design Science Method
adapted from Johannesson & Perjons 2012
• Problem Specification
• Literature Review
• Case studies
• Empirical Observations
• Artifact Outline and
Requirements
• Literature Review
• Case Studies
• Design and Development
• Artifact Demonstration
• Laboratory Experiment
(Simulated cases)
• Artifact Evaluation
• “ex ante evaluation”
• Communication of the artifact

A Semantic Web approach for Digital Investigations
 Information Integration
 Common identifiers
 Different identifiers

A Semantic Web approach for Digital Investigations
 Semi-structured Data Support
 Classification and Inference
 Extensibility
 Provenance
 Named Graphs
 Search

Relation to Digital Investigation Reference Models
• Conceptual Mapping between the Semantic Web
architecture and digital investigation frameworks
• Previous phases are assumed as prerequisites

Evaluation Criteria
 Goal – Question – Metric (GQM) approach
Generic Criteria
Goal Questions Metric
The proposed method
should be appropriate
for the task in hand
What is the relationship of the proposed method with
existing digital investigations practices and tools?
What are the case context requirements for the method
to be applied?
The ability of the method to handle different types of cases (network-related
events, media devices examination etc.) measured by the number of different
data types it can process.
The method should
provide good support
for decision-making by
providing relevant and
usable results.
What are the types of new knowledge that such the
method can extract and what is its usefulness.
How can the examiner formulate and evaluate
hypotheses about the evidence files and receive
informative results
The ability of the method to support arbitrary queries and provide answers over
the whole body of collected evidence. This can be quantified by the precision and
recall information retrieval measures over the query results.
The method should be
cost effective in terms
of storage and time
needs
How the method accepts and stores input data,
intermediate and final results. What are the storage
requirements for such an implementation?
How much time is needed for applying the method on
the input data and how can it reduce the time that the
investigation process takes?
Storage size requirements for representing input and output data.
Time needed for performing the analysis of data or evaluating user-submitted
queries.
The method should be
flexible and scalable
Can the method deal with new sources of data or being
able to seamlessly integrate new forms of ontologically-
expressed knowledge and rules.
Can the method support large amounts of data and what
problems such complexity may incur?
The ability of the method to process new data and accept additional ontologies or
rules without the need of major (possibly even none) modifications on the
existing steps. It can be measured by the amount of configuration or code
modifications such changes may require.
The method’s ability to handle large amounts of data. It can be measured by the
amounts of input size in relation to the processing time or produced errors (e.g.
number of captured network packets, firewall logs, disk image sizes etc.)

Evaluation Criteria
Forensic Criteria
The method’s results
should be reproducible
Are the results of the method behave in a deterministic manner when applied on
the same input data or they are inconsistent among multiple tests?
The method’s results (e.g. inferred axioms, query
results) should be the same given the same dataset and
independently of other factors like order of processing
the evidence files. This can be measured by the number
of errors or different results after multiple applications
of the method on the same dataset.
The method’s possible
errors should be minimal
and determined
Does the method produce accurate results? Can the method accept inconsistent
or malformed input data? How the method deals with incomplete data? Can the
method produce results that are ambiguous or inconsistent to the specified
ontologies?
The method’s results can be automatically checked by a
reasoning engine for possible inconsistencies between
asserted and inferred axioms and the given ontologies.
The method’s error rate can be measured by the error
messages produced during its lifecycle.
The method must provide
logging capabilities for
the inclusion of arbitrary
metadata regarding the
case, the entities and the
evidence objects
involved.
Does the method support the addition of annotation axioms with respect to the
asserted or inferred axioms?
Does the method allow the logging of the various steps of it as they are applied
and their results produced?
The ability to insert logging information during the
method can be measured by its flexibility to accept
arbitrary metadata.
The method should
protect the integrity of
the collected data
Can the method operate on forensic copies of the collected evidence?
Does the method use hashing algorithms in order to ensure the consistency and
integrity of these forensic copies?
The method should protect the integrity of the
collected data, files and devices throughout its whole
lifecycle by being able to work on forensic copies
instead of the original and verify any hash values that
these copies carry as forensic metadata. The ability of
performing these checks for different data sources can
be considered as a metric.

Evaluation Criteria
Semantic Web Related Criteria
System Heterogeneity –
Platform Independence
Can parts of the method be applied in different system and the partial results
later recombined? Are there any restrictions with respect to the configuration of
these analysis systems?
The ability of the method to be successfully applied in
different system configurations can be measured
through multiple tests in different systems.
Implementable with the
current Semantic Web
Stack
Can the method’s steps that utilize Semantic Web concepts be implemented
with current technology or other improvements/extensions are needed?
The method should be able to rely on existing
Semantic Web technologies without the need to
develop or improve their current status. Errors
produced or modifications needed when
implementing the proposed method can be considered
a metric of how much implementable the method
currently is.
The method and its
results should be
semantically rich
allowing the description
of high level contexts and
events along with their
interrelationships.
Can the method describe arbitrary data? Can the method accept descriptions of
high level and user-defined concepts and associate set of lower level events into
them? Can the method establish relationships between these higher level
descriptions?
The method should be able to accept user-defined high
level concepts and associate lower level events to them
using well defined rules/restrictions. Errors produced
or inability to define custom-defined events can be
considered as a metric of how semantically rich the
method is.

Description of the Method
 Design structure of the method
 The Data Collection phase assumes proper acquisition
techniques and possible pre-processing tasks.
 Ontological representation based on light-weight domain
specific ontologies to the RDF data model.
 Automated Reasoning for inferencing new axioms (class,
property, inverse property assertion axioms).
 Rule evaluation / integration with rule engines.
 Integrated query against the set of asserted and inferred
axioms.

Ontological Representation of Evidence
 Two types of data
 Case Related Data
 Storage Media Forensic Images, Network Packet Captures,
Firewall Logs
 Supportive Data
 WHOIS domain information, IP geo-location, IP to ASN
mappings, databases of malicious files or hosts
 Lightweight ontologies have been specified with the
Protégé Ontology Editor based on
 PCAP Network Captures, Disk Images, Windows XP Firewall
Logs, WHOIS RIPE Database, VirusTotal, FIRE malicious
networks tracker

Ontologies
 Network Capture
 Protocol stack
reconstruction
 Focused on HTTP
 W3C ERT RDF
vocabulary for HTTP
 Forensic Disk
Image
 DFXML and fiwalk
 Timestamps, hash
values, file type

Ontologies
 Windows XP
Firewall Log
 W3C Extended
Log File Format
 RIPE WHOIS
 RIPE NCC web
interface
 XML/JSON formatted
results

Ontologies
 Malicious Networks
 FIRE project
(Wombat EU FP7)
 Aggregation from
sources like
 Anubis, Wepawet,
SpamCop, PhishTank
 Web interface (Discontinued)
 Malware Detection
 VirusTotal provides a
web interface to a variety
of antimalware engines
 Database search web interface
based on hash values

Semantic Integration of Evidence
 URI Format
 urn://<source_id>/<resource_ID>
 Ontological representation
 Natively supported / Semantic Parsers
 De-duplication
 Single URI resource representation under the same namespace
 owl:sameAs for same resource / differently namespaced URIs
 OWL 2 hasKey
 SWRL rules for integrating individuals in different ontologies
 Realistic (manual) approach
 Integration ontology (IP address, MD5 hash value)
PacketCapture :
IPAddress
WindowsXPFirewallLog :
Host
PcapIPToFWLogHost

Semantic Correlation of Evidence
 Establishing relations between resources of different
nature.
 Temporal Correlation
 SWRL Temporal Ontology (Connor & Das 2011)
 Support for time instants and intervals
 Two approaches
 Modify existing ontologies by
importing the time ontology.
 Specifying existing classes as
subclasses of ‘ExtendedProposition’
in an external ontology.

Semantic Correlation of Evidence
 Temporal Correlation (Cont’d)
 Relations between time intervals
 Allen’s Interval Algebra (Allen 1983)
 Relations between time instants
and intervals
 ‘inside’,’before’,’after’ (Hobbs 2004)
 SWRL builtins
 Mereological Correlation
 ‘partOf’ relations
 Transitivity
 E.g. IP address (partOf) IP range (partOf) AS =>
IP address (partOfAS) AS

Integrated Query Formulation and Evaluation
 Two methods of query
preparation
 Precomputing inferred axioms
 Back-propagation
 Two methods of query
evaluation
 Merging ontologies
 Named graphs
(Distributed SPARQL
Endpoints)

A Reference Implementation
 Tools Used
 Java 6
 Protégé 4.1.0
 OWL API 3.2.4
 Pellet 2.3.0
 Protégé OWL API 3.4.8
 Jena 2.6.4
 Jess 7.1p2
 Kraken Pcap API 1.3.0
 Apache HTTP Components, Jsoup, JSON

A Reference Implementation
• Evidence Manager
• Load evidence files
• Semantic Parser
• 6 parsers
• Filtering options (NIST NSRL)
can lead to 40-50% reduction
of an XP image.
• Collector Objects
• Reduce complexity
• Coupled with parsers
• Inference Engine
• Class Assertion
• Inverse Property Assertion
• Integration Ontology
• Investigator-specific
classes/properties
• SWRL Rule Engine
• SPARQL In-memory endpoint

Experimental Setup
 2x HP Compaq 8000 Elite
 Intel Core 2 Duo E8400 Processor
 4 GB RAM
 Microsoft XP SP 3
 Backtrack 5 R1
 MS11-006
 Vulnerability in Windows Shell Graphics Processing
 Office documents in thumbnail mode
 Analysis Workstation
 Dell XPS 15
 Intel Core i7
 4GB RAM

Results (Experiment A)
CompromisedSystem.xml (Fiwalk output of the system’s disk image)
Original Disk Size 25GB
Original Fiwalk XML output File Size 9,46MB
RDF/XML Serialization File Size 7,08MB
Number of Allocated Files in the Disk 6610
Number of Nodes in the Graph Representation 34012
Number of Edges in the Graph Representation 83032
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)
Original File Size 454KB
RDF/XML Serialization File Size 662KB
Number of TCP sessions 40
Windows XP Firewall Log of the compromised system
Number of Log Entries 413
RIPE NCC WHOIS Database
Number of Queried IP Addresses 37
FIRE Malicious Networks Database
Number of Queried Autonomous Systems 5
VirusTotal Anti-Malware Web Service
Number of Queried and Indexed by VT Files 2304

Results (Experiment A)
 Reasoning Engine
 72130 inferred axioms (approx. 6.1 MB)
 SWRL Engine
 160 ‘bridging’ properties
 PacketCapture:hasIPValue(?x,?y) ^ WindowsXPFirewallLog:hasAddress(?w,?z) ^
swrlb:stringEqualIgnoreCase(?y,?z) ->
IntegrationOntology:PcapIPToFWLogHost(?x,?w)
 39610 time-related re-mapping properties
 DigitalMedia:File(?x) ^ DigitalMedia:hasFileModificationTime(?x,?y) ^
temporal:ValidInstant(?z) ^ temporal:hasTime(?z,?w) ^
swrlb:stringEqualIgnoreCase(?y,?w) ^
swrlx:makeOWLThing(?filemodificationevent,?x) ->
IntegrationOntology:FileModificationEvent(?filemodificationevent) ^
IntegrationOntology:Event(?filemodificationevent) ^
temporal:hasValidTime(?filemodificationevent,?z)

Results (Experiment B)
CompromisedSystem.xml (Fiwalk output of the system’s disk image)
Original Disk Size 25GB
Original Fiwalk XML output File Size 9,34MB
Number of Allocated Files in the Disk 3273
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)
Original File Size 2,63MB
RDF/XML Serialization File Size 2MB
Number of TCP sessions 57
Windows XP Firewall Log of the compromised system
Number of Log Entries 480
RIPE NCC WHOIS Database
Number of Queried IP Addresses 41
FIRE Malicious Networks Database
Number of Queried Autonomous Systems 5
VirusTotal Anti-Malware Web Service
Number of Queried and Indexed by VT Files 2540

Results (Experiment B)
 Additional Temporal Rules
 temporalBefore between
 Time Instants
 Time Intervals
 Time Instants and Time Periods
 Time Periods and Time Instants
 temporalStarts
 temporalInside
 1024 ValidInstant individuals
 21 ValidPeriod individuals
 58854 inferred temporal relations

Example Hypotheses - Queries
Hypoth
esis
The investigator hypothesizes that the compromised system may have had network
communications with external IP addresses that belong to autonomous systems that may be
listed as malicious networks.
Query SELECT ?tcpflow ?destipvalue ?netname ?asnumber ?host_fire
WHERE {
?tcpflow packetcapture:hasDestinationIP ?destip .
?destip packetcapture:hasIPValue ?destipvalue .
?destip integration:PcapIPToWHOISIpAddr ?whoisip .
?whoisip whois:isContainedInRange ?range .
?whoisip integration:WHOISIpAddrToFireIPAddr ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?range whois:hasRange ?rangeValue .
?range whois:isContainedInAS ?as .
?as whois:hasNetName ?netname .
?as whois:hasASNumber ?asnumber .
?as whois:hasRoute ?route
}
Results tcpflow destipvalue netname asnumber
<urn://bind_tcp_F
Wed_tcp.pcap#tcpS
ession_6>
"78.46.173.193"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"HETZNER-AS"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"24940"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
<urn://bind_tcp_F
Wed_tcp.pcap#tcpS
ession_4>
"78.46.173.193"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"HETZNER-AS"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"24940"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
Interpr
etation
The results of the query support the hypothesis that the compromised system had indeed
network communications with IP addresses that belongs to autonomous systems known to
demonstrate malicious behavior. The query is able to match a graph pattern in the provided
dataset thus retrieving additional information regarding the specific blacklisted AS.

Evaluation
 The method can be relevant to a lot of different cases
due to its ability to deal with heterogeneous data.
 Ability to formulate complex and expressive queries
over the integrated data that match closely logical
hypotheses
 Efficient data abstraction and query evaluation,
given axiom pre-inference
 Inverse object properties can improve considerably query
evaluation time
 Evidence-neutral implementation
 Temporal correlation can be computationally demanding

Evaluation
 Reliance to online source may affect the precision of
the results.
 Ontological consistency of the results given valid
ontologies.
 The implementation can be system-independent.
 Ontologies can be dynamically expanded or new
ones (case-specific) introduced.

Semantically-Enabled Digital Investigations - Research Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Semantically-Enabled Digital Investigations - Research Overview

Similar to Semantically-Enabled Digital Investigations - Research Overview (20)

Recently uploaded

Recently uploaded (20)

Semantically-Enabled Digital Investigations - Research Overview