Designing IA for AI - Information Architecture Conference 2024
Semantically-Enabled Digital Investigations - Research Overview
1. A M E T H O D F O R S E M A N T I C I N T E G R A T I O N
A N D C O R R E L A T I O N O F D I G I T A L E V I D E N C E
U S I N G A H Y P O T H E S I S - B A S E D A P P R O A C H
Semantically-Enabled Digital
Investigations
by Spyridon Dosis
February 2013, Stockholm
2. Problem Definition
Sophisticated attacks against highly interconnected
networked systems.
Multitude, variety and size of data sources with
possible evidentiary value.
Need for continuous state-of-the-art technical
expertise.
Evidence-oriented first-generation forensic tools
with poor integration and correlation features.
Lack of common, standardized data
representation/abstraction formats.
3. Research Questions and Limitations
How can the Semantic Web technologies and the Linked Data
initiative be applied to Digital Forensics?
How a common ontological-based knowledge representation
layer can improve the level of integration of currently disjoint
specialized areas of DF such as storage, network, mobile, live
memory and others?
How such a new method may improve the efficiency and
capabilities of existing DF investigation models, techniques and
tools?
Not full coverage of the features and capabilities of the
Semantic Web technologies.
Simplified complexity for the conducted experiments.
4. Digital Evidence
“any digital data that contain reliable information
that supports or refutes a hypothesis about an
incident” – (Carrier & Spafford 2004)
Continuously increasing scope
Varying layers of abstraction
(Schatz 2007) identifies 3 basic properties
Latency -> Semantic Interpretation
Fidelity -> Chain of Custody
Volatility -> Order of Volatility
5. Digital Investigations
The set of principles and methods that are followed
during the lifecycle of digital evidence with the goal
of event reconstruction.
Slight definition variations among different contexts.
The Event-based Digital Forensic Investigation
Framework (Carrier & Spafford 2004)
System Preservation, Evidence Searching, Event
Reconstruction
The Digital Investigation Process (Casey 2004)
The Hypothesis-based Approach (Carrier 2006)
6. Semantic Web Technologies
“… information is given well-defined meaning, better
enabling computers and people to work in
cooperation” – (Tim Berners Lee 2001)
Metadata – Annotation of data providing contextual
or domain-specific information about the content
Ontology – “explicit and formal specification of a
conceptualization” – (Gruber 1993)
Entities, attributes, interrelationships
Open world assumption
Reasoning over data by inferencing implicit
conclusions
7. Semantic Web Architecture : Part A
adapted from Antoniou & Van Harmelen 2004
• URI/IRI enables unique
identification of a resource under a
global scope.
• XML provides a consistent
machine-consumable data encoding
scheme in an unambiguous scoped
manner.
• XML Schema used for defining the
rules and the ‘tag’ vocabulary that
data must conform against.
• RDF provides a simple but flexible
data model for encoding metadata
• Subject-Predicate-Object
• RDF Schema used for defining RDF
vocabularies
• Class and Property hierarchies
8. Semantic Web Architecture : Part B
adapted from Antoniou & Van Harmelen 2004
• OWL 2 is a computational logic-based
language that enables automated
reasoning for inferencing and
consistency verification.
• Increased expressivity
• Property Restrictions
• Class and Property Equivalency
• Property Relationships
• Global Cardinality Constraints and
Individual Identity (no unique-
names assumption)
• OWL Dialects for varying levels of
expressiveness and computational
complexity.
• SWRL supports more advanced
reasoning cases.
• SPARQL is an RDF-based query
language and protocol
9. Previous Work #1
XML-based Approaches
Digital Forensics XML (Garfinkel 2009) for describing disk images
and their contents (partitions, files, byte runs).
EDRM XML for describing electronic document metadata.
XIRAF for XML-based extraction, storage and querying of evidence
files.
DEX for including provenance-related metadata.
Other domain-specific XML approaches for live forensics, network
forensics, vulnerability assessment, logs, malware.
Support a level of tool interoperability and standarization
No support for automated reasoning or semantic
integration of data.
10. Previous Work #2
RDF-based Approaches
AFF forensic format uses RDF for including arbitrary metadata
(system or process-related, user-specific ones)
Strengthening the chain-of-custody by additional RDF metadata
(evidence-access, examiner or artifact-related information) (Giova
2011)
Ontological Approaches
FORE (Schatz 2004) comprised of a log parser, a forensic ontology
and a custom rule language for aggregating lower level events into
higher level ones. Later expanded by referencing external ontologies.
DIALOG conceptualized ‘procedural’ and ‘practical’ aspects of a
digital investigation with practical examples of registry analysis.
Later expanded with additional concepts for encoding forensically
relevant types of data.
(Saad 2010) applied an ontology in the network forensics area for
modeling network attacks and supporting different types of
reasoning based on collected events
11. Methodology
Two main research paradigms in IT (Hevner 2004)
Behavioural Science
Design Science
Outcomes of a design science process can be:
Constructs
Models
Methods
Instantiations
12. Design Science Method
adapted from Johannesson & Perjons 2012
• Problem Specification
• Literature Review
• Case studies
• Empirical Observations
• Artifact Outline and
Requirements
• Literature Review
• Case Studies
• Design and Development
• Artifact Demonstration
• Laboratory Experiment
(Simulated cases)
• Artifact Evaluation
• “ex ante evaluation”
• Communication of the artifact
13. A Semantic Web approach for Digital Investigations
Information Integration
Common identifiers
Different identifiers
14. A Semantic Web approach for Digital Investigations
Semi-structured Data Support
Classification and Inference
Extensibility
Provenance
Named Graphs
Search
15. Relation to Digital Investigation Reference Models
• Conceptual Mapping between the Semantic Web
architecture and digital investigation frameworks
• Previous phases are assumed as prerequisites
16. Evaluation Criteria
Goal – Question – Metric (GQM) approach
Generic Criteria
Goal Questions Metric
The proposed method
should be appropriate
for the task in hand
What is the relationship of the proposed method with
existing digital investigations practices and tools?
What are the case context requirements for the method
to be applied?
The ability of the method to handle different types of cases (network-related
events, media devices examination etc.) measured by the number of different
data types it can process.
The method should
provide good support
for decision-making by
providing relevant and
usable results.
What are the types of new knowledge that such the
method can extract and what is its usefulness.
How can the examiner formulate and evaluate
hypotheses about the evidence files and receive
informative results
The ability of the method to support arbitrary queries and provide answers over
the whole body of collected evidence. This can be quantified by the precision and
recall information retrieval measures over the query results.
The method should be
cost effective in terms
of storage and time
needs
How the method accepts and stores input data,
intermediate and final results. What are the storage
requirements for such an implementation?
How much time is needed for applying the method on
the input data and how can it reduce the time that the
investigation process takes?
Storage size requirements for representing input and output data.
Time needed for performing the analysis of data or evaluating user-submitted
queries.
The method should be
flexible and scalable
Can the method deal with new sources of data or being
able to seamlessly integrate new forms of ontologically-
expressed knowledge and rules.
Can the method support large amounts of data and what
problems such complexity may incur?
The ability of the method to process new data and accept additional ontologies or
rules without the need of major (possibly even none) modifications on the
existing steps. It can be measured by the amount of configuration or code
modifications such changes may require.
The method’s ability to handle large amounts of data. It can be measured by the
amounts of input size in relation to the processing time or produced errors (e.g.
number of captured network packets, firewall logs, disk image sizes etc.)
17. Evaluation Criteria
Forensic Criteria
Goal Questions Metric
The method’s results
should be reproducible
Are the results of the method behave in a deterministic manner when applied on
the same input data or they are inconsistent among multiple tests?
The method’s results (e.g. inferred axioms, query
results) should be the same given the same dataset and
independently of other factors like order of processing
the evidence files. This can be measured by the number
of errors or different results after multiple applications
of the method on the same dataset.
The method’s possible
errors should be minimal
and determined
Does the method produce accurate results? Can the method accept inconsistent
or malformed input data? How the method deals with incomplete data? Can the
method produce results that are ambiguous or inconsistent to the specified
ontologies?
The method’s results can be automatically checked by a
reasoning engine for possible inconsistencies between
asserted and inferred axioms and the given ontologies.
The method’s error rate can be measured by the error
messages produced during its lifecycle.
The method must provide
logging capabilities for
the inclusion of arbitrary
metadata regarding the
case, the entities and the
evidence objects
involved.
Does the method support the addition of annotation axioms with respect to the
asserted or inferred axioms?
Does the method allow the logging of the various steps of it as they are applied
and their results produced?
The ability to insert logging information during the
method can be measured by its flexibility to accept
arbitrary metadata.
The method should
protect the integrity of
the collected data
Can the method operate on forensic copies of the collected evidence?
Does the method use hashing algorithms in order to ensure the consistency and
integrity of these forensic copies?
The method should protect the integrity of the
collected data, files and devices throughout its whole
lifecycle by being able to work on forensic copies
instead of the original and verify any hash values that
these copies carry as forensic metadata. The ability of
performing these checks for different data sources can
be considered as a metric.
18. Evaluation Criteria
Semantic Web Related Criteria
Goal Questions Metric
System Heterogeneity –
Platform Independence
Can parts of the method be applied in different system and the partial results
later recombined? Are there any restrictions with respect to the configuration of
these analysis systems?
The ability of the method to be successfully applied in
different system configurations can be measured
through multiple tests in different systems.
Implementable with the
current Semantic Web
Stack
Can the method’s steps that utilize Semantic Web concepts be implemented
with current technology or other improvements/extensions are needed?
The method should be able to rely on existing
Semantic Web technologies without the need to
develop or improve their current status. Errors
produced or modifications needed when
implementing the proposed method can be considered
a metric of how much implementable the method
currently is.
The method and its
results should be
semantically rich
allowing the description
of high level contexts and
events along with their
interrelationships.
Can the method describe arbitrary data? Can the method accept descriptions of
high level and user-defined concepts and associate set of lower level events into
them? Can the method establish relationships between these higher level
descriptions?
The method should be able to accept user-defined high
level concepts and associate lower level events to them
using well defined rules/restrictions. Errors produced
or inability to define custom-defined events can be
considered as a metric of how semantically rich the
method is.
19. Description of the Method
Design structure of the method
The Data Collection phase assumes proper acquisition
techniques and possible pre-processing tasks.
Ontological representation based on light-weight domain
specific ontologies to the RDF data model.
Automated Reasoning for inferencing new axioms (class,
property, inverse property assertion axioms).
Rule evaluation / integration with rule engines.
Integrated query against the set of asserted and inferred
axioms.
20. Ontological Representation of Evidence
Two types of data
Case Related Data
Storage Media Forensic Images, Network Packet Captures,
Firewall Logs
Supportive Data
WHOIS domain information, IP geo-location, IP to ASN
mappings, databases of malicious files or hosts
Lightweight ontologies have been specified with the
Protégé Ontology Editor based on
PCAP Network Captures, Disk Images, Windows XP Firewall
Logs, WHOIS RIPE Database, VirusTotal, FIRE malicious
networks tracker
21. Ontologies
Network Capture
Protocol stack
reconstruction
Focused on HTTP
W3C ERT RDF
vocabulary for HTTP
Forensic Disk
Image
DFXML and fiwalk
Timestamps, hash
values, file type
22. Ontologies
Windows XP
Firewall Log
W3C Extended
Log File Format
RIPE WHOIS
RIPE NCC web
interface
XML/JSON formatted
results
23. Ontologies
Malicious Networks
FIRE project
(Wombat EU FP7)
Aggregation from
sources like
Anubis, Wepawet,
SpamCop, PhishTank
Web interface (Discontinued)
Malware Detection
VirusTotal provides a
web interface to a variety
of antimalware engines
Database search web interface
based on hash values
24. Semantic Integration of Evidence
URI Format
urn://<source_id>/<resource_ID>
Ontological representation
Natively supported / Semantic Parsers
De-duplication
Single URI resource representation under the same namespace
owl:sameAs for same resource / differently namespaced URIs
OWL 2 hasKey
SWRL rules for integrating individuals in different ontologies
Realistic (manual) approach
Integration ontology (IP address, MD5 hash value)
PacketCapture :
IPAddress
WindowsXPFirewallLog :
Host
PcapIPToFWLogHost
25. Semantic Correlation of Evidence
Establishing relations between resources of different
nature.
Temporal Correlation
SWRL Temporal Ontology (Connor & Das 2011)
Support for time instants and intervals
Two approaches
Modify existing ontologies by
importing the time ontology.
Specifying existing classes as
subclasses of ‘ExtendedProposition’
in an external ontology.
26. Semantic Correlation of Evidence
Temporal Correlation (Cont’d)
Relations between time intervals
Allen’s Interval Algebra (Allen 1983)
Relations between time instants
and intervals
‘inside’,’before’,’after’ (Hobbs 2004)
SWRL builtins
Mereological Correlation
‘partOf’ relations
Transitivity
E.g. IP address (partOf) IP range (partOf) AS =>
IP address (partOfAS) AS
27. Integrated Query Formulation and Evaluation
Two methods of query
preparation
Precomputing inferred axioms
Back-propagation
Two methods of query
evaluation
Merging ontologies
Named graphs
(Distributed SPARQL
Endpoints)
28. A Reference Implementation
Tools Used
Java 6
Protégé 4.1.0
OWL API 3.2.4
Pellet 2.3.0
Protégé OWL API 3.4.8
Jena 2.6.4
Jess 7.1p2
Kraken Pcap API 1.3.0
Apache HTTP Components, Jsoup, JSON
29. A Reference Implementation
• Evidence Manager
• Load evidence files
• Semantic Parser
• 6 parsers
• Filtering options (NIST NSRL)
can lead to 40-50% reduction
of an XP image.
• Collector Objects
• Reduce complexity
• Coupled with parsers
• Inference Engine
• Class Assertion
• Inverse Property Assertion
• Integration Ontology
• Investigator-specific
classes/properties
• SWRL Rule Engine
• SPARQL In-memory endpoint
30. Experimental Setup
2x HP Compaq 8000 Elite
Intel Core 2 Duo E8400 Processor
4 GB RAM
Microsoft XP SP 3
Backtrack 5 R1
MS11-006
Vulnerability in Windows Shell Graphics Processing
Office documents in thumbnail mode
Analysis Workstation
Dell XPS 15
Intel Core i7
4GB RAM
33. Results (Experiment A)
CompromisedSystem.xml (Fiwalk output of the system’s disk image)
Original Disk Size 25GB
Original Fiwalk XML output File Size 9,46MB
RDF/XML Serialization File Size 7,08MB
Number of Allocated Files in the Disk 6610
Number of Nodes in the Graph Representation 34012
Number of Edges in the Graph Representation 83032
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)
Original File Size 454KB
RDF/XML Serialization File Size 662KB
Number of TCP sessions 40
Number of Nodes in the Graph Representation 1616
Number of Edges in the Graph Representation 5891
Windows XP Firewall Log of the compromised system
Original File Size 38KB
RDF/XML Serialization File Size 684KB
Number of Log Entries 413
Number of Nodes in the Graph Representation 1344
Number of Edges in the Graph Representation 5866
RIPE NCC WHOIS Database
RDF/XML Serialization File Size 210KB
Number of Queried IP Addresses 37
Number of Nodes in the Graph Representation 137
Number of Edges in the Graph Representation 395
FIRE Malicious Networks Database
RDF/XML Serialization File Size 113KB
Number of Queried Autonomous Systems 5
Number of Nodes in the Graph Representation 384
Number of Edges in the Graph Representation 1083
VirusTotal Anti-Malware Web Service
RDF/XML Serialization File Size 2,45MB
Number of Queried and Indexed by VT Files 2304
Number of Nodes in the Graph Representation 11519
Number of Edges in the Graph Representation 18508
35. Results (Experiment B)
CompromisedSystem.xml (Fiwalk output of the system’s disk image)
Original Disk Size 25GB
Original Fiwalk XML output File Size 9,34MB
RDF/XML Serialization File Size 6,44MB
Number of Allocated Files in the Disk 3273
Number of Nodes in the Graph Representation 16330
Number of Edges in the Graph Representation 45039
Network Packet Capture (filtered for the system’s IP address and TCP protocol only)
Original File Size 2,63MB
RDF/XML Serialization File Size 2MB
Number of TCP sessions 57
Number of Nodes in the Graph Representation 5419
Number of Edges in the Graph Representation 21712
Windows XP Firewall Log of the compromised system
Original File Size 46KB
RDF/XML Serialization File Size 784KB
Number of Log Entries 480
Number of Nodes in the Graph Representation 1510
Number of Edges in the Graph Representation 6794
RIPE NCC WHOIS Database
RDF/XML Serialization File Size 38KB
Number of Queried IP Addresses 41
Number of Nodes in the Graph Representation 181
Number of Edges in the Graph Representation 326
FIRE Malicious Networks Database
RDF/XML Serialization File Size 113KB
Number of Queried Autonomous Systems 5
Number of Nodes in the Graph Representation 384
Number of Edges in the Graph Representation 1083
VirusTotal Anti-Malware Web Service
RDF/XML Serialization File Size 54KB
Number of Queried and Indexed by VT Files 2540
Number of Nodes in the Graph Representation 253
Number of Edges in the Graph Representation 386
36. Results (Experiment B)
Additional Temporal Rules
temporalBefore between
Time Instants
Time Intervals
Time Instants and Time Periods
Time Periods and Time Instants
temporalStarts
temporalInside
1024 ValidInstant individuals
21 ValidPeriod individuals
58854 inferred temporal relations
37. Example Hypotheses - Queries
Hypoth
esis
The investigator hypothesizes that the compromised system may have had network
communications with external IP addresses that belong to autonomous systems that may be
listed as malicious networks.
Query SELECT ?tcpflow ?destipvalue ?netname ?asnumber ?host_fire
WHERE {
?tcpflow packetcapture:hasDestinationIP ?destip .
?destip packetcapture:hasIPValue ?destipvalue .
?destip integration:PcapIPToWHOISIpAddr ?whoisip .
?whoisip whois:isContainedInRange ?range .
?whoisip integration:WHOISIpAddrToFireIPAddr ?fireip .
?fireip fire:IPbelongsToHost ?host_fire .
?host_fire rdf:type fire:MaliciousHost .
?range whois:hasRange ?rangeValue .
?range whois:isContainedInAS ?as .
?as whois:hasNetName ?netname .
?as whois:hasASNumber ?asnumber .
?as whois:hasRoute ?route
}
Results tcpflow destipvalue netname asnumber
<urn://bind_tcp_F
Wed_tcp.pcap#tcpS
ession_6>
"78.46.173.193"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"HETZNER-AS"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"24940"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
<urn://bind_tcp_F
Wed_tcp.pcap#tcpS
ession_4>
"78.46.173.193"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"HETZNER-AS"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
"24940"
^^<http://www.w3.
org/2001/XMLSche
ma#string>
Interpr
etation
The results of the query support the hypothesis that the compromised system had indeed
network communications with IP addresses that belongs to autonomous systems known to
demonstrate malicious behavior. The query is able to match a graph pattern in the provided
dataset thus retrieving additional information regarding the specific blacklisted AS.
38. Evaluation
The method can be relevant to a lot of different cases
due to its ability to deal with heterogeneous data.
Ability to formulate complex and expressive queries
over the integrated data that match closely logical
hypotheses
Efficient data abstraction and query evaluation,
given axiom pre-inference
Inverse object properties can improve considerably query
evaluation time
Evidence-neutral implementation
Temporal correlation can be computationally demanding
39. Evaluation
Reliance to online source may affect the precision of
the results.
Ontological consistency of the results given valid
ontologies.
The implementation can be system-independent.
Ontologies can be dynamically expanded or new
ones (case-specific) introduced.