1. //Defence of MSc Dissertation
XML Accounting Trail: A model for
introducing digital forensic readiness
to XML Accounting and XBRL
by
Dirk Kotze
22 July 2015
Promotor: Prof. Martin S. Olivier
2. 2
Introduction
21st century economy: information
Business: Need for making sense & sharing of
financially relevant information (e.g. accounting
data)
XML
Rise
Requirements
Lit. review – XML weaknesses
Cyber-Crime
$800.5 million – 2014
Mostly fraud
60% discovered by accident/tip-off (ACFE 2014)
Digital forensics
3. 3
Introduction (2)
Big Data Problem – How does investigator know
whether XML financial accounting data has been
modified?
Research problem
XML financial data is susceptible to tampering due to the human-
readability property required by the XML data specification. Upon
presentation of a set of XML financial data, how can one determine
whether data has been tampered with, and reconstruct the past events
so that the nature of tampering can be determined?
Purpose
Detect
Reconstruct
Research Method
Method (detecting)
Model (reconstructing)
4. 4
Background
Overview of key topics
Most of you should be familiar
Brief discussion of key concepts necessary to
understand later work
Will be discussing
Digital Forensics
Compilers
Won’t be discussing (assume everyone is familiar
with and due to time constraints)
XML
5. 5
Background: Digital Forensics
Definition (McKemmish)
“the application of computer science and investigative
procedures for a legal purpose involving the analysis of
digital evidence after proper search authority, chain of
custody, validation with mathematics, use of validated
tools, repeatability, reporting, and possible expert
presentation”
Economics
Cost
Disruption
Complexity (Data to analyse, Anti-Forensics)
Forensic Readiness
6. 6
Background: Compilers
Stages of compilation
Analysis & Synthesis
Synthesis out of scope
Analysis
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Error handling
Panic mode
Phrase Level
Minimise noise (graph)
Error Productions
Pre-specify known patterns of data irregularities
Global Correction
Determine potential data irregularities introduced – minimum
change required to make input correct.
7. 7
Detecting Data Irregularities
Problem Statement
Forensic Pathology
Analysing XML files
Rigid structure and accounting rules
Definition of data irregularity
Any unauthorised modification to XML accounting data that
impact the semantic meaning of the financial accounting
content.
How do these occur?
Direct modification (bypassing controls and rules)
Indirect modification (via application) of illegitimate
transaction
Large Data Set Problem
8. 8
Detecting Data Irregularities (2)
Analysing XML files
Trend analysis/pattern analysis
Double entry example
Salami attack example
Manual vs Automated Searching
Automating the search for data irregularities
Compiler Theory
Classification of input, based on patterns as well as pre-
defined rule sets; and
Recursive identification of patterns, using decision tree.
Handling of errors
10. 10
Detecting Data Irregularities (4)
How process works
Establish Rule Set
normal transactions (no errors will be noted); and
error productions (patterns of transactions that deviate
from the norm i.e. data irregularities).
Execute Compiler
Results
Disclaimers
11. 11
Applying Automated Detection of Data
Irregularities
Application
Consider sample XML Accounting Format
Example 1: Generic XML accounting data format
<Transaction>
<ID> 101-1 </ID>
<Account> Bank </Account>
<Action> Credit </Action>
<Amount> 25000 </Amount>
<User> 012437 </User>
<Date> 6/19/2011 8:25:02 AM </Date>
<Hash> 1a88f9a8293e88c87ae1ae5f8bd63585 </Hash>
<Transaction>
12. 12
Applying Automated Detection of Data
Irregularities (2)
Type of Error
Lexical Syntactic Semantic
XML Data
1.1. Tag not opened or
closed correctly, e.g. a
missing ‘<’ or ‘>’.
1.2. Amounts that contain
non-numeric characters.
1.3. Reserved characters (<
or >) used in transaction
statement.
2.1. The XML schema is
violated.
2.2. A transaction entry has a
missing or imbalanced tag for:
• Transaction, Balance,
Hash, User, Date,
Amount, Account, etc.
2.3. Tags that are not defined,
e.g. a tag containing a spelling
error on the tag name.
2.4. An entry matches one or
more predefined rules
specifying an incorrect
transaction.
3.1. Tag is not correctly
specified to match the
content described by the tag,
for example the tag attribute
incorrectly specifies 24 hour
time whilst the time is
specified in AM/PM ‘<’Time
format=”HH:mm”‘>’ 12:30 PM
‘<’/Date‘>’.
13. 13
Applying Automated Detection of Data
Irregularities (3)
Type of Error
Lexical Syntactic Semantic
Data Errors
(Errors in
Accounting
Data)
1.4. Irregularities are found
in the formatting of the
data, introduced by editing
the machine generated
data, e.g. numbers within
tags are given a comma to
indicate thousands, but the
comma is omitted in certain
numbers.
1.5. The data contained
within XML tags is bad, e.g.
a ‘;’ or ‘@’ character occurs
in a number, or date with
the month specified as
larger than 12.
2.5. A violation of the
hierarchical structure and/or
order of the tags, e.g. an ID tag
that exists in isolation (instead
of belonging to a parent tag,
such as a transaction), or a
transaction tag without
children.
2.6. The allocation of optional
tags that is not applicable to
the tag object, e.g. listing an
asset number together for a
vehicle in a furniture purchase
transaction.
3.2. Omission of part of a
transaction, e.g. a transaction
with a missing corresponding
double entry.
3.3. Transaction ID Errors:
ID skipped
ID repeated
3.4. Violation of transaction
logic, e.g. purchase fulfilment
comes before order.
14. 14
Applying Automated Detection of Data
Irregularities (4)
Handling of errors:
Lexical: Typically Panic mode.
Syntactic: Panic mode or Phrase-Level correction. Also,
error productions.
Semantic: Can be done in rule set, e.g. error productions
or global correction, but needs additional consideration by
investigator.
Handling of semantic errors:
Allows for hypothesis leading to reconstruction
Investigator can look at:
Statistical analysis e.g. Benford
Benford’s law (also known as the first-digit law), applies to
most large sources of numerical data and refers to the
frequency distribution of the first digit of such data. In
summary, Benford’s law concludes that digits starting with a
‘1’ should occur around 30% of the time, whilst larger digits
occur in that position less frequently.
Analysis of time trends
Transaction order
15. 15
Advantages
Investigation time shortened
Triage: Indication of whether XML accounting data
file requires further investigation
Little chance of error/non-detection of data
irregularities.
16. 16
Reconstructing the events
Problem statement
Investigative questions
When?
What?
Who?
Why?
How?
XML does not store this info and not available
elsewhere
Black box
Similar to aircraft crash
Instrumentation
17. 17
Reconstructing the events (2)
Minimum set of evidence required:
Evidence showing the details of the data modifications;
Evidence stating the date and time of the modification;
and
Evidence showing who modified the data.
How & why not covered.
Architecture
Logging of evidence
Event reconstruction history not available
Need for real time logging
Interrupts vs. Real-time Proxy
Reference monitor
Circumvention? Need for tamper-proofing of XML file.
Digital Signatures (email)
19. 19
Reconstructing the events (3)
Reconstructing the ‘What?’
Version Control
Reconstructing the ‘When?’
Logging
Timestamps
Local vs. trusted external
Reconstructing the ‘Who?’
Disclaimer
Username/Password authentication
Storing the evidence
Encryption
22. 22
Conclusion
Research problem
XML financial data is susceptible to tampering due to the
human-readability property required by the XML data
specification. Upon presentation of a set of XML financial
data, how can one determine whether data has been
tampered with, and reconstruct the past events so that the
nature of tampering can be determined?
Proposal
Method to detect data irregularities
Compiler
Model to reconstruct events
Instrumentation
23. 23
Conclusion (2)
Self evaluation & future work
Despite best efforts, areas in research always exist
where answers may not be clear or proposed solution
leads to more questions. Therefore, important to step
back and reflect on suggested work suggested to ID
shortcomings and areas for future work.
Detecting data irregularities
Shows great promise but no real world implementation
(prototype)
Rule set is key
Incomplete/bad rule set – compiler won’t work
Template Rule Sets (future work)
Expanding use of errors & error handling
Global error correction
24. 24
Conclusion (3)
XML Accounting Trail
Lack of real-world implementation
If reference monitor compromised, work has been for
naught.
Secure private key already some protection, but not
complete
Securing reference monitor using anti-forensics & anti-
hacking to protect private key against extraction
25. 25
Published Work
XBRL-Trail: A Model for Introducing Digital Forensic
Readiness to XBRL, In Proceedings of the Fourth
International Workshop on Digital Forensics &
Incident Analysis (WDFIA), 2009, pages 93-104.
Detecting XML Data Irregularities by Means of Lexical
Analysis and Parsing. In Proceedings of the 9th
European Conference on Information Warfare and
Security, 2010, pages 151-159.
26. 26
Acknowledgements
Prof. Martin S. Olivier
Prof. Stefan Gruner
Dr. Wynand van Staden
Employers (PwC/RMB) specifically Michael Nean
Fiancé – Dr. Sheena Steyl
Mom & Dad
Dedicated to my Mom (passed away 09 Sept 2009)