2. • H.D. Kuna, Argentina
• R.García-Martinez, Argentina
• F.R.Villatoro, Spain
• Information Systems
• Volume 44, August 2014
• Keywords:
• Data mining, Systems audit, Outlier detection
2
4. 1. Introduction
• Systems auditing is composed of a series of
tasks aimed at ensuring that all information
systems within an organization function
properly and at providing the basis that
enables corporations to fulfill their strategic
objectives.
• 整體性的審計以確保資訊系統能符合企業要求並
提供正確資訊。
4
5. 1. Introduction
• Audit logs contain records of every operation
carried out within a software information
system and play a key role in guaranteeing that
each organization's procedures and
regulations are observed.
• 審計日誌(審計軌跡/工作底稿)紀錄軟體的每
個操作,並作以確保符合相關規則。
5
6. 1. Introduction
• Real databases contain anomalies related to
different causes, including errors in data
collection, errors in the information systems,
probable malicious actions, and so on.
• 異常包括資料輸入錯誤、系統錯誤、惡意操作
等。
6
7. 1. Introduction
• This paper aims to introduce a process that
employs data mining techniques to automate
outlier detection in system audit logs that
include alphanumeric data. Automated
detection can allow an auditor to detect hints of
anomalous activities, which will most likely
require closer scrutiny.
• 本篇介紹相關資料探勘技術,並提供建議。
7
8. 1.1 Related work. data
mining in systems auditing
• Computer-Assisted Auditing Techniques
(CAATs) make it possible to use computers as
part of the auditing process.
• 電腦輔助稽核技術協助我們使用電腦查核。
• 但仍有其他選擇:
8
9. 1.1 Related work. data
mining in systems auditing
• Data analysis
software.
• Network security
assessment software.
• Assessment software
for operating systems
and database
management systems.
• Software and source
code testing tools
• 資料分析軟體
• 網絡安全評估軟體
• 用於操作系統和資料庫
庫管理系統的評估軟體
• 軟體和原始碼測試工具
9
10. 1.1 Related work. data
mining in systems auditing
• Clustering(群集) is a data mining technique that
may be employed for outlier detection.
• Several clustering techniques are available,
including the following:
1. Hierarchical clustering階層式分群
2. Partitioning methods分區群集,讓群內偏差最小
3. Density-based clustering依照密度做分群
10
11. 1.1 Related work. data
mining in systems auditing
• However, procedures have not been formally
established to construct a system auditing tool
from data mining techniques applicable to
alphanumeric fields, which is the goal of this
article. Furthermore, the tool we develop must
not require its user to be an expert in data
mining.
• 資料探勘用於審計的方法論尚未建立,也不應讓
查核人員成為資料探勘專家。
11
13. 3. Materials and methods
• 1. Detection of outliers with unsupervised learning.
• 2. Detection of outliers with supervised learning.
• 3. Detection of outliers with semisupervised
learning.在分群的過程中,用有標籤過的資料先切
出一條分界線,再利用剩下無標籤資料的整體分
布,調整出兩大類別的新分界。具有非監督式學習
高自動化的優點,又能降低標籤資料的成本。
13
14. 3. Materials and methods
挑選方法時的考慮事項
• The ability of the algorithm to produce results that are comprehensible for
the final user.可理解
• The efficacy in its detection of outliers. 效率
• The false positive rate.誤報率
• The compatibility among the algorithms with the objectives of the
procedure.相互兼容
• The expected improvement of the efficacy by combining several
techniques in comparison to using them separately. 合併或分開的效率
• That the algorithm can operate on alphanumeric data. 處理文數字資料
• That the algorithms did not require a large number of parameters and can
be easily automated. This is very important in cases lacking expert
auditors for data mining. 在沒有大量資料時也可使用
• Finally and importantly, that the algorithms are capable of improving their
capacity to specifically detect outliers. 提高檢測效率
14
16. 3.2.1. Selection of outlier
detection specific algorithms
• Testing and selection were based on an
artificial database that was created in
accordance with the guidelines set forth by
several authors.
• 人工創建測試資料庫,並隨機安排異常值
• Table 1 details the characteristics of the test
database, as follows:
16
17. 3.2.1. Selection of outlier
detection specific algorithms
• DB-Outliers and COF performed excellently,
but they were discarded as they required that
the total number of outliers to be detected be
provided beforehand. As this number is
unknowable in real conditions, LOF and
DBSCAN were chosen instead.
17
18. 3.2.1.1. Merging results
from LOF and DBSCAN.
• If “LOF”r1.2,then “LOF_value” is “0”.
• If “LOF”41.2,then “LOF_value” is “1”.
• If the tuple belongs to a cluster other than cluster 0,
“DBSCAN_value” = “0”.
• If the tuple belongs to cluster 0,
“DBSCAN_value” = “1”.
LOF DBSCAN
18
19. 3.2.1.3.Combining LOF and
DBSCAN
• In Table 3, we can see that outlier detection
generated a 4% improvement based on the
use of LOF and a 24% improvement based on
the use of DBSCAN. False positives were
reduced by 1% when compared to LOF and by
4% when compared to DBSCAN.
19
20. 3.2.2.1. Classification
algorithms combination
• C4.5效率佳,並不影響誤報率
kept the efficacy level intact but did not affect the
percentage of false positive results
• Bayesian Network (BN)消除了誤報,但嚴重影響了
異常值檢測效果
eliminated all false positives but drastically
affected outlier detection efficacy
• PART 在自身模型中有不錯的結果
obtain the best possible global results through
their respective models
20
23. 3.3. Designing the proposed
process
A. Read and pre-process the database.
理解並預處理資料
B. Apply LOF. Add “LOF_value” attribute
C. Apply DBSCAN. Add the “DBSCAN_value”
attribute
進行LOF、DBSCAN
D. Merge the results.
Add the “outlier_type” attribute
結合兩種方法
23
24. 3.3. Designing the proposed
process
E. Read the database
F. Apply C4.5
G. Apply BN
H. Apply PART
I. Merge the results.
J. Apply the rules for outlier tuple determination
as per the criteria
K. Save the final results of the “outlier_type”
target attribute for each tuple; this can be
either “clean” or “outlier”.
L. End the procedure 24
26. 4.1. Academic management
system of a university
• a database from a university student
management system, investigating the audit
records from the “Exam Management”,
“Course Management”, and “Enrollment
Management”
26
27. 4.1. Academic management
system of a university
• A minimum efficacy of 65% and a 1%
maximum for false positives were established
after consultation with the aforementioned
experts, two system administrators who work
on the student management system and have
ample experience in academic management
systems.
• 最低效率65%、誤報率最高1%。
27
28. 4.1. Academic management
system of a university
• The data selected for the test came from the
year 2000, as it was determined by the experts
that most anomalous operations occurred
during that year.
28
29. 4.1. Academic management
system of a university
• To clarify the meaning of outliers in the
academic management system, let us list
some examples of activities that the experts
consider as anomalies:
• 在此案例中的異常值:
• 1. activities in the audit log during either
holidays or outside the shifts of the personnel
假期期間或在輪班之外的活動
29
30. 4.1. Academic management
system of a university
• 2. operations not meeting the profile or
permissions for a given user
不符合權限的操作
• 3. activities going against the internal
regulations defined by the university
違反大學內部規定的活動
• 4. data recorded outside the date established
under the calendar of the university
在大學行事曆以外的記錄
30
31. 4.4. Result
• The efficacy was always over 66%(>65%), with
a mean value of 76%. The number of false
positives in all the cases is smaller than
0.67%(<1%), with a minimum value of 0.10%.
31
32. 4.4. Result
• the classification of the types of outliers
detected by using our procedure applied to the
academic management database.
32
34. 5. Conclusions
• Based on these findings, we can conclude that
the data mining-algorithm merged approach
can be considered to be a resounding success,
allowing us to develop a process and to apply it
on audit tables from a real database and thus
to facilitate the system auditor's job.
• 資料探勘方法的合併使用是成功的,允許我們開
發一個流程並將其應用於真實資料庫,進而促進
審計員的工作。
34
35. 5. Conclusions
• We have also contemplated the further
optimization of the process' efficacy and would
also like to reduce the rate of false positives to
even lower levels in our future work.
• 希望未來能將誤報率降低
• We have analyzed the convenience of employing
fuzzy logic, as in many cases a tuple does not
respond to the two values that our process
establishes.
• 我們已經分析了使用模糊理論的便利性,因為在許
多情況下,是不能被效率及誤報率反映出的。
35