Graduation Project Report - Some Techniques Applied For Translating Scientific Articles.docx

Dịch vụ viết thuê đề tài – KB Zalo/Tele 0917.193.864 – luanvantrust.com
Kham thảo miễn phí – Kết bạn Zalo/Tele mình 0917.193.864
DAI NAM UNIVERSITY
FACULTY OF FOREIGN LANGUAGES
GRADUATION PROJECT REPORT
SOME TECHNIQUES APPLIED FOR
TRANSLATING SCIENTIFIC ARTICLES
Time duration: July - August/2020
Location: Dai Nam University
Student: Duong Thi Mong Thuy
Supervisor: Ms. Pham Thi Hang Nga
HANOI – August

TABLE OF CONTENTS
ACKNOWLEDGMENT..........................................................................................................................6
1. INTRODUCTION................................................................................................................................7
1.1. Descriptions of intern facility...............................................................................................7
1.2. Reasons and purposes...........................................................................................................7
1.3. Expectations ..........................................................................................................................7
1.4. Title of topic, reason to choose the topic, research scope ..................................................7
2. RESEARCH METHODS.....................................................................................................................8
3. RESEARCH CONTENT .....................................................................................................................8
3.1. Theory of Translation...........................................................................................................8
3.2. Vietnamese - English Translation .......................................................................................9
3.3. English - Vietnamese Translaton.......................................................................................16
4. CONCLUSION..................................................................................................................................25
4.1 Achievements.......................................................................................................................25
4.2 Shortcomings.......................................................................................................................26
5. APPENDIX........................................................................................................................................26
6. REFERENCES...................................................................................Error! Bookmark not defined.
7. SUPERVISOR’S REMARKS............................................................Error! Bookmark not defined.

ACKNOWLEDGMENT
In the process of completing this graduation paper, I have received a great
deal of help, guidance and encouragement from my teachers and my colleagues.
The first, I would like to give my sincere thanks to my supervisor, Ms.
Pham Thi Hang Nga for helping me through this challenging process.
Besides, I would also prefer to express my deep gratitude to all teachers of
Faculty of Foreign Language for their supportive lectures during four years that
equip me with good background to complete my graduation paper.
Finally, I would like to thank my family, my friends and my colleagues
who have offered continuous support and valuable encouragement for me to
complete this paper.
Hanoi, August - 2020
Student
Duong Thi Mong Thuy

INTERNSHIP REPORT
1. INTRODUCTION
Internships are a required part of university training programs. It is an
opportunity for learners to experience hands-on with useful practical knowledge.
With this chance, I have achieved some of the expected results as follows.
I have used accumulated English, translation theories, translation
techniques during my studies at the Faculty of Foreign Languages, University of
Nam to read and translate scientific documents in different fields.
This translation report includes scientific articles on common set mining
in data mining. I have used many translation methods that I have learned from
all the teachers in the Foreign Language Department. This is a good opportunity
for me to review all of the translation knowledge that I have learned. In this
internship report, I focus only on the translation section and give some
translation methods in each sentence.
1.1. Descriptions of intern facility
I choose to practice at Dai Nam University because there is enthusiastic
guidance of many lecturers at the university, especially Ms. Nga. I have chosen
translation skills to present my internship reports, the data source for the topic
are two scientific articles published in major journals. During this graduation
internship, my supervisor created a comfortable, flexible and efficient working
environment. My supervisor has watched, cared for and guided me throughout
the internship. So, I completed the internship report on schedule.
1.2. Reasons and purposes
Students are reviewed knowledge learned in the training program and
apply that knowledge into practice. Internships give students the opportunity to
access a practical environment, have experience working in a professional
English environment. In addition, students are trained in soft skills as well as
and improve self-study, document collection.
1.3. Expectations
- Improve communication skills, respond quickly, improve discussion
skills and writing skill as well.
- Improve reading paper skills and speaking skill in communication in
work environment
- Create the foundation for the study and scientific research of
information technology to later contribute to the educational cause of the
locality as well as in Vietnam.
1.4. Title of topic, reason to choose the topic, research scope
- Title of topic is some techniques applied for translating scientific
articles.
- I choose the topic because this is my research and teaching field.

- Research scope is the writing skills and translation techniques of
scientific articles in English.
2. RESEARCH METHODS
- Researching the skill of reading comprehension of scientific articles.
- Researching the skill of writing scientific articles.
- Researching the skill of summary an abstracts of academic articles.
- Researching the skill of deployment an academic articles by English
language.
3. RESEARCH CONTENT
3.1. Theory of Translation
Translation, by dictionary definition, consists of changing fro one state or
form to another, to turn into one’s own or another’s language. (The Merriam -
Webster Dictionary, 1974). Based on the methods of translation, there are 9
translation methods:
 Word-for-word Translation (WT): Word-for-word translation
focuses mainly on translating words from the sources text into target language
while the word order of the original is preserve. This methods of translation
can be seen in these case where some value of humor is needed.
 Literal Translation (LiT): Literal translation is featured by the fact
that grammatical structure and the meaning of words are translated almost as
closely as in the target language without paying attention to the situation or
context.
 Faithful Translation( FT) : Faithful translation can be described as
one kind of translation which tries to convey the meaning of words and
context situation according to the grammar rules of the target language,
however, there is some unusually or unnaturalness in the target language.
 Semantic Translation (SeT): Semantic translation focuses to great
degree on meaning (semantic content) and form (syntax) of the original text
of high status such as religious text, legal texts, literature, speeches.
 Communicative Translation (CT): Communicative translation is
freer than the above-mentioned types. This strategy gives high priority to the
message communicated in the text where the actual form of the original is not
closely bound to its intended meaning.
 Idiomatic Translation (IT): Idiomatic translation is based on the
meaning of the text which aims to produce the message of the original but
tends to distort nuances of the meaning by colloquialism and idiom where
these do not exist in the original.
 Free Translation (FrT): Free translation focuses more on content
than form in the target language, as a result, sometimes the grammar structure
or the form of the words in the target language may change, and the number

of words and the sentence length may vary, depending on the subjectivity of
the translator.
 Adaptation Translation (AT): This is a highly free type of
translation. Here the focus is on socio-cultural phenomena or practices that
are absent I the target culture, rather than on lingustic units. It is used mainly
for plays (comedies) and poetry: the themes, characters, plots are usually
preserved, the source language culture converted to the culture and the text
rewritten by an established dramatist or poet.
 Gist Translation (GT): It is the freest type of translation. Gist
translation is characterized by keeping the main idea/gist of text, omitting all
its supporting details and subsidiary arguments. Gist translation can be used in
language learning situtions to summarize a written text at a written test.
3.2. Vietnamese - English Translation
STT Bản gốc Bản dịch PHƯƠNG
PHÁP
1 Khai thác dữ liệu là quá
trình tìm kiếm các tri thức
tiềm ẩn có ích, các tập luật
từ cơ sở dữ liệu giao dịch
nhằm phục vụ cho các
công việc dự báo, ra quyết
định.
Data mining is the process of
finding useful latent
knowledge, sets of rules from
transaction databases to
serve for forecasting and
decision making.
Semantic
Translation
2 Trong khi đó, thuộc tính
thời gian có ý nghĩa rất
quan trọng và có yếu tố
quyết định đối với nhiều
chiến lược dự đoán thuộc
nhiều lĩnh vực như kinh
doanh, thương mại, thị
trường chứng khoán...
Meanwhile, time attribute is
very important and decisive
for many prediction
strategies in many fields such
as trading, trading, stock
market ...
Semantic
Translation
3 Vì vậy, khai thác dữ liệu
có yếu tố thời gian là một
chủ đề có vai trò quan
trọng trong khai thác dữ
liệu.
Therefore, time factor is an
important topic in data
mining
Literal
Translation
4 Nếu khung thời gian hoặc
độ phổ biến thay đổi thì
phải xây dựng lại cây
If timeframes or popularity
change then have to rebuild
Faithful
Translation

the tree
5 Việc xây dựng cây TSET
tốn kém rất nhiều thời
gian, ứng mỗi nút con
trong cây được sinh ra thì
phải quét lại toàn bộ cơ sở
dữ liệu
The TSET tree construction
takes a lot of time, and for
each new generated child
node in the tree, the entire
database must be re-
scanned.
Faithful
Translation
6 Do đó, để khắc phục
nhược điểm của thuật toán
TSET-Miner, bài báo đề
xuất cây FS-Tree để lưu
trữ các dãy sự kiện tổ hợp
ứng với từng thời điểm
xuất hiện của tập phần tử
Therefore, to overcome the
disadvantages of TSET-
Miner algorithm, this paper
proposes an FS-Tree tree to
store sequences of
combinatorial events
corresponding to each
occurrence of the element
set.
Word – for –
word
Translation
7 Việc xây dựng cây FS-
Tree chỉ cần duyệt cơ sở
dữ liệu giao dịch đúng
1lần
The construction of the FS-
Tree needs to scan the
transaction database exactly
only once
Literal
Translation
8 Thuật toán FS-Alg thực
hiện trích xuất các dãy sự
kiện phổ biến từ cây FS-
Tree ứng với khung thời
gian và độ phổ biến khác
nhau do người dùng chỉ
định.
The FS-Alg algorithm
extracts frequent event
sequences from the FS-Tree
tree with different user-
specified timeframes and
popularity.
Faithful
Translation
9 Thuật toán FS-Alg đã giải
quyết được các yếu điểm
còn tồn tại của thuật toán
TSET-Miner, giúp rút
The FS-Alg algorithm has
solved the remaining
weaknesses of TSET-Miner
Literal
Translation

ngắn thời gian thực thi. algorithm to shorten
execution time.
10 Mỗi thời điểm xuất hiện
trong cơ sở dữ liệu sẽ tạo
thành 1 nút trong cây ở
mức 1.
Each time that appears in the
database forms a node in the
tree at level 1.
Faithful
Translation
11 Các nút ở mức 2 gồm các
sự kiện ứng với thời điểm
mà các sự kiện đó thuộc
về, tức là các sự kiện có
cùng thời điểm sẽ thuộc
về cùng 1 nhánh.
Nodes at level 2 include
events corresponding to the
moment they belong,
meaning that events at the
same time will belong to the
same branch.
Literal
Translation
12 Trong cùng 1 nhánh, các
nút sẽ tổ hợp lần lượt với
nhau để tạo thành các nút
ở các mức sâu hơn.
In the same branch, the
nodes will combine together
to form nodes at deeper
levels.
Literal
Translation
13 Cây FS-Tree có nút gốc là
nút rỗng và tập liên kết
đến các nút con link.
The FS-Tree tree has a root
node that is an empty node
and a set link child nodes.
Word – for –
word
Translation
14 Các nút Ntime ở mức 1
chứa các thời điểm xuất
hiện của tập phần tử trong
cơ sở dữ liệu DB.
Ntime nodes at level 1
contain the occurrences of
the element set in the DB
database.
Word – for –
word
Translation
15 Nút này bao gồm thời
điểm xuất hiện của sự
kiện time, tập các liên kết
đến nút con link.
This node includes the
occurrence of event time, set
of links to the link child node.
Word – for –
word
Translation
16 Lần lượt chèn từng sự
kiện vào cây FS-Tree với
nút gốc là nút có cùng thời
điểm với sự kiện đang xét.
Insert each event into the FS-
Tree one by one, with the
root node being the node at
the same time as the current
event.
Faithful
Translation
17 Các nút con ở mức 2 trong Nodes at level 2 in the FS- Literal

cây FS-Tree được tạo ra
bằng cách tổ hợp với các
nút đồng cấp và có cùng
thời điểm, tức là thuộc
cùng 1 nhánh.
Tree tree are created by
combining with nodes of the
same level and at the same
time, ie belonging to the
same branch.
Translation
18 Mỗi nút Node trong cây
thuộc mức 2 xuống các
mức sâu hơn gồm dãy sự
kiện seq, thời điểm time,
liên kết đến nút con link
của dãy sự kiện.
Each node in the tree
belongs to level 2 down to
deeper levels including event
sequence seq, time, linkage
to child nodes link.
Literal
Translation
19 Duyệt nhánh ứng với nút
chứa thời điểm 0, chèn
các dãy sự kiện vào tập
X1.
Browse the branch
corresponding to the node
containing time 0, these
sequences are inserted into
set X1.
Faithful
Translation
20 Xét nhánh chứa thời điểm
1, trích xuất các dãy sự
kiện chứa trong nhánh này
và chuyển vào tập Y.
Consider the branch
containing time 1, extract the
sequence of events contained
in this branch and move to
set Y.
Faithful
Translation
21 Thực hiện tích đề - các
giữa tập X1 với tập Y và
chuyển vào tập X2.
Cartesian product of two sets
A and B are added to set X2.
Faithful
Translation
22 Thực hiện tương tự cho
nhánh chứa thời điểm 2, 3
ứng với thời điểm gốc là
0.
Do the same for the branch
containing time 2 and 3,
corresponding to zero point
in time.
Faithful
Translation
23 Với thời điểm 1, sao chép
các dãy sự kiện ở thời
điểm 0 được lưu trong tập
X1 vào tập X2.
For time 1, the sequence of
events at time 0 stored in set
X1 are copied into set X2.
Literal
Translation
24 Khi sao chép, trong từng
dãy sự kiện, chỉ lấy các sự
kiện có thời điểm lớn hơn
hoặc bằng 1.
When copying, each
sequence of events only take
the events with time greater
than or equal to 1.
Faithful
Translation
25 Nếu dãy sự kiện nào xuất
hiện từ 2 lần trở lên thì chỉ
giữ lại 1 dãy.
If a sequence of events
occurs more than 2 times,
only create a single node to
hold it.
Faithful
Translation

26 Sao chép lần lượt từng
dãy sự kiện trong tập X2,
đồng thời chuẩn hóa và
chuyển sang tập X1.
Equentially copy each
sequence of events in the set
X2, simultaneously
normalizes and moves to set
X1.
Faithful
Translation
27 Nếu dãy sự kiện sao chép
được chuẩn hóa đã tồn
tại trong tập X1 thì tăng số
lần xuất hiện, ngược lại
thì chuyển tiếp vào sau
X1.
When a normalized sequence
of events is considered, if it
already exists in set X1, the
number of occurrences
increases, conversely, a new
element is added to set X1.
Faithful
Translation
28 Tiếp tục với thời điểm 2,
xét tập X2 đã có ở thời
điểm 1.
Continuing with time 2, the
operation on set X2 existed
at the time 1.
Literal
Translation
29 Duyệt qua tập X2, chỉ giữ
lại các sự kiện ở đó thời
điểm lớn hơn hoặc bằng 2.
For each sequence of events
in the set X2, sequences of
times greater than or equal
to 2 are retained.
Faithful
Translation
30 Nếu dãy sự kiện nào xuất
hiện từ 2 lần trở lên thì chỉ
giữ lại 1 dãy.
f the sequence has occurred
more than 2 times, a unique
sequence of events is
retained.
Faithful
Translation
31 Thực hiện tương tự như
thời điểm 2 cho các thời
điểm còn lại trên cây FS-
Tree sẽ được tập X1 hoàn
chỉnh.
Do the same for all
remaining times on the FS-
Tree as time 2 to create the
complete X1 set.
Faithful
Translation
32 Tiến hành trích xuất trên
tập X1 này sẽ thu được
các dãy sự kiện phổ biến
ứng với các khung thời
gian và các độ phổ biến
khác nhau.
Extracting on this set X1 will
obtain a set of frequent
sequences of events
corresponding to different
time and the frequency
varies.
Literal
Translation
33 Nếu có sự thay đổi về cơ
sở dữ liệu thì thực hiện
thao tác cập nhật trên cây
FS-Tree.
If there is any change to the
database, perform the update
operation on the FS-Tree.
Word – for –
word
Translation

34 Sau đó, áp dụng thuật toán
FS-Alg để trích xuất các
dãy sự kiện phổ biến từ
cây FS-Tree.
Then apply the FS-Alg
algorithm to extract the
frequent event sequences
from the FS-Tree.
Word – for –
word
Translation
35 Tạo cây FS-Tree cần đối
số đầu vào là tập các
thời điểm và tập các sự
kiện ứng với thời điểm
xuất
hiện thuộc cơ sở dữ liệu
DB, đầu ra là cây FS-
Tree.
In the FS-Alg algorithm, the
input data is a set of
The time and set of events
corresponding to the
occurrence point in the DB
database, the output data is
an FS-Tree.
Free
Translation
36 Đầu tiên, khởi tạo nút gốc
là nút rỗng.
First, initialize the root node
as an empty node.
Word – for –
word
Translation
37 Duyệt cơ sở dữ liệu DB,
chèn thời gian đầu tiên
vào dưới nút gốc bằng
thuật toán insertNode.
Browse the DB database,
insert the first time under the
root node with the
insertNode algorithm.
Word – for –
word
Translation
38 Chèn sự kiện thứ nhất vào
dưới nút có cùng thời
điểm với sự kiện đang xét
bằng thuật toán
insertNode.
Insert the first event below
the node at the same time as
the current event using the
insertNode algorithm.
Word – for –
word
Translation
39 Tạo nhánh cho nút này
bằng thuật toán
CreateBranch.
Create a branch for this node
using the CreateBranch
algorithm.
Word – for –
word
Translation
40 Thực hiện tương tự cho
các sự kiện còn lại ứng
với thời điểm đang xét.
Do the same for the
remaining events
corresponding to the events
of the current time.
Fairthful
Translation
41 Sau đó, tiếp tục xét các
thời điểm và tập các sự
kiện còn lại trong cơ sở dữ
liệu DB sẽ tạo được cây
FS-Tree phù hợp với dữ
liệu đầu vào.
Then, continuing to consider
the remaining time and set of
events in the DB database, it
will create an FS-Tree that
matches the input data.
Literal
Translation
42 Các nút con của cây FS-
Tree được xây dựng bằng
cách tổ hợp các nút có
The FS-Tree child nodes are
constructed by combining
nodes that contain events at
Word – for –
word

chứa sự kiện thuộc cùng
thời điểm ở mức 2, tức là
thuộc cùng 1 nhánh.
the same time at level 2, that
is, belonging to the same
branch.
Translation
43 Thực hiện tạo nhánh cho
cây cần đối số đầu vào là
nút Node cần tạo nhánh,
cây FS-Tree.
To branch a tree, the input is
the isolated node Node, the
FS-Tree tree, and the output
is the FS-Tree branched with
the node Node .
Free
Translation
44 Lấy các nút có liên kết với
nút có cùng thời điểm với
nút Node và đưa vào hàng
đợi.
Get the nodes associated
with the node at the same
time as the node Node and
put it in the queue.
Word – for –
word
Translation
45 Lấy nút thứ nhất trong
hàng đợi tổ hợp với nút
Node, chuyển vào tập con
của nút đang xét, lấy các
nút có liên kết với nút thứ
nhất chuyển
vào hàng đợi.
Take the first node in the
association queue with the
node Node, move into the
subset of the current node,
take the nodes associated
with the first node and move
into the queue.
Word – for –
word
Translation
46 Trường hợp 1: Thời điểm
thêm vào là mới.Thực
hiện tạo nhánh mới ứng
với nút chứa thời đang
xét. Chèn lần lượt từng sự
kiện thuộc thời điểm thêm
đó vào cây ở mức tiếp
theo.
ase 1: At the time of new
addition, create a new
branch corresponding to the
node with the current time
and insert each event at that
time into the tree on the same
level.
Gist
Translation
47 Trường hợp 2: Thời điểm
đã tồn tại trên cây, sự kiện
thuộc thời điểm đó cũng
đã có trên cây thì không
thực hiện thao tác chèn
vào cây.
Case 2: A time and an event
at that time also exists in the
tree, the insertion process is
not executed.
Gist
Translation
48 Bài báo đã giải quyết
được vấn đề trích xuất các
dãy sự kiện phổ biến ứng
với các khung thời gian
khác nhau và với các độ
phổ biến khác nhau trên
cơ sở dữ liệu tăng trưởng.
The paper solves the problem
of extracting common event
sequences for different
timeframes and with different
frequencies on a growth
database.
Word – for –
word
Translation
49 Thời gian trích xuất được
rút ngắn rất nhiều so với
khi thực hiện trích xuất
bằng thuật toán TSET-
The extraction time is greatly
shortened compared to when
performing extraction using
TSET-Miner algorithm.
Word – for –
word

Miner. Translation
50 Quá trình xây dựng cây
FS-Tree chỉ cần duyệt cơ
sở dữ liệu 1 lần, giúp tiết
kiệm rất nhiều chi phí so
với khi xây dựng cây
TSET.
Building an FS-Tree with
one database scan saves a lot
of cost compared to building
a TSET tree.
Semantic
Translation
3.3. English - Vietnamese Translaton
STT Bản gốc Bản dịch PHƯƠNG
PHÁP
1 The rapid development of
database techniques
facilitates the storage and
usage of massive data
from business
corporations,
governments, and
scientific organizations.
Sự phát triển nhanh chóng
của các kỹ thuật cơ sở dữ
liệu tạo điều kiện cho việc
lưu trữ và sử dụng dữ liệu
khổng lồ từ các tập đoàn
kinh doanh, chính phủ và các
tổ chức khoa học.
Literal
Translation
2 How to obtain valuable
information from various
databases has received
considerable attention,
which results in the sharp
rise of related research
topics.
Làm thế nào để lấy được
thông tin có giá trị từ các cơ
sở dữ liệu khác nhau đã nhận
được sự quan tâm đáng kể,
dẫn đến sự gia tăng mạnh mẽ
của các chủ đề nghiên cứu
liên quan.
Word-for-
word
Translation
3 Among the topics, the
high utility itemset mining
problem is one of the most
important, and it derives
from the famous frequent
itemset mining problem
[7, 8].
Trong số các chủ đề, vấn đề
khai thác tập hữu ích cao là
một trong những chủ đề quan
trọng nhất, và nó bắt nguồn
từ vấn đề khai thác tập phổ
biến đã được biết đến trước
đó [7, 8].
Semantic
Translation

4 Mining frequent itemsets
is to identify the sets of
items that appear
frequently in transactions
in a database
Khai thác các tập phổ biến là
xác định các tập hợp các
mục xuất hiện thường xuyên
trong các giao dịch của cơ
sở dữ liệu
Gist
Translation
5 The frequency of an
itemset is measured with
the support of the itemset,
i.e., the number of
transactions containing the
itemset.
Tần suất của một tập tập
mục được gọi là độ hỗ trợ
của tập hợp đó, tức là số
lượng giao dịch chứa tập
mục.
Word-for-
word
Translation
6 If the support of an
itemset exceeds a user-
specified minimum
support threshold, the
itemset is considered as
frequent.
Nếu độ hỗ trợ của một tập
hợp thỏa ngưỡng hỗ trợ tối
thiểu do người dùng chỉ định,
thì tập hợp đó được gọi là tập
phổ biến.
Word-for-
word
Translation
7 Most frequent itemset
mining algorithms employ
the downward closure
property of itemsets [4].
Hầu hết các thuật toán khai
thác tập phổ biến áp dụng
tính chất bao đóng giảm của
tập hợp[4].
Semantic
Translation
8 That is, all supersets of an
infrequent itemset are
infrequent, and all subsets
of a frequent itemset are
frequent.
Nghĩa là, tất cả các cha của
một tập không phổ biến là
tập không phổ biến và tất cả
các tập con của một tập phổ
biến đều là tập phổ biến.
Word-for-
word
Translation
9 The property provides the
algorithms with a
powerful pruning strategy.
Tính chất này giúp cho các
thuật tóa có một chiến lược
tỉa hiệu quả
Semantic
Translation
10 In the process of mining
frequent itemsets, once an
infrequent itemset is
Trong quá trình khai thác các
tập phổ biến, khi một tập
Word-for-
word

identified, the algorithms
no longer check all
supersets of the itemset
không phổ biến được xác
định, các thuật toán không
còn kiểm tra tất cả các cha
của tập đó nữa.
Translation
11 For example, for a
database with n items,
after the algorithms
identify an infrequent
itemset containing k items,
there is no need to check
all of its supersets, i.e.,
2(n−k) − 1 itemsets
Ví du, cho cơ sở dữ liệu có n
mục, sau khi thuật toán xac
định được một tập không phổ
biến chứa k mục, thuật toán
sẽ không cầ phải kiểm tra tất
cả 2(n−k)
− 1 tập cha của nó
Word-for-
word
Translation
12 Mining of frequent
itemsets only takes the
presence and absence of
items into account
Khai thác tập phổ biến chỉ
quan tâm tới sự xuất hiện
hay không xuất hiện của các
mục trong giao dịch
Faithful
Translation
13 Other information about
items is not considered,
such as the independent
utility of an item and the
context utility of an item
in a transaction
Thông tin khác về các mặt
hàng không được xem xét,
chẳng hạn như lợi ích riêng
của một mặt hàng và lợi ích
trong ngữ cảnh của một mặt
hàng khi thực hiện giao dịch
Semantic
Translation
14 Typically, in a
supermarket database,
each item has a distinct
price/profit, and each item
in a transaction is
associated with a distinct
count which means the
quantity of the item one
bought
Thông thường, trong cơ sở
dữ liệu bán hàng, mỗi mặt
hàng có một mức giá / lợi
nhuận khác nhau và mỗi mặt
hàng trong giao dịch được
liên kết với một số lượng
chính là lượng hàng mà
người ta đã mua
Semantic
Translation
15 There are seven items in Có bảy mục trong bảng độ Semantic

the utility table and seven
transactions in the
transaction table in the
database.
hữu ích và bảy giao dịch
trong cơ sở dữ liệu được mô
tả trong bảng.
Translation
16 To calculate support, an
algorithm only makes use
of the information of the
first two columns in the
transaction table, the
information of both the
utility table and the other
columns in the transaction
table are discarded.
Để tính toán hỗ trợ, thuật
toán chỉ sử dụng thông tin
của hai cột đầu tiên trong
bảng giao dịch, thông tin của
cả bảng tiện ích và các cột
khác trong bảng giao dịch
đều bị loại bỏ.
Faithful
Translation
17 However, an itemset with
high support may have
low utility, or vice versa
Tuy nhiên, một tập hợp có độ
hỗ trợ cao có thể có độ hữu
ích thấp hoặc ngược lại
Semantic
Translation
18 For example, the support
and utility of itemset {bc}
appearing in T1, T2, and
T6 are 3 and 18
respectively, and those of
itemset {de} appearing in
T2 and T5 are 2 and 22.
Ví dụ: độ hỗ trợ và độ hữu
ích của tập phổ biến {bc}
xuất hiện trong T1, T2 và T6
lần lượt là 3 và 18, và độ hỗ
trợ của tập phổ {de} xuất
hiện trong T2 và T5 là 2 và
22.
Semantic
Translation
19 In some applications, such
as market analysis, one
may be more interested in
the utility rather than
support of itemsets.
Trong một số ứng dụng, như
phân tích thị trường, người ta
có thể quan tâm đến lợi
nhuận hơn là số lần bán của
các sản phẩm
Semantic
Translation
20 Traditional frequent
itemset mining algorithms
cannot evaluate the utility
information about
itemsets.
Các thuật toán khai thác tập
phổ biến truyền thống không
thể tính độ hữu ích của các
tập hợn
Word-for-
word
Translation

21 Like frequent itemsets,
itemsets with utilities not
less than a user-specified
minimum utility threshold
are generally valuable and
interesting, and they are
called “high utility
itemsets”.
Giống như tập phổ biến, các
tập có độ hữu ích không thấp
hơn ngưỡng độ hữu ích tối
thiểu do người dùng chỉ định
là các tập có giá trị và thú vị
và các tập đó được gọi là tập
có độ hữu ích cao
Idiomatic
Translation
22 To mine all high utility
itemsets from a database
is very intractable,
because the downward
closure property of
itemsets no longer holds
for high utility itemsets.
Để khai thác tất cả các tập có
độ hữu ích cao từ một cơ sở
dữ liệu là rất khó, bởi vì tính
chất bao đóng giảm không áp
đụng được với tập có độ hữu
ích cao.
Idiomatic
Translation
23 When items are appended
to an itemset one by one,
the support of the itemset
monotonously decreases
or remains unchanged, but
the utility of the itemset
varies irregularly.
Khi từng mục được thêm vào
một tập hợp vật phẩm, độ hỗ
trợ của tập hợp đó sẽ giảm
hoặc không thay đổi một
cách đơn điệu, nhưng độ hữu
ích của tập hợp đó lại thay
đổi bất thường.
Word-for-
word
Translation
24 For example, for the
database in Fig. 1, the
supports of {a}, {ab},
{abc}, and {abcd} are 4, 3,
2, and 1, but the utilities
of these itemsets are 16,
26, 21, and 14,
respectively
Ví dụ: đối với cơ sở dữ liệu
trong Hình 1, các hỗ trợ của
{a}, {ab}, {abc} và {abcd}
là 4, 3, 2 và 1, nhưng độ hữu
ích tương ứng của các tập
hợp này là 16, 26 , 21 và 14
Word-for-
word
Translation
25 Suppose the threshold is
20, and then high utility
{abc} contains both high
utility {ab} and low utility
Giả sử ngưỡng là 20, thì
{abc} là tập có độ hữu ích
cao chứa cả tập có độ hữu ích
Word-for-
word
Translation

{a}. cao {ab} và tập có độ hữu
ích thấp {a}.
26 Therefore, the pruning
strategy used in the
frequent itemset mining
algorithms becomes
invalid.
Do đó, chiến lược cắt tỉa
được sử dụng trong các thuật
toán khai thác tập phổ biến
không còn hợp lệ.
Word-for-
word
Translation
27 Recently, a number of
high utility itemset mining
algorithms have been
proposed [25, 18, 14, 5,
23, 22].
Gần đây, một số thuật toán
khai thác tập có độ hữu ích
cao đã được đề xuất [25, 18,
14, 5, 23, 22].
Faithful
Translation
28 Most of the algorithms
adopt a similar
framework: firstly,
generate candidate high
utility itemsets from a
database; secondly,
compute the exact utilities
of the candidates by
scanning the database to
identify high utility
itemsets.
Hầu hết các thuật toán áp
dụng một phương pháp: thứ
nhất, phát sinh các ứng viên
có độ hữu ích cao từ cơ sở dữ
liệu; thứ hai, tính toán chính
xác độ hữu ích của các ứng
viên bằng các quét cơ sở dữ
liệu để tìm tập có độ hữu ích
cao.
Word-for-
word
Translation
29 To solve these problems,
we propose in this paper
an algorithm for high
utility itemset mining.
Để giải quyết những vấn đề
này, chúng tôi đề xuất trong
bài báo này một thuật toán để
khai thác tập có độ hữu ích
cao.
Word-for-
word
Translation
30 A novel structure, called
utility-list, is proposed. A
utility-list stores not only
the utility information
about an itemset but also
the heuristic information
about whether the itemset
Một cấu trúc mới, được gọi
là danh sách tiện ích (UL),
được đề xuất. Một danh sách
tiện ích không chỉ lưu trữ
thông tin độ hữu ích củamột
Word-for-
word
Translation

should be pruned or not tập hợp mà còn lưu trữ thông
tin thêm về việc tập hợp vật
phẩm có nên được cắt bớt
hay không
31 An efficient algorithm,
called HUI-Miner (High
Utility Itemset Miner), is
developed.
Một thuật toán hiệu quả,
được gọi là HUI-Miner
(High Utility Itemset Miner),
được phát triển.
Word-for-
word
Translation
32 Different from previous
algorithms, HUI-Miner
does not generate
candidate high utility
itemsets
Khác với các thuật toán trước
đây, HUI-Miner không tạo ra
các ứng viên cho tập có độ
hữu ích cao
Faithful
Translation
33 After constructing the
initial utility-lists from a
mined database, HUI-
Miner can mine high
utility itemsets from these
utility-lists
Sau khi khởi tạo utility-list
ban đầu từ cơ sở dữ liệu,
HUI-Miner có thể khai thác
các tập có độ hữu ích cao từ
các utility-list này
Faithful
Translation
34 Extensive experiments on
various databases were
performed to compare
HUI-Miner with the state-
ofthe- art algorithms.
Các thực nghiệm mở rộng
trên các cơ sở dữ liệu khác
nhau đã được thực hiện để so
sánh HUI-Miner với các
thuật toán hiện đại nhất.
Word-for-
word
Translation
35 Experimental results that
show HUI-Miner
outperforms these
algorithms are reported
Kết quả thực nghiệm cho
thấy HUI-Miner thực thi tốt
hơn các thuật toán đã được
trình bày
Word-for-
word
Translation
36 Running time was
recorded by the “time”
command, and it contains
input time, CPU time, and
Thời gian chạy được ghi lại
bằng lệnh “time” và nó chứa
thời gian đầu vào, thời gian
Word-for-
word
Translation

output time. CPU và thời gian đầu ra.
37 The output results of the
four algorithms are the
same for a mining task,
and they were written to
“/dev/null”.
Kết quả đầu ra của bốn thuật
toán giống nhau đối với mỗi
thực nghiệm và chúng được
viết thành “/ dev / null”.
Semantic
Translation
38 We terminated a mining
task, once its running time
exceeds 10000 seconds.
Chúng tôi dừng việc khai
thác nếu thời gian thực thi
vượt quá 10000 giây.
Word-for-
word
Translation
39 When measuring running
time, we varied the
minutil for each database.
Khi đo thời gian chạy, chúng
tôi thay đổi minutil cho mỗi
cơ sở dữ liệu.
Faithful
Translation
40 The lower the minutil is,
the larger the number of
high utility itemsets is,
and thus the more the
running time is.
Minutil càng thấp thì số
lượng tập có độ hữu ích cao
càng lớn và do đó thời gian
chạy càng nhiều.
Faithful
Translation
41 For example, for database
chain in Fig. 11(b), when
the minutils are 0.004%
and 0.009%, the numbers
of high utility itemsets are
18480 and 4578, and the
running times of HUI-
Miner are 580.9 seconds
and 445.1 seconds,
respectively
Ví dụ: đối với cơ sở dữ iệu
chain trong Hình 11 (b), khi
minutils là 0,004% và
0,009%, số lượng các tập có
độ hữu ích cao là 18480 và
4578, và thời gian chạy
tương ứng của HUI-Miner là
580,9 giây và 445,1 giây.
Faithful
Translation
42 In addition, the curve for
UPGrowth almost totally
overlaps the curve for UP-
Growth+ in Fig. 11(a); the
running time of
IHUPTWU for any
minutil exceeds 10000
seconds for database
Ngoài ra, đường cong cho
UPGrowth gần như hoàn
toàn trùng lặp với đường
cong cho UP-Growth+ trong
Hình 11(a); thời gian chạy
của IHUPTWU với bất kỳ
Word-for-
word
Translation

chess, and thus there is no
curve for IHUPTWU in
Fig. 11(c).
minutil nào cũng vượt quá
10000 giây khi thực hiện trên
cơ sở dữ liệu chess, do đó
không có biểu đồ thể hiện
IHUPTWU trong Hình 11
(c).
43 For almost all databases
and minutils, HUI-Miner
performs the best.
Đối với hầu hết các cơ sở dữ
liệu và minutils, HUI-Miner
thực thi tốt nhất.
Literal
Translation
44 HUI-Miner is almost two
orders of magnitude faster
than the other algorithms
for dense databases.
HUI-Miner nhanh hơn gần
hai lần so với các thuật toán
khác trên các cơ sở dữ liệu
dày đặc.
45 To mine high utility
itemsets, almost all
existing algorithms first
generate candidate high
utility itemsets and
subsequently compute the
exact utility of each
candidate to identify high
utility itemsets.
Để khai thác các tập có độ
hữu ích cao, hầu như tất cả
các thuật toán hiện có trước
tiên tạo ra các tập ứng viên
và sau đó tính toán độ hữu
ích chính xác của từng ứng
viên để xác định các tập mục
tiện ích cao
Word-for-
word
Translation
46 To improve performance,
previous studies focus on
how to reduce the number
of candidates, which can
lead to the decrease in the
costs of both candidate
generation and utility
computation.
Để cải thiện hiệu suất, các
nghiên cứu trước đây tập
trung vào cách giảm số lượng
ứng viên, điều này có thể dẫn
đến giảm chi phí của cả việc
tạo ứng viên và tính toán độ
hữu ích.
Word-for-
word
Translation
47 In this paper, we have
proposed a novel data
Trong bài báo này, chúng tôi Literal

structure, utility-list, and
developed an efficient
algorithm, HUIMiner, for
high utility itemset
mining.
đã đề xuất một cấu trúc dữ
liệu mới, utility-list và phát
triển một thuật toán hiệu quả,
HUIMiner, để khai thác tập
có độ hữu tiện ích cao.
Translation
48 Utility-lists provide not
only utility information
about itemsets but also
important pruning
information for HUI-
Miner.
Utility-lists không chỉ cung
cấp thông tin tiện ích về các
tập hợp mà còn cung cấp
thông tin cắt tỉa quan trọng
cho HUI-Miner.
Literal
Translation
49 Previous algorithms have
to process a very large
number of candidate
itemsets during their
mining processes.
Các thuật toán trước đây phải
xử lý một số lượng rất lớn
các tập mục ứng viên trong
quá trình khai thác của
chúng.
Word-for-
word
Translation
50 However, most candidate
itemsets are not high
utility and are discarded
finally.
Tuy nhiên, hầu hết các tập
mục ứng viên không có độ
hữu ích cao và cuối cùng bị
loại bỏ.
Semantic
Translation
4. CONCLUSION
4.1 Achievements
After intership time, I feel very grateful to Ms. Pham Thi Hang Nga
my supervisor who has given me great help this time. We have frequently
exchanged via mail together to complete the job in the best way. I learnt many
things from this report. Translation is not simple and easy; it requires me of
time and gives knowledge. I used some translation methods to complete this
text: word for word translation, literal translation, faithful translation,
communicative translation, semantic translation, free translation. Finally, I
would like to send the “Thank you” to my supervisor Ms. Pham Thi Hang
Nga once again, she always besides and helps me during the report process
time.

Thanks to this precious opportunity, I have an opportunity review
gained a lot of skills as well as knowledge for my job such as theory of
translation, writing paper skills. On the other hand, I still have some
shortcomings.
4.2 Shortcomings
- Only use 2 skills: reading and writing.
- I do not have English communication environment, so I have not
developed communication skills yet.
5. APPENDIX
[1] FS-ALG: THUẬT TOÁN KHAI THÁC DÃY SỰ KIỆN PHỔ BIẾN
Tạp̣chı́ Khoa học Trường Đại học Cần Thơ (2015): 128-135
GIỚI THIỆU
Khai thác dữ liệu là quá trình tìm kiếm các tri thức tiềm ẩn có ích, các tập
luật từ cơ sở dữ liệu giao dịch nhằm phục vụ cho các công việc dự báo, ra quyết
định. Một số kỹ thuật khai thác dữ liệu([15],[17],[18]) chỉ quan tâm đến tập
phần tử mà bỏ qua yếu tố thời gian. Trong khi đó, thuộc tính thời gian có ý
nghĩa rất quan trọng và có yếu tố quyết định đối với nhiều chiến lược dự đoán
thuộc nhiều lĩnh vực như kinh doanh, thương mại, thị trường chứng khoán... Vì
vậy, khai thác dữ liệu có yếu tố thời gian là một chủ đề có vai trò quan trọng
trong khai thác dữ liệu.
Có nhiều kỹ thuật để khai thác dãy sự kiện phổ biến như Mô hình cơ sở
dữ liệu thời gian và ràng buộc thời gian của Mkaouar M. et al . (2011) [2], Khai
thác hiệu quả luật kết hợp liên giao dịch của Tung, A.K.H. et al (2003) [1], Truy
vấn và thao tác trên cơ sở dữ liệu thời gian của Mkaouar M. et al .(2011) [3],
Khai thác các đoạn phổ biến trong dãy sự kiện của H. Mannila et al . (1997) [4],
Khai thác dãy sự kiện phổ biến sử dụng cây Seq-Tree (2015) [16], Một cấu trúc
cây để khai thác dãy sự kiện của Francisco Guil et al . (2012) [7],... Trong đó,
thuật toán TSET-Miner khai thác các dãy sự kiện phổ biến dựa trên cấu trúc cây
TSET [7]. Thuật toán này phụ thuộc vào khung thời gian và độ phổ biến. Nếu
khung thời gian hoặc độ phổ biến thay đổi thìphải xây dựng lại cây. Việc xây
dựng cây TSET tốn kém rất nhiều thời gian, ứng mỗi nút con trong cây được
sinh ra thì phải duyệt lại toàn bộ cơ sở dữ liệu.
Do đó, để khắc phục nhược điểm của thuật toán TSET-Miner , bài báo đề
xuất cây FS-Tree để lưu trữ các dãy sự kiện tổ hợp ứng với từng thời điểm xuất
hiện của tập phần tử. Việc xây dựng cây FS Tree chỉ cần duyệt cơ sở dữ liệu
giao dịch đúng 1 lần. Thuật toán FS-Alg thực hiện trích xuất các dãy sự kiện
phổ biến từ cây FS-Tree ứng với khung thời gian và độ phổ biến khác nhau do
người dùng chỉ định. Vì vậy, khi thay đổi khung thời gian hoặc độ phổ biến thì
chỉ thực hiện trích xuất mà không cần phải duyệt lại cơ sở dữ liệu. Thuật
toán FS-Alg đã giải quyết được các yếu điểm còn tồn tại của thuật toán TSET-
Miner, giúp rút ngắn thời gian thực thi.

Phần còn lại của bài báo được tổ chức như sau: Phần 2 giới thiệu các công
trình liên quan. Phần 3 trình bày cơ sở lý thuyết xây dựng thuật toán FSAlg.
Phần 4 trình bày thuật toán FS-Alg. Phần 5 minh họa ví dụ của thuật toán FS-
Alg. Kết luận và hướng phát triển được mô tả tại Phần 6.
CƠ SỞ LÝ THUYẾT
Nút gốc của cây FS-Tree là nút rỗng. Mỗi thời điểm xuất hiện trong cơ sở
dữ liệu sẽ tạo thành 1 nút trong cây ở mức 1. Các nút ở mức 2 gồm các sự kiện
ứng với thời điểm mà các sự kiện đó thuộc về, tức là các sự kiện có cùng thời
điểm sẽ thuộc về cùng 1 nhánh. Trong cùng 1 nhánh, các nút sẽ tổ hợp lần lượt
với nhau để tạo thành các nút ở các mức sâu hơn.
Hình 1: Một ví dụ về cây FS-Tree
Định nghĩa 2.1.1 (Nút mức 2 trong cây). Mỗi nút trong cây FS-Tree bao gồm
tập một dãy sự kiện, tập các thời điểm xuất hiện dãy sự kiện và tập các liên kết
đến nút con
Định nghĩa 2.1.2 (Nút mức 1 trong câ y). Mỗi nút trong cây gồm thời điểm xuất
hiện của dãy sự kiện và tập liên kết đến các nút mức 2.
Định nghĩa 2.1.3 (Nút gốc trong cây ). Một nút gốc là nút rỗng có tập các liên
kết đến các nút con.
Định nghĩa 2.1.4 (Cây FS-Tree ). Một cây FSTree = ( gồm một nút gốc và một
tập nút phân cấp trong cây, … Tập các nút phân cấp bao gồm các nút cha liên
kết đến các nút con dựa trên tập các liên kết tại mỗi nút trong cây.
Nhánh trong cây là đường đi liên kết đến các nút con liên tiếp từ đến sao
cho giữa 2 nút kề nhau đều có nhánh.
THUẬT TOÁN FS-ALG
Mô tả thuật toán

Thuật toán FS-Alg được áp dụng trên cơ sở dữ liệu có cấu trúc như sau:
Cây FS-Tree có nút gốc là nút rỗng và tập liên kết đến các nút con. Các
nút ở mức 1 chứa các thời điểm xuất hiện của tập phần tử trong cơ sở dữ liệu.
Nút này bao gồm thời điểm xuất hiện của sự kiện, tập các liên kết đến nút con.
Mỗi thời điểm sẽ tạo thành một nhánh trong cây FS-Tree. Lần lượt chèn từng sự
kiện vào cây FS-Tree với nút gốc là nút có cùng thời điểm với sự kiện đang xét.
Các nút con ở mức 2 trong cây FS-Tree được tạo ra bằng cách tổ hợp với các
nút đồng cấp và có cùng thời điểm, tức là thuộc cùng 1 nhánh. Mỗi nút trong cây
thuộc mức 2 xuống các mức sâu hơn gồm dãy sự kiện, thời điểm, liên kết đến
nút con của dãy sự kiện.
Duyệt nhánh ứng với nút chứa thời điểm 0, chèn các dãy sự kiện vào tập.
Xét nhánh chứa thời điểm X1, trích xuất các dãy sự kiện chứa trong nhánh này
và chuyển vào tập Y. Thực hiên tích đề- các giữa tập X1 với tập Y và chuyển
vào tập X2. Sau đó, chuyển tập X2 vào tiếp sau tập X1. Thực hiện tương tự cho
nhánh chứa thời điểm 2, 3 ứng với thời điểm gốc là 0.
Với thời điểm 1, sao chép các dãy sự kiện ở thời điểm 0 được lưu trong
tập X1 vào tập X2. Khi sao chép, trong từng dãy sự kiện, chỉ lấy các sự kiện có
thời điểm lớn hơn hoặc bằng 1. Nếu dãy sự kiện nào xuất hiện từ 2 lần trở lên thì
chỉ giữ lại 1 dãy. Sao chép lần lượt từng dãy sự kiện trong tập X2, đồng thời
chuẩn hóa và chuyển sang tập X1. Nếu dãy sự kiện sao chép được chuẩn hóa đã
tồn tại trong tập thì tăng số lần xuất hiện. Ngược lại thì chuyển tiếp vào sau.
Tiếp tục với thời điểm 2, xét tập X2 đã có ở thời điểm 1. Duyệt tập X2,
chỉ giữ lại các sự kiện có thời điểm lớn hơn hoặc bằng 2. Nếu dãy sự kiện nào
xuất hiện từ 2 lần trở lên thì chỉ giữ lại 1 dãy. Sao chép lần lượt từng dãy sự kiện
trong tập, đồng thời chuẩn hóa và chuyển sang tập X1. Nếu dãy sự kiện sao chép
được chuẩn hóa đã tồn tại trong tập X1 thì tăng số lần xuất hiện. Ngược lại thì
chuyển tiếp vào sau X1.
Thực hiện tương tự như thời điểm 2 cho các thời điểm còn lại trên cây FS-
Tree sẽ được tập X1 hoàn chỉnh.
Tiến hành trích xuất trên tập X1 này sẽ thu được các dãy sự kiện phổ biến
ứng với các khung thời gian và các độ phổ biến khác nhau.
Nếu có sự thay đổi về cơ sở dữ liệu thì thực hiện thao tác cập nhật trên
cây FS-Tree. Sau đó, áp dụng thuật toán FS-Alg để trích xuất các dãy sự kiện
phổ biến từ cây FS-Tree.
Thuật toán FS-Alg

Tạo cây FS-Tree cần đối số đầu vào là tập các thời điểm và tập các sự
kiện ứng với thời điểm xuất hiện thuộc cơ sở dữ liệu DB, đầu ra là cây FS-Tree.
Đầu tiên, khởi tạo nút gốc là nút rỗng. Duyệt cơ sở dữ liệu, chèn thời điểm đầu
tiên vào dưới nút gốc bằng thuật toán insertNode DB. Chèn sự kiện thứ nhất
vào dưới nút có cùng thời điểm với sự kiện đang xét bằng thuật
toán insertNode. Tạo nhánh cho nút này bằng thuật toán CreateBranch.
Thực hiện tương tự cho các sự kiện còn lại ứng với thời điểm đang xét.
Sau đó, tiếp tục xét các thời điểm và tập các sự kiện còn lại trong cơ sở dữ liệu
sẽ tạo được cây FS-Tree phù hợp với dữ liệu đầu vào.
Các nút con của cây FS-Tree được xây dựng bằng cách tổ hợp các nút có
chứa sự kiện thuộc cùng thời điểm ở mức 2, tức là thuộc cùng 1 nhánh.
Thao tác này chính là tạo nhánh cho cây. Thực hiện tạo nhánh cho cây cần
đối số đầu vào là nút cần tạo nhánh, cây FS-Tree . Đầu ra là cây FS-Tree đã tạo
nhánh. Lấy các nút có liên kết với nút có cùng thời điểm với nút và đưa vào
hàng đợi. Lấy nút thứ nhất trong hàng đợi tổ hợp với nút, chuyển vào tập con
của nút đang xét, lấy các nút có liên kết với nút thứ nhất chuyển vào hàng đợi.
Thực hiện tương tự cho các nút còn lại trong hàng đợi cho đến khi hàng đợi rỗng
thì dừng lại.

Thao tác chèn nút vào cây cần đối số đầu vào là thời điểm và tập các sự
kiện thuộc thời điểm đang xét. Đầu ra là cây FS-Tree đã được cập nhật. Chèn
nút vào cây có các trường hợp có thể xảy ra:
Trường hợp 1 : Thời điểm thêm vào là mới. Thực hiện tạo nhánh mới ứng
với nút chứa thời gian đang xét. Chèn lần lượt từng sự kiện thuộc thời điểm
thêm đó vào cây ở mức tiếp theo.
Trường hợp 2 : Thời điểm đã tồn tại trên cây, sự kiện thuộc thời điểm đó
cũng đã có trên cây thì không thực hiện thao tác chèn vào cây.

Trường hợp 3 : Thời điểm đã tồn tại trên cây, sự kiện là mới ứng với thời
điểm đó. Thực hiện chèn sự kiện đó vào cây và thực hiện tổ hợp với các nút đã
có trong nhánh đang xét.
Trích xuất các dãy sự kiện phổ biến từ cây FSTree cần đầu vào là cây FS-
Tree, khung thời gian W, ngưỡng ; đầu ra là tập các dãy sự kiện phổ biến Fseq
thỏa yêu cầu đầu vào. Lần lượt xét từng nhánh ứng với từng thời điểm. Duyệt
nhánh chứa thời điểm 0 bằng thuật toán getX và chuyển vào tập X1, duyệt
nhánh chứa thời điểm 2 bằng thuật toán getX vào tập Y, thực hiện tích đề-các
giữa X1 và Y và chuyển vào tập X2. Thực hiện lần lượt cho các nhánh còn lại
ứng với thời điểm gốc là 0.
Với thời điểm gốc là 1, gán tập vào. Trong từng dãy sự kiện của tập �� ,
chỉ giữ lại các sự kiện có thời điểm lớn hơn hoặc bằng 1. Nếu các dãy sự kiện
trùng nhau thì chỉ giữ lại 1 dãy sự kiện. Sao chép và chuẩn hóa từng dãy sự kiện
trong tập X2 và chuyển vào tập X1. Thực hiện tương tự cho các thời điểm còn
lại trên cây FS-Tree . Trích xuất trên tập sẽ thu được các dãy sự kiện phổ biến
thỏa yêu cầu.

Duyệt nhánh cây FS-Tree theo thời điểm cần đầu vào là cây FS-Tree và
thời điểm, đầu ra là tập chứa các dãy sự kiện thỏa đầu vào. Lần lượt chèn từng
nút có liên kết với nút chứa thời điểm vào hàng đợi. Xét nút đầu tiên trong hàng
đợi, chuyển sự kiện trong nút này vào tập, lấy các nút có liên kết với nút đầu tiên
và chuyển vào hàng đợi. Thực hiện tương tự như vậy cho đến khi hàng đợi rỗng
thì dừng lại.
KẾT LUẬN
Bài báo đã giải quyết được vấn đề trích xuất các dãy sự kiện phổ biến ứng
với các khung thời gian khác nhau và với các độ phổ biến khác nhau trên cơ sở
dữ liệu tăng trưởng. Thời gian trích xuất được rút ngắn rất nhiều so với khi thực

hiện trích xuất bằng thuật toán TSET-Miner. Điều này đã được minh chứng
trong phần kết quả thực nghiệm. Quá trình xây dựng cây FS-Tree chỉ cần duyệt
cơ sở dữ liệu 1 lần, giúp tiết kiệm rất nhiều chi phí so với khi xây dựng cây
TSET. Hướng phát triển của bài báo là đề xuất phương pháp để tối ưu việc lưu
trữ dữ liệu thời gian sao cho việc truy xuất sẽ tiêu tốn ít thời gian nhất.
[2] Mining High Utility Itemsets without Candidate Generation
ABTRACT
High utility itemsets refer to the sets of items with high utility like profit
in a database, and efficient mining of high utility itemsets plays a crucial role in
many reallife applications and is an important research issue in data mining area.
To identify high utility itemsets, most existing algorithms first generate
candidate itemsets by overestimating their utilities, and subsequently compute
the exact utilities of these candidates. These algorithms incur the problem that a
very large number of candidates are generated, but most of the candidates are
found out to be not high utility after their exact utilities are computed. In this
paper, we propose an algorithm, called HUI-Miner (High Utility Itemset Miner),
for high utility itemset mining. HUI-Miner uses a novel structure, called utility-
list, to store both the utility information about an itemset and the heuristic
information for pruning the search space of HUI-Miner. By avoiding the costly
generation and utility computation of numerous candidate itemsets, HUI-Miner
can efficiently mine high utility itemsets from the utilitylists constructed from a
mined database. We compared HUI-Miner with the state-of-the-art algorithms
on various databases, and experimental results show that HUI-Miner
outperforms these algorithms in terms of both running time and memory
consumption.
1. INTRODUCTION
The rapid development of database techniques facilitates the storage and
usage of massive data from business corporations, governments, and scientific
organizations. How to obtain valuable information from various databases has
received considerable attention, which results in the sharp rise of related
research topics. Among the topics, the high utility itemset mining problem is one
of the most important, and it derives from the famous frequent itemset mining
problem [7, 8]. Mining frequent itemsets is to identify the sets of items that
appear frequently in transactions in a database. The frequency of an itemset is
measured with the support of the itemset, i.e., the number of transactions
containing the itemset. If the support of an itemset exceeds a user-specified
minimum support threshold, the itemset is considered as frequent. Most frequent
itemset mining algorithms employ the downward closure property of itemsets
[4]. That is, all supersets of an infrequent itemset are infrequent, and all subsets
of a frequent itemset are frequent. The property provides the algorithms with a
powerful pruning strategy. In the process of mining frequent itemsets, once an
infrequent itemset is identified, the algorithms no longer check all supersets of

the itemset. For example, for a database with n items, after the algorithms
identify an infrequent itemset containing k items, there is no need to check all of
its supersets, i.e., 2(n−k)
− 1 itemsets. Mining of frequent itemsets only takes
the presence and absence of items into account. Other information about items is
not considered, such as the independent utility of an item and the context utility
of an item in a transaction. Typically, in a supermarket database, each item has a
distinct price/profit, and each item in a transaction is associated with a distinct
count which means the quantity of the item one bought. Consider the database in
Fig. 1. There are seven items in the utility table and seven transactions in the
transaction table in the database. To calculate support, an algorithm only makes
use of the information of the first two columns in the transaction table, the
information of both the utility table and the other columns in the transaction
table are discarded. However, an itemset with high support may have low utility,
or vice versa. For example, the support and utility of itemset {bc} appearing in
T1, T2, and T6 are 3 and 18 respectively (See Section 2.1 for utility
computation), and those of itemset {de} appearing in T2 and T5 are 2 and 22. In
some applications, such as market analysis, one may be more interested in the
utility rather than support of itemsets. Traditional frequent itemset mining
algorithms cannot evaluate the utility information about itemsets. Like frequent
itemsets, itemsets with utilities not less than a user-specified minimum utility
threshold are generally valuable and interesting, and they are called “high utility
itemsets”. To mine all high utility itemsets from a database is very intractable,
because the downward closure property of itemsets no longer holds for high
utility itemsets. When items are appended to an itemset one by one, the support
of the itemset monotonously decreases or remains unchanged, but the utility of
the itemset varies irregularly. For example, for the database in Fig. 1, the
supports of {a}, {ab}, {abc}, and {abcd} are 4, 3, 2, and 1, but the utilities of
these itemsets are 16, 26, 21, and 14, respectively. Suppose the threshold is 20,
and then high utility {abc} contains both high utility {ab} and low utility {a}.
Therefore, the pruning strategy used in the frequent itemset mining algorithms
becomes invalid. Recently, a number of high utility itemset mining algorithms
have been proposed [25, 18, 14, 5, 23, 22]. Most of the algorithms adopt a
similar framework: firstly, generate candidate high utility itemsets from a
database; secondly, compute the exact utilities of the candidates by scanning the

database to identify high utility itemsets. However, the algorithms often generate
a very large number of candidate itemsets and thus are confronted with two
problems: (1) excessive memory requirement for storing candidate itemsets; (2)
a large amount of running time for generating candidates and computing their
exact utilities. When the number of candidates is so large that they cannot be
stored in memory, the algorithms will fail or their performance will be degraded
due to thrashing. To solve these problems, we propose in this paper an algorithm
for high utility itemset mining. The contributions of the paper are as follows:
1. A novel structure, called utility-list, is proposed. A utility-list stores not
only the utility information about an itemset but also the heuristic
information about whether the itemset should be pruned or not.
2. An efficient algorithm, called HUI-Miner (High Utility Itemset Miner), is
developed. Different from previous algorithms, HUI-Miner does not
generate candidate high utility itemsets. After constructing the initial
utility-lists from a mined database, HUI-Miner can mine high utility
itemsets from these utility-lists.
3. Extensive experiments on various databases were performed to compare
HUI-Miner with the state-ofthe-art algorithms. Experimental results that
show HUI-Miner outperforms these algorithms are reported.
After the related background is stated in Section 2, the paper is organized
according to the three points aforementioned in Section 3, 4, and 5. Our work is
summarized in Section 6.
2. BACKGROUND
In the section, we first give the formal description of the high utility
itemset mining problem and subsequently introduce the previous solutions to the
problem.
2.1 Problem Definition
Let I = {i1, i2, i3, . . . , in} be a set of items and DB be a database composed
of a utility table and a transaction table. Each item in I has a utility value in the

utility table. Each transaction T in the transaction table has a unique identifier
(tid) and is a subset of I, in which each item is associated with a count value. An
itemset is a subset of I and is called a k-itemset if it contains k items.
DEFINITION 1. The external utility of item i, denoted as eu(i), is the
utility value of i in the utility table of DB.
DEFINITION 2. The internal utility of item i in transaction T, denoted
as 𝑖𝑢(𝑖, 𝑇), is the count value associated with i in T in the transaction table of
DB.
DEFINITION 3. The utility of item i in transaction T, denoted as
𝑢(𝑖, 𝑇), is the product of iu(i, T) and eu(i), where 𝑢(𝑖, 𝑇) = 𝑖𝑢(𝑖, 𝑇) × 𝑒𝑢(𝑖).
For example, in Fig. 1, 𝑒𝑢(𝑒) = 4, 𝑖𝑢(𝑒, 𝑇5) = 2, and 𝑢(𝑒, 𝑇5) =
𝑖𝑢(𝑒, 𝑇5) × 𝑒𝑢(𝑒) = 2 × 4 = 8.
DEFINITION 4. The utility of itemset X in transaction T, denoted as
𝑢(𝑋, 𝑇), is the sum of the utilities of all the ∑ items in X in T in which X is
contained, where 𝑢(𝑋, 𝑇) = ∑ 𝑢(𝑖, 𝑇)
𝑖∈𝑋∧𝑋⊆𝑇 .
DEFINITION 5. The utility of itemset X, denoted as u(X), is the sum of
the utilities of X in all the transactions containing X in DB, where 𝑢(𝑋) =
∑ 𝑢(𝑋, 𝑇)
𝑇 ∈𝐷𝐵∧𝑋⊆𝑇 . For example, in Fig. 1, 𝑢({𝑎𝑒}, 𝑇2) = 𝑢(𝑎, 𝑇2) +
𝑢(𝑒, 𝑇2) = 4 × 1 + 1 × 4 = 8, and 𝑢({𝑎𝑒}) = 𝑢({𝑎𝑒}, 𝑇2) + 𝑢({𝑎𝑒}, 𝑇5) =
8 + 13 = 21.
DEFINITION 6. The utility of transaction T, denoted as tu(T), is the
sum of the utilities of all the items in T , where 𝑡𝑢(𝑇) = ∑ 𝑢(𝑖, 𝑇)
𝑖∈𝑇 , and the
total utility of DB is the sum of the utilities of all the transactions in DB.
Fig. 2 shows the utility of each transaction, for example, tu(T1) =
u(b, T1) + u(c, T1) + u(d, T1) + u(g, T1) = 2 + 2 + 5 + 1 = 10. The total
utility of the database in Fig. 1 is 98. An itemset X is high utility if u(X) is not
less than a user-specified minimum utility threshold denoted as minutil, or the
product of a minutil and the total utility of a mined database if the minutil is a
percentage. Given a database and a minutil, the high utility itemset mining
problem is to discover from the database all the itemsets whose utilities are not
less than the minutil.

2.2 Related Work
Before the high utility itemset mining problem was formally proposed
[25] as above, a variation of the problem had been studied, namely the problem
of extracting share frequent itemsets [6, 13, 12] that invariably defines the
external utility of each item as 1. The ZP [6], ZSP [6], FSH [13], ShFSH [12],
and DCG [11] algorithms for share frequent itemset mining can also be used to
mine high utility itemsets. Since the downward closure property cannot be
directly applied, Liu et al. proposed an important property [17] for pruning the
search space of the high utility itemset mining problem.
DEFINITION 7. The transaction-weighted utility of itemset X in DB,
denoted as 𝑡𝑤𝑢(𝑋), is the sum of the utilities of all the transactions containing
X in DB, where 𝑡𝑤𝑢(𝑋) = ∑ 𝑡𝑢(𝑇
𝑇∈𝐷𝐵∧𝑋⊆𝑇 ).
PROPERTY 1. If twu(X) is less than a given “minutil”, all supersets of
X are not high utility. Rationale. If X ⊆ X ′, then 𝑢(𝑋′
) ≤ 𝑡𝑤𝑢(𝑋′
) ≤
𝑡𝑤𝑢(𝑋) < 𝑚𝑖𝑛𝑢𝑡𝑖𝑙
Fig. 3 shows the transaction-weighted utilities of all 1- itemsets. For
example, itemset {f} is contained in T4 and T6, and thus 𝑡𝑤𝑢({𝑓}) =
𝑡𝑢(𝑇4) + 𝑡𝑢(𝑇6) = 9 + 18 = 27. If a minutil is equal to 30, all supersets of
{f} are not high utility according to Property 1. The Two-Phase algorithm [18,
17] first adopts Property 1 to prune the search space. Afterwards, the isolated
items discarding strategy (IIDS) is proposed [14], and the strategy can be
incorporated in the above algorithms to improve their performance, for
example, the FUM [14] and DCG+ [14] algorithms outperform ShFSH and
DCG, respectively. ZP, ZSP, FSH, ShFSH, DCG, Two-Phase, FUM, and
DCG+ mine high utility itemsets as the famous Apriori algorithm [4] mines
frequent itemsets. Given a database, firstly, all 1-itemsets are candidate high
utility itemsets. After scanning the database, the algorithms eliminate

unpromising 1-itemsets and generate 2-itemsets from the remaining 1-itemsets
as candidate high itemsets. After the second scan over the database,
unpromising 2-itemsets are eliminated and 3-itemsets as candidates are
generated from the remaining 2-itemsets. The procedure is performed
repeatedly until there is no generated candidate itemset Finally, these
algorithms, except for DCG and DCG+, compute the exact utilities of all
remaining candidates by an additional database scan to identify high utility
itemsets (DCG and DCG+ compute exact utility in each database scan.).
Besides the two problems mentioned in Section 1, these algorithms suffer from
the level-wise mining problems as well, e.g., repeated database scans. The
algorithms based on the FP-Growth algorithm [9] show better performance.
These algorithms include IHUPTWU [5], UP-Growth [23], and UP-Growth+
[22]. Firstly, they transform a mined database into a prefixtree, and the tree
maintains the utility information about itemsets. Secondly, for each item of the
tree, if it is estimated to be valuable, namely there is likely to be high utility
itemsets containing the item, the algorithms construct a conditional prefix-tree
for the item. Thirdly, the algorithms recursively process all conditional prefix-
trees to generate candidate high utility itemsets. Finally, the algorithms scan
the database again to compute the exact utilities of all candidates for
identifying high utility itemsets. Reducing the numbers of both database scans
and candidate itemsets, these algorithms outperform the Apriori-based
algorithms. Even so, compared with the number of resultant high utility
itemsets, these algorithms still generate a large number of candidate itemsets in
most cases, and it is very costly to both generate these candidates and compute
their exact utilities. There are also a number of studies that focus on the
problem of mining an approximate set of all high utility itemsets [10, 24] or a
condensed set of all high utility itemsets [20, 21]. In this study, the problem of
mining the complete set of all high utility itemsets from a database is
discussed.
3. UTILITY-LIST STRUCTURE
To mine high utility itemsets, many previous algorithms directly
perform on an original database. Although FPGrowth-based algorithms
generate candidate itemsets from prefix-trees, they have to compute the exact
utilities of candidates by scanning the database. In the section, we propose a
utility-list structure to maintain the utility information about a database.
3.1 Initial Utility-Lists

In our HUI-Miner algorithm, each itemset holds a utilitylist. Initial
utility-lists storing the utility information about a mined database can be
constructed by two scans of the database. Firstly, the transaction-weighted
utilities of all items are accumulated by a database scan. If the transaction-
weighted utility of an item is less than a given minutil, the item is no longer
considered according to Property 1 in the subsequent mining process. For the
items whose transaction-weighted utilities exceed the minutil, they are sorted
in transaction-weighted-utility-ascending order. For the database in Fig. 1,
suppose the minutil is 30, and then the algorithm no longer takes items f and g
into consideration after the first database scan. The remaining items are
sorted: 𝑒 < 𝑐 < 𝑏 < 𝑎
DEFINITION 8. A transaction is considered as “revised” after (1) all
the items whose transaction-weighted utilities are less than a given minutil are
deleted from the transaction; (2) the remaining items are sorted in transaction-
weightedutility-ascending order.
When scanning the database again, the algorithm revises each
transaction for constructing initial utility-lists. The database view in Fig. 4 lists
all revised transactions derived from the database in Fig. 1. From here on, the
following convention holds in the remainder of this paper:
CONVENTION 1. A transaction is considered as revised, and all the
items in an itemset are sorted in transactionweighted-utility-ascending order,
when mentioned.
DEFINITION 9. Given an itemset X and a transaction (or itemset) T
with X⊆T, the set of all the items after X in T is denoted as T/X.
For example, consider the view in Fig. 4, T2/{eb} = {ad} and T2/{c} =
{bad}.

DEFINITION 10.The remaining utility of itemset X in transaction T,
denoted as 𝑟𝑢(𝑋, 𝑇), is the sum of the utilities of all the items in T/X in T, where
𝑟𝑢(𝑋, 𝑇) = ∑ 𝑢(𝑖, 𝑇)
𝑖∈(𝑇 /𝑋) .
Each element in the utility-list of itemset X contains three fields: tid,
iutil, and rutil
 Field tid indicates a transaction T containing X.
 Field iutil is the utility of X in T, i.e., 𝑢(𝑋, 𝑇).
 Field rutil is the remaining utility of X in T, i.e., 𝑟𝑢(𝑋, 𝑇).
During the second database scan, the algorithm constructs the initial
utility-lists showed in Fig. 5. For example, consider the utility-list of itemset
{𝑐}. In T1, 𝑢({𝑐}, 𝑇1) = 2, 𝑟𝑢({𝑐}, 𝑇1) = 𝑢(𝑏, 𝑇1) + 𝑢(𝑑, 𝑇1) = 2 + 5 = 7,
and thus element is in the utility-list of {c} ( means , and 1 represents T1 for
simplicity.). In T2, 𝑢({𝑐}, 𝑇2) = 3, 𝑟𝑢({𝑐}, 𝑇2) = 𝑢(𝑏, 𝑇2) + 𝑢(𝑎, 𝑇2) +
𝑢(𝑑, 𝑇2) = 2 + 4 + 5 = 11, and thus element belongs to the utility-list of {c}
as well. The rest can be figured out in the same manner.
3.2 Utility-Lists of 2-Itemsets
No need for database scan, the utility-list of 2-itemset {xy} can be
constructed by the intersection of the utilitylist of {x} and that of {y}. The
algorithm identifies common transactions by comparing the tids in the two
utility-lists. Suppose the lengths of the utility-lists are m and n respectively,
and then (m + n) comparisons at most are enough for identifying common
transactions, because all tids in a utility-list are ordered. The identification
process is actually a 2-way comparison. For example, the tid comparison
between the utility-lists of itemsets {e} and {c} in Fig. 5 is demonstrated in
Fig. 6(a).

For each common transaction t, the algorithm will generate an element E
and append it to the utility-list of {xy}. The tid field of E is the tid of t. The
iutil of E is the sum of the iutils associated with t in the utility-lists of {x} and
{y}. Suppose x is before y, and then the rutil of E is assigned as the rutil
associated with t in the utility-list of {y}. Fig. 6(b) depicts the utility-lists of all
the 2-itemsets with itemset {e} as prefix. For example, to construct the
utilitylist of itemset {eb}, the algorithm intersects the utility-list of {e}, i.e., {, ,
}, and that of {b}, i.e., {, , , }, which results in {, }. One can observe from the
database view in Fig. 4 that itemset {eb} only appears in T2 and T5. In T2,
u({eb}, T2) = u(e, T2) + u(b, T2) = 2 + 4 = 6, and ru({eb}, T2) = u(a, T2) +
u(d, T2) = 4 + 5 = 9. Similarly, in T5, the utility of {eb} is 8 + 4 = 12, and the
remaining utility of {eb} is 5 + 5 = 10.
3.3 Utility-Lists of k-Itemsets (k≥3)
To construct the utility-list of k-itemset {𝑖1 · · · 𝑖𝑘−1, 𝑖𝑘} (k≥3), we can
directly intersect the utility-list of {𝑖1 · · · 𝑖𝑘−2, 𝑖𝑘−1} and that of {𝑖1 · · ·
𝑖𝑘−2, 𝑖𝑘} as we do to construct the utility-list of a 2-itemset. For example, to
construct the utility-list of {eba}, we can intersect the utility-list of {eb} and
that of {ea} in Fig. 6(b), and the resultant utility-list is depicted in Fig. 7(a).
Itemset {eba} does appear in T2 and T5 in the database view in Fig. 4, and
however the utilities of the itemset in T2 and T5 are 10 and 17 rather than 14
and 25, respectively. The reason for miscalculating the utility of {eba} in T2 is
that the sum of the utilities of both {eb} and {ea} in T2 contains the utility of
{e} in T2 twofold. Generally, to calculate the utility of {𝑖1 · · · 𝑖𝑘−2, 𝑖𝑘−1, 𝑖𝑘}
in T, the following formula holds: 𝑢({𝑖1 · · · 𝑖𝑘−2, 𝑖𝑘−1, 𝑖𝑘}, 𝑇) = 𝑢{𝑖1 · · ·
𝑖𝑘−2, 𝑖𝑘−1}, 𝑇) + 𝑢({𝑖1 · · · 𝑖𝑘−2, 𝑖𝑘}, 𝑇) − 𝑢({𝑖1 · · · 𝑖𝑘−2}𝑇).

Therefore, the iutil of the element associated with T2 in the utility-list of
{eba} is: 𝑢({𝑒𝑏𝑎}, 𝑇2) = 𝑢({𝑒𝑏}, 𝑇2) + 𝑢({𝑒𝑎}, 𝑇2) − 𝑢({𝑒}, 𝑇2) = 6 +
8 − 4 = 10. That associated with T5 is: 𝑢({𝑒𝑏𝑎}, 𝑇5) = 𝑢({𝑒𝑏}, 𝑇5) +
𝑢({𝑒𝑎}, 𝑇5) − 𝑢({𝑒}, 𝑇5) = 12 + 13 − 8 = 17. The values of
𝑢({𝑒𝑏}, 𝑇), 𝑢({𝑒𝑎}, 𝑇), and 𝑢({𝑒}, 𝑇) can be accessed from the utilitylists of
{eb}, {ea}, and {e}, respectively.
Suppose itemsets Px and Py are the combinations of itemset P with items
x and y (x is before y.), respectively, and P.UL, Px.UL, and Py.UL are the
utility-lists of itemsets P, Px, and Py. Algorithm 1 shows how to construct the
utility-list of itemset Pxy. The utility-list of a 2-itemset is constructed when
P.UL is empty, namely when P is empty, and the utility-list of a k-itemset
(k≥3) is constructed when P.UL is not empty. Note that element E in line 5 can
always be found out when P.UL is not empty, because the tid sets in both

Px.UL and Py.UL are subsets of the tid set in P.UL. The utility-lists of all the
itemsets with {eb} as prefix constructed by Algorithm 1 are showed in Fig.
7(b). Thus far, we have illustrated how to construct the utilitylist of an itemset.
When does HUI-Miner construct the utility-list of an itemset and how does
HUI-Miner judge whether or not the utility-list of an itemset should be
constructed, which will be illuminated in the next section
4. HIGH UTILITY ITEMSET MINER

Graduation Project Report - Some Techniques Applied For Translating Scientific Articles.docx

Recommended

Recommended

More Related Content

Similar to Graduation Project Report - Some Techniques Applied For Translating Scientific Articles.docx

Similar to Graduation Project Report - Some Techniques Applied For Translating Scientific Articles.docx (20)

More from Dịch vụ viết thuê đề tài trọn gói 🥰🥰 Liên hệ ZALO/TELE: 0917.193.864 ❤❤

More from Dịch vụ viết thuê đề tài trọn gói 🥰🥰 Liên hệ ZALO/TELE: 0917.193.864 ❤❤ (20)

Recently uploaded

Recently uploaded (11)

Graduation Project Report - Some Techniques Applied For Translating Scientific Articles.docx