Injustice - Developers Among Us (SciFiDevCon 2024)
Searching Keyword-lacking Files based on Latent Interfile Relationships
1. 1
Searching Keyword-lacking Files
based on Latent Interfile
Relationships
Tetsutaro Watanabe (Tokyo Tech.
Japan)
Takashi Kobayashi (Nagoya U.
Japan)
Haruo Yokota (Tokyo Tech. Japan)Tokyo Tech
Nagoya U
ICSOFT2010 – 5th
Intl Conf.Software and Data Technologies
22nd
July 2010 @ Athens, Greece
2. 2
Outline of today talk
Desktop search is must-have features
But, how often say “Good Boy!” to him?
New desktop search method using “LATENT”
relationship between files
Our major contributions:
A search method and system using inter-file
relationship with full-text search engine
A method for automatic extraction of latent
inter-file relationship from file access logs
Show feasibility and performance of our method
with real data experiments
We DON’T care
contents of files
cancel
Searching…
3. 3
Background and Goal
Information Explosion
1. Background & Goal
2. Related works
3. Proposed
method & system
4. Experiment
5. Conclusion
4. 4
Background
Increase the number of files in file system [1]
Many files & folders are generated and kept everyday
Desktop file system became a forest of folders!
Hard to classify files into appropriate directories
Difficult to find a desired file in a deep node
Desktop search (DS) is must-have features
Give up classify file and traversing the folder forest
Powerful desktop search function seamless merged
with current OS.
1.Background and Goal
[1] Agrawal, N., Bolosky, W. J., Douceur, J. R., and Lorch, J. R.
A five-year study of file-system metadata. ACM Transactions on Storage, 3(3). 2007.
5. 5
DS can find files include search keywords ONLY
It based on full-text search engine
CANNOT find keyword lacking files
even if they are related with keywords
Many related files don’t include keywords
Image figures
Source data files
Paper of related works
Source codes for
experiments
Explanatory filename is one solution. But…
“figure_sect2_ICSOFT2010_FRIDAL_outline.jpg”
Research
Paper
1.Background and Goal
Background (cond.)
6. 6
Our research goal
Searching method for keyword-lacking files
that match with given keywords
File system
1.Background and Goal
Not include but
Related with keyword
Include Keyword
(Full-text search)
Target
7. 7
Use metadata (eg. facet search )
Enable rich search but need good metadata
For important archive files, It works fine.
Can you attach into all files you generated??
Use references (eg. Google image search)
One of automatic generatable metadata
We can find even if images included no text
with text in referring documents.
Reference information is (very) rare & costly
Need target specific (syntactic, logical) analyzer,
such as HTML/TeX analyzer, specific XML doc,
paper analyzer ( to find citation )
So…
1. Background & Goal
To find keyword lacking files:
8. 8
Use metadata (eg. facet search )
Enable rich search but need good metadata
For important archive files, It works fine.
Can you attach into all files you generated??
Use references (eg. Google image search)
One of automatic generatable metadata
We can find even if images included no text
with text in referring documents.
Reference information is (very) rare & costly
Need target specific (syntactic, logical) analyzer,
such as HTML/TeX analyzer, specific XML doc,
paper analyzer ( to find citation )
So…
1. Background & Goal
To find keyword lacking files:
Research Question:
How to get the common, cost-free
relation information?
Our Answer:
Mine them from user activity automaticaly
10. 10
Related works
Semantic Approach [1][2]
Attach rich metadata to manage & search files
Time based Metaphor
Searching with timeline of past activity
Time machine computing[3], SIS[4], OreDesk[5]
2.Related works
[1] Gifford, D. K et al. Semantic file systems.
In Proc. ACM Symposium on Operating Systems Principles (1991)
[2] Chirita, P. A. et al. Activity based metadata for semantic desktop search.
In Proc. Second European Semantic Web Conference (ESWC) (2005)
[3] Rekimoto, J. Timemachine computing: A timecentric approach for the information
environment. In Proc. ACM UIST’99 (1999)
[4] Dumais, S. el al. Stuff I’ve seen: A system for personal information retrieval and re-use.
In Proc. SIGIR2003 (2003)
[5] Ohsawa, R. et al. Oredesk: A tool for retrieving data history based on user operations.
In Proc. IEEE International Symposium on Multimedia (ISM) (2006)
11. 11
Related works (cond.)
Using relationship between files
Applying PageRank idea [6]
Using usage analysis technique [7]
Integrate with fulltext-search: Connections[8]
Calculate interfile relationships using system call
to file, and search files related with files in
context based search
2.Related works
[6] Nejd, W and Paiu, R. : Desktop search – how contextual information influences search
results and rankings. In Proc. Workshop on Information Retrieval in Context (IRiX) (2005)
[7] Chirita, P. A. and Nejdl, W. Analyzing user behavior to rank desktop items.
In Proc. Intl’ Symp. On String Processing and Information Retrieval(SPIRE) (2006)
[8] Soules, C. A. and Ganger:, G. R. : Connections: Using context to enhance file search,.
In Proc. ACM Symposium on Operating Systems Principles (2005)
12. 12
Connections [Soules and Ganger 2005]
Count read-write relation in a time-window
They assume Written file refer Read file.
Propagate full-text search points
A B C
N sec
A
B
C
1
2
time
read()
write()
read()
write()
write()
Sytem call
trace log
open(s)
read(s)
write(s)
mmap(s)
stat(s)
dup(s)
link(S,D)
rename(S,F)
write()
2.Related works
Problem: Raw File I/O information is
NOT enough to analyze user activity
13. 13
Proposed method & system
1. Background & Goal
2. Related works
3. Proposed
method & system
4. Experiment
5. Conclusion
File
Retrieval by
Inter-file relationship
Derived from Access Log
14. 14
Outline of FRIDAL
Basic Assumption:
Files frequently used
same timing are related
Key Features
Cleaning raw file access log to extract
approximate file usage duration (AFUD)
Calculate latent relation by analyzing
overlap of AFUDs
Calculate Ranking for keyword using
Fulltext-search and relationship graph
Paper
(TeX)
Figure
3. Proposed method
15. 15
Approximate File Use Duration (AFUD)
Case1: User keep opening
files without using.
Need to Triming FUD
Detect Activity
1) Any activity Exist in frame
“Ta”, “(s)he was active”
-> Eliminate inactive time.
2) Long ( > “Tb”) inactive time
means “(s)he went home”
-> Eliminate after inactive
time
Active
Time
>Tb
FUDs AFUDs
Ta
apply
1)
apply
2)
3. Proposed method:
16. 16
Approximate File Use Duration (cond.)
Case2: Some Application
don’t keep opening
No or different exclusive
access control mechanism
Many short FUDs only appers
Detect Application manner
“Average of FUD < Tc” means
“App don’t lock the file”
Fill time slot between FUDs
in Active Times for such file type
Time
Active
Time FUDs AFUDs
3. Proposed method
17. 17
Calculate the interfile relationships by
the file use duration
1. Calculate four relationship
elements
T:Total time of COs
C:Number of COs
D:Total time of the time span
between COs
P: Similarity of the timings of
the open-file operations
2. Calculate interfile relatioship
Relationships =
δγβα
PDCT ⋅⋅⋅
Time
COs
Calculate latent interfile relationships
3.Proposed method
COs=co-occurrences
AFUDs
18. 18
Calculate latent relationships (1 of 3)
T:Total time of COs
C:Number of COs
Length & Frequency
of co-using
3.Proposed method
nC =
∑=
=
n
i
itT
1 c1
c2
c3
COsx
t2
t1
t3
c4 t4
Time
y
AFUDs
19. 19
time
D1 D2 COs
time
C1 C2 COsD:Total time of
the time span
between COs
When user co-use in
several task, the
relation is stronger
than in a task.
Calculate latent relationships (2 of 3)
AFUDs AFUDs
3.Proposed method
d12
d23
d12
d23
∑
−
=
+=
1
1
)1(
n
i
iidD
20. 20
Time Time
A1 A2 B1 B2
P: Similarity of the
timings of the
open-file operations
Calculate latent relationships (3 of 3)
3.Proposed method
)1(1
)1(
1
1
1
1
<=
>
=
∑
∑∑
=
=
−
=
n
i
i
n
i
i
n
i
i
pP
ppP
p1
p2
p3 p3 = 0
p2
p1
21. 21
1. Run the Full-text search using the input keywords
2. We score the file point for all files related to the
files found in the full-text search (discuss later)
3. Display the files ordered by point
Search result
1th 25pt
2th 20pt
3th 15pt
4th 10pt
5th 5pt
Search files using interfile relationships
2
12
5
3
203 10
13
9
Full-text search result
Relationship
File System
3.Proposed method
Target of Proposed method
25pt
15pt
5pt
10pt20pt
22. 22
Score the file point
10
20 30
0.5
10.75
Full-text search result
0+15
(20 * 0.75)
+30
(30* 1)45
30
+10
+5
+0 +0
20
25
Use TF-IDF and
Normalized Relationship
Propagate just one hop
for computational costs.
3.Proposed method & System
Score of TF-IDF →
10Final Score → 20
Point (F) =
TF-IDF(F) +
∑TF-IDF(X) * NormRel (F,Xi)
Normalized
Relationship
23. 23
FRIDAL Implementation
Full-text
Search Engine
(Hyper Estraier)
Web Interface
RDBMS
Controller
(java)
User
File server
(Samba)
Full-text
index
Use file
Searching phase
Preparing phase
Store
relationships
Calculates
relationships
Get access
logs
Use file
Use file
Search
result
Search
related
files
Calculate
points
Search
Search
Full-text
search
Make full-text index
3.Proposed method & System
File system
Store
relationships
25. 25
Parameter of Relationships
(α,β,γ, )=δ (1, 1, 0.5, 0.5)
based on a preparatory experiment
Experimental Environment
4. Experiments
Tester A
WinXP
319 Days
Tester B
WinXP
319 Days
Tester C
Win Vista
323 Days
Samba 2.2
Access Log of
MS Ofiice file, LaTeX
Image, Movie, file
A’s
Home
A’s
Home
B’s
Home
B’s
Home C’s
Home
C’s
Home
26. 26
Mined Latent interfile relations
#Relations was not correlate size of Logs
Depends on what (s)he were doing
Lines of Logs #Files # Rels
Tester A 4,873,703 1100 17,472
Tester B 4,323,090 713 5,692
Tester C 7,863,206 793 5,236
4. Experiments
27. 27
Evaluation1
Task:
Find specific files in another user’s home
Evaluate values
The number of queries
The number of files
that user checked until find files
The number of found answer files
Comparison methods
FRIDAL
Full-text search
4. Experiment
28. 28
Evaluation1: Results
File
Search
Method
#Check
File
#Check
Files
found
F1
FRIDAL 2 1
Full-text 2 15
F4
FRIDAL 1 2
Full-text 1 11
F6
FRIDAL 1 15
Full-text 2 14
Ave.
FRIDAL 1.3 6.0
Full-text 1.7 13.3
File
Search
method
#Queries
#Check
Files
found
F2
FRIDAL 1 9 1/1
Full-text 1 6 0/1
F3
FRIDAL 1 4 3/8
Full-text 1 0 0/8
F5
FRIDAL 1 2 1/1
Full-text 1 14 0/1
4. Experiment
Smaller cost
Only FRIDAL can find
FRIDAL can find keyword lacking files
and smaller costs than Full-text Search
F1 The paper of tester A
F2 The source of the image files in the paper of tester A
F3 The eight data files for the paper of tester A
F4 The paper of tester C
F5 The source of the image files in the paper of tester C
F6 The data file for the paper of tester C
29. 29
Evaluation2
Performance Comparison with other methods
Prepare six tasks searching files from home directory
(Details in Table 4 in our paper)
Evaluate values
Average of 11points avg precidion
Average of top 20 precidion and recall
Comparison methods
FRIDAL
Full-text search
Directory search
Connections calculation
4. Experiment
30. 30
Evaluation2 : Comparison methods
Directory search
Straightforward strategy
Search the directory that
includes the full-text search
result
4. Experiment
Full-text
search 結果
...
In the
same
directory
with 1st
1st
2nd
3rd
4th
5th
6th
7th
Directory search
1st
2nd
...
Connections calculation
Use calculation method of
Connections
Use the read/write attribute for file
access in the access logs instead of
read()/write()
Use optimal parameter values authors
reported in their paper.
In the
same
directory
with 2nd
31. 31
Evaluation2: Results
4. Experiment
Top 20 Avg of
precision
Avg of
recall
FRIDAL 0.72 0.15
Full-text search 0.54 0.12
Directory search 0.61 0.13
Connections calculation
0.48 0.10
FRIDAL が
最も高い値
FRIDAL is the
best score
The precision of FRIDAL is higher than the other
methods at low recalls
FRIDAL can retrieve more relevant files than the others
in the high orders of the results, and so we can find the
desired files efficiently by using FRIDAL
32. 32
Conclusion & Future work
1. Background & Goal
2. Related works
3. Proposed
method & system
4. Experiments
5. Conclusion
33. 33
Conclusion
FRIDAL: A new desktop search method using
latent relationship to search keyword-lacking files
A method for automatic extraction of latent
relationship between files from file access logs
A search method and system using inter-file
relationship with full-text search engine
Show feasibility and performance of FRIDAL
with real data experiments
Best performance in Comparison methods
34. 34
Future work
Improve an implementation
Support copy, move, and rename files
Support other file access log (Windows Event Log)
Improve the calculation of the interfile
relationships.
Filter noise in calculation of AFUD
Considering read/write(& move, delete…) actions.
Improve our ranking method
Detail analysis for multi user logs
More Consideration of Time related infomation
Need to disuses “Old log is important or not”