Searching Keyword-lacking Files based on Latent Interfile Relationships

1
Searching Keyword-lacking Files
based on Latent Interfile
Relationships
Tetsutaro Watanabe (Tokyo Tech.
Japan)
Takashi Kobayashi (Nagoya U.
Japan)
Haruo Yokota (Tokyo Tech. Japan)Tokyo Tech
Nagoya U
ICSOFT2010 – 5th
Intl Conf.Software and Data Technologies
22nd
July 2010 @ Athens, Greece

2
Outline of today talk
Desktop search is must-have features
 But, how often say “Good Boy!” to him?
New desktop search method using “LATENT”
relationship between files
Our major contributions:
 A search method and system using inter-file
relationship with full-text search engine
 A method for automatic extraction of latent
inter-file relationship from file access logs
 Show feasibility and performance of our method
with real data experiments
We DON’T care
contents of files
cancel
Searching…

3
Background and Goal
Information Explosion
1. Background & Goal
2. Related works
3. Proposed
method & system
4. Experiment
5. Conclusion

4
Background
 Increase the number of files in file system [1]
 Many files & folders are generated and kept everyday
 Desktop file system became a forest of folders!
 Hard to classify files into appropriate directories
 Difficult to find a desired file in a deep node
 Desktop search (DS) is must-have features
 Give up classify file and traversing the folder forest
 Powerful desktop search function seamless merged
with current OS.
1.Background and Goal
[1] Agrawal, N., Bolosky, W. J., Douceur, J. R., and Lorch, J. R.
A five-year study of file-system metadata. ACM Transactions on Storage, 3(3). 2007.

5
 DS can find files include search keywords ONLY
 It based on full-text search engine
 CANNOT find keyword lacking files
even if they are related with keywords
 Many related files don’t include keywords
 Image figures
 Source data files
 Paper of related works
 Source codes for
experiments
 Explanatory filename is one solution. But…
 “figure_sect2_ICSOFT2010_FRIDAL_outline.jpg”
Research
Paper
Background (cond.)

6
Our research goal
Searching method for keyword-lacking files
that match with given keywords
File system
Not include but
Related with keyword
Include Keyword
(Full-text search)
Target

7
 Use metadata (eg. facet search )
 Enable rich search but need good metadata
 For important archive files, It works fine.
 Can you attach into all files you generated??
 Use references (eg. Google image search)
 One of automatic generatable metadata
 We can find even if images included no text
with text in referring documents.
 Reference information is (very) rare & costly
Need target specific (syntactic, logical) analyzer,
such as HTML/TeX analyzer, specific XML doc,
paper analyzer ( to find citation )
So…
To find keyword lacking files:

8
 Use metadata (eg. facet search )
 Enable rich search but need good metadata
 For important archive files, It works fine.
 Can you attach into all files you generated??
 Use references (eg. Google image search)
 One of automatic generatable metadata
 We can find even if images included no text
with text in referring documents.
 Reference information is (very) rare & costly
Need target specific (syntactic, logical) analyzer,
such as HTML/TeX analyzer, specific XML doc,
paper analyzer ( to find citation )
So…
To find keyword lacking files:
Research Question:
How to get the common, cost-free
relation information?
Our Answer:
Mine them from user activity automaticaly

9
Related works
2. Related works
3. Proposed
method & system
4. Experiment
5. Conclusion

10
Related works
Semantic Approach [1][2]
 Attach rich metadata to manage & search files
Time based Metaphor
 Searching with timeline of past activity
 Time machine computing[3], SIS[4], OreDesk[5]
2.Related works
[1] Gifford, D. K et al. Semantic file systems.
In Proc. ACM Symposium on Operating Systems Principles (1991)
[2] Chirita, P. A. et al. Activity based metadata for semantic desktop search.
In Proc. Second European Semantic Web Conference (ESWC) (2005)
[3] Rekimoto, J. Timemachine computing: A timecentric approach for the information
environment. In Proc. ACM UIST’99 (1999)
[4] Dumais, S. el al. Stuff I’ve seen: A system for personal information retrieval and re-use.
In Proc. SIGIR2003 (2003)
[5] Ohsawa, R. et al. Oredesk: A tool for retrieving data history based on user operations.
In Proc. IEEE International Symposium on Multimedia (ISM) (2006)

11
Related works (cond.)
 Using relationship between files
 Applying PageRank idea [6]
 Using usage analysis technique [7]
 Integrate with fulltext-search: Connections[8]
 Calculate interfile relationships using system call
to file, and search files related with files in
context based search
2.Related works
[6] Nejd, W and Paiu, R. : Desktop search – how contextual information influences search
results and rankings. In Proc. Workshop on Information Retrieval in Context (IRiX) (2005)
[7] Chirita, P. A. and Nejdl, W. Analyzing user behavior to rank desktop items.
In Proc. Intl’ Symp. On String Processing and Information Retrieval(SPIRE) (2006)
[8] Soules, C. A. and Ganger:, G. R. : Connections: Using context to enhance file search,.
In Proc. ACM Symposium on Operating Systems Principles (2005)

12
Connections [Soules and Ganger 2005]
Count read-write relation in a time-window
They assume Written file refer Read file.
Propagate full-text search points
A B C
N sec
A
B
C
1
2
time
read()
write()
read()
write()
write()
Sytem call
trace log
open(s)
read(s)
write(s)
mmap(s)
stat(s)
dup(s)
link(S,D)
rename(S,F)
write()
2.Related works
Problem: Raw File I/O information is
NOT enough to analyze user activity

13
Proposed method & system
2. Related works
3. Proposed
method & system
4. Experiment
5. Conclusion
File
Retrieval by
Inter-file relationship
Derived from Access Log

14
Outline of FRIDAL
Basic Assumption:
 Files frequently used
same timing are related
Key Features
 Cleaning raw file access log to extract
approximate file usage duration (AFUD)
 Calculate latent relation by analyzing
overlap of AFUDs
 Calculate Ranking for keyword using
Fulltext-search and relationship graph
Paper
(TeX)
Figure
3. Proposed method

15
Approximate File Use Duration (AFUD)
Case1: User keep opening
files without using.
Need to Triming FUD
Detect Activity
1) Any activity Exist in frame
“Ta”, “(s)he was active”
-> Eliminate inactive time.
2) Long ( > “Tb”) inactive time
means “(s)he went home”
-> Eliminate after inactive
time
Active
Time
>Tb
FUDs AFUDs
Ta
apply
1)
apply
2)
3. Proposed method:

16
Approximate File Use Duration (cond.)
 Case2: Some Application
don’t keep opening
 No or different exclusive
access control mechanism
 Many short FUDs only appers
 Detect Application manner
 “Average of FUD < Tc” means
“App don’t lock the file”
 Fill time slot between FUDs
in Active Times for such file type
Time
Active
Time FUDs AFUDs
3. Proposed method

17
 Calculate the interfile relationships by
the file use duration
1. Calculate four relationship
elements
T:Total time of COs
C:Number of COs
D:Total time of the time span
between COs
P: Similarity of the timings of
the open-file operations
2. Calculate interfile relatioship
Relationships =
δγβα
PDCT ⋅⋅⋅
Time
COs
Calculate latent interfile relationships
3.Proposed method
COs=co-occurrences
AFUDs

18
Calculate latent relationships (1 of 3)
 T:Total time of COs
 C:Number of COs
 Length & Frequency
of co-using
3.Proposed method
nC =
∑=
=
n
i
itT
1 c1
c2
c3
COsx
t2
t1
t3
c4 t4
Time
y
AFUDs

19
time
D1 D2 COs
time
C1 C2 COsD:Total time of
the time span
between COs
When user co-use in
several task, the
relation is stronger
than in a task.
AFUDs AFUDs
3.Proposed method
d12
d23
d12
d23
∑
−
=
+=
1
1
)1(
n
i
iidD

20
Time Time
A1 A2 B1 B2
 P: Similarity of the
timings of the
open-file operations
3.Proposed method
)1(1
)1(
1
1
1
1
<=
>





=
∑
∑∑
=
=
−
=
n
i
i
n
i
i
n
i
i
pP
ppP
p1
p2
p3 p3 = 0
p2
p1

21
1. Run the Full-text search using the input keywords
2. We score the file point for all files related to the
files found in the full-text search (discuss later)
3. Display the files ordered by point
Search result
1th 25pt
2th 20pt
3th 15pt
4th 10pt
5th 5pt
Search files using interfile relationships
2
12
5
3
203 10
13
9
Full-text search result
Relationship
File System
3.Proposed method
Target of Proposed method
25pt
15pt
5pt
10pt20pt

22
Score the file point
10
20 30
0.5
10.75
Full-text search result
0+15
(20 * 0.75)
+30
(30* 1)45
30
+10
+5
+0 +0
20
25
 Use TF-IDF and
Normalized Relationship
 Propagate just one hop
for computational costs.
3.Proposed method & System
Score of TF-IDF → 　
10Final Score → 　 20 　　
Point (F) ＝
TF-IDF(F) ＋
∑TF-IDF(X) * NormRel (F,Xi)
Normalized
Relationship

23
FRIDAL Implementation
Full-text
Search Engine
(Hyper Estraier)
Web Interface
RDBMS
Controller
(java) 　
User
File server
(Samba)
Full-text
index
Use file
Searching phase
Preparing phase
Store
relationships
Calculates
relationships
Get access
logs
Use file
Use file
Search
result
Search
related
files
Calculate
points
Search
Search
Full-text
search
Make full-text index
3.Proposed method & System
File system
Store
relationships

24
Experiments
2. Related works
3. Proposed
method & system
4. Experiments
5. Conclusion

25
 Parameter of Relationships
 (α,β,γ, )=δ (1, 1, 0.5, 0.5)
based on a preparatory experiment
Experimental Environment
4. Experiments
Tester A
WinXP
319 Days
Tester B
WinXP
319 Days
Tester C
Win Vista
323 Days
Samba 2.2
Access Log of
MS Ofiice file, LaTeX
Image, Movie, file
A’s
Home
A’s
Home
B’s
Home
B’s
Home C’s
Home
C’s
Home

26
Mined Latent interfile relations
#Relations was not correlate size of Logs
 Depends on what (s)he were doing
Lines of Logs #Files # Rels
Tester A 4,873,703 1100 17,472
Tester B 4,323,090 713 5,692
Tester C 7,863,206 793 5,236
4. Experiments

27
Evaluation1
Task:
 Find specific files in another user’s home
Evaluate values
 The number of queries
 The number of files
that user checked until find files
 The number of found answer files
Comparison methods
 FRIDAL
 Full-text search
4. Experiment

28
Evaluation1: Results
File
Search
Method
#Check
File
#Check
Files
found
F1
FRIDAL 2 1 
Full-text 2 15 
F4
FRIDAL 1 2 
Full-text 1 11 
F6
FRIDAL 1 15 
Full-text 2 14 
Ave.
FRIDAL 1.3 6.0
　
Full-text 1.7 13.3
File
Search
method
#Queries
#Check
Files
found
F2
FRIDAL 1 9 1/1
Full-text 1 6 0/1
F3
FRIDAL 1 4 3/8
Full-text 1 0 0/8
F5
FRIDAL 1 2 1/1
Full-text 1 14 0/1
4. Experiment
Smaller cost
Only FRIDAL can find
FRIDAL can find keyword lacking files
and smaller costs than Full-text Search
F1 The paper of tester A
F2 The source of the image files in the paper of tester A
F3 The eight data files for the paper of tester A
F4 The paper of tester C
F5 The source of the image files in the paper of tester C
F6 The data file for the paper of tester C

29
Evaluation2
 Performance Comparison with other methods
 Prepare six tasks searching files from home directory
 (Details in Table 4 in our paper)
 Evaluate values
 Average of 11points avg precidion
 Average of top 20 precidion and recall
Comparison methods
 FRIDAL
 Full-text search
 Directory search
 Connections calculation
4. Experiment

30
Evaluation2 : Comparison methods
 Directory search
 Straightforward strategy
 Search the directory that
includes the full-text search
result
4. Experiment
Full-text
search 結果
...
In the
same
directory
with 1st
1st
2nd
3rd
4th
5th
6th
7th
Directory search
1st
2nd
...
 Connections calculation
 Use calculation method of
Connections
 Use the read/write attribute for file
access in the access logs instead of
read()/write()
 Use optimal parameter values authors
reported in their paper.
In the
same
directory
with 2nd

31
Evaluation2: Results
4. Experiment
Top 20 Avg of
precision
Avg of
recall
FRIDAL 0.72 0.15
Full-text search 0.54 0.12
Directory search 0.61 0.13
Connections calculation
0.48 0.10
FRIDAL が
最も高い値
FRIDAL is the
best score
 The precision of FRIDAL is higher than the other
methods at low recalls
FRIDAL can retrieve more relevant files than the others
in the high orders of the results, and so we can find the
desired files efficiently by using FRIDAL

32
Conclusion & Future work
2. Related works
3. Proposed
method & system
4. Experiments
5. Conclusion

33
Conclusion
 FRIDAL: A new desktop search method using
latent relationship to search keyword-lacking files
 A method for automatic extraction of latent
relationship between files from file access logs
 A search method and system using inter-file
relationship with full-text search engine
Show feasibility and performance of FRIDAL
with real data experiments
 Best performance in Comparison methods

34
Future work
 Improve an implementation
 Support copy, move, and rename files
 Support other file access log (Windows Event Log)
 Improve the calculation of the interfile
relationships.
 Filter noise in calculation of AFUD
 Considering read/write(& move, delete…) actions.
 Improve our ranking method
 Detail analysis for multi user logs
 More Consideration of Time related infomation
Need to disuses “Old log is important or not”

35
Thank you! Questions & Comments ?

Searching Keyword-lacking Files based on Latent Interfile Relationships

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Searching Keyword-lacking Files based on Latent Interfile Relationships

Similar to Searching Keyword-lacking Files based on Latent Interfile Relationships (20)

Recently uploaded

Recently uploaded (20)

Searching Keyword-lacking Files based on Latent Interfile Relationships

Editor's Notes