• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Searching Keyword-lacking Files based on Latent Interfile Relationships

on

  • 647 views

 

Statistics

Views

Total Views
647
Views on SlideShare
647
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 0:15
  • Q想定質問「なぜ他がないのか」 インターフェースが出来ていない ディレクトリを開くコストを確定していない

Searching Keyword-lacking Files based on Latent Interfile Relationships Searching Keyword-lacking Files based on Latent Interfile Relationships Presentation Transcript

  • Searching Keyword-lacking Files based on Latent Interfile Relationships Tetsutaro Watanabe (Tokyo Tech. Japan) Takashi Kobayashi (Nagoya U. Japan) Haruo Yokota (Tokyo Tech. Japan) Tokyo Tech Nagoya U ICSOFT2010 – 5 th Intl Conf.Software and Data Technologies 22 nd July 2010 @ Athens, Greece
  • Outline of today talk
    • Desktop search is must-have features
      • But, how often say “Good Boy!” to him?
    • New desktop search method using “LATENT” relationship between files
    • Our major contributions:
      • A search method and system using inter-file relationship with full-text search engine
      • A method for automatic extraction of latent inter-file relationship from file access logs
      • Show feasibility and performance of our method with real data experiments
    We DON’T care contents of files cancel Searching…
  • Background and Goal Information Explosion 1. Background & Goal 2. Related works 3. Proposed method & system 4. Experiment 5. Conclusion
  • Background
    • Increase the number of files in file system [1]
      • Many files & folders are generated and kept everyday
    • Desktop file system became a forest of folders!
      • Hard to classify files into appropriate directories
      • Difficult to find a desired file in a deep node
    • Desktop search (DS) is must-have features
      • Give up classify file and traversing the folder forest
      • Powerful desktop search function seamless merged with current OS.
    1. Background and Goal [1] Agrawal, N., Bolosky, W. J., Douceur, J. R., and Lorch, J. R. A five-year study of file-system metadata. ACM Transactions on Storage , 3(3). 2007.
    • DS can find files include search keywords ONLY
      • It based on full-text search engine
      • CANNOT find keyword lacking files even if they are related with keywords
    • Many related files don’t include keywords
      • Image figures
      • Source data files
      • Paper of related works
      • Source codes for experiments
    • Explanatory filename is one solution. But…
      • “ figure_sect2_ICSOFT2010_FRIDAL_outline.jpg”
    Background (cond.) Research Paper 1. Background and Goal
  • Our research goal
    • Searching method for keyword-lacking files that match with given keywords
    File system 1. Background and Goal Not include but Related with keyword Include Keyword (Full-text search) Target
    • Use metadata (eg. facet search )
      • Enable rich search but need good metadata
      • For important archive files, It works fine.
      • Can you attach into all files you generated??
    • Use references (eg. Google image search)
      • One of automatic generatable metadata
      • We can find even if images included no text with text in referring documents.
      • Reference information is (very) rare & costly
        • Need target specific (syntactic, logical) analyzer, such as HTML/TeX analyzer, specific XML doc, paper analyzer ( to find citation )
      • So…
    To find keyword lacking files: 1. Background & Goal
    • Use metadata (eg. facet search )
      • Enable rich search but need good metadata
      • For important archive files, It works fine.
      • Can you attach into all files you generated??
    • Use references (eg. Google image search)
      • One of automatic generatable metadata
      • We can find even if images included no text with text in referring documents.
      • Reference information is (very) rare & costly
        • Need target specific (syntactic, logical) analyzer, such as HTML/TeX analyzer, specific XML doc, paper analyzer ( to find citation )
      • So…
    To find keyword lacking files: 1. Background & Goal Research Question: How to get the common, cost-free relation information? Our Answer: Mine them from user activity automaticaly
  • Related works 1. Background & Goal 2. Related works 3. Proposed method & system 4. Experiment 5. Conclusion
  • Related works
    • Semantic Approach [1][2]
      • Attach rich metadata to manage & search files
    • Time based Metaphor
      • Searching with timeline of past activity
      • Time machine computing[3], SIS[4], OreDesk[5]
    2. Related works [1] Gifford, D. K et al. Semantic file systems. In Proc. ACM Symposium on Operating Systems Principles (1991) [2] Chirita, P. A. et al. Activity based metadata for semantic desktop search . In Proc. Second European Semantic Web Conference (ESWC) (2005) [3] Rekimoto, J. Timemachine computing: A timecentric approach for the information environment . In Proc. ACM UIST’99 (1999) [4] Dumais, S. el al. Stuff I’ve seen: A system for personal information retrieval and re-use. In Proc. SIGIR2003 (2003) [5] Ohsawa, R. et al. Oredesk: A tool for retrieving data history based on user operations. In Proc. IEEE International Symposium on Multimedia (ISM) (2006)
  • Related works (cond.)
    • Using relationship between files
      • Applying PageRank idea [6]
        • Using usage analysis technique [7]
      • Integrate with fulltext-search: Connections[8]
        • Calculate interfile relationships using system call to file, and search files related with files in context based search
    2. Related works [6] Nejd, W and Paiu, R. : Desktop search – how contextual information influences search results and rankings . In Proc. Workshop on Information Retrieval in Context (IRiX) (2005) [7] Chirita, P. A. and Nejdl, W. Analyzing user behavior to rank desktop items. In Proc. Intl’ Symp. On String Processing and Information Retrieval(SPIRE) (2006) [8] Soules, C. A. and Ganger:, G. R. : Connections: Using context to enhance file search,. In Proc. ACM Symposium on Operating Systems Principles (2005)
  • Connections [ Soules and Ganger 2005]
    • Count read-write relation in a time-window
      • They assume Written file refer Read file.
    • Propagate full-text search points
    A B C N sec A B C 1 2 time read() write() read() write() write() Sytem call trace log open(s) read(s) write(s) mmap(s) stat(s) dup(s) link(S,D) rename(S,F) write() 2. Related works Problem: Raw File I/O information is NOT enough to analyze user activity
  • Proposed method & system 1. Background & Goal 2. Related works 3. Proposed method & system 4. Experiment 5. Conclusion F ile R etrieval by I nter-file relationship D erived from A ccess L og
  • Outline of FRIDAL
    • Basic Assumption:
      • Files frequently used same timing are related
    • Key Features
      • Cleaning raw file access log to extract approximate file usage duration ( AFUD )
      • Calculate latent relation by analyzing overlap of AFUDs
      • Calculate Ranking for keyword using Fulltext-search and relationship graph
    3. Proposed method Paper (TeX) Figure
  • Approximate File Use Duration (AFUD)
    • Case1: User keep opening files without using.
      • Need to Triming FUD
    • Detect Activity
      • 1) Any activity Exist in frame “Ta”, “(s)he was active”
      • -> Eliminate inactive time.
      • 2) Long ( > “Tb”) inactive time means “(s)he went home”
      • -> Eliminate after inactive
    time Active Time >T b FUDs AFUDs T a apply 1) apply 2) 3. Proposed method:
  • Approximate File Use Duration (cond.)
    • Case2: Some Application don’t keep opening
      • No or different exclusive access control mechanism
      • Many short FUDs only appers
    • Detect Application manner
      • “ Average of FUD < Tc” means “App don’t lock the file”
      • Fill time slot between FUDs in Active Times for such file type
    Time Active Time FUDs AFUDs 3. Proposed method
    • Calculate the interfile relationships by the file use duration
      • Calculate four relationship elements
      • T :Total time of COs
      • C :Number of COs
      • D :Total time of the time span between COs
      • P : Similarity of the timings of the open-file operations
      • Calculate interfile relatioship
        • Relationships =
    Calculate latent interfile relationships 3. Proposed method
      • COs =co-occurrences
    AFUDs Time COs
  • Calculate latent relationships (1 of 3)
    • T:Total time of COs
    • C:Number of COs
    • Length & Frequency of co-using
    3. Proposed method c 1 c2 c3 COs x t 2 t 1 t 3 c4 t 4 Time y AFUDs
  • time D 1 D 2 COs time C 1 C 2 COs
    • D:Total time of the time span between COs
    • When user co-use in several task, the relation is stronger than in a task.
    Calculate latent relationships (2 of 3) AFUDs AFUDs 3. Proposed method d 12 d 23 d 12 d 23
  • Time Time A 1 A 2 B 1 B 2
    • P : Similarity of the timings of the open-file operations
    Calculate latent relationships (3 of 3) 3. Proposed method p 1 p 2 p 3 p 3 = 0 p 2 p 1
  • Search files using interfile relationships
    • Run the Full-text search using the input keywords
    • We score the file point for all files related to the files found in the full-text search (discuss later)
    • Display the files ordered by point
    Relationship File System 3. Proposed method Search result 1th 25pt 2th 20pt 3th 15pt 4th 10pt 5th 5pt 2 12 5 3 20 3 10 13 9 Full-text search result Target of Proposed method 25pt 15pt 5pt 10pt 20pt
  • Score the file point 10 20 30 0.5 1 0.75 Full-text search result 0 45 30 20 25
    • Use TF-IDF and Normalized Relationship
    • Propagate just one hop for computational costs.
    3. Proposed method & System Score of TF-IDF ->   10 Final Score ->   20     Point (F) = TF-IDF(F) + ∑ TF-IDF(X) * NormRel ( F , Xi ) Normalized Relationship +15 ( 20 * 0.75 ) +30 ( 30 * 1 ) +10 +5 +0 +0
  • FRIDAL Implementation Full-text Search Engine (Hyper Estraier) Web Interface RDBMS Controller (java)   User File server (Samba) Full-text index Store relationships Calculates relationships Get access logs Use file Use file 3. Proposed method & System File system Store relationships Use file Searching phase Preparing phase Search result Search related files Calculate points Search Search Full-text search Make full-text index
  • Experiments 1. Background & Goal 2. Related works 3. Proposed method & system 4. Experiments 5. Conclusion
    • Parameter of Relationships
      • (α , β , γ , δ)= (1, 1, 0.5, 0.5) based on a preparatory experiment
    Experimental Environment 4. Experiments Tester A WinXP 319 Days Tester B WinXP 319 Days Tester C Win Vista 323 Days Samba 2.2 Access Log of MS Ofiice file, LaTeX Image, Movie, file A’s Home B’s Home C’s Home
  • Mined Latent interfile relations
    • #Relations was not correlate size of Logs
      • Depends on what (s)he were doing
    4. Experiments Lines of Logs #Files # Rels Tester A 4,873,703 1100 17,472 Tester B 4,323,090 713 5,692 Tester C 7,863,206 793 5,236
  • Evaluation1
    • Task:
      • Find specific files in another user’s home
    • Evaluate values
      • The number of queries
      • The number of files that user checked until find files
      • The number of found answer files
    • Comparison methods
      • FRIDAL
      • Full-text search
    4. Experiment
  • Evaluation1: Results 4. Experiment Smaller cost Only FRIDAL can find FRIDAL can find keyword lacking files and smaller costs than Full-text Search File Search Method #Check File #Check Files found F1 FRIDAL 2 1  Full-text 2 15  F4 FRIDAL 1 2  Full-text 1 11  F6 FRIDAL 1 15  Full-text 2 14  Ave. FRIDAL 1.3 6.0   Full-text 1.7 13.3 File Search method #Queries #Check Files found F2 FRIDAL 1 9 1/1 Full-text 1 6 0/1 F3 FRIDAL 1 4 3 /8 Full-text 1 0 0/8 F5 FRIDAL 1 2 1/1 Full-text 1 14 0/1 F1 The paper of tester A F2 The source of the image files in the paper of tester A F3 The eight data files for the paper of tester A F4 The paper of tester C F5 The source of the image files in the paper of tester C F6 The data file for the paper of tester C
  • Evaluation2
    • Performance Comparison with other methods
      • Prepare six tasks searching files from home directory
      • (Details in Table 4 in our paper)
    • Evaluate values
      • Average of 11points avg precidion
      • Average of top 20 precidion and recall
    • Comparison methods
      • FRIDAL
      • Full-text search
      • Directory search
      • Connections calculation
    4. Experiment
  • Evaluation2 : Comparison methods
    • Directory search
      • Straightforward strategy
      • Search the directory that includes the full-text search result
    4. Experiment Full-text search 結果 ... In the same directory with 1st 1st 2nd 3rd 4th 5th 6th 7th Directory search
    • Connections calculation
      • Use calculation method of Connections
      • Use the read/write attribute for file access in the access logs instead of read()/write()
      • Use optimal parameter values authors reported in their paper.
    In the same directory with 2nd 1st 2nd ...
  • Evaluation2: Results 4. Experiment FRIDAL が 最も高い値 FRIDAL is the best score
    • The precision of FRIDAL is higher than the other methods at low recalls
    FRIDAL can retrieve more relevant files than the others in the high orders of the results, and so we can find the desired files efficiently by using FRIDAL Top 20 Avg of precision Avg of recall FRIDAL 0.72 0.15 Full-text search 0.54 0.12 Directory search 0.61 0.13 Connections calculation 0.48 0.10
  • Conclusion & Future work 1. Background & Goal 2. Related works 3. Proposed method & system 4. Experiments 5. Conclusion
  • Conclusion
    • FRIDAL: A new desktop search method using latent relationship to search keyword-lacking files
      • A method for automatic extraction of latent relationship between files from file access logs
      • A search method and system using inter-file relationship with full-text search engine
    • Show feasibility and performance of FRIDAL with real data experiments
      • Best performance in Comparison methods
  • Future work
    • Improve an implementation
      • Support copy, move, and rename files
      • Support other file access log (Windows Event Log)
    • Improve the calculation of the interfile relationships.
      • Filter noise in calculation of AFUD
      • Considering read/write(& move, delete…) actions.
    • Improve our ranking method
      • Detail analysis for multi user logs
      • More Consideration of Time related infomation
        • Need to disuses “Old log is important or not”
  • Thank you! Questions & Comments ?