The most effective strategy for finding files is to carefully arrange them into folders. This strategy breaks down for teams, where organizational schemes often differ between team members. It also breaks down when information is copied and reused as it becomes harder to track versions. As storage continues to grow and costs decline, the incentives to carefully archive old versions of files diminish. It is therefore important to explore new and improved search tools. The most common approach is keyword search, though recalling effective keywords can be challenging, especially as repositories grow and information flows across projects. A less common alternative is to use provenance --information about the creation, use and sharing of documents and their context, including collaborators. This paper presents a limited user study showing that provenance data is useful and desirable in search, and that an interface based on a graphical sketchpad is not only feasible, but efficient.
On demand access to Big Data through Semantic Technologies
Leyline: A provenance-based desktop search
1. The Leyline: A Comparative
Approach To Designing a Graphical
Provenance-Based Search UI
Soroush Ghorashi, Carlos Jensen
Oregon State University
HICSS 2013
2. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
3. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
More information available, harder to (re)find anything
4. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
More information available, harder to (re)find anything
Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]
— Collaborators use conflicting name schemes
— Overlapping projects introduce uncertainty
5. What is the problem?
Computers are increasingly “black holes” for information
— Storage abundant and cheap, no incentives to delete or archive
— Collaboration and sharing are growing
— Information increasingly flowing across devices
More information available, harder to (re)find anything
Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]
— Using conflicting name scheme by collaborators
— Overlapping projects introduce uncertainty
Keyword Search
— Having larger repositories and information reuse lead to long list of hits for common keywords
— Multiple Copies and drafts of files
6. Solution?
What about: “Leveraging provenance to enrich file search”
— Provenance: The history of a document’s ownership, transformations, as
well as sources and derivatives
att
ac
hm e
en ast
RE: presentation draft ts
av y /p data.html
e cop
sav
presentation.ppt e as
presentation-v2.ppt
— Track provenance events: Make available in search queries, use in results
presentation
— Allow for fundamentally different types of queries
— People remember related documents [Gonçalves , 2004; Blanc-Brude,
2007]
7. Research Goals
— Phase 1: Analyze information reuse, information
flow, and provenance events in a real-world settings
— Phase 2: Investigate the effectiveness of
provenance cues in desktop search
— Phase 3: Develop and evaluate provenance-based
search tools (if appropriate)
8. Phase 1: Study Real-World Work
Practices (2008/2010)
File use per person-day
3 month user study at Intel Corporation
Web* 89.9
— Logging subjects’ activities on their computers Email 73.7
— Data cleaned for personal and sensitive information Word 4.4
Excel 2.5
— Recorded provenance and information access events
PowerPoint 2.1
Text 0.4
— Participants PDF 0.2
Total 173.2
— 17 information workers, 43 workdays average
— 9 observation sessions DownloadFile
3%
FileRename
— Exit interview with test 5%
MoveFile
6%
— Findings SaveAs
15%
— 126,620 unique resources CopyPaste
63%
— 7,448 resources per subject UploadFile
2%
AttachmentAdd
— Min: 3,211; Max: 17,570; σ: 3,326 3%
AttachmentSave
3%
C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
9. Phase 1 contd.
Provenance networks are more common than we expected!
— 521 significant graphs (3+ nodes)
— Average 5.8 resources per graph
— 53.7% of files related to at least one other file in their own network
C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
10. Phase 1 contd.
“It looks like it comes from the IAP tool, and all the green boxes are “I recall uploading
my Excel spreadsheets that I exported to. The word documents are those to the
probably what I copied the Excel data to, probably for email.” SharePoint site!”
“Oh, I see what’s going Half of subjects remembered
on. I tend to open a “2.4 might have been
spreadsheet and more about their documents embedded in a doc, so I
sometimes I’ll have more had to copy it out from
than one open at the same after seeing a provenance there.”
time…”
graph.
“Yeah, that’s what I did, I turned it into Excel… I saved it, “Looks like I copied and pasted from the website into
and then I changed the name because I wanted to make a doc… It’s kind of complicated what I did here. I
sure it was distinguished from other files I have with the took 2.2, copied and pasted info into an Excel
same name for a different group.” spreadsheet. And then yeah, there’s number 7, a
spreadsheet as well.”
C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.
11. Can We Use Provenance More
Directly?
Textual query in most traditional
keyword search tools
12. Can We Use Provenance More
Directly?
Textual query in most traditional
keyword search tools
What about drawing queries?
13. Phase 2: Provenance in Search?
Is it Appropriate?
Can provenance be used effectively in search?
— How complex a query do we need to find a file?
— List of all unique walks in provenance graphs
— Find longest repeating strings for each subject
— Worst case unique query: Longest repeating string + 1
— With/without provenance event type to examine impact
Outlook--AS--Word--CP--PowerPoint--SA--PowerPoint--CP--Powerpoint
14. Phase 2 contd.
— Maximum query length for a repository of ~7500
items:
— Considering the type of provenance events
— 3 to 9, median 4
— Without considering the type of provenance events
— 3 to 10, median 4.5
Provenance events like copy/paste and versioning are too
common to add value!
— Provenance search grows linearly
— 1 node per 200 links
Provenance can be used to narrow search space quickly.
15. Tool Analysis
Categorizing tools that are using provenance-like data to enhance search
— Provenance Types
— Provenance Monitoring
— Provenance Use
— UI Approach
— Evaluation
16. Tool Analysis contd.
Name Provenance Types Provenance Monitoring Provenance Use UI Approach Evaluation
File meta-data, Extracting relations from Query formulation, Flow-chart like, Canned data,
keyword, static Google Desktop’s database Search process List view model limited within
Feldspar
relations between using its API (real-time results subjects user
resources updating) study
Meta-data such as Built-in System Monitor to Query formulation, Narrative-based, Multiple user
author, storage place, record meta-data about the Search process List of resources’ studies
Quill date, physical place user’s documents, email thumbnails (real-
tag (home, work, attachments, WebPages, time results
etc.) applications and calendar updating)
File meta-data (such Microsoft Desktop Search Query formulation, Text input with Longitudinal
as kind, date, author, database, fuzzy matching (car Search process, selectable filters, study using real
email attributes) and cars are same), fielded Results List view of data on subjects’
SIS
search (author is “john doe”) presentation results with a PCs (234 people),
preview and 6 weeks
meta-data
File meta-data (such Microsoft Desktop Search Query formation, Text input with Longitudinal
as kind, date, author, database, Extra meta-data as Search process, selectable filters, study using real
email attributes). tags (Labeling system) Results List view of data on subjects’
Phlat
Contextual cues such presentation results with a PCs (225 people),
as user defined tags preview and 8 months
meta-data
Environmental Integrated system monitor to Query formulation, Textual input and Canned data,
factors as contextual record contextual cues and Search Process selectable filters, limited within
YouPivot
cues, user defined their occurrences List view of subjects user
marks results study
17. Tool Analysis
Feldspar
— Feldspar – Chau et. al 2008
— Desktop search
— Uses associations between files and resources
— extracted from Google Desktop database
— Keyword and meta-data search
— Flowchart-like user interface
— Real-time results, fast
— Evaluated with canned data
— Within subject study
18. Tool Analysis
Stuff I’ve seen, Phlat
— Stuff I’ve Seen (SIS) – Dumais et. al 2003, Phlat – Cutrell et. al 2006
— Similar to Windows Desktop Search
— Keyword and meta-data search
— Ranks the results using contextual cues
— Textual input
— List view of results with snippet and meta-data
— Unified labeling (Phlat)
— Longitudinal study
19. Tool Analysis
YouPivot
— YouPivot – Hailpern et. al 2011
— Search web browsing history
— Internal system monitor
— Uses keyword for search and contextual cues to filter the results
— Timeline view for user activities
— Textual input, list view of results
— TimeMarks to filter the results
— Evaluated with canned data
— Within subject study
20. Phase 3: Design Goals
— Use dynamic relations
between files
— Integration with keyword
search
— Graphical UI
— Allowing all kinds of
graphical queries
— Internal system monitor
— Result exploration
21. Phase 3: System Requirements
— Provenance + Keyword search
— Streamline query composition
using a drag-drop graphical
sketchpad
— Allow for flexible exploration
and discovery
— Integration with Windows
Explorer to allow exploration of
workflow and information
provenance
22. Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)
— Introducing * links
23. Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)
— Introducing * links
24. Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)
— Introducing * links
— Partial matching
— Easier to solve
— Better matches user recall
— Use G-Ray algorithm [Tong et al. 2007]
— Best-effort matching
— Fast, scalable, flexible and forgiving
26. Phase 3: Preliminary Evaluation
Is UI approach reasonable?
— User Study
— Used file repository modeled after those found at Intel
— Participant selection
— Questionnaire to examine knowledge of search tools
— Graduate students
— Interactive tutorial
— 9 Experiment tasks
“Find the word document you created using information copy/pasted from an email, a web page, and
an excel document. Find the emails that have this word document as an attachment.”
— Tasks ordered randomly
— Think aloud protocol
— 4 minutes for each tasks
— Exit interview about their experience
S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
27. Phase 3: Preliminary Evaluation
contd.
— Average completion time: 106 seconds
— Simple tasks (72 seconds – 93 seconds)
— Hard tasks (126 seconds – 155 seconds)
— Query complexity (#nodes & #edges)
— Average of 2.8 nodes and 2 edges
— System scales well (Completion time vs. Complexity)
— Observations
— Importance of target document
— Working on one resource or relation at a time
— Saw marked learning effect
— Interviews
— Overall likability rating: 4.2 out of 5 (σ = 0.4)
— Wanted Leyline in real life
— No one complained about effort/time requirement
— Areas for improvement
— Query composition history panel
— Customization options
— Support more resource types
S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.
28. Conclusion
— Provenance events are very common in real-world
settings, and potentially helpful in search
— Provenance alone can quickly and effectively identify
unique files/resources (assuming perfect recall)
— A graphical sketchpad is a viable UI for query
composition
— Isn’t going to replace keyword search, but valuable addition
— Users quickly learned how to use our system, and
wanted the tool
29. What about the future?
— Incorporate the feedback and lessons learned into a new
prototype
— Expand feature set to include:
— Auto-completion and suggestion features to speed up the
search process
— Support a broader set of files and resources
— Possibly support other computer platforms
— Prepare for longitudinal study
— How do people adapt and use the Leyline?
— How does the Leyline scale in a large database?
— Does the Leyline change exploration?
— Does the Leyline work in collaborative environment?
30. Thank you
— Thanks to Intel for early funding and subjects!
— For more information:
— Soroush Ghorashi
— (ghorashi@eecs.oregonstate.edu)
— Carlos Jensen
— (cjensen@eecs.oregonstate.edu)