Leyline: A provenance-based desktop search

The Leyline: A Comparative
Approach To Designing a Graphical
Provenance-Based Search UI

Soroush Ghorashi, Carlos Jensen
Oregon State University
HICSS 2013

What is the problem?
Computers are increasingly “black holes” for information

—  Storage abundant and cheap, no incentives to delete or archive

—  Collaboration and sharing are growing

—  Information increasingly flowing across devices





More information available, harder to (re)find anything






Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]

—  Collaborators use conflicting name schemes

—  Overlapping projects introduce uncertainty






Manual Folder Navigation [Barreau, D. and Nardi 1995, Teevan et. al 2004, Bergman et. al 2008]

—  Using conflicting name scheme by collaborators

—  Overlapping projects introduce uncertainty

Keyword Search

—  Having larger repositories and information reuse lead to long list of hits for common keywords

—  Multiple Copies and drafts of files

Solution?
What about: “Leveraging provenance to enrich file search”
—  Provenance: The history of a document’s ownership, transformations, as
well as sources and derivatives
att
ac
hm e
en ast
RE: presentation draft ts
av y /p data.html
e cop

sav
presentation.ppt e as

presentation-v2.ppt

—  Track provenance events: Make available in search queries, use in results
presentation

—  Allow for fundamentally different types of queries
—  People remember related documents [Gonçalves , 2004; Blanc-Brude,
2007]

Research Goals
—  Phase 1: Analyze information reuse, information
flow, and provenance events in a real-world settings

—  Phase 2: Investigate the effectiveness of
provenance cues in desktop search

—  Phase 3: Develop and evaluate provenance-based
search tools (if appropriate)

Phase 1: Study Real-World Work
Practices (2008/2010)
File use per person-day
3 month user study at Intel Corporation
Web* 89.9
—  Logging subjects’ activities on their computers Email 73.7

—  Data cleaned for personal and sensitive information Word 4.4
Excel 2.5
—  Recorded provenance and information access events
PowerPoint 2.1
Text 0.4

—  Participants PDF 0.2

Total 173.2
—  17 information workers, 43 workdays average
—  9 observation sessions DownloadFile
3%
FileRename

—  Exit interview with test 5%
MoveFile
6%

—  Findings SaveAs
15%

—  126,620 unique resources CopyPaste
63%
—  7,448 resources per subject UploadFile
2%
AttachmentAdd
—  Min: 3,211; Max: 17,570; σ: 3,326 3%
AttachmentSave
3%

C. Jensen et al., "The life and times of files and information: a study of desktop provenance." In Proceedings of the 28th international
Conference on Human Factors in Computing Systems (Atlanta, GA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, pp. 767-776.

Phase 1 contd.
Provenance networks are more common than we expected!
—  521 significant graphs (3+ nodes)
—  Average 5.8 resources per graph
—  53.7% of files related to at least one other file in their own network


Phase 1 contd.
“It looks like it comes from the IAP tool, and all the green boxes are “I recall uploading
my Excel spreadsheets that I exported to. The word documents are those to the
probably what I copied the Excel data to, probably for email.” SharePoint site!”

“Oh, I see what’s going Half of subjects remembered
on. I tend to open a “2.4 might have been
spreadsheet and more about their documents embedded in a doc, so I
sometimes I’ll have more had to copy it out from
than one open at the same after seeing a provenance there.”
time…”
graph.
“Yeah, that’s what I did, I turned it into Excel… I saved it, “Looks like I copied and pasted from the website into
and then I changed the name because I wanted to make a doc… It’s kind of complicated what I did here. I
sure it was distinguished from other ﬁles I have with the took 2.2, copied and pasted info into an Excel
same name for a different group.” spreadsheet. And then yeah, there’s number 7, a
spreadsheet as well.”


Can We Use Provenance More
Directly?
Textual query in most traditional
keyword search tools

Can We Use Provenance More
Directly?
Textual query in most traditional
keyword search tools

What about drawing queries?

Phase 2: Provenance in Search?
Is it Appropriate?
Can provenance be used effectively in search?

—  How complex a query do we need to find a file?
—  List of all unique walks in provenance graphs
—  Find longest repeating strings for each subject
—  Worst case unique query: Longest repeating string + 1
—  With/without provenance event type to examine impact

Outlook--AS--Word--CP--PowerPoint--SA--PowerPoint--CP--Powerpoint

Phase 2 contd.
—  Maximum query length for a repository of ~7500
items:
—  Considering the type of provenance events
—  3 to 9, median 4
—  Without considering the type of provenance events
—  3 to 10, median 4.5
Provenance events like copy/paste and versioning are too
common to add value!

—  Provenance search grows linearly
—  1 node per 200 links
Provenance can be used to narrow search space quickly. 

Tool Analysis
Categorizing tools that are using provenance-like data to enhance search

—  Provenance Types

—  Provenance Monitoring

—  Provenance Use

—  UI Approach

—  Evaluation

Tool Analysis contd.
Name Provenance Types Provenance Monitoring Provenance Use UI Approach Evaluation
File meta-data, Extracting relations from Query formulation, Flow-chart like, Canned data,
keyword, static Google Desktop’s database Search process List view model limited within
Feldspar
relations between using its API (real-time results subjects user
resources updating) study
Meta-data such as Built-in System Monitor to Query formulation, Narrative-based, Multiple user
author, storage place, record meta-data about the Search process List of resources’ studies
Quill date, physical place user’s documents, email thumbnails (real-
tag (home, work, attachments, WebPages, time results
etc.) applications and calendar updating)
File meta-data (such Microsoft Desktop Search Query formulation, Text input with Longitudinal
as kind, date, author, database, fuzzy matching (car Search process, selectable filters, study using real
email attributes) and cars are same), fielded Results List view of data on subjects’
SIS
search (author is “john doe”) presentation results with a PCs (234 people),
preview and 6 weeks
meta-data
File meta-data (such Microsoft Desktop Search Query formation, Text input with Longitudinal
as kind, date, author, database, Extra meta-data as Search process, selectable filters, study using real
email attributes). tags (Labeling system) Results List view of data on subjects’
Phlat
Contextual cues such presentation results with a PCs (225 people),
as user defined tags preview and 8 months
meta-data
Environmental Integrated system monitor to Query formulation, Textual input and Canned data,
factors as contextual record contextual cues and Search Process selectable filters, limited within
YouPivot
cues, user defined their occurrences List view of subjects user
marks results study

Tool Analysis
Feldspar
—  Feldspar – Chau et. al 2008
—  Desktop search
—  Uses associations between files and resources
—  extracted from Google Desktop database
—  Keyword and meta-data search
—  Flowchart-like user interface
—  Real-time results, fast
—  Evaluated with canned data
—  Within subject study

Tool Analysis
Stuff I’ve seen, Phlat
—  Stuff I’ve Seen (SIS) – Dumais et. al 2003, Phlat – Cutrell et. al 2006
—  Similar to Windows Desktop Search
—  Keyword and meta-data search
—  Ranks the results using contextual cues
—  Textual input
—  List view of results with snippet and meta-data
—  Unified labeling (Phlat)
—  Longitudinal study

Tool Analysis
YouPivot
—  YouPivot – Hailpern et. al 2011
—  Search web browsing history
—  Internal system monitor
—  Uses keyword for search and contextual cues to filter the results
—  Timeline view for user activities
—  Textual input, list view of results
—  TimeMarks to filter the results
—  Evaluated with canned data
—  Within subject study

Phase 3: Design Goals
—  Use dynamic relations
between files

—  Integration with keyword
search

—  Graphical UI

—  Allowing all kinds of
graphical queries

—  Internal system monitor

—  Result exploration

Phase 3: System Requirements
—  Provenance + Keyword search

—  Streamline query composition
using a drag-drop graphical
sketchpad

—  Allow for flexible exploration
and discovery

—  Integration with Windows
Explorer to allow exploration of
workflow and information
provenance

Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)

—  Introducing * links

Phase 3 contd.
Exact pattern matching problem is np-complete!
(sub-graph isomorphism problem)

—  Introducing * links
—  Partial matching
—  Easier to solve
—  Better matches user recall

—  Use G-Ray algorithm [Tong et al. 2007]
—  Best-effort matching
—  Fast, scalable, flexible and forgiving

Phase 3: Preliminary Evaluation
Is UI approach reasonable?

—  User Study
—  Used file repository modeled after those found at Intel
—  Participant selection
—  Questionnaire to examine knowledge of search tools
—  Graduate students
—  Interactive tutorial
—  9 Experiment tasks
“Find the word document you created using information copy/pasted from an email, a web page, and
an excel document. Find the emails that have this word document as an attachment.”
—  Tasks ordered randomly
—  Think aloud protocol
—  4 minutes for each tasks
—  Exit interview about their experience

S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.

Phase 3: Preliminary Evaluation
contd.
—  Average completion time: 106 seconds
—  Simple tasks (72 seconds – 93 seconds)
—  Hard tasks (126 seconds – 155 seconds)

—  Query complexity (#nodes & #edges)
—  Average of 2.8 nodes and 2 edges
—  System scales well (Completion time vs. Complexity)

—  Observations
—  Importance of target document
—  Working on one resource or relation at a time
—  Saw marked learning effect

—  Interviews
—  Overall likability rating: 4.2 out of 5 (σ = 0.4)
—  Wanted Leyline in real life
—  No one complained about effort/time requirement
—  Areas for improvement
—  Query composition history panel
—  Customization options
—  Support more resource types

S. Ghorashi, C. Jensen, “Leyline: provenance-based search using a graphical sketchpad”, In Proceedings of the 6th Symposium on Human-
Computer Interaction and Information Retrieval (HCIR'12). ACM, New York, NY, USA, Article 2 , 10 pages.

Conclusion
—  Provenance events are very common in real-world
settings, and potentially helpful in search

—  Provenance alone can quickly and effectively identify
unique files/resources (assuming perfect recall)

—  A graphical sketchpad is a viable UI for query
composition
—  Isn’t going to replace keyword search, but valuable addition

—  Users quickly learned how to use our system, and
wanted the tool

What about the future?
—  Incorporate the feedback and lessons learned into a new
prototype
—  Expand feature set to include:
—  Auto-completion and suggestion features to speed up the
search process
—  Support a broader set of files and resources
—  Possibly support other computer platforms
—  Prepare for longitudinal study
—  How do people adapt and use the Leyline?
—  How does the Leyline scale in a large database?
—  Does the Leyline change exploration?
—  Does the Leyline work in collaborative environment?

Thank you
—  Thanks to Intel for early funding and subjects!

—  For more information:
—  Soroush Ghorashi
—  (ghorashi@eecs.oregonstate.edu)
—  Carlos Jensen
—  (cjensen@eecs.oregonstate.edu)

Leyline: A provenance-based desktop search

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Leyline: A provenance-based desktop search

Similar to Leyline: A provenance-based desktop search (20)

Leyline: A provenance-based desktop search