Curating Humanities Data: Law, technology and reality
Remembrance of data past
1. Remembrance of
Data Past
Using Context in Personal
Information Search
Amélie Marian, Rutgers University
Thu D. Nguyen, Rutgers University
Daniela Vianna, Rutgers University
Luan Nguyen, Rutgers University
2. What was the name of that
restaurant?
• I went there with Julia
• We had dinner
• It was pouring rain
Some Sources of helpful data
“With Julia”: Calendar, email, text
“Restaurant”: Check-ins, cell phone GPS logs
“Restaurant”: Credit Card statements
“Pouring rain”: Historical Weather reports
Amélie Marian - Rutgers University
3. The Web
hypertext universal library of text
and multimedia
personal/private data social data
Amélie Marian - Rutgers University
5. We remember our data based
on context clues
• “Serge sent me this file while we were on a
conference call with Alkis”
Skype, Google hangout, email, calendar, filesystem
• “I found this shopping web site while talking to
Tova on Skype, She was wearing a bue dress.”
Skype (+ snaphot), calendar, browser history
• “Are my insurance reimbursements up to date?”
Calendar, insurance account, bank account
Amélie Marian - Rutgers University
6. We also remember data from
our social network
• “Mohan posted this interesting article on CS
education on Facebook, or maybe on Twitter, or
maybe it was Moshe Vardi who posted it”
Facebook, Twitter, browser history
• “What are the books my friends recommended”
Facebook (and comments), Twitter, emails
• “What are the place in Maui that my friends
enjoyed”
Facebook, Twitter, emails, Foursquare
Amélie Marian - Rutgers University
7. Data dimensions
• Follow natural interrogative words:
• what? (content)
• who? (with whom, from whom, to whom,...)
• where? (physical or logical, in the real-world
and in the system)
• when? (time and date, but also what was
happening concurrently, before and after)
• why? (sequence of data/events that are
connected)
• how? (application, author, environment).
Amélie Marian - Rutgers University
8. What is an answer?
• Content
• Email
• File
• Link
• List of objects (insurance reimbursements)
• But also part of the context
• Location
• Meeting participants
• Time
Amélie Marian - Rutgers University
9. Personal Data Context
• Explicit
• Metadata information stored by the file system or
application, e.g., timestamp, GPS location, tags, directory
structure.
• Implicit
• Identified through application-based semantic
information, e.g., email recipients, calendar meeting
participants, check-in location
• Inferred
• Knowledge about the environment of the data collection.
• System environment (Which applications/documents were
opened concurrently with a given document)
• Social environment (Which Facebook members had access to
an event)
• Real world environment (Who was physically in the room –
RFID tags, skype –, weather).
Amélie Marian - Rutgers University
10. Challenges
• Indexing content and context
• Semantic analysis for extracted context
• Data integration
• Identify inferred context
• Store and index as it is produced (system environment)
• Use API calls on-demand or copy information (social and
real-world environment)
• Unified data model
• Content and structure
• Data in context
• Navigation
Amélie Marian - Rutgers University
11. Challenges (2)
• Powerful data tools
• Access and query (possibly remote) sources
• Search based on content and contextual clues
• Approximate matching
• Explore data to get relevant information
• Discover new relevant information
• “It’s been six month, you need to make a dentist
appointment!”
• “You forgot to pay the home insurance bill!”
• “Last time you bought toothpaste was a month ago,
you are probably running out.”
Amélie Marian - Rutgers University
12. Previous results: EDBT’08
ICDE’08 (demo)
EDBT’11
Unified Structure, Content, TKDE’12
with Wei Wang,
and Metadata Search Chris Peery, and
Thu D. Nguyen
• Data and query models that unify content and
structure along one dimension
• System metadata seen as a separate dimension
• A unified multi-dimensional scoring mechanism
• IDF-based scores for each dimension
• Individual dimension scores easily combined
• TF scores to break ties
• Query processing algorithms and index structures
to score and rank answers efficiently
Amélie Marian - Rutgers University
13. Unified Structure and Content
Target file: Halloween party pictures taken at home where someone
wears a witch costume
//Home*.//“Halloween” and .//“witch”+
File
root
Boundary
Home
“Halloween” “witch”
Amélie Marian - Rutgers University
14. Unified IDF Score
For a unified data tree T, a path query PQ, and a file
F, we define:
• IDF Score
N
log
matches (T , PQ )
score idf
( PQ )
log N
where N is total number of files, and matches (T , PQ ) is the
set of files that match PQ in T.
Amélie Marian - Rutgers University
15. Date: 26 Feb 07
File Extension: .txt
Case Study Directory:
Personal/Ebook/Novel/JackLondon
Target file: Electronic version of the novel SeaWolf by Jack London
Content and filtering Query Target file does
Keywords: sea, wolf, jack, london not appear
Directory: /JackLondon/Ebooks in result
Approximate Query Target file at
Keywords: sea, wolf, jack, london Rank 3
Directory: /JackLondon/Ebooks
Content and filtering Query
Keywords: sea, wolf, jack, london Target file does
Date:19 Feb 07; type: pdf not appear
Directory: /JackLondon/Ebooks in result
Approximate Query
Keywords: sea, wolf, jack, london Target file at
Date: 19 Feb 07; type: pdf Rank 2
Directory: /JackLondon/Ebooks
Amélie Marian - Rutgers University
16. Conclusions
• First step towards an automated Personal Data
Assistant
• Looks at data and its context
• Gathers personal data from remote sources
• Cloud applications, social networks, emails, phone
logs, financial accounts, friends public data,…
• Integrates data in a unified data model
• Based on natural questions
• Provide search and discovery capabilities
• Beyond keyword search
• Context-aware
Amélie Marian - Rutgers Universityby
Funded a Google Research Award