Integration and Exploration of Connected Personal Digital Traces

Integration and Exploration
of Connected
Personal Digital Traces
Valia Kalokyri, Alex Borgida, Amélie Marian, Daniela Vianna
Rutgers University

Personal data is fragmented, heterogeneous
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 2

DigitalSelf Project: Goals
1. Integrate personal from various heterogeneous sources
2. Design of a unified and intuitive model to link and
represent personal information
3. Group personal data with respect to conceptually
coherent episodes – Creation of a Personal Knowledge
Base
4. Search tools for digital memories
5. Design of interactive tools to provide users with narrative
views of their digital memories.

PIM – Personal Information Management
• Traditional PIM Systems – focus on objects relationships
• Haystack
• Semex
• OntoPim
• …
• We focus on a narrative of events
• Exploration of connections between events – or Personal Data
Traces (PDTs)

Background
• Research in psychology:
Episodic memory – memory of autobiographical events
• It is the collection of past personal experiences that occurred at a particular time
and place. (times, places, associated emotions, and other contextual who, what,
when, where, why knowledge that can be explicitly stated/conjured)
• Natural way to remember past events is by pertinent contextual
information; answers to:
• What, When, Where, Who, What, Why, How (w5h)
• Derived from the "frame" structure of events which involve the
digital documents

Integrating Personal Data
• Create an infrastructure to retrieve and store personal data
• Gather content from several online services (via APIs, IMAP)
• Social data - Facebook,Twitter, LinkedIn
• Geolocation data - Foursquare
• Email - Gmail, or any other email
• Calendars - Google Calendar
• Personal files - local file system, Google Drive, Dropbox
• Web browsing histories - Chrome, Firefox
• Apply entity resolution – who, where dimension
IIWeb’14 paper, Github open source

Contributions
• High-level description of episodic scripts
• Group events (PDTs) to connect them into a memory episode
• Scripts: prototypical plans, “a predetermined, stereotyped sequence of
actions that defines a well-known situation”. (Schank and Abelson)
• Heuristic algorithm to find and combine PDTs into scripts
• Case study: Eating out script
• Script description
• Evaluation with user data
Goal: Organize & summarize PTDs into episodes
Allow users to explore, understand and learn from their actions

Grouping Data into Coherent Episodes
• Provide a narrative by making connections between PDTs
• Example - Going out to eat at a restaurant
• Script would provide description of possible “event flows” (arrange
where & when to go, make reservation, call a cab/uber, go to the
restaurant, order food, [...], pay, [...], return, [...])
• Emails concerning a dinner
• OpenTable reservation at a restaurant
• Foursquare checkin with photos
• Credit card payment
Narrative for going out to a
dinner

EstablishWhereEat
InitiateGoingOut EstablishWhenEat
EstablishWhoEat
MakeReservation
<restaurant>
AttendEatingOut
Ontology for scripts
UML activity diagram for Eating_Out

Algorithm for instantiating script instances
1. Create a list of “trigger words/phrases”, whose occurrence
indicates that a document has something to do with an
instance of a particular script type.
• Start with goal events/subscripts - AttendEatingOut
• E.g. “Eat”, “eat out” and all their synonyms and hyponyms (Wordnet,
ConceptNet5)
• Consider the w5h participants of the goal event (Verbnet, Framenet)
• E.g. “restaurant” is a where value of “eat” for Eating_Out
• The result is a list of words to search for
• E.g. breakfast, lunch, dinner, restaurant and its hyponyms etc.

2. All retrieved PDTs are preprocessed:
• Entity extraction (Stanford nltk)
• Who, Where
• Time extraction-explicating/disambiguating information
• E.g. tomorrow, this Wednesday, are made absolute dates
• Technlogies used: Stanford ntlk, python Dateparser, our own regular expressions
• Group certain kinds of documents into single individuals
• E.g. Email threads, facebook messages etc

3. Each individual leads to the creation of a candidate instance
of the script (or one of the subscript)
4. Fill some of the script instance sub-properties
• E.g. restaurant charge in a credit card bill provides evidence for the
attendEatingOut subscript, with whereEatingOccurred and
whenEatingOccurred and one whoAttended.
• A corresponding Facebookcheckin could give information about the
rest whoAttended property

5. Score the instances depending on the strength of evidence it
manifests for an instance.
• strong evidence:
• Bank statement
• a long email thread mentioning keywords many times and the user participating a
lot in the email exchange
• weak evidence:
• A single email mentioning the word “lunch”
• mild evidence: user sent message, “lunch” in Subject
• null evidence: email from unknown sender

6. Merge instances sharing same/similar “key parts
• whenEatingOccurred, whereEatingOccurred, and to a lesser
extent, whoAttended.
• why and what local properties of this script are of secondary
importance (instances of eating pizza need not be merged)
• Merge documents when:
1. “When” property is the same/close
2. “Where”/”Who” is the same if the tf-idf for the term is low.
• Merge the property fillers and score becomes: 1 − (∏s∈S0 1 − Score(s)) ,
where S0 is the set of script instances.
• Repeat merging as additional subproperties are filled.

Case Study: Eating Out
• Goal: Find, among users’ personal data, instances of eating at
various restaurants.
• Three users: Alice, Bob, Charlie
• Six-month period data
• Four types of sources:
• messaging (e.g., email, Facebook messenger, Hangouts)
• calendaring (e.g. Google Calendar)
• financial transactions (e.g. bank and credit card statements)
• location services (e.g. Foursquare, Facebook checkins).

Relevant objects to the Eating_out script
Note: that the fact that an object is relevant does not mean that it indeed was part of an Eating Out event.

Golden set
• The identification of the golden set a posteriori is difficult- we
cannot expect our users to accurately remember every single
instance of Eating Out.
• Every user carefully went over the six month of recorded PDT
and identified all data that pertained to Eating Out events.

Evaluation Metrics
• Percentage of events retrieved: percentage of all user-
identified Eating Out events retrieved by our scripts, as a proxy
for Recall.
• Overall Precision: measured as the percentage of identified
script instances that correspond to actual Eating Out events.
• Precision@k: the percentage of top-k (based in merged
scores) script instances that correspond to actual eating out
events.

Experimental results

Precision@k
Alice Bob
Charlie Charlie + spouse

Precision@k

Implementation Challenges
• Evaluation
• Personal data is sensitive
• Data retrieval is complex
• IRB - privacy
• Misclassified “restaurants” in bank statements
• Use of Google Maps. Partial success
• Need for NLP analysis
• E.g., we miss "cannot make it for dinner”
• Personalization issues: each person uses PDT consistently but
very differently (e.g. shared bank accounts)

Conclusions and Future Work
• First step towards creation of a PKB for personal data
exploration
• Future work:
• Extensible approach for implementing script instantiation from PDTs.
• declarative description of scripts
• declarative description of clues/evidence
• declarative description of information to extract from each relevant PDT
• Script personalization
• Extended user experiments
• Visualization tools

Integration and Exploration of Connected Personal Digital Traces

Recommended

Recommended

More Related Content

Similar to Integration and Exploration of Connected Personal Digital Traces

Similar to Integration and Exploration of Connected Personal Digital Traces (20)

More from Amélie Marian

More from Amélie Marian (6)

Recently uploaded

Recently uploaded (20)

Integration and Exploration of Connected Personal Digital Traces

Editor's Notes