Integration and Exploration of Connected Personal Digital Traces
1. Integration and Exploration
of Connected
Personal Digital Traces
Valia Kalokyri, Alex Borgida, Amélie Marian, Daniela Vianna
Rutgers University
2. Personal data is fragmented, heterogeneous
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 2
3. DigitalSelf Project: Goals
1. Integrate personal from various heterogeneous sources
2. Design of a unified and intuitive model to link and
represent personal information
3. Group personal data with respect to conceptually
coherent episodes – Creation of a Personal Knowledge
Base
4. Search tools for digital memories
5. Design of interactive tools to provide users with narrative
views of their digital memories.
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 3
4. PIM – Personal Information Management
• Traditional PIM Systems – focus on objects relationships
• Haystack
• Semex
• OntoPim
• …
• We focus on a narrative of events
• Exploration of connections between events – or Personal Data
Traces (PDTs)
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 4
5. Background
• Research in psychology:
Episodic memory – memory of autobiographical events
• It is the collection of past personal experiences that occurred at a particular time
and place. (times, places, associated emotions, and other contextual who, what,
when, where, why knowledge that can be explicitly stated/conjured)
• Natural way to remember past events is by pertinent contextual
information; answers to:
• What, When, Where, Who, What, Why, How (w5h)
• Derived from the "frame" structure of events which involve the
digital documents
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 5
6. Integrating Personal Data
• Create an infrastructure to retrieve and store personal data
• Gather content from several online services (via APIs, IMAP)
• Social data - Facebook,Twitter, LinkedIn
• Geolocation data - Foursquare
• Email - Gmail, or any other email
• Calendars - Google Calendar
• Personal files - local file system, Google Drive, Dropbox
• Web browsing histories - Chrome, Firefox
• Apply entity resolution – who, where dimension
IIWeb’14 paper, Github open source
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 6
7. Contributions
• High-level description of episodic scripts
• Group events (PDTs) to connect them into a memory episode
• Scripts: prototypical plans, “a predetermined, stereotyped sequence of
actions that defines a well-known situation”. (Schank and Abelson)
• Heuristic algorithm to find and combine PDTs into scripts
• Case study: Eating out script
• Script description
• Evaluation with user data
Goal: Organize & summarize PTDs into episodes
Allow users to explore, understand and learn from their actions
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 7
8. Grouping Data into Coherent Episodes
• Provide a narrative by making connections between PDTs
• Example - Going out to eat at a restaurant
• Script would provide description of possible “event flows” (arrange
where & when to go, make reservation, call a cab/uber, go to the
restaurant, order food, [...], pay, [...], return, [...])
• Emails concerning a dinner
• OpenTable reservation at a restaurant
• Foursquare checkin with photos
• Credit card payment
Narrative for going out to a
dinner
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 8
10. Algorithm for instantiating script instances
1. Create a list of “trigger words/phrases”, whose occurrence
indicates that a document has something to do with an
instance of a particular script type.
• Start with goal events/subscripts - AttendEatingOut
• E.g. “Eat”, “eat out” and all their synonyms and hyponyms (Wordnet,
ConceptNet5)
• Consider the w5h participants of the goal event (Verbnet, Framenet)
• E.g. “restaurant” is a where value of “eat” for Eating_Out
• The result is a list of words to search for
• E.g. breakfast, lunch, dinner, restaurant and its hyponyms etc.
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 10
11. Algorithm for instantiating script instances
2. All retrieved PDTs are preprocessed:
• Entity extraction (Stanford nltk)
• Who, Where
• Time extraction-explicating/disambiguating information
• E.g. tomorrow, this Wednesday, are made absolute dates
• Technlogies used: Stanford ntlk, python Dateparser, our own regular expressions
• Group certain kinds of documents into single individuals
• E.g. Email threads, facebook messages etc
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 11
12. Algorithm for instantiating script instances
3. Each individual leads to the creation of a candidate instance
of the script (or one of the subscript)
4. Fill some of the script instance sub-properties
• E.g. restaurant charge in a credit card bill provides evidence for the
attendEatingOut subscript, with whereEatingOccurred and
whenEatingOccurred and one whoAttended.
• A corresponding Facebookcheckin could give information about the
rest whoAttended property
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 12
13. Algorithm for instantiating script instances
5. Score the instances depending on the strength of evidence it
manifests for an instance.
• strong evidence:
• Bank statement
• a long email thread mentioning keywords many times and the user participating a
lot in the email exchange
• weak evidence:
• A single email mentioning the word “lunch”
• mild evidence: user sent message, “lunch” in Subject
• null evidence: email from unknown sender
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 13
14. Algorithm for instantiating script instances
6. Merge instances sharing same/similar “key parts
• whenEatingOccurred, whereEatingOccurred, and to a lesser
extent, whoAttended.
• why and what local properties of this script are of secondary
importance (instances of eating pizza need not be merged)
• Merge documents when:
1. “When” property is the same/close
2. “Where”/”Who” is the same if the tf-idf for the term is low.
• Merge the property fillers and score becomes: 1 − (∏s∈S0 1 − Score(s)) ,
where S0 is the set of script instances.
• Repeat merging as additional subproperties are filled.
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 14
15. Case Study: Eating Out
• Goal: Find, among users’ personal data, instances of eating at
various restaurants.
• Three users: Alice, Bob, Charlie
• Six-month period data
• Four types of sources:
• messaging (e.g., email, Facebook messenger, Hangouts)
• calendaring (e.g. Google Calendar)
• financial transactions (e.g. bank and credit card statements)
• location services (e.g. Foursquare, Facebook checkins).
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 15
16. Relevant objects to the Eating_out script
Note: that the fact that an object is relevant does not mean that it indeed was part of an Eating Out event.
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 16
17. Golden set
• The identification of the golden set a posteriori is difficult- we
cannot expect our users to accurately remember every single
instance of Eating Out.
• Every user carefully went over the six month of recorded PDT
and identified all data that pertained to Eating Out events.
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 17
18. Evaluation Metrics
• Percentage of events retrieved: percentage of all user-
identified Eating Out events retrieved by our scripts, as a proxy
for Recall.
• Overall Precision: measured as the percentage of identified
script instances that correspond to actual Eating Out events.
• Precision@k: the percentage of top-k (based in merged
scores) script instances that correspond to actual eating out
events.
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 18
23. Implementation Challenges
• Evaluation
• Personal data is sensitive
• Data retrieval is complex
• IRB - privacy
• Misclassified “restaurants” in bank statements
• Use of Google Maps. Partial success
• Need for NLP analysis
• E.g., we miss "cannot make it for dinner”
• Personalization issues: each person uses PDT consistently but
very differently (e.g. shared bank accounts)
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 23
24. Conclusions and Future Work
• First step towards creation of a PKB for personal data
exploration
• Future work:
• Extensible approach for implementing script instantiation from PDTs.
• declarative description of scripts
• declarative description of clues/evidence
• declarative description of information to extract from each relevant PDT
• Script personalization
• Extended user experiments
• Visualization tools
5/19/17 Amélie Marian - Rutgers University - ExploreDB'17 24
Digital data is inherently contextual due to various forms of metadata.
Idea of narrative is supported by the notion of “episodic memory” (Tulving, 2002)
As proof of concept, we implemented our scripts for the Eating Out scenario.
Performing experiments on Personal Data is not a trivial en- deavor due to the sensitive nature of the data and the diffi- culty in getting personal data sets for research purposes.
Mint.com is a free, web-based personal financial management service
Relevance was computed using:
-keyword based scoring for Emails/Messaging, Calendar
-metadata categories stored with the original data items for Financial and location data
*Verified and corrected information by using the Google Maps API.
---Alice may have discussed a restaurant in messages with friends but not gone there, or Charlie may have bought food at a business categorized both as a supermarket and a restaurant.
The 3 users have very different patterns-expected due to the highly individual nature of user behavior.
Charlie shares a credit card account with her spouse, there- fore some of the 125 relevant financial data objects are not from her credit card (only 49 are)
To evaluate the quality of the memory retieval process using our scripts, we need to identify all the instances of Eating Out for each user, aka a golden set.
Without a perfect golden set, we cannot accurately evaluate Recall.
Shows the percentage of identified events retrieved by our script for our three users.
A first observation is that the results clearly reflect the different behavior of the three users.
Alice and Bob use email/messaging to make restaurant plans in a majority of cases, but do not always have a financial record of the transaction.
In contrast Charlie makes very few plans by email/messaging nor does she enter them in her calendar, but most of her outings result in financial transactions.
Results show that not only looking at several sources of information to identify script instances for a given user is critical to identify user script instances, as the percentage of events retrieved increases with the number of sources considered; but also that any approach to retrieve user memories of events must consider several sources to adapt to the wide variety of user behaviors.
The quality of the information given by different sources vary
Financial data tend to be of high quality. (ordered takeout or bought groceries at a business doubling as a restaurant;->FP)
email/messaging data, which depends on keyword matching for relevance, tend to be of lower quality.
Need for merging information from multiple sources of personal data to improve the identification of script instances
considering a variety of Personal Information sources to account for the different individual behavior of users.
However, retrieval systems typically return results in a ranked order, and users are expecting the first few results to be the most relevant.
Alice: financial data is of very high quality, it only exists for 67% of her Eating Out events.
By combining Email/Messaging and Financial data information, she is able to identify her Eating Out events with high accuracy for all values of k.
Charlie: similar pattern, lower accuracy for Email/Messaging
Bob: financial data is not as accurate as expected-> categorization provided by financial provider is inaccurate in several of his transactions.
when information from multiple sources is combined, the precision, especially for low values of k, which are the first instances returned to the users, is higher than that of considering sources individually