Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dunbar

654 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Dunbar

  1. 1. Sudheendra Hangal ( [email_address] )
  2. 2. Motivation <ul><li>Gigabytes of email… free </li></ul><ul><ul><li>“ Never delete anything - Archive!” </li></ul></ul><ul><li>Eventually 50 years worth of email per person </li></ul><ul><li>(How) Can we make sense of our lives using this huge corpus? </li></ul>Cracked SAT Girlfriend! Trip to Hawaii Breakup  Wedding Son born 1985 1990 1991 1992 1995 2000
  3. 3. Possibilities <ul><li>In-situ relationship extraction </li></ul><ul><li>Mining email to form a micro-diary </li></ul><ul><ul><li>Personal, group, </li></ul></ul><ul><li>View attachments as a file system </li></ul>
  4. 4. Weighted Social Graph <ul><li>Weight social graph edges </li></ul><ul><ul><ul><li>E.g. Query all friends + “close friends of close friends” </li></ul></ul></ul><ul><ul><ul><li>Trust-rank for social graphs </li></ul></ul></ul><ul><li>Shortest distance ≠ no. of links on path </li></ul><ul><ul><li>w 1 × w 2 × …. × w n </li></ul></ul><ul><li>In-situ inference </li></ul><ul><ul><li>Sent emails </li></ul></ul><ul><ul><li>Phone calls, social network activity, Co-authorship, length of association </li></ul></ul>
  5. 5. Attachment Wall <ul><li>Interesting material in attachments </li></ul><ul><li>Pictures, Documents, Presentations </li></ul><ul><li>Piclens “photo wall” for display </li></ul><ul><li>Interlace with text ? </li></ul>
  6. 6. Life Browsing Work Family Travel Finances Health Stuff bought on Amazon
  7. 7. What we have done <ul><li>Connect to IMAP server, fetch messages </li></ul><ul><li>Cluster similar messages </li></ul><ul><ul><li>k-means/CLUTO </li></ul></ul><ul><ul><li>Descriptive features for each cluster </li></ul></ul><ul><li>1 and 2-grams for each week/month/year (ranked by TF-IDF) </li></ul>Preprocessing Clustering TF/IDF scoring Visualization
  8. 8. Example Dataset <ul><li>Our own email </li></ul><ul><ul><li>Text only, sans attachments, no stemming </li></ul></ul><ul><li>E.g. 7,402 emails sent by Sudheendra from 2004-2008 </li></ul><ul><ul><li>3,994,751 tokens </li></ul></ul><ul><ul><li>984,421 stop words (2,302 distinct) </li></ul></ul><ul><ul><li>1,729,621 dictionary words (10,223 distinct) </li></ul></ul><ul><ul><li>129,229 distinct unigrams and 345,869 distinct bigrams </li></ul></ul>
  9. 9. Demo
  10. 10. Learnings so far <ul><li>Phrases more useful </li></ul><ul><ul><li>Fix: Remove unigrams that are dictionary words </li></ul></ul><ul><li>Need to avoid cluster pollution </li></ul><ul><ul><li>(tried 1-50 clusters) </li></ul></ul><ul><ul><li>Limit cluster diameter </li></ul></ul><ul><li>Tune clustering, automatically find k </li></ul><ul><li>Data cleaning issues </li></ul><ul><ul><ul><li>Add HTML tags to stop words </li></ul></ul></ul>
  11. 11. Learnings so far (2) <ul><li>Need to tune TF-IDF (normalize/dampen) </li></ul><ul><li>TF-IDF highlights bursty terms </li></ul><ul><ul><li>e.g. course project mates </li></ul></ul>
  12. 12. Future possibilities (1) <ul><li>Privacy preserving exchange of contacts </li></ul><ul><ul><li>Get rid of “It’s a small world!” </li></ul></ul><ul><ul><li>Contacts, preferences, reviews, location history </li></ul></ul><ul><li>Group reminiscence </li></ul><ul><ul><li>POMI 2008-2013 (Proposal, Retreats, Photos, Papers, …) </li></ul></ul><ul><li>Visualize relationships with people </li></ul><ul><li>Mine mailing lists </li></ul><ul><ul><li>“ Zeitgeist” inside an organization </li></ul></ul>
  13. 13. Future possibilities (2) <ul><li>Large-scale user studies/viral, fun FB app </li></ul><ul><li>More mining techniques </li></ul><ul><ul><li>Length of message, subject, attachments etc </li></ul></ul><ul><ul><li>Identify templates: extract semantic relations, remove boilerplate </li></ul></ul><ul><ul><li>Consider TF-IDF w.r.t. history, not whole corpus </li></ul></ul><ul><ul><li>Better clustering </li></ul></ul><ul><li>Integrate email mine with other personal data: photos, calendar, social networking, chats, SMS </li></ul><ul><li>Automatic interest/expertise extraction </li></ul>
  14. 14. <ul><li>Related Work: </li></ul><ul><ul><li>TheMail (IBM) </li></ul></ul><ul><ul><li>Exploring Enron (Jeff Heer) </li></ul></ul><ul><ul><li>Still wide open </li></ul></ul><ul><li>Acknowledgments: Piyush Agarwal, Anand Rajaraman, Jeff Ullman </li></ul>
  15. 15. <ul><li>Try Dunbar on your own email! </li></ul><ul><ul><li>http://clazz.stanford.edu:8080/dunbar </li></ul></ul><ul><ul><li>Soon: Privacy-preserving contacts exchange via PCBs </li></ul></ul>
  16. 16. Backup Slides
  17. 17. Data mining E-mail <ul><li>Rich set of data mining techniques applicable </li></ul><ul><ul><li>Clustering of emails based on topic </li></ul></ul><ul><ul><ul><li>Travel, Kids, Projects, Stuff bought at Amazon,… </li></ul></ul></ul><ul><ul><ul><li>approximate File-to-folder action by user </li></ul></ul></ul><ul><ul><li>Important words/phrases using TF-IDF </li></ul></ul><ul><ul><li>Shingling/min-hashing/LSH to detect lineage </li></ul></ul><ul><ul><ul><li>replies, forwards, cut-n-paste </li></ul></ul></ul><ul><ul><li>Association rules </li></ul></ul><ul><ul><ul><li>e.g. frequent co-recipients </li></ul></ul></ul><ul><ul><ul><li>Time to reply </li></ul></ul></ul>

×