Your SlideShare is downloading. ×
0
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Public profile
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Public profile

127

Published on

Course project presentation for COMP5331: implementation of social profile matching algorithm, research on improvements for it, testing data

Course project presentation for COMP5331: implementation of social profile matching algorithm, research on improvements for it, testing data

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
127
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • As shown in previous slide, Chinese provides broad opportunities for homonimy. However, even in small Russia there there is a guy with the same name and surname as me.
  • Performance has not been tested due to small testing data set
  • However, Jan demonstrated the problem of missing data.
  • Transcript

    • 1. Public profile matchingUrzhumtcev Oleg, SkolTech, ITMO1Instructor: Raymond Chi-Wing Wong, HKUST 27 November 2012, Hong Kong 1 http://en.qdinvest.ru
    • 2. ProblemGeneralizationApproach studiedPropositionTestingConclusion 2
    • 3. ProblemThere are many objects in the worldSome of them are named entities, among them — peopleThey may have different representations (profiles) 3
    • 4. Problem 4
    • 5. Problem 5
    • 6. GeneralizationPart of summarization problemPrecisely — first step: accurate data collectionCaused by homonimy 6
    • 7. ProblemGeneralizationApproach studiedPropositionTestingConclusion 7
    • 8. Approaches• User Identification Across Multiple Social Networks – Jan Vosecky, Dan Hong, Vincent Y. Shen• Features: • Direct matching (nearest neighbor) • Vector-Based Comparison Algorithm • Fuzzy string field matching • Weighted parameters 8
    • 9. ApproachesVector-based comparison: profile { profile { :id = “50b2c847e3b24cf21400000” :id = “50b2c843e3b24cf214000005” :username = “darikcr” :username = “oleg.urzhumtsev” :type “twitter” :type “facebook” :source “http://twitter.com/darikcr” :source “http://facebook.com/oleg.urzhumtsev” :name “NetBUG” :name “Oleg Urzhumtcev” :lang “ru-RU” :alias “NetBUG” :birthday nil :lang “ru-RU” :email nil :birthday 1989/10/19 :about “Linguist, programmer, also have some XPrience :email “darikcr@gmail.cm” in making startups. Groaning for active shiny people to :about “” do business together” :status Checked in at HKUST Bus Station” :status “Две новые станции метро в Петербурге - :tags nil "Бухарестскую" и "Международную" - откроют 27 :university [“HKUST” “SkolTech” “ITMO” декабря. Б... vk.cc/15wd4M ” http:// “SPbSU”] :tags nil :job [“ProMT JSC” ”Israeli Embassy”] } :interests [“Linguistics” “motoschool” “programming” “startups”] } 9
    • 10. ApproachesFuzzy matching (VMN algorithm*): String Pair VMN SDS SD 1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.0 2 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.0 3 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.0 4 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.0 5 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.0 6 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0 Table 1. String Match Functions Comparison• Partial matching• Word swapping tolerance• *Vosecky, Hong, Shen 2009 10
    • 11. ApproachesDrawbacks:1.Suitable for well-intersected profiles2.Bad for discovery3.No cross-parameter search 11
    • 12. ApproachesAwareness of missing data: 12
    • 13. Approaches• Identifying Users Across Social Tagging Systems by Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof• Tagged entities• ‘Bag-of-words’ document model• Only basic matching 13
    • 14. ProblemGeneralizationApproach studiedPropositionTestingConclusion 14
    • 15. Proposition1. Profile is a non-uniform document with different features of different types2. Parameters split into ‘unique’ and ‘frequent’ ‘username’ is unique ‘surname’ is unique although homonymy may occur ‘interests’ is frequent (shared by many people) 15
    • 16. Proposition3. Use combined model: 1. Initial matching as in [1] (vector-based) 2. If fails, continue to weight-based unique attribute matching 3. If fails, continue to clustering and all attribute nearest-neighbor prediction 16
    • 17. Proposition Weight-based unique attribute matchingSimilarity = (this.unique_attrs.each{|id,attr| weight_unique[id]*other.unique_attrs.each == attr}.sum + this.freqent_attrs.each{|attr| other.freqent_attrs.each == attr}.sum) / this.freqent_attrs.each{|attr| other.freqent_attrs.each != attr}.sum 17
    • 18. Proposition ClusteringHierarchical: the distribution seems to be even• Distance: non-numeric parameter conversion• Merging:• show up features shared by 30% of members or more for vector-like attributes • Slow • Reliable• Probabilistic for singular featuresCurse of dimensionality 18
    • 19. Technical work1. Data fetching: 1. About.me 2. Facebook 3. Twitter2. Tools: 1. Ruby 2. Document-oriented noSQL database: mongoDB3. Implementation of vector-based weighted comparison4. Implementation of VMN algorithm 19
    • 20. ProblemGeneralizationApproach studiedPropositionTestingConclusion 20
    • 21. TestingData Direct nearest Unique Combined LDA Document- neighbor parameter (direct + based model matching matching unique + (experimental) clustering)Completeness 51% 56% 74% 46%(%)Basic set 53 58 78 95Accuracy (%) 100% 98% 95% 51%,of them false 0 1 3 42positiveFalse negative 51 46 29 9Extended set 56 62 127 N/TAccuracy(%) 98% 54% 70% N/T,of them false 2 5 N/TpositiveFalse negative 51 (basic set) 37 (basic set) 28 (basic set) N/T 21
    • 22. Future work1. Attempt to convert all parameters to numeric format and apply SVM for clustering2. Add semantic word similarity via WordNet distance3. Named Entity Recognition in text fields4. Envelope the algorithms developed into a single sleek Rails web application and public testing 22
    • 23. Conclusion1. All approaches studied had strong mathematical background but were badly adapted for real applications2. Intuitive fusion of approaches suitable for different situations may improve results3. Further work is necessary to develop the best approach 23
    • 24. Thank you!Questions?Slides available at http://n3r.ru/c4Demo&code available at http://n3r.ru/c5Feel free to contact me: darikcr@gmail.com http://about.me/netbug...and enlarge your soft skills! 24

    ×