Public profile matchingUrzhumtcev Oleg, SkolTech, ITMO1Instructor: Raymond Chi-Wing Wong, HKUST      27 November 2012, Hon...
ProblemGeneralizationApproach studiedPropositionTestingConclusion                    2
ProblemThere are many objects in the worldSome of them are named entities, among them — peopleThey may have different r...
Problem          4
Problem          5
GeneralizationPart of summarization problemPrecisely — first step: accurate data collectionCaused by homonimy          ...
ProblemGeneralizationApproach studiedPropositionTestingConclusion                    7
Approaches• User Identification Across Multiple Social Networks –  Jan Vosecky, Dan Hong, Vincent Y. Shen• Features:  • Dir...
ApproachesVector-based comparison: profile {                                                 profile {         :id = “50b2...
ApproachesFuzzy matching (VMN algorithm*):    String       Pair                               VMN          SDS           S...
ApproachesDrawbacks:1.Suitable for well-intersected profiles2.Bad for discovery3.No cross-parameter search                ...
ApproachesAwareness of missing data:                             12
Approaches• Identifying Users Across Social Tagging Systems by  Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bisc...
ProblemGeneralizationApproach studiedPropositionTestingConclusion                    14
Proposition1. Profile is a non-uniform document with   different features of different types2. Parameters split into ‘uniq...
Proposition3. Use combined model:  1. Initial matching as in [1] (vector-based)  2. If fails, continue to weight-based uni...
Proposition       Weight-based unique attribute matchingSimilarity =  (this.unique_attrs.each{|id,attr|  weight_unique[id]...
Proposition                            ClusteringHierarchical: the distribution seems to be even• Distance: non-numeric pa...
Technical work1. Data fetching:  1. About.me  2. Facebook  3. Twitter2. Tools:  1. Ruby  2. Document-oriented noSQL databa...
ProblemGeneralizationApproach studiedPropositionTestingConclusion                    20
TestingData               Direct   nearest Unique           Combined          LDA Document-                   neighbor    ...
Future work1. Attempt to convert all parameters to numeric   format and apply SVM for clustering2. Add semantic word simil...
Conclusion1. All approaches studied had strong   mathematical background but were badly   adapted for real applications2. ...
Thank you!Questions?Slides available at http://n3r.ru/c4Demo&code available at http://n3r.ru/c5Feel free to contact me:...
Upcoming SlideShare
Loading in …5
×

Public profile

382 views

Published on

Course project presentation for COMP5331: implementation of social profile matching algorithm, research on improvements for it, testing data

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
382
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • As shown in previous slide, Chinese provides broad opportunities for homonimy. However, even in small Russia there there is a guy with the same name and surname as me.
  • Performance has not been tested due to small testing data set
  • However, Jan demonstrated the problem of missing data.
  • Public profile

    1. 1. Public profile matchingUrzhumtcev Oleg, SkolTech, ITMO1Instructor: Raymond Chi-Wing Wong, HKUST 27 November 2012, Hong Kong 1 http://en.qdinvest.ru
    2. 2. ProblemGeneralizationApproach studiedPropositionTestingConclusion 2
    3. 3. ProblemThere are many objects in the worldSome of them are named entities, among them — peopleThey may have different representations (profiles) 3
    4. 4. Problem 4
    5. 5. Problem 5
    6. 6. GeneralizationPart of summarization problemPrecisely — first step: accurate data collectionCaused by homonimy 6
    7. 7. ProblemGeneralizationApproach studiedPropositionTestingConclusion 7
    8. 8. Approaches• User Identification Across Multiple Social Networks – Jan Vosecky, Dan Hong, Vincent Y. Shen• Features: • Direct matching (nearest neighbor) • Vector-Based Comparison Algorithm • Fuzzy string field matching • Weighted parameters 8
    9. 9. ApproachesVector-based comparison: profile { profile { :id = “50b2c847e3b24cf21400000” :id = “50b2c843e3b24cf214000005” :username = “darikcr” :username = “oleg.urzhumtsev” :type “twitter” :type “facebook” :source “http://twitter.com/darikcr” :source “http://facebook.com/oleg.urzhumtsev” :name “NetBUG” :name “Oleg Urzhumtcev” :lang “ru-RU” :alias “NetBUG” :birthday nil :lang “ru-RU” :email nil :birthday 1989/10/19 :about “Linguist, programmer, also have some XPrience :email “darikcr@gmail.cm” in making startups. Groaning for active shiny people to :about “” do business together” :status Checked in at HKUST Bus Station” :status “Две новые станции метро в Петербурге - :tags nil "Бухарестскую" и "Международную" - откроют 27 :university [“HKUST” “SkolTech” “ITMO” декабря. Б... vk.cc/15wd4M ” http:// “SPbSU”] :tags nil :job [“ProMT JSC” ”Israeli Embassy”] } :interests [“Linguistics” “motoschool” “programming” “startups”] } 9
    10. 10. ApproachesFuzzy matching (VMN algorithm*): String Pair VMN SDS SD 1 “Jan Vosecky”,“J Vosecky” 0.66 0.82 2.0 2 “Jan Vosecky”,“Vosecky Jan” 1.0 0.55 5.0 3 “Jan Vosecky”,“Honza vosecky” 0.5 0.36 7.0 4 “Jan Vosecky”,“Robert Vosecky” 0.5 0.55 5.0 5 “Jan Vosecky”,“Jan Smith” 0.5 0.45 6.0 6 “Jan Vosecky”,“Jack Vondracek” 0.0 0.27 8.0 Table 1. String Match Functions Comparison• Partial matching• Word swapping tolerance• *Vosecky, Hong, Shen 2009 10
    11. 11. ApproachesDrawbacks:1.Suitable for well-intersected profiles2.Bad for discovery3.No cross-parameter search 11
    12. 12. ApproachesAwareness of missing data: 12
    13. 13. Approaches• Identifying Users Across Social Tagging Systems by Tereza Iofciu, Peter Fankhauser, Fabian Abel, Kerstin Bischof• Tagged entities• ‘Bag-of-words’ document model• Only basic matching 13
    14. 14. ProblemGeneralizationApproach studiedPropositionTestingConclusion 14
    15. 15. Proposition1. Profile is a non-uniform document with different features of different types2. Parameters split into ‘unique’ and ‘frequent’ ‘username’ is unique ‘surname’ is unique although homonymy may occur ‘interests’ is frequent (shared by many people) 15
    16. 16. Proposition3. Use combined model: 1. Initial matching as in [1] (vector-based) 2. If fails, continue to weight-based unique attribute matching 3. If fails, continue to clustering and all attribute nearest-neighbor prediction 16
    17. 17. Proposition Weight-based unique attribute matchingSimilarity = (this.unique_attrs.each{|id,attr| weight_unique[id]*other.unique_attrs.each == attr}.sum + this.freqent_attrs.each{|attr| other.freqent_attrs.each == attr}.sum) / this.freqent_attrs.each{|attr| other.freqent_attrs.each != attr}.sum 17
    18. 18. Proposition ClusteringHierarchical: the distribution seems to be even• Distance: non-numeric parameter conversion• Merging:• show up features shared by 30% of members or more for vector-like attributes • Slow • Reliable• Probabilistic for singular featuresCurse of dimensionality 18
    19. 19. Technical work1. Data fetching: 1. About.me 2. Facebook 3. Twitter2. Tools: 1. Ruby 2. Document-oriented noSQL database: mongoDB3. Implementation of vector-based weighted comparison4. Implementation of VMN algorithm 19
    20. 20. ProblemGeneralizationApproach studiedPropositionTestingConclusion 20
    21. 21. TestingData Direct nearest Unique Combined LDA Document- neighbor parameter (direct + based model matching matching unique + (experimental) clustering)Completeness 51% 56% 74% 46%(%)Basic set 53 58 78 95Accuracy (%) 100% 98% 95% 51%,of them false 0 1 3 42positiveFalse negative 51 46 29 9Extended set 56 62 127 N/TAccuracy(%) 98% 54% 70% N/T,of them false 2 5 N/TpositiveFalse negative 51 (basic set) 37 (basic set) 28 (basic set) N/T 21
    22. 22. Future work1. Attempt to convert all parameters to numeric format and apply SVM for clustering2. Add semantic word similarity via WordNet distance3. Named Entity Recognition in text fields4. Envelope the algorithms developed into a single sleek Rails web application and public testing 22
    23. 23. Conclusion1. All approaches studied had strong mathematical background but were badly adapted for real applications2. Intuitive fusion of approaches suitable for different situations may improve results3. Further work is necessary to develop the best approach 23
    24. 24. Thank you!Questions?Slides available at http://n3r.ru/c4Demo&code available at http://n3r.ru/c5Feel free to contact me: darikcr@gmail.com http://about.me/netbug...and enlarge your soft skills! 24

    ×