Mining the social web 6

549 views
449 views

Published on

Mining the social web 6

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
549
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining the social web 6

  1. 1. Mining the Social Web: Chapter6LinkedIn: Clustering Your Professional Network for Fun(And Profit?) chois79
  2. 2. Contents• Introduction• Motivation for Clustering• Clustering Contacts by Job Title – Standardizing and Counting Job Titles – Common Similarity Metrics for Clustering – A Greedy Approach to Clustering – Hierarchical and k-Means Clustering• Fetching Extended Profile Information• Closing Remarks
  3. 3. Introduction• LinkedIn is a popular social networking site focused on professional and business relationships• The two primary ways you can access Linked-in – Exporting it as address book data – Using the Linked-in API• This chapter introduce fundamental clustering techniques to answer the following kinds of queries – Which of your connections are the most similar based upon a criterion like job title? – Which of your connections have worked in companies you want to work for? – Where do most of your connections reside geographically?
  4. 4. Motivation for Clustering• Which of your connections are the most similar based upon a criterion like job title? – LinkedIn members are able to enter in their professional information as free text. • job titles, company name, professional interests…• There are two issues – How to measure similarity between two values • Ex) Chief Executive Officer, Chief Technology Officer – How to cluster every people • It would be ideal to compare every member to every other member • This is n-squared problem
  5. 5. Clustering Contacts by Job Title Standardizing and Counting Job Titles• Standardizing and Counting Job Titles – Use a pattern for transforming common job title – Perform a basic frequency analysis standardizing
  6. 6. Clustering Contacts by Job Title Common Similarity Metrics for Clustering• Edit distance(Levenshtein distance) – The number of operations required to transform one of them into the other – Ex1) dad into bad = 1 – Ex2) □park into spake = 3 s p a k e distance □ p a r k 3 p a r k □ 4 p □ a r k 4 p a r □ k 5
  7. 7. Clustering Contacts by Job Title Common Similarity Metrics for Clustering• N-gram similarity – Terse way of expressing each possible consequence of n tokens from a text – Ex) bi-gram (n = 2)
  8. 8. Clustering Contacts by Job Title Common Similarity Metrics for Clustering• Jaccard distance – The number of items in common between the two sets divided by the total number • |Set1 intersection Set2| / |Set1 union Set2| – In nltk.metrics.distance.jaccard_distnace • ( len(X.union(Y)) – len(X.intersection(Y))) / float(len(X.union(Y))• MASI distance – Weighted version of Jaccard similarity • adjusts the score to result in a smaller distance than Jaccard when a partial overlap between set exists • 1 – float(len(X.intersection(Y))) / float(max(len(X), len(Y))
  9. 9. Clustering Contacts by Job Title Common Similarity Metrics for Clustering• Jaccard distance vs. MASI distance
  10. 10. Clustering Contacts by Job Title A Greedy Approach to Clustering• Cluster job titles by comparing them using MASI distance n-squared problem• Scalable clustering sure ain’t easy – O(n2) algorithm is simply unacceptable • len(all_titles) * len(all_titles) times
  11. 11. Clustering Contacts by Job Title A Greedy Approach for Clustering• A random sample is selected for the scoring function – Execute the inner loop a much smaller, fixed number of times
  12. 12. Clustering Contacts by Job Title Hierarchical Clustering• Hierarchical clustering(agglomerative clustering) – Deterministic and exhaustive technique • Compute the full matrix of distance between all items • Walks through the matrix clustering items that meet a minimum distance threshold – 0.5*n2 times (dynamic programming) • Ex) (abc, def), (def, abc)
  13. 13. Clustering Contacts by Job Title K-Means Clustering• K-Means Clustering – Generally executes on the order of O(k*n) times – Step 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K 1, K2… Kk . 2. Assign each of the n points to a cluster by finding the nearest Kn – effectively creating k clusters and requiring k * n comparisons 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. 4. Repeat steps 2-3 until the members of the clusters do not change between iteration. Generally speaking, relatively few iterations are required for convergence – http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ AppletKM.html
  14. 14. Fetching Extended Profile Information(1/2)• OAuth – Open standard for authorization – allows users to share their private resources stored on one site with another site without having to hand out their credentials
  15. 15. Fetching Extended Profile Information(2/2)• Example Request token Redirect auth page Access token
  16. 16. Closing Remarks• This chapter covered some serious ground – Introduce fundamental clustering techniques – Apply to your profession network data on linked in a variety of ways

×