• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mining the social web 6
 

Mining the social web 6

on

  • 562 views

Mining the social web 6

Mining the social web 6

Statistics

Views

Total Views
562
Views on SlideShare
562
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mining the social web 6 Mining the social web 6 Presentation Transcript

    • Mining the Social Web: Chapter6LinkedIn: Clustering Your Professional Network for Fun(And Profit?) chois79
    • Contents• Introduction• Motivation for Clustering• Clustering Contacts by Job Title – Standardizing and Counting Job Titles – Common Similarity Metrics for Clustering – A Greedy Approach to Clustering – Hierarchical and k-Means Clustering• Fetching Extended Profile Information• Closing Remarks
    • Introduction• LinkedIn is a popular social networking site focused on professional and business relationships• The two primary ways you can access Linked-in – Exporting it as address book data – Using the Linked-in API• This chapter introduce fundamental clustering techniques to answer the following kinds of queries – Which of your connections are the most similar based upon a criterion like job title? – Which of your connections have worked in companies you want to work for? – Where do most of your connections reside geographically?
    • Motivation for Clustering• Which of your connections are the most similar based upon a criterion like job title? – LinkedIn members are able to enter in their professional information as free text. • job titles, company name, professional interests…• There are two issues – How to measure similarity between two values • Ex) Chief Executive Officer, Chief Technology Officer – How to cluster every people • It would be ideal to compare every member to every other member • This is n-squared problem
    • Clustering Contacts by Job Title Standardizing and Counting Job Titles• Standardizing and Counting Job Titles – Use a pattern for transforming common job title – Perform a basic frequency analysis standardizing
    • Clustering Contacts by Job Title Common Similarity Metrics for Clustering• Edit distance(Levenshtein distance) – The number of operations required to transform one of them into the other – Ex1) dad into bad = 1 – Ex2) □park into spake = 3 s p a k e distance □ p a r k 3 p a r k □ 4 p □ a r k 4 p a r □ k 5
    • Clustering Contacts by Job Title Common Similarity Metrics for Clustering• N-gram similarity – Terse way of expressing each possible consequence of n tokens from a text – Ex) bi-gram (n = 2)
    • Clustering Contacts by Job Title Common Similarity Metrics for Clustering• Jaccard distance – The number of items in common between the two sets divided by the total number • |Set1 intersection Set2| / |Set1 union Set2| – In nltk.metrics.distance.jaccard_distnace • ( len(X.union(Y)) – len(X.intersection(Y))) / float(len(X.union(Y))• MASI distance – Weighted version of Jaccard similarity • adjusts the score to result in a smaller distance than Jaccard when a partial overlap between set exists • 1 – float(len(X.intersection(Y))) / float(max(len(X), len(Y))
    • Clustering Contacts by Job Title Common Similarity Metrics for Clustering• Jaccard distance vs. MASI distance
    • Clustering Contacts by Job Title A Greedy Approach to Clustering• Cluster job titles by comparing them using MASI distance n-squared problem• Scalable clustering sure ain’t easy – O(n2) algorithm is simply unacceptable • len(all_titles) * len(all_titles) times
    • Clustering Contacts by Job Title A Greedy Approach for Clustering• A random sample is selected for the scoring function – Execute the inner loop a much smaller, fixed number of times
    • Clustering Contacts by Job Title Hierarchical Clustering• Hierarchical clustering(agglomerative clustering) – Deterministic and exhaustive technique • Compute the full matrix of distance between all items • Walks through the matrix clustering items that meet a minimum distance threshold – 0.5*n2 times (dynamic programming) • Ex) (abc, def), (def, abc)
    • Clustering Contacts by Job Title K-Means Clustering• K-Means Clustering – Generally executes on the order of O(k*n) times – Step 1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K 1, K2… Kk . 2. Assign each of the n points to a cluster by finding the nearest Kn – effectively creating k clusters and requiring k * n comparisons 3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. 4. Repeat steps 2-3 until the members of the clusters do not change between iteration. Generally speaking, relatively few iterations are required for convergence – http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/ AppletKM.html
    • Fetching Extended Profile Information(1/2)• OAuth – Open standard for authorization – allows users to share their private resources stored on one site with another site without having to hand out their credentials
    • Fetching Extended Profile Information(2/2)• Example Request token Redirect auth page Access token
    • Closing Remarks• This chapter covered some serious ground – Introduce fundamental clustering techniques – Apply to your profession network data on linked in a variety of ways