Slideshare.net (beta)

 
Post to TwitterPost to Twitter
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 19 (more)

Mining Social Data for Fun and Insight

From adunne, 11 months ago

Speaker: Toby Segaran

10962 views  |  2 comments  |  18 favorites  |  565 downloads  |  5 embeds (Stats)
 

Categories

Add Category
 
 

Tags

web 2.0 expo berlin web 2.0 web2expoberlin data social mining datamining web2expo kdd social mining

more

 
 

Groups / Events

 
Embed
options

More Info

This slideshow is Public
Total Views: 10962
on Slideshare: 10936
from embeds: 26

Slideshow transcript

Slide 1: Social Data Mining Toby Segaran

Slide 2: About Me http://kiwitobes.com

Slide 3: What is data mining? Implicit Unknown Useful

Slide 4: What is data?

Slide 5: Data-mining traditional uses

Slide 6: Why it’s important now Data

Slide 7: Why it’s important now

Slide 8: Why it’s important now All products are actually sold on Amazon

Slide 9: Why it’s important now Facebook Google

Slide 10: Why it’s important now

Slide 11: For Social Insight Home Prices Blogs and News Movie Data Fashion Product Prices Hotties

Slide 12: Blogs…

Slide 13: The Technorati Top 100

Slide 14: Getting the content The Six Degrees Hypothesis Experienced It Is When You Travel

Slide 15: Building a Word Matrix The Six Degrees Six Hypothesis Six 3 Degrees Experienced Degrees 3 Hypothesis Hypothesis 1 It Experienced Experienced 5 Is Travel Travel 6 When You Travel

Slide 16: The Word Matrix “china” “kids” “music” “yahoo” “travel” Gothamist 0 3 3 0 3 GigaOM 6 0 1 2 4 QuickOnlineTips 0 2 2 12 0 O’Reilly Radar 1 0 3 4 6

Slide 17: Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0)2 + (0 − 2)2 + (1− 2)2 + (2 −12)2 = 12 (approx)

Slide 18: Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…

Slide 19: Hierarchical Algorithm

Slide 20: Hierarchical Algorithm

Slide 21: Hierarchical Algorithm

Slide 22: Hierarchical Algorithm

Slide 23: Hierarchical Algorithm

Slide 24: Dendrogram

Slide 25: Hierarchical Blog Clusters

Slide 26: Hierarchical Blog Clusters

Slide 27: Hierarchical Blog Clusters

Slide 28: Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12

Slide 29: Hierarchical Word Clusters

Slide 30: K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat

Slide 31: K-Means Algorithm

Slide 32: K-Means Algorithm

Slide 33: K-Means Algorithm

Slide 34: K-Means Algorithm

Slide 35: K-Means Algorithm

Slide 36: K-Means Results 1 2 The Viral Garden Wonkette Copyblogger Gawker Creating Passionate Users Gothamist Oilman Huffington Post ProBlogger Blog Tips Seth's Blog

Slide 37: 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling

Slide 38: Multidimensional Scaling

Slide 39: Multidimensional Scaling

Slide 40: Multidimensional Scaling

Slide 41: Multidimensional Scaling

Slide 46: Zillow

Slide 47: The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price

Slide 48: A home price dataset House Zip Bathrooms Bedrooms Built Type Price Single 505296 A 02138 1.5 2 1847 B 02139 3.5 9 Triplex 776378 1916 C 02140 3.5 4 Duplex 595027 1894 D 02139 2.5 4 Duplex 552213 1854 E 02138 3.5 5 Duplex 947528 1909 F 02138 3.5 4 Single 2107871 1930 etc..

Slide 49: What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?

Slide 50: Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6

Slide 51: Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6

Slide 52: Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6

Slide 53: Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9

Slide 54: Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7

Slide 55: Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4

Slide 56: CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6

Slide 57: CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6

Slide 58: CART Algoritm 10 Circle 20 22 Square 8 11 Square 22 18 Circle 6

Slide 59: CART Algoritm

Slide 60: Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?

Slide 61: Just for Fun… Hot or Not

Slide 62: Variance dividers 9 8 7 6 5 4 3 2 1 0 Northeast South Male Female Low Variance Split High Variance Split

Slide 63: Just for Fun… Hot or Not

Slide 64: Supervised and Unsupervised Clustering methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns Regression Trees are supervised “answers” are in the dataset Tree models predict answers

Slide 65: Personal Ads

Slide 66: The Analysis Five Cities W4M Personal Ads

Slide 67: Bayesian filter If you listen to NPR, watch Hardball, and love the Red Sox, you may be the Sox 0.4 guy for me. Red 0.35 Boston Grad 0.2 Please email me back. Professional 0.1 I'm a professional with a grad school Humor 0.1 degree who has a sense of humor and loves the Sox.

Slide 68: Bayesian filter P( C | W ) = P (C & W) / P (W) How often do the word and the city appear together? How often does the word appear overall… Rank these, and you have a list of the words most particular to a given city

Slide 69: Results New York Boston Chicago Mets Pink Cubs Lounges Sox Burbs Offense Poetry Bears Desires Intellectually Girlie Musical Punk Insecure Submissive Appreciation Cheat Create Exercise Importance Song Winter Blunt Oral Education Mouth

Slide 70: Results Los Angeles San Francisco Excellent Tee Vegas Employment Meaningful Picnic Star STD Lame Tasting Industry Hikes Heat French Fitness .com Entertainment Kayaking Latino Cycling

Slide 71: Newsgroup Discussion

Slide 72: Overlapping themes

Slide 73: Themes in a document

Slide 74: Another word matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Actual Matrix

Slide 75: Weights and features F1 F2 F3 Msg1 M2 M3 M4 M5 Gym 0 1 2 x F1 1 0 2 3 0 Calorie 2 0 1 F2 0 2 1 1 3 Weigh 2 2 1 F3 1 0 2 0 0 Carbs 1 0 3 Weight Matrix Treadmill 0 1 2 Features Matrix

Slide 76: Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 0 1 Calorie 0 2 4 1 3 Weigh 2 3 1 0 1 Carbs 0 1 1 0 2 Treadmill 3 2 0 2 2 Current Guess

Slide 77: Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 1 3 3 0 1 Calorie 0 2 1 1 3 Calorie 0 2 4 1 3 Weigh 1 0 2 0 0 Weigh 2 3 1 0 1 Carbs 0 3 0 0 2 Carbs 0 1 1 0 2 Treadmill 1 0 0 2 0 Treadmill 3 2 0 2 2 Target Result Current Guess

Slide 78: Matrix factorization F1 F2 F3 Gym 1 0 0 Msg1 M2 M3 M4 M5 Calorie 0 1 1 F1 2 0 0 1 0 x Weigh 0 0 2 F2 0 2 0 1 3 Carbs 0 1 0 F3 1 0 1 0 0 Treadmill 1 0 0 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Treadmill 1 0 0 2 0 Target Result Current Guess

Slide 79: Interpreting Features F1 F2 F3 Theme 1 Theme 2 Theme 3 Gym 1 0 0 Calorie 0 1 1 Gym Calorie Weigh Weigh 0 0 2 Treadmill Carbs Calorie Carbs 0 1 0 Treadmill 1 0 0 Features Matrix Msg1 M2 M3 M4 M5 Msg1 Msg2 Msg3 etc. F1 2 0 0 1 0 Theme 1 Theme 2 Theme 3 F2 0 2 0 1 3 F3 1 0 1 0 0 Theme 3 Weight Matrix

Slide 80: “Diet and body” themes Calories Weight Atkins Fats Induction Protein South Chocolate Cholesterol Beach Black Carbs Coffee Olive Gym Broccoli Weights Exercise Running Cook Injured Recipe Fried Home Money Organic Want Best

Slide 81: Side note: NMF for faces

Slide 82: Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Bayesian Classifier Non-negative Matrix Factorization

Slide 83: Other ideas Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio

Slide 84: Other ideas Product problems/ideas Use support message boards Extract themes Understand recurring issues Learn what features people want

Slide 85: Other ideas Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors