0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Mining Social Data for Fun and Insight

35,527

Published on

Speaker: Toby Segaran

Speaker: Toby Segaran

Published in: Technology
95 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Views
Total Views
35,527
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
1,958
6
Likes
95
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Social Data Mining Toby Segaran
• 3. What is data mining? Implicit Unknown Useful
• 4. What is data?
• 6. Why it’s important now Data
• 7. Why it’s important now
• 8. Why it’s important now All products are actually sold on Amazon
• 10. Why it’s important now
• 11. For Social Insight Home Prices Blogs and News Movie Data Fashion Product Prices Hotties
• 12. Blogs…
• 13. The Technorati Top 100
• 14. Getting the content The Six Degrees Hypothesis Experienced It Is When You Travel
• 15. Building a Word Matrix The Six Degrees Six Hypothesis Six 3 Degrees Experienced Degrees 3 Hypothesis Hypothesis 1 It Experienced Experienced 5 Is Travel Travel 6 When You Travel
• 16. The Word Matrix “china” “kids” “music” “yahoo” “travel” Gothamist 0 3 3 0 3 GigaOM 6 0 1 2 4 QuickOnlineTips 0 2 2 12 0 O’Reilly Radar 1 0 3 4 6
• 17. Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0)2 + (0 − 2)2 + (1− 2)2 + (2 −12)2 = 12 (approx)
• 18. Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
• 19. Hierarchical Algorithm
• 20. Hierarchical Algorithm
• 21. Hierarchical Algorithm
• 22. Hierarchical Algorithm
• 23. Hierarchical Algorithm
• 24. Dendrogram
• 25. Hierarchical Blog Clusters
• 26. Hierarchical Blog Clusters
• 27. Hierarchical Blog Clusters
• 28. Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
• 29. Hierarchical Word Clusters
• 30. K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
• 31. K-Means Algorithm
• 32. K-Means Algorithm
• 33. K-Means Algorithm
• 34. K-Means Algorithm
• 35. K-Means Algorithm
• 36. K-Means Results 1 2 The Viral Garden Wonkette Copyblogger Gawker Creating Passionate Users Gothamist Oilman Huffington Post ProBlogger Blog Tips Seth's Blog
• 37. 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
• 38. Multidimensional Scaling
• 39. Multidimensional Scaling
• 40. Multidimensional Scaling
• 41. Multidimensional Scaling
• 42. Zillow
• 43. The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price
• 44. A home price dataset House Zip Bathrooms Bedrooms Built Type Price Single 505296 A 02138 1.5 2 1847 B 02139 3.5 9 Triplex 776378 1916 C 02140 3.5 4 Duplex 595027 1894 D 02139 2.5 4 Duplex 552213 1854 E 02138 3.5 5 Duplex 947528 1909 F 02138 3.5 4 Single 2107871 1930 etc..
• 45. What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
• 46. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
• 47. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
• 48. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
• 49. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
• 50. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
• 51. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
• 52. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
• 53. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
• 54. CART Algoritm 10 Circle 20 22 Square 8 11 Square 22 18 Circle 6
• 55. CART Algoritm
• 56. Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
• 57. Just for Fun… Hot or Not
• 58. Variance dividers 9 8 7 6 5 4 3 2 1 0 Northeast South Male Female Low Variance Split High Variance Split
• 59. Just for Fun… Hot or Not
• 60. Supervised and Unsupervised Clustering methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns Regression Trees are supervised “answers” are in the dataset Tree models predict answers
• 62. The Analysis Five Cities W4M Personal Ads
• 63. Bayesian filter If you listen to NPR, watch Hardball, and love the Red Sox, you may be the Sox 0.4 guy for me. Red 0.35 Boston Grad 0.2 Please email me back. Professional 0.1 I'm a professional with a grad school Humor 0.1 degree who has a sense of humor and loves the Sox.
• 64. Bayesian filter P( C | W ) = P (C & W) / P (W) How often do the word and the city appear together? How often does the word appear overall… Rank these, and you have a list of the words most particular to a given city
• 65. Results New York Boston Chicago Mets Pink Cubs Lounges Sox Burbs Offense Poetry Bears Desires Intellectually Girlie Musical Punk Insecure Submissive Appreciation Cheat Create Exercise Importance Song Winter Blunt Oral Education Mouth
• 66. Results Los Angeles San Francisco Excellent Tee Vegas Employment Meaningful Picnic Star STD Lame Tasting Industry Hikes Heat French Fitness .com Entertainment Kayaking Latino Cycling
• 67. Newsgroup Discussion
• 68. Overlapping themes
• 69. Themes in a document
• 70. Another word matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Actual Matrix
• 71. Weights and features F1 F2 F3 Msg1 M2 M3 M4 M5 Gym 0 1 2 x F1 1 0 2 3 0 Calorie 2 0 1 F2 0 2 1 1 3 Weigh 2 2 1 F3 1 0 2 0 0 Carbs 1 0 3 Weight Matrix Treadmill 0 1 2 Features Matrix
• 72. Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 0 1 Calorie 0 2 4 1 3 Weigh 2 3 1 0 1 Carbs 0 1 1 0 2 Treadmill 3 2 0 2 2 Current Guess
• 73. Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 1 3 3 0 1 Calorie 0 2 1 1 3 Calorie 0 2 4 1 3 Weigh 1 0 2 0 0 Weigh 2 3 1 0 1 Carbs 0 3 0 0 2 Carbs 0 1 1 0 2 Treadmill 1 0 0 2 0 Treadmill 3 2 0 2 2 Target Result Current Guess
• 74. Matrix factorization F1 F2 F3 Gym 1 0 0 Msg1 M2 M3 M4 M5 Calorie 0 1 1 F1 2 0 0 1 0 x Weigh 0 0 2 F2 0 2 0 1 3 Carbs 0 1 0 F3 1 0 1 0 0 Treadmill 1 0 0 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Treadmill 1 0 0 2 0 Target Result Current Guess
• 75. Interpreting Features F1 F2 F3 Theme 1 Theme 2 Theme 3 Gym 1 0 0 Calorie 0 1 1 Gym Calorie Weigh Weigh 0 0 2 Treadmill Carbs Calorie Carbs 0 1 0 Treadmill 1 0 0 Features Matrix Msg1 M2 M3 M4 M5 Msg1 Msg2 Msg3 etc. F1 2 0 0 1 0 Theme 1 Theme 2 Theme 3 F2 0 2 0 1 3 F3 1 0 1 0 0 Theme 3 Weight Matrix
• 76. “Diet and body” themes Calories Weight Atkins Fats Induction Protein South Chocolate Cholesterol Beach Black Carbs Coffee Olive Gym Broccoli Weights Exercise Running Cook Injured Recipe Fried Home Money Organic Want Best
• 77. Side note: NMF for faces
• 78. Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Bayesian Classifier Non-negative Matrix Factorization
• 79. Other ideas Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
• 80. Other ideas Product problems/ideas Use support message boards Extract themes Understand recurring issues Learn what features people want
• 81. Other ideas Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors