0
Social Data Mining

Toby Segaran
About Me




           http://kiwitobes.com
What is data mining?

   Implicit
   Unknown
   Useful
What is data?
Data-mining traditional uses
Why it’s important now




          Data
Why it’s important now
Why it’s important now




 All products are actually sold on Amazon
Why it’s important now
 Facebook      Google
Why it’s important now
For Social Insight


Home Prices   Blogs and News   Movie Data




 Fashion      Product Prices     Hotties
Blogs…
The Technorati Top 100
Getting the content
                The
                Six
                Degrees
                Hypothesis
           ...
Building a Word Matrix
The
Six
Degrees
                 Six
Hypothesis                     Six           3
               ...
The Word Matrix
                  “china”   “kids”   “music”              “yahoo”
                                        ...
Determining distance
                         “china”   “kids”   “music”      “yahoo”




     Gothamist           0      ...
Hierarchical Clustering

 Find the two closest item
 Combine them into a single item
 Repeat…
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Dendrogram
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Rotating the Matrix

   Words in a blog -> blogs containing each word


             Gothamist     GigaOM        Quick Onl...
Hierarchical Word Clusters
K-Means Clustering

 Divides data into distinct clusters
 User determines how many
 Algorithm
   Start with arbitrary cent...
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Results

1                           2

The Viral Garden            Wonkette
Copyblogger                 Gawker
Cr...
2D Visualizations

 Instead of Clusters, a 2D Map
 Goals
   Preserve distances as much as
   possible
   Draw in two dimen...
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Zillow
The Zillow API

 Allows querying by address
 Returns information about the
 property
   Bedrooms
   Bathrooms
   Zip Code
...
A home price dataset

House   Zip     Bathrooms   Bedrooms   Built   Type      Price

                                    ...
What can we learn?

 A made-up houses price
 How important is Zip Code?
 What are the important attributes?

 Can we do be...
Introducing Regression
         Trees
A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
Introducing Regression
         Trees
A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         C...
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         C...
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         C...
Minimizing deviation
         Standard deviation is the “spread” of results
         Try all possible divisions
         C...
CART Algoritm

A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
CART Algoritm

A    B        Value
10   Circle   20
11   Square   22
22   Square   8
18   Circle   6
CART Algoritm




  10   Circle   20   22   Square   8
  11   Square   22   18   Circle   6
CART Algoritm
Zillow Results

                           Bathrooms > 3




      Zip: 02139?                               After 1903?

...
Just for Fun… Hot or Not
Variance dividers
9
8
7
6
5
4
3
2
1
0
     Northeast   South   Male     Female

    Low Variance Split   High Variance Split
Just for Fun… Hot or Not
Supervised and
Unsupervised
 Clustering methods are unsupervised
   There are no answers
   Methods just characterize the ...
Personal Ads
The Analysis
         Five Cities




       W4M Personal Ads
Bayesian filter


If you listen to NPR, watch Hardball,
and love the Red Sox, you may be the             Sox            0....
Bayesian filter

      P( C | W ) = P (C & W) / P (W)

              How often do the word and the city appear together?

...
Results
New York         Boston           Chicago
Mets             Pink             Cubs
Lounges          Sox             ...
Results
Los Angeles     San Francisco
Excellent       Tee
Vegas           Employment
Meaningful      Picnic
Star          ...
Newsgroup Discussion
Overlapping themes
Themes in a document
Another word matrix
            Msg1   Msg2     Msg3          Msg4   Msg5

Gym          2      0          0           3   ...
Weights and features


              F1   F2       F3
                                          Msg1   M2    M3   M4   M5
...
Matrix factorization
                F1    F2     F3
Gym               0    1     2                     Msg1      M2     M...
Matrix factorization
                F1    F2       F3
Gym               0   1        2                   Msg1    M2    M3...
Matrix factorization
                F1    F2       F3
Gym               1   0        0                   Msg1    M2    M3...
Interpreting Features
                F1    F2   F3
                                          Theme 1      Theme 2      Th...
“Diet and body” themes
                                Calories
                                Weight
Atkins
            ...
Side note: NMF for faces
Methods covered

 Regression trees
 Hierarchical clustering
 k-means clustering
 Multidimensional scaling
 Bayesian Classi...
Other ideas

 Finance
   Analysts already drowning in info
   Stories sometimes broken on blogs
   Message boards show sen...
Other ideas

 Product problems/ideas
   Use support message boards
   Extract themes
   Understand recurring issues
   Lea...
Other ideas

 Entertainment
   How much buzz is a movie
   generating?
   What psychographic profiles like this
   type of...
Mining Social Data for Fun and Insight
Mining Social Data for Fun and Insight
Mining Social Data for Fun and Insight
Mining Social Data for Fun and Insight
Upcoming SlideShare
Loading in...5
×

Mining Social Data for Fun and Insight

35,691

Published on

Speaker: Toby Segaran

Published in: Technology
6 Comments
95 Likes
Statistics
Notes
No Downloads
Views
Total Views
35,691
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
1,968
Comments
6
Likes
95
Embeds 0
No embeds

No notes for slide

Transcript of "Mining Social Data for Fun and Insight"

  1. 1. Social Data Mining Toby Segaran
  2. 2. About Me http://kiwitobes.com
  3. 3. What is data mining? Implicit Unknown Useful
  4. 4. What is data?
  5. 5. Data-mining traditional uses
  6. 6. Why it’s important now Data
  7. 7. Why it’s important now
  8. 8. Why it’s important now All products are actually sold on Amazon
  9. 9. Why it’s important now Facebook Google
  10. 10. Why it’s important now
  11. 11. For Social Insight Home Prices Blogs and News Movie Data Fashion Product Prices Hotties
  12. 12. Blogs…
  13. 13. The Technorati Top 100
  14. 14. Getting the content The Six Degrees Hypothesis Experienced It Is When You Travel
  15. 15. Building a Word Matrix The Six Degrees Six Hypothesis Six 3 Degrees Experienced Degrees 3 Hypothesis Hypothesis 1 It Experienced Experienced 5 Is Travel Travel 6 When You Travel
  16. 16. The Word Matrix “china” “kids” “music” “yahoo” “travel” Gothamist 0 3 3 0 3 GigaOM 6 0 1 2 4 QuickOnlineTips 0 2 2 12 0 O’Reilly Radar 1 0 3 4 6
  17. 17. Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0)2 + (0 − 2)2 + (1− 2)2 + (2 −12)2 = 12 (approx)
  18. 18. Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
  19. 19. Hierarchical Algorithm
  20. 20. Hierarchical Algorithm
  21. 21. Hierarchical Algorithm
  22. 22. Hierarchical Algorithm
  23. 23. Hierarchical Algorithm
  24. 24. Dendrogram
  25. 25. Hierarchical Blog Clusters
  26. 26. Hierarchical Blog Clusters
  27. 27. Hierarchical Blog Clusters
  28. 28. Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
  29. 29. Hierarchical Word Clusters
  30. 30. K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
  31. 31. K-Means Algorithm
  32. 32. K-Means Algorithm
  33. 33. K-Means Algorithm
  34. 34. K-Means Algorithm
  35. 35. K-Means Algorithm
  36. 36. K-Means Results 1 2 The Viral Garden Wonkette Copyblogger Gawker Creating Passionate Users Gothamist Oilman Huffington Post ProBlogger Blog Tips Seth's Blog
  37. 37. 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
  38. 38. Multidimensional Scaling
  39. 39. Multidimensional Scaling
  40. 40. Multidimensional Scaling
  41. 41. Multidimensional Scaling
  42. 42. Zillow
  43. 43. The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price
  44. 44. A home price dataset House Zip Bathrooms Bedrooms Built Type Price Single 505296 A 02138 1.5 2 1847 B 02139 3.5 9 Triplex 776378 1916 C 02140 3.5 4 Duplex 595027 1894 D 02139 2.5 4 Duplex 552213 1854 E 02138 3.5 5 Duplex 947528 1909 F 02138 3.5 4 Single 2107871 1930 etc..
  45. 45. What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
  46. 46. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  47. 47. Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  48. 48. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
  49. 49. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
  50. 50. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
  51. 51. Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
  52. 52. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  53. 53. CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
  54. 54. CART Algoritm 10 Circle 20 22 Square 8 11 Square 22 18 Circle 6
  55. 55. CART Algoritm
  56. 56. Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
  57. 57. Just for Fun… Hot or Not
  58. 58. Variance dividers 9 8 7 6 5 4 3 2 1 0 Northeast South Male Female Low Variance Split High Variance Split
  59. 59. Just for Fun… Hot or Not
  60. 60. Supervised and Unsupervised Clustering methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns Regression Trees are supervised “answers” are in the dataset Tree models predict answers
  61. 61. Personal Ads
  62. 62. The Analysis Five Cities W4M Personal Ads
  63. 63. Bayesian filter If you listen to NPR, watch Hardball, and love the Red Sox, you may be the Sox 0.4 guy for me. Red 0.35 Boston Grad 0.2 Please email me back. Professional 0.1 I'm a professional with a grad school Humor 0.1 degree who has a sense of humor and loves the Sox.
  64. 64. Bayesian filter P( C | W ) = P (C & W) / P (W) How often do the word and the city appear together? How often does the word appear overall… Rank these, and you have a list of the words most particular to a given city
  65. 65. Results New York Boston Chicago Mets Pink Cubs Lounges Sox Burbs Offense Poetry Bears Desires Intellectually Girlie Musical Punk Insecure Submissive Appreciation Cheat Create Exercise Importance Song Winter Blunt Oral Education Mouth
  66. 66. Results Los Angeles San Francisco Excellent Tee Vegas Employment Meaningful Picnic Star STD Lame Tasting Industry Hikes Heat French Fitness .com Entertainment Kayaking Latino Cycling
  67. 67. Newsgroup Discussion
  68. 68. Overlapping themes
  69. 69. Themes in a document
  70. 70. Another word matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Actual Matrix
  71. 71. Weights and features F1 F2 F3 Msg1 M2 M3 M4 M5 Gym 0 1 2 x F1 1 0 2 3 0 Calorie 2 0 1 F2 0 2 1 1 3 Weigh 2 2 1 F3 1 0 2 0 0 Carbs 1 0 3 Weight Matrix Treadmill 0 1 2 Features Matrix
  72. 72. Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 0 1 Calorie 0 2 4 1 3 Weigh 2 3 1 0 1 Carbs 0 1 1 0 2 Treadmill 3 2 0 2 2 Current Guess
  73. 73. Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 1 3 3 0 1 Calorie 0 2 1 1 3 Calorie 0 2 4 1 3 Weigh 1 0 2 0 0 Weigh 2 3 1 0 1 Carbs 0 3 0 0 2 Carbs 0 1 1 0 2 Treadmill 1 0 0 2 0 Treadmill 3 2 0 2 2 Target Result Current Guess
  74. 74. Matrix factorization F1 F2 F3 Gym 1 0 0 Msg1 M2 M3 M4 M5 Calorie 0 1 1 F1 2 0 0 1 0 x Weigh 0 0 2 F2 0 2 0 1 3 Carbs 0 1 0 F3 1 0 1 0 0 Treadmill 1 0 0 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Treadmill 1 0 0 2 0 Target Result Current Guess
  75. 75. Interpreting Features F1 F2 F3 Theme 1 Theme 2 Theme 3 Gym 1 0 0 Calorie 0 1 1 Gym Calorie Weigh Weigh 0 0 2 Treadmill Carbs Calorie Carbs 0 1 0 Treadmill 1 0 0 Features Matrix Msg1 M2 M3 M4 M5 Msg1 Msg2 Msg3 etc. F1 2 0 0 1 0 Theme 1 Theme 2 Theme 3 F2 0 2 0 1 3 F3 1 0 1 0 0 Theme 3 Weight Matrix
  76. 76. “Diet and body” themes Calories Weight Atkins Fats Induction Protein South Chocolate Cholesterol Beach Black Carbs Coffee Olive Gym Broccoli Weights Exercise Running Cook Injured Recipe Fried Home Money Organic Want Best
  77. 77. Side note: NMF for faces
  78. 78. Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Bayesian Classifier Non-negative Matrix Factorization
  79. 79. Other ideas Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
  80. 80. Other ideas Product problems/ideas Use support message boards Extract themes Understand recurring issues Learn what features people want
  81. 81. Other ideas Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×