• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mining Social Data for Fun and Insight
 

Mining Social Data for Fun and Insight

on

  • 40,115 views

Speaker: Toby Segaran

Speaker: Toby Segaran

Statistics

Views

Total Views
40,115
Views on SlideShare
39,237
Embed Views
878

Actions

Likes
91
Downloads
1,900
Comments
6

24 Embeds 878

http://formaced.com 490
http://conectandoformasdeaprender.blogspot.com 90
http://tzara.wordpress.com 54
http://www.slideshare.net 48
http://webclipbook.com 44
http://www.scissorsfly.com 34
http://familypornthep.wordpress.com 26
http://www.linkedin.com 21
https://bsuonline.blackboard.com 16
http://www.formaced.com 10
http://www.webclipbook.com 8
http://mmilonakis.ced.tuc.gr 7
http://tushneem.wordpress.com 7
http://tushneem.blogspot.com 5
http://jennifered.wordpress.com 5
http://snf-59420.vm.okeanos.grnet.gr 3
http://conectandoformasdeaprender.blogspot.com.ar 2
http://localhost 2
https://csuglobal.blackboard.com 1
http://www.blogger.com 1
http://conectandoformasdeaprender.blogspot.com.es 1
http://hmail2.daum.net 1
http://wolfgang.secondbrain.com 1
http://www.hanrss.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

16 of 6 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • how to download these slides
    Are you sure you want to
    Your message goes here
    Processing…
  • osum..slides reali helpful
    Are you sure you want to
    Your message goes here
    Processing…
  • 集体智慧编程
    Are you sure you want to
    Your message goes here
    Processing…
  • nice info



    ----------------
    http://iulren.com/crabs-how-to-properly-treat-pubic-lice.html
    http://iulren.com
    Are you sure you want to
    Your message goes here
    Processing…
  • I think these sildes are so good for us.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mining Social Data for Fun and Insight Mining Social Data for Fun and Insight Presentation Transcript

    • Social Data Mining Toby Segaran
    • About Me http://kiwitobes.com
    • What is data mining? Implicit Unknown Useful
    • What is data?
    • Data-mining traditional uses
    • Why it’s important now Data
    • Why it’s important now
    • Why it’s important now All products are actually sold on Amazon
    • Why it’s important now Facebook Google
    • Why it’s important now
    • For Social Insight Home Prices Blogs and News Movie Data Fashion Product Prices Hotties
    • Blogs…
    • The Technorati Top 100
    • Getting the content The Six Degrees Hypothesis Experienced It Is When You Travel
    • Building a Word Matrix The Six Degrees Six Hypothesis Six 3 Degrees Experienced Degrees 3 Hypothesis Hypothesis 1 It Experienced Experienced 5 Is Travel Travel 6 When You Travel
    • The Word Matrix “china” “kids” “music” “yahoo” “travel” Gothamist 0 3 3 0 3 GigaOM 6 0 1 2 4 QuickOnlineTips 0 2 2 12 0 O’Reilly Radar 1 0 3 4 6
    • Determining distance “china” “kids” “music” “yahoo” Gothamist 0 3 3 0 GigaOM 6 0 1 2 Quick Online Tips 0 2 2 12 Euclidean “as the crow flies” (6 − 0)2 + (0 − 2)2 + (1− 2)2 + (2 −12)2 = 12 (approx)
    • Hierarchical Clustering Find the two closest item Combine them into a single item Repeat…
    • Hierarchical Algorithm
    • Hierarchical Algorithm
    • Hierarchical Algorithm
    • Hierarchical Algorithm
    • Hierarchical Algorithm
    • Dendrogram
    • Hierarchical Blog Clusters
    • Hierarchical Blog Clusters
    • Hierarchical Blog Clusters
    • Rotating the Matrix Words in a blog -> blogs containing each word Gothamist GigaOM Quick Onl china 0 6 0 kids 3 0 2 music 3 1 2 Yahoo 0 2 12
    • Hierarchical Word Clusters
    • K-Means Clustering Divides data into distinct clusters User determines how many Algorithm Start with arbitrary centroids Assign points to centroids Move the centroids Repeat
    • K-Means Algorithm
    • K-Means Algorithm
    • K-Means Algorithm
    • K-Means Algorithm
    • K-Means Algorithm
    • K-Means Results 1 2 The Viral Garden Wonkette Copyblogger Gawker Creating Passionate Users Gothamist Oilman Huffington Post ProBlogger Blog Tips Seth's Blog
    • 2D Visualizations Instead of Clusters, a 2D Map Goals Preserve distances as much as possible Draw in two dimensions Dimension Reduction Principal Components Analysis Multidimensional Scaling
    • Multidimensional Scaling
    • Multidimensional Scaling
    • Multidimensional Scaling
    • Multidimensional Scaling
    • Zillow
    • The Zillow API Allows querying by address Returns information about the property Bedrooms Bathrooms Zip Code Price Estimate Last Sale Price
    • A home price dataset House Zip Bathrooms Bedrooms Built Type Price Single 505296 A 02138 1.5 2 1847 B 02139 3.5 9 Triplex 776378 1916 C 02140 3.5 4 Duplex 595027 1894 D 02139 2.5 4 Duplex 552213 1854 E 02138 3.5 5 Duplex 947528 1909 F 02138 3.5 4 Single 2107871 1930 etc..
    • What can we learn? A made-up houses price How important is Zip Code? What are the important attributes? Can we do better than averages?
    • Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
    • Introducing Regression Trees A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
    • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most Initially A B Value Average = 14 10 Circle 20 Standard Deviation = 8.2 11 Square 22 22 Square 8 18 Circle 6
    • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most B = Circle A B Value Average = 13 10 Circle 20 Standard Deviation = 9.9 11 Square 22 22 Square 8 B = Square 18 Circle 6 Average = 15 Standard Deviation = 9.9
    • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 18 A B Value Average = 8 10 Circle 20 Standard Deviation = 0 11 Square 22 22 Square 8 A <= 20 18 Circle 6 Average = 16 Standard Deviation = 8.7
    • Minimizing deviation Standard deviation is the “spread” of results Try all possible divisions Choose the division that decreases deviation the most A > 11 A B Value Average = 7 10 Circle 20 Standard Deviation = 1.4 11 Square 22 22 Square 8 A <= 11 18 Circle 6 Average = 21 Standard Deviation = 1.4
    • CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
    • CART Algoritm A B Value 10 Circle 20 11 Square 22 22 Square 8 18 Circle 6
    • CART Algoritm 10 Circle 20 22 Square 8 11 Square 22 18 Circle 6
    • CART Algoritm
    • Zillow Results Bathrooms > 3 Zip: 02139? After 1903? Zip: 02140? Bedrooms > 4? Duplex? Triplex?
    • Just for Fun… Hot or Not
    • Variance dividers 9 8 7 6 5 4 3 2 1 0 Northeast South Male Female Low Variance Split High Variance Split
    • Just for Fun… Hot or Not
    • Supervised and Unsupervised Clustering methods are unsupervised There are no answers Methods just characterize the data Show interesting patterns Regression Trees are supervised “answers” are in the dataset Tree models predict answers
    • Personal Ads
    • The Analysis Five Cities W4M Personal Ads
    • Bayesian filter If you listen to NPR, watch Hardball, and love the Red Sox, you may be the Sox 0.4 guy for me. Red 0.35 Boston Grad 0.2 Please email me back. Professional 0.1 I'm a professional with a grad school Humor 0.1 degree who has a sense of humor and loves the Sox.
    • Bayesian filter P( C | W ) = P (C & W) / P (W) How often do the word and the city appear together? How often does the word appear overall… Rank these, and you have a list of the words most particular to a given city
    • Results New York Boston Chicago Mets Pink Cubs Lounges Sox Burbs Offense Poetry Bears Desires Intellectually Girlie Musical Punk Insecure Submissive Appreciation Cheat Create Exercise Importance Song Winter Blunt Oral Education Mouth
    • Results Los Angeles San Francisco Excellent Tee Vegas Employment Meaningful Picnic Star STD Lame Tasting Industry Hikes Heat French Fitness .com Entertainment Kayaking Latino Cycling
    • Newsgroup Discussion
    • Overlapping themes
    • Themes in a document
    • Another word matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Actual Matrix
    • Weights and features F1 F2 F3 Msg1 M2 M3 M4 M5 Gym 0 1 2 x F1 1 0 2 3 0 Calorie 2 0 1 F2 0 2 1 1 3 Weigh 2 2 1 F3 1 0 2 0 0 Carbs 1 0 3 Weight Matrix Treadmill 0 1 2 Features Matrix
    • Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 0 1 Calorie 0 2 4 1 3 Weigh 2 3 1 0 1 Carbs 0 1 1 0 2 Treadmill 3 2 0 2 2 Current Guess
    • Matrix factorization F1 F2 F3 Gym 0 1 2 Msg1 M2 M3 M4 M5 Calorie 2 0 1 F1 1 0 2 3 0 x Weigh 2 2 1 F2 0 2 1 1 3 Carbs 1 0 3 F3 1 0 2 0 0 Treadmill 0 1 2 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 1 3 3 0 1 Calorie 0 2 1 1 3 Calorie 0 2 4 1 3 Weigh 1 0 2 0 0 Weigh 2 3 1 0 1 Carbs 0 3 0 0 2 Carbs 0 1 1 0 2 Treadmill 1 0 0 2 0 Treadmill 3 2 0 2 2 Target Result Current Guess
    • Matrix factorization F1 F2 F3 Gym 1 0 0 Msg1 M2 M3 M4 M5 Calorie 0 1 1 F1 2 0 0 1 0 x Weigh 0 0 2 F2 0 2 0 1 3 Carbs 0 1 0 F3 1 0 1 0 0 Treadmill 1 0 0 Weight Matrix Features Matrix Msg1 Msg2 Msg3 Msg4 Msg5 Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 0 0 3 0 Gym 2 0 0 3 0 Calorie 0 2 1 1 3 Calorie 0 2 1 1 3 Weigh 1 0 2 0 0 Weigh 1 0 2 0 0 Carbs 0 3 0 0 2 Carbs 0 3 0 0 2 Treadmill 1 0 0 2 0 Treadmill 1 0 0 2 0 Target Result Current Guess
    • Interpreting Features F1 F2 F3 Theme 1 Theme 2 Theme 3 Gym 1 0 0 Calorie 0 1 1 Gym Calorie Weigh Weigh 0 0 2 Treadmill Carbs Calorie Carbs 0 1 0 Treadmill 1 0 0 Features Matrix Msg1 M2 M3 M4 M5 Msg1 Msg2 Msg3 etc. F1 2 0 0 1 0 Theme 1 Theme 2 Theme 3 F2 0 2 0 1 3 F3 1 0 1 0 0 Theme 3 Weight Matrix
    • “Diet and body” themes Calories Weight Atkins Fats Induction Protein South Chocolate Cholesterol Beach Black Carbs Coffee Olive Gym Broccoli Weights Exercise Running Cook Injured Recipe Fried Home Money Organic Want Best
    • Side note: NMF for faces
    • Methods covered Regression trees Hierarchical clustering k-means clustering Multidimensional scaling Bayesian Classifier Non-negative Matrix Factorization
    • Other ideas Finance Analysts already drowning in info Stories sometimes broken on blogs Message boards show sentiment Extremely low signal-to-noise ratio
    • Other ideas Product problems/ideas Use support message boards Extract themes Understand recurring issues Learn what features people want
    • Other ideas Entertainment How much buzz is a movie generating? What psychographic profiles like this type of movie? Of interest to studios and media investors