Invited talk at SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction (April 3, 2012). Based on paper by Ryu, Lease, and Woodward, to appear at ACM HyperText 2012. Joint work with Hohyon Ryu and Nicholas Woodward.
Discovering and Navigating Memes
in Social Media
Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease
Joint Work with
Hohyon Ryu & Nicholas Woodward
Paper to appear at HyperText 2012: 23rd ACM Conference on Hypertext and Social Media
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 2
Critical Reading (Literacy)
• Context-awareness (how work is situated)
– Related works, Time/Place, Author…
• Recognizing & questioning
– Sources of Influence
– Positions, Assumptions, Bias, …
• New challenges online
– Scale, authorship, citing of sources, borrowing…
• Traditional approach: education
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 3
Inspiration #1: Living Stories
livingstories.googlelabs.com
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 4
Memes
• Similar phrases found across multiple sources
– Includes multiple phrasings of same idea
• Re-use reveals implicit network
– Sources, Individuals, Communities
– Patterns of re-use reinforce links
• Questions
– Re-use?
– Intended re-use?
– Visible (quoted)?
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 5
Inspiration #2: Meme Tracker
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 6
Where Repeated Text Occurs
• Intended Re-use
– Visible (Quotation): “to be or not to be”
• Leskovec et al., KDD’09 ( memetracker.org )
– Hidden: e.g. plagiarism, false plurality
– Unmarked
• Near-Duplicate documents
• Boilerplate: All rights reserved
• Common adage: …a penny saved…
• Style, genre, laziness, …
• Accidental borrowing
• Shared context (e.g. named entities)
– E.g. named-entities: S. Skiena et al., Stony Brook ( textmap.com )
• Chance (e.g. …then he said…)
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 7
Data
• TREC Blogs08 Collection
– http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
– 28M permalinks (January 2008 – January 2009)
– 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
– http://www.icwsm.org/data/
– 44 million blog posts (August - September, 2008)
– 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 8
Inspiration #3: Popular Passages
• Kolak & Schilit, HyperText’08
• Find re-use in scanned books
– Find repeated phrases
– Group related phrases
– Rank passages
– MapReduce processing architecture
• Browsing interface with generated links
• Issues: data/task, locality, details, scalability
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 9
Processing Architecture
Blogs08 Test Collection
28M posts, 1.4TB
Preprocessing (Pseudo-MapReduce)
Decruft & Language Identification
HTML Strip & Near-Duplicate Detection 16M posts, 960GB
Common Phrase Extraction
15K posts, 43GB
3 MapReduce Stages
Common Phrase Ranking
Daily Top 200 Phrases 6.2M phrases, 2GB
1 MapReduce Process
Common Phrase Clustering
75K phrases, 2.6MB
1 MapReduce Process
Meme Browser 68K memes
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 10
Meme Browser
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 11
Efficiency: Meme Clustering
• From WEKA ARFF format to sparse representation
– From ~96 hours 11 hours
• Indexed vs. un-indexed
– From 11 hours 16 minutes (single core)
– From 34 minutes 3 minutes (136 cores)
• Distributed vs. single core
– From 11 hours 34 minutes (un-indexed)
– From 16 minutes 3 minutes (indexed)
April 3, 2012 SBP 2012: Intl. Conf. on Social Computing, Behavioral-Cultural Modeling, & Prediction 12
Thank You!
Joint Work with Matt Lease
– Hohyon (Will) Ryu ml@ischool.utexas.edu
– Nicholas Woodward www.ischool.utexas.edu/~ml
@mattlease
Support
• FCT of Portugal / UT CoLab
• Amazon Web Services
Meme Browser: • UT Austin LIFT Award
odyssey.ischool.utexas.edu/mb • John P. Commons Fellowship