Discovering Memes in Social Media

1,134 views
913 views

Published on

Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,134
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Discovering Memes in Social Media

  1. 1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas WoodwardResearch paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
  2. 2. Memes• Short, similar phrases found in many different sources – Re-use, shared temporal context• Evolutionary mutation & propagation as they transmit from source-to-source• Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
  3. 3. MemeBrowser & Critical LiteracyMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3
  4. 4. Google/NYT Living Stories livingstories.googlelabs.comMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
  5. 5. Related Work• Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org• Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com• O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not globalMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
  6. 6. MapReduce @ UT• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10• New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc.• Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
  7. 7. Datasets• TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed• ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed• ICWSM 2011 Spinn3r Blog DatasetMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
  8. 8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memesMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
  9. 9. Creating the Shingle Table• e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
  10. 10. Grouping Shingles by Document• Mapper: trivial grouping; Reducer: IdentityMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10
  11. 11. Common Phrase (CP) Detection• Mapper: Merge adjacent shingles into memes (ignoring small gaps)• Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
  12. 12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12
  13. 13. Clustering Memes• Mapper: Single-link hierarchical clustering with cosine similarity• Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13
  14. 14. Efficiency: Meme Clustering• From WEKA ARFF format to sparse representation – From ~96 hours  11 hours• Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores)• Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14
  15. 15. Meme Browser: Original InterfaceMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15
  16. 16. Meme Browser: Current InterfaceMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16
  17. 17. Meme Evolution (Leskovec et al.’09)March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17
  18. 18. Thank You!• Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship

×