Your SlideShare is downloading. ×
0
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Discovering Memes in Social Media
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Discovering Memes in Social Media

787

Published on

Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.

Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
787
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas WoodwardResearch paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
  • 2. Memes• Short, similar phrases found in many different sources – Re-use, shared temporal context• Evolutionary mutation & propagation as they transmit from source-to-source• Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
  • 3. MemeBrowser & Critical LiteracyMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3
  • 4. Google/NYT Living Stories livingstories.googlelabs.comMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
  • 5. Related Work• Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org• Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com• O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not globalMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
  • 6. MapReduce @ UT• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10• New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc.• Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
  • 7. Datasets• TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed• ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed• ICWSM 2011 Spinn3r Blog DatasetMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
  • 8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memesMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
  • 9. Creating the Shingle Table• e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
  • 10. Grouping Shingles by Document• Mapper: trivial grouping; Reducer: IdentityMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10
  • 11. Common Phrase (CP) Detection• Mapper: Merge adjacent shingles into memes (ignoring small gaps)• Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
  • 12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12
  • 13. Clustering Memes• Mapper: Single-link hierarchical clustering with cosine similarity• Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13
  • 14. Efficiency: Meme Clustering• From WEKA ARFF format to sparse representation – From ~96 hours  11 hours• Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores)• Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14
  • 15. Meme Browser: Original InterfaceMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15
  • 16. Meme Browser: Current InterfaceMarch 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16
  • 17. Meme Evolution (Leskovec et al.’09)March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17
  • 18. Thank You!• Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship

×