SlideShare a Scribd company logo
Discovering Memes in Social Media

                              Matt Lease
                        School of Information
                      University of Texas at Austin
                        ml@ischool.utexas.edu
                              @mattlease

                             Joint Work with
                     Hohyon Ryu & Nicholas Woodward


Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
Memes
• Short, similar phrases found in
  many different sources
  – Re-use, shared temporal context
• Evolutionary mutation &
  propagation as they transmit
  from source-to-source
• Reveals implicit connections
  between sources, individuals
  and communities involved
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   2
MemeBrowser & Critical Literacy




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   3
Google/NYT Living Stories




                 livingstories.googlelabs.com
March 21, 2012         ACM SIGKDD - Austin Chapter Meeting   4
Related Work
• Jure Leskovec et al. (KDD’09): blogs
     – quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
     – Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
     – Mine “popular passages” from complete texts
     – MapReduce “shingling” approach
     – Popular passages found are local, not global

March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   5
MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
   – 48 Dell R610 nodes
         • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
         • 48GB RAM with ~1.5TB disk per node
         • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
   – 16 Dell R710 (same CPU configuration)
         • 144GB RAM with ~0.8TB disk per node
   – Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
 March 21, 2012         ACM SIGKDD - Austin Chapter Meeting        6
Datasets
• TREC Blogs08 Collection
     – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
     – 28M permalinks (January 2008 – January 2009)
     – 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
     – http://www.icwsm.org/data/
     – 44 million blog posts (August - September, 2008)
     – 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset

March 21, 2012        ACM SIGKDD - Austin Chapter Meeting      7
Processing Architecture
                                                               Blogs08 Test Collection
                                                                  28M posts, 1.4TB
       Preprocessing (Pseudo-MapReduce)
       Decruft & Language Identification
       HTML Strip & Near-Duplicate Detection                       16M posts, 960GB



       Common Phrase Extraction
                                                                    15K posts, 43GB
       3 MapReduce Stages

       Common Phrase Ranking
       Daily Top 200 Phrases                                       6.2M phrases, 2GB
       1 MapReduce Process

       Common Phrase Clustering
                                                                   75K phrases, 2.6MB
       1 MapReduce Process

       Meme Browser
                                                                      68K memes


March 21, 2012               ACM SIGKDD - Austin Chapter Meeting                         8
Creating the Shingle Table
• e.g. trigram shingles for: what do you think of

  – what do you
  – do you think
  – you think of




 March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   9
Grouping Shingles by Document
• Mapper: trivial grouping; Reducer: Identity




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   10
Common Phrase (CP) Detection
• Mapper:
  Merge adjacent
  shingles into memes
  (ignoring small gaps)

• Reducer:
  Find set of
  documents in which
  each meme occurs
  March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   11
Ranking Memes




 March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   12
Clustering Memes
• Mapper:
  Single-link
  hierarchical
  clustering with
  cosine similarity
• Reducer:
  create/merge
  clusters


  March 21, 2012      ACM SIGKDD - Austin Chapter Meeting   13
Efficiency: Meme Clustering



• From WEKA ARFF format to sparse representation
   – From ~96 hours  11 hours
• Indexed vs. un-indexed
   – From 11 hours  16 minutes (single core)
   – From 34 minutes  3 minutes (136 cores)
• Distributed vs. single core
   – From 11 hours  34 minutes (un-indexed)
   – From 16 minutes  3 minutes (indexed)
  March 21, 2012     ACM SIGKDD - Austin Chapter Meeting   14
Meme Browser: Original Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   15
Meme Browser: Current Interface




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   16
Meme Evolution (Leskovec et al.’09)




March 21, 2012   ACM SIGKDD - Austin Chapter Meeting   17
Thank You!
• Joint Work with                   Matt Lease
  – Hohyon (Will) Ryu               ml@ischool.utexas.edu
     • InfoChimps (Summer’11)       www.ischool.utexas.edu/~ml
     • Indeed.com (Summer’12)         @mattlease
  – Nicholas Woodward (TACC)
     • Latin American Network
       Information Center (LANIC)   Support
                                    • FCT of Portugal / UT CoLab
                                    • Amazon Web Services
                                    • UT Austin LIFT Award
                                    • John P. Commons Fellowship

More Related Content

Viewers also liked

Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
Andrea Zarate
 
WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.
Ravard & Co
 
Memes
MemesMemes
Meme Powerpoint
Meme PowerpointMeme Powerpoint
Meme Powerpoint
Connor
 
mems ppt
mems pptmems ppt
mems ppt
sapparao
 
Memes, Memes Everywhere
Memes, Memes EverywhereMemes, Memes Everywhere
Memes, Memes Everywhere
Cast From Clay
 
Fantastic memes and how to use them
Fantastic memes and how to use themFantastic memes and how to use them
Fantastic memes and how to use them
Aaron Hill
 
Social networking PPT
Social networking PPTSocial networking PPT
Social networking PPT
varun0912
 
A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)
TrackMaven
 
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
Dave McClure
 

Viewers also liked (11)

Gdc reports2013 4_13
Gdc reports2013 4_13Gdc reports2013 4_13
Gdc reports2013 4_13
 
Making Memes Latinitas
Making Memes LatinitasMaking Memes Latinitas
Making Memes Latinitas
 
WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.WTF is meme culture? / memes anatomy.
WTF is meme culture? / memes anatomy.
 
Memes
MemesMemes
Memes
 
Meme Powerpoint
Meme PowerpointMeme Powerpoint
Meme Powerpoint
 
mems ppt
mems pptmems ppt
mems ppt
 
Memes, Memes Everywhere
Memes, Memes EverywhereMemes, Memes Everywhere
Memes, Memes Everywhere
 
Fantastic memes and how to use them
Fantastic memes and how to use themFantastic memes and how to use them
Fantastic memes and how to use them
 
Social networking PPT
Social networking PPTSocial networking PPT
Social networking PPT
 
A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)A Complete Guide To The Best Times To Post On Social Media (And More!)
A Complete Guide To The Best Times To Post On Social Media (And More!)
 
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...
 

Similar to Discovering Memes in Social Media

Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
Matthew Lease
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
guest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
WCET
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Salil Navgire
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
PyData
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
ScyllaDB
 
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneStartup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Idan Tohami
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
Murat Çakal
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
Ahmad Ammari
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
Melissa Hornbostel
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
WU (Vienna University of Economics and Business)
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?
Robert Sanderson
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
Crate.io
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
Marco Quartulli
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
Information Development World
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
Dr.-Ing. Thomas Hartmann
 

Similar to Discovering Memes in Social Media (20)

Discovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social MediaDiscovering and Navigating Memes in Social Media
Discovering and Navigating Memes in Social Media
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZoneStartup Bootcamp - Intro to NoSQL/Big Data by DataZone
Startup Bootcamp - Intro to NoSQL/Big Data by DataZone
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
How DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don DayHow DITA Got Her Groove Back: Going Mapless with Don Day
How DITA Got Her Groove Back: Going Mapless with Don Day
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 

More from Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
Matthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
Matthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
Matthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
Matthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
Matthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Matthew Lease
 

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

Discovering Memes in Social Media

  • 1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas Woodward Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012
  • 2. Memes • Short, similar phrases found in many different sources – Re-use, shared temporal context • Evolutionary mutation & propagation as they transmit from source-to-source • Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2
  • 3. MemeBrowser & Critical Literacy March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3
  • 4. Google/NYT Living Stories livingstories.googlelabs.com March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4
  • 5. Related Work • Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org • Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com • O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not global March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5
  • 6. MapReduce @ UT • UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10 • New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc. • Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6
  • 7. Datasets • TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed • ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed • ICWSM 2011 Spinn3r Blog Dataset March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7
  • 8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8
  • 9. Creating the Shingle Table • e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9
  • 10. Grouping Shingles by Document • Mapper: trivial grouping; Reducer: Identity March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10
  • 11. Common Phrase (CP) Detection • Mapper: Merge adjacent shingles into memes (ignoring small gaps) • Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11
  • 12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12
  • 13. Clustering Memes • Mapper: Single-link hierarchical clustering with cosine similarity • Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13
  • 14. Efficiency: Meme Clustering • From WEKA ARFF format to sparse representation – From ~96 hours  11 hours • Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores) • Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14
  • 15. Meme Browser: Original Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15
  • 16. Meme Browser: Current Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16
  • 17. Meme Evolution (Leskovec et al.’09) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17
  • 18. Thank You! • Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship