Discovering Memes in Social Media

•

1 like•872 views

Talk at the ACM SIGKDD - Austin Chapter Meeting, March 21, 2012. Paper by Hohyon Ryu, Matthew Lease, and Nicholas Woodward, at the 23rd ACM Conference on Hypertext and Social Media, 2012.

Technology Education

Discovering Memes in Social Media

Matt Lease
School of Information
University of Texas at Austin
ml@ischool.utexas.edu
@mattlease

Joint Work with
Hohyon Ryu & Nicholas Woodward

Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012

Memes
• Short, similar phrases found in
many different sources
– Re-use, shared temporal context
• Evolutionary mutation &
propagation as they transmit
from source-to-source
• Reveals implicit connections
between sources, individuals
and communities involved
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2

MemeBrowser & Critical Literacy

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3

Google/NYT Living Stories

livingstories.googlelabs.com
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4

Related Work
• Jure Leskovec et al. (KDD’09): blogs
– quotations only: http://memetracker.org
• Steven Skiena, Stony Brook NY: blogs
– Named-entities only: http://www.textmap.com
• O. Kolak and B. Schilit (HT’08): scanned books
– Mine “popular passages” from complete texts
– MapReduce “shingling” approach
– Popular passages found are local, not global

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5

MapReduce @ UT
• UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10
• New harddisks @ TACC Longhorn installed Dec.’10
– 48 Dell R610 nodes
• 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz
• 48GB RAM with ~1.5TB disk per node
• With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers
– 16 Dell R710 (same CPU configuration)
• 144GB RAM with ~0.8TB disk per node
– Setup Hadoop, testing, benchmarking, etc.
• Baldridge & Lease teach MapReduce class Fall’11
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6

Datasets
• TREC Blogs08 Collection
– http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
– 28M permalinks (January 2008 – January 2009)
– 250G compressed
• ICWSM 2009 Spinn3r Blog Dataset
– http://www.icwsm.org/data/
– 44 million blog posts (August - September, 2008)
– 27 GB compressed
• ICWSM 2011 Spinn3r Blog Dataset

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7

Processing Architecture
Blogs08 Test Collection
28M posts, 1.4TB
Preprocessing (Pseudo-MapReduce)
Decruft & Language Identification
HTML Strip & Near-Duplicate Detection 16M posts, 960GB

Common Phrase Extraction
15K posts, 43GB
3 MapReduce Stages

Common Phrase Ranking
Daily Top 200 Phrases 6.2M phrases, 2GB
1 MapReduce Process

Common Phrase Clustering
75K phrases, 2.6MB
1 MapReduce Process

Meme Browser
68K memes

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8

Creating the Shingle Table
• e.g. trigram shingles for: what do you think of

– what do you
– do you think
– you think of

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9

Grouping Shingles by Document
• Mapper: trivial grouping; Reducer: Identity

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10

Common Phrase (CP) Detection
• Mapper:
Merge adjacent
shingles into memes
(ignoring small gaps)

• Reducer:
Find set of
documents in which
each meme occurs
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11

Ranking Memes

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12

Clustering Memes
• Mapper:
Single-link
hierarchical
clustering with
cosine similarity
• Reducer:
create/merge
clusters

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13

Efficiency: Meme Clustering

• From WEKA ARFF format to sparse representation
– From ~96 hours  11 hours
• Indexed vs. un-indexed
– From 11 hours  16 minutes (single core)
– From 34 minutes  3 minutes (136 cores)
• Distributed vs. single core
– From 11 hours  34 minutes (un-indexed)
– From 16 minutes  3 minutes (indexed)
March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14

Meme Browser: Original Interface

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15

Meme Browser: Current Interface

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16

Meme Evolution (Leskovec et al.’09)

March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17

Thank You!
• Joint Work with Matt Lease
– Hohyon (Will) Ryu ml@ischool.utexas.edu
• InfoChimps (Summer’11) www.ischool.utexas.edu/~ml
• Indeed.com (Summer’12) @mattlease
– Nicholas Woodward (TACC)
• Latin American Network
Information Center (LANIC) Support
• FCT of Portugal / UT CoLab
• Amazon Web Services
• UT Austin LIFT Award
• John P. Commons Fellowship

The document discusses the Common Crawl project, which crawls and archives the web. It provides over 8 billion web pages and 120 TB of data that is freely available to anyone. The data includes raw HTML content, metadata, and text-only files. The document outlines some of the ways the Common Crawl data is currently being used, such as for testing Apache Giraph, the maplight political mapping project, image search by Tineye, and sentiment analysis projects. It also discusses future plans to expand the data available and use cases.

แนวทางการสร้างทรัพยาการสารสนเทศดิจิทัล (Digital Library Collection)

Dr. Thiti Vacharasintopchai, ATSI-DX, CISA

Issue Date: 12-May-2008 Type: Presentation Publisher: กลุ่มสาขาวิชาการจัดการสารสนเทศและการสื่อสาร คณะมนุษยศาสตร์และสังคมศาสตร์ มหาวิทยาลัยขอนแก่น ร่วมกับ สถาบันเทคโนโลยีแห่งเอเชีย (Asian Institute of Technology) Series/Report no.: การสัมมนาทางวิชาการหลักสูตรระดับบัณฑิตศึกษา เรื่อง "การจัดการห้องสมุดดิจิทัล", 12-13 พฤษภาคม 2551, ณ อาคาร HS-05 คณะมนุษยศาสตร์และสังคมศาสตร์ มหาวิทยาลัยขอนแก่น; URI: http://dspace.siu.ac.th/handle/1532/140

2014 10-11 Wikidata talk London WMF UK

Magnus Manske

Wikidata is a free, collaborative knowledge base maintained by the Wikimedia Foundation that aims to centralize structured data about items and provide machine-readable data to Wikimedia projects and third parties. It contains over 15 million items with 48 million statements and is currently in the process of adding statements about items after initially linking language editions of Wikipedia articles. Wikidata utilizes a variety of tools to visualize and query its structured data.

ระบบการจัดการห้องสมุดดิจิทัล : คุณสมบัติ ความสามารถ การใช้งาน ประโยชน์

Dr. Thiti Vacharasintopchai, ATSI-DX, CISA

FAST Update

OCLC

The document provides statistics on the FAST (Faceted Application of Subject Terminology) thesaurus as of June 2017, including the number of records and types of headings in FAST. It also lists links from FAST to other datasets like Library of Congress Subject Headings and Wikidata. The FAST team is continuing to synchronize and refine processing rules for FAST and developing an import tool. Information is also provided on the FACETVOC-L discussion list focused on faceted controlled vocabularies.

Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...

Data Con LA

Fcc open-developer-day

Ted Drake

The document discusses open government data and accessibility. It promotes using open data from sources like Data.gov, FCC.gov/Data and other open data portals. It also describes Yahoo! tools like YQL and the Yahoo! User Interface Library that can be used to access and mashup open government data through SQL-like queries or to build accessible user interfaces. Example queries and potential mashup ideas involving FCC and broadband data are provided.

20 ENTERTAINING SOCIAL MEDIA JOKES TO LAUGH ABOUT

StuartJDavidson.com

The document discusses memes as an empowering approach to social media that allows people to get away from mainstream culture, be listened to by creating something relatable that others can easily create and share collectively online. While some memes can be hurtful, others are created to inspire and bring about change. The document encourages creating empowering memes and provides links to meme generators for inspiration.

WTF is meme culture? / memes anatomy.

Ravard & Co

The document discusses the rise of internet memes and how they have changed culture. It explains that new technologies have made content creation, production, and broadcasting easy and accessible to everyone. Memes, which are ideas or behaviors spread through non-genetic means, have proliferated online in countless variations taking any possible digital form. Examples like "Not Bad Obama" demonstrate how memes can emerge from current events and spread virally with many derivatives in a short time, becoming a genuine part of mass culture.

Memes

BryanOrtega98

Un meme es una idea, concepto, situación o pensamiento que se puede difundir a través de enlaces, videos, imágenes u otros medios multimedia para entretener a las personas y propagarse de forma viral en las redes sociales. Los memes han invadido Internet en los últimos años y se han convertido en parte de nuestra cultura popular. Surgen de la teoría mimetica de la transmisión cultural propuesta por Richard Dawkins, en la que las personas transmiten memorias sociales y culturales que imitan unas a otras.

Meme Powerpoint

Connor

The document discusses internet memes, defining them as ideas or units of cultural information that spread rapidly online through sharing. It notes that memes began as edited videos with voiceovers or text that people found contagious and enjoyed spreading through word of mouth. Specifically, the document examines the "Kersal massive" meme, a viral video of three "chavs" rapping about their hometown for a BMX competition, which became popular due to many remixes and crossed over into mainstream media by being featured on a television program.

mems ppt

sapparao

MEMS (micro-electro-mechanical systems) combine mechanical and electrical functions on a single chip using microfabrication technology. The fabrication process for MEMS is similar to that used for making electronic circuits and involves steps such as chemical deposition, physical deposition, lithography, and etching. MEMS can be used to develop microsensors using materials like metals, polymers, ceramics, semiconductors, and composites. Common applications of MEMS include accelerometers, which have advantages over conventional accelerometers such as lower cost, smaller size, and lower power requirements.

Memes, Memes Everywhere

Cast From Clay

2016 was a ‘meme-ntous’ year. Memes saw people round the world pretend to be mannequins, they impacted the US presidential election, and nearly led the UK government to name a ship “Boaty McBoatface”. Memes are nothing new: they have been a staple of culture and communications for thousands of years. What is new is the speed with which memes are created, adapted, and spread around the world via social media. Today, Internet memes are being used to great effect by brands, third-sector organisations and political movements (from the “alt-right” to their far-left alternatives). Opportunities abound for entities who use them well. If you work in communications you need to understand where Internet memes come from, how they work, and how you can use them. This report answers those questions. Enjoy it and get in touch with queries.

Fantastic memes and how to use them

Aaron Hill

Social networking PPT

varun0912

The document discusses social networking sites and provides statistics about key players and markets in 2010. It summarizes user numbers, revenues, and rankings of top social networking sites like Facebook, Twitter, Myspace, and LinkedIn. It also provides data on the top social networking markets and sites in India and average time spent on different Indian sites. Finally, it discusses revenue models, an external environment analysis, factors for success, and analyzing competitiveness of social media companies.

A Complete Guide To The Best Times To Post On Social Media (And More!)

TrackMaven

How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...

Dave McClure

This document provides tips on using emojis, GIFs, and memes to influence people and increase company valuation in 3 sentences or less: It recommends including emojis, GIFs, and memes in communications as they are quick, easy, universal, and fun ways to engage others that can help influence people and increase a company's valuation. However, it warns not to get arrested by being inappropriate and provides examples of emojis, GIFs, and memes to consider using. It also thanks various internet personalities and celebrities who helped inspire the use of digital media in business communications.

Discovering and Navigating Memes in Social Media

Matthew Lease

E Science As A Lens On The World Lazowska

guest43b4df3

The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets. It also describes how technologies like cloud computing, databases, data mining and machine learning enable eScience. Finally, it argues that eScience capabilities will be essential for any organization to remain competitive in the future.

E Science As A Lens On The World Lazowska

WCET

The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets using technologies like databases, data mining, machine learning and data visualization on cluster computing systems at enormous scales. It states that eScience capabilities will be required for organizations to remain competitive in the future. It also discusses how technologies like Amazon EC2 enable scalable computing resources for any organization and how broadband access and networks like Internet2 played an important role in enabling eScience.

MapReduce and Hadoop

Salil Navgire

MapReduce and Hadoop provide a framework for processing vast amounts of data across clusters of computers. It allows distributed processing of large datasets in a parallel and fault-tolerant manner. The key components are HDFS for storage, and MapReduce for distributed processing. HDFS stores data reliably across commodity hardware, while MapReduce breaks jobs into map and reduce tasks that can run in parallel across a cluster.

Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017

Noemi Derzsy

This document summarizes Noemi Derzsy's presentation at PyData New York 2017 about analyzing open NASA datasets. It notes that NASA currently has over 32,000 open datasets across various topics. Derzsy demonstrates how to analyze the datasets using natural language processing and network analysis techniques in Python like word clouds, topic modeling with LDA, text classification, and network analysis to discover relationships between keywords, descriptions and titles. The goal is to better understand and organize the large number of datasets.

Data Science Keys to Open Up OpenNASA Datasets

PyData

By Noemi Derzsy PyData New York City 2017 Open source data has enabled society to engage in community-based research, and has provided government agencies with more visibility and trust from individuals. I will briefly introduce the openNASA platform with over 32,000 open NASA datasets, and I will present open NASA metadata analysis, and tools for applying NLP/topic modeling techniques to understand open government dataset associations.

Realtime Indexing for Fast Queries on Massive Semi-Structured Data

ScyllaDB

Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to build data-driven applications and microservices. In this talk, we discuss some of the key design aspects of Rockset, such as Smart Schema and Converged Index. We describe Rockset's Aggregator Leaf Tailer (ALT) architecture that provides low latency queries on large datasets.Then we describe how you can combine lightweight transactions in ScyllaDB with realtime analytics on Rockset to power an user-facing application.

Startup Bootcamp - Intro to NoSQL/Big Data by DataZone

Idan Tohami

Arthur Gimpel gave an introduction to Big Data and NoSQL databases. He defined the 3 V's of Big Data as volume, velocity and variety. He explained different NoSQL database types including key-value, graph and document databases. Key-value databases provide fast access but querying values is challenging. Graph databases are useful for relationships between entities. Document databases commonly store JSON and include popular examples like MongoDB. Choosing the right database is important and replacing them is not cheap.

RDBMS vs NoSQL

Murat Çakal

This document provides a comparison of SQL and NoSQL databases. It summarizes the key features of SQL databases, including their use of schemas, SQL query languages, ACID transactions, and examples like MySQL and Oracle. It also summarizes features of NoSQL databases, including their large data volumes, scalability, lack of schemas, eventual consistency, and examples like MongoDB, Cassandra, and HBase. The document aims to compare the different approaches of SQL and NoSQL for managing data.

Distributed data mining

Ahmad Ammari

This document outlines a proposed approach to use distributed data mining techniques to help users make sense of large amounts of content in online collaborative spaces. It discusses how "big data" is affecting users' ability to understand discussions. The approach involves preprocessing content, clustering it using Hadoop and Mahout, and generating topic clouds. A case study clusters content from technical forums and finds topic-specific discussions not obvious from category names. The conclusion is that distributed data mining can help summarize huge online discussions and uncover buried topics to support user sensemaking.

Lunch & Learn Intro to Big Data

Melissa Hornbostel

Viewers also liked

Gdc reports2013 4_13Fumio Kurokawa

Making Memes Latinitas

Andrea Zarate

WTF is meme culture? / memes anatomy.

Memes

Meme Powerpoint

mems ppt

Memes, Memes Everywhere

Cast From Clay

Fantastic memes and how to use them

Aaron Hill

Social networking PPT

varun0912

A Complete Guide To The Best Times To Post On Social Media (And More!)

TrackMaven

How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...

Dave McClure

Viewers also liked (11)

Gdc reports2013 4_13

Making Memes Latinitas

WTF is meme culture? / memes anatomy.

Memes

Meme Powerpoint

mems ppt

Memes, Memes Everywhere

Fantastic memes and how to use them

Social networking PPT

A Complete Guide To The Best Times To Post On Social Media (And More!)

How to Win Friends, Influence People, and Get a Better Valuation with Emoji, ...

Similar to Discovering Memes in Social Media

Discovering and Navigating Memes in Social Media

Matthew Lease

E Science As A Lens On The World Lazowska

guest43b4df3

The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets. It also describes how technologies like cloud computing, databases, data mining and machine learning enable eScience. Finally, it argues that eScience capabilities will be essential for any organization to remain competitive in the future.

E Science As A Lens On The World Lazowska

WCET

The document summarizes a presentation about eScience and its implications. It discusses how eScience is driven by massive amounts of sensor data and requires analysis of large datasets using technologies like databases, data mining, machine learning and data visualization on cluster computing systems at enormous scales. It states that eScience capabilities will be required for organizations to remain competitive in the future. It also discusses how technologies like Amazon EC2 enable scalable computing resources for any organization and how broadband access and networks like Internet2 played an important role in enabling eScience.

MapReduce and Hadoop

Salil Navgire

Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017

Noemi Derzsy

Data Science Keys to Open Up OpenNASA Datasets

PyData

Realtime Indexing for Fast Queries on Massive Semi-Structured Data

ScyllaDB

Startup Bootcamp - Intro to NoSQL/Big Data by DataZone

Idan Tohami

RDBMS vs NoSQL

Murat Çakal

Distributed data mining

Ahmad Ammari

Lunch & Learn Intro to Big Data

Melissa Hornbostel

Scaling the (evolving) web data –at low cost-

WU (Vienna University of Economics and Business)

The document discusses scaling web data at low cost. It begins by presenting Javier D. Fernández and providing context about his work in semantic web, open data, big data management, and databases. It then discusses techniques for compressing and querying large RDF datasets at low cost using binary RDF formats like HDT. Examples of applications using these techniques include compressing and sharing datasets, fast SPARQL querying, and embedding systems. It also discusses efforts to enable web-scale querying through projects like LOD-a-lot that integrate billions of triples for federated querying.

INSPIRE Hackathon Webinar Intro to Linked Data and Semantics

plan4all

This document introduces linked data and the semantic web. It defines linked data as using URIs to identify things on the web and describe them using standard formats like RDF to link related things. This allows data on the web to be treated like a large database. The semantic web builds on linked data principles to publish structured data on the web that can be processed by machines, helping make information more discoverable and science more reproducible. Challenges include agreeing on definitions, performance of query languages, and the effort required to publish high-quality linked data.

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

DataWorks Summit/Hadoop Summit

Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture. In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.

Intro to Big Data and NoSQL

Don Demcsak

This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.

RDF: Resource Description Failures?

Robert Sanderson

RDF has several advantages as a data modeling technique including its ability to represent complex relationships as graphs and enable novel inferences. However, graphs also introduce complexity in querying, storage, and visualization. Additional challenges include the open world assumption, managing ontologies and identities, multiple serialization formats, and addressing temporal issues as data changes over time. Overall, proponents argue that the benefits outweigh the limitations, which can be mitigated through tools and standards, and RDF enables a more powerful representation of data on the web compared to alternatives.

Webinar: The Future of SQL

Crate.io

NoSQL databases like MongoDB, Elasticsearch, and Cassandra are synonymous with scalability, search, and developer agility. But there’s a downside...having to give up the ease and comfort of SQL. Or do you? Join this webcast to learn how the newest databases, like CrateDB and CockroachDB deliver the benefits of NoSQL with the ease of SQL by building SQL engines on top of custom NoSQL technology stacks. Database industry veteran Andy Ellicott, who helped launch Vertica, VoltDB, Cloudant, and now with Crate.io, will provide a no-BS view of current DBMS architectures and predictions for the future of data. If you’re a DBMS user, this webcast will help you make sense of a very crowded DBMS market and make better-informed decisions for your new tech stacks.

07 data structures_and_representations

Marco Quartulli

This document discusses image search and analysis techniques for remote sensing data. It describes an index management system that takes in data and indexes it using column-based databases. Images are analyzed to extract features that allow for image search based on compression in compressed streams. Queries can be performed on the indexed data to return similar images based on semantic labels and normalized distances from queries. Examples are provided using different remote sensing datasets, including GeoEye, DigitalGlobe, and TerraSAR-X images.

How DITA Got Her Groove Back: Going Mapless with Don Day

Information Development World

Context is everything, from the clothing you choose in the morning to the dinner menu you plan based on available ingredients and time. The word on the street is that DITA maps are the express context designed to drive builds for particular deliverables and conditionality for DITA topics. That is partly true, but it is not the whole story. For one thing, maps are far more versatile than just as build directives. Moreover, DITA topic processing can get its cues from contexts other than maps. And therein hangs the premise of Going Mapless. To get our own context for this presentation, we start with a quick review of the original architectural definition of DITA and then trace the popular information architectures and tools that have grown up with the standard as we currently know it. Then Don introduces some scenarios where DITA could be useful if freed from the the prevailing map-driven processing paradigm, and he walks you through some available methods and solutions for using DITA in these unconventional ways. This presentation was given at Information Development World on October 2, 2015.

IASSIST 2012 - DDI-RDF - Trouble with Triples

Dr.-Ing. Thomas Hartmann

This document summarizes work being done to express the Data Documentation Initiative (DDI) metadata standard in Resource Description Framework (RDF) format to improve discovery and linking of microdata on the Web of Linked Data. It describes background on the DDI to RDF mapping effort, the goals of making microdata more accessible and interoperable online, and examples of how the RDF representation would support common discovery use cases. It also provides information on tools and next steps for the ongoing work, acknowledging contributions from participants in workshops where this effort was discussed.

Similar to Discovering Memes in Social Media (20)

Discovering and Navigating Memes in Social Media

E Science As A Lens On The World Lazowska

MapReduce and Hadoop

Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017

Data Science Keys to Open Up OpenNASA Datasets

Realtime Indexing for Fast Queries on Massive Semi-Structured Data

Startup Bootcamp - Intro to NoSQL/Big Data by DataZone

RDBMS vs NoSQL

Distributed data mining

Lunch & Learn Intro to Big Data

Scaling the (evolving) web data –at low cost-

INSPIRE Hackathon Webinar Intro to Linked Data and Semantics

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Intro to Big Data and NoSQL

RDF: Resource Description Failures?

Webinar: The Future of SQL

07 data structures_and_representations

How DITA Got Her Groove Back: Going Mapless with Don Day

IASSIST 2012 - DDI-RDF - Trouble with Triples

More from Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses

Matthew Lease

Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...

Matthew Lease

Explainable Fact Checking with Humans in-the-loop

Matthew Lease

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...

Matthew Lease

Talk given at Delft University speaker series on "Crowd Computing & Human-Centered AI" (https://www.academicfringe.org/). November 23, 2020. Covers two 2020 works: (1) Anubrata Das, Brandon Dang, and Matthew Lease. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content. In Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2020. Alexander Braylan and Matthew Lease. Modeling and Aggregation of Complex Annotations via Annotation Distances. In Proceedings of the Web Conference, pages 1807--1818, 2020.

AI & Work, with Transparency & the Crowd

Matthew Lease

The document discusses designing human-AI partnerships and the role of crowdsourcing in AI systems. It summarizes work on designing AI assistants to work with humans, using crowds to help fact-check information, and explores challenges around protecting crowd workers who review harmful content or do "dirty jobs". It advocates for more research on ethics in AI and using crowds to help check work for ethical issues.

Designing Human-AI Partnerships to Combat Misinfomation

Matthew Lease

The document discusses designing human-AI partnerships to combat misinformation. It describes a prototype partnership where a human and AI work together to fact-check claims. The partnership aims to make the AI more transparent and address user bias by allowing the user to adjust the perceived reliability of news sources, which then changes the AI's political leaning analysis and fact checking results. The discussion wraps up by noting challenges like avoiding echo chambers and assessing potential harms, as well as opportunities for AI to reduce bias and increase trust through explainable, interactive systems.

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...

Matthew Lease

This document summarizes a presentation about designing human-AI partnerships for fact-checking misinformation. It discusses using crowdsourced rationales to improve the accuracy and cost-efficiency of annotation tasks. It also addresses challenges in designing interfaces for automatic fact-checking models, such as integrating human knowledge and reasoning to correct errors and account for bias. The goal is to develop mixed-initiative systems where humans and AI can jointly reason and personalize fact-checking.

But Who Protects the Moderators?

Matthew Lease

Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...

Matthew Lease

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...

Matthew Lease

Fact Checking & Information Retrieval

Matthew Lease

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...

Matthew Lease

What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...

Matthew Lease

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

Matthew Lease

Systematic Review is e-Discovery in Doctor’s Clothing

Matthew Lease

This document discusses opportunities for collaboration between researchers working in systematic reviews and electronic discovery (e-discovery). It notes similarities in the challenges both fields face, including the need for high recall with bounded costs and reliance on multi-stage review pipelines. The document proposes that technologies developed for semi-automated citation screening and crowdsourcing could help address current limitations. It concludes by encouraging information retrieval researchers to investigate open problems in systematic reviews as opportunities to advance technologies beyond other tasks and help bring together interested parties through forums like the TREC Total Recall track.

The Rise of Crowd Computing (July 7, 2016)

Matthew Lease

The Rise of Crowd Computing - 2016

Matthew Lease

Crowd computing utilizes both crowdsourcing and human computation to solve problems. Crowdsourcing enables more efficient and scalable data collection and processing by outsourcing tasks to a large, undefined group of people. Human computation allows software developers to incorporate human intelligence and judgment into applications to provide capabilities beyond current artificial intelligence. Examples discussed include Amazon Mechanical Turk, various crowd-powered applications, and how crowdsourcing has helped label large datasets to train machine learning models.

The Rise of Crowd Computing (December 2015)

Matthew Lease

Crowd computing is rising with two waves - the first using crowds to label large amounts of data for artificial intelligence applications. The second wave delivers applications that go beyond AI abilities by incorporating human computation. Open problems remain around ensuring high quality outputs, task design, understanding the worker context and experience, and addressing ethics concerns around opaque platforms and working conditions. The future holds potential for empowering crowd work but also risks like digital sweatshops if worker freedoms and conditions are not considered.

Toward Better Crowdsourcing Science

Matthew Lease

Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms

Matthew Lease

The document summarizes a presentation about analyzing paid crowd work platforms beyond Mechanical Turk. It discusses how Mechanical Turk has dominated research on paid crowdsourcing due to its early popularity, but that it has limitations. The presentation conducts a qualitative study of 7 alternative crowd work platforms to identify distinguishing capabilities not found on MTurk, such as different payment models, richer worker profiles, and support for confidential tasks. It aims to increase awareness of other platforms to further inform practice and research on crowdsourcing.

More from Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses

Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...

Explainable Fact Checking with Humans in-the-loop

Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...

AI & Work, with Transparency & the Crowd

Designing Human-AI Partnerships to Combat Misinfomation

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...

But Who Protects the Moderators?

Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...

Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...

Fact Checking & Information Retrieval

Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...

What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

Systematic Review is e-Discovery in Doctor’s Clothing

The Rise of Crowd Computing (July 7, 2016)

The Rise of Crowd Computing - 2016

The Rise of Crowd Computing (December 2015)

Toward Better Crowdsourcing Science

Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms

Recently uploaded

Serial Arm Control in Real Time Presentation

tolgahangng

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/ DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen! Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell. Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten. Diese Themen werden behandelt - Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten - Wie funktionieren CCB- und CCX-Lizenzen wirklich? - Verstehen des DLAU-Tools und wie man es am besten nutzt - Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw. - Praxisbeispiele und Best Practices zum sofortigen Umsetzen

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

GenAI Pilot Implementation in the organizations

kumardaparthi1024

OpenID AuthZEN Interop Read Out - Authorization

David Brossard

GraphRAG for Life Science to increase LLM accuracy

Tomaz Bratanic

Driving Business Innovation: Latest Generative AI Advancements & Success Story

Safe Software

Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency. During the hour, we’ll take you through: Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board. Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes. Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI. We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI. This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!

WeTestAthens: Postman's AI & Automation Techniques

Postman

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

IndexBug

20240609 QFM020 Irresponsible AI Reading List May 2024

Matthew Sinclair

Project Management Semester Long Project - Acuity

jpupo2018

Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on integration of Salesforce with Bonterra Impact Management. Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

TrustArc Webinar - 2024 Global Privacy Survey

TrustArc

How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024? In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores. See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe. This webinar will review: - The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey - The top challenges for privacy leaders, practitioners, and organizations in 2024 - Key themes to consider in developing and maintaining your privacy program

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Introduction of Cybersecurity with OSS at Code Europe 2024

Hiroshi SHIBATA

I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems. The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS. Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application. I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Recently uploaded (20)

Serial Arm Control in Real Time Presentation

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU

UiPath Test Automation using UiPath Test Suite series, part 6

GenAI Pilot Implementation in the organizations

OpenID AuthZEN Interop Read Out - Authorization

GraphRAG for Life Science to increase LLM accuracy

Driving Business Innovation: Latest Generative AI Advancements & Success Story

WeTestAthens: Postman's AI & Automation Techniques

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

20240609 QFM020 Irresponsible AI Reading List May 2024

Project Management Semester Long Project - Acuity

National Security Agency - NSA mobile device best practices

Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...

How to use Firebase Data Connect For Flutter

TrustArc Webinar - 2024 Global Privacy Survey

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Programming Foundation Models with DSPy - Meetup Slides

Introduction of Cybersecurity with OSS at Code Europe 2024

Presentation of the OECD Artificial Intelligence Review of Germany

Discovering Memes in Social Media

1. Discovering Memes in Social Media Matt Lease School of Information University of Texas at Austin ml@ischool.utexas.edu @mattlease Joint Work with Hohyon Ryu & Nicholas Woodward Research paper to appear at the 23rd ACM Conference on Hypertext and Social Media, 2012

2. Memes • Short, similar phrases found in many different sources – Re-use, shared temporal context • Evolutionary mutation & propagation as they transmit from source-to-source • Reveals implicit connections between sources, individuals and communities involved March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 2

3. MemeBrowser & Critical Literacy March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 3

4. Google/NYT Living Stories livingstories.googlelabs.com March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 4

5. Related Work • Jure Leskovec et al. (KDD’09): blogs – quotations only: http://memetracker.org • Steven Skiena, Stony Brook NY: blogs – Named-entities only: http://www.textmap.com • O. Kolak and B. Schilit (HT’08): scanned books – Mine “popular passages” from complete texts – MapReduce “shingling” approach – Popular passages found are local, not global March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 5

6. MapReduce @ UT • UT LIFT Award to Lease, Baldridge, & Xu in Sept.’10 • New harddisks @ TACC Longhorn installed Dec.’10 – 48 Dell R610 nodes • 2 Intel Nehalem quad-core processors (8 cores) @ 2.53 GHz • 48GB RAM with ~1.5TB disk per node • With 1 NameNode & 47 Datanodes, up to 376 parallel Mappers – 16 Dell R710 (same CPU configuration) • 144GB RAM with ~0.8TB disk per node – Setup Hadoop, testing, benchmarking, etc. • Baldridge & Lease teach MapReduce class Fall’11 March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 6

7. Datasets • TREC Blogs08 Collection – http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html – 28M permalinks (January 2008 – January 2009) – 250G compressed • ICWSM 2009 Spinn3r Blog Dataset – http://www.icwsm.org/data/ – 44 million blog posts (August - September, 2008) – 27 GB compressed • ICWSM 2011 Spinn3r Blog Dataset March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 7

8. Processing Architecture Blogs08 Test Collection 28M posts, 1.4TB Preprocessing (Pseudo-MapReduce) Decruft & Language Identification HTML Strip & Near-Duplicate Detection 16M posts, 960GB Common Phrase Extraction 15K posts, 43GB 3 MapReduce Stages Common Phrase Ranking Daily Top 200 Phrases 6.2M phrases, 2GB 1 MapReduce Process Common Phrase Clustering 75K phrases, 2.6MB 1 MapReduce Process Meme Browser 68K memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 8

9. Creating the Shingle Table • e.g. trigram shingles for: what do you think of – what do you – do you think – you think of March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 9

10. Grouping Shingles by Document • Mapper: trivial grouping; Reducer: Identity March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 10

11. Common Phrase (CP) Detection • Mapper: Merge adjacent shingles into memes (ignoring small gaps) • Reducer: Find set of documents in which each meme occurs March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 11

12. Ranking Memes March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 12

13. Clustering Memes • Mapper: Single-link hierarchical clustering with cosine similarity • Reducer: create/merge clusters March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 13

14. Efficiency: Meme Clustering • From WEKA ARFF format to sparse representation – From ~96 hours  11 hours • Indexed vs. un-indexed – From 11 hours  16 minutes (single core) – From 34 minutes  3 minutes (136 cores) • Distributed vs. single core – From 11 hours  34 minutes (un-indexed) – From 16 minutes  3 minutes (indexed) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 14

15. Meme Browser: Original Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 15

16. Meme Browser: Current Interface March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 16

17. Meme Evolution (Leskovec et al.’09) March 21, 2012 ACM SIGKDD - Austin Chapter Meeting 17

18. Thank You! • Joint Work with Matt Lease – Hohyon (Will) Ryu ml@ischool.utexas.edu • InfoChimps (Summer’11) www.ischool.utexas.edu/~ml • Indeed.com (Summer’12) @mattlease – Nicholas Woodward (TACC) • Latin American Network Information Center (LANIC) Support • FCT of Portugal / UT CoLab • Amazon Web Services • UT Austin LIFT Award • John P. Commons Fellowship

Discovering Memes in Social Media

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Discovering Memes in Social Media

Similar to Discovering Memes in Social Media (20)

More from Matthew Lease

More from Matthew Lease (20)

Recently uploaded

Recently uploaded (20)

Discovering Memes in Social Media