News-oriented multimedia search over multiple social networks

Symeon Papadopoulos
Symeon PapadopoulosResearcher at CERTH-ITI, Co-founder at infalia
News-oriented multimedia search over
multiple social networks
Katerina Iliakopoulou, Symeon Papadopoulos and Yiannis Kompatsiaris
1Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)
CBMI 2015, June 11, 2015, Prague, Czech Republic
Presented by Katerina Andreadou
The rise of Online Social Networks (OSNs)
#2
• Increasingly popular  Massive amounts of data
– Both text and multimedia
• Content peaks when
– A planned event takes place (e.g., Olympic games)
– An unexpected news story breaks (e.g., earthquake)
Journalistic practices now involve the use of user-
generated content from OSNs for reporting on news
stories and events
The Problem
#3
• News stories are covered in multiple OSNs
– Twitter, Facebook, Google+, Instagram, Tumblr, Flickr
• No effective means of searching over multiple OSNs
– Necessary to build appropriate queries
– Find relevant hashtags and query keywords
• Effective querying is not straightforward
– Long complicated queries retrieve no results
– Vague queries bring back irrelevant content
The Problem is also OSN-specific
#4
• Flickr search is more flexible
– It returns results that contain all requested keywords or a
portion of them with the appropriate ranking
• Instagram is more restrictive
– It can only handle hashtags
– It returns very few or no results to multi-keyword queries
• The order of keywords is also crucial for some OSNs
Query formulation has to be OSN-specific
Content requirements
#5
• High relevance to the topic of interest
• High quality of multimedia
• Diversity of retrieved content
• Usefulness with respect to reporting and publication
Related work
• Optimization of query formulation methods utilizing
terms, proximities and phrases with respect to their
frequency and text position
– Markov random field models (Metzler et al., 2005)
– Positional language models (Lv et al., 2009)
– Query operations (Mishne et al., 2005)
• Improve query formulation by modelling query
concepts
– Learning concept importance (Bendersky et al., 2010)
– Latent content expansion using markov random fields
(Metzler et al., 2007)
#6
Goals and Contributions
• A novel graph-based query formulation method
– Catered for the special characteristics of each OSN
– Captures the primary entities and their associations
– Builds numerous queries by greedy graph traversal
• A relevance classification method
– 12 features based on content (text, visual) and context
(popularity, publication time)
• Evaluation of the framework in real-world events
and stories
#7
Overview of the Framework
#8
Step I: Collection of highly relevant content
• Query six OSNs with a high precision query q0 to
build an initial collection M0
– news story headline
– official name of the event
• Lower the possibility of noisy content by
– discarding all material retrieved before the story broke
• Only some OSNs were found to contribute to the
collection: Twitter, Flickr, Google+
#9
Step II: Keyword and hashtag extraction
• Extract the Named Entities from the M0 metadata
• Discard all stop-words and filter out HTML tags, web
links and social network account names
• Perform stemming for keywords that are not listed
as Named Entities to group keywords with similar
meaning
Create a list of keywords and a list of hashtags, each
associated with a frequency count
#10
Step III: Graph construction
• Vertices  set of selected keywords
• Edges  their pairwise adjacency relations
– adjacency is computed with respect to the text metadata
• Each edge  frequency of appearance of the phrase
composed of the edge keywords
• Only significant keywords are considered 
keywords with greater frequency than the average
– elimination of noisy keywords
– cost-effectiveness
#11
Step IV: Query building
• Query  path from a starting
node to an end node given a
maximum number of L hops
• Starting node high out-degree
or connected to heavy weighted
edges
• Total score for a node
• Penalize queries with high text
similarity  Jaccard coefficient
#12
Example: 86th Academy Awards
#13
Step V: Relevance classification
Textual relevance is computed wrt the high precision query q0
• title & description
• tags
#14
Popularity
Textual relevance
Visual similarity
Temporal proximity to the story
Image dimensions
Evaluation
#15
• Choose 20 events and news stories which took place
up to five months before data collection
– the older the event, the more content disappears from the
OSNs
• Choose events with considerable size and variety
• Set the maximum number of keyword-base queries
Mmax=20 and the maximum number of hashtag-
based queries to Mmax=10
Data statistics
#16
• More than 88K images for all
20 events
• ~4.4K images per event/new
story on average
• Events are associated on
average with more images
(5.5K) than news stories (3.3K)
Number of images
collected during the
first querying step
Number of images
collected during the
second querying step
Media volume per OSN
#17
• Flickr contributes the most (66.9%) with Twitter
following (19%)
• Instagram and Google+ less but considerable
• Tumblr and Facebook the least content
– Tumblr has significantly lower usage
– Facebook has very poor search API behaviour
• Increase between the two retrieval steps
– Facebook, Flickr, Tumblr: 5x
– Google+, Instagram: even higher (8.1x and 6.8x)
– Twitter: 3x
Quality of formulated queries
#18
• Evaluate the relevance and quality of the retrieved
content in the second step (Mext)
– A large majority (90%) of the images retrieved in the first
step (M0) were relevant
– Four human annotators
• Relevance is high (>50%) for 3 events
• Relevance is decent (>40%) for 3 news stories
• Half of the events and news stories are characterized
by low-to-medium relevance (10% - 40%)
• Relevance is very low (<10%) for two events and two
news stories
Why is irrelevant content collected?
#19
• Vague keyword-based queries or hashtags
– Example: British Academy Film Awards  most popular
hashtag  british
– Example: Sundance Film Festival  vague query  film
festival
• False keyword-based queries
– They contain keywords irrelevant to the subject
– They are left-overs from the graph pruning, they should
have been eliminated
Relevance classification
#20
DT  Decision Tree RF  Random Forest
SVM  Support Vector Machine MP  Multilayer Perceptor
Relevance classification
#21
• RF outperforms the
rest in all cases
• DT is also very good
• SVM has the worst
performance
– Input features are
not normalized
– A few of them are
quantized to a small
set of possible
values
Conclusion - Contributions
• Searching for multimedia content around events and
news stories over multiple OSNs is challenging!
– Collect high quality relevant content in spite of the
different behaviors and requirements of the OSNs
• We proposed a multi-step process including
– a graph-based query building method
– a relevance classification step
• We evaluated the framework on a set of 20 large-
scale events and news stories of global interest
#22
Future Work
• Improve the performance of the query building
method when the number of collected items in the
first step is small
• Extract statistically grounded relevance features
– Take into account distribution differences in different OSNs
• Apply the method while the event evolves
• Add support for the collection of video content
#23
Thank you!
• Slides:
http://www.slideshare.net/sympapadopoulos/newsoriented-
multimedia-search-over-multiple-social-networks
• Get in touch:
@matzika00 / katerina.iliakopoulou@gmail.com
@sympapadopoulos / papadop@iti.gr
#24
1 of 24

Recommended

Online data sources and information exposure by
Online data sources and information exposureOnline data sources and information exposure
Online data sources and information exposureUniversity of Southampton
547 views41 slides
Search, Exploration and Analytics of Evolving Data by
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
1.6K views159 slides
School intro by
School introSchool intro
School introJosé Ramón Ríos Viqueira
840 views24 slides
e-Research: A Social Informatics Perspective by
e-Research: A Social Informatics Perspectivee-Research: A Social Informatics Perspective
e-Research: A Social Informatics PerspectiveEric Meyer
1K views78 slides
Social network analysis by
Social network analysisSocial network analysis
Social network analysisFEG
122 views22 slides
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind... by
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...
Compact and Distinctive Visual Vocabularies for Efficient Multimedia Data Ind...Symeon Papadopoulos
1K views21 slides

More Related Content

Viewers also liked

Finding Diverse Social Images at MediaEval 2015 by
Finding Diverse Social Images at MediaEval 2015Finding Diverse Social Images at MediaEval 2015
Finding Diverse Social Images at MediaEval 2015Symeon Papadopoulos
449 views12 slides
Social Media Verification Challenges, Approaches and Applications by
Social Media Verification  Challenges, Approaches and ApplicationsSocial Media Verification  Challenges, Approaches and Applications
Social Media Verification Challenges, Approaches and ApplicationsYiannis Kompatsiaris
669 views70 slides
Predicting News Popularity by Mining Online Discussions by
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsSymeon Papadopoulos
1.2K views30 slides
CERTH/CEA LIST at MediaEval Placing Task 2015 by
CERTH/CEA LIST at MediaEval Placing Task 2015CERTH/CEA LIST at MediaEval Placing Task 2015
CERTH/CEA LIST at MediaEval Placing Task 2015Symeon Papadopoulos
899 views15 slides
Perceived versus Actual Predictability of Personal Information in Social Netw... by
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...Symeon Papadopoulos
543 views17 slides
Placing Images with Refined Language Models and Similarity Search with PCA-re... by
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...Symeon Papadopoulos
406 views14 slides

Viewers also liked(7)

Social Media Verification Challenges, Approaches and Applications by Yiannis Kompatsiaris
Social Media Verification  Challenges, Approaches and ApplicationsSocial Media Verification  Challenges, Approaches and Applications
Social Media Verification Challenges, Approaches and Applications
Predicting News Popularity by Mining Online Discussions by Symeon Papadopoulos
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
Symeon Papadopoulos1.2K views
Perceived versus Actual Predictability of Personal Information in Social Netw... by Symeon Papadopoulos
Perceived versus Actual Predictability of Personal Information in Social Netw...Perceived versus Actual Predictability of Personal Information in Social Netw...
Perceived versus Actual Predictability of Personal Information in Social Netw...
Placing Images with Refined Language Models and Similarity Search with PCA-re... by Symeon Papadopoulos
Placing Images with Refined Language Models and Similarity Search with PCA-re...Placing Images with Refined Language Models and Similarity Search with PCA-re...
Placing Images with Refined Language Models and Similarity Search with PCA-re...

Similar to News-oriented multimedia search over multiple social networks

ESSnet Big Data WP8 Methodology (+ Quality, +IT) by
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)Piet J.H. Daas
338 views31 slides
Leveraging Big Data Opportunities for Growth by
Leveraging Big Data Opportunities for GrowthLeveraging Big Data Opportunities for Growth
Leveraging Big Data Opportunities for GrowthDatamatics Global Services GmbH
1.5K views24 slides
EMOS 2018 Big Data methods and techniques by
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesPiet J.H. Daas
316 views52 slides
Research in Intelligent Systems and Data Science at the Knowledge Media Insti... by
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Enrico Motta
1.5K views38 slides
Semantic Technology in Publishing & Finance by
Semantic Technology in Publishing & FinanceSemantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceVladimir Alexiev, PhD, PMP
1.4K views50 slides
Opportunities and methodological challenges of Big Data for official statist... by
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
942 views38 slides

Similar to News-oriented multimedia search over multiple social networks(20)

ESSnet Big Data WP8 Methodology (+ Quality, +IT) by Piet J.H. Daas
ESSnet Big Data WP8 Methodology (+ Quality, +IT)ESSnet Big Data WP8 Methodology (+ Quality, +IT)
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
Piet J.H. Daas338 views
EMOS 2018 Big Data methods and techniques by Piet J.H. Daas
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
Piet J.H. Daas316 views
Research in Intelligent Systems and Data Science at the Knowledge Media Insti... by Enrico Motta
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Enrico Motta1.5K views
Opportunities and methodological challenges of Big Data for official statist... by Piet J.H. Daas
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
Piet J.H. Daas942 views
A Data-driven Approach for Internet of Things Applications: Methods and Case ... by Suparna De
A Data-driven Approach for Internet of Things Applications: Methods and Case ...A Data-driven Approach for Internet of Things Applications: Methods and Case ...
A Data-driven Approach for Internet of Things Applications: Methods and Case ...
Suparna De274 views
[CS570] Machine Learning Team Project (I know what items really are) by Kunwoo Park
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
Kunwoo Park579 views
How Oracle Uses CrowdFlower For Sentiment Analysis by CrowdFlower
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower924 views
eMadrid 2014-01-17 uned Salvador Ros (UNED) "Big Data in Education" by eMadrid network
eMadrid 2014-01-17 uned Salvador Ros (UNED) "Big Data in Education"eMadrid 2014-01-17 uned Salvador Ros (UNED) "Big Data in Education"
eMadrid 2014-01-17 uned Salvador Ros (UNED) "Big Data in Education"
eMadrid network1.4K views
Predictive Analytics: Context and Use Cases by Kimberley Mitchell
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
Kimberley Mitchell19.2K views
Data Mining Xuequn Shang NorthWestern Polytechnical University by butest
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
butest827 views
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form... by Alistair Hamilton
Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...Experimental transformation of  ABS data into Data Cube Vocabulary (DCV) form...
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Alistair Hamilton1.1K views
Smarter Data for Smarter Libraries by OCLC
Smarter Data for Smarter LibrariesSmarter Data for Smarter Libraries
Smarter Data for Smarter Libraries
OCLC1.5K views
WikiRate - Data Liberation and Radical Transparency by Vishal Kapadia
WikiRate - Data Liberation and Radical TransparencyWikiRate - Data Liberation and Radical Transparency
WikiRate - Data Liberation and Radical Transparency
Vishal Kapadia216 views
Nextérité: Semantic Business Services by Edith Nuss
Nextérité: Semantic Business ServicesNextérité: Semantic Business Services
Nextérité: Semantic Business Services
Edith Nuss266 views
Advanced Analytics and Data Science Expertise by SoftServe
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
SoftServe1.7K views

More from Symeon Papadopoulos

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno... by
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...Symeon Papadopoulos
856 views29 slides
Deepfakes: An Emerging Internet Threat and their Detection by
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their DetectionSymeon Papadopoulos
1.5K views50 slides
Knowledge-based Fusion for Image Tampering Localization by
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering LocalizationSymeon Papadopoulos
133 views24 slides
Deepfake Detection: The Importance of Training Data Preprocessing and Practic... by
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Symeon Papadopoulos
168 views19 slides
COVID-19 Infodemic vs Contact Tracing by
COVID-19 Infodemic vs Contact TracingCOVID-19 Infodemic vs Contact Tracing
COVID-19 Infodemic vs Contact TracingSymeon Papadopoulos
205 views11 slides
Similarity-based retrieval of multimedia content by
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia contentSymeon Papadopoulos
814 views61 slides

More from Symeon Papadopoulos(20)

DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno... by Symeon Papadopoulos
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
DeepFake Detection: Challenges, Progress and Hands-on Demonstration of Techno...
Deepfakes: An Emerging Internet Threat and their Detection by Symeon Papadopoulos
Deepfakes: An Emerging Internet Threat and their DetectionDeepfakes: An Emerging Internet Threat and their Detection
Deepfakes: An Emerging Internet Threat and their Detection
Symeon Papadopoulos1.5K views
Knowledge-based Fusion for Image Tampering Localization by Symeon Papadopoulos
Knowledge-based Fusion for Image Tampering LocalizationKnowledge-based Fusion for Image Tampering Localization
Knowledge-based Fusion for Image Tampering Localization
Deepfake Detection: The Importance of Training Data Preprocessing and Practic... by Symeon Papadopoulos
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Deepfake Detection: The Importance of Training Data Preprocessing and Practic...
Similarity-based retrieval of multimedia content by Symeon Papadopoulos
Similarity-based retrieval of multimedia contentSimilarity-based retrieval of multimedia content
Similarity-based retrieval of multimedia content
Aggregating and Analyzing the Context of Social Media Content by Symeon Papadopoulos
Aggregating and Analyzing the Context of Social Media ContentAggregating and Analyzing the Context of Social Media Content
Aggregating and Analyzing the Context of Social Media Content
Symeon Papadopoulos5.9K views
Learning to detect Misleading Content on Twitter by Symeon Papadopoulos
Learning to detect Misleading Content on TwitterLearning to detect Misleading Content on Twitter
Learning to detect Misleading Content on Twitter
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers by Symeon Papadopoulos
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Web and Social Media Image Forensics for News Professionals by Symeon Papadopoulos
Web and Social Media Image Forensics for News ProfessionalsWeb and Social Media Image Forensics for News Professionals
Web and Social Media Image Forensics for News Professionals
Symeon Papadopoulos1.2K views
Geotagging Social Media Content with a Refined Language Modelling Approach by Symeon Papadopoulos
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling Approach
Symeon Papadopoulos1.5K views
Media REVEALr: A social multimedia monitoring and intelligence system for Web... by Symeon Papadopoulos
Media REVEALr: A social multimedia monitoring and intelligence system for Web...Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
An Ensemble Model for Cross-Domain Polarity Classification on Twitter by Symeon Papadopoulos
An Ensemble Model for Cross-Domain Polarity Classification on TwitterAn Ensemble Model for Cross-Domain Polarity Classification on Twitter
An Ensemble Model for Cross-Domain Polarity Classification on Twitter
Symeon Papadopoulos2.3K views

Recently uploaded

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
37 views69 slides
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
19 views29 slides
PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
14 views1 slide
Kyo - Functional Scala 2023.pdf by
Kyo - Functional Scala 2023.pdfKyo - Functional Scala 2023.pdf
Kyo - Functional Scala 2023.pdfFlavio W. Brasil
368 views92 slides
Democratising digital commerce in India-Report by
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
15 views161 slides
virtual reality.pptx by
virtual reality.pptxvirtual reality.pptx
virtual reality.pptxG036GaikwadSnehal
11 views15 slides

Recently uploaded(20)

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker37 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri16 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman33 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada136 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi127 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab19 views
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex22 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院

News-oriented multimedia search over multiple social networks

  • 1. News-oriented multimedia search over multiple social networks Katerina Iliakopoulou, Symeon Papadopoulos and Yiannis Kompatsiaris 1Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) CBMI 2015, June 11, 2015, Prague, Czech Republic Presented by Katerina Andreadou
  • 2. The rise of Online Social Networks (OSNs) #2 • Increasingly popular  Massive amounts of data – Both text and multimedia • Content peaks when – A planned event takes place (e.g., Olympic games) – An unexpected news story breaks (e.g., earthquake) Journalistic practices now involve the use of user- generated content from OSNs for reporting on news stories and events
  • 3. The Problem #3 • News stories are covered in multiple OSNs – Twitter, Facebook, Google+, Instagram, Tumblr, Flickr • No effective means of searching over multiple OSNs – Necessary to build appropriate queries – Find relevant hashtags and query keywords • Effective querying is not straightforward – Long complicated queries retrieve no results – Vague queries bring back irrelevant content
  • 4. The Problem is also OSN-specific #4 • Flickr search is more flexible – It returns results that contain all requested keywords or a portion of them with the appropriate ranking • Instagram is more restrictive – It can only handle hashtags – It returns very few or no results to multi-keyword queries • The order of keywords is also crucial for some OSNs Query formulation has to be OSN-specific
  • 5. Content requirements #5 • High relevance to the topic of interest • High quality of multimedia • Diversity of retrieved content • Usefulness with respect to reporting and publication
  • 6. Related work • Optimization of query formulation methods utilizing terms, proximities and phrases with respect to their frequency and text position – Markov random field models (Metzler et al., 2005) – Positional language models (Lv et al., 2009) – Query operations (Mishne et al., 2005) • Improve query formulation by modelling query concepts – Learning concept importance (Bendersky et al., 2010) – Latent content expansion using markov random fields (Metzler et al., 2007) #6
  • 7. Goals and Contributions • A novel graph-based query formulation method – Catered for the special characteristics of each OSN – Captures the primary entities and their associations – Builds numerous queries by greedy graph traversal • A relevance classification method – 12 features based on content (text, visual) and context (popularity, publication time) • Evaluation of the framework in real-world events and stories #7
  • 8. Overview of the Framework #8
  • 9. Step I: Collection of highly relevant content • Query six OSNs with a high precision query q0 to build an initial collection M0 – news story headline – official name of the event • Lower the possibility of noisy content by – discarding all material retrieved before the story broke • Only some OSNs were found to contribute to the collection: Twitter, Flickr, Google+ #9
  • 10. Step II: Keyword and hashtag extraction • Extract the Named Entities from the M0 metadata • Discard all stop-words and filter out HTML tags, web links and social network account names • Perform stemming for keywords that are not listed as Named Entities to group keywords with similar meaning Create a list of keywords and a list of hashtags, each associated with a frequency count #10
  • 11. Step III: Graph construction • Vertices  set of selected keywords • Edges  their pairwise adjacency relations – adjacency is computed with respect to the text metadata • Each edge  frequency of appearance of the phrase composed of the edge keywords • Only significant keywords are considered  keywords with greater frequency than the average – elimination of noisy keywords – cost-effectiveness #11
  • 12. Step IV: Query building • Query  path from a starting node to an end node given a maximum number of L hops • Starting node high out-degree or connected to heavy weighted edges • Total score for a node • Penalize queries with high text similarity  Jaccard coefficient #12
  • 13. Example: 86th Academy Awards #13
  • 14. Step V: Relevance classification Textual relevance is computed wrt the high precision query q0 • title & description • tags #14 Popularity Textual relevance Visual similarity Temporal proximity to the story Image dimensions
  • 15. Evaluation #15 • Choose 20 events and news stories which took place up to five months before data collection – the older the event, the more content disappears from the OSNs • Choose events with considerable size and variety • Set the maximum number of keyword-base queries Mmax=20 and the maximum number of hashtag- based queries to Mmax=10
  • 16. Data statistics #16 • More than 88K images for all 20 events • ~4.4K images per event/new story on average • Events are associated on average with more images (5.5K) than news stories (3.3K) Number of images collected during the first querying step Number of images collected during the second querying step
  • 17. Media volume per OSN #17 • Flickr contributes the most (66.9%) with Twitter following (19%) • Instagram and Google+ less but considerable • Tumblr and Facebook the least content – Tumblr has significantly lower usage – Facebook has very poor search API behaviour • Increase between the two retrieval steps – Facebook, Flickr, Tumblr: 5x – Google+, Instagram: even higher (8.1x and 6.8x) – Twitter: 3x
  • 18. Quality of formulated queries #18 • Evaluate the relevance and quality of the retrieved content in the second step (Mext) – A large majority (90%) of the images retrieved in the first step (M0) were relevant – Four human annotators • Relevance is high (>50%) for 3 events • Relevance is decent (>40%) for 3 news stories • Half of the events and news stories are characterized by low-to-medium relevance (10% - 40%) • Relevance is very low (<10%) for two events and two news stories
  • 19. Why is irrelevant content collected? #19 • Vague keyword-based queries or hashtags – Example: British Academy Film Awards  most popular hashtag  british – Example: Sundance Film Festival  vague query  film festival • False keyword-based queries – They contain keywords irrelevant to the subject – They are left-overs from the graph pruning, they should have been eliminated
  • 20. Relevance classification #20 DT  Decision Tree RF  Random Forest SVM  Support Vector Machine MP  Multilayer Perceptor
  • 21. Relevance classification #21 • RF outperforms the rest in all cases • DT is also very good • SVM has the worst performance – Input features are not normalized – A few of them are quantized to a small set of possible values
  • 22. Conclusion - Contributions • Searching for multimedia content around events and news stories over multiple OSNs is challenging! – Collect high quality relevant content in spite of the different behaviors and requirements of the OSNs • We proposed a multi-step process including – a graph-based query building method – a relevance classification step • We evaluated the framework on a set of 20 large- scale events and news stories of global interest #22
  • 23. Future Work • Improve the performance of the query building method when the number of collected items in the first step is small • Extract statistically grounded relevance features – Take into account distribution differences in different OSNs • Apply the method while the event evolves • Add support for the collection of video content #23
  • 24. Thank you! • Slides: http://www.slideshare.net/sympapadopoulos/newsoriented- multimedia-search-over-multiple-social-networks • Get in touch: @matzika00 / katerina.iliakopoulou@gmail.com @sympapadopoulos / papadop@iti.gr #24

Editor's Notes

  1. http://irevolution.net/2014/04/03/using-aidr-to-collect-and-analyze-tweets-from-chile-earthquake/
  2. http://irevolution.net/2014/04/03/using-aidr-to-collect-and-analyze-tweets-from-chile-earthquake/