SlideShare a Scribd company logo
Searching for Quality Microblog Posts:
Filtering and Ranking based on Content
Analysis and Implicit Links


Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Department of Computer Science and Engineering
HKUST
Hong Kong

DASFAA‟12
Introduction   Method      Features    Experiments   Conclusions


      Agenda
2


         Introduction
         Proposed method
         Quality features of tweets
         Experiments
         Conclusions
Introduction     Method   Features   Experiments   Conclusions




    3          Introduction
Introduction        Method   Features         Experiments       Conclusions


      Microblogs
4


                             mentioned user         timestamp
                user

          Tweet 1


          Tweet 2
                                                                hashtag
                                                URL link


         Both social network and social media
           Linksbetween users (follow, mention, re-tweet)
           Users post updates (tweets)
Introduction   Method   Features   Experiments   Conclusions


      Searching for “ipad” on Twitter
5




                                        Around 50 tweets
                                        mentioning “iPad”
                                        posted within a
                                        1-minute period
Introduction           Method       Features       Experiments       Conclusions


      Research challenge
6


         Twitter: user-generated content
           Short messages, often comments or opinions
           High volume
           Varying quality
                  “Most tweets are not of general interest (57%)”   (Alonso et
                   al.’10)
              Information overload
         Research questions:
           How  to distinguish content worth reading from
            useless or less important messages?
           How to promote „high quality‟ content?
Introduction       Method          Features       Experiments        Conclusions


      Defining „quality‟
7


         General (global) definition for assessing tweet
          quality
         3 criteria:
              Well-formedness
               + Well-written, grammatically correct, understandable
               - Heavy slang, misspellings, excessive punctuation
              Factuality
               + News, events, announcements
               - Unclear message, private conversations, generic personal
                 feelings
              Navigational quality (URL links)
               + Reputable external resources (e.g. news articles)
Introduction   Method   Features   Experiments   Conclusions


      Quality-based tweet filtering
8




                                             +
                                             -
                                             -
                                             +
                                             -
Introduction   Method   Features   Experiments   Conclusions


      Quality-based tweet ranking
9




                                             5
                                             4
                                             3
                                             1
                                             1
Introduction          Method         Features    Experiments   Conclusions


      Research goals
10


         Quality-based tweet filtering
           Filtering      out low-quality tweets
                In twitter feeds
                In search results

         Quality-based tweet ranking
           Re-ranking         Twitter search results
                For   a given time period
Introduction    Method   Features   Experiments   Conclusions




   11          Proposed Method
Introduction       Method          Features     Experiments   Conclusions


      Representation of tweets
12


         Vector-space model: not sufficient
           Short tweet length, terms often malformed
           Ignores special features in Twitter

         Feature-vector representation
           Extract features from tweet
           Traditional features: e.g. length, spelling

           Twitter-specific features:
                Exploiting   hashtags, URL links, mentioned usernames
Introduction    Method   Features   Experiments   Conclusions




   13          Quality Features of Tweets
Introduction        Method                   Features              Experiments        Conclusions


      Feature categories
14


           1. Punctuation and Spelling                  2. Syntactic and semantic
                                                        complexity
           Number of exclamation marks                  Max. & Avg. word length
           Number of question marks                     Length of tweet
           Max. no. of repeated letters                 Percentage of stopwords
           % of correctly spelled words                 Contains numbers
           No. of capitalized words                     Contains a measure
           Max. no. of consecutive capitalized          Contains emoticons
           words                                        Uniqueness score

           3. Grammaticality                            4. Link-based
           Has first-person part-of-speech              Contains link
           Formality score                              Is reply-tweet
           Number of proper names                       Is re-tweet
           Max. no. of consecutive                      No. of mentions of users
           proper names                                 Number of hashtags
           Number of named entities                     URL domain reputation score
                                                        RT source reputation score
                                                        Hashtag reputation score
           5. Timestamp
Introduction        Method          Features         Experiments         Conclusions


      1. Punctuation and spelling
15


         Excessive punctuation
              Number of exclamation marks
              Number of question marks
              Max. number of consecutive dots
         Capitalization
              Presence of all-capitalized words
              Largest number of consecutive words in capital letters
         Spellchecking
              Number of correctly spelled words
              Percentage of words found in a dictionary
                    RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!??
                    lls. He's only the greatest guy next to jesus lmao
Introduction        Method          Features         Experiments      Conclusions
      2. Syntactic and semantic
16
          complexity
         Syntactic complexity
              Tweet length
              Max. & avg. word length
              Percentage of stopwords
              Presence of emoticons and other sentiment indicators
              Presence of measure symbols ($, %)
              Numbers – number of digits
         Tweet uniqueness
              Uniqueness of the tweet relative to other tweets by the author


                                                  where
Introduction          Method                Features                Experiments            Conclusions


      3. Grammaticality
17


         Parts-of-speech labelling
              Presence of first person parts-of-speech
              Formality score [Heylighen‟02]
                  F = (noun frequency + adjective freq. + preposition freq.+ article freq.
                   − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2
         Names
              Number of „proper names‟ as words with a single initial capital
               letter
              Number of consecutive „proper names‟
              Number of Named entities




           F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure.
           Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.
Introduction          Method       Features     Experiments   Conclusions


      4. Link-based features
18


         Links to other items
           Re-tweet(RT), reply tweet, mention of other users
           Presence of a URL link

           Number of hashtags as indicated by the “#” sign

         Link target‟s quality reputation
           metrics       to reflect the quality of tweets which relate
               to a
                URL  domain
                Hashtag
                a user
Introduction           Method           Features         Experiments        Conclusions


      URL domain reputation
19


         Observation:
               Tweets which link to news articles usually better quality than
                tweets which link to photo sharing websites

               Q=1                                 Q=5
          Tweet 1                                  Tweet 4
                                Tweetpic.co                            NYtimes.co
                     Q=3
                                    m                                      m
                                                         Q=4
                                      Q=2
                 Tweet 2                                 Tweet 5             Q=5
                                     Tweet 3                                Tweet 6

         Questions:
               What does the quality of tweets linking to a website say about its
                quality?
               Can we predict quality of future tweets linking to that website?
Introduction       Method          Features       Experiments     Conclusions


      URL domain reputation
20


         Step 1: URL translation
          Short link to original link
                 bit.ly/e2jt9F  http://www.reuters.com/4151120


         Step 2: summarize tweets linking to a URL
          domain
              Accumulate “quality reputation” over time
Introduction          Method          Features          Experiments     Conclusions


      URL domain reputation
21


         Average URL domain quality



              Td = set of tweets linking to domain d
              qt = quality label of tweet t


              Weakness:
                  Does not reflect the number of inlink tweets in the score
                  Favours domains with few inlink tweets
Introduction          Method            Features                Experiments        Conclusions


      URL domain reputation
22


         Domain reputation score


                   where AvgQ(d) is between [-1, +1]

              “Collecting evidence” behaviour:
                  Score getting higher with more good quality inlink tweets

                                4.00
                                                                     -1
                                2.00
                                                                     -0,5
                          DRS   0.00                                 0      AvgQ
                                        1     10   100   1000        0,5
                                -2.00
                                                                     1
                                -4.00

                                            |Td|
Introduction        Method            Features       Experiments        Conclusions


      URL domain reputation
23




     10 domains with a high DRS:                 10 domains with a low DRS:
     Domain            AvgQ Inlinks      RS      Domain            AvgQ Inlinks      RS
     gallup.com         0,96     99    1,92      tweetphoto.com    -0,77    106   -1,57
     mashable.com       0,79     97    1,58      twitpic.com       -0,75    113   -1,54
     hrw.org            0,86     57    1,51      twitlonger.com    -0,85     66   -1,54
     foxnews.com        0,68     38    1,08      myloc.me          -0,85     54   -1,48
     good.is            0,68     31    1,01      instagr.am        -0,62     52   -1,06
     intuit.com         0,57     60    1,01      formspring.me     -0,78     18   -0,98
     forbes.com         0,68     19    0,87      yfrog.com         -0,55     53   -0,94
     reuters.com        1,00      6    0,78      lockerz.com       -0,63     16   -0,75
     cnn.com            0,36     85    0,70      qik.com           -0,75      8   -0,68


                  Mainly                                      Mainly
               News-oriented                                Image and
                  sites                                  location sharing
                                                               sites
Introduction         Method             Features               Experiments       Conclusions


      Reputation of hashtag & user
24




       Q=1                                         Q=5
      Tweet 1                                      Tweet 4
                          #justforfun                                   #DASFAA
               Q=3                                       Q=4
                                Q=2
           Tweet 2                                       Tweet 5               Q=5
                               Tweet 3                                        Tweet 6



         Hashtag reputation                                         #DASFAA vs. #justforfun


         Re-tweet source user reputation                                    @barackobama vs.
                                                                                 @wysz22212
Introduction    Method   Features   Experiments   Conclusions




   25          Experiments
Introduction       Method         Features       Experiments     Conclusions


      Dataset
27


         10,000 tweets
           100    users, 100 recent tweets per user
         Users:
           50 random users
           50 influential users
                Selected  from listorious.com
                5 categories: technology, business, politics,
                 celebrities, activism
                10 users per category
Introduction       Method        Features      Experiments   Conclusions


      Labelling
28


         Crowdsourcing
              Amazon Mechanical Turk
         3 labels per tweet from different reviewers
         Possible labels: 1 to 5
              1 = low quality, 5 = high quality
         Random order of tweets
Introduction    Method     Features    Experiments        Conclusions


      Labelling results
29


         Tweet quality distribution
                                                     Quality score:
Introduction    Method          Features       Experiments   Conclusions


      Feature analysis
30


         Total 29 features
         Top 5 features based on Information Gain:

                   0.374   Domain reputation
                   0.287   Contains link
                   0.130   Formality score
                   0.127   Num. proper names
                   0.113   Max. proper names
Introduction      Method             Features             Experiments      Conclusions


      Feature selection
31


         Greedy attribute selection
           15   selected features:

               Domain reputation                RT source reputation
               Formality                        Tweet uniqueness
               No. named entities               % correct. spelled words
               Max. no. repeat. Letters         No. hash-tags
               Contains numbers                 No. capitalized words
               Is reply-tweet                   Is re-tweet
               Avg. word length                 Contains first-person
               No. exclam. Marks
Introduction         Method        Features       Experiments    Conclusions
      Classification and Ranking
32
      Method
         Classification:
           SVM,   binary classification (high-quality, low-
            quality)
           50/50 split for training/testing

         Ranking:
           Learning-to-rank (Rank SVM)
           30 queries from 5 topic categories

           Process:
               1.   Retrieve tweets matching a query
               2.   Extract features from the tweets
               3.   „Query-tweet vector‟ pairs + quality scores of the
Introduction        Method             Features           Experiments        Conclusions


      Classification results
33


                                #attribute    High-Quality     Low-Quality       Overall
      Features                  s             P       R        P        R        AUC
      Link only                 1             0.798   0.702    0.894    0.934    0.818
      TF-IDF                    3322          0.862   0.665    0.885    0.96     0.813
      Subset.Reputation         3             0.812   0.746    0.909    0.936    0.841
      Subset.SVM (“greedy”)     15            0.715   0.758    0.912    0.936    0.847
      All quality features      29            0.815   0.66     0.882    0.944    0.802
      All quality ftr‟s + TF-   3351          0.739   0.775    0.915    0.899    0.837
      IDF


      Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)
      Link-based “reputation” features (3 attrs.) achieve the 2nd best result
      Combining quality features + TF-IDF does not improve result
Introduction        Method           Features   Experiments     Conclusions


      Classification results
34



                             #attribute
  Features                   s          AUC
  Link only                  1        0.818
  TF-IDF                     3322     0.813
  Subset.Reputation          3        0.841
  Subset.SVM                 15       0.847
  (“greedy”)                                             Storage cost

  All quality features       29       0.802
  All quality ftr‟s + TF-    3351     0.837
  IDF

   Optimal feature set achieves
    reduced training time and storage
    cost
                                                         Training time
Introduction       Method             Features           Experiments        Conclusions


      Ranking results
35



                             where


                                                         NDCG@N
      Features                   #attributes 1       2         5       10        MAP
      Link only                  1           0.067   0.111     0.22    0.324     0.398
      Subset.Reputation          3           0.822   0.777     0.777   0.764     0.661
      Subset.SVM (“greedy”) 15               0.867   0.767     0.778   0.769     0.653
      All quality features       29          0.733   0.733     0.763   0.753     0.637



      Optimal feature set (15 attrs.) achieves the best result
      Link-based “reputation” features (3 attrs.) achieve the 2nd best result
Introduction    Method   Features   Experiments   Conclusions




   36          Conclusions
Introduction   Method      Features   Experiments   Conclusions


      Summary
37


         Method for quality-based classification and
          ranking of tweets
         Proposed and evaluated a set of tweet‟s
          features to capture the tweet‟s quality
         Link-based features lead to the best
          performance
Introduction       Method    Features     Experiments   Conclusions


      Future work
38


         Consider different types of queries in Twitter
           E.g. searching for hot topics, movie reviews,
            facts, opinions, etc.
           Different features may be important in different
            scenarios
         Incorporating recent hot topics
         Personalized re-ranking
Introduction   Method   Features   Experiments   Conclusions


      Q/A
39
Introduction   Method   Features   Experiments   Conclusions


      Thank You
40
Related work
41


        Spam detection
              Bag-of-words, keyword-based
              Feature-based approaches
              Combinations

        Social networks
            Finding quality answers in Q-A systems
              E.g. Yahoo Answers
              Feature-based

        Web search
            Quality-based ranking of web documents
                Feature-based quality score (WSDM‟11)
ROC Curve
42




        Area under the ROC curve: probability that a classifier
         will rank a randomly chosen positive instance higher
         than a randomly chosen negative one

More Related Content

Similar to Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)
Dmitrii Ivanov
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
Knowledge Media Institute - The Open University
 
Ranking Twitter Conversations
Ranking Twitter ConversationsRanking Twitter Conversations
Ranking Twitter Conversations
Mohammed Faisal Anees
 
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
University of Hawai‘i at Mānoa
 
Analysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdfAnalysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdf
Sarah Pollard
 
Towards trust-aware recommender systems
Towards trust-aware recommender systemsTowards trust-aware recommender systems
Towards trust-aware recommender systems
Alberto Lumbreras
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Traian Rebedea
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?
George Sam
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
Ghulam Qambar
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
CommunitySense
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
Dhaval Dalal
 
Exploratory research design
Exploratory research design Exploratory research design
Exploratory research design
Kritika Jain
 
Research design ii
Research design iiResearch design ii
Research design ii
Kritika Jain
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
ijtsrd
 
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docxAssignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
murgatroydcrista
 
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
alywise
 
Need help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docxNeed help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docx
gibbonshay
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
IRJET Journal
 
Research design ii
Research design iiResearch design ii
Research design ii
Kritika Jain
 
Touchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docx
Touchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docxTouchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docx
Touchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docx
novabroom
 

Similar to Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links (20)

Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)Hidden sides of Code Review (MMM-2023)
Hidden sides of Code Review (MMM-2023)
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
 
Ranking Twitter Conversations
Ranking Twitter ConversationsRanking Twitter Conversations
Ranking Twitter Conversations
 
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
Rename Chains: An Exploratory Study on the Occurrence and Characteristics of ...
 
Analysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdfAnalysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdf
 
Towards trust-aware recommender systems
Towards trust-aware recommender systemsTowards trust-aware recommender systems
Towards trust-aware recommender systems
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?How Anonymous Can Someone be on Twitter?
How Anonymous Can Someone be on Twitter?
 
Themes identification techniques in qualitative research
Themes identification techniques in qualitative researchThemes identification techniques in qualitative research
Themes identification techniques in qualitative research
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
Exploratory research design
Exploratory research design Exploratory research design
Exploratory research design
 
Research design ii
Research design iiResearch design ii
Research design ii
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docxAssignment 1 Discussion—Developing TrustCommunicating ethically t.docx
Assignment 1 Discussion—Developing TrustCommunicating ethically t.docx
 
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
Learning Analytics for Online Discussions: A Pedagogical Model for Intervent...
 
Need help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docxNeed help with this assignmentPreliminary research is attached w.docx
Need help with this assignmentPreliminary research is attached w.docx
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
Research design ii
Research design iiResearch design ii
Research design ii
 
Touchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docx
Touchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docxTouchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docx
Touchstone 2.1 Evaluate a SourceASSIGNMENT For this essay, y.docx
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

  • 1. Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng Department of Computer Science and Engineering HKUST Hong Kong DASFAA‟12
  • 2. Introduction Method Features Experiments Conclusions Agenda 2  Introduction  Proposed method  Quality features of tweets  Experiments  Conclusions
  • 3. Introduction Method Features Experiments Conclusions 3 Introduction
  • 4. Introduction Method Features Experiments Conclusions Microblogs 4 mentioned user timestamp user Tweet 1 Tweet 2 hashtag URL link  Both social network and social media  Linksbetween users (follow, mention, re-tweet)  Users post updates (tweets)
  • 5. Introduction Method Features Experiments Conclusions Searching for “ipad” on Twitter 5 Around 50 tweets mentioning “iPad” posted within a 1-minute period
  • 6. Introduction Method Features Experiments Conclusions Research challenge 6  Twitter: user-generated content  Short messages, often comments or opinions  High volume  Varying quality  “Most tweets are not of general interest (57%)” (Alonso et al.’10)  Information overload  Research questions:  How to distinguish content worth reading from useless or less important messages?  How to promote „high quality‟ content?
  • 7. Introduction Method Features Experiments Conclusions Defining „quality‟ 7  General (global) definition for assessing tweet quality  3 criteria:  Well-formedness + Well-written, grammatically correct, understandable - Heavy slang, misspellings, excessive punctuation  Factuality + News, events, announcements - Unclear message, private conversations, generic personal feelings  Navigational quality (URL links) + Reputable external resources (e.g. news articles)
  • 8. Introduction Method Features Experiments Conclusions Quality-based tweet filtering 8 + - - + -
  • 9. Introduction Method Features Experiments Conclusions Quality-based tweet ranking 9 5 4 3 1 1
  • 10. Introduction Method Features Experiments Conclusions Research goals 10  Quality-based tweet filtering  Filtering out low-quality tweets  In twitter feeds  In search results  Quality-based tweet ranking  Re-ranking Twitter search results  For a given time period
  • 11. Introduction Method Features Experiments Conclusions 11 Proposed Method
  • 12. Introduction Method Features Experiments Conclusions Representation of tweets 12  Vector-space model: not sufficient  Short tweet length, terms often malformed  Ignores special features in Twitter  Feature-vector representation  Extract features from tweet  Traditional features: e.g. length, spelling  Twitter-specific features:  Exploiting hashtags, URL links, mentioned usernames
  • 13. Introduction Method Features Experiments Conclusions 13 Quality Features of Tweets
  • 14. Introduction Method Features Experiments Conclusions Feature categories 14 1. Punctuation and Spelling 2. Syntactic and semantic complexity Number of exclamation marks Max. & Avg. word length Number of question marks Length of tweet Max. no. of repeated letters Percentage of stopwords % of correctly spelled words Contains numbers No. of capitalized words Contains a measure Max. no. of consecutive capitalized Contains emoticons words Uniqueness score 3. Grammaticality 4. Link-based Has first-person part-of-speech Contains link Formality score Is reply-tweet Number of proper names Is re-tweet Max. no. of consecutive No. of mentions of users proper names Number of hashtags Number of named entities URL domain reputation score RT source reputation score Hashtag reputation score 5. Timestamp
  • 15. Introduction Method Features Experiments Conclusions 1. Punctuation and spelling 15  Excessive punctuation  Number of exclamation marks  Number of question marks  Max. number of consecutive dots  Capitalization  Presence of all-capitalized words  Largest number of consecutive words in capital letters  Spellchecking  Number of correctly spelled words  Percentage of words found in a dictionary RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!?? lls. He's only the greatest guy next to jesus lmao
  • 16. Introduction Method Features Experiments Conclusions 2. Syntactic and semantic 16 complexity  Syntactic complexity  Tweet length  Max. & avg. word length  Percentage of stopwords  Presence of emoticons and other sentiment indicators  Presence of measure symbols ($, %)  Numbers – number of digits  Tweet uniqueness  Uniqueness of the tweet relative to other tweets by the author where
  • 17. Introduction Method Features Experiments Conclusions 3. Grammaticality 17  Parts-of-speech labelling  Presence of first person parts-of-speech  Formality score [Heylighen‟02]  F = (noun frequency + adjective freq. + preposition freq.+ article freq. − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2  Names  Number of „proper names‟ as words with a single initial capital letter  Number of consecutive „proper names‟  Number of Named entities F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure. Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.
  • 18. Introduction Method Features Experiments Conclusions 4. Link-based features 18  Links to other items  Re-tweet(RT), reply tweet, mention of other users  Presence of a URL link  Number of hashtags as indicated by the “#” sign  Link target‟s quality reputation  metrics to reflect the quality of tweets which relate to a  URL domain  Hashtag  a user
  • 19. Introduction Method Features Experiments Conclusions URL domain reputation 19  Observation:  Tweets which link to news articles usually better quality than tweets which link to photo sharing websites Q=1 Q=5 Tweet 1 Tweet 4 Tweetpic.co NYtimes.co Q=3 m m Q=4 Q=2 Tweet 2 Tweet 5 Q=5 Tweet 3 Tweet 6  Questions:  What does the quality of tweets linking to a website say about its quality?  Can we predict quality of future tweets linking to that website?
  • 20. Introduction Method Features Experiments Conclusions URL domain reputation 20  Step 1: URL translation Short link to original link bit.ly/e2jt9F  http://www.reuters.com/4151120  Step 2: summarize tweets linking to a URL domain  Accumulate “quality reputation” over time
  • 21. Introduction Method Features Experiments Conclusions URL domain reputation 21  Average URL domain quality  Td = set of tweets linking to domain d  qt = quality label of tweet t  Weakness:  Does not reflect the number of inlink tweets in the score  Favours domains with few inlink tweets
  • 22. Introduction Method Features Experiments Conclusions URL domain reputation 22  Domain reputation score where AvgQ(d) is between [-1, +1]  “Collecting evidence” behaviour:  Score getting higher with more good quality inlink tweets 4.00 -1 2.00 -0,5 DRS 0.00 0 AvgQ 1 10 100 1000 0,5 -2.00 1 -4.00 |Td|
  • 23. Introduction Method Features Experiments Conclusions URL domain reputation 23 10 domains with a high DRS: 10 domains with a low DRS: Domain AvgQ Inlinks RS Domain AvgQ Inlinks RS gallup.com 0,96 99 1,92 tweetphoto.com -0,77 106 -1,57 mashable.com 0,79 97 1,58 twitpic.com -0,75 113 -1,54 hrw.org 0,86 57 1,51 twitlonger.com -0,85 66 -1,54 foxnews.com 0,68 38 1,08 myloc.me -0,85 54 -1,48 good.is 0,68 31 1,01 instagr.am -0,62 52 -1,06 intuit.com 0,57 60 1,01 formspring.me -0,78 18 -0,98 forbes.com 0,68 19 0,87 yfrog.com -0,55 53 -0,94 reuters.com 1,00 6 0,78 lockerz.com -0,63 16 -0,75 cnn.com 0,36 85 0,70 qik.com -0,75 8 -0,68 Mainly Mainly News-oriented Image and sites location sharing sites
  • 24. Introduction Method Features Experiments Conclusions Reputation of hashtag & user 24 Q=1 Q=5 Tweet 1 Tweet 4 #justforfun #DASFAA Q=3 Q=4 Q=2 Tweet 2 Tweet 5 Q=5 Tweet 3 Tweet 6  Hashtag reputation #DASFAA vs. #justforfun  Re-tweet source user reputation @barackobama vs. @wysz22212
  • 25. Introduction Method Features Experiments Conclusions 25 Experiments
  • 26. Introduction Method Features Experiments Conclusions Dataset 27  10,000 tweets  100 users, 100 recent tweets per user  Users:  50 random users  50 influential users  Selected from listorious.com  5 categories: technology, business, politics, celebrities, activism  10 users per category
  • 27. Introduction Method Features Experiments Conclusions Labelling 28  Crowdsourcing  Amazon Mechanical Turk  3 labels per tweet from different reviewers  Possible labels: 1 to 5  1 = low quality, 5 = high quality  Random order of tweets
  • 28. Introduction Method Features Experiments Conclusions Labelling results 29  Tweet quality distribution Quality score:
  • 29. Introduction Method Features Experiments Conclusions Feature analysis 30  Total 29 features  Top 5 features based on Information Gain: 0.374 Domain reputation 0.287 Contains link 0.130 Formality score 0.127 Num. proper names 0.113 Max. proper names
  • 30. Introduction Method Features Experiments Conclusions Feature selection 31  Greedy attribute selection  15 selected features: Domain reputation RT source reputation Formality Tweet uniqueness No. named entities % correct. spelled words Max. no. repeat. Letters No. hash-tags Contains numbers No. capitalized words Is reply-tweet Is re-tweet Avg. word length Contains first-person No. exclam. Marks
  • 31. Introduction Method Features Experiments Conclusions Classification and Ranking 32 Method  Classification:  SVM, binary classification (high-quality, low- quality)  50/50 split for training/testing  Ranking:  Learning-to-rank (Rank SVM)  30 queries from 5 topic categories  Process: 1. Retrieve tweets matching a query 2. Extract features from the tweets 3. „Query-tweet vector‟ pairs + quality scores of the
  • 32. Introduction Method Features Experiments Conclusions Classification results 33 #attribute High-Quality Low-Quality Overall Features s P R P R AUC Link only 1 0.798 0.702 0.894 0.934 0.818 TF-IDF 3322 0.862 0.665 0.885 0.96 0.813 Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841 Subset.SVM (“greedy”) 15 0.715 0.758 0.912 0.936 0.847 All quality features 29 0.815 0.66 0.882 0.944 0.802 All quality ftr‟s + TF- 3351 0.739 0.775 0.915 0.899 0.837 IDF  Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)  Link-based “reputation” features (3 attrs.) achieve the 2nd best result  Combining quality features + TF-IDF does not improve result
  • 33. Introduction Method Features Experiments Conclusions Classification results 34 #attribute Features s AUC Link only 1 0.818 TF-IDF 3322 0.813 Subset.Reputation 3 0.841 Subset.SVM 15 0.847 (“greedy”) Storage cost All quality features 29 0.802 All quality ftr‟s + TF- 3351 0.837 IDF  Optimal feature set achieves reduced training time and storage cost Training time
  • 34. Introduction Method Features Experiments Conclusions Ranking results 35 where NDCG@N Features #attributes 1 2 5 10 MAP Link only 1 0.067 0.111 0.22 0.324 0.398 Subset.Reputation 3 0.822 0.777 0.777 0.764 0.661 Subset.SVM (“greedy”) 15 0.867 0.767 0.778 0.769 0.653 All quality features 29 0.733 0.733 0.763 0.753 0.637  Optimal feature set (15 attrs.) achieves the best result  Link-based “reputation” features (3 attrs.) achieve the 2nd best result
  • 35. Introduction Method Features Experiments Conclusions 36 Conclusions
  • 36. Introduction Method Features Experiments Conclusions Summary 37  Method for quality-based classification and ranking of tweets  Proposed and evaluated a set of tweet‟s features to capture the tweet‟s quality  Link-based features lead to the best performance
  • 37. Introduction Method Features Experiments Conclusions Future work 38  Consider different types of queries in Twitter  E.g. searching for hot topics, movie reviews, facts, opinions, etc.  Different features may be important in different scenarios  Incorporating recent hot topics  Personalized re-ranking
  • 38. Introduction Method Features Experiments Conclusions Q/A 39
  • 39. Introduction Method Features Experiments Conclusions Thank You 40
  • 40. Related work 41  Spam detection  Bag-of-words, keyword-based  Feature-based approaches  Combinations  Social networks  Finding quality answers in Q-A systems  E.g. Yahoo Answers  Feature-based  Web search  Quality-based ranking of web documents  Feature-based quality score (WSDM‟11)
  • 41. ROC Curve 42  Area under the ROC curve: probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one