SlideShare a Scribd company logo
Microblog(Twitter) mining

          yutao
What is twitter?
•   140 character       tweet
•   Hashtag # before relevant keywords in tweet
•   RT means to “re-tweet” or forward a tweet
•   @ reference refers to a user’s screen name
Why it is different?
• Very short in length
• Written in informal style
• Social
What is twitter, a social network or a
      news media?(www2010)
• Following is mostly not reciprocated(not so
  “social”)
• Users talk about timely topics
• A few users reach large audience directly
• Most users can reach large audience by word-
  of-mouth quickly
early Analysis
Analysis 1: Take the people out
• Krishnamurthy et al (2008)
• users were classified by
      follower/following counts,
      Numbers and ratios
• means and mechanisms of their
  engagement
      Web (61.7%), mobile/text (7.5%),
      software (22.4%)
Analysis 2: Content Category
    Four meta-categories
•   daily chatter
•   conversations
•   information / URL sharing
•   news reporting
Analysis 3: measuring user influence
• Indegree, retweets and mentions
• Strong correlation between retweet and
  mention
• Most connected != most influential
User influence
How to detect spam?
• classification
• Content attributes
      hashtags, trending topics
      replies, mentions, http links
• User behavior attributes
     age of user account
• Graph based attribute
Sentiment analysis
• Supervised classification
• Training data come from twitter, instead of
  human labeled
• Happy emotions: “:-)”, “:)”, “=)”, “:D” etc
• Sad emotions: “:-(”, “:(”, “=(”, “;(” etc
• Objective: newspapers and magzines
            such as “NY times”
Trend detection
• Bursty keywords detection
• Bursty keywords grouping
• Context extraction(such as PCA, SVD)
twitter search(wsdm2011)
The largest difference
• Twitter search        order by time
• Search engine         order by relevance

• Social
• Time
recommendation
Recommending content from
          information streams
• The filtering problem:
  – “I get 1000+ items in my stream daily but only
    have time to read 10 of them. Which ones should I
    read?”
• The Discovery Problem:
  – “There are millions of URLs posted daily on
    twitter. Am I missing something important there
    outside my own Twitter stream?”
Recommending content from
          information streams
• Recency of content: only interesting within a
  short time after published.
  – always a “cold start” situation
• Explicit interaction among users
  – Explicitly interact by subscribing or sharing
• User-generated content
  – People are content producers as well as
    consumers
Recommending content from
    information streams
URL Sources
• Considering all URLs was impossible
• FoF : URLs from followee-of-followees
• Popular : URLs that are popular across whole
  twitter
Topic relevance scores
• Topic profile of URLs
  – Use term vectors as profiles
  – Built from tweets that have mentioned the URL
• Topic profile of users
  – Self-topic: content profile based on what I post
  – Followee-Topic: content profile based on what my
    followees post
Social network scores
• “Popular Vote” in among my followees-of-
  followees
  – People “vote” a URL by tweeting it
  – Votes are weighted using social network structure
  – URLs with more votes in total are assigned higher
    score
Recommending twitter users to follow
• Social graph
• Profile user
  – User himself
  – Followers
  – followees
Microblog summarization
The phrase reinforcement algorithm
• Looking for the most commonly occurring
  phrases
  – Users tend to use similar words when describing a
    particular topic
  – RT
Hybrid TF-IDF summarization
• TF: the document is the entire collection of
  posts
• IDF: the document is a single post
Topic model
Content modeling on Twitter

   tf.idf cosine
    similarity,
                     Surface word
        etc.           features



                        Deeper
 Parsing, parts of                                  dats yur mom not
speech, coreferen
                        natural
                                                    me lol
      ce, etc          language
                      processing    THE_REAL_SHAQ




                                                                   32
Content modeling on Twitter

  tf.idf cosine
                                             Topic         Latent Dirichlet
   similarity,
                  Surface word            models, Dimen    Allocation (LDA),
       etc.         features                   sionality       LSA, etc.
                                              reduction



                            Supervised
                           classification
                           #hashtags, emotico
                            ns, questions, etc.             Labeled LDA

Best model in
                             Naïve Bayes,
   ranking
                              SVM, etc.
experiments
                                                                           33
Content modeling with Labeled LDA
Discover unlabeled topics        Model common labels
 Parameter K=200 latent        500 - 1000 dimensions for
    topic dimensions           hashtags, emoticons, etc.
     obama president            Smile : )
     american
     america says           :) good day
     country russia         morning thanks           #jobs
     pope island            have happy
                                                #jobs featured
                            hope birthday
     I’m going go out                           manager sales
     gonna see im           :) can‘t wait see   engineer yahoo
     tonight sleep          one yay!!! cant     location senior
     tomorrow about         tomorrow got !!
     am night               next christmas
                                                             34
Content modeling with Labeled LDA
   4           1            1                 1

  new muppetblog political commentary link

       2       2      2           3          3

  @kermit heyy wanna catch a movie

   5       5       #yummy       #yummy

  just ate a cookie #yummy

                                         Histogram as signature
                                         for set of posts
                                                           35
Twitter content by category
can make help if someone                                obama president american
tell_me them anyone use                                 america says country russia
makes any sense trying explain                          pope island failed honduras
up what's hit pick whats hey                            iphone new phone app mobile
set twitter sign give catch      Social                 apple ipod blackberry touch
when show first wats make         23%      Substance    pro store apps free android an
                                             27%



                                               Status
                                                12%
haha lol :) funny :p omg           Style                am still doing sleep so going
hahaha yeah too yes thats ha       38%                  tired bed awake supposed hell
wow cool lmao though kinda                              asleep early sleeping sleepy
im get dont gonna shit gotta                            night sleep bed going off
wanna cuz damn ur make cant                             tomorrow bye tonight
say cause bout ill mad tired                            goodnight all im time now nite

                                                                               36
Characterizing Microblogs with Topic
                Models
Outline
• Modeling Twitter content with topic models
• Characterizing, recommending and filtering




                                               37
Characterizing users
Characterizing users
TwitterRank: Finding Topic-sensitive
         Influential Twitterers
• Apply LDA to distill topics automatically
• Find topics in the twitterer’s content to
  represent her interests
  – Twitterer’s content = aggregated tweets
• Twitterers with “following” relationships are
  more similar than those without according to
  the topics they are interested in
Topic-specific TwitterRank
Interesting application
• Personalized and automatic social
  summarization of events in video
• Twitter Can Predict the Stock Market
• Predicting elections with twitter
• Earthquake(time, location)
thanks
many pictures and slides come from the internet

More Related Content

Similar to Twitter mining

A Comparative Study of Users' Microblogging Behavior on Sina Weibo and Twitter
A Comparative Study of Users' Microblogging Behavior on Sina Weibo and TwitterA Comparative Study of Users' Microblogging Behavior on Sina Weibo and Twitter
A Comparative Study of Users' Microblogging Behavior on Sina Weibo and TwitterQi Gao
 
How We Built A Bot For FITC
How We Built A Bot For FITCHow We Built A Bot For FITC
How We Built A Bot For FITCFITC
 
Social media 101 november 2011
Social media 101 november 2011Social media 101 november 2011
Social media 101 november 2011mrjtyler
 
How to Blog: A guide for developers
How to Blog: A guide for developers How to Blog: A guide for developers
How to Blog: A guide for developers MarsBased
 
Twitter: Advanced Tips & Tricks
Twitter: Advanced Tips & Tricks Twitter: Advanced Tips & Tricks
Twitter: Advanced Tips & Tricks Morad Stern
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving searchnet2-project
 
RSS - Syndicating Your Thoughts To Create Influence
RSS - Syndicating Your Thoughts To Create InfluenceRSS - Syndicating Your Thoughts To Create Influence
RSS - Syndicating Your Thoughts To Create InfluenceJeffrey Stewart
 
Syndicating Your Thoughts To Create Influence
Syndicating Your Thoughts To Create InfluenceSyndicating Your Thoughts To Create Influence
Syndicating Your Thoughts To Create InfluenceSocial Media Bootcamp
 
Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Digital Reasoning
 
Creating More Engaging Content For Social
Creating More Engaging Content For SocialCreating More Engaging Content For Social
Creating More Engaging Content For SocialEric T. Tung
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignCommunitySense
 
Social Media and New Media Workshop (FSI) PY363 - Day 3
Social Media and New Media Workshop (FSI) PY363 - Day 3Social Media and New Media Workshop (FSI) PY363 - Day 3
Social Media and New Media Workshop (FSI) PY363 - Day 3Eric Schwartzman
 
Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Uche Ogbuji
 
How to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah BeckleyHow to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah BeckleyContent Strategy Workshops
 
Atlanta Press Club Talk on # Grammar
Atlanta Press Club Talk on # GrammarAtlanta Press Club Talk on # Grammar
Atlanta Press Club Talk on # GrammarJeanne Bohannon
 
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsIntroduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsMike Kujawski
 
How to Attract & Survive Media Attention as PhD
How to Attract & Survive Media Attention as PhDHow to Attract & Survive Media Attention as PhD
How to Attract & Survive Media Attention as PhDThomas Winters
 
Engineering Virality -- DC Week 2012
Engineering Virality -- DC Week 2012Engineering Virality -- DC Week 2012
Engineering Virality -- DC Week 2012Upworthy
 
Tiffany Jane Brand Integrating Social Media into Your Federal Library Job Search
Tiffany Jane Brand Integrating Social Media into Your Federal Library Job SearchTiffany Jane Brand Integrating Social Media into Your Federal Library Job Search
Tiffany Jane Brand Integrating Social Media into Your Federal Library Job SearchTiffany Brand
 

Similar to Twitter mining (20)

A Comparative Study of Users' Microblogging Behavior on Sina Weibo and Twitter
A Comparative Study of Users' Microblogging Behavior on Sina Weibo and TwitterA Comparative Study of Users' Microblogging Behavior on Sina Weibo and Twitter
A Comparative Study of Users' Microblogging Behavior on Sina Weibo and Twitter
 
How We Built A Bot For FITC
How We Built A Bot For FITCHow We Built A Bot For FITC
How We Built A Bot For FITC
 
Social media 101 november 2011
Social media 101 november 2011Social media 101 november 2011
Social media 101 november 2011
 
How to Blog: A guide for developers
How to Blog: A guide for developers How to Blog: A guide for developers
How to Blog: A guide for developers
 
Twitter: Advanced Tips & Tricks
Twitter: Advanced Tips & Tricks Twitter: Advanced Tips & Tricks
Twitter: Advanced Tips & Tricks
 
Social Media for your business
Social Media for your businessSocial Media for your business
Social Media for your business
 
Evolving web, evolving search
Evolving web, evolving searchEvolving web, evolving search
Evolving web, evolving search
 
RSS - Syndicating Your Thoughts To Create Influence
RSS - Syndicating Your Thoughts To Create InfluenceRSS - Syndicating Your Thoughts To Create Influence
RSS - Syndicating Your Thoughts To Create Influence
 
Syndicating Your Thoughts To Create Influence
Syndicating Your Thoughts To Create InfluenceSyndicating Your Thoughts To Create Influence
Syndicating Your Thoughts To Create Influence
 
Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...Tim Estes - Generating dynamic social networks from large scale unstructured ...
Tim Estes - Generating dynamic social networks from large scale unstructured ...
 
Creating More Engaging Content For Social
Creating More Engaging Content For SocialCreating More Engaging Content For Social
Creating More Engaging Content For Social
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
Social Media and New Media Workshop (FSI) PY363 - Day 3
Social Media and New Media Workshop (FSI) PY363 - Day 3Social Media and New Media Workshop (FSI) PY363 - Day 3
Social Media and New Media Workshop (FSI) PY363 - Day 3
 
Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)
 
How to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah BeckleyHow to Future-proof Your Content by Sarah Beckley
How to Future-proof Your Content by Sarah Beckley
 
Atlanta Press Club Talk on # Grammar
Atlanta Press Club Talk on # GrammarAtlanta Press Club Talk on # Grammar
Atlanta Press Club Talk on # Grammar
 
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT ToolsIntroduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
Introduction to the Responsible Use of Social Media Monitoring and SOCMINT Tools
 
How to Attract & Survive Media Attention as PhD
How to Attract & Survive Media Attention as PhDHow to Attract & Survive Media Attention as PhD
How to Attract & Survive Media Attention as PhD
 
Engineering Virality -- DC Week 2012
Engineering Virality -- DC Week 2012Engineering Virality -- DC Week 2012
Engineering Virality -- DC Week 2012
 
Tiffany Jane Brand Integrating Social Media into Your Federal Library Job Search
Tiffany Jane Brand Integrating Social Media into Your Federal Library Job SearchTiffany Jane Brand Integrating Social Media into Your Federal Library Job Search
Tiffany Jane Brand Integrating Social Media into Your Federal Library Job Search
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsVlad Stirbu
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»QADay
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Twitter mining

  • 2. What is twitter? • 140 character tweet • Hashtag # before relevant keywords in tweet • RT means to “re-tweet” or forward a tweet • @ reference refers to a user’s screen name
  • 3. Why it is different? • Very short in length • Written in informal style • Social
  • 4. What is twitter, a social network or a news media?(www2010) • Following is mostly not reciprocated(not so “social”) • Users talk about timely topics • A few users reach large audience directly • Most users can reach large audience by word- of-mouth quickly
  • 6. Analysis 1: Take the people out • Krishnamurthy et al (2008) • users were classified by follower/following counts, Numbers and ratios • means and mechanisms of their engagement Web (61.7%), mobile/text (7.5%), software (22.4%)
  • 7. Analysis 2: Content Category Four meta-categories • daily chatter • conversations • information / URL sharing • news reporting
  • 8. Analysis 3: measuring user influence • Indegree, retweets and mentions • Strong correlation between retweet and mention • Most connected != most influential
  • 10.
  • 11.
  • 12.
  • 13. How to detect spam? • classification • Content attributes hashtags, trending topics replies, mentions, http links • User behavior attributes age of user account • Graph based attribute
  • 14. Sentiment analysis • Supervised classification • Training data come from twitter, instead of human labeled • Happy emotions: “:-)”, “:)”, “=)”, “:D” etc • Sad emotions: “:-(”, “:(”, “=(”, “;(” etc • Objective: newspapers and magzines such as “NY times”
  • 15. Trend detection • Bursty keywords detection • Bursty keywords grouping • Context extraction(such as PCA, SVD)
  • 17. The largest difference • Twitter search order by time • Search engine order by relevance • Social • Time
  • 19. Recommending content from information streams • The filtering problem: – “I get 1000+ items in my stream daily but only have time to read 10 of them. Which ones should I read?” • The Discovery Problem: – “There are millions of URLs posted daily on twitter. Am I missing something important there outside my own Twitter stream?”
  • 20. Recommending content from information streams • Recency of content: only interesting within a short time after published. – always a “cold start” situation • Explicit interaction among users – Explicitly interact by subscribing or sharing • User-generated content – People are content producers as well as consumers
  • 21. Recommending content from information streams
  • 22. URL Sources • Considering all URLs was impossible • FoF : URLs from followee-of-followees • Popular : URLs that are popular across whole twitter
  • 23. Topic relevance scores • Topic profile of URLs – Use term vectors as profiles – Built from tweets that have mentioned the URL • Topic profile of users – Self-topic: content profile based on what I post – Followee-Topic: content profile based on what my followees post
  • 24. Social network scores • “Popular Vote” in among my followees-of- followees – People “vote” a URL by tweeting it – Votes are weighted using social network structure – URLs with more votes in total are assigned higher score
  • 25.
  • 26. Recommending twitter users to follow • Social graph • Profile user – User himself – Followers – followees
  • 28. The phrase reinforcement algorithm • Looking for the most commonly occurring phrases – Users tend to use similar words when describing a particular topic – RT
  • 29.
  • 30. Hybrid TF-IDF summarization • TF: the document is the entire collection of posts • IDF: the document is a single post
  • 32. Content modeling on Twitter tf.idf cosine similarity, Surface word etc. features Deeper Parsing, parts of dats yur mom not speech, coreferen natural me lol ce, etc language processing THE_REAL_SHAQ 32
  • 33. Content modeling on Twitter tf.idf cosine Topic Latent Dirichlet similarity, Surface word models, Dimen Allocation (LDA), etc. features sionality LSA, etc. reduction Supervised classification #hashtags, emotico ns, questions, etc. Labeled LDA Best model in Naïve Bayes, ranking SVM, etc. experiments 33
  • 34. Content modeling with Labeled LDA Discover unlabeled topics Model common labels Parameter K=200 latent 500 - 1000 dimensions for topic dimensions hashtags, emoticons, etc. obama president Smile : ) american america says :) good day country russia morning thanks #jobs pope island have happy #jobs featured hope birthday I’m going go out manager sales gonna see im :) can‘t wait see engineer yahoo tonight sleep one yay!!! cant location senior tomorrow about tomorrow got !! am night next christmas 34
  • 35. Content modeling with Labeled LDA 4 1 1 1 new muppetblog political commentary link 2 2 2 3 3 @kermit heyy wanna catch a movie 5 5 #yummy #yummy just ate a cookie #yummy Histogram as signature for set of posts 35
  • 36. Twitter content by category can make help if someone obama president american tell_me them anyone use america says country russia makes any sense trying explain pope island failed honduras up what's hit pick whats hey iphone new phone app mobile set twitter sign give catch Social apple ipod blackberry touch when show first wats make 23% Substance pro store apps free android an 27% Status 12% haha lol :) funny :p omg Style am still doing sleep so going hahaha yeah too yes thats ha 38% tired bed awake supposed hell wow cool lmao though kinda asleep early sleeping sleepy im get dont gonna shit gotta night sleep bed going off wanna cuz damn ur make cant tomorrow bye tonight say cause bout ill mad tired goodnight all im time now nite 36
  • 37. Characterizing Microblogs with Topic Models Outline • Modeling Twitter content with topic models • Characterizing, recommending and filtering 37
  • 40. TwitterRank: Finding Topic-sensitive Influential Twitterers • Apply LDA to distill topics automatically • Find topics in the twitterer’s content to represent her interests – Twitterer’s content = aggregated tweets • Twitterers with “following” relationships are more similar than those without according to the topics they are interested in
  • 42. Interesting application • Personalized and automatic social summarization of events in video • Twitter Can Predict the Stock Market • Predicting elections with twitter • Earthquake(time, location)
  • 43. thanks many pictures and slides come from the internet

Editor's Notes

  1. Hashtags are indicated by a # symbol and are combined with keywords to indicate a topic of interest. Hashtags become popular when many people use it. Popular topics, known as “trending” topics, appear on the main twitter page and can significantly increase the number of tweets containing that topic.
  2. http://www.slideshare.net/haewoon/what-is-twitter-a-social-network-or-a-news-media-3922095OSN we are friendsTwitter follow youMedia the means of communication, as radio and television, newspapers, and magazines, that reach or influence people widelyOnly 22.1% user pairs follow each other (flickr 68%, 84 yahoo% )Majority of topics are headlineTwitter user ranking by followers, pagerank, and RT Followers, pagerank(actor, musician, show host, sports star, model)RT (news)A retweet brings a few hundred additional readers (55% of RT < 1hr)Summary:Low reciprocity distinguishes twitter from OSNsTwitter hasw characteristics of news media: 1. tweets mentioning timely topics 2. plenty of hubs reaching a large public directly 3. fast and wide spread of word-of-mouth
  3. Indegree news source; politicians; athletes; celebritiesRetweet content aggregation service, news sitesMention celebrities
  4. http://www.slideshare.net/daniel.gayo/overcoming-spammers-in-twitter-a-tale-of-five-algorithms11650140
  5. The follower/followee ratio “matters” more than raw number of followersFollowing people is a simple way to get followers
  6. TunkRank is an influence ranking tool that helps you identify leading influencers on Twitter. There are two basic ideas:The amount of attention you can give is spread out among all those you follow. The more you follow, the less attention you can give each one.Your influence depends on the amount of attention your followers can give you.As a twitterer, your influence does not depend on how many people you follow. However, your usefulness as a follower does. Having higher influence depends on having many followers who follow relatively few people but are followed by many. Followers like that are more likely to read your tweets and act on them (retweeting, clicking links, responding, blogging, etc). Their influence trickles up to you.Your TunkRank score is a reflection of how much attention your followers can both directly give you and how much attention they bring you from their network of followers.
  7. External URLLetter+number patterns in usernamesSuggestive keywords (“naked”, “girls”, “webcam”)Propagation tree
  8. Context extraction algorithm(such as PCA, SVD) over the recent history of the trend and reports the keywords that most correlated with it.For example, thekeyword ‘NBA’ may usually appear in 5 tweets per minute,yet suddenly exhibit a rate of 100 tweets/min.
  9. Lots of celebrity names–lady gaga@ and # reduce ambiguity like advanced query operators•Hashtagqueries particularly popular–Most popular queries: Hashtag51% of the time–Least popular queries: Hashtag7% of the time•Celebrity queries particularly popular–Most popular queries: Celebrity 25% of the time–Least popular queries: Celebrity 4% of the time•Twitter queries less diverse than Web queries–Only 1 in 4 unique (v. 2 in 4 unique)
  10. http://www.slideshare.net/PARCInc/recommending-content-from-social-information-streamsThere is no collective filtering
  11. There is no collaborative filtering
  12. http://www.slideshare.net/PARCInc/recommending-content-from-social-information-streams
  13. Length normalizationStopwordThredhold remove similar onescluster
  14. 研究者采用了两种情绪追踪工具。一种是开源工具OpinionFinder,能将推文二分为正面和负面情绪;另一种则是研究者在临床上使用的情绪状态量表(POMS)基础上,新开发出的情绪测试工具GPOMS。它能将公众的情绪分为冷静、警惕、确信、活力、友善和幸福这六个类别。为了验证两个工具的准确性,研究者将公众情绪和社会事件对比,结果十分吻合。例如,在总统大选日(2008年11月4日)期间,Twitter 在大选日前一天开始紧张,在大选日当天变得冷静、活力、友善、幸福,总体情绪在大选日后又回归平常。在感恩节(11月28日)当天,整个 Twitter 洋溢着浓浓的幸福味道,过后又恢复正常。而最令人激动的是,将“冷静”情绪指数后移3天,竟然和道琼斯工业平均指数惊人一致。其他情绪则没有这样的效果。另外,研究者还测试了一个称为SOFNN的股市预测模型。当仅输入股市数据时,模型已经有73.3%的准确率;加入“冷静”的情感信息后,准确率更升至86.7%。但是,Twitter 情绪指标,仍然不可能预测出会冲击金融市场的突发事件。例如,在2008年10月13号,美国联邦储备委员会突然启动一项银行纾困计划,令道琼斯指数反弹,而3天前的Twitter冷静指数自然毫无征兆。而且,研究者自己也意识到,Twitter 用户与股市投资者并不完全重合,这样的样本代表性有待商榷。慕尼黑工业大学的两位学者对 Twitter 进行了更为细致的分析[5]。他们筛选出提到标准普尔100指数中的公司的推文(比如 $AAPL 代表苹果公司),分为 “买入”、“持有”或“卖出”三类,并算出每支股票的看涨程度。结果同样鼓舞人心。例如,推文的总数和交易量,看涨程度和标准普尔100指数之间,都有密切相关。更具操作意义的是,如果投资者采取“买入”看涨程度最高的3支股票,“卖出”最低3支的策略,半年便有高达15%的收益。美国佩斯大学的博士生亚瑟•奥康纳(Arthur O’Connor)[7],则采用了另外一种思路。他追踪了星巴克、可口可乐和耐克三家公司在社交媒体上的受欢迎程度,同时比较它们的股价。他发现,Facebook上的粉丝数、Twitter 上的听众数和 Youtude上的观看人数,都和股价密切相关。品牌的受欢迎程度,还能预测股价在10天、30天之后的上涨情况。