SlideShare a Scribd company logo
1 of 61
RT @deepti: #presentation
Streaming First Story Detection
  with application to Twitter

  Sasa Petrovic, Miles Osborne, Victor Lavrenko
Agenda
1. Awesomeness of Twitter
2. Understanding the problem presented in this paper.
      - Streaming first story detection.
3. State of the art in FSD.
4. Proposed system.
5. Experiments.
       - Different datasets.
       - Evaluation metrics.
6. Results.
7. Observations.
8. Questions/ Discussion.
Social Media Tools
Information explosion
What makes twitter tick?
• Twitter and few other social media tools are
  sometimes ahead of newswire .
  Ex#1- Protests during Iranian elections in 2009
  – People posted news first on Twitter which was later
    picked up by the broadcasting corporations.

  Ex#2- The swine flu outbreak in US
  – US Centre for disease control CDC used twitter to
    post latest updates on the pandemic.
#Mumbai26/11
#Mumbai26/11
• #mumbaiblasts RT @SamuraiSingh: Anyone from inorbit mall Malad
  heading towards Dindoshi in Goregaon east? I could do with a lift.
  #needhelp
• #needhelp RT @OhMyKohli: Need a lift from andheri west to bhandup
  nagar goregaon.
• RT @NabeelN: RT @splurgestar7: #NeedHelp #Mumbai #blasts I live 5
  mins from Kabootar Khana.. Anyone needs help, please let me know!
• #here2help RT @Nakulsud: Stranded in the rain near Gandhi hospital.. No
  cabs. Anyone around? Call 9920722186..going to mahim #needhelp
• #NeedHelp Find @nikhilwarrier RT @jayblawgs @sukhkarni Demon
  Stealer Records is where he works.I'm not sure where it is. Trying Google
  Maps.
• B-ve donors needed tomorow after 10 am, KEM Hospital, Parel, #Mumbai
  contact the hosp. blood bank at 022-24135189/24107421 #needhelp
Twitter leads to..
• Citizen Journalism
• Promotion
• Subjective opinion
Categorizing twitter data
Problem statement



To detect new events from a stream of Twitter
posts.
Topic Detection and Tracking
• An information filtering task.
• Focuses on organizing news documents.
• Subtasks
  – Story Segmentation.
  – Topic Tracking.
  – Topic Detection.
  – First-Story Detection.
  – Link Detection
Definitions
• An event is an unique thing that happens at
  some specific time and place
  – Eg: Earthquake in Italy in April 2009.


• A topic is an event or activity, along with all
  directly related events or activities.
  – Eg: Elections, Natural Disasters etc.
First-Story Detection




Is this the first story on a topic?
FSD on Twitter data
• Challenges
  – Much higher volume of data.
  – High level of noise.


• Benefit
  – First hand information on the impact of an event
    and how people reacted to it.
First Story Detection – traditional
             approach
         Streaming algorithm




                               Old


                               New

  Time
Nearest Neighborhood Approach
• Documents are represented as vectors in
  term space.
• Coordinates represent the frequency of a
  particular term in a document.
• Each new document is compared to the
  previous ones
     If (similarity < threshold )
            First story detected
Allan et al.,2000
Disadvantages of NN approach
• Not scalable to the twitter streaming setting.
• Space and time requirements increase with incoming
  data.

Alternative - Approximate neighborhood search
• To find any point that lies within (1+Ɛ)r distance of the
  query point.
   – r here is the distance to the nearest neighbor.
• One way to achieve this.
   – Locality sensitive hashing (LSH)
Hash tables - definitions
Hash tables - definitions
• Hash function
  – Mapping from the input value to a hash key
• Hash key
  – Value returned by a hash function
  – Identifiers of each bucket.
• Collision
  – When two or more input values are mapped to
    the same bucket.
  – More buckets -> less collision.
Locality Sensitive Hashing

• Hash each query point into buckets in such a
  way that probability of collision is proportional
  to distance between the points.
        Nearer points have higher chance to be
       hashed into the same bucket.

• Points in the same bucket are inspected to
  find the nearest one.
Locality Sensitive Hashing ( contd..)
• Number of hyper planes (k)
• Higher k value, less is the probability of
  collision of non-similar points.
• For any two points x and y



           - Angle between x and y
Hash key values




0   1    0   0   1   1   0   1
Multiple hash tables
• To increase the chance that the nearest neighbor
  will collide with our point at least once.
• Each hash table has k independently chosen
  random hyperplanes.
• The number of hash tables L



           (δ probability of missing a nearest neighbor)
Variance Reduction Strategy
• LSH fails to find the true near neighbor when
  the query point lies far away from all other
  points.

• Another level of processing is added.
Variance Reduction Strategy (contd..)

                     LSH scheme             Old




                              New



       • Compare the query with a fixed number
              of most recent documents.

       • Update the distance value if necessary.
Streaming First Story detection -
              Challenges
• Millions of new documents are published each
   hour
• The volume limits the amount of space and time
  we can spend on each document
  – Cannot compare new document with all documents
    returned by LSH.
  – Cannot store all the previous documents in the main
    memory.
• Additional metadata
  – Time stamp, topic tags etc.
Desiderata for a streaming FSD system
• For each document say whether it discusses a
  previously unseen event and give confidence
  in this decision.
• Decision should be made in bounded time.
• Use bounded space
• Only one pass over data allowed
• Decision should be immediately made.
Using the LSH system without bounds
• Number of documents in each bucket will
  grow without bound.
     => unbounded amount of space

• Number of comparisons also grow without
  bound.
Constant space approach
• Limit the number of documents in a single
  bucket to a constant.
  – Remove the oldest document if the bucket is full.


• The document is removed only from one
  single bucket in one of the L hash tables.
Constant number of comparisons
• Limiting the number of documents might still
  result in large number of comparisons.
  – A new document can collide with all the
    documents in a bucket.


• An additional limit to make a constant number
  of comparisons.
Constant number of comparisons
               (contd..)
• Compare each new document with at most 3L
  documents that it most frequently collided with in
  all L hash tables.

• If S is the set of all documents that collided with a
  new document in all L hash tables.
   – Order the elements in S according to the number of hash
     tables where the collision occurred.
   – Pick the first 3L elements of that ordered set and compare
     the new document only with them.
Detecting Events in Twitter Posts
• Not all tweet posts are actual stories.
  – Updates on personal life.
  – Spams
  – Conversations.
  – Real stories.


• An important event – that which interests a
  larger population.
Detecting Events – Threading
• Run the streaming FSD system and assign a
  novelty score to each tweet.
  – Score is based on a cosine distance to the nearest
    tweet.
  – Output <tweet, its novelty score, its nearest
    tweet>
• Tweet a links to tweet b if
  – b is the nearest neighbor of a
  – 1-cos(a,b) < t (t ε [0.5,0.6]).
Threading (contd..)
• For each tweet a
  – If its NN is within distance t.
     • Assign a to an existing thread to which b belongs.
  – Else
     • a is the first tweet in a new thread.
• Once we have threads of tweets
  – Can identify which threads grow fastest  news
    of a new event is spreading.
Analysis of social media - Related Work
1. Luo et al (2007) worked on new event detection in a large
scale streaming.
   – Used traditional FSD approach and employed various heuristics.
   – Not a generalized approach, never showed utility of their system on a large
     scale task.
2. Saha and Getoor (2009) worked on maximum coverage
problem.
   – select k blogs that maximize the cover of interests specified by user.
   – 20 days of blog data totaling to 2 million posts.
• This paper works on twitter data for 6 months totaling over
  160 million posts.
• This paper’s FSD approach is more generalized.
Experiments
• Experiments used the English part of the TDT5
  consists of 221,306 documents from a time
  period spamming April 2003 to Sept 2003.

• Experiments done in two stages
  A) Test and compare the proposed FSD to the state of
  the art FSD system on the standard TDT5 dataset.

  B) Test different ranking methods on the output of the
  proposed FSD applied on twitter data.
TDT5 Experimental Setup
• Aim
  – To test if the proposed system is on par with the best
  existing system. ( UMass system in particular)
  - To accurately measure the speedup obtained over the
    existing system.
- Same settings as the UMass system
  -   1-NN clustering
  -   Cosine as a similarity measure
  -   TF-IDF weighted document representation
  -   Top 300 features in each document.
TDT5 Experimental Setup (contd..)
• LSH parameters
  – Higher k, more computation. Lower k, more collisions.
  – k ( No of hyperplanes) = 13.
  – Probability of missing a neighbor within the distance
    of 0.2 is less than 2.5%
• The official TDT evaluation requires each system
  to assign a confidence score for its decision.
  – In our case, we assign the score as soon as the new
    story arrives.
TDT5 Experiment Evaluation Metrics
1. Detection Error Tradeoff (DET) curves
  – A graphical plot of error rates for binary
    classification systems, plotting false reject rate vs.
    false accept rate.
  – DET provides tools to select possibly optimal
    models and to discard suboptimal ones
    independently from (and prior to specifying) the
    cost context or the class distribution.
Plotting DET curves
1. Sort all stories according to their scores.
2. Perform threshold sweep. For each threshold
   value:
   – Stories with a score greater than threshold are
     considered new.
   – Calculate false alarm and miss probabilities.
      • False alarm - declaring a story new when it is not.
      • Miss - declaring a new story old.
3. Plot the values on a graph to show the trade off
between these two quantities.
DET curves comparing the proposed
system with the UMass FSD system
2. Minimal Normalized Cost


Cmiss and CFA are costs of miss and false alarm.

Pmiss and PFA are probabilities of miss and false alarm.

Ptarget and Pnon-target are the prior non target and non-
target probabilities.

Cmin is the minimal value of Cdet over all threshold values.

Lower value of Cmin indicates better performance.
TDT5 Results – Minimal Normalized
                 Cost




• No limit on the bucket size.
• Processing time per item was made constant.
TDT5 Results (contd..)
• Variance in case of Pure LSH = 0.046.
• Variance in case of Variance Reduced LSH =
  0.004.
• UMass system took 28 hours to complete the
  processing.
• The proposed system took only 2 hours.
Comparison of processing time for the proposed
           and the UMass system
TDT5 Results – Minimal Normalized
                 Cost



• Bucket size limited in terms of the percent of expected
  number of collisions.
       Eg: Bucket size of 0.5% means that the number of docs in a bucket
   cannot be more than 50% of the expected number of collisions.
• Performance declines when bucket size is limited but is
  reasonable when bucket size is reduced to 10% of the
  collisions.
Memory usage on a month of Twitter
              data.




• X-axis shows how long the system has been
  running for.
Twitter Experimental Setup
• Dataset
  – Twitter data gathered over a period of six
    months.
  – 163.5 million timestamped tweets, totaling
    over 2 billion tokens.
     • Only ASCII characters
     • Stripped the words beginning with “@”,
       “#”
Twitter Experimental Setup ( contd..)
• Not evaluating our FSD system

• Evaluating different methods of ranking
  threads which are the outputs of a FSD system
  – To detect important events in a very noisy and
    unstructured stream as twitter.
Twitter Experimental Setup ( contd..)
• Gold Standard
  – Human experts manually labeled tweets returned
    by the system.
  – 3 labels
     • Event –A tweet which conveys what exactly happened
       without having any prior knowledge about the event.
       The event referenced should be sufficiently important.
     • Spam – Automatic weather updates, radio station
       updates etc.
     • Neutral – everything that isn’t an event or a spam.
Twitter Experimental Setup ( contd..)
• Only the 1000 fastest growing threads were
  labeled.

• 820 tweets on which both the annotators
  agreed are considered as the gold standard.
Twitter Evaluation
• Evaluation is performed by computing average
  precision (AP) on the gold standard sorted
  according to different criteria.
• Test#1
  – Relevant documents – Event tweets
  – Non-relevant documents – Neutral and spam tweets.
• Test#2
  – Relevant documents – Event + Neutral tweets
  – Non-relevant documents – Spam tweets.
Ranking the threads
• Different ways of ranking the threads
  – Baseline : Random ordering of threads
  – Size of the thread – threads are ranked according
    to the number of tweets.
  – Number of users – threads are ranked according
    to the number of unique users posting in a thread.
  – Entropy + users – if the entropy of a thread is <
    3.5, move to the back of the list, otherwise sort
    according to the number of unique users.
Ranking the threads ( contd..)
Average precision chart




• Results of the second experiment are better.
Top ten fastest growing threads in our
                 data.
Observations
• Celebrity deaths are the fastest spreading news of
  Twitter.
   – Steve Jobs’s death broke Twitter record with 10,000 tweets
     per second.
   – Tweet count soon after Osama bin Laden’s death.
Questions..
• The language in the tweets – often misspelt,
  quite informal.
• Topic tags might provide richer information about
  the trending topic.
• Time complexity of the algorithm and of the
  sorting techniques not mentioned.
• Entropy is usually a measure of disorder or
  randomness – i.e., the lesser the entropy the
  lesser disorder .
  – But the definition here says that the higher entropy
    values are better.
THANK YOU!

More Related Content

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Streaming First Story Detection with application to Twitter

  • 1. RT @deepti: #presentation Streaming First Story Detection with application to Twitter Sasa Petrovic, Miles Osborne, Victor Lavrenko
  • 2. Agenda 1. Awesomeness of Twitter 2. Understanding the problem presented in this paper. - Streaming first story detection. 3. State of the art in FSD. 4. Proposed system. 5. Experiments. - Different datasets. - Evaluation metrics. 6. Results. 7. Observations. 8. Questions/ Discussion.
  • 5. What makes twitter tick? • Twitter and few other social media tools are sometimes ahead of newswire . Ex#1- Protests during Iranian elections in 2009 – People posted news first on Twitter which was later picked up by the broadcasting corporations. Ex#2- The swine flu outbreak in US – US Centre for disease control CDC used twitter to post latest updates on the pandemic.
  • 7. #Mumbai26/11 • #mumbaiblasts RT @SamuraiSingh: Anyone from inorbit mall Malad heading towards Dindoshi in Goregaon east? I could do with a lift. #needhelp • #needhelp RT @OhMyKohli: Need a lift from andheri west to bhandup nagar goregaon. • RT @NabeelN: RT @splurgestar7: #NeedHelp #Mumbai #blasts I live 5 mins from Kabootar Khana.. Anyone needs help, please let me know! • #here2help RT @Nakulsud: Stranded in the rain near Gandhi hospital.. No cabs. Anyone around? Call 9920722186..going to mahim #needhelp • #NeedHelp Find @nikhilwarrier RT @jayblawgs @sukhkarni Demon Stealer Records is where he works.I'm not sure where it is. Trying Google Maps. • B-ve donors needed tomorow after 10 am, KEM Hospital, Parel, #Mumbai contact the hosp. blood bank at 022-24135189/24107421 #needhelp
  • 8. Twitter leads to.. • Citizen Journalism • Promotion • Subjective opinion
  • 10. Problem statement To detect new events from a stream of Twitter posts.
  • 11. Topic Detection and Tracking • An information filtering task. • Focuses on organizing news documents. • Subtasks – Story Segmentation. – Topic Tracking. – Topic Detection. – First-Story Detection. – Link Detection
  • 12. Definitions • An event is an unique thing that happens at some specific time and place – Eg: Earthquake in Italy in April 2009. • A topic is an event or activity, along with all directly related events or activities. – Eg: Elections, Natural Disasters etc.
  • 13. First-Story Detection Is this the first story on a topic?
  • 14. FSD on Twitter data • Challenges – Much higher volume of data. – High level of noise. • Benefit – First hand information on the impact of an event and how people reacted to it.
  • 15. First Story Detection – traditional approach Streaming algorithm Old New Time
  • 16. Nearest Neighborhood Approach • Documents are represented as vectors in term space. • Coordinates represent the frequency of a particular term in a document. • Each new document is compared to the previous ones If (similarity < threshold ) First story detected
  • 18. Disadvantages of NN approach • Not scalable to the twitter streaming setting. • Space and time requirements increase with incoming data. Alternative - Approximate neighborhood search • To find any point that lies within (1+Ɛ)r distance of the query point. – r here is the distance to the nearest neighbor. • One way to achieve this. – Locality sensitive hashing (LSH)
  • 19. Hash tables - definitions
  • 20. Hash tables - definitions • Hash function – Mapping from the input value to a hash key • Hash key – Value returned by a hash function – Identifiers of each bucket. • Collision – When two or more input values are mapped to the same bucket. – More buckets -> less collision.
  • 21. Locality Sensitive Hashing • Hash each query point into buckets in such a way that probability of collision is proportional to distance between the points. Nearer points have higher chance to be hashed into the same bucket. • Points in the same bucket are inspected to find the nearest one.
  • 22. Locality Sensitive Hashing ( contd..) • Number of hyper planes (k) • Higher k value, less is the probability of collision of non-similar points. • For any two points x and y - Angle between x and y
  • 23. Hash key values 0 1 0 0 1 1 0 1
  • 24. Multiple hash tables • To increase the chance that the nearest neighbor will collide with our point at least once. • Each hash table has k independently chosen random hyperplanes. • The number of hash tables L (δ probability of missing a nearest neighbor)
  • 25. Variance Reduction Strategy • LSH fails to find the true near neighbor when the query point lies far away from all other points. • Another level of processing is added.
  • 26. Variance Reduction Strategy (contd..) LSH scheme Old New • Compare the query with a fixed number of most recent documents. • Update the distance value if necessary.
  • 27.
  • 28. Streaming First Story detection - Challenges • Millions of new documents are published each hour • The volume limits the amount of space and time we can spend on each document – Cannot compare new document with all documents returned by LSH. – Cannot store all the previous documents in the main memory. • Additional metadata – Time stamp, topic tags etc.
  • 29. Desiderata for a streaming FSD system • For each document say whether it discusses a previously unseen event and give confidence in this decision. • Decision should be made in bounded time. • Use bounded space • Only one pass over data allowed • Decision should be immediately made.
  • 30. Using the LSH system without bounds • Number of documents in each bucket will grow without bound. => unbounded amount of space • Number of comparisons also grow without bound.
  • 31. Constant space approach • Limit the number of documents in a single bucket to a constant. – Remove the oldest document if the bucket is full. • The document is removed only from one single bucket in one of the L hash tables.
  • 32. Constant number of comparisons • Limiting the number of documents might still result in large number of comparisons. – A new document can collide with all the documents in a bucket. • An additional limit to make a constant number of comparisons.
  • 33. Constant number of comparisons (contd..) • Compare each new document with at most 3L documents that it most frequently collided with in all L hash tables. • If S is the set of all documents that collided with a new document in all L hash tables. – Order the elements in S according to the number of hash tables where the collision occurred. – Pick the first 3L elements of that ordered set and compare the new document only with them.
  • 34. Detecting Events in Twitter Posts • Not all tweet posts are actual stories. – Updates on personal life. – Spams – Conversations. – Real stories. • An important event – that which interests a larger population.
  • 35. Detecting Events – Threading • Run the streaming FSD system and assign a novelty score to each tweet. – Score is based on a cosine distance to the nearest tweet. – Output <tweet, its novelty score, its nearest tweet> • Tweet a links to tweet b if – b is the nearest neighbor of a – 1-cos(a,b) < t (t ε [0.5,0.6]).
  • 36. Threading (contd..) • For each tweet a – If its NN is within distance t. • Assign a to an existing thread to which b belongs. – Else • a is the first tweet in a new thread. • Once we have threads of tweets – Can identify which threads grow fastest  news of a new event is spreading.
  • 37. Analysis of social media - Related Work 1. Luo et al (2007) worked on new event detection in a large scale streaming. – Used traditional FSD approach and employed various heuristics. – Not a generalized approach, never showed utility of their system on a large scale task. 2. Saha and Getoor (2009) worked on maximum coverage problem. – select k blogs that maximize the cover of interests specified by user. – 20 days of blog data totaling to 2 million posts. • This paper works on twitter data for 6 months totaling over 160 million posts. • This paper’s FSD approach is more generalized.
  • 38. Experiments • Experiments used the English part of the TDT5 consists of 221,306 documents from a time period spamming April 2003 to Sept 2003. • Experiments done in two stages A) Test and compare the proposed FSD to the state of the art FSD system on the standard TDT5 dataset. B) Test different ranking methods on the output of the proposed FSD applied on twitter data.
  • 39. TDT5 Experimental Setup • Aim – To test if the proposed system is on par with the best existing system. ( UMass system in particular) - To accurately measure the speedup obtained over the existing system. - Same settings as the UMass system - 1-NN clustering - Cosine as a similarity measure - TF-IDF weighted document representation - Top 300 features in each document.
  • 40. TDT5 Experimental Setup (contd..) • LSH parameters – Higher k, more computation. Lower k, more collisions. – k ( No of hyperplanes) = 13. – Probability of missing a neighbor within the distance of 0.2 is less than 2.5% • The official TDT evaluation requires each system to assign a confidence score for its decision. – In our case, we assign the score as soon as the new story arrives.
  • 41. TDT5 Experiment Evaluation Metrics 1. Detection Error Tradeoff (DET) curves – A graphical plot of error rates for binary classification systems, plotting false reject rate vs. false accept rate. – DET provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.
  • 42. Plotting DET curves 1. Sort all stories according to their scores. 2. Perform threshold sweep. For each threshold value: – Stories with a score greater than threshold are considered new. – Calculate false alarm and miss probabilities. • False alarm - declaring a story new when it is not. • Miss - declaring a new story old. 3. Plot the values on a graph to show the trade off between these two quantities.
  • 43. DET curves comparing the proposed system with the UMass FSD system
  • 44. 2. Minimal Normalized Cost Cmiss and CFA are costs of miss and false alarm. Pmiss and PFA are probabilities of miss and false alarm. Ptarget and Pnon-target are the prior non target and non- target probabilities. Cmin is the minimal value of Cdet over all threshold values. Lower value of Cmin indicates better performance.
  • 45. TDT5 Results – Minimal Normalized Cost • No limit on the bucket size. • Processing time per item was made constant.
  • 46. TDT5 Results (contd..) • Variance in case of Pure LSH = 0.046. • Variance in case of Variance Reduced LSH = 0.004. • UMass system took 28 hours to complete the processing. • The proposed system took only 2 hours.
  • 47. Comparison of processing time for the proposed and the UMass system
  • 48. TDT5 Results – Minimal Normalized Cost • Bucket size limited in terms of the percent of expected number of collisions. Eg: Bucket size of 0.5% means that the number of docs in a bucket cannot be more than 50% of the expected number of collisions. • Performance declines when bucket size is limited but is reasonable when bucket size is reduced to 10% of the collisions.
  • 49. Memory usage on a month of Twitter data. • X-axis shows how long the system has been running for.
  • 50. Twitter Experimental Setup • Dataset – Twitter data gathered over a period of six months. – 163.5 million timestamped tweets, totaling over 2 billion tokens. • Only ASCII characters • Stripped the words beginning with “@”, “#”
  • 51. Twitter Experimental Setup ( contd..) • Not evaluating our FSD system • Evaluating different methods of ranking threads which are the outputs of a FSD system – To detect important events in a very noisy and unstructured stream as twitter.
  • 52. Twitter Experimental Setup ( contd..) • Gold Standard – Human experts manually labeled tweets returned by the system. – 3 labels • Event –A tweet which conveys what exactly happened without having any prior knowledge about the event. The event referenced should be sufficiently important. • Spam – Automatic weather updates, radio station updates etc. • Neutral – everything that isn’t an event or a spam.
  • 53. Twitter Experimental Setup ( contd..) • Only the 1000 fastest growing threads were labeled. • 820 tweets on which both the annotators agreed are considered as the gold standard.
  • 54. Twitter Evaluation • Evaluation is performed by computing average precision (AP) on the gold standard sorted according to different criteria. • Test#1 – Relevant documents – Event tweets – Non-relevant documents – Neutral and spam tweets. • Test#2 – Relevant documents – Event + Neutral tweets – Non-relevant documents – Spam tweets.
  • 55. Ranking the threads • Different ways of ranking the threads – Baseline : Random ordering of threads – Size of the thread – threads are ranked according to the number of tweets. – Number of users – threads are ranked according to the number of unique users posting in a thread. – Entropy + users – if the entropy of a thread is < 3.5, move to the back of the list, otherwise sort according to the number of unique users.
  • 56. Ranking the threads ( contd..)
  • 57. Average precision chart • Results of the second experiment are better.
  • 58. Top ten fastest growing threads in our data.
  • 59. Observations • Celebrity deaths are the fastest spreading news of Twitter. – Steve Jobs’s death broke Twitter record with 10,000 tweets per second. – Tweet count soon after Osama bin Laden’s death.
  • 60. Questions.. • The language in the tweets – often misspelt, quite informal. • Topic tags might provide richer information about the trending topic. • Time complexity of the algorithm and of the sorting techniques not mentioned. • Entropy is usually a measure of disorder or randomness – i.e., the lesser the entropy the lesser disorder . – But the definition here says that the higher entropy values are better.