Streaming First Story Detection with application to Twitter
RT @deepti: #presentationStreaming First Story Detection with application to Twitter Sasa Petrovic, Miles Osborne, Victor Lavrenko
Agenda1. Awesomeness of Twitter2. Understanding the problem presented in this paper. - Streaming first story detection.3. State of the art in FSD.4. Proposed system.5. Experiments. - Different datasets. - Evaluation metrics.6. Results.7. Observations.8. Questions/ Discussion.
What makes twitter tick?• Twitter and few other social media tools are sometimes ahead of newswire . Ex#1- Protests during Iranian elections in 2009 – People posted news first on Twitter which was later picked up by the broadcasting corporations. Ex#2- The swine flu outbreak in US – US Centre for disease control CDC used twitter to post latest updates on the pandemic.
#Mumbai26/11• #mumbaiblasts RT @SamuraiSingh: Anyone from inorbit mall Malad heading towards Dindoshi in Goregaon east? I could do with a lift. #needhelp• #needhelp RT @OhMyKohli: Need a lift from andheri west to bhandup nagar goregaon.• RT @NabeelN: RT @splurgestar7: #NeedHelp #Mumbai #blasts I live 5 mins from Kabootar Khana.. Anyone needs help, please let me know!• #here2help RT @Nakulsud: Stranded in the rain near Gandhi hospital.. No cabs. Anyone around? Call 9920722186..going to mahim #needhelp• #NeedHelp Find @nikhilwarrier RT @jayblawgs @sukhkarni Demon Stealer Records is where he works.Im not sure where it is. Trying Google Maps.• B-ve donors needed tomorow after 10 am, KEM Hospital, Parel, #Mumbai contact the hosp. blood bank at 022-24135189/24107421 #needhelp
Problem statementTo detect new events from a stream of Twitterposts.
Topic Detection and Tracking• An information filtering task.• Focuses on organizing news documents.• Subtasks – Story Segmentation. – Topic Tracking. – Topic Detection. – First-Story Detection. – Link Detection
Deﬁnitions• An event is an unique thing that happens at some speciﬁc time and place – Eg: Earthquake in Italy in April 2009.• A topic is an event or activity, along with all directly related events or activities. – Eg: Elections, Natural Disasters etc.
First-Story DetectionIs this the first story on a topic?
FSD on Twitter data• Challenges – Much higher volume of data. – High level of noise.• Benefit – First hand information on the impact of an event and how people reacted to it.
First Story Detection – traditional approach Streaming algorithm Old New Time
Nearest Neighborhood Approach• Documents are represented as vectors in term space.• Coordinates represent the frequency of a particular term in a document.• Each new document is compared to the previous ones If (similarity < threshold ) First story detected
Disadvantages of NN approach• Not scalable to the twitter streaming setting.• Space and time requirements increase with incoming data.Alternative - Approximate neighborhood search• To find any point that lies within (1+Ɛ)r distance of the query point. – r here is the distance to the nearest neighbor.• One way to achieve this. – Locality sensitive hashing (LSH)
Hash tables - definitions• Hash function – Mapping from the input value to a hash key• Hash key – Value returned by a hash function – Identifiers of each bucket.• Collision – When two or more input values are mapped to the same bucket. – More buckets -> less collision.
Locality Sensitive Hashing• Hash each query point into buckets in such a way that probability of collision is proportional to distance between the points. Nearer points have higher chance to be hashed into the same bucket.• Points in the same bucket are inspected to find the nearest one.
Locality Sensitive Hashing ( contd..)• Number of hyper planes (k)• Higher k value, less is the probability of collision of non-similar points.• For any two points x and y - Angle between x and y
Multiple hash tables• To increase the chance that the nearest neighbor will collide with our point at least once.• Each hash table has k independently chosen random hyperplanes.• The number of hash tables L (δ probability of missing a nearest neighbor)
Variance Reduction Strategy• LSH fails to find the true near neighbor when the query point lies far away from all other points.• Another level of processing is added.
Variance Reduction Strategy (contd..) LSH scheme Old New • Compare the query with a fixed number of most recent documents. • Update the distance value if necessary.
Streaming First Story detection - Challenges• Millions of new documents are published each hour• The volume limits the amount of space and time we can spend on each document – Cannot compare new document with all documents returned by LSH. – Cannot store all the previous documents in the main memory.• Additional metadata – Time stamp, topic tags etc.
Desiderata for a streaming FSD system• For each document say whether it discusses a previously unseen event and give confidence in this decision.• Decision should be made in bounded time.• Use bounded space• Only one pass over data allowed• Decision should be immediately made.
Using the LSH system without bounds• Number of documents in each bucket will grow without bound. => unbounded amount of space• Number of comparisons also grow without bound.
Constant space approach• Limit the number of documents in a single bucket to a constant. – Remove the oldest document if the bucket is full.• The document is removed only from one single bucket in one of the L hash tables.
Constant number of comparisons• Limiting the number of documents might still result in large number of comparisons. – A new document can collide with all the documents in a bucket.• An additional limit to make a constant number of comparisons.
Constant number of comparisons (contd..)• Compare each new document with at most 3L documents that it most frequently collided with in all L hash tables.• If S is the set of all documents that collided with a new document in all L hash tables. – Order the elements in S according to the number of hash tables where the collision occurred. – Pick the first 3L elements of that ordered set and compare the new document only with them.
Detecting Events in Twitter Posts• Not all tweet posts are actual stories. – Updates on personal life. – Spams – Conversations. – Real stories.• An important event – that which interests a larger population.
Detecting Events – Threading• Run the streaming FSD system and assign a novelty score to each tweet. – Score is based on a cosine distance to the nearest tweet. – Output <tweet, its novelty score, its nearest tweet>• Tweet a links to tweet b if – b is the nearest neighbor of a – 1-cos(a,b) < t (t ε [0.5,0.6]).
Threading (contd..)• For each tweet a – If its NN is within distance t. • Assign a to an existing thread to which b belongs. – Else • a is the first tweet in a new thread.• Once we have threads of tweets – Can identify which threads grow fastest news of a new event is spreading.
Analysis of social media - Related Work1. Luo et al (2007) worked on new event detection in a largescale streaming. – Used traditional FSD approach and employed various heuristics. – Not a generalized approach, never showed utility of their system on a large scale task.2. Saha and Getoor (2009) worked on maximum coverageproblem. – select k blogs that maximize the cover of interests specified by user. – 20 days of blog data totaling to 2 million posts.• This paper works on twitter data for 6 months totaling over 160 million posts.• This paper’s FSD approach is more generalized.
Experiments• Experiments used the English part of the TDT5 consists of 221,306 documents from a time period spamming April 2003 to Sept 2003.• Experiments done in two stages A) Test and compare the proposed FSD to the state of the art FSD system on the standard TDT5 dataset. B) Test different ranking methods on the output of the proposed FSD applied on twitter data.
TDT5 Experimental Setup• Aim – To test if the proposed system is on par with the best existing system. ( UMass system in particular) - To accurately measure the speedup obtained over the existing system.- Same settings as the UMass system - 1-NN clustering - Cosine as a similarity measure - TF-IDF weighted document representation - Top 300 features in each document.
TDT5 Experimental Setup (contd..)• LSH parameters – Higher k, more computation. Lower k, more collisions. – k ( No of hyperplanes) = 13. – Probability of missing a neighbor within the distance of 0.2 is less than 2.5%• The official TDT evaluation requires each system to assign a confidence score for its decision. – In our case, we assign the score as soon as the new story arrives.
TDT5 Experiment Evaluation Metrics1. Detection Error Tradeoff (DET) curves – A graphical plot of error rates for binary classification systems, plotting false reject rate vs. false accept rate. – DET provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution.
Plotting DET curves1. Sort all stories according to their scores.2. Perform threshold sweep. For each threshold value: – Stories with a score greater than threshold are considered new. – Calculate false alarm and miss probabilities. • False alarm - declaring a story new when it is not. • Miss - declaring a new story old.3. Plot the values on a graph to show the trade offbetween these two quantities.
DET curves comparing the proposedsystem with the UMass FSD system
2. Minimal Normalized CostCmiss and CFA are costs of miss and false alarm.Pmiss and PFA are probabilities of miss and false alarm.Ptarget and Pnon-target are the prior non target and non-target probabilities.Cmin is the minimal value of Cdet over all threshold values.Lower value of Cmin indicates better performance.
TDT5 Results – Minimal Normalized Cost• No limit on the bucket size.• Processing time per item was made constant.
TDT5 Results (contd..)• Variance in case of Pure LSH = 0.046.• Variance in case of Variance Reduced LSH = 0.004.• UMass system took 28 hours to complete the processing.• The proposed system took only 2 hours.
Comparison of processing time for the proposed and the UMass system
TDT5 Results – Minimal Normalized Cost• Bucket size limited in terms of the percent of expected number of collisions. Eg: Bucket size of 0.5% means that the number of docs in a bucket cannot be more than 50% of the expected number of collisions.• Performance declines when bucket size is limited but is reasonable when bucket size is reduced to 10% of the collisions.
Memory usage on a month of Twitter data.• X-axis shows how long the system has been running for.
Twitter Experimental Setup• Dataset – Twitter data gathered over a period of six months. – 163.5 million timestamped tweets, totaling over 2 billion tokens. • Only ASCII characters • Stripped the words beginning with “@”, “#”
Twitter Experimental Setup ( contd..)• Not evaluating our FSD system• Evaluating different methods of ranking threads which are the outputs of a FSD system – To detect important events in a very noisy and unstructured stream as twitter.
Twitter Experimental Setup ( contd..)• Gold Standard – Human experts manually labeled tweets returned by the system. – 3 labels • Event –A tweet which conveys what exactly happened without having any prior knowledge about the event. The event referenced should be sufficiently important. • Spam – Automatic weather updates, radio station updates etc. • Neutral – everything that isn’t an event or a spam.
Twitter Experimental Setup ( contd..)• Only the 1000 fastest growing threads were labeled.• 820 tweets on which both the annotators agreed are considered as the gold standard.
Twitter Evaluation• Evaluation is performed by computing average precision (AP) on the gold standard sorted according to different criteria.• Test#1 – Relevant documents – Event tweets – Non-relevant documents – Neutral and spam tweets.• Test#2 – Relevant documents – Event + Neutral tweets – Non-relevant documents – Spam tweets.
Ranking the threads• Different ways of ranking the threads – Baseline : Random ordering of threads – Size of the thread – threads are ranked according to the number of tweets. – Number of users – threads are ranked according to the number of unique users posting in a thread. – Entropy + users – if the entropy of a thread is < 3.5, move to the back of the list, otherwise sort according to the number of unique users.
Observations• Celebrity deaths are the fastest spreading news of Twitter. – Steve Jobs’s death broke Twitter record with 10,000 tweets per second. – Tweet count soon after Osama bin Laden’s death.
Questions..• The language in the tweets – often misspelt, quite informal.• Topic tags might provide richer information about the trending topic.• Time complexity of the algorithm and of the sorting techniques not mentioned.• Entropy is usually a measure of disorder or randomness – i.e., the lesser the entropy the lesser disorder . – But the definition here says that the higher entropy values are better.