How to spot first stories on
Twitter using Storm
Michael Vogiatzis - @mvogiatzis
Software Engineer
The Task
Find the first document in a stream of
documents, which discusses about a
specific event.
@mvogiatzis
Twitter
 Spam
◦ It’s Cooooooooooooolddd !! Brrrrrr…
 Neutral
◦ #nowplaying ♫ Live At The BBC – Dire Straits
 Events
◦ T...
Algorithm
 TF-IDF on input Tweet
 Convert it to Vector
@mvogiatzis
TF - IDF
 Split text into words
 Term Frequency * Inverted Document
Frequency
 More frequent words – less weight
 Remo...
Algorithm
 TF-IDF on input Tweet
 Convert it to Vector
 Find N nearest neighbours
◦ Locality Sensitive Hashing
@mvogiat...
Locality Sensitive Hashing
 Data Clustering – Near neighbour search
 Buckets – Hash Tables for similar
documents
 Rando...
Locality Sensitive Hashing cont’d
@mvogiatzis
Algorithm
 TF-IDF on input Tweet
 Convert it to Vector
 Find N nearest neighbours
◦ Locality Sensitive Hashing
 Compar...
Extra Step
 If Buckets distance is not short enough
 Compare with a fixed number of recent
tweets
 Check again
@mvogiat...
Algorithm
 TF-IDF on input Tweet
 Convert it to Vector
 Find N nearest neighbours
◦ Locality Sensitive Hashing
 Compar...
Storm
Real-time computation made easy
Storm
 Distributed real-time computation system
 Fault tolerant
 Fast
 Scalable
 Guaranteed message processing
 Open...
Elements
 Streams
◦ Set of tuples
◦ Unbounded sequence of data
 Spout
◦ Source of streams
 Bolts
◦ Application logic
◦ ...
Topology
@mvogiatzis
Part I
@mvogiatzis
Part II
@mvogiatzis
Results
Input Tweet Stored Tweet Similarity score
@Real_Liam_Payne i
wanna be your female
pal
i. wanna be your best
friend...
Evaluation
 Evaluation on speed-up metric
◦ 1381 % vs single threaded
◦ 372 % vs multi threaded (4 threads)
 Having huma...
Future work
 Reduce false alarms by using threads for
topics
 Image similarity detection
 Audio similarity ?
◦ Hello Sh...
Michael Vogiatzis
 Twitter: @mvogiatzis
 Code on Github
 http://micvog.com
◦ Next post: “7 Lessons Learned at a London
...
Upcoming SlideShare
Loading in …5
×

How to Spot First Stories on Twitter using Storm

1,074 views

Published on

http://micvog.com/2013/09/08/storm-first-story-detection/

Published in: Technology, Travel
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,074
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

How to Spot First Stories on Twitter using Storm

  1. 1. How to spot first stories on Twitter using Storm Michael Vogiatzis - @mvogiatzis Software Engineer
  2. 2. The Task Find the first document in a stream of documents, which discusses about a specific event. @mvogiatzis
  3. 3. Twitter  Spam ◦ It’s Cooooooooooooolddd !! Brrrrrr…  Neutral ◦ #nowplaying ♫ Live At The BBC – Dire Straits  Events ◦ The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of East Timor. @mvogiatzis
  4. 4. Algorithm  TF-IDF on input Tweet  Convert it to Vector @mvogiatzis
  5. 5. TF - IDF  Split text into words  Term Frequency * Inverted Document Frequency  More frequent words – less weight  Remove out-of-vocabulary words e.g. “lol”, “the”  Remove URLs and mentions (@) @mvogiatzis
  6. 6. Algorithm  TF-IDF on input Tweet  Convert it to Vector  Find N nearest neighbours ◦ Locality Sensitive Hashing @mvogiatzis
  7. 7. Locality Sensitive Hashing  Data Clustering – Near neighbour search  Buckets – Hash Tables for similar documents  Random projection creates a hash  Identical hash -> nearest neighbour candidate @mvogiatzis
  8. 8. Locality Sensitive Hashing cont’d @mvogiatzis
  9. 9. Algorithm  TF-IDF on input Tweet  Convert it to Vector  Find N nearest neighbours ◦ Locality Sensitive Hashing  Compare distances and find the closest  If distance < threshold not a first story @mvogiatzis
  10. 10. Extra Step  If Buckets distance is not short enough  Compare with a fixed number of recent tweets  Check again @mvogiatzis
  11. 11. Algorithm  TF-IDF on input Tweet  Convert it to Vector  Find N nearest neighbours ◦ Locality Sensitive Hashing  Compare distances and find the closest  If distance < threshold not a first story  Else compare with X most recent tweets (optimization)  If new_distance > threshold -> first story! @mvogiatzis
  12. 12. Storm Real-time computation made easy
  13. 13. Storm  Distributed real-time computation system  Fault tolerant  Fast  Scalable  Guaranteed message processing  Open source  Multilang capabilities @mvogiatzis
  14. 14. Elements  Streams ◦ Set of tuples ◦ Unbounded sequence of data  Spout ◦ Source of streams  Bolts ◦ Application logic ◦ Functions ◦ Streaming aggregations, joins, DB ops @mvogiatzis
  15. 15. Topology @mvogiatzis
  16. 16. Part I @mvogiatzis
  17. 17. Part II @mvogiatzis
  18. 18. Results Input Tweet Stored Tweet Similarity score @Real_Liam_Payne i wanna be your female pal i. wanna be your best friend so follow me  0.385 RT @damnitstrue: Life is for living, not for stressing. RT Life is for living, not for stressing. 0.99 The 6.4-magnitude quake struck just after 9.20pm (CST) on Sunday in the Banda Sea northeast of East Timor. http://t.co/UhfwC S2xPp Yay Sunday! 0.129 @mvogiatzis
  19. 19. Evaluation  Evaluation on speed-up metric ◦ 1381 % vs single threaded ◦ 372 % vs multi threaded (4 threads)  Having humans labeling tweets is hard!  Implementation tested on newswire and broadcast news  False alarms @mvogiatzis
  20. 20. Future work  Reduce false alarms by using threads for topics  Image similarity detection  Audio similarity ? ◦ Hello Shazam! @mvogiatzis
  21. 21. Michael Vogiatzis  Twitter: @mvogiatzis  Code on Github  http://micvog.com ◦ Next post: “7 Lessons Learned at a London startup” @mvogiatzis

×