EVENT DETECTION
4th BDE Hang-out “Big Data in Secure
societies”
4/05/2017
George Giannakopoulos and
Nikiforos Pittaras,
NCSR "Demokritos"
Pilot Architecture
8-mai-17www.big-data-europe.eu
Event Detection Workflow
8-mai-17www.big-data-europe.eu
News
Crawler
…
Event
Detector
Lookup
Service
ED Workflow: News Crawler
 Runs periodically
 Monitored sources:
o Reuters news feeds (RSS)
o Selected Twitter accounts
o Keyword-based search
 Possible to cover more sources if needed
 Possible to process explicitly specified items
8-mai-17www.big-data-europe.eu
ED Workflow: Cassandra
 Scalable, noSQL distributed database
 Input Scenario I:
o Individual news items / tweets from News / Twitter Crawler
o Conforms with privacy regulations
 Input Scenario II:
o Events identified by Event Detector
 Input Scenario III:
o Queries about the stored news items and events
8-mai-17www.big-data-europe.eu
ED Workflow: Event Detector
 Runs periodically, parallel execution based on Spark
 Input:
o News items
 Output:
o Events
 Events are associated with meta-data: date, locations &
specified named entities
 Algorithm based on
8-mai-17www.big-data-europe.eu
Event Detector Algorithm
Two steps:
1. Identify events
o Compare pairs of news items
o If similarity > threshold → related pair
o Form clusters based on related pairs
o If cluster has support > threshold → event
2. Enrich events
o Compare individual social media items with events
o If similarity > threshold → attach to event
8-mai-17www.big-data-europe.eu
Event Detector Scaling
Scaling of performance through Apache Spark
 Reuters-21578 dataset, up to 8192 articles (~ 33.5 mil
pairs)
 Parallel vs distributed execution time (lower is better)
8-mai-17www.big-data-europe.eu
ED Workflow: Lookup service
 Based on Apache Lucene for fuzzy queries
 Based on the GAMD dataset
o more than 180,000 location names
 Input:
o Query including an extracted location name
 Output:
o The corresponding geocordinates
8-mai-17www.big-data-europe.eu
ED Workflow: Locations &
entities
 Incorporation of semantic metadata extraction
o Enrich location extraction with additional sources
o Augment events by extracting generic named
entities
 Grounding to unique entity URI
 Entity metadata queriable from additional API, if
needed
o APIs / thesauri by the Semantic Web Company
8-mai-17www.big-data-europe.eu
https://en.wikipedia.org/wiki/The_Godfather#Cast
ED Workflow: Locations &
entities
 Example: famous people thesaurus:
8-mai-17www.big-data-europe.eu
Extracto
r
API
http://bde.poolparty.biz/People/20
http://bde.poolparty.biz/People/446473
http://bde.poolparty.biz/People/688722
....
Metadat
a
API
name: Marlon Brando
uri: http://bde.poolparty.biz/People/688722
grounding:
http://dbpedia.org/resource/Marlon_Brando
broaders: http://bde.poolparty.biz/People/2
properties: http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
...
Entity metadata Entities
ED Workflow: CD & Twitter
search
 Automatic triggering of the change detection
workflow
o Activated for events with a number of source
articles greater than a threshold
 Support for interactive twitter keyword search
o Receive Twitter post ids from Sextant
o Processed on the next crawler run
8-mai-17www.big-data-europe.eu
Thank you!
Questions?
Links
 Strabon: http://strabon.di.uoa.gr
 GeoTriples: https://github.com/LinkedEOData/GeoTriples
 Event Detection: https://github.com/big-data-
europe/docker-event-detection
8-mai-17www.big-data-europe.eu

SC7 Webinar 4 04/05/2017 NCSR Demokritos Presentation "Event Detection"

  • 1.
    EVENT DETECTION 4th BDEHang-out “Big Data in Secure societies” 4/05/2017 George Giannakopoulos and Nikiforos Pittaras, NCSR "Demokritos"
  • 2.
  • 3.
  • 4.
    ED Workflow: NewsCrawler  Runs periodically  Monitored sources: o Reuters news feeds (RSS) o Selected Twitter accounts o Keyword-based search  Possible to cover more sources if needed  Possible to process explicitly specified items 8-mai-17www.big-data-europe.eu
  • 5.
    ED Workflow: Cassandra Scalable, noSQL distributed database  Input Scenario I: o Individual news items / tweets from News / Twitter Crawler o Conforms with privacy regulations  Input Scenario II: o Events identified by Event Detector  Input Scenario III: o Queries about the stored news items and events 8-mai-17www.big-data-europe.eu
  • 6.
    ED Workflow: EventDetector  Runs periodically, parallel execution based on Spark  Input: o News items  Output: o Events  Events are associated with meta-data: date, locations & specified named entities  Algorithm based on 8-mai-17www.big-data-europe.eu
  • 7.
    Event Detector Algorithm Twosteps: 1. Identify events o Compare pairs of news items o If similarity > threshold → related pair o Form clusters based on related pairs o If cluster has support > threshold → event 2. Enrich events o Compare individual social media items with events o If similarity > threshold → attach to event 8-mai-17www.big-data-europe.eu
  • 8.
    Event Detector Scaling Scalingof performance through Apache Spark  Reuters-21578 dataset, up to 8192 articles (~ 33.5 mil pairs)  Parallel vs distributed execution time (lower is better) 8-mai-17www.big-data-europe.eu
  • 9.
    ED Workflow: Lookupservice  Based on Apache Lucene for fuzzy queries  Based on the GAMD dataset o more than 180,000 location names  Input: o Query including an extracted location name  Output: o The corresponding geocordinates 8-mai-17www.big-data-europe.eu
  • 10.
    ED Workflow: Locations& entities  Incorporation of semantic metadata extraction o Enrich location extraction with additional sources o Augment events by extracting generic named entities  Grounding to unique entity URI  Entity metadata queriable from additional API, if needed o APIs / thesauri by the Semantic Web Company 8-mai-17www.big-data-europe.eu
  • 11.
    https://en.wikipedia.org/wiki/The_Godfather#Cast ED Workflow: Locations& entities  Example: famous people thesaurus: 8-mai-17www.big-data-europe.eu Extracto r API http://bde.poolparty.biz/People/20 http://bde.poolparty.biz/People/446473 http://bde.poolparty.biz/People/688722 .... Metadat a API name: Marlon Brando uri: http://bde.poolparty.biz/People/688722 grounding: http://dbpedia.org/resource/Marlon_Brando broaders: http://bde.poolparty.biz/People/2 properties: http://www.w3.org/1999/02/22-rdf-syntax- ns#type ... Entity metadata Entities
  • 12.
    ED Workflow: CD& Twitter search  Automatic triggering of the change detection workflow o Activated for events with a number of source articles greater than a threshold  Support for interactive twitter keyword search o Receive Twitter post ids from Sextant o Processed on the next crawler run 8-mai-17www.big-data-europe.eu
  • 13.
    Thank you! Questions? Links  Strabon:http://strabon.di.uoa.gr  GeoTriples: https://github.com/LinkedEOData/GeoTriples  Event Detection: https://github.com/big-data- europe/docker-event-detection 8-mai-17www.big-data-europe.eu