4. ED Workflow: News Crawler
Runs periodically
Monitored sources:
o Reuters news feeds (RSS)
o Selected Twitter accounts
o Keyword-based search
Possible to cover more sources if needed
Possible to process explicitly specified items
8-mai-17www.big-data-europe.eu
5. ED Workflow: Cassandra
Scalable, noSQL distributed database
Input Scenario I:
o Individual news items / tweets from News / Twitter Crawler
o Conforms with privacy regulations
Input Scenario II:
o Events identified by Event Detector
Input Scenario III:
o Queries about the stored news items and events
8-mai-17www.big-data-europe.eu
6. ED Workflow: Event Detector
Runs periodically, parallel execution based on Spark
Input:
o News items
Output:
o Events
Events are associated with meta-data: date, locations &
specified named entities
Algorithm based on
8-mai-17www.big-data-europe.eu
7. Event Detector Algorithm
Two steps:
1. Identify events
o Compare pairs of news items
o If similarity > threshold → related pair
o Form clusters based on related pairs
o If cluster has support > threshold → event
2. Enrich events
o Compare individual social media items with events
o If similarity > threshold → attach to event
8-mai-17www.big-data-europe.eu
8. Event Detector Scaling
Scaling of performance through Apache Spark
Reuters-21578 dataset, up to 8192 articles (~ 33.5 mil
pairs)
Parallel vs distributed execution time (lower is better)
8-mai-17www.big-data-europe.eu
9. ED Workflow: Lookup service
Based on Apache Lucene for fuzzy queries
Based on the GAMD dataset
o more than 180,000 location names
Input:
o Query including an extracted location name
Output:
o The corresponding geocordinates
8-mai-17www.big-data-europe.eu
10. ED Workflow: Locations &
entities
Incorporation of semantic metadata extraction
o Enrich location extraction with additional sources
o Augment events by extracting generic named
entities
Grounding to unique entity URI
Entity metadata queriable from additional API, if
needed
o APIs / thesauri by the Semantic Web Company
8-mai-17www.big-data-europe.eu
11. https://en.wikipedia.org/wiki/The_Godfather#Cast
ED Workflow: Locations &
entities
Example: famous people thesaurus:
8-mai-17www.big-data-europe.eu
Extracto
r
API
http://bde.poolparty.biz/People/20
http://bde.poolparty.biz/People/446473
http://bde.poolparty.biz/People/688722
....
Metadat
a
API
name: Marlon Brando
uri: http://bde.poolparty.biz/People/688722
grounding:
http://dbpedia.org/resource/Marlon_Brando
broaders: http://bde.poolparty.biz/People/2
properties: http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
...
Entity metadata Entities
12. ED Workflow: CD & Twitter
search
Automatic triggering of the change detection
workflow
o Activated for events with a number of source
articles greater than a threshold
Support for interactive twitter keyword search
o Receive Twitter post ids from Sextant
o Processed on the next crawler run
8-mai-17www.big-data-europe.eu