4. ED Workflow: News Crawler
Runs periodically
Stores parsed content and metadata to Cassandra
RSS feeds:
o Crawler conforms with privacy regulations
o Default RSS feeds list to Reuters generic categories
Direct links to published articles:
o Best-effort parsing
18-déc.-17www.big-data-europe.eu
5. ED Workflow: Twitter Crawler
Runs periodically
Stores parsed content and metadata to Cassandra
Multiple operation modes:
o Query specified twitter accounts
o Monitor all twitter posts of a specified language
o Keyword-based search
o Parse individual specified posts
18-déc.-17www.big-data-europe.eu
6. ED Workflow: Cassandra
Scalable, noSQL distributed database
I/O scenarios:
1. News & Tweets storage:
o Individual items (news articles or tweets) from the crawlers
2. Event storage:
o Event objects & metadata, as identified by the Event Detector
3. Frontend queries:
o Queries from Sextant about the stored news items and events
18-déc.-17www.big-data-europe.eu
7. ED Workflow: Event Detector
Runs periodically
Distributed execution based on Apache SPARK
Two algorithm steps:
1. Discovers related news items and clusters them into events
2. Produced events are augmented with useful meta-data: date,
locations, images and specified named entities
Detector algorithm based on
18-déc.-17www.big-data-europe.eu
8. ED Workflow: ED Algorithm
1) Identify events:
o Gather all unique article pairs
o Extract similarity of members in each pair using graph
representation methods
If similarity > threshold → related pair
o Form clusters based on related pairs
If cluster has support > threshold → event
18-déc.-17www.big-data-europe.eu
9. ED Workflow: ED Algorithm
2) Enrich events:
o Assign individual social media items to events
Convert to graph-based representation method, similarity-based classification
If similarity > threshold → attach to event
o Augment events from external metadata extractable from their member
articles and tweets:
Locations names and geocoordinates (GADM)
Named entities (Famous people)
Photographs (Flickr)
18-déc.-17www.big-data-europe.eu
10. ED Workflow: Location Extraction
Based on Apache Lucene for fuzzy queries
Based on the GAMD dataset
o more than 180,000 location names & geometries
Input: Clean text
Output: Location name(s) with their corresponding
geocoordinates
18-déc.-17www.big-data-europe.eu
11. ED Workflow: Entity extraction
Incorporation of semantic metadata extraction
Augment events by extracting generic named
entities
o Grounded to a unique entity URI
o Highly extensible: entity metadata easily queriable
from additional RESTful APIs, if needed
APIs & thesauri by the Semantic Web Company
18-déc.-17www.big-data-europe.eu
12. Text (https://en.wikipedia.org/wiki/The_Godfather#Cast)
ED Workflow: Entity extraction
Example: famous people thesaurus:
18-déc.-17www.big-data-europe.eu
Extractor
APIhttp://bde.poolparty.biz/People/20
http://bde.poolparty.biz/People/446473
http://bde.poolparty.biz/People/688722
....
Metadata
API
name: Marlon Brando
uri: http://bde.poolparty.biz/People/688722
grounding: http://dbpedia.org/resource/Marlon_Brando
broaders: http://bde.poolparty.biz/People/2
properties: http://www.w3.org/1999/02/22-rdf-syntax-
ns#type
...
Entity metadata Entities
13. ED Workflow: Detector Scaling
Study on event detection performance scaling
Distributed execution in Apache SPARK
Further experiments on two datasets on two different domains
o News articles (Reuters-21578)
o Biomedical scientific publications (bioASQ)
Up to 10K articles in total (~ 5 mil pairs)
Technical report draft available upon request
18-déc.-17www.big-data-europe.eu
14. ED Workflow: Detector Scaling
Preliminary results on Reuters-21578
Parallel vs distributed execution time (lower is better)
Substantial speedup at large enough (> 8K articles) workloads
18-déc.-17www.big-data-europe.eu
15. ED Workflow: Image extraction
Enrichment of extracted locations with photographs
o Considers a radial area around the centroid of the
geocoordinates of a location geometry
o Queries the Flickr API for user-uploaded public
photographs within that area
o Filters results to a temporal window relevant to
the date of the event in question
18-déc.-17www.big-data-europe.eu
16. ED Workflow: Connectivity
Workflow inter-connections
Automatic triggering of the CD workflow
o Event support calculated during detection
o Triggers if support greater than a specified threshold
Twitter Crawler source injection
o Targeted consumption of specified posts
Asynchronous non-blocking operations
18-déc.-17www.big-data-europe.eu