SC7 Hangout 3: Architecture of the BDE Pilot for Secure Societies
1. ARCHITECTURE OF THE
BDE PILOT FOR SECURE
SOCIETIES
3rd BDE Hangout “Big Data in Secure Societies”5 Dicember 2016
George Papadakis, University of Athens
Postdoctoral Researcher
4. ED Workflow: News Crawler
Runs periodically
Monitored sources:
o Reuters news feeds (RSS)
o Selected Twitter accounts
o Keyword-based search
Possible to cover more sources if needed
7-déc.-16www.big-data-europe.eu
5. ED Workflow: Cassandra
Scalable, noSQL distributed database
Input Scenario I:
o Individual news items from News Crawler
o Conforms with privacy regulations
Input Scenario II:
o Events identified by Event Detector
Input Scenario III:
o Queries about the stored news items and events
7-déc.-16www.big-data-europe.eu
6. ED Workflow: Event Detector
Runs periodically, parallel execution based on Spark
Input:
o News items
Output:
o Events
Every event is associated with meta-data: date & location
Algorithm based on
7-déc.-16www.big-data-europe.eu
7. Event Detector Algorithm
Two steps:
1. Identify events
o Compare pairs of news items
o If similarity > threshold → related pair
o Form clusters based on related pairs
o If cluster has support > threshold → event
2. Enrich events
o Compare individual items with events
o If similarity > threshold → attached to event
7-déc.-16www.big-data-europe.eu
8. ED Workflow: Lookup service
Based on Apache Lucene for fuzzy queries
Based on the GAMD dataset
o more than 180,000 location names
Input:
o Query including an extracted location name
Output:
o The corresponding geocordinates
7-déc.-16www.big-data-europe.eu
10. CD Workflow: Image Aggregator
Rest service called by GUI & Event Detection
Input (manual or automatic):
o Bounding box of the area of interest (WKT)
o The time of interest
o A past time, before an event of interest took place
Output:
o a set of satellite images downloaded from ESA’s SciHub.
o Subset operator
7-déc.-16www.big-data-europe.eu
11. Automatic call of the CD workflow
Best-effort service
Based on a queue
o Maximum capacity: 1,000 events
o Maximum waiting time: 1 week
Input:
o Event meta-data
Output:
o Areas with detected changes & corresponding satellite images
7-déc.-16www.big-data-europe.eu
12. CD Workflow: HDFS
Input:
o Two satellite images in zip format, each occupying few GBs.
Output:
o Distribute parts of every image to the available cluster nodes to facilitate
their efficient processing.
7-déc.-16www.big-data-europe.eu
13. CD Workflow: Change Detector
Parallelizes the change detection algorithm using Spark.
Input:
o Two satellite images depicting the same geolocation.
Output:
o A set of the areas with differences between the two snapshots.
7-déc.-16www.big-data-europe.eu
14. Change Detector Algorithm
Three steps:
1. Preprocessing to align the given images
Coregistration (4 successive operators) or
Terrain Correction (1 operator)
2. Main algorithm to perform the actual comparison
3. DBScan for clustering together pixels with changes
Two parallelization strategies:
1. Tile-centric approach (subset operator)
2. Image-centric approach (baseline approach)
7-déc.-16www.big-data-europe.eu
16. Common workflow: GeoTriples
Converts geospatial data into RDF.
Input Scenario I:
o Areas of change from Change Detector
Input Scenario II:
o Event summaries from Event Detector
Output:
o RDF statements
7-déc.-16www.big-data-europe.eu
17. Common workflow: Strabon
Scalable & efficient spatiotemporal RDF store.
Input Scenario I:
o Data coming from GeoTriples
Input Scenario II:
o SPARQL queries such as:
Get N latest event summaries from location X.
Get event summaries with keyword Y.
Output:
o Answers to the received queries.
7-déc.-16www.big-data-europe.eu
18. Common Workflow: SemaGrow
Federates Cassandra and Strabon.
Input:
o Queries from GUI about events or locations with changes.
Output:
o Answers to the received queries.
7-déc.-16www.big-data-europe.eu
19. Common Workflow: Sextant - A
Web application implementing the GUI.
Input for Change Detection:
o Area selected by user through the interactive map
o Time interval (optional)
o User info
Output:
o Calls Image Aggregator
o Progress messages
7-déc.-16www.big-data-europe.eu
20. Common Workflow: Sextant - B
Input for Event Detection (at least one of the following):
o Keyword
o Location name or coordinates
o Time
Output:
o Latest relevant event summaries & corresponding news items.
7-déc.-16www.big-data-europe.eu
21. Common Workflow: Sextant - C
Cybersecurity
o User registration
Pilot credentials (encrypted)
SciHub credentials (encrypted)
Type of user (classified, unclassified)
Requires administration approval
o Authorization
7-déc.-16www.big-data-europe.eu
22. Common Workflow: Sextant - D
Twitter keyword search
o Retrieves tweets on the fly
o Input:
Hashtag (e.g., #bdeSC7)
Mention (e.g., @bigDataEurope)
Keyword(s)
o Output:
Latest posts from Twitter Public Stream
7-déc.-16www.big-data-europe.eu