This document discusses dockerizing a multi-component open data application called LinkedEconomy. It obtains data from various sources like government procurement records, fuel prices, and economic statistics. The application transforms and publishes this data as linked open data. The document proposes dockerizing each component like Drupal, Virtuoso triplestore, and QGIS for scalability and portability. It provides example docker compose files and discusses next steps like running applications on a swarm for auto-scaling and using Consul for service discovery.
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽❤️🧑🏻 89...
Dockerizing a multi-component Open Data app
1. Dockerizing a multi-
component Open Data app
Athens Docker Meetup, June 2016
Dimitris Negkas, Stergios Tsiafoulis
dimneg@gmail.com, s.tsiafoulis@gmail.com
2. Description and Scope
LinkedEconomy (http://linkedeconomy.org/).
is a publicly available web platform and linked data
repository.
its scope is to transform, curate, aggregate,
interlink and publish economic data in machine-
readable format, to enable
citizens awareness
research with unprecedented data
evidence-based policy
3. Data Sources
Sources Currently used:
Transparency – DIAVGEIA
Central Electronic Registry of Public Procurement - E-
Procurement
National Strategic Reference Framework (NSRF)
Central Market of Thessaloniki (CMT)
e-Prices
Fuel Prices
Municipality of Athens, Municipality of Thessaloniki
Government of Australia
4. Data growth
we use Open Link Virtuoso for 15 different sources
of nearly 1B triples
we host 27 datasets in CKAN from 15 organizations
data is increased respectively each month
5. Data processing
Each data source is separately handled and processed as its
available data are not uniformly provided or in machine-
readable format.
Diavgeia, “NSRF” and Observatories for product and fuel
prices provide a rich API interface that can be easily
queried in order to provide machine-readable data in JSON
format.
In the cases of E-Procurement, “CMT” and “Municipalities
of Athens and Thessaloniki” there is no API available.
Thus, we have developed a software module, which gathers
online information in an automated way, storing it in a
machine-readable format.
6. General Architecture
Process model
Open economic data related to public budgeting,
spending and prices are characterized of high
volume, velocity, variety and veracity
We have to build custom components under the
common logic of transforming static data to
linked open data streams.
7. Process model: Nucleus
The nucleus of our
approach is semantic
modelling, data
enrichment and
interconnections.
Data are stored in raw
(as harvested from
sources), in RDF and
json formats.
8. Process model : Data distribution
Enriched data are
distributed though five
channels:
1. Data dumps (CKAN),
2. SPARQL queries,
3. Web,
4. Social media
5. Structured inputs to
Business Intelligence (BI)
systems.
Additionally, data can be
further analysed and
exchanged with relevant
platforms (e.g. SPARQL to
R).
9. Process model : Validation and
messenger
The validation
component runs
throughout the whole
process in order to
safeguard high data
quality by detecting
errors.
The messaging
component works as an
internal messaging and
alert system for all
components.
19. Docker MySQL
version: '2'
services:
mysql:
build: ./mysql-docker/5.6
container_name: eLodDrupalmySQL
volumes:
- /mysql_drupal:/var/lib/mysql
environment:
- MYSQL_DATABASE=drupalelod
- MYSQL_ROOT_PASSWORD=eLodmysqlpass
restart: on-failure
Save your data !!
Will build the image from
your directory
Do not use flag “always”
in your development
environment!
20. Docker Drupal
drupal:
build: ./docker-drupal
command:
- /start.sh
depends_on:
- mysql
container_name: eLodDrupal
#image: eLodDrupal
ports:
- "8081:80"
volumes:
- "/data_drupal:/var/www/html"
links:
- "mysql"
environment:
- MYSQL_DATABASE=drupalelod
- MYSQL_USER=root
- MYSQL_PASSWORD=eLodmysqlpass
- DRUPAL_ADMIN_PW=eLODDR
- DRUPAL_ADMIN=admin
- MYSQL_HOST=eLodDrupalmySQL
- DRUPAL_ADMIN_EMAIL=stetsiafoulis@gmail.com
restart: on-failure
Will start the service only
after MySQL service
Will link the container
with MySQL container
22. Docker QGIS
qgisdesktop:
#image: kartoza/qgis-desktop:2.14
build: ./qgis-desktop/2.14
hostname: qgis-server
volumes:
#Wherever you want to mount your data from
- ./gis:/gis
#Unix socket for X11
- "/tmp/.X11-unix:/tmp/.X11-unix"
links:
- db:db
environment:
- DISPLAY=unix:1
command: /usr/bin/qgis
23. Build the system
Clone the repository from github
https://github.com/stetsiafoulis/eLOD
Create the directories where you are going to link your
data
Enter docker-compose up -d and that’s it !!
24. Why Docker ?
o Portable
o Lightweight
o Move to different cloud infrastructures
and to Physical servers
o Run on Virtual Machines for
development and testing
o Easily Scale
o Easy Delivery and deployment
o Run Anywhere (regardless host distro,
physical, cloud or not )
o Run Anything
26. Scaling per Source
Di@ygeia KHMDHS
Virtuoso
Drupal
MySql
QGIS Desktop
CouchDB
QGIS Server
Small Applications
Virtuoso
Drupal
MySql
CouchDB
QGIS Server
Small ApplicationsQGIS Desktop
28. Next Steps - Swarm
Virtuoso
Drupal
MySql
CouchDB
QGIS Server
Cluster management
Scaling
State reconciliation
Multi-host networking
Service discovery
Load balancing
29. Next Steps - Consul
Health CheckingService Discovery
Multi Datacenter support
31. Appendix - Data Sources links
LinkedEconomy (http://linkedeconomy.org/).
linkedeconomy@gmail.com
Sources Currently used:
Transparency - DIAVGEIA: https://diavgeia.gov.gr
Central Electronic Registry of Public Procurement - E-Procurement (KHDMHS):
http://www.eprocurement.gov.gr
National Strategic Reference Framework (NSRF):https://www.espa.gr/en
Central Market of Thessaloniki (CMT):http://www.kath.gr/
e-Prices: http://www.e-prices.gr/
Fuel Prices: http://www.fuelprices.gr/
Municipality of Athens: https://www.cityofathens.gr/khe/proypologismos
Municipality of Thessaloniki:
http://www.thessaloniki.gr/portal/page/portal/DioikitikesYpiresies/GenDnsiDioikOikonYpiresion/DnsiDiafanEksipirDimoton/Tmima
Diafaneias/AnoiktiDdiathesiDedomenon/DimosiefsiEktelesisProipologismou/ektelesi-proypologismou
Government of Australia: http://data.gov.au/
Editor's Notes
Open economic data related to public budgeting, spending and prices are characterized by high volume, velocity, variety and veracity.
10 virtual machines with memory and storage capacities that span from 2GB to 8GB RAM and 20GB to 100GB respectively, as well as a non-commodity (physical) server of 12 CPUs, 64GB RAM and a storage capacity of more than 4TB.
This map shows which municipalities are the most expensive on a specific product ie. Milk, fruits, or petrol etc
The scale of the color gives a perception of the price of the product to a municipality.. More red more expensive.
Also we are using QGIS in order to display on the map geoinformation of the supermarkets or other POIs
The system consists of : CKAN data portal, Drupal, Virtuoso, MySQLs, QGIS server, CouchDB and many scripts of different technologies and scope.
We are using such a system of apps in order to elaborate information from different data sources.
As we mentioned before the system is established on a cloud-based infrastructure ~okeanos.
There is a need in some cases to move the system or back it– up on different cloud or physical infrastructures.
Here is where Docker came and help us to achieve that , almost very easily and without many efforts.
We started to dockerize the services one by one until we decided use the new Compose 2.
Compose creates the entire system with a single command.
docker-compose up –d
And not only that, also it creates an internal network and attaches the containers to that automatically.
Policy
no
Do not automatically restart the container when it exits. This is the default.
on-failure[:max-retries]
Restart only if the container exits with a non-zero exit status. Optionally, limit the number of restart retries the Docker daemon attempts.
always
Always restart the container regardless of the exit status. When you specify always, the Docker daemon will try to restart the container indefinitely. The container will also always start on daemon startup, regardless of the current state of the container.
unless-stopped
Always restart the container regardless of the exit status, but do not start it on daemon startup if the container has been put to a stopped state before.
An ever increasing delay (double the previous delay, starting at 100 milliseconds) is added before each restart to prevent flooding the server. This means the daemon will wait for 100 ms, then 200 ms, 400, 800, 1600, and so on until either the on-failure limit is hit, or when you docker stop or docker rm -f the container.
If a container is successfully restarted (the container is started and runs for at least 10 seconds), the delay is reset to its default value of 100 ms.
You can specify the maximum amount of times Docker will try to restart the container when using the on-failure policy. The default is that Docker will try forever to restart the container. The number of (attempted) restarts for a container can be obtained via docker inspect. For example, to get the number of restarts for container “my-container”;
Cluster management integrated with Docker Engine: Use the Docker Engine CLI to create a Swarm of Docker Engines where you can deploy application services. You don’t need additional orchestration software to create or manage a Swarm.
Decentralized design: Instead of handling differentiation between node roles at deployment time, the Docker Engine handles any specialization at runtime. You can deploy both kinds of nodes, managers and workers, using the Docker Engine. This means you can build an entire Swarm from a single disk image.
Declarative service model: Docker Engine uses a declarative approach to let you define the desired state of the various services in your application stack. For example, you might describe an application comprised of a web front end service with message queueing services and a database backend.
Scaling: For each service, you can declare the number of tasks you want to run. When you scale up or down, the swarm manager automatically adapts by adding or removing tasks to maintain the desired state.
Desired state reconciliation: The swarm manager node constantly monitors the cluster state and reconciles any differences between the actual state your expressed desired state. For example, if you set up a service to run 10 replicas of a container, and a worker machine hosting two of those replicas crashes, the manager will create two new replicas to replace the ones that crashed. The swarm manager assigns the new replicas to workers that are running and available.
Multi-host networking: You can specify an overlay network for your services. The swarm manager automatically assigns addresses to the containers on the overlay network when it initializes or updates the application.
Service discovery: Swarm manager nodes assign each service in the swarm a unique DNS name and load balances running containers. You can query every container running in the swarm through a DNS server embedded in the swarm.
Load balancing: You can expose the ports for services to an external load balancer. Internally, the swarm lets you specify how to distribute service containers between nodes.
Secure by default: Each node in the swarm enforces TLS mutual authentication and encryption to secure communications between itself and all other nodes. You have the option to use self-signed root certificates or certificates from a custom root CA.
Rolling updates: At rollout time you can apply service updates to nodes incrementally. The swarm manager lets you control the delay between service deployment to different sets of nodes. If anything goes wrong, you can roll-back a task to a previous version of the service.
What is Consul?
Consul has multiple components, but as a whole, it is a tool for discovering and configuring services in your infrastructure.
It provides several key features:
Service Discovery: Clients of Consul can provide a service, such as api or mysql, and other clients can use Consul to discover providers of a given service. Using either DNS or HTTP, applications can easily find the services they depend upon.
Health Checking: Consul clients can provide any number of health checks, either associated with a given service ("is the webserver returning 200 OK"), or with the local node ("is memory utilization below 90%"). This information can be used by an operator to monitor cluster health, and it is used by the service discovery components to route traffic away from unhealthy hosts.
Key/Value Store: Applications can make use of Consul's hierarchical key/value store for any number of purposes, including dynamic configuration, feature flagging, coordination, leader election, and more. The simple HTTP API makes it easy to use.
Multi Datacenter: Consul supports multiple datacenters out of the box. This means users of Consul do not have to worry about building additional layers of abstraction to grow to multiple regions.