A description of how visualfabriq's bquery-based reporting & (advanced) analytics environment works. It completely consists of docker-based micro-services, with each component being horizontally scalable.
We are aiming to open source this by Q4 2015.
2. Architecture Overview
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
• The architecture consists of 5 basic
components, a HTML5 Client and a file
backend
• Each instance of a component auto-
registers in the metadata master
• Every component defined here
• Is horizontally scalable
• Has load balancing
• And has failover capabilities
• All external communication goes
through the fully REST-ful api, where
each request is checked against a role-
based security system
• Next to the restful interface, it can also
deliver and retrieve results and data
through indirect methods (mail, sftp)
1
2
4
B
3
5
Web ClientA
3. 1) Web Server
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The web server receives all requests,
checks them against the security model
and metadata, after which it sets out the
actions in the queuing system
• The setup of the security model,
metadata (including data descriptions
there) and the entire API (calls and
actions) are proprietary code
• Dependencies:
• Nginx, for the scalable http server
• uWSGI, for running python code
behind nginx
• Flask, a web framework for
handling sockets and sessions
1
2
4
B
A
3
5
4. 2) Routing & Queuing
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The queue server receives all action
requests from the API, finds where it
can execute them and load balances
requests over these resources
• We have created the queues and auto-
registering setup to create the generic
framework functionality and to ensure
load balancing and fail over capabilities
• Dependencies:
• Celery, for the Python library
• RabbitMQ, the distribution broker
• Redis, for exchanging results
between the processes
1
2
4
B
A
3
5
5. 3) Metadata
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The metadata server contains all
general data on users, databases and
security, as well the metadata on
available data for users (measures,
dimensions, tables and how these all
related to each other)
• Dependencies:
• MongoDB, for containing the
metadata
1
2
4
B
A
3
5
6. 4) Dynamic Query Engine
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The dynamic query engine server
contains a number of data files (which it
automatically downloads and
synchronizes from the backend) and
can analyze and aggregate
• It can also auto-join tables on
commonalities, perform a wide range of
calculations and do several distributed
analytics operations on row-level
• Dependencies:
• Bcolz, for containing the data files
in a compressed, columnar format
• Pandas, for higher end operations
for the result data set (joins, sorts,
etc.)
1
2
4
B
A
3
5
7. 5) Processing & Analytics
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The processing & analytics server
handles (asynchronous) calls to
perform file loading, exporting and
analytics calls
• This includes the creation and execution
of machine learning and statistical
models
• It also handles the conversion of raw
data files into the binary files and
updating relevant metadata
• Dependencies:
• Scikit-learn for machine learning
• Statsmodel for statistical models
• Pandas, for data manipulation
• Bcolz, for converting the data files
into a compressed, columnar
format
1
2
4
B
A
3
5
8. A) Web Client
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The web client is a full, web-based
HTML5 client that gives access to all
• Reporting
• Analytics
• File import
• User and Security Mgmt
• Server Mgmt
• The files are server by the webserver as
a static, with all calls go through the
standard API
• Dependencies:
• Jquery, for cross-browser javascript
simplification and ui
• Bootstrap, for layout
• D3.js, a library for visualizations
1
2
4
B
A
3
5
9. B) File Backend
Web Server
API
Routing & Queuing Metadata
Dynamic Query
Engine
Processing &
Analytics
File Backend
Web Client
• The file backend contains all raw files
and the processed (compressed,
columnar) files
• DQE instances automatically retrieve
their assigned files from the backend
when a file has been updated.
• Dependencies:
• AWS S3 for saving files
1
2
4
B
A
3
5
10. Architecture Comparison
Area Hadoop Cassandra Best In Class visualfabriq Difference
Data Non-structured & structured Structured, wide-column Teradata (structured, columnar) Structured, columnar,
compressed
Optimized for numerical data (means: no text analytics etc.)
Architecture Rack-aware, daemon based
Cluster
Peer-to-peer cluster Horizontally scaling, container-
based microservices
communicating through
rabbitmq queues
Easier to monitor & scale
Setup Complex Complex Up & running in one minute Much, much easier to setup and rollout
Cluster
Maintenance
Node creation and assignment
usually through commercial
cluster mgmt software
Peer-to-peer network; auto-
configures
Self-registering nodes that can be
assigned specific tasks and data in
a web interface
ETL Flume, Sqoop Bulk Loader Informatica, Talend Web based, drag & drop with
wizards
Web based, easy to use
Language Map/Reduce; add-ons for sql (pig,
hive, impala, etc.)
CQL SQL MOLAP-like; sql interface to be
build
SQL is the standard, but because of the built-in reporting
and analytics this is not something users will need
Compression No No MongoDb/WiredTiger Blosc-based Saves on average 20x in disk space while speeding up reads
Performance Slow, batch based; Spark can add
in-memory capability (speeds up
100x)
High, in-memory options High, disk-based with
compression delivering in 2-3x
range of in-memory
Out-of-the-box near in-memory performance with file-
based scaling; with advances of CPU speed, this might even
surpass traditional in-memory performance
Interface Restful API Restful API Restful API Restful API
Reporting Only in external tools (that
connect to sql-connector)
Only in external tools (that
connect to 3rd party connectors)
Tableau (HTML5, interactive,
beautiful)
Built-in HTML5, interactive,
extensible (d3.js based)
Only solution with out-of-the-box reporting with an easy-
to-use, modern web-based interface
Analytics Distributed map/reduce analytics
through Mahout
Only as optional, paid-for module SAS, SPSS Built-in HTML5, interactive
environment that incorporates
leading OS machine learning (sci-
kit learn), statistics (statsmodel)
and propietary (POS-analytics)
functionality; nb: the analytics
load is not fully distributed yet
Only solution with out-of-the-box analytics with an easy-to-
use, modern web-based interface
Security Kerberos-based security Data object security General, role-based security One point to manage all security from data access to
functionality (reporting, accessibility, etc.)
Open source Core is open source; several
performance acceleration &
mgmt tools are paid
Core is open source; analytics,
backup and other options are
paid
Core is open source; large cluster
mgmt tools and vertical-specific
analytics options are paid
Language Java Java Python (and Cython & C)