Holistic approach to analysis of different data models, databases and database management systems. Examining tabular, hierarchical, relational, textual, dimensional, graph, spatial, multimedia and other types of data and their specifics.
2. Taxonomy !?
Taxonomy is the practice and science of classification of things or concepts, including
the principles that underlie such classification.
3. Terminology
Data are facts and statistics collected for reference or analysis.
Data model is an abstract or conceptual model that organizes elements of data and
standardizes how they relate to one another and to the properties of real-world
entities. It is used in both to define domain model, as well as its metamodel. It differs
from physical model which defines a way data is stored on storage media.
Database is an organized collection of data, generally stored and accessed
electronically from a computer system.
DBMS (Database Management System) is a software system that enables users to
define, create, maintain and control access to the database. DBMS that supports
multiple data models is called a multi-model DBMS.
4. High-level data types
Data types differ from data classes or categories
Something is in class or category, but of type
5. Data classifications
Data of all data models can be divided into the following classes:
By temporal value:
Real-time
Stale
By general source:
Machine-generated
Human
By level of abstraction:
Data
Metadata
By structure:
Unstructured
Semi-structured
Structured
By sensitivity:
Public
Internal-only
Confidential
Restricted
6. DBMS classifications
By accessibility:
Online
Local
By distribution:
Centralized
Distributed
Homogenous
Heterogenous
By read-write purpose:
OLAP (Online Analytical Processing)
OLTP (Online Transaction Processing)
By indexing:
Indexed
Unindexed
By query language:
SQL (standardized)
NoSQL
By storage medium:
In-memory (RAM)
On disc (HDD, SSD)
By schema existence:
Has schema
Schemaless
8. Tabular
Data presented as a plain-text single table
Considered to be structured
Usually unindexed
Used for data transfer to indexed DBMSes
Relational algebra enabled (SQL !?)
Usual format: CSV/DSV
Implementations: Flat-file, Excel
DBMS: Berkeley DB
9. Hierarchical
Organized as tree-like structure (parent -< child)
Child contains link to parent (usually a unique identifier)
Each child has only one parent
Created by IBM in 1960s
Considered as semi-structured data
Suitable for both machine and human generated data
Usually distributed DBMSes
NoSQL (XPath, XQuery, JSON)
Usual format: XML, JSON, YAML, BSON
Implementations: Document-oriented, XML data store
DBMS: MarkLogic, MongoDB
10. Hierarchical > Document-oriented
Considered to be associative (document identifier)
Difference from plain associative model – filtering/restriction
Aggregate data model (DDD)
Direct object mapping
Collections belong to a database
Documents belong to collections
Document contains multiple fields/documents
DB, Collection, Field names - metadata
11. Relational
Data presented as tuples grouped into relations/tables
Relations consists of heading and body
Foreign keys between relations/tables
Each relation has primary key
Most popular data model
Usually SQL query language supported
First described by Codd in 1969
DBMS: Oracle, SQL Server, MySQL
12. Associative
Associative array, dictionary, hash table
Collection of values, objects or records
Values are usually unstructured or raw data
Identifier is a unique key
Search (index) enabled only by key (equality, wildcard)
Keys can represent hierarchy: /folder/subfolder/file
NoSQL
Used for caching (In-memory)
Usually distributed
Implementations: Key-value store
DBMS: Redis, Riak, Memcached
13. Textual
Data can be both machine and human generated
Usually indexed - inverted index
Working like search engines - FTS
Unstructured data (including multimedia)
NoSQL
Centralized and distributed
Implementation: Search Engine, Content store
DBMS: Solr, Elasticsearch
14. Dimensional
Data presented with multiple dimensions - cube
(R)OLAP – Business Intelligence
Data warehouse
Fact and dimension table
Structured and indexed data
Usually centralized and in-memory
MDX queries (Not exactly SQL !?)
DBMS: MS Analysis Services
15. Time-series
Series of data points listed in chronological order
Presenting discrete data points
Append(current_timestamp, value)
High transaction volumes
Statistical queries (aggregation with time dimension)
Structured and indexed
Usually distributed
Mostly machine-generated data
DBMS: InfluxDB, Riak-TS, TimescaleDB
16. Graph
Graph structures with nodes and edges
Superset of hierarchical
Successor of early network model
Nodes and edges have fields
NoSQL (graph traversal)
Mostly indexed and centralized
Implementations: Triplestores/RDF store
DBMS: Neo4j
17. Spatial data
Data which represents objects defined in geometric space
Geospatial data - GIS
Vector and raster data
Point, Line, Polygon
Spatial query examples:
Distance
Intersection
Centralized and indexed
DBMS: Postgres + PostGIS
18. Multimedia
Sub-classes of multimedia data:
Graphic (vectors) – time independent
Image (pixels) – time independent
Audio (sound) – time dependent
Video (combination) – time dependent
Time dependent serving - streaming
Multimedia data is considered unstructured
Multimedia search
Different media formats (BMP, JPEG, GIF, PNG…)
19. Hybrid
Database with multiple models – multi-model DB
Polyglot persistence – maintaining consistency !?
Document + Graph
Relational + Hierarchical
Goes with association/identifier
XML and JSON columns
Object-relational
Relational + Textual - FTS
Associative
Spatial – Geo types
Spatiotabular and spatiotemporal
Column-family
Combining: associative, tabular, hierarchical
Column-oriented
Sparse table
Google Big Table, Cassandra
20. Conclusion
Data model level of abstraction
DBMS choice defines data models too
Structured data
Has schema – metadata
Has better query capabilities – SQL
Semi-structured data
Usually associated with NoSQL
Hierarchical – XML, JSON
Unstructured data
Multimedia and text
Better organization requires more energy
Multi-model vs. polyglot persistence