ITI015En-The evolution of databases (I)

The evolution of database technology (I)
Huibert Aalbers
Senior Certiﬁed Executive IT Architect

IT Insight podcast
• This podcast belongs to the IT Insight series
• You can subscribe to the podcast through iTunes.
• Additional material such as presentations in PDF format or white
papers mentioned in the podcast can be downloaded from the IT
insight section of my site at http://www.huibert-aalbers.com
• You can send questions or suggestions regarding this podcast to my
personal email, huibert_aalbers@mac.com

Hierarchical databases
• In the 60’s IBM launched the first
computers equipped with a hard disk
drive
• This spurred the development of a
technology to store, process and
retrieve data. IMS, in 1968, became
the first commercial database
software, developed by IBM to
inventory the very large bill of
materials (BOM) for the Saturn V moon
rocket and Apollo space vehicle.
• IMS was the first hierarchical database

• Hierarchical databases have a serios
limitation. They only support 1 to n
relationships, which make data
modeling difﬁcult
• A parent can have multiple
children
• A child can only have a single
parent

• The most well known hierarchical
databases are
• IMS (still popular in large banks)
• Windows registry
• LDAP directories (depending on the
implementation)
• Hierarchical databases still have a
signiﬁcant performance edge over
more modern relational databases

Relational databases
• In 1970, Ted Codd, a British
mathematician who worked at IBM,
published a paper titled “A relational
model of data for large shared data
banks”
• His groundwork generated much interest
in the information management world and
spurred the creation of new companies
such as Oracle (1977) or Informix (1980)
that implemented Codd’s ideas.
Meanwhile, IBM developed DB2, which
ﬁrst appeared on mainframes (1981) and
later on distributed platforms.

Relational databases
• For over thirty years, relational
databases have ruled the database
market, based on their undeniable
strengths
• During that period, users have shaped
the evolution of the technology by
demanding new features and
increased performance

Strengths of Relational Databases
• Great technology to store large
volumes of structured data
• The consistency of the data is
guaranteed through the
implementation of the ACID properties
• Atomicity
• Consistency
• Isolation
• Durability

User requirements that have shaped modern
relational databases
• Increased scalability
• Alibí to perform complex queries against
large data sets (Data warehousing)
• Support for new programming
languages and types of data
• Requirements inspired by trends in
modern programming languages
• Improved administration features to ease
management of large numbers of
database instances

Increased scalability
• Symmetric Multiprocessing (SMP)
• IBM System 65 (1967)
• UNIX (starting in the mid 80’s)
• Support for multiples processor cores
• Power 4 (2001)
• Data partitioning
• SQL query optimizer improvements
• Data compression
• Increased use of RAM
• Clustering

What are the relational databases bottlenecks?
• I/O
• SQL joins
• Transactions (Locks), Distributed
Transactions (Two phase commit)
• Concurrency
• Hardware

Data partitioning
• Hard disk drives used to be the main
bottleneck which prevented quick data
access. That is why a system was
needed to access the data from
multiple disks, in parallel.
• A partitioned table has its data spread
over multiple disks, based on:
• Expression
• Range
• Round-Robin

Data compression
• Data compression allows for signiﬁcant storage (and
therefore money) savings. In addition, and this may
sound counterintuitive, it also increases
performance, since data is read much faster (with
less I/O), specially when data is stored in a columnar
form. Administrators can choose to compress:
• Data
• Indices
• Blobs
• Results are spectacular
• Up to 80% less space needed to store the data
• Up to 20% less I/O

In-memory databases
These databases store data in memory (RAM) instead of hard disk drives to
scale better and support extremely high volumes transactions
• This technology was originally designed to meet needs of speciﬁc
industries (telcos and ﬁnancial institutions primarily), that required
processing unusually high volumes of transactions
Recently, the line that divided in-memory databases from traditional databases
has started to blur with the introductions of databases such as DB2 BLU which
automatically try to make the most use of RAM to improve performance without
requiring all the data to be loaded in memory
BLU(Oracle Exalytics)

Data Warehouse
• The need for analyzing vast amounts of data was the ﬁrst
application that challenged the dominance of RDMS as
the only tool required to work with data, as performance
became a serious issue.
• In order to avoid impacting the performance of OLTP
(OnLine Transaction Processing) databases, common
sense dictated that data analysis should be performed on
a different data store. As a result, the process is as
follows:
• The data is ﬁrst moved from the OLTP database to an
operational data store (ODS), a repository used to
transform the data before it can be used
• Then, the data is moved to the databased in which the
information is analyzed, the Data Warehouse (DW)
• This process is at the origin of the spectacular growth
in the use of ETL (Extraction, Transformation and
Load) and data quality tools

Extraction-Transformation-Load
ETL tools
• In Data warehouse environments, it is
common to update the data regularly
(usually nightly) with the latest information
from the transactional systems (OLTP). In
general, the data needs to be transformed
before it can be loaded into the DWH.
• In addition, it is still very common to
exchange data between systems by
sending ﬂat ﬁles from one computer to
another.
• This will probably disappear over time,
as we move to a world where systems
need to be online at all time.

Data Replication
• In modern environments which are online 24/7,
exchanging flat files to share information among
systems is not a viable solution.
• As soon as a change happens in one
database, this needs to be reflected in the
other repositories that require the
información.
• The data replication needs to be guaranteed,
even if one of the repositories is momentarily
off-line
• Most databases include some kind of built-in
replication functionality, but it is usually
limited in scope, i.e. not allowing replication
between databases from different brands.

Data cleansing and enrichment
• Analyzing dirty data is simply not possible. It
needs to be cleansed.
• Standardize data (addresses, names, etc.)
• Eliminate duplicates, erroneous data, etc.
• Further, deeper analysis can be performed
when the data is enriched with additional data
• Geocoding (distance to store)
• Demographics (age, sex, marital status,
estimated income, house ownership,
attended college, political leaning, charitable
giving, number of cars owned, etc.)

Data Warehouse
• Data is usually kept in a star schema, a special
case of the snowﬂake schema, which is effective
to handle simpler DWH queries
• Fact tables are at the centre of the schema and
surrounded by the dimensions tables.
• These tables are usually not normalized, for
performance reasons. Referential integrity is not
a concern as the data is usually imported from
databases that enforce referential integrity.
• Specialized DWH databases can load data
very quickly, run queries very fast by using
specialized indices and execute critical
operations such as building aggregate tables in
optimized ways
BLU

Database clustering
If web servers can scale horizontally, why can’t
relational databases do the same? Couldn’t we share
the workload among multiple computer nodes?
In order to achieve that, computer scientists have
created two distinct architectures
• Shared disk
• Multiple instances of the database, all
pointing to a single copy of de los datos
• Shared Nothing
• Multiples instances of the database, each
one owning part of the data set (the data is
partitioned)

Shared Disk
• Pros
• If one database instance or even a computing
node fails, the system keeps working
• Good performance when reading data, even
though the shared disk can become a bottleneck
• Cons
• Write operations become the main bottleneck
(specially when using more than two nodes),
because all the nodes need to be coordinated
• This can be mitigated by partitioning the data
• If the shared disk fails, the whole system fails
• Recovery after a node fails is a lengthy operation
pureScale

Shared Nothing
XPSEEE/DPF
• Pros
• In general, write operations are extremely fast
• Scales linearly
• Cons
• Read operations can be slower when queries
execute joins on data residing on different disks
• This also applies, to a lesser extent, to write
operations on data residing on multiple disks
• If a computing node or its disk fails, all that data
becomes unavailable

Data marts
• The ﬁrst DWH grew extremely quickly
until they became too hard to manage
• That is why organizations started to
build specialized Data marts by
function (HR, ﬁnance, sales, etc.) or
department
• In order to avoid creating information
silos, all data marts should use the
same dimensions
• This is usually enforced by the ETL
tools

OLAP cubes
The data is stored in a repository using a
star schema, which in turn is used to
build a multidimensional cube to analyze
the information through multiple
dimensions (sales, regions, time periods,
etc.)

MOLAP / ROLAP tools
• MOLAP tools (Multidimensional OLAP)
load data into a cube, on which the user
can quickly execute complex queries
• ROLAP tools (Relational OLAP) transform
user queries into complex SQL queries
that are executed on a relational database
• This requires a relational database that
has been optimized to handle data
warehouse type queries
• In addition, to improve performance,
aggregate tables need to be built

MOLAP tools
• Pros
• In most cases delivers better performance, due
to index optimization, data cache and efficient
storage mechanisms
• Lower disk space usage due to efficient data
compression techniques
• Aggregate data is built automatically
• Cons
• Loading large data sets can be slow
• Working with models that include large amounts
of data and a large number of dimensions is
highly inefficient

ROLAP tools
• Pros
• Usually scales better (dimensions and registers)
• Loading data with a robust ETL usually is much
faster
• Cons
• Generally offers worse performance when both
MOLAP and ROLAP tools can perform the job. This
can however be mitigated by using ad-hoc
database extensions (for example DB2 cubes)
• Depends on SQL. In some cases, this does not
translate well for some particular use cases
(budgeting, ﬁnancial reporting, etc.)
• Uses much more space on disk

HOLAP tools
HOLAP (Hybrid Online Analytical Processing)
is a combination of ROLAP and MOLAP
With this technology it becomes possible to
store part of the data in a MOLAP repository
and the remainder information in a ROLAP
one in order to choose the best strategy for
each case. For example:
• Keep large tables with the detailed data
in a relational database
• Keep agregate data in a MOLAP
repository

Hardware (Appliances)
In order to obtain the best performance from
the software and simplify the database
management, some manufacturers have
opted for developing integrated hardware
and software soluciones (a.k.a. appliances)
• It simplifies configuration (loading data,
index and schema creation, etc).
• It simplifies maintenance (standard
components and streamlined support)
• It allows to get the most performance out
of the hardware by using specialized
chips and optimized storage devices

Hardware (CPU)
• The IBM Power 8 micro processor was
specially designed to excel at data
processing applications
• Large memory cache (512k for each
core, 96MB shared L3 cache and a
128MB L4 cache outside the chip)
• 8 threads per core, 12 cores per
chip (96 threads per chip)
• Up to 5GHz

Columnar databases
BLU
In Business Analytics environments, it is very unlikely that all the columns of a
register will be required as part of the result of a query or in the WHERE clause
• Having the data organized by columns instead of by register (rows) allows
to signiﬁcantly improve query times because usually much less information
has to be read from disk
• Modern databases such as DB2 BLU have been designed to excel both in
OLTP as well as in OLAP environments. That means that DBAs can choose
at database or table creation time how the data will be stored on disk
(columns or rows)

Support for new data types
During the 90s, developers started to ask for
expanded datatype support in relational databases
• Distinct types based on existing types
• STRUCT like composed types
• Completely new data types with their own
indexing methods (videos, pictures, sound)
• Time series
• Coordinates (2D, 3D)
• Text documents
• XML, Word, PDF, etc.
• Etc.

Requirements inspired by trends in modern
programming languages
• Inheritance
• Tables and types that inherit part of
their structure from other tables/types
• Polymorphism
• More ﬂexibility to deﬁne/overload
functions, stored procedures and
operators
• Stored procedures written in modern
programming languages

Object-relational databases
• Illustra was company that developed an
object relational databased that pioneered
many of these interesting concepts that came
primarily from Java and Smalltalk
• Informix acquired Ilustra and integrated these
novel ideas into version 9.x of its ﬂagship IDS
database
• Later on, DB2 and Oracle also implemented
some of those ideas
• Mapping Java objects to a relational database
(O/R mapping) is a different issue that can be
solved using object persistence libraries

Improved database management
• The more options a DBA has to tune the
system, the better are his chances to get
the most performance out of the system
• However, as we provide more knobs to tune
the system, the DBA’s job becomes more
and more complex, specially in large
datacenter where a single DBA may be
responsible for hundreds of database
instances
• The solution to this problem is Autonomic
Computing, which allows the database to
tune itself, based on rules that result from
experience

Relational databases have evolved, a lot
Despite the fact that just before the data explosion resulting from the Web 2.0
phenomenon, some large enterprises still used niche databases to cope with
the limitations of relational databases in some edge use cases, the fact is that
in most cases the most advanced database products (such as DB2, Oracle
and Informix) had been very successful evolving very quickly in order to solve
virtually all emerging information management problems, and therefore
avoiding to have their privileged position be threatened in any signiﬁcant way
by new products.

Contact information
On Twitter: @huibert (English), @huibert2 (Spanish)
Web site: http://www.huibert-aalbers.com
Blog: http://www.huibert-aalbers.com/blog

ITI015En-The evolution of databases (I)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to ITI015En-The evolution of databases (I)

Similar to ITI015En-The evolution of databases (I) (20)

Recently uploaded

Recently uploaded (20)

ITI015En-The evolution of databases (I)