Relational databases have pretty much ruled over the IT world for the last 30 years. However, Web 2.0 and the incipient Internet of Things (IoT) are some of the sources of a data explosion that has proved to exceed the limits of what modern relational databases can handle in a growing number of cases. As a result, new technologies had to be developed to handle these new use cases. We generally group these technologies under the umbrella of Big Data. In this two part presentation, we will start by understanding how relational databases have evolved to become the powerhouses they are today. In part 2 we will look at how non SQL databases are tackling the big data problem to scale beyond what relational databases can provide us today.
1. The evolution of database technology (I)
Huibert Aalbers
Senior Certified Executive IT Architect
2. IT Insight podcast
• This podcast belongs to the IT Insight series
• You can subscribe to the podcast through iTunes.
• Additional material such as presentations in PDF format or white
papers mentioned in the podcast can be downloaded from the IT
insight section of my site at http://www.huibert-aalbers.com
• You can send questions or suggestions regarding this podcast to my
personal email, huibert_aalbers@mac.com
3. Hierarchical databases
• In the 60’s IBM launched the first
computers equipped with a hard disk
drive
• This spurred the development of a
technology to store, process and
retrieve data. IMS, in 1968, became
the first commercial database
software, developed by IBM to
inventory the very large bill of
materials (BOM) for the Saturn V moon
rocket and Apollo space vehicle.
• IMS was the first hierarchical database
4. Hierarchical databases
• Hierarchical databases have a serios
limitation. They only support 1 to n
relationships, which make data
modeling difficult
• A parent can have multiple
children
• A child can only have a single
parent
5. Hierarchical databases
• The most well known hierarchical
databases are
• IMS (still popular in large banks)
• Windows registry
• LDAP directories (depending on the
implementation)
• Hierarchical databases still have a
significant performance edge over
more modern relational databases
6. Relational databases
• In 1970, Ted Codd, a British
mathematician who worked at IBM,
published a paper titled “A relational
model of data for large shared data
banks”
• His groundwork generated much interest
in the information management world and
spurred the creation of new companies
such as Oracle (1977) or Informix (1980)
that implemented Codd’s ideas.
Meanwhile, IBM developed DB2, which
first appeared on mainframes (1981) and
later on distributed platforms.
7. Relational databases
• For over thirty years, relational
databases have ruled the database
market, based on their undeniable
strengths
• During that period, users have shaped
the evolution of the technology by
demanding new features and
increased performance
8. Strengths of Relational Databases
• Great technology to store large
volumes of structured data
• The consistency of the data is
guaranteed through the
implementation of the ACID properties
• Atomicity
• Consistency
• Isolation
• Durability
9. User requirements that have shaped modern
relational databases
• Increased scalability
• Alibí to perform complex queries against
large data sets (Data warehousing)
• Support for new programming
languages and types of data
• Requirements inspired by trends in
modern programming languages
• Improved administration features to ease
management of large numbers of
database instances
10. Increased scalability
• Symmetric Multiprocessing (SMP)
• IBM System 65 (1967)
• UNIX (starting in the mid 80’s)
• Support for multiples processor cores
• Power 4 (2001)
• Data partitioning
• SQL query optimizer improvements
• Data compression
• Increased use of RAM
• Clustering
11. What are the relational databases bottlenecks?
• I/O
• SQL joins
• Transactions (Locks), Distributed
Transactions (Two phase commit)
• Concurrency
• Hardware
12. Data partitioning
• Hard disk drives used to be the main
bottleneck which prevented quick data
access. That is why a system was
needed to access the data from
multiple disks, in parallel.
• A partitioned table has its data spread
over multiple disks, based on:
• Expression
• Range
• Round-Robin
13. Data compression
• Data compression allows for significant storage (and
therefore money) savings. In addition, and this may
sound counterintuitive, it also increases
performance, since data is read much faster (with
less I/O), specially when data is stored in a columnar
form. Administrators can choose to compress:
• Data
• Indices
• Blobs
• Results are spectacular
• Up to 80% less space needed to store the data
• Up to 20% less I/O
14. In-memory databases
These databases store data in memory (RAM) instead of hard disk drives to
scale better and support extremely high volumes transactions
• This technology was originally designed to meet needs of specific
industries (telcos and financial institutions primarily), that required
processing unusually high volumes of transactions
Recently, the line that divided in-memory databases from traditional databases
has started to blur with the introductions of databases such as DB2 BLU which
automatically try to make the most use of RAM to improve performance without
requiring all the data to be loaded in memory
BLU(Oracle Exalytics)
15. Data Warehouse
• The need for analyzing vast amounts of data was the first
application that challenged the dominance of RDMS as
the only tool required to work with data, as performance
became a serious issue.
• In order to avoid impacting the performance of OLTP
(OnLine Transaction Processing) databases, common
sense dictated that data analysis should be performed on
a different data store. As a result, the process is as
follows:
• The data is first moved from the OLTP database to an
operational data store (ODS), a repository used to
transform the data before it can be used
• Then, the data is moved to the databased in which the
information is analyzed, the Data Warehouse (DW)
• This process is at the origin of the spectacular growth
in the use of ETL (Extraction, Transformation and
Load) and data quality tools
16. Extraction-Transformation-Load
ETL tools
• In Data warehouse environments, it is
common to update the data regularly
(usually nightly) with the latest information
from the transactional systems (OLTP). In
general, the data needs to be transformed
before it can be loaded into the DWH.
• In addition, it is still very common to
exchange data between systems by
sending flat files from one computer to
another.
• This will probably disappear over time,
as we move to a world where systems
need to be online at all time.
17. Data Replication
• In modern environments which are online 24/7,
exchanging flat files to share information among
systems is not a viable solution.
• As soon as a change happens in one
database, this needs to be reflected in the
other repositories that require the
información.
• The data replication needs to be guaranteed,
even if one of the repositories is momentarily
off-line
• Most databases include some kind of built-in
replication functionality, but it is usually
limited in scope, i.e. not allowing replication
between databases from different brands.
18. Data cleansing and enrichment
• Analyzing dirty data is simply not possible. It
needs to be cleansed.
• Standardize data (addresses, names, etc.)
• Eliminate duplicates, erroneous data, etc.
• Further, deeper analysis can be performed
when the data is enriched with additional data
• Geocoding (distance to store)
• Demographics (age, sex, marital status,
estimated income, house ownership,
attended college, political leaning, charitable
giving, number of cars owned, etc.)
19. Data Warehouse
• Data is usually kept in a star schema, a special
case of the snowflake schema, which is effective
to handle simpler DWH queries
• Fact tables are at the centre of the schema and
surrounded by the dimensions tables.
• These tables are usually not normalized, for
performance reasons. Referential integrity is not
a concern as the data is usually imported from
databases that enforce referential integrity.
• Specialized DWH databases can load data
very quickly, run queries very fast by using
specialized indices and execute critical
operations such as building aggregate tables in
optimized ways
BLU
20. Database clustering
If web servers can scale horizontally, why can’t
relational databases do the same? Couldn’t we share
the workload among multiple computer nodes?
In order to achieve that, computer scientists have
created two distinct architectures
• Shared disk
• Multiple instances of the database, all
pointing to a single copy of de los datos
• Shared Nothing
• Multiples instances of the database, each
one owning part of the data set (the data is
partitioned)
21. Shared Disk
• Pros
• If one database instance or even a computing
node fails, the system keeps working
• Good performance when reading data, even
though the shared disk can become a bottleneck
• Cons
• Write operations become the main bottleneck
(specially when using more than two nodes),
because all the nodes need to be coordinated
• This can be mitigated by partitioning the data
• If the shared disk fails, the whole system fails
• Recovery after a node fails is a lengthy operation
pureScale
22. Shared Nothing
XPSEEE/DPF
• Pros
• In general, write operations are extremely fast
• Scales linearly
• Cons
• Read operations can be slower when queries
execute joins on data residing on different disks
• This also applies, to a lesser extent, to write
operations on data residing on multiple disks
• If a computing node or its disk fails, all that data
becomes unavailable
23. Data marts
• The first DWH grew extremely quickly
until they became too hard to manage
• That is why organizations started to
build specialized Data marts by
function (HR, finance, sales, etc.) or
department
• In order to avoid creating information
silos, all data marts should use the
same dimensions
• This is usually enforced by the ETL
tools
24. OLAP cubes
The data is stored in a repository using a
star schema, which in turn is used to
build a multidimensional cube to analyze
the information through multiple
dimensions (sales, regions, time periods,
etc.)
25. MOLAP / ROLAP tools
• MOLAP tools (Multidimensional OLAP)
load data into a cube, on which the user
can quickly execute complex queries
• ROLAP tools (Relational OLAP) transform
user queries into complex SQL queries
that are executed on a relational database
• This requires a relational database that
has been optimized to handle data
warehouse type queries
• In addition, to improve performance,
aggregate tables need to be built
26. MOLAP tools
• Pros
• In most cases delivers better performance, due
to index optimization, data cache and efficient
storage mechanisms
• Lower disk space usage due to efficient data
compression techniques
• Aggregate data is built automatically
• Cons
• Loading large data sets can be slow
• Working with models that include large amounts
of data and a large number of dimensions is
highly inefficient
27. ROLAP tools
• Pros
• Usually scales better (dimensions and registers)
• Loading data with a robust ETL usually is much
faster
• Cons
• Generally offers worse performance when both
MOLAP and ROLAP tools can perform the job. This
can however be mitigated by using ad-hoc
database extensions (for example DB2 cubes)
• Depends on SQL. In some cases, this does not
translate well for some particular use cases
(budgeting, financial reporting, etc.)
• Uses much more space on disk
28. HOLAP tools
HOLAP (Hybrid Online Analytical Processing)
is a combination of ROLAP and MOLAP
With this technology it becomes possible to
store part of the data in a MOLAP repository
and the remainder information in a ROLAP
one in order to choose the best strategy for
each case. For example:
• Keep large tables with the detailed data
in a relational database
• Keep agregate data in a MOLAP
repository
29. Hardware (Appliances)
In order to obtain the best performance from
the software and simplify the database
management, some manufacturers have
opted for developing integrated hardware
and software soluciones (a.k.a. appliances)
• It simplifies configuration (loading data,
index and schema creation, etc).
• It simplifies maintenance (standard
components and streamlined support)
• It allows to get the most performance out
of the hardware by using specialized
chips and optimized storage devices
30. Hardware (CPU)
• The IBM Power 8 micro processor was
specially designed to excel at data
processing applications
• Large memory cache (512k for each
core, 96MB shared L3 cache and a
128MB L4 cache outside the chip)
• 8 threads per core, 12 cores per
chip (96 threads per chip)
• Up to 5GHz
31. Columnar databases
BLU
In Business Analytics environments, it is very unlikely that all the columns of a
register will be required as part of the result of a query or in the WHERE clause
• Having the data organized by columns instead of by register (rows) allows
to significantly improve query times because usually much less information
has to be read from disk
• Modern databases such as DB2 BLU have been designed to excel both in
OLTP as well as in OLAP environments. That means that DBAs can choose
at database or table creation time how the data will be stored on disk
(columns or rows)
32. Support for new data types
During the 90s, developers started to ask for
expanded datatype support in relational databases
• Distinct types based on existing types
• STRUCT like composed types
• Completely new data types with their own
indexing methods (videos, pictures, sound)
• Time series
• Coordinates (2D, 3D)
• Text documents
• XML, Word, PDF, etc.
• Etc.
33. Requirements inspired by trends in modern
programming languages
• Inheritance
• Tables and types that inherit part of
their structure from other tables/types
• Polymorphism
• More flexibility to define/overload
functions, stored procedures and
operators
• Stored procedures written in modern
programming languages
34. Object-relational databases
• Illustra was company that developed an
object relational databased that pioneered
many of these interesting concepts that came
primarily from Java and Smalltalk
• Informix acquired Ilustra and integrated these
novel ideas into version 9.x of its flagship IDS
database
• Later on, DB2 and Oracle also implemented
some of those ideas
• Mapping Java objects to a relational database
(O/R mapping) is a different issue that can be
solved using object persistence libraries
35. Improved database management
• The more options a DBA has to tune the
system, the better are his chances to get
the most performance out of the system
• However, as we provide more knobs to tune
the system, the DBA’s job becomes more
and more complex, specially in large
datacenter where a single DBA may be
responsible for hundreds of database
instances
• The solution to this problem is Autonomic
Computing, which allows the database to
tune itself, based on rules that result from
experience
36. Relational databases have evolved, a lot
Despite the fact that just before the data explosion resulting from the Web 2.0
phenomenon, some large enterprises still used niche databases to cope with
the limitations of relational databases in some edge use cases, the fact is that
in most cases the most advanced database products (such as DB2, Oracle
and Informix) had been very successful evolving very quickly in order to solve
virtually all emerging information management problems, and therefore
avoiding to have their privileged position be threatened in any significant way
by new products.
37. Contact information
On Twitter: @huibert (English), @huibert2 (Spanish)
Web site: http://www.huibert-aalbers.com
Blog: http://www.huibert-aalbers.com/blog