The best way to remember genius is through his work. Codd, who gave us Relational Databases, Normalization Principles, Relational Algebra, Relational Calculus and “12 Rules of Codd”, passed away a few years back. His contributions to the field of computing specially the Relational Database Management System (RDBMS) is immense. But for his work at IBM Research Laboratories, San Jose, California today there exists a $100 billion industry of databases. Before him there were Network and Hierarchical Databases but he brought simplicity to data representation with Relations. In this paper we will explore the foundation of Relational Databases Management Systems i.e. SETS and RELATIONS, a branch of mathematics and establish some of the fundamental properties of these.
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Remembering Edgar Frank “Ted” Codd - Founder of Relational Databases
1. Remembering Edgar Frank “Ted” Codd -
Founder of Relational Databases
Bala Nagendra Rao Betha
Data Architect
Satyam Computer Services Limited
Abstract- The best way to remember a genius is
through his work. Codd, who gave us Relational
Databases, Normalization Principles, Relational
Algebra, Relational Calculus and “12 Rules of Codd”,
passed away a few years back. His contributions to the
field of computing specially the Relational Database
Management System (RDBMS) is immense. But for
his work at IBM Research Laboratories, San Jose,
California today there exists a $100 billion industry of
databases. Before him there were Network and
Hierarchical Databases but he brought simplicity to
data representation with Relations. In this paper we
will explore the foundamentals of Relational
Databases Management Systems i.e. SETS and
RELATIONS, a branch of mathematics and establish
some of the fundamental properties of these.
Keywords- Relations, Set Theory, Binary-
Relations, Cartesian-Product, Ternary-Relations, N-
ary Relations
Dr. Edgar Frank “Ted” Codd
Founder of Relational Databases
August 23rd, 1923 - April 18th, 2003
ACM/A. M. Turing award recipient 1981
Citation for A. M. Turing award 1981
For his fundamental and continuing contributions to the
theory and practice of database management systems.
He originated the relational approach to database
management in a series of research papers published
commencing in 1970. His paper "A Relational Model
of Data for Large Shared Data Banks" was a seminal
paper, in a continuing and carefully developed series of
papers. Dr. Codd built upon this space and in doing so
has provided the impetus for widespread research into
numerous related areas, including database languages,
query subsystems, database semantics, locking and
recovery, and inferential subsystems.
http://www.acm.org/awards/turing_citations/codd.html
2. I. INTRODUCTION
The paper revisits the fundamentals of Sets and
Relations and rebuilds the Relational Database
Management System (RDBMS) in
Table/Columns manner. So, from SETS and
RELATIONS we will arrive at Tables/Columns
of RDBMS. It answers a few of the persistent
questions like What is “Relation” in a Relational
Database, What was Codd’s opinion about
“Relation”, Why Relations are selective and not
cross-product of Sets, Can a Relation repeat in
another Relation and Why duplicate elements are
not allowed in Sets etc.
The paper starts with raising fundamental
question What is a “Relation” in the context of
Relational Databases?”, followed by excerpts
from one of E. F. Codd’s paper giving insight
into definition of a Relation. In Section IV we
bring Set Theory concepts and with a few
examples explain how Binary and Ternary
Relations are formulated. Section-V explains
some more properties of Relations and with
Appendix we close the paper.
II. WHAT IS A “RELATION” IN
RELATIONAL DATABASES?
In Relational Databases the term “Relation” is
often used. What does it refer to? Sometimes a
Relation is misunderstood for relationship
between two entities such as in Entity-
Relationship Diagram (ERD). This leads to a
question. Relationships as in Entity-Relationship
Diagram are implemented through Foreign Keys.
So, Foreign Keys are Relations?
But of course there were no foreign keys in early
implementations of Relation Databases such as
Oracle, Sybase etc. Oracle did not have support
for Foreign Keys till version 7. By this argument
we can infer that Oracle was not a Relational
database till version 6? Is MS Access a
Relational Database? Do you treat Dbase III or
IV as Relational Databases? There are many
questions which remain unanswered in
Relational Database terms. And the popular
understanding is “Relation” means Relationship
between two tables.
What does the term “Relation” refer to?
By the term “Relation” we refer to a Table, not
to the Foreign Key relationship between two
tables as understood popularly. C. J. Date in his
book “Introduction to database Systems” took
pains to explain the correlation between Relation
and Table. Here is list of Relational and their
equivalent terms:
Formal Relational
Term
Informal
equivalents
Relation Table
Tuple Row or record
Cardinality Number of rows in a
Table/Relation
Attribute Column or field
Degree Number of columns
Primary key Unique identifier
Domain Pool of legal values
III. CODD’S VIEW OF A RELATION
In E. F. Codd’s now famous paper “Normalized
Data Base Structure: A Brief Tutorial” published
by IBM Research Laboratory, San Jose,
California. Codd refers to the “Relation” as
follows:
A table as normally understood
is a rectangular array with the
following properties:
P1: it is column-homogeneous –
in other words, in any selected
column the items are all of the
same kind, whereas items in
different columns need not be of
the same kind;
P2: each item is a simple
number or a character string
(thus, for example, if we look at
the item in any specified row
and any specified column, we do
not find a set of numbers or a
repeating group).
For data base tables we add 3
more properties:
P3: all rows of a table must be
distinct (duplicate rows are not
allowed);
P4: the ordering of rows within a
table is immaterial;
3. P5: the columns of a table are
assigned distinct names and the
ordering of columns within a
table is immaterial.
As a result of P3, each row can
be uniquely identified (or
addressed) by its content. As a
result of P3 and P4 together, this
kind of table is what
mathematicians call a relation.
So, the term “Relation” comes from the
mathematical branch of Set Theory. A Relation
is special form of set.
This one citation from E. F. Codd’s paper has
enough stuff for tracing the evolution of
Normalization Principles as well but we will
focus on proper definition of “Relation”.
Also, please note the term “Data Base” being
used as two words instead of “Database”, as it is
referred today. These were the formative years of
databases and the word “Database” was not yet
coined. In earlier papers Codd had referred to
databases as “Data Banks” as well.
IV. GOING BACK TO SET THEORY
Set Theory is a simple subject but until someone
found its use in Relational Databases. To revive
some memories, we start with simple examples
of Sets:
a. SETS AND RELATIONS
Let A = {1, 2, 3} and
B = {x, y, z} and
Let R = {(1, y), (1, z), (3, y)}.
Then R is Relation from A to B since R is subset
of A X B.
Cross-Product of A and B: We know that the
cross-product or Cartesian product of Set-A and
Set-B would generate 3 X 3= 9 elements.
So, A X B is certainly larger than Relation-R.
Relation-R is only a selective Relation between
elements of Set-A and Set-B. Relation-R need
not be a complete product of Set-A and Set-B.
The next example would make it further clearer.
b. COUNTRIES AND NEIGHBORS
We have countries which are adjacent if they
have a common boundary. Then “is adjacent to”
is Relation R on the countries of earth.
Thus:
(ITALY, SWITZERLAND) R; in common
terms ITALY and SWITZERLAND belong to the
Relation R.
While, (CANADA, MEXICO) R; CANADA
and MEXICO do not belong to the Relation R
because CANADA and MEXICO do not have
common boundaries.
Let us prepare a full Relational data model.
We have a Set COUNTRY of the following elements:
COUNTRY = { NORWAY,
SWEDEN,
FINLAND,
DENMARK,
ICELAND,
GERMANY,
ITALY,
FRANCE,
SPAIN,
POLAND,
ROMANIA,
PORTUGAL,
RUSSIA,
UKRAINE}
And we have Set NEIGHBOR_COUNTRY which
has all adjacent countries to Set COUNTRY:
NEIGHBOR_COUNTRY={NORWAY,
SWEDEN,
FINLAND,
DENMARK,
GERMANY,
ITALY,
FRANCE,
SPAIN,
POLAND,
ROMANIA,
PORTUGAL,
RUSSIA,
UKRAINE}
Please note that ICELAND is missing from the
NEIGHBOR_COUNTRY set as ICELAND is an
island and does not share boundaries with any
country. Same with AUSTRALIA as it does not
share boundaries with any country.
4. So, the Relation IS_NEIGHBOR_TO is as
follows:
IS_NEIGHBOR_TO = {COUNTRY,
NEIGHBOR_COUNTRY}
For the Relation IS_NEIGHBOR_TO,
COUNTRY is the Domain and
NEIGHBOR_COUNTRY is the Range.
In other words, Domain is the identity of the
Relation while Range is the value the Relation
carries. The elements of the Relation are
identified by Domain.
Relation IS_NEIGHBOR_TO is:
IS_NEIGHBOR_TO =
{(NORWAY, SWEDEN),
(NORWAY, FINLAND),
(SWEDEN, NORWAY),
(SWEDEN, FINLAND),
(FINLAND, NORWAY),
(FINLAND, SWEDEN),
(DENMARK, GERMANY),
(GERMANY, DENMARK),
(GERMANY, POLAND),
(GERMANY, FRANCE),
(GERMANY, AUSTRIA),
(GERMANY, SWITZERLAND),
(POLAND, GERMANY),
(FRANCE, GERMANY),
(FRANCE, SWITZERLAND),
(FRANEC, ITALY) ...}
Interestingly, NORWAY shares common
boundaries with SWEDEN and FINLAND;
hence NORWAY appears in domain
of first two tuples of Relation
IS_NEIGHBOR_TO. And vice-versa,
SWEDEN appears in the domain part of next two
tuples as it shares its boundaries with NORWAY
and FINLAND.
And similarly, GERMANY shares its boundaries
with 5 or more countries and again each of these
five countries shares its boundaries with
GERMANY. So, they will swap their domain and
range roles with GERMANY.
So, if we want to know which or how many
countries GERMANY is sharing its boundaries,
we will issue a SELECT statement as below:
SELECT DOMAIN (COUNTRY),
RANGE (NEIGHBOR_COUNTRY)
FROM RELATION IS_NEIGHBOR_TO
WHERE DOMAIN (COUNTRY)=‘GERMANY’
And the Relation IS_NEIGHBOR_TO will fetch
us the following results:
IS_NEIGHBOR_TO =
{(GERMANY, SWITZERLAND)
(GERMANY, POLAND)
(GERMANY, DENMARK)
(GERMANY, FRANCE)
(GERMANY, AUSTRIA)}
We are able prove many properties of Relations
from this example such as:
Theorem 1: R A X B meaning Relation-R is a
subset of cross-product of Set-A and Set-B. This
is clear from following example:
Example: The term Cartesian-product or Cross-
product is used interchangeably. A Cartesian
product of COUNTRY (14 elements) and
NEIGHBOR_COUNTRY (13 elements) would be
14 X 13 = 182 elements.
But in IS_NEIGHBOR_TO we have far less than
182 elements. Hence,
Relation IS_NEIGHBOR_TO
COUNTRY X NEIGHBOR_COUNTRY
Or in other terms:
5. Relation IS_NEIGHBOR_TO is a
subset of (COUNTRY product
NEIGHBOR_COUNTRY)
Theorem 2: A Relation is an ordered pair, which
means “A Relation B ≠ B Relation
A”.
Because, domain and range are not
interchangeable.
Relation (COUNTRY,
NEIGHBOR_COUNTRY) ≠ Relation
(NEIGHBOR_COUNTRY, COUNTRY)
Example: In our example GERMANY shares its
boundaries with 5 countries. It would not mean
all other countries also share their boundaries
with all those five countries. It would have been
true if in the Relation IS_NEIGHBOR_TO we
had a Cartesian product of COUNTRY and
NEIGHBOR_COUNTRY. In a Cartesian product
each country will share its boundary with another
country, which is not practical. In a Relation
only selective elements from first set interact
with selective elements of second set, hence, it is
an ordered pair. Some more data would make it
clear:
GERMANY shares its boundaries with five
countries.
IS_NEIGHBOR_TO =
{(GERMANY, SWITZERLAND)
(GERMANY, POLAND)
(GERMANY, DENMARK)
(GERMANY, FRANCE)
(GERMANY, AUSTRIA)}
While POLAND shares its boundaries with:
IS_NEIGHBOR_TO =
{(POLAND, GERMANY)
(POLAND, UKRAINE)}
And FRANCE shares its boundaries with:
IS_NEIGHBOR_TO =
{(FRANCE, GERMANY)
(FRANCE, SWITZERLAND)
(FRANCE, SPAIN)}
c. TERNARY RELATION
So far we have dealt with binary Relation only, it
is time we take it a little further and handle three
sets together.
On the same example we will add a set
DIRECTION. It would indicate on what
direction of a country the neighboring country
falls e.g. NORTH, SOUTH or NORTH-EAST
etc.
So, the set DIRECTION has the following 8
elements as follows:
DIRECTION={ NORTH,
SOUTH,
EAST,
WEST,
NORTH-EAST,
NORTH-WEST,
SOUTH-EAST,
SOUTH-WEST}
We add this set to IS_NEIGHBOR_TO Relation:
IS_NEIGHBOR_TO = {COUNTRY,
NEIGHBOR_COUNTRY, DIRECTION}
Here, once again COUNTRY is the Domain and
NEIGHBOR_COUNTRY and DIRECTION
together are the Range. No. of columns in a
Relation determine the degree of a Relation. A
third-degree Relation IS_NEIGHBOR_TO looks
as follows:
IS_NEIGHBOR_TO =
{(NORWAY, SWEDEN, EAST),
(NORWAY, FINLAND, NORTH-EAST),
(SWEDEN, NORWAY, WEST),
(SWEDEN, FINLAND, EAST),
(FINLAND, NORWAY, NORTH-WEST),
(FINLAND, SWEDEN, WEST),
(DENMARK, GERMANY, SOUTH),
(GERMANY, DENMARK, NORTH),
(GERMANY, POLAND, EAST),
(GERMANY, FRANCE, WEST),
(GERMANY, AUSTRIA, SOUTH-EAST),
(GERMANY, SWITZERLAND, SOUTH-WEST),
(POLAND, GERMANY, WEST),
(FRANCE, GERMANY, NORTH-EAST),
(FRANCE, SWITZERLAND, EAST),
(FRANEC, ITALY, SOUTH-EAST) ...}
Cross Product of Ternary Relation: If we
work out the Cartesian product of COUNTRY
(14 elements), NEIGHBOR_COUNTRY (13
elements) and DIRECTION (8 elements), we
arrive 14 X 13 X 8 = 1456 elements. But in our
Relation IS_NEIGHBOR_TO (ternary) we have
the same number of elements as in
IS_NEIGHBOR_TO (binary).
6. Hence Relation does not grow with number of
elements in sets or with the number of sets.
Rather the elements in a Relation could come
down as we add more sets, because we are
adding more filtering.
V. SOME MORE PROPERTIES OF
RELATIONS
A Relation starts at binary-Relation, which is
with two sets. And when more sets are added to
the Relation, its degree goes up to n-ary. In
Relational Databases a table could exist with just
one column but it is of little or limited use. We
normally work with tables of nth
degree.
Further more a Relation could become part of
another Relation, such as:
COUNTRY_POPULATION =
{COUNTRY, POPULATION}
and
IS_NEIGHBOR_TO =
{COUNTRY _POPULATION,
NIEGHBOR_COUNTRY}
Set
NEIGHBOR_COUNTRY
Relation
IS_NEIGHBOR_TO
Set COUNTRY Set POPULATION
Relation
COUNTRY_POPULATION
COUNTRY_POULATION is Relation of Set-
COUNTRY and Set-POPULATION and in turn
Relation IS_NEIGHBOR_TO is built with
Relation COUNTRY_POPULATION and Set
NEIGHBOR_COUNTRY.
a. RELATIONAL, WHOLLY
RELATIONAL AND NOTHING BUT
RELATIONAL
What ever operations we do on Relations the
resultant should be again a Relation. That is the
CLOSURE of a Relation. It is like adding
Kilograms with Kilograms and the result is
Kilogram. We do not add weight with length to
arrive at speed, unless in Theory of Relativity.
That is fundamental to Relational Database. All
the 8 Relational operators, defined in Codd’s
Relational Algebra take Relations as inputs and
give Relations as output.
CLOSURE of Relation: The following are the 8
Relational operators as defined by Codd in
Relational Algebra:
Relational Algebra
Relational
Operator Description
Restrict Returns a Relation consisting of
all tuples that satisfy condition
Project Returns a Relation consisting of
all tuples after eliminating
specified attributes.
Product Returns a Relation consisting of
all possible combinations of the
two tuples from both Relations.
Union Returns a Relation consisting of
all tuples appearing in either or
both of two specified Relations.
Intersect Returns a Relation consisting of
all tuples appearing in both the
specified Relations.
Difference Returns a Relation consisting of
all tuples appearing in first
Relation but not in second
Relation.
Join Returns a Relation consisting of
all tuples with a Natural Join of
the two Relations.
Divide Takes two Relation one binary
and one unary and returns a
Relation consisting of all values of
one attribute of binary Relation
that match all values of unary
Relation.
All the information about the database should
also be in Relations. That is the first principle of
Relational Databases. Even the metadata about
the database is represented in Relations only. No
information about database is stored outside
database. It is whole world of Relations. The
data sub-language SQL (Structured Query
Language) is built to work on Relations and
Relations only. Its out coming results are
Relations again.
Information about Relations is stored in a
Relation, which is why Relational Databases are
7. so popular. Where else can we find such
simplicity?
b. WHY DUPLICATE VALUES ARE NOT
ALLOWED IN SETS?
From the hindsight we can now answer to the
question why duplicate values are not allowed in
Sets. Sets are a collection of unordered and non-
duplicate elements.
Such as:
A = {1, 2, 2, 3, 5, 6} is always reduced to A =
{1, 2, 3, 5, 6} and
A = {1, 2, 3, 5, 6} is same as A = {2, 3, 6, 5, 1}
Order of elements does not matter in sets. So
does the rows in a table. Rows in a table are
unordered. Also, we know that when there
duplicate values in a table, there is every chance
that it would generate a Cartesian product, when
joined with other tables. Tables when traced
back to sets similarly do not allow duplicate
values.
VI. SUMMARY
Unlike Hierarchical and Network Databases,
which were the predecessors of Relational
Databases, data representation is very simple.
The beauty of Relational Databases lies in its
simplicity. Only simple concepts last for ever.
Codd took a simple subject of Sets and Relations
and took us forward to a brave new world of
Relational Databases. Without doubt, C. J. Date
say, “A hundred years from now, I’m quite sure
database systems will still be based on Codd’s
Relational foundation”.
8. VII. APPENDIX
Set Theory jargon:
Symbol Description
≠ Inequality
Sub set of
Belongs to
Does not belong to
DOMAIN Domain of Relation is the Identity field of the Relation. In
other words it becomes the primary key in physical
implementation of Relational Databases.
RANGE Range of Relation is the value of the Relation
CARTESIAN PRODUCT
or
CROSS PRODUCT
Cartesian-Product or Cross-Product is the product between
two Sets where n-elements of Set-A are multiplied with n-
elements of Set-B. Hence A X B = { m x n elements }
V. REFERENCES
1. Normalized Data base structure: A brief
tutorial by E. F. Codd, IBM Research
Laboratory, San Jose, California
2. E. F. Codd, “A data base sublanguage
founded on Relational Calculus”, Proc.
1971 ACM-SIGFIDET Workshop on
Data Description, Access and Control
3. E. F. Codd, “A Relational Model of
Data for large shared data banks”,
Comm. ACM 13 6, June 1970
4. E. F. Codd, “Further Normalization of
the Data Base Relational Model”,
Courant Computer Science Symposia 6,
“Data Base Systems”, New York City,
1971
5. C. J. Date, “An Introduction to
Database Systems”, Sixth Edition
6. Obituary to E. F. Codd.ppt by Paul
Maxim
7. A tribute of E. F. Codd by C. J. Date
8. Schaum’s Outlines “Set Theory and
Related Topics” by Seymour Lipschutz,
Second Edition
9. Schaum’s Outlines “Fundamentals of
Relational Databases”, Ramon A. Mata-
Toledo, Pauline K. Cushman