Keysum - Using Checksum Keys

Σ KEYSUM
© 1997 Data Management & Warehousing
INTRODUCTION
Keysum is a new and interesting
technique (not a product) in the
generation of keys within a database. It
has particular application within Data
Warehouses where keys are often
made up of de-normalised
alphanumeric data.
THE PROBLEMS
Data that has been de-normalised often
has a primary key that is made up of a
single string, a series of concatenated
strings, or other data types that can be
converted to strings. The key is
traditionally costly in terms of storage
requirements and access speed when
used in an index. It is, however, vital to
the usability of the data.
The second issue is that in a data
warehousing environment data may be
loaded and assigned an arbitrary
unique number as a key. If the data
needs to be re-loaded at a later date,
possibly with additions, then it is
impossible to guarantee that the same
arbitrary key will be assigned to the
same row.
THE SOLUTION
The solution is simplicity itself. The
generated key of the row should be the
checksum of the string that makes up
the unique key. This will, depending
on the checksum algorithm chosen,
generate a large integer that will be
nearly unique within the scope of the
data. For example using the industry
standard CRC32 algorithm will
generate a number in the range 0 to
4294967296, whilst using the Message
Digest algorithm MD5 will generate a
number between 0 and 3.4 * 1038
.
In addition to this the result can
incorporate the length of the original
string which improves the uniqueness
of lower order algorithm results
considerably.
HOW DOES THIS HELP?
The table key is now an integer, the
optimal format on which to index. The
user now calls a function to convert the
required string into the checksum and
uses the index to look up the
appropriate row. On very large tables
this is considerably faster than
conventional string look-up.
Furthermore the data can be validated,
as, if the current checksum differs from
the stored checksum then the data has
changed. This also works when re-
loading data, as any existing data will
still be able to reference the old key. It
should also be noted that when a field
within the key is altered the key also
needs to be re-generated.
If this technique is used in contexts
such as trend analysis within a Data
Warehouse it is also possible that the
occasional mis-match because of a
duplicate checksum will not be
statistically significant and therefore
the key can be considered unique.
WHAT ARE THE ISSUES?
No checksum is guarantied to be
unique. It is therefore possible that two
different records can return the same
value. If the length is included in the
checksum it is still not guarantied but it
further reduces the risk. When
choosing a checksum algorithm it is
important to consider the amount of
records for which the checksum will
provide a key. If you have a table with
500,000 rows (such as a table that
contains addresses) then CRC32 will
have an 8500:1 chance of duplicates
without considering the length of the
original string.

MD5 on the other has the remote
6.8*1032
:1 chance of generating a
duplicate checksum. This is because it
uses 128 bits rather than CRC32 which
uses only 32 bits.
When implementing the algorithm it is
important to note that checksums
normally return unsigned integers as
their result. Your database and routines
that access the checksum must all be
able to handle the size of the result and
ensure that they deal with the issue of
signed versus unsigned variables.
IS THIS FEATURE AVAILABLE NOW?
There is no direct implementation of a
checksum within the SQL Dialects of
the major vendors currently available,
however it can be implemented via an
external procedure call.
The author has implemented this
technique within an Oracle7™
database. A daemon was created that
took as its input the string and returned
two values, the checksum and the
length. This was connected to the
database via a ‘Database Pipe’. When a
checksum was required a PL/SQL
stored procedure was called that placed
the string into the database pipe and
received the two values, the checksum
and the length, back.
The daemon was also implemented as
a shared library so that it could be
accessed from the command line and
from other utilities that could call a
shared ‘C’ library.
An optional parameter was included to
allow the use of different algorithms in
different contexts. For example where
only a small data set needs a checksum
key then CRC32 may be suitable,
whilst MD5 is used only for the largest
data sets.
WHERE DO I GET A CHECKSUM
ALGORITHM?
The inevitably answer to this question
is ‘From the Internet’. Any site that
distributes the source for FreeBSD
includes an implementation of CRC32.
MD5 is also widely available.
THE FUTURE DIRECTION
The author hope that in the future that
Database vendors such as Oracle will
add the checksum function to their
SQL dialects. Once available as a in-
built function the need to implement
checksums via external procedure calls
will disappear and performance will be
improved even more. It will also allow
some standardisation is the choice and
handling of the checksum algorithms.
Data Management & Warehousing is
the trading name of David M Walker, a
freelance Data Warehousing consultant.
Address: 138, Finchampstead Road,
Wokingham, Berkshire,
RG41 2NU, United Kingdom.
WWW: http://www.datamgmt.com
Telephone: +44 (0) 7050 028 911
Fax: +44 (0) 7050 028 912
Copyright © 1997 All rights reserved.
All Copyrights and Trademarks respected
MD5 Copyright © 1991-2, RSA Data Security, Inc.
Oracle7™ is a trademark of Oracle Corporation
WHAT IS THE MD5 MESSAGE-DIGEST ALGORITHM?
MD5 is a message-digest algorithm. The algorithm takes as input a message of
arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of
the input. It is conjectured that it is computationally infeasible to produce two
messages having the same message digest, or to produce any message having a given
pre-specified target message digest.
The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the
MD5 algorithm does not require any large substitution tables; the algorithm can be
coded quite compactly.
Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All rights reserved.

Keysum - Using Checksum Keys

More Related Content

What's hot

Viewers also liked

Similar to Keysum - Using Checksum Keys

More from David Walker

Recently uploaded

Keysum - Using Checksum Keys