SlideShare a Scribd company logo
1 of 7
Download to read offline
Natural vs. Surrogate Keys in SQL Server :
Getting the Proper Perspective
By RonMorgan -sql-server-performance.com/2014
I once walked into a bar, and saw two construction workers pounding each other to a pulp. The
argument was over what was the better tool—a hammer or screwdriver. I feel a similar
sensation when I see SQL developers arguing over whether to use natural or surrogate
keys. Few other arguments in database design can cause tempers to flare so quickly. Fans of
surrogates consider anyone who uses a natural key a drooling idiot, whereas natural key acolytes
believe the use of a surrogate warrants burning alive at the stake.
Which side is right? Neither. Both natural and surrogate keys have their own place, and a
developer who doesn’t use both as the situation demands is shortchanging both himself and his
applications.
Definitions
A natural key is simply a column or set of columns in a table that uniquely identifies each
row. Natural keys are a feature of the data, not the database, and thus have business meaning on
their own. Quite often a natural key is more than a single column. For instance, the natural key
for a table of addresses might be the five columns: street number, street name, city, state, and zip
code.
What is a surrogate key? Most people will define it as some variation of “a system-generated
value used to uniquely identify a row”. Unlike natural keys, surrogates have no business
meaning. In SQL Server, by far the most common technique for generating surrogate values is
the ubiquitous IDENTITY column.
While the above definition is true, there’s another very important part to it. The value of a
surrogate must never be exposed to the outside world. Users should never be allowed to see the
key, under any conditions. Display the value of a surrogate key on a report, allow it to be
viewed on a form or even used as a search term – these are all forbidden. Once you expose a
surrogate key, it immediately begins acquiring business meaning.
Smart Keys: The Worst of Both Worlds
A smart key is an artificial key with one or more parts that contain business meaning. For
instance, an employee table where each primary key begins with the initial of the employee
(“JD1001” for John Doe), or a table of paint products, where the key identifies the can size,
color, and paint type, i.e. “1G-RED- LATEX”.
Smart keys are a sort of hybrid between natural and surrogate keys. They’re seductively
attractive to many developers, but should be avoided like the plague. They tend to make your
design very brittle and subject to failure as business rules change.
Note: if your data already contains meaningful product codes or other keys such as those
described above, then they are simply natural keys and the above caveat doesn’t apply. It’s a
smart key only when the value is constructed by the developer.
Benefits of Natural Keys
A natural key is…well, natural. Since its values already exist in the data, using a natural key
means you don’t have to add and maintain a new column. This also means smaller tables and
less storage requirements. As more rows fit on a database page, it can sometimes mean greater
performance. (It can also mean less- more on this later) . However, in practical terms, the space
savings are minor, except for very narrow tables. Using a surrogate key usually means an
additional index is required, though.
Generating sequential key values is inherently a serial process, so using a natural key rather than
an IDENTITY column can be a performance boost for inserts, especially in OLTP environments.
Since natural key values are used as foreign keys in child tables, it can mean the elimination of
joins for queries that require no other columns from the parent other than the natural key.
One benefit of natural keys often claimed by its proponents is that they can aid in self-
documenting your database schema. Explicitly naming each natural key documents what
specifically identifies each row in a table, and joining on natural keys helps to identify the
natural relationships between tables. This argument based on elegance appeals strongly to those
working in academia; it may or may not have much value for developers working in the dirt and
grime of real production systems.
Benefits of Surrogate Keys
Since surrogate values are controlled by the system, you never have to worry about duplicate,
missing, or changing values. They’re also an easy and reliable way to join tables; when writing
queries, you never need worry about remembering which combination of columns is the natural
key. However, some of these benefits are less compelling than they seem. I’ll discuss
separately each case for using a surrogate, and whether or not it holds up.
When no natural key on the table exists – Yes.
If the table has no unique identifier, then you must create one. Unkeyed tables are in general a
very bad idea. There are exceptions such as logging or summary tables in which rows are only
inserted, never updated. Otherwise, if your table doesn’t have a unique key—create one.
When the natural key can change- Sometimes.
Immutability is certainly a desirable feature for a key, but it’s by no means a
requirement. Using this as an excuse is the trademark of a lazy developer. Data changes. If the
key changes, update it. If it’s being used as a foreign key in another table – update that table
too. Cascading updates exist for a reason, after all.
Obviously if a key changes very often, or will be used as a FK for many other tables, there can
be performance or concurrency implications to making it the primary key . In this case, you do
want to consider a surrogate, and use it if appropriate.
There’s one particular case where the stability of a surrogate key actually works against you. For
lookup tables, particular those containing selection options for given fields, changes to the
lookup value are often not meant to be cascaded into child tables. For example, an application
may store the “referral source” for new customers or marketing leads, whether they were
generated via an ad in a newspaper or magazine, a yellow pages entry, word of mouth,
etc. These referral codes can be very specific, and are normally stored in a lookup table. Once
set, the code should be preserved historically, even if the original lookup table value is updated
or removed. This is behavior very difficult to achieve with a surrogate key, but trivial with a
natural key.
When natural key values can be missing or duplicated - No.
This is probably the most misunderstood aspect of the natural vs. surrogate debate. Natural keys
are unique by definition. If it isn’t a serious error for a value to be missing or duplicated, then
that value isn’t a natural key to begin with. And if it is an error, then you’re almost always better
off trapping that error at the database level, rather than allowing that bad data into your DB.
As example, consider a table of employees keyed off Social Security Number. Users are
complaining the database throws errors when they don’t have a SSN or mistakenly enter a
duplicate. So you replace the SSN PK with a surrogate and smugly conclude you’ve solved the
problem. But have you? Now some employees don’t have SSNs, and the accounting module
starts failing when printing tax records … or worse, collates all the NULL SSN entries together,
reporting them as a single employee. The search function starts returning the wrong rows
because some employees are sharing the same SSNs, and the new hire in the mailroom gets the
boss’s paycheck, because someone in HR accidentally cut and pasted a SSN.
In reality, all you’ve done is short-circuit out the data integrity safeguards in your database, and
pass the responsibility for the problem up to the application level. Bad move.
These sorts of problems exist because most tables have a uniqueness requirement at the tion
level. A surrogate key only solves the uniqueness problem at the database level, but users
(who cannot and should not see the surrogate value) still don’t have a way to uniquely identify
each record. This also explains why when, even if you choose to use a surrogate key, you will
usually want to also add a unique constraint on the original natural key, since uniqueness is no
longer being automatically enforced by the PK.
But wait a minute! What if your business rules specifically require employees to be input before
you have their SSN data? Or what if your table holds overseas employees that may not have a
SSN at all? Does that mean you can’t use a natural key? Maybe. One possibility is to assign
your own unique values in these cases. One system I’ve seen used randomly generated
alphabetic values for temporary SSNs, whereas the standard numeric value identified a “real”
one. Better yet is to examine your table for some other column or columns that can be used as a
natural key. Or maybe you really do want to drop natural keys altogether. The point here is not
that surrogates should never be used, but simply that if your natural key isn’t unique, you are
going to have problems beyond those that a surrogate will solve.
When the natural key is very wide, or a composite of multiple columns – Sometimes .
Wide keys make for fat indexes. Fat indexes have performance implications. A very wide
key can hurt performance far more than the extra space required by a surrogate. Replacing a
composite key with a surrogate also simplifies your queries, but this should never be a primary
consideration. It’s poor form to replace the natural key of a two CHAR(2) columns with am INT
IDENTITY, for no other reason than it makes your queries prettier.
One common case where a multi-column natural key should always be used is the so-called
junction table: a table used to implement a many-many relationship between two other
tables. Most junction tables have only two columns, each a FK back to a parent table. The
combination of these two FKs is itself the primary key for the table. Adding a surrogate to a
table like this is just asking for trouble.
For performance reasons.
This is the trickiest question of all. Replacing a wide key with a narrower value means smaller
indexes, and more values retrieved from every index page read. This does boost
performance. However, you’ll usually retain the index on the natural key (to enforce uniqueness
if nothing else) and that means another index to maintain. If your table is very narrow, the
additional column for the surrogate can noticeably impact performance and storage
requirements. Finally, some queries that may have not required JOINs with a natural foreign
key may now need them. For instance, our employee SSN example might have a child table
containing reported hours worked:
Table: ReportedHours
Start_Time DATETIME
Stop_Time DATETIME
EmployeeID (Foreign key to Employee
table)
If the EmployeeID FK is SSN, then we can retrieve a list of total hours by SSN from this table
alone:
SELECT EmployeeID, SUM(DATEDIFF(hr,StopTime,StartTime))
FROM ReportedHours
GROUP BY EmployeeID
With EmployeeID as a surrogate key, however, we must JOIN back to the EMPLOYEE table:
SELECT SSN, SUM(DATEDIFF(hr,StopTime,StartTime))
FROM ReportedHours h
JOIN Employees e ON h.EmployeeID = e.EmployeeID
GROUP BY e.SSN
Performance Testing
Three different examples highlighting three different aspects of the performance issue are
tested. As you will see, neither synthetic nor natural keys win in all cases. Note: the examples
shown here are ‘corner cases’, designed to highlight performance differences. In real world
databases, the differences you see are likely to be smaller than those shown here.
Test Case 1: OLTP Data Insertion
A sample employee table is created, using SSN as primary key. Two client sessions ( on
separate machines) are simultaneously started, each inserting 100,000 rows of random test
data. The test is then rerun with an IDENTITY column as primary key, and a unique constraint
added on SSN.
Test Table 1: Natural Key
CREATE TABLE Employees
(
SSN CHAR(9) PRIMARY KEY,
Firstname VARCHAR(50),
Lastname VARCHAR(50),
Date1 DATETIME NOT NULL DEFAULT GETDATE(),
Int1 INTEGER NOT NULL DEFAULT 0,
Char1 Varchar(50),
)
Test Table 2: Surrogate Key
CREATE TABLE Employees
(
EmployeeID INT IDENTITY PRIMARY KEY,
SSN CHAR(9) UNIQUE NOT NULL,
Firstname VARCHAR(50),
Lastname VARCHAR(50),
Date1 DATETIME NOT NULL DEFAULT GETDATE(),
Int1 INTEGER NOT NULL DEFAULT 0,
Char1 Varchar(50),
)
Test Results (average of three runs)
· Natural Key Insert: 39.1 sec.
· Surrogate Key Insert: 46.5 sec. (19% slower)
Before I ran this test, I expected the vast majority of any performance differential would be due
maintaining two indexes, rather than one. However, when I reran it without the unique
constraint on SSN, the difference was almost exactly half the original, meaning contention on the
IDENTITY column is a significant factor.
Test Case #2: Narrowing a Wide Index
A query is used to join a child table to its parent by a 40-byte three-column foreign key. The child table
contains one million rows, and the parent 300K rows. Key values are created randomly. The test is
then rerun using a 4-byte IDENTITY column to perform the JOIN. To force an index seek (rather than an
index scan) a WHERE clause is used to limit rows retrieved to 1% of the table total.
Table Schema:
CREATE TABLE Parent
(
SurrogateKey INT IDENTITY NOT NULL,
NatKeyPart1 CHAR(32) NOT NULL,
NatKeyPart2 INT NOT NULL,
NatKeyPart3 INT NOT NULL,
MiscData VARCHAR(100),
)
CREATE INDEX ix_Parent ON Parent(ID)
CREATE INDEX ix_ParentParts ON Parent(KeyPart1,KeyPart2,KeyPart3)
CREATE TABLE Child
(
SurrogateKey INT NOT NULL,
NatKeyPart1 CHAR(32) NOT NULL,
NatKeyPart2 INT NOT NULL,
NatKeyPart3 INT NOT NULL,
ChildData VARCHAR(100) PRIMARY KEY CLUSTERED
)
Test Results (average of three runs)
· Natural Key Insert: 11.1 sec. (21% slower)
· Surrogate Key Insert: 9.2 sec.
Test Case #3: Join Elimination via Natural Key
The same data and methodology from Test Case #2 are used. The difference here is that the
query references only natural key values from the Parent table, rather than all columns. This
means when the natural key is used as the foreign key, no join to the Parent table is necessary;
the query can be fully serviced by the Child table.
Test Results (average of three runs):
· Natural Key Insert: 2.9 sec
· Surrogate Key Insert: 9.2 sec. (320% slower)
Conclusion
This article has hopefully demonstrated that the choice of natural vs. surrogate key is a complex
issue, with no single answer that fits all scenarios. Armed with the information above, however,
you can make better decisions about which is the best solution for your own particular needs.

More Related Content

What's hot

Slide 5 keys
Slide 5 keysSlide 5 keys
Slide 5 keysVisakh V
 
Database Keys & Relationship
Database Keys & RelationshipDatabase Keys & Relationship
Database Keys & RelationshipBellal Hossain
 
Keys in Database Management System
Keys in Database Management SystemKeys in Database Management System
Keys in Database Management SystemAnkit Rai
 
ER model to Relational model mapping
ER model to Relational model mappingER model to Relational model mapping
ER model to Relational model mappingShubham Saini
 
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING ARADHYAYANA
 
5. relational structure
5. relational structure5. relational structure
5. relational structurekhoahuy82
 
ER DIAGRAM & ER MODELING IN DBMS
ER DIAGRAM & ER MODELING IN DBMSER DIAGRAM & ER MODELING IN DBMS
ER DIAGRAM & ER MODELING IN DBMSssuser20b618
 
Mapping ER and EER Model
Mapping ER and EER ModelMapping ER and EER Model
Mapping ER and EER ModelMary Brinda
 
27 f157al5enhanced er diagram
27 f157al5enhanced er diagram27 f157al5enhanced er diagram
27 f157al5enhanced er diagramdddgh
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Mudasir Qazi
 
Er model ppt
Er model pptEr model ppt
Er model pptPihu Goel
 

What's hot (20)

Dbms keys
Dbms keysDbms keys
Dbms keys
 
Types of keys dbms
Types of keys dbmsTypes of keys dbms
Types of keys dbms
 
Slide 5 keys
Slide 5 keysSlide 5 keys
Slide 5 keys
 
Dbms keysppt
Dbms keyspptDbms keysppt
Dbms keysppt
 
Database keys
Database keysDatabase keys
Database keys
 
Database Keys & Relationship
Database Keys & RelationshipDatabase Keys & Relationship
Database Keys & Relationship
 
B & c
B & cB & c
B & c
 
Keys in Database
Keys in DatabaseKeys in Database
Keys in Database
 
Keys in Database Management System
Keys in Database Management SystemKeys in Database Management System
Keys in Database Management System
 
ER model to Relational model mapping
ER model to Relational model mappingER model to Relational model mapping
ER model to Relational model mapping
 
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
 
5. relational structure
5. relational structure5. relational structure
5. relational structure
 
Entity relationship diagram
Entity relationship diagramEntity relationship diagram
Entity relationship diagram
 
Er diagrams presentation
Er diagrams presentationEr diagrams presentation
Er diagrams presentation
 
Entity relationship diagram for dummies
Entity relationship diagram for dummiesEntity relationship diagram for dummies
Entity relationship diagram for dummies
 
ER DIAGRAM & ER MODELING IN DBMS
ER DIAGRAM & ER MODELING IN DBMSER DIAGRAM & ER MODELING IN DBMS
ER DIAGRAM & ER MODELING IN DBMS
 
Mapping ER and EER Model
Mapping ER and EER ModelMapping ER and EER Model
Mapping ER and EER Model
 
27 f157al5enhanced er diagram
27 f157al5enhanced er diagram27 f157al5enhanced er diagram
27 f157al5enhanced er diagram
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)
 
Er model ppt
Er model pptEr model ppt
Er model ppt
 

Similar to Natural vs.surrogate keys

Difference between fact tables and dimension tables
Difference between fact tables and dimension tablesDifference between fact tables and dimension tables
Difference between fact tables and dimension tablesKamran Haider
 
Intro to Data warehousing lecture 12
Intro to Data warehousing   lecture 12Intro to Data warehousing   lecture 12
Intro to Data warehousing lecture 12AnwarrChaudary
 
dotnetMALAGA - Sql query tuning guidelines
dotnetMALAGA - Sql query tuning guidelinesdotnetMALAGA - Sql query tuning guidelines
dotnetMALAGA - Sql query tuning guidelinesJavier García Magna
 
Tips for Database Performance
Tips for Database PerformanceTips for Database Performance
Tips for Database PerformanceKesavan Munuswamy
 
Advance sqlite3
Advance sqlite3Advance sqlite3
Advance sqlite3Raghu nath
 
Handling SQL Server Null Values
Handling SQL Server Null ValuesHandling SQL Server Null Values
Handling SQL Server Null ValuesDuncan Greaves PhD
 
Database management system
Database management systemDatabase management system
Database management systemTushar Desarda
 
Design your own database
Design your own databaseDesign your own database
Design your own databaseFrank Katta
 
Referential integrity
Referential integrityReferential integrity
Referential integrityJubin Raju
 
Steps towards of sql server developer
Steps towards of sql server developerSteps towards of sql server developer
Steps towards of sql server developerAhsan Kabir
 
Advance Sqlite3
Advance Sqlite3Advance Sqlite3
Advance Sqlite3Raghu nath
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09guest9d79e073
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Mark Ginnebaugh
 
10359485
1035948510359485
10359485kavumo
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008paulguerin
 
Optimize access
Optimize accessOptimize access
Optimize accessAla Esmail
 
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USARelevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USALeonardo Dias
 

Similar to Natural vs.surrogate keys (20)

Difference between fact tables and dimension tables
Difference between fact tables and dimension tablesDifference between fact tables and dimension tables
Difference between fact tables and dimension tables
 
Intro to Data warehousing lecture 12
Intro to Data warehousing   lecture 12Intro to Data warehousing   lecture 12
Intro to Data warehousing lecture 12
 
dotnetMALAGA - Sql query tuning guidelines
dotnetMALAGA - Sql query tuning guidelinesdotnetMALAGA - Sql query tuning guidelines
dotnetMALAGA - Sql query tuning guidelines
 
Tips for Database Performance
Tips for Database PerformanceTips for Database Performance
Tips for Database Performance
 
Advance sqlite3
Advance sqlite3Advance sqlite3
Advance sqlite3
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Handling SQL Server Null Values
Handling SQL Server Null ValuesHandling SQL Server Null Values
Handling SQL Server Null Values
 
Database management system
Database management systemDatabase management system
Database management system
 
Design your own database
Design your own databaseDesign your own database
Design your own database
 
Referential integrity
Referential integrityReferential integrity
Referential integrity
 
joins dbms.pptx
joins dbms.pptxjoins dbms.pptx
joins dbms.pptx
 
Steps towards of sql server developer
Steps towards of sql server developerSteps towards of sql server developer
Steps towards of sql server developer
 
Advance Sqlite3
Advance Sqlite3Advance Sqlite3
Advance Sqlite3
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09
 
Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09Brad McGehee Intepreting Execution Plans Mar09
Brad McGehee Intepreting Execution Plans Mar09
 
Unit 2 DBMS.pptx
Unit 2 DBMS.pptxUnit 2 DBMS.pptx
Unit 2 DBMS.pptx
 
10359485
1035948510359485
10359485
 
Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008Myth busters - performance tuning 102 2008
Myth busters - performance tuning 102 2008
 
Optimize access
Optimize accessOptimize access
Optimize access
 
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USARelevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
 

Natural vs.surrogate keys

  • 1. Natural vs. Surrogate Keys in SQL Server : Getting the Proper Perspective By RonMorgan -sql-server-performance.com/2014 I once walked into a bar, and saw two construction workers pounding each other to a pulp. The argument was over what was the better tool—a hammer or screwdriver. I feel a similar sensation when I see SQL developers arguing over whether to use natural or surrogate keys. Few other arguments in database design can cause tempers to flare so quickly. Fans of surrogates consider anyone who uses a natural key a drooling idiot, whereas natural key acolytes believe the use of a surrogate warrants burning alive at the stake. Which side is right? Neither. Both natural and surrogate keys have their own place, and a developer who doesn’t use both as the situation demands is shortchanging both himself and his applications. Definitions A natural key is simply a column or set of columns in a table that uniquely identifies each row. Natural keys are a feature of the data, not the database, and thus have business meaning on their own. Quite often a natural key is more than a single column. For instance, the natural key for a table of addresses might be the five columns: street number, street name, city, state, and zip code. What is a surrogate key? Most people will define it as some variation of “a system-generated value used to uniquely identify a row”. Unlike natural keys, surrogates have no business meaning. In SQL Server, by far the most common technique for generating surrogate values is the ubiquitous IDENTITY column. While the above definition is true, there’s another very important part to it. The value of a surrogate must never be exposed to the outside world. Users should never be allowed to see the key, under any conditions. Display the value of a surrogate key on a report, allow it to be viewed on a form or even used as a search term – these are all forbidden. Once you expose a surrogate key, it immediately begins acquiring business meaning. Smart Keys: The Worst of Both Worlds A smart key is an artificial key with one or more parts that contain business meaning. For instance, an employee table where each primary key begins with the initial of the employee (“JD1001” for John Doe), or a table of paint products, where the key identifies the can size, color, and paint type, i.e. “1G-RED- LATEX”.
  • 2. Smart keys are a sort of hybrid between natural and surrogate keys. They’re seductively attractive to many developers, but should be avoided like the plague. They tend to make your design very brittle and subject to failure as business rules change. Note: if your data already contains meaningful product codes or other keys such as those described above, then they are simply natural keys and the above caveat doesn’t apply. It’s a smart key only when the value is constructed by the developer. Benefits of Natural Keys A natural key is…well, natural. Since its values already exist in the data, using a natural key means you don’t have to add and maintain a new column. This also means smaller tables and less storage requirements. As more rows fit on a database page, it can sometimes mean greater performance. (It can also mean less- more on this later) . However, in practical terms, the space savings are minor, except for very narrow tables. Using a surrogate key usually means an additional index is required, though. Generating sequential key values is inherently a serial process, so using a natural key rather than an IDENTITY column can be a performance boost for inserts, especially in OLTP environments. Since natural key values are used as foreign keys in child tables, it can mean the elimination of joins for queries that require no other columns from the parent other than the natural key. One benefit of natural keys often claimed by its proponents is that they can aid in self- documenting your database schema. Explicitly naming each natural key documents what specifically identifies each row in a table, and joining on natural keys helps to identify the natural relationships between tables. This argument based on elegance appeals strongly to those working in academia; it may or may not have much value for developers working in the dirt and grime of real production systems. Benefits of Surrogate Keys Since surrogate values are controlled by the system, you never have to worry about duplicate, missing, or changing values. They’re also an easy and reliable way to join tables; when writing queries, you never need worry about remembering which combination of columns is the natural key. However, some of these benefits are less compelling than they seem. I’ll discuss separately each case for using a surrogate, and whether or not it holds up. When no natural key on the table exists – Yes. If the table has no unique identifier, then you must create one. Unkeyed tables are in general a very bad idea. There are exceptions such as logging or summary tables in which rows are only inserted, never updated. Otherwise, if your table doesn’t have a unique key—create one. When the natural key can change- Sometimes. Immutability is certainly a desirable feature for a key, but it’s by no means a
  • 3. requirement. Using this as an excuse is the trademark of a lazy developer. Data changes. If the key changes, update it. If it’s being used as a foreign key in another table – update that table too. Cascading updates exist for a reason, after all. Obviously if a key changes very often, or will be used as a FK for many other tables, there can be performance or concurrency implications to making it the primary key . In this case, you do want to consider a surrogate, and use it if appropriate. There’s one particular case where the stability of a surrogate key actually works against you. For lookup tables, particular those containing selection options for given fields, changes to the lookup value are often not meant to be cascaded into child tables. For example, an application may store the “referral source” for new customers or marketing leads, whether they were generated via an ad in a newspaper or magazine, a yellow pages entry, word of mouth, etc. These referral codes can be very specific, and are normally stored in a lookup table. Once set, the code should be preserved historically, even if the original lookup table value is updated or removed. This is behavior very difficult to achieve with a surrogate key, but trivial with a natural key. When natural key values can be missing or duplicated - No. This is probably the most misunderstood aspect of the natural vs. surrogate debate. Natural keys are unique by definition. If it isn’t a serious error for a value to be missing or duplicated, then that value isn’t a natural key to begin with. And if it is an error, then you’re almost always better off trapping that error at the database level, rather than allowing that bad data into your DB. As example, consider a table of employees keyed off Social Security Number. Users are complaining the database throws errors when they don’t have a SSN or mistakenly enter a duplicate. So you replace the SSN PK with a surrogate and smugly conclude you’ve solved the problem. But have you? Now some employees don’t have SSNs, and the accounting module starts failing when printing tax records … or worse, collates all the NULL SSN entries together, reporting them as a single employee. The search function starts returning the wrong rows because some employees are sharing the same SSNs, and the new hire in the mailroom gets the boss’s paycheck, because someone in HR accidentally cut and pasted a SSN. In reality, all you’ve done is short-circuit out the data integrity safeguards in your database, and pass the responsibility for the problem up to the application level. Bad move. These sorts of problems exist because most tables have a uniqueness requirement at the tion level. A surrogate key only solves the uniqueness problem at the database level, but users (who cannot and should not see the surrogate value) still don’t have a way to uniquely identify each record. This also explains why when, even if you choose to use a surrogate key, you will usually want to also add a unique constraint on the original natural key, since uniqueness is no longer being automatically enforced by the PK. But wait a minute! What if your business rules specifically require employees to be input before you have their SSN data? Or what if your table holds overseas employees that may not have a SSN at all? Does that mean you can’t use a natural key? Maybe. One possibility is to assign
  • 4. your own unique values in these cases. One system I’ve seen used randomly generated alphabetic values for temporary SSNs, whereas the standard numeric value identified a “real” one. Better yet is to examine your table for some other column or columns that can be used as a natural key. Or maybe you really do want to drop natural keys altogether. The point here is not that surrogates should never be used, but simply that if your natural key isn’t unique, you are going to have problems beyond those that a surrogate will solve. When the natural key is very wide, or a composite of multiple columns – Sometimes . Wide keys make for fat indexes. Fat indexes have performance implications. A very wide key can hurt performance far more than the extra space required by a surrogate. Replacing a composite key with a surrogate also simplifies your queries, but this should never be a primary consideration. It’s poor form to replace the natural key of a two CHAR(2) columns with am INT IDENTITY, for no other reason than it makes your queries prettier. One common case where a multi-column natural key should always be used is the so-called junction table: a table used to implement a many-many relationship between two other tables. Most junction tables have only two columns, each a FK back to a parent table. The combination of these two FKs is itself the primary key for the table. Adding a surrogate to a table like this is just asking for trouble. For performance reasons. This is the trickiest question of all. Replacing a wide key with a narrower value means smaller indexes, and more values retrieved from every index page read. This does boost performance. However, you’ll usually retain the index on the natural key (to enforce uniqueness if nothing else) and that means another index to maintain. If your table is very narrow, the additional column for the surrogate can noticeably impact performance and storage requirements. Finally, some queries that may have not required JOINs with a natural foreign key may now need them. For instance, our employee SSN example might have a child table containing reported hours worked: Table: ReportedHours Start_Time DATETIME Stop_Time DATETIME EmployeeID (Foreign key to Employee table) If the EmployeeID FK is SSN, then we can retrieve a list of total hours by SSN from this table alone: SELECT EmployeeID, SUM(DATEDIFF(hr,StopTime,StartTime)) FROM ReportedHours GROUP BY EmployeeID With EmployeeID as a surrogate key, however, we must JOIN back to the EMPLOYEE table: SELECT SSN, SUM(DATEDIFF(hr,StopTime,StartTime))
  • 5. FROM ReportedHours h JOIN Employees e ON h.EmployeeID = e.EmployeeID GROUP BY e.SSN Performance Testing Three different examples highlighting three different aspects of the performance issue are tested. As you will see, neither synthetic nor natural keys win in all cases. Note: the examples shown here are ‘corner cases’, designed to highlight performance differences. In real world databases, the differences you see are likely to be smaller than those shown here. Test Case 1: OLTP Data Insertion A sample employee table is created, using SSN as primary key. Two client sessions ( on separate machines) are simultaneously started, each inserting 100,000 rows of random test data. The test is then rerun with an IDENTITY column as primary key, and a unique constraint added on SSN. Test Table 1: Natural Key CREATE TABLE Employees ( SSN CHAR(9) PRIMARY KEY, Firstname VARCHAR(50), Lastname VARCHAR(50), Date1 DATETIME NOT NULL DEFAULT GETDATE(), Int1 INTEGER NOT NULL DEFAULT 0, Char1 Varchar(50), ) Test Table 2: Surrogate Key CREATE TABLE Employees ( EmployeeID INT IDENTITY PRIMARY KEY, SSN CHAR(9) UNIQUE NOT NULL, Firstname VARCHAR(50), Lastname VARCHAR(50), Date1 DATETIME NOT NULL DEFAULT GETDATE(), Int1 INTEGER NOT NULL DEFAULT 0, Char1 Varchar(50), ) Test Results (average of three runs)
  • 6. · Natural Key Insert: 39.1 sec. · Surrogate Key Insert: 46.5 sec. (19% slower) Before I ran this test, I expected the vast majority of any performance differential would be due maintaining two indexes, rather than one. However, when I reran it without the unique constraint on SSN, the difference was almost exactly half the original, meaning contention on the IDENTITY column is a significant factor. Test Case #2: Narrowing a Wide Index A query is used to join a child table to its parent by a 40-byte three-column foreign key. The child table contains one million rows, and the parent 300K rows. Key values are created randomly. The test is then rerun using a 4-byte IDENTITY column to perform the JOIN. To force an index seek (rather than an index scan) a WHERE clause is used to limit rows retrieved to 1% of the table total. Table Schema: CREATE TABLE Parent ( SurrogateKey INT IDENTITY NOT NULL, NatKeyPart1 CHAR(32) NOT NULL, NatKeyPart2 INT NOT NULL, NatKeyPart3 INT NOT NULL, MiscData VARCHAR(100), ) CREATE INDEX ix_Parent ON Parent(ID) CREATE INDEX ix_ParentParts ON Parent(KeyPart1,KeyPart2,KeyPart3) CREATE TABLE Child ( SurrogateKey INT NOT NULL, NatKeyPart1 CHAR(32) NOT NULL, NatKeyPart2 INT NOT NULL, NatKeyPart3 INT NOT NULL, ChildData VARCHAR(100) PRIMARY KEY CLUSTERED ) Test Results (average of three runs) · Natural Key Insert: 11.1 sec. (21% slower) · Surrogate Key Insert: 9.2 sec. Test Case #3: Join Elimination via Natural Key The same data and methodology from Test Case #2 are used. The difference here is that the query references only natural key values from the Parent table, rather than all columns. This means when the natural key is used as the foreign key, no join to the Parent table is necessary; the query can be fully serviced by the Child table.
  • 7. Test Results (average of three runs): · Natural Key Insert: 2.9 sec · Surrogate Key Insert: 9.2 sec. (320% slower) Conclusion This article has hopefully demonstrated that the choice of natural vs. surrogate key is a complex issue, with no single answer that fits all scenarios. Armed with the information above, however, you can make better decisions about which is the best solution for your own particular needs.