SlideShare a Scribd company logo
1 of 38
Download to read offline
Object Relational Mapping: SE Perspective
Tyler Smith
Undergraduate Honors Thesis
Introduction
The intent of this paper is to provide an in-depth analysis of the software engineering merits
of the incorporation of an Object Relational Mapping (ORM) framework. The discussion will
be based around answering the following question: ​In a persistent, object oriented
application, is an ORM framework the optimal method of implementing a persistent layer?
The paper progresses in the manner I expect a developer advocating the use of an ORM
framework would present ideas to a software architect. First, an overview of persistence in
an object oriented context, including a justification of relational databases as a persistence
back-end. Second, potential problems associated with the paradigm mismatch between
in-memory objects and relational database tables. Third, a demonstration of the methods by
which ORM frameworks address these problems. Fourth, a discussion of the process of
integrating an ORM framework into an existing software project. Finally, two case studies
which follow real projects through the transition to ORM software.
The intended audience is experienced programmers who are considering options for making
an application persistent. It is written with the expectation that the reader has some formal
knowledge of computer science, such as the difference between the heap and the stack.
This paper does ​not assume familiarity with relational database theory.
My goal is not to establish a single, brief answer to the above question, but rather to build a
base of contextual evidence which will allow the reader to address the usefulness of an ORM
framework in his or her own projects.
Why did I choose this topic?
I have had the opportunity to work on two projects which were both in the process of
transitioning to an ORM framework. In both cases, I joined the project after the decision
had been made to use an ORM framework. This paper is an investigation into the
justifications for the use of an ORM framework. By exploring this topic, I hope to gain both a
better understanding of ORM technology, and the decision making process used when
making significant design changes in a software project.
Object Relational Mapping: SE Perspective
Introduction
Why did I choose this topic?
Part 1: Understanding Object Oriented Persistence
Persistence: a simple example:
Adding a relation
Why do we need databases?
Java with a relational database
Part 1 Conclusions
Part 2: The Object/Relational Mismatch
Problem 1: References Between Classes
Problem 2: Sub/Super Class Relationships
Problem 3: Managing Identity
Problem 4: Developer Expertise
Problem 5: Managing object state
Problem 6: Performance
Problem 7: Changing Database Engine
Part 3: What is an Object Relational Mapping framework, and how can it help?
How does ORM fit into a design?
Problem 1: References Between Classes
Problem 2: Sub/Super class relationships
1. Table per Concrete Class - Implicit polymorphism
2. Table per Concrete Class With Unions
3. Table per Class Hierarchy
4. Inheritance Relationships as Foreign Keys
Django Example:
Problem 3: Managing Identity
Problem 4: Developer Expertise
Problem 5: Managing Object State
Problem 6: Performance
Problem 7: Changing database engines
Part 4: Integrating ORM into the development process:
Design basics
Integration
Top Down
Bottom Up
Middle Out
Meet in the Middle
Integrating ORM in Practice
Part 5: Case Studies
Case Study One Carlson School of Management Help desk
Motivations
Design
Implementation
Problems
Improvements
Conclusions
Case Study Two General Dynamics Six-Delta
Motivations
Selecting an ORM solution
Integration
Problems
Improvements
Developer Reaction
Data Metrics
Looking Back
Part 6: Conclusions
Future Research/Open Questions
Object oriented databases
Inferred mapping
Sources
Special Thanks to:
Appendix A: Source Code
SQL Performance Test
Part 1: Understanding Object Oriented Persistence
The following sections give an overview of the basic application of various methods of
persistence in an object oriented environment, and provide an explanation of the use of a
database in a large scale persistent application. If you are already comfortable with these
topics, feel free to skim this section (the examples will be referenced in part 2). I will
provide examples using pseudo code, Java, Python and UML.
Object oriented programming is now an industry standard for large scale applications. Some
programming languages, such as Java, are entirely object oriented. Others, such as C++,
support object oriented principles alongside traditional functionality. Many large and small
scale software projects now rely on objects to encapsulate data and functional elements of
software.
What is an object?
In the simplest sense, an object consists of a collection of data and methods which operate
on that data. The structure of this data and the functionality of the methods are defined by
a ​class. When a class is ​instantiated, an object is created. The term object oriented
programming refers to the design paradigms which guide the creation of an application
based on objects.
What is persistence?
In an object oriented application, objects exist in memory during the execution of the
application. A persistent object is an object which can exist through a shutdown of the
system. Persistence is the method by which an object becomes persistent. In the following
example, I will outline some common persistence methods.
Persistence: a simple example:
Consider a simple class called Car. The Car class defines a set of data values relevant to a
Car: color, engine type, and number of doors. By instantiating this class, we can create a
Car object which lives in memory.
/****************************************************************
* Class Car V.1
* Tyler Smith
*****************/
class Car{
String color;
String engine;
int doors;
}
If we wish to make this object persistent, the easiest method is to simply write the data
fields of the Car to a file. Files are stored in the file system and can persist through system
shutdowns. To make our Car object persistent, we could create a file consisting of tuples
which store our data in a ​key-value pair. A key-value pair is a set of two elements: a key,
which specifies the related object field, and a value, which is the value of that field for a
given object.
#####################################################
# Car File V.1
# Tyler Smith
#############
color = "red"
engine = "V6"
doors = "4"
In a simple application, this strategy could be very effective. It would also be very simple to
implement, as it only requires two new methods, a writeToFile() and a readFromFile().
Adding a relation
Consider making a modification to the car class. We want to add a variable to store the
owner of the car. We will manage owners with a new class, called Owner:
/****************************************************************
* Class Owner V.1
* Tyler Smith
*****************/
class Owner{
String name;
}
We must now consider the addition of a ​relation. A relation is a connection or association
between multiple data sets. As we need to reference an Owner, our Car objects cannot exist
in isolation; they must contain a data field providing a connection to the owner. In the
object oriented environment, implementing this field is trivial: we can add a Owner
reference field to the Car class definition.
/****************************************************************
* Class Car V.2
* Tyler Smith
*****************/
class Car{
String color;
String engine;
int doors;
Owner owner;//This field holds a reference to an Owner object
}
In object oriented programming, this addition has a minimal impact on overall complexity.
The owner field simply contains a reference to the in-memory location of an Owner object.
However, when considered in the scope of persistent data, this reference introduces a
complexity which will garner much more discussion later in this paper: the problem of object
identity. During program execution, the identity of an object is simply its location in
memory. Every unique object has a unique memory location, which means the a given
memory address can only refer to a single object. This allows our Car object to trivially store
a reference to a unique Owner.
If we were to simply add a key-value pair for this memory location value, our software
would fail. It would fail because we have no guarantee that the memory location of the
object in one execution will be the same as in a later execution. In fact, it will almost
certainly be different.
We need to save some other value which will guarantee that when we read the Car back
into the application from a file the same Owner will be referenced. We could save the name
of the Owner, but this would not allow us to alter the name of an Owner after it was
assigned to a Car. A better choice is to add an identifier field to the owner, the sole task of
which is to manage the identity of the Owner.
That means our Owner class will look like this:
/****************************************************************
* Class Owner V.2
* Tyler Smith
*****************/
class Owner{
String name;
int owner_id;
//for the purpose of this example, I will not discuss how this id
//is made to be unique.
}
I will also add an ID to the Car class for the same reason - to make the identity of a Car
object constant throughout transitions to and from a persistent state.
/****************************************************************
* Class Car V.2
* Tyler Smith
*****************/
class Car{
String color;
String engine;
int doors;
Owner owner;//This field holds a reference to an Owner object
int car_id;
}
Our key-value Car file will now look like this:
#####################################################
# Car File V.2
# Tyler Smith
#############
color = "red"
engine = "V6"
doors = "4"
owner_id = "2"
car_id = "1"
The Owner file will look like this:
#####################################################
# Owner File V.1
# Tyler Smith
#############
name = "Tyler"
Id = "1"
So far, we have established a simple method for creating a persistent data type. We are
able to create and manipulate Car objects, and save and restore them from a persistent
state. We are able to maintain a persistent relationship between a Car and an Owner. In
the next section, I will give an example which demonstrates why databases are critical to
persistent systems.
Why do we need databases?
So far we've only considered a small number of Car objects. What if we were to try to store
100,000 Car objects? As we're currently storing one Car per file, this would require a
100,000 files. Most modern file systems are not optimized to manage this much data.
To combat this problem, we could merge the data into a single file, using a Comma
Separated Value format. In this manner we could consolidate the values into a single file
and reduce the overhead:
#####################################################
# Car File V.3
# Tyler Smith
#############
Color, Engine, Doors, Owner, Owner_Id #Headers
red, V6, 4, 1,1
blue, V8, 2, 2,2
This will allow us to reduce the overhead required to store hundreds of thousands of files.
Now, consider the problem of loading a specific Car from the file. We would have to either
search through the entire file, or load every Car into memory, and then search our in
memory objects. Both methods are O(n), and do not scale well.
Further, in the simple state presented, we have no method for maintaining the integrity of
the data. For example, we have no way to control whether two Cars can have the same
owner.
Relational databases were created to solve these problems​. A relational database is a
specialized file system for persistent data designed to handle these complexities. A
relational database provides a layer of specialized control between the low level file
representation (similar to the above CSV example) and the program needing access to the
data.
Contemporary relational databases also provide a standard method of communicating with
the the database. SQL or Structured Query Language is the industry standard method of
performing operations on the database.
A relational database provides quick access to data, along with the ability to apply
constraints which force the data to conform to specific rules. For example, constraints could
be used to enforce the uniqueness of Cars or Owners, or to require that Cars could only
reference existing Owner objects. Relational databases are now the de facto standard in
persistent applications, object oriented and otherwise.
Java with a relational database
We can now expand our Car class to connect to a database by adding read and write
methods which access the database:
/****************************************************************
* Class Car V.3
* Tyler Smith
*****************/
class Car{
String color;
String engine;
int door;
Owner owner;
int car_id;
public Car readCarFromDatabase(){
connection = getConnection();
connection.runQuery("SELECT FROM car WHERE id = ...")
//SQL statement to retrieve a Car from the database
}
public void writeCarToDatabase(){
connection = getConnection();
connection.runQuery("INSERT INTO car ....")
//SQL statement to add a car to the database
}
}
We can now store Car objects between executions of our program, store a relationship
between Cars and Owners, and manage many persistent Cars in an efficient manner.
Part 1 Conclusions
For a large scale, persistent, object oriented application, it is necessary to use some manner
of advanced data storage system. Typically, this system is a relational database. Relational
databases are the industry standard for large scale data management, and provide a
reasonable back-end for object persistence. The Car/Owner example demonstrates the
basics of persistence, and provides a justification for the use of relational databases.
However, it is a very simple example. As the data model increases in complexity, the
difficulty of managing its data grows significantly. I will address these complexities in the
following section.
The primary goal of an Object Relational Mapping framework is not to remove this
complexity, but rather to encapsulate it within a well vetted tool which allows developers
working at the object level to leave the low level details of the data management to the
ORM tool, and focus on the intent of the overall application. In part two, I outline some
examples of complexities addressed by ORM tools.
Part 2: The Object/Relational Mismatch
There is a fundamental disparity between the way data is stored in a relational database
and the way data is stored in objects. Object oriented programming provides a framework
with which software can be built to reflect real world analogues. Relational databases
provide a fast, structured and reliable method of saving and restoring data. In a system
which requires both a database and object oriented development, there must be some
resolution whereby data that exists in the form of abstract objects can be preserved in a
relational database. Thus we need some method of correctly mapping object data to
database tables even in complex situations. It is from this need that Object Relational
Mapping tools were born.
In the following section, I will present a set of problems which demonstrate this paradigm
mismatch. I will follow this with a discussion of how ORM tools address these issues. Note
that I do ​not assume the existence of a home brewed ORM persistence layer in these
discussions.
Problem 1: References Between Classes
In a relational database, ​everything is stored in a tables. In a manner very similar to the csv
file example above, all of the instances of a class are stored in rows. In an object, local data
is stored in a linear manner, but references to other objects are not. As seen in the car
example, an object can have a reference to another object. References can exist both as an
aggregation relationship, where an object references but does not own another object, or as
a composition relationship, where an object owns other objects. [Fowler 68] These
relationships do not have a simple or implicit mapping to a table structure. This is the
fundamental problem which spurred the development of ORM frameworks.
In the car example, I provided a system with a fairly natural mapping to a persistent state.
The most basic elements of an object can be trivially mapped to rows in a table.
Simple Car Table (Matches Car V.1)
Color Engine Doors Car_Id
Red V8 4 1
However, if we consider the addition of the reference to an Owner, the mapping becomes
less direct. As shown in part 1, one method is to add an Id to the owner, and have each car
save a the Id of its owner.
Car Table (Version 2)
Color Engine Doors Car_Id Owner_Id
Red V8 4 1 1
Owner Table
Name Owner_Id
Ted 1
In a relational database, this type of relationship is called a ​foreign key reference. Notice
however that there is some ambiguity in this design. For example, we could create a
functionally equivalent relationship by giving each Car an Id, and having each Owner store
the Id of a Car.
Car Table (Version 3)
Color Engine Doors Car_Id
Red V8 4 1
Owner Table
Name Id Car Id
Ted 1 1
Neither of these methods is wrong. Both will allow us to load the objects from the database,
and recreate the relationship between them. However, there is not necessarily a database
mapping implicit from the class definitions. A relationship which is straightforward in object
oriented terms is not necessarily as straightforward when applied to a database.
The choice of relationship mapping also has effects on ​multiplicity. Multiplicity is the number
of objects associated with each part of a relationship. For example, a single Owner could
own multiple Cars. If we store the objects in the manner described by the Car table version
2, a Car can only have one owner, but an owner can have multiple cars. Similarly, in Car
table version 3, a Car can have many Owners, but an Owner can only have one car. We call
these relationships One-to-Many relationships.
Suppose we wanted to allow Owners to have multiple Cars, and Cars to have multiple
Owners. In object oriented programming, this is simple to implement. Each Car object can
contain a variable length set of references to Owner objects, and each Owner object
contains a variable length set of references to Car objects. These are called Many-to-Many
relationships.
The limitations of a relational database do not allow us to have variable length fields in a
database table. This means we need some other method of mapping these relationships; we
need a ​join table. A join table resolves the connection in a Many-to-Many relationship. A join
table for this relationship between Cars and Owners might look like this:
Car/Owner Join Table (Version 1)
Car Id Owner Id
1 2
2 2
1 3
In this table, Car 1 is owned by Owners 2 and 3, and Owner 2 owns both cars 1 and 2.
In object oriented programming, we would not necessarily need to define a datatype to
manage this relationship. In object definitions, variable length sets are easy to manage, as
are bi-directional inter-object relationships. In the simplest implementation, Cars and
Owners can just contain a set of references to each other. However, to handle the transition
to a relational database, we had to a create a new table with the sole purpose of managing
the connection between Car and Owner objects. This demonstrates that relationships
defined in object oriented terms do not necessarily have a one to one mapping to their
representation in the database.
The Car/Owner relationship above is an ​aggregation - the Car and Owner exist as
independent objects, and the deletion of one would not imply the deletion of the other. Now
suppose we wished to regard a Car object as a ​composition of various parts. For example,
suppose we added an Engine class:
/****************************************************************
* Class Part V.1
* Tyler Smith
*****************/
class Part{
String name;
float liters;
int engine_id;
}
The updated Car class:
/****************************************************************
* Class Car V.4
* Tyler Smith
*****************/
class Car{
String color;
Engine engine;//note we replaced the String engine with an Engine object
int door;
Owner owner;
int car_id;
//Database access methods left out for clarity
}
Once again, there is no direct mapping of this object object oriented relationship to a
database. We could choose to give Engine its own table in the database:
Engine Table (Version 1)
Name Liters Engine_Id
V8 5.7 1
Car Table (Version 4)
Color Engine_Id Doors Car_Id
Red 1 4 1
As the Car/Engine relationship is a composition, this means that deleting the Car would
mean also deleting the Engine. As the two are connected directly, it could be more efficient
to implement everything in the Car table:
Car Table (Version 5)
Color Doors Car_Id Engine_Name Liters Engine_Id
Red 4 1 V8 5.7 1
Again, neither implementation is strictly correct or incorrect. Using two tables provides a
direct mapping from object definition to table definition, and using one table increases
efficiency, and enforces the intended nature of the relationship.
The problem of inter-object relationships demontrates the first paradigm mismatch between
object oriented programming and relational databases. Objects manage relationships in a
very different manner from relational databases. There is not necessarily a one-to-one
mapping between class definitions and table definitions. This means that in a object oriented
persistent context, steps must be taken to account for this disparity during the object
transition from in-memory to persistent.
Problem 2: Sub/Super Class Relationships
The object oriented principle of inheritance presents another situation for which there is no
implicit relationship between objects and database tables. Suppose we wish to create
another class called Truck, and a superclass Vehicle to which both Car and Truck are
subclasses. The Vehicle superclass will allow us to manage data shared by Car and Truck in
one place, while allowing type-specific data to be managed at the subclass level.
Once again, this presents an ambiguity. We could create a mapping which ignores the
superclass entirely by creating a Car table a Truck table which duplicate the fields in the
superclass. Or we could create a superclass table which contains the superclass fields, and
references to subclass tables. [Bauer 193]
These problems are further complicated if we consider the possibility of an Owner containing
a reference to a Generic Vehicle. How should the relationship be mapped? Once again, there
is no single correct method.
Problem 3: Managing Identity
As mentioned in part 1, in a purely object oriented environment, object identity is trivial*.
Each object has a location in memory, so determining if two references point to the same
object is clear. However, when we transfer an object to and from a persistent state, we
can't trust the reference location to demonstrate identity. Suppose we create two car
objects, each with all of the same values. Without adding some additional structure, the
identity of any given object which has been re-loaded from a persistent state is ambiguous.
For example, using class Car V.1, suppose I create a Car with color = red, doors = 4 and
engine = v6. If (in the same program execution context) I create another Car with color =
red, doors = 4, and engine = v6, they are clearly not the same Car - they will have different
memory locations. However, if I save them and reload them in a new execution of the
program, I cannot trust the memory locations. How can I trust that they are different? The
program needs some additional structures and/or functions to manage identity in a
persistent context.
*There are situations where identity is non-trivial in object oriented programs, but in most
cases the memory-location comparison is sufficient.
Problem 4: Developer Expertise
The difference between object data and tabular data from a database is a critical conceptual
distinction. For developers working in an object oriented, database persisted application, it
is critical that they understand both possible representations of the data. For example, this
means understanding that a Car object containing a set of Owners will not have those
owners directly mapped to its row in the database - a row in the Owner table would contain
a reference to the car, or there might be a row establishing their relationship in a join
table.
This requirement of a wider breath of knowledge will make it harder for developers to
become specialized in a given subset of the program, and will mean more training before
new developers are comfortable with the system.
Further, consider the problem of developer training. A project which is primarily functional,
but requires some persistence, would require developers to be familiar with both database
functionality and the necessary functionality of the program. This extra requirement adds
significant overhead to the project.
Problem 5: Managing object state
In object oriented programming, objects typically have two states, live and dead. If a
process modifies a live object (provided concurrency is properly managed) we can be
confident that the change will take effect.
Objects in a persistent application can exist in four states, ​transient, ​persistent, detached,
and removed* [Bauer 386]. A transient object is an object without a database identity - it
has been created but has not yet been saved to the database. A persistent object is an
object with a database identity. A detached object is an object whose contents are not
necessarily consistent with what is stored in the database - the two are no longer
synchronized. Typically, a detached object is an object marked for deletion in the system,
but which still exists in the database. A removed object is no longer in the database, and is
awaiting garbage collection. We need infrastructure to manage the state of the objects, to
make sure edits to an object are not lost or overwritten.
Any persistent application needs to manage these states. For example, consider the
problem of concurrent updates to a persistent object. In a solely object oriented program,
concurrency can be managed at the execution level, using a semaphore or other control
structure. If two processes (or machines) read an object into memory from a persistent
state, and each makes a different edit to the object, we have a potential race condition as
the changes are saved to the database. This can happen even if the access to the object is
offset by a significant time delay; one process can simply overwrite the changes of the other
(by default, it has no way to know that the object is has in memory is old). There needs to
be some infrastructure to manage concurrency at the data access level.
*Note that these are the states defined by Hibernate, other ORM solutions may have
different states.
Problem 6: Performance
Invoking methods on an in-memory object is typically very fast. This is because memory
access has very low overhead compared to accessing a database. Working with a database,
every query run by the program has a significant time cost. A design which does not use
smart query ordering and structuring can have a big performance impact.
To test this impact, I wrote a simple time test program (see appendix A). My program
demonstrates the difference between executing a new query every time a new value is
needed and executing a single query to get all of the needed values at the same time. Even
testing on a local database (low network overhead), the single query was on average three
times faster than running a new query for each needed value. Any persistence
implementation needs to account for the potential performance issues associated with query
ordering and structuring.
Managing the contents of an object can be very important in performance as well. Consider
an object which owns a large set of other objects. Suppose we only need to make a small
edit to the object, unrelated to the large set of objects. A 'dumb' system would simply load
everything in the object, needed and otherwise. We need some way to load a subset of the
object and make our change without loading all of the data into memory.
Consider the problem of making a series of subsequent updates to an object. In a raw SQL
system (without significant optimization structure), each update would need its own
UPDATE statement. This potentially adds a lot of overhead (especially if the entire object
must be loaded into memory each time). We need some way to intelligently manage queries
such that they are executed in a logical manner, and with a minimal number of connections
to the database.
Problem 7: Changing Database Engine
In the lifespan of a product, it is likely that some components will change. For example,
suppose you built a database product based on SQL Server 2008. Then, some time later,
management says we can no longer use SQL Server 2008, and you need to move to an
open source database engine. This means several things: You must find out if and where
differences exist between the two database engines, and search for any places in your
source code where those changes might affect the program.
For example, suppose you have a column defined as VARCHAR(MAX) in one of your table
definitions. There is no VARCHAR(MAX) in MySQL. Thus to be compatible with MySQL, you
will need to search your source code for places VARCHAR(MAX) is used, and replace them
with a valid MySQL field definition.
This problem is even worse if you have custom database functions or datatypes, or if you
need to support multiple database types simultaneously.
Part 3: What is an Object Relational Mapping framework,
and how can it help?
Object relational mapping is the process of transferring data stored in objects to relational
database tables. An ORM framework is a tool which manages this transfer. Any persistent
application already has some semblance of ORM, typically in the form of methods such as
those seen in the above examples - methods which perform operations on the persistent
data. An ORM framework is a method of abstracting the details of these Create, Read,
Update, and Delete operations (commonly referred to as CRUD) such that the high level
implementation does not need to know the details.
From Java Persistence with Hibernate: "In a nutshell, object/relational mapping is the
automated (and transparent) persistence of objects in a Java application to the tables in a
relational database, using metadata that describes the mapping between the objects and
the database." [Bauer 25]
Object relational mapping software started to gain notoriety in the late 1990s with the
production of tools such as Oracle's TopLink. ORM was originally most popular with Java,
but in recent years has been expanded to most contemporary object oriented languages.
How does ORM fit into a design?
In the above Car example, we had a simple two tier system - we had a java
implementation, and database back-end:
In a two tier system, all of the implementation for both the application and data persistence
are stored together, in this case in the Car object. If we add an ORM tool to this application,
we can hide the details of the persistence functionality.
Now the Car class does not need to know how the data is being retrieved, or how to map
object data to and from tabular data - the ORM framework handles all of these details. This
means that the complexities addressed in part 2 can be hidden from the high level Car
implementation.
I will now outline how an ORM framework can handle the problems outlined in part 2. Note
that not all ORM frameworks implement solutions to all of these problems. I won't provide
extremely low level implementation details, as they can differ between implementations.
Rather, I will give a design-level explanation of how ORM solutions can assist in these
issues. My explanation will be based primarily on Hibernate, the ORM solution employed by
my project at General Dynamics and Django, the ORM tool I used at the Carlson school of
management.
It is important to note that an ORM framework does not necessarily need to be created by a
third party. Everything I outline here ​can be done by hand. Many of the advantages can be
achieved by establishing a multi tier architecture and hiding the data implementation layer.
However, my discussion is based on the use of a third party tool, as most project leads
likely do not have the time or budget to implement an entire ORM solution.
Problem 1: References Between Classes
References between classes in object oriented software present a problem, as the mapping
of these relationships to a relational database is ambiguous. ORM tools handle these
ambiguities by employing user provided annotations or mapping files to clarify complex
relationships.
In the Hibernate ORM, mapping data provided by the user is used to tell the the tool
specifically how some objects relate to others. Alongside the data fields in a class, the user
provides annotations which specify the nature of the relationship. For example, let's
consider a Hibernate annotation to our Car class:
/****************************************************************
* Class Car V.4 (Java)
* Tyler Smith
*****************/
class Car{
//Other fields hidden for clarity.
//Hibernate annotation to make relationship
//In this case we are saying a single Car has many Owners
@ManyToOne(targetEntity = Owner.class)
Owner owner;
}
This annotation tells Hibernate exactly how the relationship should be mapped to the
database. Note that the developer still needs to understand how relationships work in a
database - the ORM solution won't allow developers to blindly trust that the objects will be
mapped correctly.
In Django, the functionality is very similar. The user provides information to the ORM to
specify the nature of the relationship:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" Class Car V.4 (Python)
" Tyler Smith
""""""""""""""""""""""
Class Car(models.Model):
#Other fields hidden for clarity
#In this example, we have a many to many relationship between Car and
Owner
owner = models.ManyToManyField(Owner)
Note that the ORM tools are not able to make complex design assumptions, such as
establishing whether a relationship is an aggregation or composition. Instead, the ORM
framework provides methods of formalizing such relationships with the class definition, as
seen above. Thus the burden of resolving the mapping ambiguity still falls on the developer,
but after the mapping is decided, its nature is clear and formalized.
When relationships are formalized in an object definition, the ORM framework can employ
this definition whenever operations are requested. A developer does not need to explicitly
consider the nature of each relationship held by an object to perform operations elsewhere
in the code. The ORM framework can keep track of the relationships associated with a given
object, and manage the data accordingly.
Problem 2: Sub/Super class relationships
This difficulty in mapping is not automatically solved in all ORM frameworks; the user needs
to make some design decisions as to how inheritance is handled. The Hibernate
documentation provides 4 possible mapping schemes to handle inheritance in persistent
classes [Bauer 191]:
1. Table per Concrete Class - Implicit polymorphism
In this method, we tell the Hibernate to map one table for each concrete (non-abstract)
class. All properties, inherited and local, are mapped to columns. Hibernate can
automatically generate queries for polymorphic method calls against the superclass. This
solution is effective for models with very little inheritance, but presents risks. For example,
if Car and Truck both inherit from an abstract superclass Vehicle, and a User has a reference
to a Vehicle, how do we map this relationship in the database? We can't have a correctly
constrained yet generic reference to one of two tables. [Bauer 193]
2. Table per Concrete Class With Unions
In this method, each concrete class, including super classes, is mapped to a table. This has
the advantage of shared superclass properties. In one table per class, superclass properties
had to be mapped to each sub class. We can do this because Hibernate will then use a
UNION to join the results of queries against these tables to get shared sub and super class
data. This solution also solves the problem of associations with inherited classes, as
"Hibernate can use a UNION query to simulate a single table as the target of the association
mapping" [Bauer 199]
3. Table per Class Hierarchy
In this method, each hierarchy of classes is mapped to a single table, so shared data is
implicitly shared, no need for unions or multiple queries. However, this method presents
some potentially critical problems. Subclass properties must be nullable, as they are not
necessarily instantiated in every row. This means that extra caution must be used if
accessing the data manually, as many columns could potentially be null for a given value.
[Bauer 200]
4. Inheritance Relationships as Foreign Keys
In this method, every class which defines its own properties is mapped to its own table. This
implementation will require more tables and queries, but is normalized. It is simple to
understand, but can result in unnecessary complexity. It involves treating every ​is
a relationship as a ​has a in terms of the schema. This means abstract and concrete classes,
and even interfaces, can have their own tables. [Bauer 203]
Django Example:
I couldn't find an explanation in the Django documentation as to how inheritance is
implemented. To find out, I did a simple test, using the following Python classes. Note that
all classes are concrete.
from django.db import models
class Vehicle(models.Model):
name = models.CharField(max_length=200,)
color = models.CharField(max_length=200,)
class Car(Vehicle):
trunk = models.BooleanField()
class Truck(Vehicle):
bedLength = models.IntegerField(blank= True, default = 0)
Django mapped each class to its own table, as follows:
Vehicle:
Field Type
id Int
name varchar
color varchar
Car:
Field Type
vehicle_ptr_id int
trunk tinyint
Truck:
Field Type
vehicle_ptr_id int
bedLength int
Adding a Car, we see:
Vehicle:
id name color
1 Tyler's Car Red
Car:
vehicle_ptr_id trunk
1 1
When an object is added, its sub and superclass properties are split up and added to the
relevant tables. Then the tables are connected via a pointer, stored in the sub class. Note
that the sub class does not have its own id field. This is because the superclass stores the
id, and two such id's would be redundant. This mapping is an example of ​table per concrete
class with unions.
Problem 3: Managing Identity
This problem has not actually been completly addressed by ORM frameworks as of this
writing. The complexity exists in establishing the identifying feature of an object. The
memory location fails, as it will change as soon as the current process is completed. Using
the database identifier (primary key) works, but leaves us with a identity-less object until
the object is made persistent - in the ​transient state, the object has no identity, and thus
two transient objects of the same type could incorrectly be found equal (null == null).
The accepted solution to this has been to use a ​business key, an identifier similar to the
database primary key, but which can be set before the creation of the relevant row in the
database.[Bauer 397] Use of this method requires domain classes which appropriately
manage equality (typically by overriding the default equals method).
Problem 4: Developer Expertise
The data layer abstraction provided by an ORM framework provides developers working on
high level function code with an interface which allows them access to the data, but shields
them from low level data details. While it is still important that they have an idea how the
data layer works, developers do not need to have an advanced understanding in order to
perform data operations.
Note that this encapsulation is not limited to third party software - home brewed data
access layers can provide the same benefit.
Problem 5: Managing Object State
ORM tools provide structured methods for managing object state to keep track of object
identity and transience. For example, Hibernate maintains a version field in each row of the
database. Each time a the row is updated, Hibernate automatically checks the version of the
in memory object against the version of the object stored in the database. If they are not
the same, then the object is old and needs to be refreshed before any data can be written.
This is called ​offline record locking.
Problem 6: Performance
An ORM framework affords many options for performance improvement. These include
methods such as lazy or eager loading, transaction based database operations, and a
variety of caching options.
Earlier, I discussed the problem of making a small update to a large object. Loading the
entire object (and its children) into memory requires a lot of overhead. ​Lazy loading is a
method of loading only certain parts of an object when needed, as opposed to loading the
entire object into memory. For a large object with many children, loading every object in
the set could mean thousands of unnecessary database reads, and even more if we consider
and sets contained within those objects. Lazy loading allows us to load the parent object
without loading the children. [bauer 571] Hibernate implements lazy loading by creating
proxy objects in the place of sets, and only instantiating the sets when they are explicitly
requested by the user.
In a substantial persistent system, we also need to consider the concept of a ​transaction. A
transaction is a series of operations. It is often optimal to combine a series of modifications
to an object or set of objects into a single statement, or series of statements. Thus for
performance reasons, it is optimal to have some method of combining operations. In
Hibernate, transactions are optimized to do minimal work on the database. This means
updates may not necessarily be executed in the order the requests appear in the source
code.
Problem 7: Changing database engines
Most ORM frameworks are written and tested against multiple database back-ends. Typically
this allows developers to simply tell the framework which type of back-end is being used,
and the low level differences can be changed within the tool. This means that if the
database back-end changes, the product does not need to drastically change, as the
flexability is already built into the ORM layer.
Part 4: Integrating ORM into the development process:
The integration of an ORM framework can take place in many different stages in
development. It may be as early as the design phase, or as late as an update to an already
complete product. The decision of when and how to integrate an ORM framework is
non-trivial, as the persistence layer of any application is critical to its success.
Design basics
Object oriented persistent projects typically share some key design elements related to
persistence. First, most are built on a domain model (though they may not call it that). A
domain model is a design that has classes corresponding to distinct persistent elements.
The Car example above used a domain model - the class Car is a domain class. [Bauer 107]
Second, many projects also use the Data Access Object (DAO) design pattern. In this
pattern, data access objects live above the domain classes in the class structure, and
provide high level access to data. In projects using the Data Access Object pattern, domain
level classes typically have little or no implementation details, only data fields and mapping
data.
Integration
In ​Java Persistence with Hibernate the authors discuss methods for integrating Hibernate
into a project: [Bauer 40]
Top Down
In a top down development model, an ORM framework is integrated into an existing domain
model. This means the program already has a set of classes defining the persistent data
types, and needs to integrate a method for mapping the persistent classes to the database.
Note that it is assume that there is ​not an existing database schema.
Bottom Up
In a bottom up development model, an existing database schema is used as the basis for
integration of an ORM framework, and the development of code to operate on the data. This
would likely occur if a program was required to work on a legacy database - the new system
must be made to work with the existing database schema.
Middle Out
In a middle out model, developers start with a model of the mapping between objects and
tables, and design both the code and the schema based on the mapping. This means that
before the classes are written or the database schema is decided, the specifics of the
mapping between objects and the database are deduced.
Meet in the Middle
In this model, developers start with an existing schema and existing code base, and
integrate an ORM framework. The authors of ​Java Persistence and Hibernate consider this to
be the most difficult method of integrating Hibernate into a project, as it often requires
significant refactoring to get the domain classes and database schema to agree. "This can
be an incredibly painful scenario, and it is, fortunately, exceedingly rare."
In the next section, I analyze two projects which switched to a ORM implementation from an
existing platform - both were established projects. In the first study, we used a bottom up
strategy. In the second study, the meet in the middle strategy was used.
Integrating ORM in Practice
In the next section, I describe the process taken by two software projects for integrating an
ORM framework. These case studies serve as real world examples of the processes
described above. At the Carlson School of Management, we used the bottom up strategy,
integrating an ORM tool with an existing database schema. At General Dynamics, we used
the meet in the middle strategy, integrating an ORM framework with an existing schema
and domain model.
Part 5: Case Studies
To provide real world context to the discussion of ORM frameworks, I did two case studies.
The goal of both studies was to investigate whether ORM tools can successfully address the
problems described in part two in a production environment. In both projects I studied, I
focused on the following questions:
1. What was the existing implementation before the integration of an ORM framework?
2. What were the problems/concerns with that implementation?
3. What was the primary motivation for switching to an ORM framework?
4. Why was the ORM framework in question chosen?
5. Was the integration as difficult as anticipated?
6. Has it been successful?
7. What would you do differently?
I gathered this data through my own experience (I worked on both projects), through
interviews with project members, and through code analysis tools.
Case Study One Carlson School of Management Help desk
Project Overview
At the Carlson School of Management, I worked on the Laptop Management System. This
system is a web based program which manages laptop repairs and equipment checkouts.
The system manages over one thousand students, and hundreds of laptops, and is used by
approximately 15 staff members.
Motivations
The program was originally written in PHP, using a MySQL database. In early 2009, the
decision was made to switch to Django. There were three primary reasons for the switch.
First, Django allows very fast web page development, which would allow us to easily write
report-generation pages. Second, Django has HTML template options which allow more
flexibility in page design. Finally, it was hoped that Django would be easier to maintain, as
many aspects of the PHP based version were fairly old and disconnected.
Design
The original project was implemented with PHP (non-object oriented) and MySQL. The
second version was written in Python, using the ORM tool Django. As we needed to keep
track of legacy data, the database schema remained largely the same.
We used a domain model, implementing each persistent type as a class based on our
existing schema. Django handles all of the communication with the database. We did not
need to use any type of advanced data access structure. Instead, we used built in methods
provided by Django for operations such as selecting a large set of objects. To retrieve a set
of objects from the database, the developer needs only to execute a ​filter operation, which
acts as a SELECT query. To save an object, the developer calls a save() method.
For example, if I want to retrieve all repair tickets with non-null laptops, I can use this line
of python:
all_tickets = Ticket.objects.filter(laptop__isnull=False)
All persistent classes extend the Model superclass. This superclass provides access to
methods such as filter, which returns a subset of the instances of that class saved to the
database. The above line uses a filter to get all Ticket objects from the database where the
laptop field is non-null. The corresponding raw-SQL implementation would require executing
a SELECT statement, followed by a line by line parsing of the returned row, creating objects
for each row.
Django also handles all transaction level details - we did not need to worry about whether
an object was persistent or transient, Django managed those details.
Implementation
The implementation took approximately 12 months. Much of this development time was due
to the time taken to re-implement the system, as we converted from conventional PHP to
object oriented Python. We also had to write SQL scripts to transfer data from the old
database schema to the new, Django based schema.
The implementation was done in stages. Carl Allen implemented the ticket management
portion first, and then I implemented the equipment management section and report
generation tools. Some of the higher level functionality (forms, etc) were complicated, but
the data implementation was fairly quick. Typically, adding a class and the associated table
took 2-4 hours.
Problems
We had many tricky problems originate from our use of legacy data in the database. In
many cases, we had poorly enforced constraints in the legacy data which transfered to the
new database. This meant that the assumptions made by Django regarding data integrity
were often incorrect. For example, Django assumes that if a foreign key reference is
non-null, then the associated object must exist. We often had tickets referencing laptops
which had since been deleted. This resulted in verbose, hard to fix errors. For example, the
following code caused errors when a ticket referenced a non-existent laptop:
all_tickets = Ticket.objects.filter(laptop__isnull=False).order_by('-pk')
for ticket in all_tickets:
​laptop = ticket.laptop
This code parsed all of the tickets in our system with non-null laptop fields, and assigned a
local laptop variable to the laptop associated with each ticket. The following error was
thrown when one of the laptops referenced by a ticket did not exist:
It is clear from this error that a laptop is missing. However, we're given no information as to
which laptop is missing. Carlson has thousands of laptops and thousands of tickets. Therefor
to solve this problem I had to bypass Django altogether, and execute SQL directly against
the database to find bad laptop references. These error codes are often lengthy to read, as
most of the lines in the stack trace are within Django source code.
We also had issues stemming from the large degree of control Django has over the
application. The most complex of which was a data validation error which would cause the
system to silently ignore requests. We think this problem has to do with Django's caching
policy, but it has yet to be resolved.
In the PHP implementation, the code to be run with each page request is very
straightforward. In Django, many things happen in the background, hidden from the user
and the developer. While this decreases the amount of code needed to accomplish a given
task, it makes searching for the source of an error much more complex.
Improvements
Using Django allowed us to rapidly add elements to the application. The primary motivation
for the switch was to allow report generation. With Django's built in commands for data
access, we could easily gather and manage a lot of data without lots of low level SQL. We
could just request the data and work with it.
Using Django also allowed us to decrease the size of our code base. We went from 13,000
lines of PHP to about 2,000 lines of Python (Reported by CLOC). There are some minor
functionality discrepancies, as Django handled some user management functions which
were handled manually in the older version, and the older version of the software did not
have reporting tools.
Conclusions
Overall, users and management have been pleased with the Django based implementation.
However, while our code size and reporting ability have been dramatically improved, errors
have become very hard to trace. Finding the source of an error typically means searching
online for the meaning of the error code, and then manually parsing the database data to
find the row causing the problem. Some errors, such as the silent request rejection error
discussed above, have yet to be resolved. The time spent resolving these errors has
lessened the efficiency we hoped we could gain from Django.
Case Study Two General Dynamics Six-Delta
Project Overview
Over the past year, I've had the opportunity to do an internship at General Dynamics
Advanced Information Systems in Bloomington, Minnesota. I work on the Six-Delta project,
which is a persistent java based application.
Six-Delta uses SQL Server as a back-end. The project has about 12 developers, and about
500,000 lines of code. The program has about 75 persistent classes - classes that are saved
to the database. Six-Delta is based on a domain model of persistence. In the domain model,
the business logic of the system is separated from the data implementation through domain
objects - objects which have corresponding tables in the database.
Motivations
In late 2008, the project moved from a raw sql based implementation to an Object
Relational Mapping framework. I interviewed Six-Delta chief architect Paul Hed and
database lead Paul Wehlage to discuss this change.
From an architectural perspective The motivation to switch to ORM came as part of a
general drive to move to an n-tier architecture. Before ORM, Six-Delta was a 2-tier
application using custom direct sql based to manage persistent data. The program was
growing steadily, continuing to add components would have meant more persistent classes.
In a 2 tier architecure, all of the complexity of the data access layer is visible in the
functional logic of the program. The goal was to insert an ORM framework as a 3rd layer to
manage some of this complexity.
From a data-access perspective, the motivation for ORM came from a desire for better
performance and integrity. The project already had a hand-coded ORM implementation. The
implementation could do some of the specialized operations discussed earlier in this
document, such as lazy loading. However, it was lacking in several key areas. It did not
have a method of offline version checking, which meant that concurrent data access could
result in race conditions and lost changes. It also did not have any concept of a transaction.
Updates to an object were executed field by field, and were very slow.
From a general design perspective, the motivation from ORM came from a desire to
decrease complexity in the data access classes. Maintaining the data layer of the
implementation was very complex - each persistent class required 5 data maintenance
classes. Maintaining the database schema took a lot of work. Event though some support
classes could be auto-generated, the system was complex enough that only certain
developers had the expertise to make schema changes. As the number of tables was
increasing dramatically (from 15 to close to 50), the complexity and overhead associated
with maintaining the home brew data access layer was growing too great.
Selecting an ORM solution
The team chose an open source ORM framework called Hibernate. Hibernate was considered
alongside two other ORM solutions, Enterprise Java Beans and Java Data Objects. Hibernate
was chosen as it is the most popular open source solution for Java, and is fairly mature.
Enterprise Java Beans was considered too large and complex, and Java Data Objects was
too new.
Using annotations in persistent classes, Hibernate generates SQL code, and executes it in
transactions. To save an object, the developer uses a ​transaction. A transaction is a series
of changes to persistent objects, which is managed by hibernate. During a transaction,
Hibernate generates SQL statements, which it executes against the database.
Integration
The planned integration time was 3 weeks. While the lead Hibernate developers advocate
against inserting ORM into an existing project with an existing schema, the team was
confident that it could be added quickly, as the data access layer of Six-Delta was
functionally very similar Hibernate's implementation - both implementations used a domain
model with ​Data Access Objects. Data access objects are objects whose provide an interface
to the persistent domain classes.
Integration was more complex than expected, primarily because of the difficulty of
integrating transctions into all of the Six-Delta data dependent code. While the base layer of
Hibernate's data access design was consistent with the pre-Hibernate Six-Delta
implementation, Six-Delta had no transactions, and a significant refactoring was required
before transactions were successfully implemented. The team also chose to integrate Spring
with Hibernate. Spring provides tools for tighter management of transactions, along with
connection management tools. The integration ultimately took approximately 16 months
before a stable release was available (note that many other software changes too place in
this time, Hibernate was not the only item requiring development time).
Problems
As time passed, other elements of Hibernate surfaced which required additional work.
Hibernate does not implement an efficient delete operation - it loads the entire object into
memory before deleting in the database. The team had to implement SQL based delete
operations to allow for efficient deleting. Hibernate does not allow for column defaults in
SQL Server, which meant that Hibernate could not be used to auto-generate a database
schema. Hibernate is slow to initialize, which means that for very small sub-application
elements, Hibernate is too slow to be useful. Hibernate is also incapable of managing two
schemas at the same time, which makes database upgrade testing complex (verify this is
the problem associated with schemas). Hibernate has troublesome debug options. It can
print out the base elements of the SQL statements it executes, but can't print the whole
thing (statements look like INSERT (id, name) INTO TABLE (?,?) ) so problems can be hard
to trace.
While none of these problems were critical, they required more low level coding than would
be ideal in an ORM framework managed data access layer.
Improvements
Hibernate succeeded in its primary goals. In the current state, it is much easier for
developers to make schema changes without special implementation knowledge. In general
(with the exception of deletion) Hibernate has made database operations much faster.
Finally, Hibernate has dramatically decreased the amount of time required to implement a
schema change.
Developer Reaction
When I first inquired about how Spring and Hibernate were integrated into the program, I
was simply told: magic. Obviously this was hyperbole on the part of the developer I was
speaking to, but the developer mentality implied by this reply was clear: developers were
not familiar with the design of the data access layer. Hibernate errors can be verbose and
confusing, and the integration into the system is not very straightforward. This meant that
even experienced developers can often have trouble deducing errors that arise in the data
layer, as the layer of abstraction has allowed them to ignore the implementation details.
Data Metrics
I was given permission by Paul Hed to use historical defect data in this report. The following
is a plot of problem and regression reports in the two years before the switch to Hibernate.
Problems and Regressions Before Hibernate
As the current version of the software is larger than the pre-hibernate version, I also plotted
the regressions and problems relative to the number of lines of code in the last
non-Hibernate release (I don't have access to LOC data for all previous releases, so these
values are all relative to the same release).
Problems and Regressions per KLOC
Overall, we see an average of 47.68 total regressions and problems per month, with an
average of 0.098 total regressions and problems per KLOC. Interestingly, there is a
dramatic spike in problem reports approximately one year before the transition to
Hibernate.
Problems and Regressions, Post Hibernate
There is also a significant spike in December of 2008. Paul Hed noted that this spike was
likely due to complications regarding the transition to hibernate. The team had to maintain a
non-Hibernate baseline, while implementing transaction management tools in the Hibernate
baseline. This resulted in a lot of code changes.
The spike in April 2010 was due to a major release testing process, which simply found
more bugs - there wasn't a significant change in development, just an increase in testing.
Again, the below values are all relative to a single KLOC value, as I don't have access to
historical KLOC data.
Problems and Regressions per KLOC
After the Hibernate release, the average number of regressions and problems per month
dropped to 42.21, and average regressions and problems per KLOC dropped significantly to
0.067. This corresponds to an approximately 12 percent drop in total regressions and
problems, and an approximately 32 percent drop in regressions and problems per KLOC
(noting that KLOC values are not strict).
Despite this improvement, Paul Hed asserted that the pre-Hibernate data access layer was
significantly more stable than the current implementation, largely due to the overall
simplicity of the design. In the Hibernate implementation, the data goes through more steps
and tools as it goes from user to database. This added complexity makes it harder to be
confident in the code.
Note: Lines of code metrics are from the Coverity Prevent static analysis tool.
Looking Back
I asked Paul Wehlage what he would do differently if he could start over with the transition.
He was generally satisfied with Hibernate's performance, but said that spending more time
with it, learning the nuances and tricky spots, would have been beneficial before attempting
integration. He also said more work should have been done to educate developers on
Hibernate.
Part 6: Conclusions
There is unquestionably a significant paradigm mismatch between the object oriented
development and relational databases. Object relational mapping frameworks provide a
compelling solution to this problem.
At the beginning of this document, I posed the question "In a persistent, object oriented
application, is an ORM framework an advantageous method of implementing the persistent
layer?". Through the examples of ORM application, I demonstrated a variety of ways an
ORM framework can assist in implementing complex transitions between the OO world and
relational databases. In the case studies, we saw examples of ORM solutions in action.
While both projects I studied were ultimately successful, both suffered from the same type
of problem: the inclusion of a black-box framework trusted to handle a significant layer of
an application is very dangerous. When things work as expected, the ORM framework can
be ignored. However, it is critical that developers working with the system have a solid
understanding of the implementation, so that when errors inevitably arise, developers
charged with finding the solution are not faced with opening the black box for the first time.
Thus the answer to my question is: an ORM framework can be an immensely helpful tool for
improving the speed and quality of the persistence layer. However, the implementation of
an ORM framework (or any major framework) cannot be a black box operation; any
developers interacting must ​understand the system. If the system is treated as magic,
eventually the developers will be called to debug a problem, and it's very hard to debug
magic.
From the perspective of a software architect or project manager, the implication is that the
costs of integrating an ORM framework stretch beyond the initial coding costs. Successfully
integrating an ORM framework requires a commitment to train ​all of the developers on a
project in the use of the framework. This is not to say that integrating an ORM solution is a
waste of resources; rather, management needs to understand the long term commitment of
including such a complex tool.
Future Research/Open Questions
Object oriented databases
In recent years, relational databases have been the standard for data management. This
paper assumes the reader is planning to use a relational database. However, there are
specialized object oriented databases designed to circumvent the problems caused by the
object relational mismatch. As ORM frameworks become increasingly popular, it would be
interesting to compare the performance of a relational database and ORM framework to an
object oriented database. Performance could be compared both in terms of speed and bug
occurrences.
Inferred mapping
Hibernate allows the developer to map relationships both with XML mapping files, and with
in-code annotations. I would like to develop a tool which could analyze user specified
domain classes and infer annotations or mapping files.
Sources
Bauer, Christian, and Gavin King. Java Persistence with Hibernate. Greenwich, Conn.:
Manning, 2007. Print.
Beighley, Lynn. Head First SQL. Beijing: O'Reilly Media, 2007. Print.
Fowler, Martin. UML Distilled: a Brief Guide to the Standard Object Modeling Language.
Boston [etc.: Addison-Wesley, 2009. Print.
Rod Johnson, "J2EE Development Frameworks," ​Computer, vol. 38, no. 1, pp. 107-110, Jan. 2005,
doi:10.1109/MC.2005.22
"Version 2.0 (English)." The Django Book. Web. 13 June 2010.
<http://www.djangobook.com/en/2.0/>.
Special Thanks to:
Phil Barry - Primary advisor
Eric Van Wyk and Mats Heimdahl - Secondary advisors.
Paul Wehlage and Paul Hed - General Dynamics engineers who did interviews.
Matt Maloney and Garreth McMaster - Carlson School of Management managers who did
interviews.
Doug Smith - Reader and database advisor.
Appendix A: Source Code
SQL Performance Test
import​ java.sql.DriverManager;
import​ ​java.sql.Statement;
/**
* TimeTest
* ​@author​ ​Tyler​ Smith
*
* This class is designed to test a series of queries against a
* database. The queries accomplish the same thing, but one is optimally structured, and the
* other is not. The idea is to show the importance of optimal query structuring.
*
* Obvious this test is very simplified. However, it does reflect the importance
* of considering sequential access when performing CRUD operations on object data.
*
*/
public​ ​class​ TimeTest {
​private​ ​static​ String ​bulkQuery = ​"SELECT * FROM [AccessTest].[dbo].[presidents]"​;
​private​ ​static​ String ​query1 = ​"SELECT name FROM [AccessTest].[dbo].[presidents]"​;
​private​ ​static​ String ​query2 = ​"SELECT id FROM [AccessTest].[dbo].[presidents]"​;
​private​ ​static​ String ​query3 = ​"SELECT birthday FROM [AccessTest].[dbo].[presidents]"​;
​private​ ​static​ String ​query4 = ​"SELECT gender FROM [AccessTest].[dbo].[presidents]"​;
​public​ ​static​ ​void​ main(String[] args) {
java.sql.Connection con = ​null​;
​try​{
​//Get the SQL Driver
Class.​forName(​"com.microsoft.sqlserver.jdbc.SQLServerDriver"​).newInstance();
String url = ​"jdbc:sqlserver://localhost:1433;"​ +
​"user=test.user;password=test.password;"​ +
​"databaseName=AccessTest"​;
con = DriverManager.​getConnection(url);
​Statement st = con.createStatement();
​//Test single statement
​long​ start_SQL1 = System.​nanoTime();
​for​(​int​ i=0;i<10000;i++){
st.executeQuery(​bulkQuery);
}
​long​ finish_SQL1 = System.​nanoTime();
​long​ net_SQL1 = finish_SQL1 - start_SQL1;
System.​out.println(​"Net time, single queries = "​ + net_SQL1);
​//Test multiple statements
​long​ start_SQL2 = System.​nanoTime();
​for​(​int​ i=0;i<10000;i++){
st.executeQuery(​query1);
st.executeQuery(​query2);
st.executeQuery(​query3);
st.executeQuery(​query4);
}
​long​ finish_SQL2 = System.​nanoTime();
​long​ net_SQL2 = finish_SQL2 - start_SQL2;
System.​out.println(​"Net time, multiple queries = "​ + net_SQL2);
​double​ net_SQL1d = (​double​)net_SQL1;
​double​ net_SQL2d = (​double​)net_SQL2;
​double​ ratio = net_SQL2d / net_SQL1d;
System.​out.println(​"Ratio = "​ + ratio);
} ​catch​ (Exception ee){
ee.printStackTrace();
}
}
}

More Related Content

Viewers also liked

Frm16 pmi decision agile_yanncoirault
Frm16 pmi decision agile_yanncoiraultFrm16 pmi decision agile_yanncoirault
Frm16 pmi decision agile_yanncoiraultYann Coirault
 
Trabajo practico de freddy
Trabajo practico de freddyTrabajo practico de freddy
Trabajo practico de freddyfreddy71mx
 
Algoritmos selectivos
Algoritmos selectivosAlgoritmos selectivos
Algoritmos selectivoscarlos torres
 
What the hell is imagineering?!
What the hell is imagineering?!What the hell is imagineering?!
What the hell is imagineering?!zuma0000
 
Resume- 4 years exp
Resume- 4 years expResume- 4 years exp
Resume- 4 years expNaren Logi
 
Concurrency presentation
Concurrency presentationConcurrency presentation
Concurrency presentationTed Wentzel
 
Computer Talk presentation
Computer Talk presentationComputer Talk presentation
Computer Talk presentationTed Wentzel
 
Tan Siew Jen Jessica Resume
Tan Siew Jen Jessica ResumeTan Siew Jen Jessica Resume
Tan Siew Jen Jessica Resumejessicatan_91
 
K&S Quality Associates Provides Quality Solutions Matching International Stan...
K&S Quality Associates Provides Quality Solutions Matching International Stan...K&S Quality Associates Provides Quality Solutions Matching International Stan...
K&S Quality Associates Provides Quality Solutions Matching International Stan...qualityassociates1
 
Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016
Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016
Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016Yalın Enstitü Türkiye
 
Bridge communications presentation
Bridge communications presentationBridge communications presentation
Bridge communications presentationTed Wentzel
 
Tech Talk: Customize Reporting Dashboards for New Business Insights
Tech Talk: Customize Reporting Dashboards for New Business InsightsTech Talk: Customize Reporting Dashboards for New Business Insights
Tech Talk: Customize Reporting Dashboards for New Business InsightsCA Technologies
 
Manifiesto metropolitano, Pedro del Piero
Manifiesto metropolitano, Pedro del PieroManifiesto metropolitano, Pedro del Piero
Manifiesto metropolitano, Pedro del PieroAlehayon
 

Viewers also liked (20)

Frm16 pmi decision agile_yanncoirault
Frm16 pmi decision agile_yanncoiraultFrm16 pmi decision agile_yanncoirault
Frm16 pmi decision agile_yanncoirault
 
Resume
ResumeResume
Resume
 
Plan emergencias ce-final
Plan emergencias ce-finalPlan emergencias ce-final
Plan emergencias ce-final
 
Trabajo practico de freddy
Trabajo practico de freddyTrabajo practico de freddy
Trabajo practico de freddy
 
Algoritmos selectivos
Algoritmos selectivosAlgoritmos selectivos
Algoritmos selectivos
 
Plan emergencias ce-final
Plan emergencias ce-finalPlan emergencias ce-final
Plan emergencias ce-final
 
What the hell is imagineering?!
What the hell is imagineering?!What the hell is imagineering?!
What the hell is imagineering?!
 
Resume- 4 years exp
Resume- 4 years expResume- 4 years exp
Resume- 4 years exp
 
Concurrency presentation
Concurrency presentationConcurrency presentation
Concurrency presentation
 
Fiscalidad en internet
Fiscalidad en internetFiscalidad en internet
Fiscalidad en internet
 
Computer Talk presentation
Computer Talk presentationComputer Talk presentation
Computer Talk presentation
 
Oca 3
Oca 3Oca 3
Oca 3
 
Tan Siew Jen Jessica Resume
Tan Siew Jen Jessica ResumeTan Siew Jen Jessica Resume
Tan Siew Jen Jessica Resume
 
Estadística i
Estadística iEstadística i
Estadística i
 
K&S Quality Associates Provides Quality Solutions Matching International Stan...
K&S Quality Associates Provides Quality Solutions Matching International Stan...K&S Quality Associates Provides Quality Solutions Matching International Stan...
K&S Quality Associates Provides Quality Solutions Matching International Stan...
 
Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016
Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016
Yalın Sağlık Eğitim Kataloğu 2 Aralık 2016
 
Resumen 2
Resumen 2Resumen 2
Resumen 2
 
Bridge communications presentation
Bridge communications presentationBridge communications presentation
Bridge communications presentation
 
Tech Talk: Customize Reporting Dashboards for New Business Insights
Tech Talk: Customize Reporting Dashboards for New Business InsightsTech Talk: Customize Reporting Dashboards for New Business Insights
Tech Talk: Customize Reporting Dashboards for New Business Insights
 
Manifiesto metropolitano, Pedro del Piero
Manifiesto metropolitano, Pedro del PieroManifiesto metropolitano, Pedro del Piero
Manifiesto metropolitano, Pedro del Piero
 

Similar to ObjectRelationalMappingSEPerspective

Mapping objects to_relational_databases
Mapping objects to_relational_databasesMapping objects to_relational_databases
Mapping objects to_relational_databasesIvan Paredes
 
Super applied in a sitecore migration project
Super applied in a sitecore migration projectSuper applied in a sitecore migration project
Super applied in a sitecore migration projectdodoshelu
 
Sadcw 7e chapter04_recorded
Sadcw 7e chapter04_recordedSadcw 7e chapter04_recorded
Sadcw 7e chapter04_recordedLamineKaba6
 
Sadcw 7e chapter04(1)
Sadcw 7e chapter04(1)Sadcw 7e chapter04(1)
Sadcw 7e chapter04(1)LamineKaba6
 
Dot Net Fundamentals
Dot Net FundamentalsDot Net Fundamentals
Dot Net FundamentalsLiquidHub
 
Session 3 - Object oriented programming with Objective-C (part 1)
Session 3 - Object oriented programming with Objective-C (part 1)Session 3 - Object oriented programming with Objective-C (part 1)
Session 3 - Object oriented programming with Objective-C (part 1)Vu Tran Lam
 
A350103
A350103A350103
A350103aijbm
 
Enterprise Level Application Architecture with Web APIs using Entity Framewor...
Enterprise Level Application Architecture with Web APIs using Entity Framewor...Enterprise Level Application Architecture with Web APIs using Entity Framewor...
Enterprise Level Application Architecture with Web APIs using Entity Framewor...Akhil Mittal
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudiesHellen Gakuruh
 
Repository Pattern in MVC3 Application with Entity Framework
Repository Pattern in MVC3 Application with Entity FrameworkRepository Pattern in MVC3 Application with Entity Framework
Repository Pattern in MVC3 Application with Entity FrameworkAkhil Mittal
 
Object Oriented Analysis and Design with UML2 part2
Object Oriented Analysis and Design with UML2 part2Object Oriented Analysis and Design with UML2 part2
Object Oriented Analysis and Design with UML2 part2Haitham Raik
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt064ChetanWani
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research ReportAlex Sumner
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 

Similar to ObjectRelationalMappingSEPerspective (20)

Mapping objects to_relational_databases
Mapping objects to_relational_databasesMapping objects to_relational_databases
Mapping objects to_relational_databases
 
Super applied in a sitecore migration project
Super applied in a sitecore migration projectSuper applied in a sitecore migration project
Super applied in a sitecore migration project
 
Sadcw 7e chapter04_recorded
Sadcw 7e chapter04_recordedSadcw 7e chapter04_recorded
Sadcw 7e chapter04_recorded
 
Sadcw 7e chapter04(1)
Sadcw 7e chapter04(1)Sadcw 7e chapter04(1)
Sadcw 7e chapter04(1)
 
L9
L9L9
L9
 
Why hibernater1
Why hibernater1Why hibernater1
Why hibernater1
 
Dot Net Fundamentals
Dot Net FundamentalsDot Net Fundamentals
Dot Net Fundamentals
 
Session 3 - Object oriented programming with Objective-C (part 1)
Session 3 - Object oriented programming with Objective-C (part 1)Session 3 - Object oriented programming with Objective-C (part 1)
Session 3 - Object oriented programming with Objective-C (part 1)
 
Ef overview
Ef overviewEf overview
Ef overview
 
A350103
A350103A350103
A350103
 
Enterprise Level Application Architecture with Web APIs using Entity Framewor...
Enterprise Level Application Architecture with Web APIs using Entity Framewor...Enterprise Level Application Architecture with Web APIs using Entity Framewor...
Enterprise Level Application Architecture with Web APIs using Entity Framewor...
 
SessionTen_CaseStudies
SessionTen_CaseStudiesSessionTen_CaseStudies
SessionTen_CaseStudies
 
Repository Pattern in MVC3 Application with Entity Framework
Repository Pattern in MVC3 Application with Entity FrameworkRepository Pattern in MVC3 Application with Entity Framework
Repository Pattern in MVC3 Application with Entity Framework
 
Jazz
JazzJazz
Jazz
 
Object Oriented Analysis and Design with UML2 part2
Object Oriented Analysis and Design with UML2 part2Object Oriented Analysis and Design with UML2 part2
Object Oriented Analysis and Design with UML2 part2
 
Ad507
Ad507Ad507
Ad507
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt
 
Cs2305 programming paradigms lecturer notes
Cs2305   programming paradigms lecturer notesCs2305   programming paradigms lecturer notes
Cs2305 programming paradigms lecturer notes
 
SURE Research Report
SURE Research ReportSURE Research Report
SURE Research Report
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 

ObjectRelationalMappingSEPerspective

  • 1. Object Relational Mapping: SE Perspective Tyler Smith Undergraduate Honors Thesis
  • 2. Introduction The intent of this paper is to provide an in-depth analysis of the software engineering merits of the incorporation of an Object Relational Mapping (ORM) framework. The discussion will be based around answering the following question: ​In a persistent, object oriented application, is an ORM framework the optimal method of implementing a persistent layer? The paper progresses in the manner I expect a developer advocating the use of an ORM framework would present ideas to a software architect. First, an overview of persistence in an object oriented context, including a justification of relational databases as a persistence back-end. Second, potential problems associated with the paradigm mismatch between in-memory objects and relational database tables. Third, a demonstration of the methods by which ORM frameworks address these problems. Fourth, a discussion of the process of integrating an ORM framework into an existing software project. Finally, two case studies which follow real projects through the transition to ORM software. The intended audience is experienced programmers who are considering options for making an application persistent. It is written with the expectation that the reader has some formal knowledge of computer science, such as the difference between the heap and the stack. This paper does ​not assume familiarity with relational database theory. My goal is not to establish a single, brief answer to the above question, but rather to build a base of contextual evidence which will allow the reader to address the usefulness of an ORM framework in his or her own projects. Why did I choose this topic? I have had the opportunity to work on two projects which were both in the process of transitioning to an ORM framework. In both cases, I joined the project after the decision had been made to use an ORM framework. This paper is an investigation into the justifications for the use of an ORM framework. By exploring this topic, I hope to gain both a better understanding of ORM technology, and the decision making process used when making significant design changes in a software project.
  • 3. Object Relational Mapping: SE Perspective Introduction Why did I choose this topic? Part 1: Understanding Object Oriented Persistence Persistence: a simple example: Adding a relation Why do we need databases? Java with a relational database Part 1 Conclusions Part 2: The Object/Relational Mismatch Problem 1: References Between Classes Problem 2: Sub/Super Class Relationships Problem 3: Managing Identity Problem 4: Developer Expertise Problem 5: Managing object state Problem 6: Performance Problem 7: Changing Database Engine Part 3: What is an Object Relational Mapping framework, and how can it help? How does ORM fit into a design? Problem 1: References Between Classes Problem 2: Sub/Super class relationships 1. Table per Concrete Class - Implicit polymorphism 2. Table per Concrete Class With Unions 3. Table per Class Hierarchy 4. Inheritance Relationships as Foreign Keys Django Example: Problem 3: Managing Identity Problem 4: Developer Expertise Problem 5: Managing Object State Problem 6: Performance Problem 7: Changing database engines Part 4: Integrating ORM into the development process: Design basics Integration Top Down Bottom Up Middle Out Meet in the Middle Integrating ORM in Practice Part 5: Case Studies Case Study One Carlson School of Management Help desk Motivations Design Implementation Problems Improvements Conclusions Case Study Two General Dynamics Six-Delta Motivations Selecting an ORM solution Integration Problems
  • 4. Improvements Developer Reaction Data Metrics Looking Back Part 6: Conclusions Future Research/Open Questions Object oriented databases Inferred mapping Sources Special Thanks to: Appendix A: Source Code SQL Performance Test
  • 5. Part 1: Understanding Object Oriented Persistence The following sections give an overview of the basic application of various methods of persistence in an object oriented environment, and provide an explanation of the use of a database in a large scale persistent application. If you are already comfortable with these topics, feel free to skim this section (the examples will be referenced in part 2). I will provide examples using pseudo code, Java, Python and UML. Object oriented programming is now an industry standard for large scale applications. Some programming languages, such as Java, are entirely object oriented. Others, such as C++, support object oriented principles alongside traditional functionality. Many large and small scale software projects now rely on objects to encapsulate data and functional elements of software. What is an object? In the simplest sense, an object consists of a collection of data and methods which operate on that data. The structure of this data and the functionality of the methods are defined by a ​class. When a class is ​instantiated, an object is created. The term object oriented programming refers to the design paradigms which guide the creation of an application based on objects. What is persistence? In an object oriented application, objects exist in memory during the execution of the application. A persistent object is an object which can exist through a shutdown of the system. Persistence is the method by which an object becomes persistent. In the following example, I will outline some common persistence methods. Persistence: a simple example: Consider a simple class called Car. The Car class defines a set of data values relevant to a Car: color, engine type, and number of doors. By instantiating this class, we can create a Car object which lives in memory. /**************************************************************** * Class Car V.1 * Tyler Smith *****************/ class Car{ String color; String engine; int doors; } If we wish to make this object persistent, the easiest method is to simply write the data
  • 6. fields of the Car to a file. Files are stored in the file system and can persist through system shutdowns. To make our Car object persistent, we could create a file consisting of tuples which store our data in a ​key-value pair. A key-value pair is a set of two elements: a key, which specifies the related object field, and a value, which is the value of that field for a given object. ##################################################### # Car File V.1 # Tyler Smith ############# color = "red" engine = "V6" doors = "4" In a simple application, this strategy could be very effective. It would also be very simple to implement, as it only requires two new methods, a writeToFile() and a readFromFile(). Adding a relation Consider making a modification to the car class. We want to add a variable to store the owner of the car. We will manage owners with a new class, called Owner: /**************************************************************** * Class Owner V.1 * Tyler Smith *****************/ class Owner{ String name; } We must now consider the addition of a ​relation. A relation is a connection or association between multiple data sets. As we need to reference an Owner, our Car objects cannot exist in isolation; they must contain a data field providing a connection to the owner. In the object oriented environment, implementing this field is trivial: we can add a Owner reference field to the Car class definition. /**************************************************************** * Class Car V.2 * Tyler Smith *****************/ class Car{ String color; String engine; int doors; Owner owner;//This field holds a reference to an Owner object } In object oriented programming, this addition has a minimal impact on overall complexity. The owner field simply contains a reference to the in-memory location of an Owner object.
  • 7. However, when considered in the scope of persistent data, this reference introduces a complexity which will garner much more discussion later in this paper: the problem of object identity. During program execution, the identity of an object is simply its location in memory. Every unique object has a unique memory location, which means the a given memory address can only refer to a single object. This allows our Car object to trivially store a reference to a unique Owner. If we were to simply add a key-value pair for this memory location value, our software would fail. It would fail because we have no guarantee that the memory location of the object in one execution will be the same as in a later execution. In fact, it will almost certainly be different. We need to save some other value which will guarantee that when we read the Car back into the application from a file the same Owner will be referenced. We could save the name of the Owner, but this would not allow us to alter the name of an Owner after it was assigned to a Car. A better choice is to add an identifier field to the owner, the sole task of which is to manage the identity of the Owner. That means our Owner class will look like this: /**************************************************************** * Class Owner V.2 * Tyler Smith *****************/ class Owner{ String name; int owner_id; //for the purpose of this example, I will not discuss how this id //is made to be unique. } I will also add an ID to the Car class for the same reason - to make the identity of a Car object constant throughout transitions to and from a persistent state. /**************************************************************** * Class Car V.2 * Tyler Smith *****************/ class Car{ String color; String engine; int doors; Owner owner;//This field holds a reference to an Owner object int car_id; } Our key-value Car file will now look like this: ##################################################### # Car File V.2 # Tyler Smith ############# color = "red"
  • 8. engine = "V6" doors = "4" owner_id = "2" car_id = "1" The Owner file will look like this: ##################################################### # Owner File V.1 # Tyler Smith ############# name = "Tyler" Id = "1" So far, we have established a simple method for creating a persistent data type. We are able to create and manipulate Car objects, and save and restore them from a persistent state. We are able to maintain a persistent relationship between a Car and an Owner. In the next section, I will give an example which demonstrates why databases are critical to persistent systems. Why do we need databases? So far we've only considered a small number of Car objects. What if we were to try to store 100,000 Car objects? As we're currently storing one Car per file, this would require a 100,000 files. Most modern file systems are not optimized to manage this much data. To combat this problem, we could merge the data into a single file, using a Comma Separated Value format. In this manner we could consolidate the values into a single file and reduce the overhead: ##################################################### # Car File V.3 # Tyler Smith ############# Color, Engine, Doors, Owner, Owner_Id #Headers red, V6, 4, 1,1 blue, V8, 2, 2,2 This will allow us to reduce the overhead required to store hundreds of thousands of files. Now, consider the problem of loading a specific Car from the file. We would have to either search through the entire file, or load every Car into memory, and then search our in memory objects. Both methods are O(n), and do not scale well. Further, in the simple state presented, we have no method for maintaining the integrity of the data. For example, we have no way to control whether two Cars can have the same owner. Relational databases were created to solve these problems​. A relational database is a
  • 9. specialized file system for persistent data designed to handle these complexities. A relational database provides a layer of specialized control between the low level file representation (similar to the above CSV example) and the program needing access to the data. Contemporary relational databases also provide a standard method of communicating with the the database. SQL or Structured Query Language is the industry standard method of performing operations on the database. A relational database provides quick access to data, along with the ability to apply constraints which force the data to conform to specific rules. For example, constraints could be used to enforce the uniqueness of Cars or Owners, or to require that Cars could only reference existing Owner objects. Relational databases are now the de facto standard in persistent applications, object oriented and otherwise. Java with a relational database We can now expand our Car class to connect to a database by adding read and write methods which access the database: /**************************************************************** * Class Car V.3 * Tyler Smith *****************/ class Car{ String color; String engine; int door; Owner owner; int car_id; public Car readCarFromDatabase(){ connection = getConnection(); connection.runQuery("SELECT FROM car WHERE id = ...") //SQL statement to retrieve a Car from the database } public void writeCarToDatabase(){ connection = getConnection(); connection.runQuery("INSERT INTO car ....") //SQL statement to add a car to the database } } We can now store Car objects between executions of our program, store a relationship between Cars and Owners, and manage many persistent Cars in an efficient manner. Part 1 Conclusions For a large scale, persistent, object oriented application, it is necessary to use some manner
  • 10. of advanced data storage system. Typically, this system is a relational database. Relational databases are the industry standard for large scale data management, and provide a reasonable back-end for object persistence. The Car/Owner example demonstrates the basics of persistence, and provides a justification for the use of relational databases. However, it is a very simple example. As the data model increases in complexity, the difficulty of managing its data grows significantly. I will address these complexities in the following section. The primary goal of an Object Relational Mapping framework is not to remove this complexity, but rather to encapsulate it within a well vetted tool which allows developers working at the object level to leave the low level details of the data management to the ORM tool, and focus on the intent of the overall application. In part two, I outline some examples of complexities addressed by ORM tools. Part 2: The Object/Relational Mismatch There is a fundamental disparity between the way data is stored in a relational database and the way data is stored in objects. Object oriented programming provides a framework with which software can be built to reflect real world analogues. Relational databases provide a fast, structured and reliable method of saving and restoring data. In a system which requires both a database and object oriented development, there must be some resolution whereby data that exists in the form of abstract objects can be preserved in a relational database. Thus we need some method of correctly mapping object data to database tables even in complex situations. It is from this need that Object Relational Mapping tools were born. In the following section, I will present a set of problems which demonstrate this paradigm mismatch. I will follow this with a discussion of how ORM tools address these issues. Note that I do ​not assume the existence of a home brewed ORM persistence layer in these discussions. Problem 1: References Between Classes In a relational database, ​everything is stored in a tables. In a manner very similar to the csv file example above, all of the instances of a class are stored in rows. In an object, local data is stored in a linear manner, but references to other objects are not. As seen in the car example, an object can have a reference to another object. References can exist both as an aggregation relationship, where an object references but does not own another object, or as a composition relationship, where an object owns other objects. [Fowler 68] These relationships do not have a simple or implicit mapping to a table structure. This is the fundamental problem which spurred the development of ORM frameworks. In the car example, I provided a system with a fairly natural mapping to a persistent state. The most basic elements of an object can be trivially mapped to rows in a table. Simple Car Table (Matches Car V.1) Color Engine Doors Car_Id
  • 11. Red V8 4 1 However, if we consider the addition of the reference to an Owner, the mapping becomes less direct. As shown in part 1, one method is to add an Id to the owner, and have each car save a the Id of its owner. Car Table (Version 2) Color Engine Doors Car_Id Owner_Id Red V8 4 1 1 Owner Table Name Owner_Id Ted 1 In a relational database, this type of relationship is called a ​foreign key reference. Notice however that there is some ambiguity in this design. For example, we could create a functionally equivalent relationship by giving each Car an Id, and having each Owner store the Id of a Car. Car Table (Version 3) Color Engine Doors Car_Id Red V8 4 1 Owner Table Name Id Car Id Ted 1 1 Neither of these methods is wrong. Both will allow us to load the objects from the database, and recreate the relationship between them. However, there is not necessarily a database mapping implicit from the class definitions. A relationship which is straightforward in object oriented terms is not necessarily as straightforward when applied to a database. The choice of relationship mapping also has effects on ​multiplicity. Multiplicity is the number of objects associated with each part of a relationship. For example, a single Owner could own multiple Cars. If we store the objects in the manner described by the Car table version 2, a Car can only have one owner, but an owner can have multiple cars. Similarly, in Car table version 3, a Car can have many Owners, but an Owner can only have one car. We call these relationships One-to-Many relationships. Suppose we wanted to allow Owners to have multiple Cars, and Cars to have multiple Owners. In object oriented programming, this is simple to implement. Each Car object can contain a variable length set of references to Owner objects, and each Owner object contains a variable length set of references to Car objects. These are called Many-to-Many relationships. The limitations of a relational database do not allow us to have variable length fields in a database table. This means we need some other method of mapping these relationships; we
  • 12. need a ​join table. A join table resolves the connection in a Many-to-Many relationship. A join table for this relationship between Cars and Owners might look like this: Car/Owner Join Table (Version 1) Car Id Owner Id 1 2 2 2 1 3 In this table, Car 1 is owned by Owners 2 and 3, and Owner 2 owns both cars 1 and 2. In object oriented programming, we would not necessarily need to define a datatype to manage this relationship. In object definitions, variable length sets are easy to manage, as are bi-directional inter-object relationships. In the simplest implementation, Cars and Owners can just contain a set of references to each other. However, to handle the transition to a relational database, we had to a create a new table with the sole purpose of managing the connection between Car and Owner objects. This demonstrates that relationships defined in object oriented terms do not necessarily have a one to one mapping to their representation in the database. The Car/Owner relationship above is an ​aggregation - the Car and Owner exist as independent objects, and the deletion of one would not imply the deletion of the other. Now suppose we wished to regard a Car object as a ​composition of various parts. For example, suppose we added an Engine class: /**************************************************************** * Class Part V.1 * Tyler Smith *****************/ class Part{ String name; float liters; int engine_id; } The updated Car class: /**************************************************************** * Class Car V.4 * Tyler Smith *****************/ class Car{ String color; Engine engine;//note we replaced the String engine with an Engine object int door; Owner owner; int car_id; //Database access methods left out for clarity
  • 13. } Once again, there is no direct mapping of this object object oriented relationship to a database. We could choose to give Engine its own table in the database: Engine Table (Version 1) Name Liters Engine_Id V8 5.7 1 Car Table (Version 4) Color Engine_Id Doors Car_Id Red 1 4 1 As the Car/Engine relationship is a composition, this means that deleting the Car would mean also deleting the Engine. As the two are connected directly, it could be more efficient to implement everything in the Car table: Car Table (Version 5) Color Doors Car_Id Engine_Name Liters Engine_Id Red 4 1 V8 5.7 1 Again, neither implementation is strictly correct or incorrect. Using two tables provides a direct mapping from object definition to table definition, and using one table increases efficiency, and enforces the intended nature of the relationship. The problem of inter-object relationships demontrates the first paradigm mismatch between object oriented programming and relational databases. Objects manage relationships in a very different manner from relational databases. There is not necessarily a one-to-one mapping between class definitions and table definitions. This means that in a object oriented persistent context, steps must be taken to account for this disparity during the object transition from in-memory to persistent. Problem 2: Sub/Super Class Relationships The object oriented principle of inheritance presents another situation for which there is no implicit relationship between objects and database tables. Suppose we wish to create another class called Truck, and a superclass Vehicle to which both Car and Truck are subclasses. The Vehicle superclass will allow us to manage data shared by Car and Truck in one place, while allowing type-specific data to be managed at the subclass level. Once again, this presents an ambiguity. We could create a mapping which ignores the superclass entirely by creating a Car table a Truck table which duplicate the fields in the superclass. Or we could create a superclass table which contains the superclass fields, and references to subclass tables. [Bauer 193] These problems are further complicated if we consider the possibility of an Owner containing a reference to a Generic Vehicle. How should the relationship be mapped? Once again, there
  • 14. is no single correct method. Problem 3: Managing Identity As mentioned in part 1, in a purely object oriented environment, object identity is trivial*. Each object has a location in memory, so determining if two references point to the same object is clear. However, when we transfer an object to and from a persistent state, we can't trust the reference location to demonstrate identity. Suppose we create two car objects, each with all of the same values. Without adding some additional structure, the identity of any given object which has been re-loaded from a persistent state is ambiguous. For example, using class Car V.1, suppose I create a Car with color = red, doors = 4 and engine = v6. If (in the same program execution context) I create another Car with color = red, doors = 4, and engine = v6, they are clearly not the same Car - they will have different memory locations. However, if I save them and reload them in a new execution of the program, I cannot trust the memory locations. How can I trust that they are different? The program needs some additional structures and/or functions to manage identity in a persistent context. *There are situations where identity is non-trivial in object oriented programs, but in most cases the memory-location comparison is sufficient. Problem 4: Developer Expertise The difference between object data and tabular data from a database is a critical conceptual distinction. For developers working in an object oriented, database persisted application, it is critical that they understand both possible representations of the data. For example, this means understanding that a Car object containing a set of Owners will not have those owners directly mapped to its row in the database - a row in the Owner table would contain a reference to the car, or there might be a row establishing their relationship in a join table. This requirement of a wider breath of knowledge will make it harder for developers to become specialized in a given subset of the program, and will mean more training before new developers are comfortable with the system. Further, consider the problem of developer training. A project which is primarily functional, but requires some persistence, would require developers to be familiar with both database functionality and the necessary functionality of the program. This extra requirement adds significant overhead to the project. Problem 5: Managing object state In object oriented programming, objects typically have two states, live and dead. If a process modifies a live object (provided concurrency is properly managed) we can be confident that the change will take effect. Objects in a persistent application can exist in four states, ​transient, ​persistent, detached, and removed* [Bauer 386]. A transient object is an object without a database identity - it
  • 15. has been created but has not yet been saved to the database. A persistent object is an object with a database identity. A detached object is an object whose contents are not necessarily consistent with what is stored in the database - the two are no longer synchronized. Typically, a detached object is an object marked for deletion in the system, but which still exists in the database. A removed object is no longer in the database, and is awaiting garbage collection. We need infrastructure to manage the state of the objects, to make sure edits to an object are not lost or overwritten. Any persistent application needs to manage these states. For example, consider the problem of concurrent updates to a persistent object. In a solely object oriented program, concurrency can be managed at the execution level, using a semaphore or other control structure. If two processes (or machines) read an object into memory from a persistent state, and each makes a different edit to the object, we have a potential race condition as the changes are saved to the database. This can happen even if the access to the object is offset by a significant time delay; one process can simply overwrite the changes of the other (by default, it has no way to know that the object is has in memory is old). There needs to be some infrastructure to manage concurrency at the data access level. *Note that these are the states defined by Hibernate, other ORM solutions may have different states. Problem 6: Performance Invoking methods on an in-memory object is typically very fast. This is because memory access has very low overhead compared to accessing a database. Working with a database, every query run by the program has a significant time cost. A design which does not use smart query ordering and structuring can have a big performance impact. To test this impact, I wrote a simple time test program (see appendix A). My program demonstrates the difference between executing a new query every time a new value is needed and executing a single query to get all of the needed values at the same time. Even testing on a local database (low network overhead), the single query was on average three times faster than running a new query for each needed value. Any persistence implementation needs to account for the potential performance issues associated with query ordering and structuring. Managing the contents of an object can be very important in performance as well. Consider an object which owns a large set of other objects. Suppose we only need to make a small edit to the object, unrelated to the large set of objects. A 'dumb' system would simply load everything in the object, needed and otherwise. We need some way to load a subset of the object and make our change without loading all of the data into memory. Consider the problem of making a series of subsequent updates to an object. In a raw SQL system (without significant optimization structure), each update would need its own UPDATE statement. This potentially adds a lot of overhead (especially if the entire object must be loaded into memory each time). We need some way to intelligently manage queries such that they are executed in a logical manner, and with a minimal number of connections to the database.
  • 16. Problem 7: Changing Database Engine In the lifespan of a product, it is likely that some components will change. For example, suppose you built a database product based on SQL Server 2008. Then, some time later, management says we can no longer use SQL Server 2008, and you need to move to an open source database engine. This means several things: You must find out if and where differences exist between the two database engines, and search for any places in your source code where those changes might affect the program. For example, suppose you have a column defined as VARCHAR(MAX) in one of your table definitions. There is no VARCHAR(MAX) in MySQL. Thus to be compatible with MySQL, you will need to search your source code for places VARCHAR(MAX) is used, and replace them with a valid MySQL field definition. This problem is even worse if you have custom database functions or datatypes, or if you need to support multiple database types simultaneously. Part 3: What is an Object Relational Mapping framework, and how can it help? Object relational mapping is the process of transferring data stored in objects to relational database tables. An ORM framework is a tool which manages this transfer. Any persistent application already has some semblance of ORM, typically in the form of methods such as those seen in the above examples - methods which perform operations on the persistent data. An ORM framework is a method of abstracting the details of these Create, Read, Update, and Delete operations (commonly referred to as CRUD) such that the high level implementation does not need to know the details. From Java Persistence with Hibernate: "In a nutshell, object/relational mapping is the automated (and transparent) persistence of objects in a Java application to the tables in a relational database, using metadata that describes the mapping between the objects and the database." [Bauer 25] Object relational mapping software started to gain notoriety in the late 1990s with the production of tools such as Oracle's TopLink. ORM was originally most popular with Java, but in recent years has been expanded to most contemporary object oriented languages. How does ORM fit into a design? In the above Car example, we had a simple two tier system - we had a java implementation, and database back-end:
  • 17. In a two tier system, all of the implementation for both the application and data persistence are stored together, in this case in the Car object. If we add an ORM tool to this application, we can hide the details of the persistence functionality.
  • 18. Now the Car class does not need to know how the data is being retrieved, or how to map object data to and from tabular data - the ORM framework handles all of these details. This means that the complexities addressed in part 2 can be hidden from the high level Car implementation. I will now outline how an ORM framework can handle the problems outlined in part 2. Note that not all ORM frameworks implement solutions to all of these problems. I won't provide extremely low level implementation details, as they can differ between implementations. Rather, I will give a design-level explanation of how ORM solutions can assist in these issues. My explanation will be based primarily on Hibernate, the ORM solution employed by my project at General Dynamics and Django, the ORM tool I used at the Carlson school of management. It is important to note that an ORM framework does not necessarily need to be created by a third party. Everything I outline here ​can be done by hand. Many of the advantages can be achieved by establishing a multi tier architecture and hiding the data implementation layer. However, my discussion is based on the use of a third party tool, as most project leads likely do not have the time or budget to implement an entire ORM solution.
  • 19. Problem 1: References Between Classes References between classes in object oriented software present a problem, as the mapping of these relationships to a relational database is ambiguous. ORM tools handle these ambiguities by employing user provided annotations or mapping files to clarify complex relationships. In the Hibernate ORM, mapping data provided by the user is used to tell the the tool specifically how some objects relate to others. Alongside the data fields in a class, the user provides annotations which specify the nature of the relationship. For example, let's consider a Hibernate annotation to our Car class: /**************************************************************** * Class Car V.4 (Java) * Tyler Smith *****************/ class Car{ //Other fields hidden for clarity. //Hibernate annotation to make relationship //In this case we are saying a single Car has many Owners @ManyToOne(targetEntity = Owner.class) Owner owner; } This annotation tells Hibernate exactly how the relationship should be mapped to the database. Note that the developer still needs to understand how relationships work in a database - the ORM solution won't allow developers to blindly trust that the objects will be mapped correctly. In Django, the functionality is very similar. The user provides information to the ORM to specify the nature of the relationship: """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" " Class Car V.4 (Python) " Tyler Smith """""""""""""""""""""" Class Car(models.Model): #Other fields hidden for clarity #In this example, we have a many to many relationship between Car and Owner owner = models.ManyToManyField(Owner) Note that the ORM tools are not able to make complex design assumptions, such as establishing whether a relationship is an aggregation or composition. Instead, the ORM framework provides methods of formalizing such relationships with the class definition, as seen above. Thus the burden of resolving the mapping ambiguity still falls on the developer, but after the mapping is decided, its nature is clear and formalized. When relationships are formalized in an object definition, the ORM framework can employ this definition whenever operations are requested. A developer does not need to explicitly consider the nature of each relationship held by an object to perform operations elsewhere
  • 20. in the code. The ORM framework can keep track of the relationships associated with a given object, and manage the data accordingly. Problem 2: Sub/Super class relationships This difficulty in mapping is not automatically solved in all ORM frameworks; the user needs to make some design decisions as to how inheritance is handled. The Hibernate documentation provides 4 possible mapping schemes to handle inheritance in persistent classes [Bauer 191]: 1. Table per Concrete Class - Implicit polymorphism In this method, we tell the Hibernate to map one table for each concrete (non-abstract) class. All properties, inherited and local, are mapped to columns. Hibernate can automatically generate queries for polymorphic method calls against the superclass. This solution is effective for models with very little inheritance, but presents risks. For example, if Car and Truck both inherit from an abstract superclass Vehicle, and a User has a reference to a Vehicle, how do we map this relationship in the database? We can't have a correctly constrained yet generic reference to one of two tables. [Bauer 193] 2. Table per Concrete Class With Unions In this method, each concrete class, including super classes, is mapped to a table. This has the advantage of shared superclass properties. In one table per class, superclass properties had to be mapped to each sub class. We can do this because Hibernate will then use a UNION to join the results of queries against these tables to get shared sub and super class data. This solution also solves the problem of associations with inherited classes, as "Hibernate can use a UNION query to simulate a single table as the target of the association mapping" [Bauer 199] 3. Table per Class Hierarchy In this method, each hierarchy of classes is mapped to a single table, so shared data is implicitly shared, no need for unions or multiple queries. However, this method presents some potentially critical problems. Subclass properties must be nullable, as they are not necessarily instantiated in every row. This means that extra caution must be used if accessing the data manually, as many columns could potentially be null for a given value. [Bauer 200] 4. Inheritance Relationships as Foreign Keys In this method, every class which defines its own properties is mapped to its own table. This implementation will require more tables and queries, but is normalized. It is simple to
  • 21. understand, but can result in unnecessary complexity. It involves treating every ​is a relationship as a ​has a in terms of the schema. This means abstract and concrete classes, and even interfaces, can have their own tables. [Bauer 203] Django Example: I couldn't find an explanation in the Django documentation as to how inheritance is implemented. To find out, I did a simple test, using the following Python classes. Note that all classes are concrete. from django.db import models class Vehicle(models.Model): name = models.CharField(max_length=200,) color = models.CharField(max_length=200,) class Car(Vehicle): trunk = models.BooleanField() class Truck(Vehicle): bedLength = models.IntegerField(blank= True, default = 0) Django mapped each class to its own table, as follows: Vehicle: Field Type id Int name varchar color varchar Car: Field Type vehicle_ptr_id int trunk tinyint Truck: Field Type vehicle_ptr_id int bedLength int Adding a Car, we see: Vehicle: id name color 1 Tyler's Car Red
  • 22. Car: vehicle_ptr_id trunk 1 1 When an object is added, its sub and superclass properties are split up and added to the relevant tables. Then the tables are connected via a pointer, stored in the sub class. Note that the sub class does not have its own id field. This is because the superclass stores the id, and two such id's would be redundant. This mapping is an example of ​table per concrete class with unions. Problem 3: Managing Identity This problem has not actually been completly addressed by ORM frameworks as of this writing. The complexity exists in establishing the identifying feature of an object. The memory location fails, as it will change as soon as the current process is completed. Using the database identifier (primary key) works, but leaves us with a identity-less object until the object is made persistent - in the ​transient state, the object has no identity, and thus two transient objects of the same type could incorrectly be found equal (null == null). The accepted solution to this has been to use a ​business key, an identifier similar to the database primary key, but which can be set before the creation of the relevant row in the database.[Bauer 397] Use of this method requires domain classes which appropriately manage equality (typically by overriding the default equals method). Problem 4: Developer Expertise The data layer abstraction provided by an ORM framework provides developers working on high level function code with an interface which allows them access to the data, but shields them from low level data details. While it is still important that they have an idea how the data layer works, developers do not need to have an advanced understanding in order to perform data operations. Note that this encapsulation is not limited to third party software - home brewed data access layers can provide the same benefit. Problem 5: Managing Object State ORM tools provide structured methods for managing object state to keep track of object identity and transience. For example, Hibernate maintains a version field in each row of the database. Each time a the row is updated, Hibernate automatically checks the version of the in memory object against the version of the object stored in the database. If they are not the same, then the object is old and needs to be refreshed before any data can be written. This is called ​offline record locking. Problem 6: Performance
  • 23. An ORM framework affords many options for performance improvement. These include methods such as lazy or eager loading, transaction based database operations, and a variety of caching options. Earlier, I discussed the problem of making a small update to a large object. Loading the entire object (and its children) into memory requires a lot of overhead. ​Lazy loading is a method of loading only certain parts of an object when needed, as opposed to loading the entire object into memory. For a large object with many children, loading every object in the set could mean thousands of unnecessary database reads, and even more if we consider and sets contained within those objects. Lazy loading allows us to load the parent object without loading the children. [bauer 571] Hibernate implements lazy loading by creating proxy objects in the place of sets, and only instantiating the sets when they are explicitly requested by the user. In a substantial persistent system, we also need to consider the concept of a ​transaction. A transaction is a series of operations. It is often optimal to combine a series of modifications to an object or set of objects into a single statement, or series of statements. Thus for performance reasons, it is optimal to have some method of combining operations. In Hibernate, transactions are optimized to do minimal work on the database. This means updates may not necessarily be executed in the order the requests appear in the source code. Problem 7: Changing database engines Most ORM frameworks are written and tested against multiple database back-ends. Typically this allows developers to simply tell the framework which type of back-end is being used, and the low level differences can be changed within the tool. This means that if the database back-end changes, the product does not need to drastically change, as the flexability is already built into the ORM layer. Part 4: Integrating ORM into the development process: The integration of an ORM framework can take place in many different stages in development. It may be as early as the design phase, or as late as an update to an already complete product. The decision of when and how to integrate an ORM framework is non-trivial, as the persistence layer of any application is critical to its success. Design basics Object oriented persistent projects typically share some key design elements related to persistence. First, most are built on a domain model (though they may not call it that). A domain model is a design that has classes corresponding to distinct persistent elements. The Car example above used a domain model - the class Car is a domain class. [Bauer 107] Second, many projects also use the Data Access Object (DAO) design pattern. In this pattern, data access objects live above the domain classes in the class structure, and provide high level access to data. In projects using the Data Access Object pattern, domain level classes typically have little or no implementation details, only data fields and mapping
  • 24. data. Integration In ​Java Persistence with Hibernate the authors discuss methods for integrating Hibernate into a project: [Bauer 40] Top Down In a top down development model, an ORM framework is integrated into an existing domain model. This means the program already has a set of classes defining the persistent data types, and needs to integrate a method for mapping the persistent classes to the database. Note that it is assume that there is ​not an existing database schema. Bottom Up In a bottom up development model, an existing database schema is used as the basis for integration of an ORM framework, and the development of code to operate on the data. This would likely occur if a program was required to work on a legacy database - the new system must be made to work with the existing database schema. Middle Out In a middle out model, developers start with a model of the mapping between objects and tables, and design both the code and the schema based on the mapping. This means that before the classes are written or the database schema is decided, the specifics of the mapping between objects and the database are deduced. Meet in the Middle In this model, developers start with an existing schema and existing code base, and integrate an ORM framework. The authors of ​Java Persistence and Hibernate consider this to be the most difficult method of integrating Hibernate into a project, as it often requires significant refactoring to get the domain classes and database schema to agree. "This can be an incredibly painful scenario, and it is, fortunately, exceedingly rare." In the next section, I analyze two projects which switched to a ORM implementation from an existing platform - both were established projects. In the first study, we used a bottom up strategy. In the second study, the meet in the middle strategy was used. Integrating ORM in Practice
  • 25. In the next section, I describe the process taken by two software projects for integrating an ORM framework. These case studies serve as real world examples of the processes described above. At the Carlson School of Management, we used the bottom up strategy, integrating an ORM tool with an existing database schema. At General Dynamics, we used the meet in the middle strategy, integrating an ORM framework with an existing schema and domain model. Part 5: Case Studies To provide real world context to the discussion of ORM frameworks, I did two case studies. The goal of both studies was to investigate whether ORM tools can successfully address the problems described in part two in a production environment. In both projects I studied, I focused on the following questions: 1. What was the existing implementation before the integration of an ORM framework? 2. What were the problems/concerns with that implementation? 3. What was the primary motivation for switching to an ORM framework? 4. Why was the ORM framework in question chosen? 5. Was the integration as difficult as anticipated? 6. Has it been successful? 7. What would you do differently? I gathered this data through my own experience (I worked on both projects), through interviews with project members, and through code analysis tools. Case Study One Carlson School of Management Help desk Project Overview At the Carlson School of Management, I worked on the Laptop Management System. This system is a web based program which manages laptop repairs and equipment checkouts. The system manages over one thousand students, and hundreds of laptops, and is used by approximately 15 staff members. Motivations The program was originally written in PHP, using a MySQL database. In early 2009, the decision was made to switch to Django. There were three primary reasons for the switch. First, Django allows very fast web page development, which would allow us to easily write report-generation pages. Second, Django has HTML template options which allow more flexibility in page design. Finally, it was hoped that Django would be easier to maintain, as many aspects of the PHP based version were fairly old and disconnected. Design The original project was implemented with PHP (non-object oriented) and MySQL. The second version was written in Python, using the ORM tool Django. As we needed to keep track of legacy data, the database schema remained largely the same. We used a domain model, implementing each persistent type as a class based on our existing schema. Django handles all of the communication with the database. We did not need to use any type of advanced data access structure. Instead, we used built in methods
  • 26. provided by Django for operations such as selecting a large set of objects. To retrieve a set of objects from the database, the developer needs only to execute a ​filter operation, which acts as a SELECT query. To save an object, the developer calls a save() method. For example, if I want to retrieve all repair tickets with non-null laptops, I can use this line of python: all_tickets = Ticket.objects.filter(laptop__isnull=False) All persistent classes extend the Model superclass. This superclass provides access to methods such as filter, which returns a subset of the instances of that class saved to the database. The above line uses a filter to get all Ticket objects from the database where the laptop field is non-null. The corresponding raw-SQL implementation would require executing a SELECT statement, followed by a line by line parsing of the returned row, creating objects for each row. Django also handles all transaction level details - we did not need to worry about whether an object was persistent or transient, Django managed those details. Implementation The implementation took approximately 12 months. Much of this development time was due to the time taken to re-implement the system, as we converted from conventional PHP to object oriented Python. We also had to write SQL scripts to transfer data from the old database schema to the new, Django based schema. The implementation was done in stages. Carl Allen implemented the ticket management portion first, and then I implemented the equipment management section and report generation tools. Some of the higher level functionality (forms, etc) were complicated, but the data implementation was fairly quick. Typically, adding a class and the associated table took 2-4 hours. Problems We had many tricky problems originate from our use of legacy data in the database. In many cases, we had poorly enforced constraints in the legacy data which transfered to the new database. This meant that the assumptions made by Django regarding data integrity were often incorrect. For example, Django assumes that if a foreign key reference is non-null, then the associated object must exist. We often had tickets referencing laptops which had since been deleted. This resulted in verbose, hard to fix errors. For example, the following code caused errors when a ticket referenced a non-existent laptop: all_tickets = Ticket.objects.filter(laptop__isnull=False).order_by('-pk') for ticket in all_tickets: ​laptop = ticket.laptop This code parsed all of the tickets in our system with non-null laptop fields, and assigned a local laptop variable to the laptop associated with each ticket. The following error was thrown when one of the laptops referenced by a ticket did not exist:
  • 27. It is clear from this error that a laptop is missing. However, we're given no information as to which laptop is missing. Carlson has thousands of laptops and thousands of tickets. Therefor
  • 28. to solve this problem I had to bypass Django altogether, and execute SQL directly against the database to find bad laptop references. These error codes are often lengthy to read, as most of the lines in the stack trace are within Django source code. We also had issues stemming from the large degree of control Django has over the application. The most complex of which was a data validation error which would cause the system to silently ignore requests. We think this problem has to do with Django's caching policy, but it has yet to be resolved. In the PHP implementation, the code to be run with each page request is very straightforward. In Django, many things happen in the background, hidden from the user and the developer. While this decreases the amount of code needed to accomplish a given task, it makes searching for the source of an error much more complex. Improvements Using Django allowed us to rapidly add elements to the application. The primary motivation for the switch was to allow report generation. With Django's built in commands for data access, we could easily gather and manage a lot of data without lots of low level SQL. We could just request the data and work with it. Using Django also allowed us to decrease the size of our code base. We went from 13,000 lines of PHP to about 2,000 lines of Python (Reported by CLOC). There are some minor functionality discrepancies, as Django handled some user management functions which were handled manually in the older version, and the older version of the software did not have reporting tools. Conclusions Overall, users and management have been pleased with the Django based implementation. However, while our code size and reporting ability have been dramatically improved, errors have become very hard to trace. Finding the source of an error typically means searching online for the meaning of the error code, and then manually parsing the database data to find the row causing the problem. Some errors, such as the silent request rejection error discussed above, have yet to be resolved. The time spent resolving these errors has lessened the efficiency we hoped we could gain from Django. Case Study Two General Dynamics Six-Delta Project Overview Over the past year, I've had the opportunity to do an internship at General Dynamics Advanced Information Systems in Bloomington, Minnesota. I work on the Six-Delta project, which is a persistent java based application. Six-Delta uses SQL Server as a back-end. The project has about 12 developers, and about 500,000 lines of code. The program has about 75 persistent classes - classes that are saved
  • 29. to the database. Six-Delta is based on a domain model of persistence. In the domain model, the business logic of the system is separated from the data implementation through domain objects - objects which have corresponding tables in the database. Motivations In late 2008, the project moved from a raw sql based implementation to an Object Relational Mapping framework. I interviewed Six-Delta chief architect Paul Hed and database lead Paul Wehlage to discuss this change. From an architectural perspective The motivation to switch to ORM came as part of a general drive to move to an n-tier architecture. Before ORM, Six-Delta was a 2-tier application using custom direct sql based to manage persistent data. The program was growing steadily, continuing to add components would have meant more persistent classes. In a 2 tier architecure, all of the complexity of the data access layer is visible in the functional logic of the program. The goal was to insert an ORM framework as a 3rd layer to manage some of this complexity. From a data-access perspective, the motivation for ORM came from a desire for better performance and integrity. The project already had a hand-coded ORM implementation. The implementation could do some of the specialized operations discussed earlier in this document, such as lazy loading. However, it was lacking in several key areas. It did not have a method of offline version checking, which meant that concurrent data access could result in race conditions and lost changes. It also did not have any concept of a transaction. Updates to an object were executed field by field, and were very slow. From a general design perspective, the motivation from ORM came from a desire to decrease complexity in the data access classes. Maintaining the data layer of the implementation was very complex - each persistent class required 5 data maintenance classes. Maintaining the database schema took a lot of work. Event though some support classes could be auto-generated, the system was complex enough that only certain developers had the expertise to make schema changes. As the number of tables was increasing dramatically (from 15 to close to 50), the complexity and overhead associated with maintaining the home brew data access layer was growing too great. Selecting an ORM solution The team chose an open source ORM framework called Hibernate. Hibernate was considered alongside two other ORM solutions, Enterprise Java Beans and Java Data Objects. Hibernate was chosen as it is the most popular open source solution for Java, and is fairly mature. Enterprise Java Beans was considered too large and complex, and Java Data Objects was too new. Using annotations in persistent classes, Hibernate generates SQL code, and executes it in transactions. To save an object, the developer uses a ​transaction. A transaction is a series of changes to persistent objects, which is managed by hibernate. During a transaction, Hibernate generates SQL statements, which it executes against the database. Integration The planned integration time was 3 weeks. While the lead Hibernate developers advocate against inserting ORM into an existing project with an existing schema, the team was confident that it could be added quickly, as the data access layer of Six-Delta was
  • 30. functionally very similar Hibernate's implementation - both implementations used a domain model with ​Data Access Objects. Data access objects are objects whose provide an interface to the persistent domain classes. Integration was more complex than expected, primarily because of the difficulty of integrating transctions into all of the Six-Delta data dependent code. While the base layer of Hibernate's data access design was consistent with the pre-Hibernate Six-Delta implementation, Six-Delta had no transactions, and a significant refactoring was required before transactions were successfully implemented. The team also chose to integrate Spring with Hibernate. Spring provides tools for tighter management of transactions, along with connection management tools. The integration ultimately took approximately 16 months before a stable release was available (note that many other software changes too place in this time, Hibernate was not the only item requiring development time). Problems As time passed, other elements of Hibernate surfaced which required additional work. Hibernate does not implement an efficient delete operation - it loads the entire object into memory before deleting in the database. The team had to implement SQL based delete operations to allow for efficient deleting. Hibernate does not allow for column defaults in SQL Server, which meant that Hibernate could not be used to auto-generate a database schema. Hibernate is slow to initialize, which means that for very small sub-application elements, Hibernate is too slow to be useful. Hibernate is also incapable of managing two schemas at the same time, which makes database upgrade testing complex (verify this is the problem associated with schemas). Hibernate has troublesome debug options. It can print out the base elements of the SQL statements it executes, but can't print the whole thing (statements look like INSERT (id, name) INTO TABLE (?,?) ) so problems can be hard to trace. While none of these problems were critical, they required more low level coding than would be ideal in an ORM framework managed data access layer. Improvements Hibernate succeeded in its primary goals. In the current state, it is much easier for developers to make schema changes without special implementation knowledge. In general (with the exception of deletion) Hibernate has made database operations much faster. Finally, Hibernate has dramatically decreased the amount of time required to implement a schema change. Developer Reaction When I first inquired about how Spring and Hibernate were integrated into the program, I was simply told: magic. Obviously this was hyperbole on the part of the developer I was speaking to, but the developer mentality implied by this reply was clear: developers were not familiar with the design of the data access layer. Hibernate errors can be verbose and confusing, and the integration into the system is not very straightforward. This meant that even experienced developers can often have trouble deducing errors that arise in the data layer, as the layer of abstraction has allowed them to ignore the implementation details. Data Metrics
  • 31. I was given permission by Paul Hed to use historical defect data in this report. The following is a plot of problem and regression reports in the two years before the switch to Hibernate. Problems and Regressions Before Hibernate As the current version of the software is larger than the pre-hibernate version, I also plotted the regressions and problems relative to the number of lines of code in the last non-Hibernate release (I don't have access to LOC data for all previous releases, so these values are all relative to the same release). Problems and Regressions per KLOC Overall, we see an average of 47.68 total regressions and problems per month, with an average of 0.098 total regressions and problems per KLOC. Interestingly, there is a
  • 32. dramatic spike in problem reports approximately one year before the transition to Hibernate. Problems and Regressions, Post Hibernate There is also a significant spike in December of 2008. Paul Hed noted that this spike was likely due to complications regarding the transition to hibernate. The team had to maintain a non-Hibernate baseline, while implementing transaction management tools in the Hibernate baseline. This resulted in a lot of code changes. The spike in April 2010 was due to a major release testing process, which simply found more bugs - there wasn't a significant change in development, just an increase in testing. Again, the below values are all relative to a single KLOC value, as I don't have access to historical KLOC data. Problems and Regressions per KLOC
  • 33. After the Hibernate release, the average number of regressions and problems per month dropped to 42.21, and average regressions and problems per KLOC dropped significantly to 0.067. This corresponds to an approximately 12 percent drop in total regressions and problems, and an approximately 32 percent drop in regressions and problems per KLOC (noting that KLOC values are not strict). Despite this improvement, Paul Hed asserted that the pre-Hibernate data access layer was significantly more stable than the current implementation, largely due to the overall simplicity of the design. In the Hibernate implementation, the data goes through more steps and tools as it goes from user to database. This added complexity makes it harder to be confident in the code. Note: Lines of code metrics are from the Coverity Prevent static analysis tool. Looking Back I asked Paul Wehlage what he would do differently if he could start over with the transition. He was generally satisfied with Hibernate's performance, but said that spending more time with it, learning the nuances and tricky spots, would have been beneficial before attempting integration. He also said more work should have been done to educate developers on Hibernate. Part 6: Conclusions There is unquestionably a significant paradigm mismatch between the object oriented development and relational databases. Object relational mapping frameworks provide a compelling solution to this problem. At the beginning of this document, I posed the question "In a persistent, object oriented application, is an ORM framework an advantageous method of implementing the persistent
  • 34. layer?". Through the examples of ORM application, I demonstrated a variety of ways an ORM framework can assist in implementing complex transitions between the OO world and relational databases. In the case studies, we saw examples of ORM solutions in action. While both projects I studied were ultimately successful, both suffered from the same type of problem: the inclusion of a black-box framework trusted to handle a significant layer of an application is very dangerous. When things work as expected, the ORM framework can be ignored. However, it is critical that developers working with the system have a solid understanding of the implementation, so that when errors inevitably arise, developers charged with finding the solution are not faced with opening the black box for the first time. Thus the answer to my question is: an ORM framework can be an immensely helpful tool for improving the speed and quality of the persistence layer. However, the implementation of an ORM framework (or any major framework) cannot be a black box operation; any developers interacting must ​understand the system. If the system is treated as magic, eventually the developers will be called to debug a problem, and it's very hard to debug magic. From the perspective of a software architect or project manager, the implication is that the costs of integrating an ORM framework stretch beyond the initial coding costs. Successfully integrating an ORM framework requires a commitment to train ​all of the developers on a project in the use of the framework. This is not to say that integrating an ORM solution is a waste of resources; rather, management needs to understand the long term commitment of including such a complex tool. Future Research/Open Questions Object oriented databases In recent years, relational databases have been the standard for data management. This paper assumes the reader is planning to use a relational database. However, there are specialized object oriented databases designed to circumvent the problems caused by the object relational mismatch. As ORM frameworks become increasingly popular, it would be interesting to compare the performance of a relational database and ORM framework to an object oriented database. Performance could be compared both in terms of speed and bug occurrences. Inferred mapping Hibernate allows the developer to map relationships both with XML mapping files, and with in-code annotations. I would like to develop a tool which could analyze user specified domain classes and infer annotations or mapping files.
  • 35. Sources Bauer, Christian, and Gavin King. Java Persistence with Hibernate. Greenwich, Conn.: Manning, 2007. Print. Beighley, Lynn. Head First SQL. Beijing: O'Reilly Media, 2007. Print. Fowler, Martin. UML Distilled: a Brief Guide to the Standard Object Modeling Language. Boston [etc.: Addison-Wesley, 2009. Print. Rod Johnson, "J2EE Development Frameworks," ​Computer, vol. 38, no. 1, pp. 107-110, Jan. 2005, doi:10.1109/MC.2005.22 "Version 2.0 (English)." The Django Book. Web. 13 June 2010. <http://www.djangobook.com/en/2.0/>.
  • 36. Special Thanks to: Phil Barry - Primary advisor Eric Van Wyk and Mats Heimdahl - Secondary advisors. Paul Wehlage and Paul Hed - General Dynamics engineers who did interviews. Matt Maloney and Garreth McMaster - Carlson School of Management managers who did interviews. Doug Smith - Reader and database advisor.
  • 37. Appendix A: Source Code SQL Performance Test import​ java.sql.DriverManager; import​ ​java.sql.Statement; /** * TimeTest * ​@author​ ​Tyler​ Smith * * This class is designed to test a series of queries against a * database. The queries accomplish the same thing, but one is optimally structured, and the * other is not. The idea is to show the importance of optimal query structuring. * * Obvious this test is very simplified. However, it does reflect the importance * of considering sequential access when performing CRUD operations on object data. * */ public​ ​class​ TimeTest { ​private​ ​static​ String ​bulkQuery = ​"SELECT * FROM [AccessTest].[dbo].[presidents]"​; ​private​ ​static​ String ​query1 = ​"SELECT name FROM [AccessTest].[dbo].[presidents]"​; ​private​ ​static​ String ​query2 = ​"SELECT id FROM [AccessTest].[dbo].[presidents]"​; ​private​ ​static​ String ​query3 = ​"SELECT birthday FROM [AccessTest].[dbo].[presidents]"​; ​private​ ​static​ String ​query4 = ​"SELECT gender FROM [AccessTest].[dbo].[presidents]"​; ​public​ ​static​ ​void​ main(String[] args) { java.sql.Connection con = ​null​; ​try​{ ​//Get the SQL Driver Class.​forName(​"com.microsoft.sqlserver.jdbc.SQLServerDriver"​).newInstance(); String url = ​"jdbc:sqlserver://localhost:1433;"​ + ​"user=test.user;password=test.password;"​ + ​"databaseName=AccessTest"​; con = DriverManager.​getConnection(url); ​Statement st = con.createStatement(); ​//Test single statement ​long​ start_SQL1 = System.​nanoTime(); ​for​(​int​ i=0;i<10000;i++){ st.executeQuery(​bulkQuery); } ​long​ finish_SQL1 = System.​nanoTime(); ​long​ net_SQL1 = finish_SQL1 - start_SQL1; System.​out.println(​"Net time, single queries = "​ + net_SQL1); ​//Test multiple statements ​long​ start_SQL2 = System.​nanoTime(); ​for​(​int​ i=0;i<10000;i++){ st.executeQuery(​query1); st.executeQuery(​query2); st.executeQuery(​query3); st.executeQuery(​query4); } ​long​ finish_SQL2 = System.​nanoTime();
  • 38. ​long​ net_SQL2 = finish_SQL2 - start_SQL2; System.​out.println(​"Net time, multiple queries = "​ + net_SQL2); ​double​ net_SQL1d = (​double​)net_SQL1; ​double​ net_SQL2d = (​double​)net_SQL2; ​double​ ratio = net_SQL2d / net_SQL1d; System.​out.println(​"Ratio = "​ + ratio); } ​catch​ (Exception ee){ ee.printStackTrace(); } } }