BNFO-501/.DS_Store
__MACOSX/BNFO-501/._.DS_Store
BNFO-501/BNFO501 Course Guide.pdf
1
Bioinformatics 501
2
Table of Contents
Course Outline ................................................................................................................................ 3
Chapter 1: Physical Level ............................................................................................................... 7
Chapter 2: Algorithm Complexity ................................................................................................ 16
Chapter 3: Search Algorithms ....................................................................................................... 22
Chapter 4: Sort Algorithms ........................................................................................................... 28
Chapter 5: Trees ............................................................................................................................ 43
Chapter 6: Hashing ....................................................................................................................... 62
Authors’ Notes .............................................................................................................................. 66
3
Course Outline
Purpose
The purpose of this document is to introduce students to concepts, structures and
algorithms that form a foundation on which a database is built. Since this is created for a
specific class we will assume students have at least taken CMSC-256 at VCU or an equivalent
course.
Format
Consistency will be a key element in learning and understanding this material, but
unfortunately there can be many different ways to format this material. We will use this section
to illustrate different formats we will use in the examples of this text.
We are assuming the readers of this text to have some general understanding of
programming and our target audience is supposed to have some background specifically in
Java. We will use a pseudo code in our examples that can easily be translated to Java code, but
can also be applied to many other languages. An example piece of code is below followed by
the Java implementation.
Pseudo Code
Java implementation
4
Structure
This chapter will briefly introduce what will be covered in this document. We will cover:
• Computer Architecture (Physical Level)
• Algorithm Complexity
• Search Algorithms
• Sort Algorithms
• Trees
• Hashing
Chapter 1: The Physical Level
While we can assume you have taken at least an intermediate Java course, we cannot
assume you have taken any courses on computer architecture. The first section will briefly
cover key concepts in this field that you will surely hear again in a Database course.
You will learn about how information is stored in a computer in both temporary and
long term storage along with the time it takes to set and receive information from ...
2. Chapter 3: Search Algorithms
...............................................................................................
........ 22
Chapter 4: Sort Algorithms
...............................................................................................
............ 28
Chapter 5: Trees
...............................................................................................
............................. 43
Chapter 6: Hashing
...............................................................................................
........................ 62
Authors’ Notes
...............................................................................................
............................... 66
3
Course Outline
Purpose
The purpose of this document is to introduce students to
concepts, structures and
algorithms that form a foundation on which a database is built.
Since this is created for a
3. specific class we will assume students have at least taken
CMSC-256 at VCU or an equivalent
course.
Format
Consistency will be a key element in learning and
understanding this material, but
unfortunately there can be many different ways to format this
material. We will use this section
to illustrate different formats we will use in the examples of
this text.
We are assuming the readers of this text to have some general
understanding of
programming and our target audience is supposed to have some
background specifically in
Java. We will use a pseudo code in our examples that can easily
be translated to Java code, but
can also be applied to many other languages. An example piece
of code is below followed by
the Java implementation.
Pseudo Code
Java implementation
4
Structure
This chapter will briefly introduce what will be covered in this
4. document. We will cover:
• Computer Architecture (Physical Level)
• Algorithm Complexity
• Search Algorithms
• Sort Algorithms
• Trees
• Hashing
Chapter 1: The Physical Level
While we can assume you have taken at least an intermediate
Java course, we cannot
assume you have taken any courses on computer architecture.
The first section will briefly
cover key concepts in this field that you will surely hear again
in a Database course.
You will learn about how information is stored in a computer
in both temporary and
long term storage along with the time it takes to set and receive
information from each.
Chapter 2: Algorithm Complexity
This is a simple concept, but it is perhaps one of the most
important we will discuss. We
can create algorithms that can solve just about anything. The
problem comes in when these
algorithms take an exceptionally long time to run.
We will discuss how to determine an algorithm’s complexity in
terms of its input size.
We will show a few examples and give best/average/worst
cases. We generally care the most
about worst case scenarios. If we can reduce this worst case
5. then we know the algorithm will
always run in an appropriate amount of time.
Chapter 3: Search Algorithms
Storing information is an essential part of computer science.
Retrieving that information
is just as important. Retrieving that information in an
appropriate amount of time is even more
important.
You will be introduced to two simple algorithms: sequential
search and binary search.
We will discuss how and when to use each.
After completing Chapter 3, you will be able to complete the
first programming
assignment.
5
Chapter 4: Sort Algorithms
Sorting and ordering information is crucial when it comes to
retrieving information
quickly. There are many situations where you would want to
sort your data. One example is in
data retrieval. Attempting to get information from data that is
not in any order will require
every element to be inspected. However, if that information is
sorted, then the time it takes to
6. retrieve it later can be greatly reduced.
There are many different sorting algorithms one can use. Some
are very intuitive but not
very efficient. Others can be very efficient but unintuitive and
difficult to code. There are
applications where each is useful so we will discuss many
different types of sorting algorithms
including:
• Insertion Sort
• Selection Sort
• Bubble Sort
• Merge Sort
An application will be provided with this document to help
visualize exactly what these
algorithms are doing while they are running. It will also include
sorts that are not discussed in
this text along with some variations of some of the more
efficient sorts.
After completing chapter 4, you will be able to complete the
second programming
assignment.
Chapter 5: Trees
You should already be familiar with the concept of the tree
data structure already. We
will discuss a simple binary tree as an introduction, but our
primary focus will be on B-Trees. It
is important to note that there are many different tree structures
we will not discuss, such as
general (non-binary, non-balanced) Trees, Heaps, Binary Search
Trees and Balanced Trees.
7. The B-Tree is a way of storing very large amounts of
information. Until now you may
have been able to store all the data you need in RAM. Most
databases have much more
information than available temporary memory so we have no
choice but to store the
information on hard disks. As you will learn in the Physical
Level chapter, retrieving information
from disk is much slower than RAM. B-Trees are constructed
with this in mind, giving us a way
to quickly navigate and gain access to specific files.
After completing chapter 5, you will be able to complete the
third programming
assignment.
6
Chapter 6: Hashing
Hashing is an important data structure which gives extremely
fast insertion of data and,
when implemented correctly, extremely fast retrieval of that
data. Hashing uses a combination
of data structures you should be familiar with already.
Generally these use an array where each
element of that array stores a linked list.
7
8. Chapter 1: Physical Level
This chapter gives an overview of some of the basic computer
hardware concepts, and
what they mean to a computer scientist.
Hardware
The first thing we will look at is the primary hardware
components of a computer. If we
ignore peripherals and output devices, the three main
components of a computer are the:
1.) Central processing unit (CPU).
2.) Random access memory (RAM) sometimes called the main
or primary memory.
3.) The hard drive which is also called secondary memory.
The CPU is basically the “brains” of a computer. It is what
executes application code,
manipulates data, and controls the other hardware. The CPU is
composed of three parts:
1.) ALU (Arithmetic Logic Unit) - As its name suggests, this is
what does all the mathematical
and logical operations.
2.) Control Unit - This can be thought of as the conductor; it
does not do anything by itself but
it tells the ALU what to do and communicates where to store
data in memory.
3.) Registers (little memory chips) - These are where the direct
results of the ALU are put and
9. where the data that is to be executed next is stored.
8
RAM is essentially the computer's workbench. It is the memory
where the computer
stores code and data that it is actively using. In more technical
terms, RAM is basically a storage
area of bytes that the CPU controls. RAM is relatively fast,
especially when compared to a hard
drive. Retrieving a particular byte from RAM can be done in a
few nanoseconds (1 nanosecond =
1 billionth of a second). The main difference between RAM and
our last component, the hard
drive, is its speed and the fact that RAM is volatile or non
persistent. This means that when RAM
loses power, like when a computer is turned off, all the data in
RAM is lost.
The last primary hardware component is a hard drive (HD).
Hard drives are a type of
secondary storage. Other types of secondary storage are flash
drives, CDs, DVDs, magnetic tape,
and blue ray. A hard drive is used for long term storage of data
or persistent storage. Persistent
means that, unlike RAM, when power is removed the data is
still there. Hard drives are typically
spinning metal disks on which data is stored with magnetic
patterns. The other version of a hard
drive is a solid state disk (SSD). SSD's have no moving parts
and are faster than a magnetic disk,
but are much more expensive. While it is faster than magnetic
10. disks, it is still much slower then
RAM. However, no matter what type you use, all of them
provide persistent storage.
Bytes
No matter what type of memory it is registers, RAM, or hard
drives, all memory is split
up into “bytes.” A byte is made up of 8 “bits,” it is also the
smallest “addressable” unit. Bytes are
represented as base 2 numbers, so each bit can have the binary
value 1 or 0 and the value of
each position is found by 2^(0-7). This means that one byte can
have the decimal value of 0 –
255. There are different conventions for the symbol of a byte,
but it is typically denoted as “B.”
Prefixes are also used to represent multiple bytes. However,
since it is base 2, a kilobyte 2^10 is
1024 bytes denoted as the symbol “kB” instead of 1000 bytes.
Just to confuse things, you also
have the symbol “kb” for kilobit. You can also have megabytes
(MB), gigabytes (GB), terabytes
(TB), etc. Because of the different naming conventions, there is
sometimes some ambiguity on
what a symbol means in some situations. The bits in a byte are
ordered right to left. The left
most bit is called the “most significant bit” or “high order bit,”
in the same manner, the right
most bit is called the “least significant bit” or “low order bit.”
It is important that you remember,
that all the memory in a computer is measured in bytes and that
a byte is the smallest
addressable unit in memory.
11. 9
This leads us to the term “word.” A word is basically the unit
of data that the CPU thinks
in. What this means is that the word size of the CPU is how
large a piece the CPU can
manipulate at a time. It is also the size of most of the registers.
When you hear about 32bit and
64bit machines, this refers to the word size. So when you have a
32bit machine you have word
size of 32bits. You can see why this is important with a simple
arithmetic example. Suppose you
wanted to execute: 1000 + 1000. If you had a word size of one
byte (8 bits) the largest number
you could represent using a single word is 255. So to represent
a larger number you have to use
two words. This means that to do the addition, instead of it
taking just one operation and
therefore one computer cycle, it would have to be split up into
several operations taking several
computer cycles. As this example demonstrates, when
considering how to design the
architecture of a computer, the choice of the word size is very
important.
While the bits in a byte are ordered right to left, that is not
always the case for the bytes
in a word when talking about storing them in memory. There are
two ways that bytes are
ordered when stored in memory, the two ways are called “Big
Endian” and “Little Endian.” For a
given word we have the four bytes B1B2B3B4. If the bytes are
12. stored 1 – 4, in memory they are
stored using big endian. If the bytes of each word in memory
are stored 4 – 1, in memory they
are stored using little endian. There are advantages and
disadvantages to both formats, and is
the basis for one of the arguments between the PC and Mac.
Little Endian means that you are
storing the low-order byte of the number in memory, at the
lowest address, and that you are
storing the high-order byte, in memory at the highest address.
The advantage of this is that it
creates a one-to-one relationship between the byte number and
memory address. Big Endian
means that you are storing the high order byte of the number, in
memory at the lowest address,
and that you are storing the low order byte in memory, at the
highest address. The advantage to
using this format, is you can test whether a number is positive
or negative just by looking at the
first byte.
Encoding Schemes and Variable Types
Characters
Now that we know what a byte is, and the fact that all data is
stored as a byte, you might
wonder how characters, pictures, and other types of data are
represented. The answer is
encoding schemes. ASCII (pronounced ask-ee) is an acronym
for the American Standard Code
for Information Interchange. ASCII is an encoding scheme for
representing the English alphabet
13. characters as numbers. In this encoding scheme, each letter is a
byte and assigned a number
from 0 to 127. For example, the ASCII code for uppercase N is
the decimal value 78, the
lowercase n is the decimal value 110. Since ASCII is what most
computers use to represent text,
this is what makes it possible to share data between computers.
The first version of ASCII was
published in 1963 and went through several revisions before it
became the version we use
today in 1986. While ASCII can represent English characters it
does not support characters from
other languages. To solve this, another encoding scheme was
created called Unicode. Unicode
represents characters as a two byte number. This means in can
represent up to 65,536 (2^16)
different characters, the disadvantage of Unicode is, since it is a
two byte encoding scheme it
takes twice the memory as ASCII.
10
Numbers
We now know how to represent characters, but what about
numbers? Numbers are
represented in two different ways. The first is as Integers. Since
they are integers they cannot
hold fractional values. We could just represent them using the
binary value. This would mean
that for a 32 bit word, we could represent the integer values 0 -
4,294,967,295. The problem
14. with this method is that we cannot represent signed (negative)
numbers. The most obvious way
to represent signed integers would be to use the most significant
as the “sign” bit. This would
mean that when the most significant bit is “1”, the integer is
negative, when it is “0” positive.
Since it is represented with 32 bits and 1 bit is used to denote
the sign this leaves us with 31 bits
for the number allowing us to represent the range:
−2,147,483,647 - 2,147,483,647. This
method of representation is called signed magnitude.
The disadvantage, as seen in the following table is that we are
not efficiently using one
representation, “100”.
11
Another disadvantage is seen when executing arithmetic
operations. For the CPU to add
the two numbers 111 (–3) and 001 (1) together, it would require
more than simple binary
addition.
The solution to this is 2’s complement. 2's complement is a
representation method that
allows the use of binary arithmetic operations on signed
integers to yield the correct 2's
complement result. 2's complement is the method that is used
in today’s computers to
represent signed integers. In 2’s complement we still use the
15. most significant bit to represent
the sign of the integer. So positive integers with a leading bit of
0 is straight forward, but
negative numbers with a leading bit of 1 are slightly different.
Negative numbers are
represented as a binary number that when added to a positive
number with the same absolute
value will equal zero. This makes implementing the logic gates
in the CPU much simpler than
any other representation.
12
The second way to represent numbers is as a floating point
number. Floating point
numbers represent “real” numbers, meaning it can represent
integers and fractional numbers.
Floating point numbers are represented using an exponential
format. For a floating point
number represented as a single word this would give us 32 bits.
In the typical representation
format, the bits would be broken up into three parts. The sign,
the significant, and the
exponent. So for a 32 bit number they would typically be
separated like this:
The most significant bit (bit 31) is used to represent the sign of
the number, 1 for
negative, o for positive. The next eight bits (bits 30 – 23) are
16. used to represent the exponent.
The convention for the exponent is to “bias” it by 127. This
means that to represent the
exponent 6 we add 127 to it.
Example: 127 + 6 = 133 which is the binary value: 10000101
On the other hand the representation of the exponent – 6 would
be:
127 – 6 = 121 which is the binary value: 01111001
The last 23 bits are used for the significant and are call the
“Mantissa.” The mantissa M
is “normalized” so that it is between 0.5 and 1. The
normalization is done by adjusting the
binary exponent accordingly. So the value decimal value 0.825,
in binary would be:
0.1101 = ( 1/2 + 1/4 + 1/16 = 13/16 = 0.825).
The other thing to know about the mantissa, is because of our
normalization process it
always begins with 1. Since this is always the case we do not
store the leading bit, this in effect
gives the mantissa 24 bits of resolution using 23 bits. This
means that we can represent values
ranging from approximately 1.5 × 10−45 to 3.4 × 1038 with a
precision of 7 digits.
Let’s look at how the decimal number 0.085 is stored as an
example. 0.085 is stored as
“0 01111011 01011100001010001111011.” Its decimal values
would be 0 for the sign, 123 for
the exponent, and 3019899 for the significant. The exact
17. representation of this number would
be:
2e-127 * (1 + M / 223)
= 2-4(1 + 3019899/8388608)
= 11408507/134217728
= 0.085000000894069671630859375.
13
As we can see precision is not the same as accuracy. This can
make programming with
floating point numbers a perilous process for the ignorant.
Integer's are exact, unless the result
to a computation is outside the range that integers can represent.
Floating point numbers by
contrast are not exact since some real numbers require an
infinite number of digits to be
represented, like 1/3
Booleans
Booleans are the values true or false, yes or no, and on or off.
Since we only need to
distinguish between two different values and we are using a
base 2 system, representing
booleans is easy. We just use “00000000” for false and
“11111111” for true. While we could just
18. represent booleans using only one byte, as that is the smallest
piece of addressable memory,
we typically use all 32 bits as that is the word size of our CPU.
Programming Variables
You should already be familiar with variables from your
previous courses, but now you
know generally how the computer represents variables like
Java's float, int, and boolean. In the
examples given we only looked at 32 bit representations, but the
idea is the same for larger
representations like Java's double and long which represent
floating point and integer numbers
using 64 bits. The only difference is that since you have more
bits to work with you can
represent larger and more precise numbers.
Now that you know how different types of data are represented,
it should make you
realize how important it is to keep track of where you are at in
memory. The byte 00010100 can
be used as the decimal value 20, the ASCII value space, or
something else entirely depending on
how you look at it. This leads us into the file system.
File System
Before we look at how a file system is structured, we have to
look at how the data is
physically represented. The information on hard disk drives is
split into what we call “tracks”
19. and “sectors.” Tracks are concentric circles on the surface of
the disk. Each track on a hard disk
is numbered, starting with zero on the outside and increasing as
you move toward the center. A
sector is a subdivision of a track into fixed size, physical data
blocks. Traditionally, hard disks are
split into sectors of 512 bytes each. However, modern hard
drives are divided into sectors of
2048 bytes, or even 4096 bytes each. When looking at the
information on the hard drive, you
look at it sector by sector.
14
Now that we know how hard disks are divided up physically, we
can look at how the
data is actually stored. Files are stored on the hard disk as
“records.” Records are a physical unit
of information made up of “fields.” Another way of thinking
about them would be as a
subdivision of a file, containing data related to a single entity.
The fields that a record is made
up of can be thought of as the variables in a program. For
example, in the following 6 byte
record, it has room in it to hold an integer and two characters. A
record like this could be used
in a program to represent an ID designation.
The final term we need to know about is a “block.” A block is a
physical data structure
that holds records on a storage median. It is a group of sectors
20. that the operating system can
address. It might be one sector, or it could be several. Blocks
can also be thought of as groups
of bytes. The size of a block is specified by the operating
system and is, therefore, operating
system dependent. Blocks are manipulated as a whole. An
example would be when disk drives
read and write data, they do so in 512 byte blocks.
The last thing we need to look at are “files” and the job of the
“file system.” The file
system organizes the data of the storage median into an
interface that is familiar, to the
computer's user. A file therefor refers to one or more blocks of
data and gives them a useful
name like “myFile.txt” for reference by a computer's user.
There is generally a very tight
“coupling” between the file system and the operating system.
The two main file systems in use
right now is the proprietary NTFS file system used by Microsoft
and the HFS+ equivalent by
Apple. Another main file system that several electronics use is
Microsoft's old FAT32 file
system.
I/O buffering
We have looked at the basic components of the computer, how
memory is divided into
units, how data is represented as a binary value, and how files
are stored on the hard drive. This
leaves us with our last topic for this section, which is I/O
buffering. I/O buffering is where you
temporarily store data passing between two components. The
21. purpose of this is to help smooth
out the difference in the rate of which two devices can handle
data.
I/O buffering is done at all levels of communication between
the components of a
computer. You can see why it is so important when trying to
write a file. The CPU runs several
magnitudes faster than disk drives. So if we did not have
buffers, the CPU would be slowed
down to the speed that the disk drive runs at and be unable to do
anything else until it finished.
It should be apparent how inefficient that would be. With I/O
buffering the CPU can quickly
send the information to the buffer, then go about its business,
while the disk drives writes the
data to memory. This idea works for input from a disk drive
also. When the CPU wants a file, it
will send the request to the disk drive. The CPU is then free to
work on other stuff while the
disk drive loads the file into the buffer.
15
If we look back at our primary components we can see that
RAM is used as an
intermediate buffer. In modern computers there are several
“controllers” that are used to
increase a computer's speed and efficiency. During a normal
execution cycle, whenever the CPU
needs a file from the disk drive it will tell the controller, which
will then load the information
from the buffer to a specific place in RAM. The CPU can then
begin executing using the data in
23. algorithm that solves the problem. So the question then
becomes which algorithm should we
use? All things being equal, we should use the algorithm that is
easiest to understand,
implement, and document. However when performance is an
issue, then we have to look at
how fast the algorithm runs, as well as how efficiently it uses
the computers resources. This
makes understanding the time complexity of algorithms a
central concept to computer science.
When we look at the time complexity of an algorithm we
typically do not consider how
fast it runs on a specific computer. If one computer has a 1Ghz
processor and another has a
2Ghz processor, then the second computer we generally execute
twice as fast as the first for the
same algorithm. Instead we look at how fast an algorithm runs
as a function in comparison to
the size of its input. We are typically interested in how the
running time of an algorithm
increases when we supply it with a “larger” input. What is the
“size” of input? That depends on
the problem.
Example:
• Number of elements in the array to sort
• Number of vertices and edges in a graph to traverse
To demonstrate how to analyze an algorithm, we will look at an
algorithm for finding the
majority element in array of integers. It takes as input an array
of N positive integers, so the size
of the problem is N. The output is the majority element (M.E.),
24. the element in the array
occurring more than N/2 times. For simplicity we will assume
that a majority element exists in
the array. In the algorithm we go through each element in the
array and count how many times
each element appears.
Examples:
• <1,4,4,4,2,3> -> no majority element
• <1,4,2,4,4> -> 4
• <2,2,2,3,3,3,3> -> 3
Running time: A– assignment, C– comparison, E – expression,
Blue – conditional execution
17
MajorityElement(A[1..N])
Repeats time line
1 A 1 mIdx=1
1 A+N*(C+E) 2 for i = 1 to I <= N do i++
N A 3 Counts[i]=0
N A+N*(C+E) 4 for j = 1 to j <= N do j++
25. N*N C+E 5 if A[i] == A[j] then Counts[i]++
N C+A 6 if Counts[i] > Counts[mIdx] then mIdx=i
7 return A[mIdx]
Running time = A + A + N*(C + E + A + A + N*(C + E + C +
E) + C + A))
=2A + N(2A + 2C + E + A + N*(2C + E + E))
=2A + (2A + 2C + E)*N + (2C + E)*N^2 + A*N + E*N^2
First we will focus on the number of conditional executions
Worst case – all elements of A are identical,
• we run E in each execution of line 5
• we never run A in line 6
• Running time=2A + (2A+2C+E)*N + (2C+E)*N^2 + E*N^2
Best case – only N/2+1 copies of Majority Element, at the start
of A, all other elements unique.
• we run E (line 5) N/2+1 times in each of first N/2+1 iterations
of line 2 loop
• we never run A in line 6
• Running time=2A + (2A+2C+E)*N + (2C+E)*N^2 +
E*(1+N+¼N^2)
Typically distinguishing between running times of different
elementary operations is:
• To detailed – obscures the picture
• To machine-dependent
26. So we can assume that all elementary operations execute in the
same constant unit amount of
time.
• A=C=E=1
Then the running time simplifies to:
• 2 + 5N + 3N^2 + N^2 = 2 + 5N + 4N^2
18
Since N^2 is the part of the algorithm that grows fastest with a
growing N, or more
formally T(n) = n^2, this means that as the size of the input
increases, the time it takes for the
algorithm to completes grows exponentially as N^2. We call
this a O(n^2) algorithm. O(n^2)
means the algorithm is an “order n squared” algorithm.
If we are worried about efficiency then there is a problem with
the previous approach;
we repeat the same calculation may times. If the element X is in
the array M times, then we
count how many times X appears in the array M times. This
wastes time and resources. The
solution is to group the identical elements X together, so that
we only have to count how many
X's there are in the array only once for each different X in the
array. Since we are only looking at
positive integers we will set the last element to a negative
number so we know when to stop.
27. This time we will ignore elementary operations, meaning
executing a small number of
A/C/E operations takes constant time (equal to 1)
The function N^2
19
MajorityElement(A[1..N])
Repeats time line
1 ? 1 A = sort(A);
1 1 2 me = A[1];
1 1 3 cnt = 1;
1 1 4 currentCnt = 1;
1 1 5 A[N+1] = -1;
1 N 6 for i = 2 to I <= (N+1) do i++
N 1 7 if A[i-1] == A[i] then
1 8 currentCnt++;
1 9 else if currentCnt > cnt then
1 10 cnt = currentCnt;
1 11 me = A[i-1];
1 12 currentCnt = 1;
28. 13 return me;
We can see that the running time is growing as N + (time to
sort A). There are several
different sorting methods. We will not get into them now, but
there are a couple that have a run
time of N log N. So if we choose an appropriate sort method,
running time can be growing as a
function of N log N or O(n log n).
As we can see in the graph, the complexity is still exponential,
but it grows at a much
slower rate.
The function N^2 (red) and N log N (blue)
20
We have looked at two algorithms for solving the same
problem: one with a complexity
of O(n^2) and the other O(n log n). Many people seeing this
make the assumption that the
algorithm with the better time complexity always solves the
problem faster. While not
demonstrated in the previous algorithm, this is not always the
case. We can see this by looking
at the graph of a O(n) and O(n^2).
As we can see for a small N, the algorithm with a better
complexity takes longer than the
other. This pattern is prominent in the study of algorithms;
29. usually the simple approach is faster
for small inputs, while the more complex approach is faster for
large inputs. This means that if
we are worried about maximum efficiency all the time we must
change our approach
depending on the size of the input.
The last thing to note in this section is the usage of the term
O(n). We said earlier that
this means the time complexity of the algorithm was n. This is
not the exact meaning of O(n).
O(n) means that the complexity of an algorithm can be bounded
above by some function c * n.
For example given an algorithm with a time complexity of 4N,
we can come up with a function
that will always be above it, and therefore act as an upper bound
of order N. This can be seen
graphically in the follow graph.
2N^2 (red) and 100N (blue)
21
The source for the material in this section came from lecture
notes prepared by Dr. Tim
Arodz.
Other Sources:
30. http://community.topcoder.com/tc?module=Static&d1=tutorials
&d2=complexity1
http://www.cse.buffalo.edu/~szhong/cse191/hw/191Lec6.pdf
The function 4N (red), is bounded above by the function
5N (blue) making the function 4N O(n).
http://community.topcoder.com/tc?module=Static&d1=tutorials
&d2=complexity1
http://www.cse.buffalo.edu/~szhong/cse191/hw/191Lec6.pdf
22
Chapter 3: Search Algorithms
Searching is a very common task in computer science.
Examples include searching a list
for a specific element or searching an array of objects for the
smallest object. Although it is a
very common task, there are only two main searching
algorithms. They are a sequential search
and a binary search. Grasping the ideas and limitations of the
two search algorithms is fairly
intuitive so this section will only give a summary example of
each.
Sequential Search (sometimes called a linear search):
31. The idea behind the sequential search, as its name suggests, is
to start at the beginning
and sequentially look through a data set for the search
parameter and finish when the
parameter is either found or the end of the data set is reached.
Let's take a look at an example.
[4 29 6 3 9 34 23]
Suppose our search parameter was to see if the value 9 existed
in the array. We would
get the first value and see if it equals 9. If it does, we are
finished. If not, we get the next value
and try the comparison again. The whole sequence would look
like this.
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
The other typical search parameter is to find the smallest or
largest element in the data
set. To do this with a sequential search we have to have a
temporary variable to hold the
current smallest value. So to solve this we initialize the
temporary variable to the first value in
the array and then increment through the entire array, updating
the temporary variable as we
go. It would look like this:
Initialize:
[4 29 6 3 9 34 23] X = 4
Increment through the rest of the array:
[4 29 6 3 9 34 23] X = 4
32. [4 29 6 3 9 34 23] X = 4
[4 29 6 3 9 34 23] X = 3
[4 29 6 3 9 34 23] X = 3
[4 29 6 3 9 34 23] X = 3
[4 29 6 3 9 34 23] X = 3
Smallest element = 3.
23
Now we will examine the complexity of the algorithm. It
should be easy to see that for
searches with a parameter like “what is the smallest element”,
we have to go through the
entire array. Since we only have to look at each element once,
this makes the
complexity/runtime in the best, average, and worst case
scenarios O(n).
If we had a parameter like “does this value exist”, then we have
to look a little closer.
The best case would be that the first value we try is the value
we are looking for. The worst
case, of course, would be that the value we are looking for does
not exist or is the last element.
The only way to calculate the average case would be to say that
the value we are looking for is
in the middle.
Summary
Complexity Number of Comparisons
(for n = 100000)
33. Comparisons as a function of n
Best Case
(fewest comparisons)
1
(target is first item)
1
Worst Case
(most comparisons)
100000
(target is last item)
n
Average Case
(average number of
comparisons)
50000
(target is middle item)
n/2
The best case analysis does not tell us much. If the first
element checked happens to be
the value we are looking for, any algorithm will take only one
comparison. The worst and
average case analysis gives us a better indication of an
algorithm’s efficiency.
34. Notice that if the size of the array grows, the number of
comparisons required to find a
parameter in both worst and average cases grows linearly. In
general, for an array of size n, the
worst case is n comparisons. The algorithm is also called a
linear search because its complexity
and efficiency can be expressed as a linear function. The
number of comparisons to find a target
increases linearly as the size of the list increases.
Although we have not looked at sorting algorithms yet, the
other thing to look at when
looking at the complexity is whether the run time would change
if the array was sorted. If the
parameter is “what is the smallest or largest value” then the
answer would be yes, as we would
know the position of the largest and smallest elements and we
would not need to search for
them. However, if the parameter is “does this element exist”
then the answer would be no as
our early basis for the complexity would still be valid.
24
Pseudo Code
35. 25
Binary Search
Our second search algorithm is still intuitive, though slightly
more complex to
implement. You might wonder why we need another search
algorithm, as a sequential search
would technically work in every situation. The answer to that is
efficiency. Since a sequential
search’s complexity grows linearly to the size of the input, the
time it takes to execute grows
linearly as well. This is not an issue for small data sets with
only a few hundred to a few
thousand pieces of data. But what happens when the data set
becomes large like a few million
to a few billion pieces of data? Even with modern computers it
could take several minutes to
complete the search.
This is where a binary search algorithm comes into play. A
binary search can only be
used when the data set is sorted and random access is supported.
Therefore, in data structures
such as a linked list, a binary search cannot be used. As one of
the requirements for using a
binary search is that the data set be sorted, there is no need to
use a binary search for a search
parameter of finding the largest or smallest element. Although
you could search for it, their
position would be known so no searching would be required.
The premise behind a binary search is simple. Since our data
set is sorted, comparing
the middle value to our parameter, will give us one of three
36. situations.
1.) The value we are looking for is in the upper portion of the
data set,
2.) The value we are looking for is in the lower portion of the
data set, or
3.) The middle value is the value we are looking for.
By always comparing the middle value, the binary search
algorithm allows us to vastly
reduce the number of comparisons. Let's look at an example.
[9 20 34 35 68 47 49 65 80 86]
Suppose our search parameter was to see if the value 34 existed
in the array. We first
find the middle value; if that is the value 34 we are done. If it is
not we “cut” the area in the
array in half, which reduces the potential comparisons by half
as well. We keep doing this
process until we find the value we are looking for or until we
cannot cut the array in half
anymore. The sequence of events would look like this.
Active section: [9 20 34 35 68 47 49 65 80 86] (1
+10)/2 = 5.5 => 5
Active section: [9 20 34 35] (1 + 4)/2 = 2.5 => 2
Active section: [34 35] (3 + 4)/2 = 3.5 => 3
26
37. Now that we see how it works, let’s look at the complexity. We
said earlier that a binary
search was more efficient than a linear search. If that is so, how
much more efficient is it? To
answer this, we look at the number of comparisons needed in
both the best and worst case
scenarios. We will not look at the average case, as it is more
difficult to compute, and it ignores
the differences between the required computations
corresponding to each comparison in the
different algorithms.
The best case of course would be that the middle value is what
we are looking for, so
the best case scenario does not tell us very much about the
algorithms efficiency. That leaves
the worst case scenario. The worst case, as with a sequential
search, is that the value does not
exist or is the last value that we check. So to compare the
complexity of the worst case to the
size of the input we get the following scenario.
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]
Goal: find the value 16.
The first index we look at is: (1+16)/2 = 8.5 => 8
First comparison
Active Section: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16]
8 < 16 next index is: (9+16)/2 = 12.5 => 12
Second comparison
Active Section: [9 10 11 12 13 14 15 16]
12 < 16 next index is: (12+16)/2 = 14
38. Third comparison
Active Section: [13 14 15 16]
14 < 16 next index is: (15+16)/2 = 15.5 => 15
Fourth comparison
Active Section: [15 16]
15 < 16 next index is: (16+16)/2 = 16
Final comparison
Active Section: [16]
16 = 16
So it takes us a maximum of five comparisons for a dataset
containing sixteen elements
to find any element. Or to express it in mathematical terms,
given a data set of size n it takes us
X number of comparisons, where X = log2n. So our complexity
is O(log n).
27
Summary
Complexity
Number of Comparisons
(for n = 100000)
Comparisons as a function of n
Best Case
(fewest comparisons)
39. 1
(target is middle item)
1
Worst Case
(most comparisons)
16
(target not in array)
log2n
Pseudo Code
Other sources:
http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/sear
ch.html
http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/sear
ch.html
28
Chapter 4: Sort Algorithms
40. Sorting
Sorting problem – Definition reminder
Input:
a sequence of numbers <a1, a2, ..., an>
Output:
A permutation <a1',a2',...,an'> of the input sequence, such that
a1'<=a2'<=...<=an'
Insertion sort:
Concept for the algorithm – to sort an array
-Maintain two parts of array
-Sorted part: – initially empty – left part
-Unsorted: – initially full – right part
-Take one element from the unsorted part and insert at correct
position in the sorted
part.
-Iterate until all elements are moved to the sorted part of the
array, and the unsorted
part is empty
Start with an unsorted array of size n, [0...n-1] where 0 is the
first array index and n-1 is
the last index.
[ 4 5 2 0 9 ]
41. Imagine splitting this array into two different parts. The left
half will be sorted, the right
half is not. I will show my split in the array as two vertical
lines. An array split in the middle
would be of the form [0...i || j..n-1]. It is important to note that
this is still one array, the split is
conceptual only.
We are going to apply this conceptual split to our array. We will
put the split after the
first index, so we have just one element on the left, and n-1
elements on the right. After our
imaginary split, our array now looks like this.
[ 4 || 5 2 0 9 ]
i j n-1
29
Now during each iteration of insertion sort, we will look at the
first element of the right
part of our array which we previously showed was at index j.
We are going to move this
element (insert it) into the left part of the array. We will keep
moving the element to the left
until the left array is sorted again. As you can see, in this
example we didn’t have to physically
move anything in the array for this iteration.
42. [ 4 5 || 2 0 9]
We will repeat that process by moving 2 into the left section of
the array.
[ 4 5 2 || 0 9 ]
We can clearly see that the left part of the array is no longer
ordered, so let’s continue
moving 2 to the left until the array is ordered again.
[ 4 2 5 || 0 9 ]
[ 2 4 5 || 0 9 ]
We will continue to repeat this process until all elements are in
the left, sorted section.
[ 2 4 5 0 || 9 ]
[ 2 4 0 5 || 9 ]
[ 2 0 4 5 || 9 ]
[ 0 2 4 5 || 9 ] ← End of the this iteration.
[ 0 2 4 5 9 || ] ← End of the final iteration.
We can see here that the left section of the array, which is the
sorted section, contains
all the elements and is still sorted. So now the question is, how
do we do this in a program?
Pseudo Code
43. Note this pseudo code receives a reference to an array.
30
Algorithm Complexity
Let’s examine the pseudo code for complexity.
Pseudo Code
This will be a little more in depth than we will go in the future,
but for now we will show
what everything doing along with their relative complexities.
The first iteration of the for loop
increments from 1 to n-1. The run time of this is directly related
to the size of the array, n. You
should be able to see that as n gets larger, the number of times
the for loop will increment will
grow in proportion to n.
Everything else is contained in the for loop, so we must, in
essence, say that the number
of executions of everything inside the for loop also depends on
n. We have a few constant time
assignment operations and we have another loop. The
complexity of this loop can be a little
tricky to understand.
44. This is saying that we have a key which is located at index ‘ j ‘.
We want to move this key
until it fits in the right spot, which is when A[ i ] <= key or i <
0. The number of spaces we
actually move this key element will vary throughout the
algorithm, but the thing to remember
is that, as n gets larger, we will generally have to move that key
more spaces until we find it’s
home, which would make this an O(n) loop. Remember still that
this loop is inside another loop
that is also O(n), so the contents of the while loop can possibly
run n^2 times in a worst case
scenario. This means that this algorithm is O(n^2).
31
Now that we have look at the complexity it is easy to see how
Insertion Sort performs:
Worst case performance: О(n^2)
ex: A reverse sorted array.
Best case performance: О(n)
ex: A sorted array
45. Average case performance: О(n^2)
Other Sources:
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
32
Selection Sort:
Our second sort will focus on a different, yet still intuitive,
way of arranging elements
into the correct order. Insertion Sort focused on moving
elements from an unordered array and
finding its place in an array that is ordered. Selection sort does
46. something similar. However,
instead of grabbing any element from the unordered array, it
finds the largest element and
swaps it with the smallest element of the ordered array.
Remember, since the array is ordered,
the smallest element will always be the left-most element of
that array. Let’s take a look at an
example.
[4 2 5 1 6 7 0]
This is a sorted, unordered array. Let’s divide this array into a
sorted and an unsorted portion,
similar to what we did with insertion sort. However, the sorted
part of this array will be the
right side.
[4 2 5 1 6 7 || 0]
You can see that the left part of this array is not sorted, and the
right side is sorted as it
only has one element. We will need to know which element is
the largest in the unsorted array,
so we will keep that element’s index as a key. We will also
color this element blue to show that
we are storing it. The first element in the array will be initially
marked as the largest and we will
change that as we move through the array. The current element
we are looking at will be
colored red. The first complete iteration looks like this:
LargestElement = 4
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
47. [4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 0 || 7]
As you should be able to see, we looked at every element in the
unordered array once.
We also looked at the first element of the ordered array. If we
found an element that was
larger than the previous largest, we simply marked that element
as the new largest and kept
looking. Once we arrived at the end of the unsorted array, all
we had to do was swap it with the
first element of the ordered array.
33
For each new iteration, we will slide the divider one element to
the left and continue.
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 0 || 6 7]
[4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
48. [4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
[4 2 0 1 || 5 6 7]
[4 2 0 || 1 5 6 7]
[4 2 0 || 1 5 6 7]
[4 2 0 || 1 5 6 7]
[4 2 0 || 1 5 6 7]
[1 2 0 || 4 5 6 7]
[1 2 || 0 4 5 6 7]
[1 2 || 0 4 5 6 7]
[1 2 || 0 4 5 6 7]
[1 0 || 2 4 5 6 7]
[1 || 0 2 4 5 6 7]
[1 || 0 2 4 5 6 7]
[0 || 1 2 4 5 6 7]
Now that we have reached the end of this last iteration, we can
see that, no matter
what the first element in the array is, it will always be smaller
than every element in the sorted
array. This is because every element we moved thus far has
been larger than this last element.
Before we even look at the pseudo code, we can get a good
understanding of the
complexity of this algorithm. For each extra element in the
array, the number of iterations we
would have to do will grow by 1. During each of these
iterations, we have to look at every
element in the unsorted array. While this number gradually gets
49. smaller as the algorithm
progresses, ultimately as n gets larger, so will the number of
elements we have to look at
during each iteration. This tells us already that Selection Sort
will be O(n^2).
34
Pseudo-code
35
Bubble Sort:
The third sort we will discuss is the Bubble Sort. Unlike
Insertion and Selection sort, this
one is not so intuitive. The name comes from bubbles rising to
the surface of water. As the
bubble passes through the array it moves each number closer to
the location it needs to be.
In order to show this we will once again start off with an
unordered array.
[9 1 2 4 5 8 7 6 3]
50. [1 9 2 4 5 8 7 6 3]
[1 2 9 4 5 8 7 6 3]
[1 2 4 9 5 8 7 6 3]
[1 2 4 5 9 8 7 6 3]
[1 2 4 5 8 9 7 6 3]
[1 2 4 5 8 7 9 6 3]
[1 2 4 5 8 7 6 9 3]
[1 2 4 5 8 7 6 3 9]
This is one iteration of the bubble sort. The red “bubble” is
going from left to right and
each time it is putting the two elements inside it in the proper
order. Since the largest element,
9, happened to be at the beginning of the array, the 9 was
trapped in the bubble until it was put
at the very end. We will create a new bubble to iterate through
the array and each time it will
grab the next largest element and put it in its place. Let’s go
through the rest of this sort.
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 7 8 6 3 9]
[1 2 4 5 7 6 8 3 9]
[1 2 4 5 7 6 3 8 9] <-End of iteration 2
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
51. [1 2 4 5 6 7 3 8 9]
[1 2 4 5 6 3 7 8 9] <-End of iteration 3
36
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 3 6 7 8 9] <-End of iteration 4
[1 2 4 5 3 6 7 8 9]
[1 2 4 5 3 6 7 8 9]
[1 2 4 5 3 6 7 8 9]
[1 2 4 5 3 6 7 8 9]
[1 2 4 3 5 6 7 8 9] <-End of iteration 5
[1 2 4 3 5 6 7 8 9]
[1 2 4 3 5 6 7 8 9]
[1 2 4 3 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9] <-End of iteration 6
As you can probably see, each iteration the bubble has to go
one less index in the array.
This is fairly easy to implement because we can just reduce the
apparent size of the array by
one in each iteration. You can also see that we stopped the
algorithm early because it is sorted.
In a worst case scenario, we would have to iterate through this
52. n-1 times. However, there is
also a technique to determine if the array is sorted and all it
requires is that we iterate through
one more time.
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
Notice that the bubble never moved anything. If the bubble had
moved something, then
we know that the array was not sorted when we began this
iteration.
Pseudo code
37
Complexity Summary:
Bubble sort contains a for loop that will grow in proportion to
n. This loop is inside a
while loop. This while loop probably won’t iterate n-times, but
the number of iterations of this
loop will tend to grow linearly with n.
Worst Case O(n^2)
We will see the worst case trying to sort an array that is
initially reverse sorted.
53. Best Case O(n)
We will see the best case trying to sort an array that is already
sorted. The for loop will iterate
once, see that the bubble didn’t move anything, and break out of
the while loop.
Average Case O(n^2)
During the majority of sorts done with bubble sort, the
computation time will grow by some
factor of n^2.
38
Merge Sort:
The final sorting algorithm we will discuss is Merge Sort. This
sort uses a divide-and-
conquer approach. This is not a way people tend to sort things,
but it is much more efficient at
sorting very large arrays. We will use a different type of
example for this that is used from Dr.
Tomaz Arodz’s CMSC 401 Lecture notes. Many of his notes we
use will include images used
from “An Introduction to Algorithms: Third Edition” by Thomas
H. Cormen, Charles E. Leiserson,
Ronald L. Rivest, and Clifford Stein.
General Merging of two arrays to produce a single, sorted,
array:
54. We will start by exploring how to merge two arrays together.
First we will see the
merging of two unordered arrays, then with two ordered arrays.
We want start with two
separate arrays and end up with a single, sorted array.
Given two arrays:
[3 5 1 7]
[4 2 8 6]
Let’s think about this for a minute. We have options if we want
to merge these two
arrays. If you apply what you have learned from the past sorts,
we can simply put these two
arrays into one and then sort them with Insertion, Selection or
Bubble sort. So, let’s try that.
We will use the Insertion Sort method. We won’t show every
minute step, just remember that
during each step we are looking at every element in the sorted
side until we find the correct
place for each element.
[3 5 1 7 4 2 8 6]
side.
[3 5|| 1 7 4 2 8 6]
[1 3 5|| 7 4 2 8 6]
[1 3 5 7|| 4 2 8 6]
[1 3 4 5 7|| 2 8 6]
[1 2 3 4 5 7|| 8 6]
[1 2 3 4 5 7 8|| 6]
[1 2 3 4 5 6 7 8||]
We have seen this before. We know this is a O(n^2) algorithm.
55. Let’s look at a different approach. We will take the same two
arrays, only first we will
sort those arrays before combining them.
[1 3 5 7]
[2 4 6 8]
39
Now let’s set them up so they are easier to visualize. We will
go through this part step-
by-step. We will have the two initial arrays on top and we will
create a destination array on the
bottom. The destination array will be large enough to fit all of
the elements and initially it will
be empty.
[1 3 5 7] [2 4 6 8]
[ ]
We will keep an index for the element we look at in each array.
We will highlight these
current elements in red.
[1 3 5 7] [2 4 6 8]
[ ]
Now for the fun part, let’s start the merge process. During each
iteration, we will
perform one check. We will find which red element is the
smallest, put that element in the
56. destination array, and then look at the next element from the
source array. That may be a bit
confusing to conceptualize, so let’s see it in action.
[1 3 5 7] [2 4 6 8]
(1 < 2)? Yes, let’s move 1 down.
[1 ]
[1 3 5 7] [2 4 6 8]
(3 < 2)? No. Let’s move the 2 down.
[1 2 ]
[1 3 5 7] [2 4 6 8]
(3 < 4)? Yes. 3 goes down.
[1 2 3 ]
[1 3 5 7] [2 4 6 8]
(5 < 4)? No. Move 4.
[1 2 3 4 ]
[1 3 5 7] [2 4 6 8]
(5 < 6)? Yes. Move 5.
[1 2 3 4 5 ]
[1 3 5 7] [2 4 6 8]
(7 < 6)? No. Move 6.
[1 2 3 4 5 6 ]
40
[1 3 5 7] [2 4 6 8]
(7 < 8)? Yes. Move 7.
[1 2 3 4 5 6 7 ]
57. Now we have a small issue. As you can see, the next, and only,
possible item to move
now is the 8. However, telling a computer how to do this is a bit
more complicated. Luckily, we
know a few solutions to this issue. The solution we will cover
involves expanding the source
array’s to include one extra element, Positive Infinity (INF). We
would need to do this before we
started the merging process, but it doesn’t affect anything until
now. So, imagine we made this
addition before, and we will resume where we left off.
[1 3 5 7 INF] [2 4 6 8 INF]
(INF < 8)? Not even close. Move 8.
[1 2 3 4 5 6 7 8]
Now you may be asking yourself, how do we use Infinity in a
Java program? Well, you
can’t, but you can come close. All integers contain a maximum
value. We can get this in Java
with Integer.MAX_VALUE. The other numerical primitive data-
types have wrappers that provide
the maximum or infinity values as well.
Going back to the merging, we can see that, with two pre-
sorted arrays, we only look at
each element once. The number of elements we look at grows
proportionally to the number of
elements we need to merge. This means the merging of two
sorted arrays is O(n). This doesn’t
do us much good though if we have to use a O(n^2) algorithm to
get the two initial arrays
sorted.
Divide and Conquer approach:
58. As mentioned above, merge sort is a divide and conquer
algorithm. It divides the
problem into smaller problems of the same nature, recursively
solves them, and then combines
their solutions. We will divide the array into two smaller arrays
until the arrays contain just one
element each.
This division of the array creates a tree where each child node
is an array that is half the
size of its parent node. Each leaf will be an array containing
just one element. This, as we have
seen before, means that each leaf is a sorted array.
41
This should be fairly easy to see so far, at least conceptually.
Implementing this in code
will be slightly trickier since computers tend to do things
procedurally. We will see how this is
done later in this text, but for now let’s go over the merging
procedure.
We will start with the leaf nodes. Since they are already sorted,
we will simply use the
merging procedure we talked about above. We want to merge
them into the same parent
arrays they had before.
59. 42
As you can see, Merge sort first divides the problem into a lot
of small, easy to tackle
problems. The “easy problem” in this case is merging two
arrays that are already sorted.
Each time we go down a level we cut the array in focus by half.
Each level will contain
twice as many smaller problems as the last until we have n
levels. This means that the dividing
process is O(log2(n)). We already discovered that merging two
sorted arrays is O(n), so doing a
O(n) operation log2(n) times will give us a O(nlog2(n))
algorithm. This is significantly better than
the O(n
2
) sorting algorithms we discussed earlier for very large n’s.
Pseudo-Code
60. 43
Chapter 5: Trees
Trees:
We will now take a look at Trees. For our purposes, the only
trees we will be discussing
in detail are B-Trees. However, you will need a firm
understanding of general tree structures
before you can fully understand the concepts behind a B-Tree.
You may have gone over Trees in
CMSC-256, if so this section will just be a review.
Trees in computer science are used as a means of storing data
using an acyclic
connected graph where each node has zero or more children
nodes and at most one parent
node. Furthermore, the children of each node have a specific
order.
Much of the following information was used, with permission,
from Dr. Arodz’s CMSC
401 notes.
We will discuss the above operations but first we need to
describe structures we are
using. The first is the Node.
- Each node has a key: x.key
- Each node may have children: x.left, x.right
o A null child represents no child
- Each node has a parent: x.p
o The exception to this is the root node. The root node of any
given tree, by
61. definition, has no parent.
A binary tree must maintain certain properties with respect to
these nodes. For each node x,
- If node y is in a left subtree of x, y.key <= x.key
- If node y is in a right subtree of x, y.key >= x.key
44
This tree has other properties you are probably already familiar
with. With the trees we
display, we will only be displaying the node and key. Each of
these nodes in a real application
would contain some information. This information is usually
accessed with a getter method
such as: x.data. Other properties of this particular tree include:
- 6 is the root node
- 5 is the left child of 6. (7 is the right)
- 2, 5 and 8 are leaf nodes.
- 5 is the root of a subtree of 6.
As with all data structures there is a set of operations we will
want to perform on this
tree. We will want to, at minimum, insert and remove nodes
from a tree. Other useful
operations include in order, preorder and post order traversals.
Note: the following methods
can be done recursively. We will only show the iterative
approach.
Insert:
62. This is the full pseudo code for a Binary Tree insertion. Instead
of explaining how an
insertion works then providing the code at the end, we will take
the opposite approach. This
way we can provide the overview and decompose the code piece
by piece.
This method takes a Tree and a new node to be inserted into the
tree.
We define the current node as the root of the tree. We also need
to keep track of the
parent. Since the root of any tree has no parent, we initialize
this as null.
45
This is where we find the proper location to put our new node.
We always insert onto a
leaf, so our while loop will iterate until the current node is null
which is why we want to keep
track of the parent. After iterating through this loop, the
currentParent variable will hold the
node which will be a parent to our new node…
63. …so let’s go ahead and assign the parent reference of our new
node to the
currentParent variable.
Now we have to determine the details of where we are inserting
this node. If the tree
was empty, Tree.root would have returned null. In that case,
currentNode would have been null
and the while loop would have never executed and currentParent
would be null. The above
statement covers this situation and inserts the new node as the
root of the tree.
If the parent is not null then we will have to insert to the left or
the right of that parent.
This statement determines where that is. If the value of the new
node is less than its parent we
will insert as the left child, otherwise right.
That’s it! We are done with inserting a single node into a
binary tree.
Remove:
Removing from a binary tree is slightly more complicated. We
will try to break it down
into small chunks. To do this, we will introduce two helper
methods. These also have other uses
outside of removing a node which we will not discuss.
64. 46
The first helper method we
will introduce is the TreeMinimum.
Given a Tree, if you were to follow
the left child until you reached a
node where there is no left child,
you will end up at the smallest
value in that tree. This can be used
on any node in a tree in order to
get the minimum value within a
subtree.
Two examples are
highlighted in the tree to the right.
The TreeMinimum of 6 is 2. The
tree minimum of 18 is 17.
If you were to take the minimum of 7, you would simply get 7.
This code should be fairly self-explanatory. We extract the root
of the current tree.
While the current node’s left child is not null, we move on to
that left child. When the left child
is null we know we have reached the smallest value in the tree.
The second helper method is the Transplant method. This
method does the actual work
of removing a node from a tree. Its parameters include a tree,
the node to be removed and the
65. subtree that will take the place of that node.
Since this is a bit more complicated than the TreeMinimum, we
will once again break up
the code and explain it piece by piece.
47
If the node we want to remove has a null parent then that node
was the root of the
tree. In this simple case we simply assign the root of the tree to
the new subtree.
This part usually looks complicated at first. All we are doing is
checking the parent of the
removed node to see which child is being taken away. We will
then replace that child with the
subtree.
Finally, we will check to see if that subtree is empty. If it is
not empty then we finalize
the attachment by setting the parent of the root node to the
removed node’s parent.
We can now go over the actual removal method.
66. We use both Transplant and TreeMinimum to do a proper
removal. Care must be taken
when removing a node. You cannot just use a transplant in
every situation. If we remove a node
that has only one child, we can just transplant the subtree
represented by that node’s child to
the parent. If that node has more than one child then a single
transplant won’t work.
48
This method takes a tree and a node to be removed from that
tree.
This conditional expression takes care of the two easy cases. If
either children of the
removed node is null then a single transplant will effectively
remove the node.
Otherwise we will require a bit more manipulation. We will
start by finding the smallest
value in the right subtree of the node to be removed. We want
this value because it is the
smallest value that is larger than every other value in the left
67. subtree of our removed node.
If the parent of that minimum node is the node we want
removed then we will skip this
next step. Otherwise we will run a Transplant on the minimum
node. This will take the
minimum node out of the right subtree. Remember, this node is
larger than every element of
the removed node’s left subtree and smaller than or equal to
every element in the removed
node’s right sub tree. It only makes sense that we should
replace the removed node with this
minimum node.
49
After the transplant we assign the right child of this minimum
68. node to the right child of
the node we wish to remove and give that child a new parent.
Because this minimum node is
smaller than every node in the subtree represented by its new
right child, the fundamental
properties of the binary tree hold.
Now we can deal with the left subtree of the node we wish to
remove. We start this by
transplanting the node we wish to remove with our old minimum
reference.
After the transplant the node is now completely removed from
the tree. Unfortunately
its left subtree is still attached. We can fix this by attaching the
left child of the removed node
to that minimum node. Because the minimum node used to
reside in the removed node’s right
subtree and we are attaching the removed nodes left subtree,
every element in that left
subtree will be smaller than the minimum node.
And now we are finally done removing a node from our tree.
Complexity:
The operations done on a binary tree vary with the structure of
the tree. As you have
probably noticed, if you insert the values [1,2,3,4,5,6,7,8,9] into
a binary tree, you will
essentially get a list. Any operations done on a tree as
unbalanced as this will yield O(n)
complexity.
69. However, if a tree is properly balanced then it can yield an
average O(log(n)) complexity.
This is much better than performing operations on a linear list.
Additionally, this performance is comparable to sorting a list
and then doing a binary
search on that list to find and extract some information. In order
to add the unordered
elements of an array to a tree, we will have to iterate through
the list once O(n), and at each
element we will have to perform an insertion O(log(n)).
Extracting that information will simply
be O(log(n)). In total, inserting an entire list and extracting one
value will be O(nlog(n) + log(n)),
or simply O(nlog(n)).
Sorting the linear list can be done with O(nlog(n)) and then
extracting an element from a
sorted list can be done using a binary search which is O(log(n))
for a total of O(nlog(n) + log(n)),
or simply O(nlog(n)).
50
So, if doing each of these operations is bound by O(nlog(n)),
why use a tree over a linear
list? Well, it depends. Given some list of comparable elements,
70. sorting that list will be faster
than inserting the entire thing into a binary tree. Using a
standard desktop, a sample set of
50,000,000 integers took about 9 seconds to sort. That same
sample took almost 90 seconds to
insert into a binary tree. So the question remains, why use a
binary tree?
If you insert something into a linear list, the insert will take
O(n) time. An insert into a
binary tree can be done in O(log(n)) time. The same applies for
removals. So, if you are planning
on manipulating the data then a binary tree is probably what you
want. However, if you are not
going to change the data, having a sorted list may be more
beneficial than a tree. Basically, the
binary tree is much faster to maintain after the initial
preparation has been finished. You have
options, use them wisely.
Other Sources:
http://cslibrary.stanford.edu/110/BinaryTrees.html
http://cslibrary.stanford.edu/110/BinaryTrees.html
51
B-Tree:
Binary Trees tend to be an excellent way of storing data when
all of that data can fit in
RAM. However, as most computer scientists know, there are
many applications where we have
71. more information than RAM available. This means we will have
to store the actual information
on a hard drive. Accessing a hard drive is much slower than
RAM as you have seen in the
architecture portion of this text.
The time it takes to perform I/O to a hard drive is slow because
there are physical
moving parts.
The platter spins around the
spindle. The read/write head
reads pages off of the current
track. We have to wait until the
information we want is under
the read/write head before we
can access it.
We could use a binary
tree to store very large amounts
of information. Every node
would be some file on the disk.
As we traverse the tree we will have to do a disk I/O for each
node visited. Even if the tree is
perfectly balanced we can still find ourselves doing lots of disk
I/O’s for very large trees.
To prevent this we want to store more than one thing in each
file. To be more specific,
since each disk read gets a page from the hard drive, we want to
fit as much information into a
node as possible before we overflow that page. This will ensure
we make the most efficient use
of our available resources while drastically reducing the time it
takes to find information.
72. This is where the B-Tree comes in. A B-Tree is basically a
Binary Tree which allows each
node to have more than two children. Each node will have some
number of separator keys. The
number of children coming out of these nodes will be equal to
the number of separator keys+1.
52
An easy way to demonstrate the B-Tree is with the alphabet.
The tree above is rooted at
M. The number of separators for the root in this case is one and
it has two children. Note the
number of keys in a node can be much larger than 2 or 3. In
practical applications we may have
thousands of keys in each node.
Each node in a B-Tree contains:
- x.p : a pointer/reference to the parent
- x.n : a number of separator keys in the node
- x.key[1 … x.n] : array with value of separator keys (as
opposed to a single key for the node)
- x.c[1 … x.n+1] : array of pointers to children (as opposed to
x.left and x.right)
- x.leaf : Boolean value representing whether x is a leaf or not.
Other Properties:
- Every leaf in the B-Tree will have the same depth. i.e. the
length of the paths from root to
73. each leaf is the same (equal to the height of the tree).
- Each node may have no more than 2t-1 keys.
o t is a predefined number that will regulate the properties of
this tree.
- Each node, with the exception of the root, must have at least t-
1 keys. In a non-empty tree,
the root must have at least 1 key.
B-Trees support dictionary operations, meaning insert, search,
and remove which we will
go over. They do support other operations such as successor and
predecessor which we will not
look into.
53
BTreeSearch:
Let’s start with searching for an element in an existing B-Tree.
In this search we will use a
method, FindBranch(x,k) such that x is a node and k is the key
we are looking for. This searches
through a single node to find where the branch to a child should
74. be. It can be either a linear or
binary search through the node, what is important is that it finds
a key index, i, such that:
- x.key[i] >= k
- x.key[i-1] < k or i == x.n+1 if x.key[x.n] < k
Above is the pseudo code for a BTreeSearch. As with the
Binary Tree we will break this
down to help you understand exactly what is going on.
This method takes two parameters, x which is the current node
and k, the key we are
searching for.
This searches through the node to find the branch location as
described above.
This checks to see if the key we are looking for matches the
key we found with our
FindBranch method. If it is, we are done and can return the
node and the index of the key.
Otherwise if this is a leaf node then the key we are looking for
does not exist in this tree.
75. 54
We have covered the cases where we found the item we are
looking for and the case
where we are the key isn’t in the structure, so now we must
move on and repeat this process
for the next node.
We have already found the branch to the next child and we
know that child exists, so we
need to read that node into memory. This is done with
DiskRead(x.c[i]) which gets the i
th
child
of x. In Java, this could be done by creating an input stream
from the file.
We then recursively call BTreeSearch with the new node and
the same key we are looking for.
BTreeInsert:
For the insert (and remove later) we won’t show pseudo code.
The implementations for
these are quite ugly because it involves modifying the actual
file structures. If you find yourself
needing to implement a B-Tree, pseudo code is available on the
internet or you may be able to
form the code yourself after reading the descriptions of how
they work.
76. When we wish to insert into a B-Tree, we always begin at the
root. We will do a search
until we find the appropriate leaf to insert our information. We
can only insert into a leaf if
there is space (it must have less than 2t-1 keys)
In this case we do have room. We want to insert ‘B’ into this
tree where t=3.
- When t=3 each node has:
- Keys >= (t-1 = 2)
- Keys <= (2t-1 = 5)
55
The appropriate leaf node may not always have room to insert.
We don’t want to just add a
new level because B-Trees have to hold the property where
every leaf in the tree will have the
same depth. It also means we will have a node with just one
element in it which is a waste of a
hard drive I/O. Remember: we want to make the most efficient
use of our space as possible
which is why we need to enforce these rules on how many keys
each node can hold.
We will now try to insert ‘Q’ into this B-Tree. Intuitively we
77. can see that, since ‘Q’ comes
after ‘P’ we will want to insert ‘Q’ into the node with [R S T U
V]. However, this node has 5 keys
already so we cannot insert into it as-is.
The next place we may look to insert is in the root with [G M P
X]. This is not an option
either. Doing this would introduce a child node between ‘P’ and
‘Q’ that has no keys. Since the
number of keys has to be at least t-1, this would violate the
properties of a B-Tree.
We do have another option.
We can split this node in half and
raise the ‘T’ to the root. It has room for one more key. The
pointer between ‘P’ and ‘T’ will
contain [R S] as a child and the pointer between ‘T’ and ‘X’
will contain [U V] as a child. Both of
these abide by the t-1 keys property.
We can then insert ‘Q’ into the tree after making the split.
There is one last contingency we will have to deal with. We
can see that if we tried to
insert ‘F’ into this tree we would have no room for it in either
the appropriate leaf or in the
root.
The way to deal with this is actually quite simple. Whenever
we want to insert
78. something into a tree, if we run into a node that is full we split
it unconditionally.
56
Inserting ‘L’ into this tree forced the split of the root node.
This is also why we have an
exception for the minimum number of keys in the root node.
Using this technique will ensure
that the tree not only stays balanced but there will always be
room if we need to split a node.
Special mention should be made in this case. Because we had
to split the root, there
was no parent to place the median key from the root. As you
probably guessed, we just make
that median key its own node and make the tree’s root pointer
point to that. The height of the
tree is increased by one whenever we have to split the root and
this is the only way that the
height of the tree is allowed to grow.
Finally, when we insert ‘F’, we have plenty of room for a split.
All properties discussed
earlier are held true.
At any given node we will have to do a search. Assuming this
is a linear search it will be
bound by O(t) where t is the constraint we applied limiting the
number of keys in our node.
Each node we visit we go down one level in the tree, which
79. means we have to do another
search. The total time ends up being O(t*h) = O(t*logtn).
57
B-TreeRemoval:
It is even more complicated to delete a key from a B-Tree than
it is to insert one. We will
start, per usual with any tree structure, from the root. We will
do a search for the key we want
to delete. We will have two main cases: deleting from a leaf and
deleting from an internal node.
Deleting from Leaf:
If the leaf has at least t keys we can just remove the key. The
node will have at least t-1
keys and the structural properties will be maintained.
In this case we want to delete ‘B’. We can’t just remove it
because the leaf will have less
than t-1 keys. We will want to increase the size of the node
before attempting any removal. To
80. make things simple, on removals we will check each node
before we move to it to see if it has t-
1 keys. If it does, we will preemptively increase the size of that
node just in case the key we
want to remove is there. There are several different approaches
to do this depending on the
siblings of that node.
58
Case A: The t-1 node has a neighbor with at least t keys.
In the case above, the sibling of the node we want to remove
‘B’ from has more than t-1
keys. We can move the first element from that node, ‘E’, up to
its parent. We can then move
the ‘C’ from the root to the node with which we want to delete
‘B’.
After moving ‘C’ and ‘E’, the node had t elements and we were
able to remove ‘B’ with
no issue.
Case B: Neither left nor right siblings have more than t-1 keys.
In this case we will use a technique that merges nodes with a
key from the parent.
81. If we wish to delete ‘D’ we will first visit [C L]. This node has
t-1 keys so we want to
increase that. We can’t just take a key from [T X] because it
also only has t-1 keys. Instead we
will merge the two nodes using the key ‘P’ from the parent.
59
After the merge we are free to delete ‘D’ as it’s node had more
than t-1 keys. Also, since
we removed the only key from the key, we remove that node and
the new root of this tree is [C
L P T X].
Why preemptively increase keys to t?
Let’s go back to a previous tree:
In this example we want to delete Z. We can see that we will
have to pass through [T X]
which has t-1 keys. We already know we can increase that by
using a sibling from [C G M].
We can now, because of our preemptive efforts, merge [U V]
and [Y Z] using ‘X’.
60
82. The process of removing ‘Z’ is now our simple case of simply
removing it from the leaf
node.
Removing from non-leaf:
This process, again, has several cases. Sometimes, if we are
lucky, we will want to delete
a key that separates two children with t-1 nodes. If this is the
case, we simply merge the
children.
Case A: Key to be removed separates two children with t-1
keys.
There is not a lot to explain here. If we first merge [D E] and [J
K] using ‘G’ as a median,
then we will get a single child [D E G J K] between ‘C’ and ‘L’.
G would then be in a child with 2t-
1 keys and we have already seen how to delete that.
61
Case B: Key to be removed separates children with more than t-
1 keys.
83. If we wish to delete ‘M’ then we will have to find something to
take its place. To do this
we will have to find its predecessor. This will be the “largest”
key in the left subtree of ‘M’. The
predecessor will always be found in a leaf. This will use an
algorithm similar to the
TreeMinimum for Binary Trees.
The substituted node will have to be deleted from the leaf using
the standard deletion
techniques we have already discussed (i.e. ensuring each node
has at least t keys). After it is
“deleted” from the tree, we will just replace the element in the
internal node with our
predecessor. In the example above, we deleted ‘L’ from the left
subtree of ‘M’ and then
replaced ‘M’ with ‘L’, effectively deleting ‘M’ from the tree.
Overall Complexity:
We only move from top to bottom, returning up once only if we
need to delete from an
internal node. At each node we will access at most two of its
children. Since these are constant
values, meaning the extra work we do at each step will not
increase as n increases, we will have
O(log n) operations.
Other Sources:
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
84. 62
Chapter 6: Hashing
In this chapter we will look at hashing and how it is used to
implement the hash table
data structure. While they are outside the scope of this course,
other uses of hashing in
computer science include security and encryption. First off let's
define what hashing is. Hashing,
in general, is the use of a “hash function” that maps a data set
of potentially variable length to
another data set of standardized length.
We will use this table representation of a directory for our
examples:
Index Name Phone # Address Email
0 John, Smith 804-453-3425 25 West Main St. [email protected]
1 John, Doe 804-343-7385 54 Marshal Rd. [email protected]
2 Jane, Wilkerson 804-374-3836 978 Woodman Rd.
[email protected]
At this point you might be wondering why we need another
data structure. We have
already looked at searching algorithms, sorting algorithms, and
trees and come up with fast
implementations for all of them. So why would we not just use
something like a binary tree or
sorted array to represent the directory? As we have seen
previously, if we used a sorted array to
implement the directory and a binary search to find the record,
that would give us a search time
85. of O(log 2n) . While that is much better than some of the other
methods we have seen, for a
large directory it could still take a long time to complete. The
second issue would be adding
new records. Whenever you add a new record you would have to
shift the entire part of the
array that comes after the new record for each addition that you
make. This could take up to
O(n). Even with a binary tree the best we get for insertions and
deletions is O(log 2n) .
Imagine if the telephone company stored numbers that way.
When you placed a call the
telephone company would have to search potentially several
directories from different
companies to find the number you are calling. This could still
work if only a few people were
calling at a time. However, when you have millions of calls at a
time, a faster method is required.
Another case would be a guidance system for a missile, where
last second changes are needed.
If the calculation takes too long, the missile would not have
time to change its trajectory.
This is where hashing comes into play. If we look back to the
table of our example
directory and made a hash function for it, we see that our hash
function would need to return
the index's 0, 1, and 2. This leads us to our discussion of
creating hash functions. As creating
hash functions is not the focus of this course we will only look
at one method. The method we
86. will look at is called modular hashing. Modular hashing is
where we convert our key into an
integer then divide by the size table M, to get the remainder as
the index.
63
In our table above this would give us the function: h(x) = x
mod 3
If we choose the phone number as our key X, and we pass
“8043743836” to the hash
function it would return 2. So to reiterate, a hash table is an
array data structure that maps
elements to each index by inputting the key of each element into
the hash function.
Now that you know what hash tables are, there should be one
issue that jumps out at
you. That of course is, what happens when the hash function
returns the same value for two
different inputs? When this happens we have what is called a
“collision.” Avoiding collisions is
one of the primary concerns when constructing hash functions.
One of the simplest techniques we use is the size of the table.
If the table of our
example directory had a size of 10 then all the numbers ending
in “00”, “10” and so on would
each map to the same index. This gives us our standard for the
table size: the size of hash tables
should always be a prime number. This is because each division
87. is more likely to be unique,
since you are not with a common denominator other than itself.
This will not completely avoid
collisions but it will significantly reduce them. Besides the
table size, the only other way we can
avoid collisions is by adjusting our key and hash function. As
there is no good way to do this,
usually we change our focus from avoiding, collisions to
dealing with them.
The most common method for dealing with collisions is to
make a table of linked lists. In
case you do not remember, a linked list is a data structure
composed of a group of nodes where
each node holds a piece of data and a pointer to the next node in
the list. It would look like this:
When you insert an element, you would get the index from the
hash function and add
the element to the head of the linked list. By adding to the head
of the list, it prevents you from
having to traverse the list on insertion. To then find the
element, you would get your index from
the hash function, and then do a sequential search of the list to
find the element you are
looking for. This would give us a data structure that looks like
the picture on the following page.
Linked List
64
There are other methods for dealing with collisions. The first is
88. to increment through the
table until you find an empty space and place the element there.
There are two issues with
this. The first is clustering. Clustering is when elements are
clumped together around the same
index. When this happens it increases the potential for more
collisions to occur. The second
issue with this method is finding the element. Since you use the
next open space you have to
look forward linearly at each element, but you could see an
element that was hashed to that
index. So how do you know how far forward you should look?
Because of these issues, this
method is rarely used. Another method would be to create a
second hash function for when
collisions occur. If this method was used, it should be evident
that you are just pushing the
problem farther along instead of dealing with it, as you would
then have the question of how to
deal with collisions from the second hash function.
At this point you should start seeing why hashing is so
valuable. Our directory example
only had three values, so the execution time would be very
small no matter what data structure
you used. If, on the other hand, you have a directory holding a
few million records, the speed of
a hash table over something like a binary tree would be very
significant.
There are several important things that should be noted about
hash tables.
1.) Even with a good hash function it is still possible for
collisions to occur. So always
89. anticipate and have a mechanism for dealing with collisions.
2.) Typically, the number of possible keys is much larger than
the actual keys that are used
and stored. This means that you need to know and plan for the
maximum number of entries.
3.) There are several different techniques for creating hash
functions. But no matter what
method you use, if the hash function is called twice with the
same input, it should always return
the same value.
Linked List implementation of a hash table.
65
4.) The only operations that the hash table data structure
supports are the dictionary
operations, based on the element's key: insert, search, and
delete. This means that there are
some limitations to hash tables. Unlike a binary tree or sorted
array, hash tables do not support
operations based on order. This means that they do not support
operations like minimum,
maximum, successor, and predecessor.
To finish up, we need to look at the complexity of hash tables.
For insertions using the
linked list implementation we have the time it takes to compute
the index plus the time to
insert into the linked list. Since we insert into the head, this
gives us a best and worst case time
complexity of O(1). For searches and deletions we have the time
it takes to compute the index,
plus the time it takes to find the element in the linked list. In
90. the best case, we have only one
element in each list which would give us a time complexity of
O(1). The worst case, on the other
hand, would be that all the keys hashed to the same index. If
that happened, we would have a
time complexity of O(n). As long as care is taken in designing
the hash table and hash function,
there should be a relatively small number of collisions. So the
average run time for well-
constructed hash tables would be O(1).
Other sources:
http://algs4.cs.princeton.edu/34hash/index.html
http://www.comsci.us/fs/notes/ch11.html
http://algs4.cs.princeton.edu/34hash/index.html
http://www.comsci.us/fs/notes/ch11.html
http://www.comsci.us/fs/notes/ch11.html
66
Authors’ Notes
The material for this course has been assembled by Steven
Andrews and Nathan
Stephens. This course is meant to help prepare you for taking
CMSC 508 Database Theory. As a
firm grasp of this material is essential, it is suggested that if
91. your still have trouble
understanding a topic you look at the other sources provided at
the end of each section.
__MACOSX/BNFO-501/._BNFO501 Course Guide.pdf
BNFO-501/Project 1.pdf
BNFO-501 Project 1:
Input:
You will be given, in standard input, two arrays of integers.
The first array will be your data. The
second array will contain integers that may or may not be in the
first.
The first line of input will contain two integers separated by a
space. The first integer, n, will be
the size of the data array. The second integer, m, will be the
size of the query array. The next n lines will
contain a single integer that corresponds to a value in the data
array. Immediately following these will
be m more lines containing the elements of the query array. A
small sample set of input may look like
this:
92. 5 2
4
7
12
89
102
92
89
This will correspond to the two arrays:
- Data: [4 7 12 89 102]
- Query: [92 89]
Output:
You are to write both a sequential and binary search that will
look for each value in the query
array and return true if it exists in the data array. You will also
print the time, in milliseconds, that each
search takes along with the number that is being searched for.
For example, the output of the input
above should look like this:
false:2ms false:0ms 92
true:2ms true:0ms 89
The search times will vary with the machine you are using. If
you were to use this input as a test, you will
93. most likely get 0ms for each search. To truly see the intended
result, it is recommended that you
generate your own input with at least 1,000,000 data elements.
Files used for grading purposes will not
exceed 50,000,000 data elements.
BNFO-501 Project 1:
Help & Tips:
A template file has been provided that shows one way of
reading to and printing from standard
I/O. If you would like to run your own tests you can easily
create a program that generates the standard
file. We will be using the same format of file, with the
exception of ordered elements, for every project
so this would be a wise investment. If you want to use a file to
test your program then you can add a
command line parameter I/O redirect. For example, if you are
on a windows machine and your program
is named Project.java with the input file of data.txt, you can use
the command:
94. java Project < data.txt
If you would like to route the standard output to a file instead
of having it print to command prompt,
you can use:
java Project < data.txt > output.txt
__MACOSX/BNFO-501/._Project 1.pdf
BNFO-501/Project 2.pdf
BNFO-501 Project 2:
Input:
You will be given, in standard input, two arrays of integers.
The first array will be your data. The
second array will contain integers that may or may not be in the
first.
The first line of input will contain two integers separated by a
space. The first integer, n, will be
the size of the data array. The second integer, m, will be the
size of the query array. The next n lines will
contain a single integer that corresponds to a value in the data
array. Immediately following these will
95. be m more lines containing the elements of the query array. A
small sample set of input may look like
this:
5 2
89
4
12
7
102
92
89
This will correspond to the two arrays:
- Data: [4 7 12 89 102]
- Query: [92 89]
Output:
You are to modify the program you wrote from Project 1. The
output will be the same with the
addition of 1 line of output. You are to print in standard output
the time it takes to prepare the data.
This preparation time will be the time, in milliseconds, that it
takes for you to sort the data. You must
write a merge sort to accomplish this task. You are NOT
allowed to use Arrays.sort().
96. You will then print the sequential and binary search results as
done in project 1:
Prep time: 45ms
false:2ms false:0ms 92
true:2ms true:0ms 89
The search times will vary with the machine you are using. If
you were to use this input as a test, you will
most likely get 0ms for each search. To truly see the intended
result, it is recommended that you
generate your own input with at least 1,000,000 data elements.
Files used for grading purposes will not
exceed 50,000,000 data elements.
BNFO-501 Project 2:
Help & Tips:
Try also writing one of the simpler O(n
2
) sorts and compare it with the time it takes a Merge Sort
to run for very large input.
__MACOSX/BNFO-501/._Project 2.pdf
97. BNFO-501/Project 3.pdf
BNFO-501 Project 3:
Input:
You will be given, in standard input, two arrays of integers.
The first array will be your data. The
second array will contain integers that may or may not be in the
first.
The first line of input will contain two integers separated by a
space. The first integer, n, will be
the size of the data array. The second integer, m, will be the
size of the query array. The next n lines will
contain a single integer that corresponds to a value in the data
array. Immediately following these will
be m more lines containing the elements of the query array. A
small sample set of input may look like
this:
5 2
89
4
12
7
102
92
89
98. This will correspond to the two arrays:
- Data: [4 7 12 89 102]
- Query: [92 89]
Output:
You are to once again modify the previous project. You will
write a basic binary tree which only
needs to insert and search for elements. Your prep time in this
example will be the time it takes to add
every element to the binary tree. You will only run one query
per item in the query array, which will be a
tree search to find the element. The output will be similar to
before:
Prep time: 450ms
false:0ms 92
true:0ms 89
The search times will vary with the machine you are using. If
you were to use this input as a test, you will
most likely get 0ms for each search. To truly see the intended
result, it is recommended that you
generate your own input with at least 1,000,000 data elements.
Files used for grading purposes will not
99. exceed 50,000,000 data elements.
BNFO-501 Project 3:
Help & Tips:
This will be the first project you are required to write
something where pseudo-code was not
explicitly given in the text. However, you have everything you
need to write this search. Just remember,
in order to insert something you must first search for an element
to insert it under.
If you are familiar with the recursive algorithms for insertion
or searching you are welcome to
use them. Just remember, since the testing can be done with up
to 50,000,000 elements, memory may
be a concern. You can increase the amount of memory allocated
to the JVM for your machine, but the
grader may not.
__MACOSX/BNFO-501/._Project 3.pdf
BNFO-501/Project program template.txt
100. import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class Template {
public static void main(String[] args){
FileContents file = null;
try{ //Try to read the input.
//Reads standard I/O. When graded, console
input will be redirected to read from a file.
BufferedReader in = new BufferedReader(new
InputStreamReader(System.in));
int dataSize = 0;
int querySize = 0;
{//This block reads the first line and assigns the
values to variables.
String[] header = in.readLine().split(" ");
dataSize = Integer.parseInt(header[0]);
querySize = Integer.parseInt(header[1]);
}
System.out.println("Data Array Contents");
//Read and store the contents of the input array.
for(int i = 0; i < dataSize; i++){
System.out.println(in.readLine());
}
System.out.println("Query Array Contents");
//Read and store the contents of the query array.
for(int i = 0; i < querySize; i++){
101. System.out.println(in.readLine());
}
}catch(IOException e){
System.err.println("Error Reading Input: " +
e.getMessage());
System.exit(0);
}
//Example for how to time how long a method takes.
long start = System.currentTimeMillis();
timeThis();
long end = System.currentTimeMillis();
System.out.println((end - start) + "ms");
}
//Some Method
public static void timeThis(){
while((int)(Math.random() * 100) < 95);
}
}
__MACOSX/BNFO-501/._Project program template.txt
The mean hourly pay rate for financial managers in the East
North Central region is $48.93, and the standard deviation is
$2.76. Assume that pay rates are normally distributed.
a. What is the probability a financial manager earns between
$45 and $52 per hour?
b. How high must the hourly rate be to put a financial manager
in the top 10% with respect to pay?
102. c. For a randomly selected financial manager, what is the
probability the manager earned less than $43 per hour?