SlideShare a Scribd company logo
1 of 102
BNFO-501/.DS_Store
__MACOSX/BNFO-501/._.DS_Store
BNFO-501/BNFO501 Course Guide.pdf
1
Bioinformatics 501
2
Table of Contents
Course Outline
...............................................................................................
................................. 3
Chapter 1: Physical Level
.................................................................................. .............
................ 7
Chapter 2: Algorithm Complexity
...............................................................................................
. 16
Chapter 3: Search Algorithms
...............................................................................................
........ 22
Chapter 4: Sort Algorithms
...............................................................................................
............ 28
Chapter 5: Trees
...............................................................................................
............................. 43
Chapter 6: Hashing
...............................................................................................
........................ 62
Authors’ Notes
...............................................................................................
............................... 66
3
Course Outline
Purpose
The purpose of this document is to introduce students to
concepts, structures and
algorithms that form a foundation on which a database is built.
Since this is created for a
specific class we will assume students have at least taken
CMSC-256 at VCU or an equivalent
course.
Format
Consistency will be a key element in learning and
understanding this material, but
unfortunately there can be many different ways to format this
material. We will use this section
to illustrate different formats we will use in the examples of
this text.
We are assuming the readers of this text to have some general
understanding of
programming and our target audience is supposed to have some
background specifically in
Java. We will use a pseudo code in our examples that can easily
be translated to Java code, but
can also be applied to many other languages. An example piece
of code is below followed by
the Java implementation.
Pseudo Code
Java implementation
4
Structure
This chapter will briefly introduce what will be covered in this
document. We will cover:
• Computer Architecture (Physical Level)
• Algorithm Complexity
• Search Algorithms
• Sort Algorithms
• Trees
• Hashing
Chapter 1: The Physical Level
While we can assume you have taken at least an intermediate
Java course, we cannot
assume you have taken any courses on computer architecture.
The first section will briefly
cover key concepts in this field that you will surely hear again
in a Database course.
You will learn about how information is stored in a computer
in both temporary and
long term storage along with the time it takes to set and receive
information from each.
Chapter 2: Algorithm Complexity
This is a simple concept, but it is perhaps one of the most
important we will discuss. We
can create algorithms that can solve just about anything. The
problem comes in when these
algorithms take an exceptionally long time to run.
We will discuss how to determine an algorithm’s complexity in
terms of its input size.
We will show a few examples and give best/average/worst
cases. We generally care the most
about worst case scenarios. If we can reduce this worst case
then we know the algorithm will
always run in an appropriate amount of time.
Chapter 3: Search Algorithms
Storing information is an essential part of computer science.
Retrieving that information
is just as important. Retrieving that information in an
appropriate amount of time is even more
important.
You will be introduced to two simple algorithms: sequential
search and binary search.
We will discuss how and when to use each.
After completing Chapter 3, you will be able to complete the
first programming
assignment.
5
Chapter 4: Sort Algorithms
Sorting and ordering information is crucial when it comes to
retrieving information
quickly. There are many situations where you would want to
sort your data. One example is in
data retrieval. Attempting to get information from data that is
not in any order will require
every element to be inspected. However, if that information is
sorted, then the time it takes to
retrieve it later can be greatly reduced.
There are many different sorting algorithms one can use. Some
are very intuitive but not
very efficient. Others can be very efficient but unintuitive and
difficult to code. There are
applications where each is useful so we will discuss many
different types of sorting algorithms
including:
• Insertion Sort
• Selection Sort
• Bubble Sort
• Merge Sort
An application will be provided with this document to help
visualize exactly what these
algorithms are doing while they are running. It will also include
sorts that are not discussed in
this text along with some variations of some of the more
efficient sorts.
After completing chapter 4, you will be able to complete the
second programming
assignment.
Chapter 5: Trees
You should already be familiar with the concept of the tree
data structure already. We
will discuss a simple binary tree as an introduction, but our
primary focus will be on B-Trees. It
is important to note that there are many different tree structures
we will not discuss, such as
general (non-binary, non-balanced) Trees, Heaps, Binary Search
Trees and Balanced Trees.
The B-Tree is a way of storing very large amounts of
information. Until now you may
have been able to store all the data you need in RAM. Most
databases have much more
information than available temporary memory so we have no
choice but to store the
information on hard disks. As you will learn in the Physical
Level chapter, retrieving information
from disk is much slower than RAM. B-Trees are constructed
with this in mind, giving us a way
to quickly navigate and gain access to specific files.
After completing chapter 5, you will be able to complete the
third programming
assignment.
6
Chapter 6: Hashing
Hashing is an important data structure which gives extremely
fast insertion of data and,
when implemented correctly, extremely fast retrieval of that
data. Hashing uses a combination
of data structures you should be familiar with already.
Generally these use an array where each
element of that array stores a linked list.
7
Chapter 1: Physical Level
This chapter gives an overview of some of the basic computer
hardware concepts, and
what they mean to a computer scientist.
Hardware
The first thing we will look at is the primary hardware
components of a computer. If we
ignore peripherals and output devices, the three main
components of a computer are the:
1.) Central processing unit (CPU).
2.) Random access memory (RAM) sometimes called the main
or primary memory.
3.) The hard drive which is also called secondary memory.
The CPU is basically the “brains” of a computer. It is what
executes application code,
manipulates data, and controls the other hardware. The CPU is
composed of three parts:
1.) ALU (Arithmetic Logic Unit) - As its name suggests, this is
what does all the mathematical
and logical operations.
2.) Control Unit - This can be thought of as the conductor; it
does not do anything by itself but
it tells the ALU what to do and communicates where to store
data in memory.
3.) Registers (little memory chips) - These are where the direct
results of the ALU are put and
where the data that is to be executed next is stored.
8
RAM is essentially the computer's workbench. It is the memory
where the computer
stores code and data that it is actively using. In more technical
terms, RAM is basically a storage
area of bytes that the CPU controls. RAM is relatively fast,
especially when compared to a hard
drive. Retrieving a particular byte from RAM can be done in a
few nanoseconds (1 nanosecond =
1 billionth of a second). The main difference between RAM and
our last component, the hard
drive, is its speed and the fact that RAM is volatile or non
persistent. This means that when RAM
loses power, like when a computer is turned off, all the data in
RAM is lost.
The last primary hardware component is a hard drive (HD).
Hard drives are a type of
secondary storage. Other types of secondary storage are flash
drives, CDs, DVDs, magnetic tape,
and blue ray. A hard drive is used for long term storage of data
or persistent storage. Persistent
means that, unlike RAM, when power is removed the data is
still there. Hard drives are typically
spinning metal disks on which data is stored with magnetic
patterns. The other version of a hard
drive is a solid state disk (SSD). SSD's have no moving parts
and are faster than a magnetic disk,
but are much more expensive. While it is faster than magnetic
disks, it is still much slower then
RAM. However, no matter what type you use, all of them
provide persistent storage.
Bytes
No matter what type of memory it is registers, RAM, or hard
drives, all memory is split
up into “bytes.” A byte is made up of 8 “bits,” it is also the
smallest “addressable” unit. Bytes are
represented as base 2 numbers, so each bit can have the binary
value 1 or 0 and the value of
each position is found by 2^(0-7). This means that one byte can
have the decimal value of 0 –
255. There are different conventions for the symbol of a byte,
but it is typically denoted as “B.”
Prefixes are also used to represent multiple bytes. However,
since it is base 2, a kilobyte 2^10 is
1024 bytes denoted as the symbol “kB” instead of 1000 bytes.
Just to confuse things, you also
have the symbol “kb” for kilobit. You can also have megabytes
(MB), gigabytes (GB), terabytes
(TB), etc. Because of the different naming conventions, there is
sometimes some ambiguity on
what a symbol means in some situations. The bits in a byte are
ordered right to left. The left
most bit is called the “most significant bit” or “high order bit,”
in the same manner, the right
most bit is called the “least significant bit” or “low order bit.”
It is important that you remember,
that all the memory in a computer is measured in bytes and that
a byte is the smallest
addressable unit in memory.
9
This leads us to the term “word.” A word is basically the unit
of data that the CPU thinks
in. What this means is that the word size of the CPU is how
large a piece the CPU can
manipulate at a time. It is also the size of most of the registers.
When you hear about 32bit and
64bit machines, this refers to the word size. So when you have a
32bit machine you have word
size of 32bits. You can see why this is important with a simple
arithmetic example. Suppose you
wanted to execute: 1000 + 1000. If you had a word size of one
byte (8 bits) the largest number
you could represent using a single word is 255. So to represent
a larger number you have to use
two words. This means that to do the addition, instead of it
taking just one operation and
therefore one computer cycle, it would have to be split up into
several operations taking several
computer cycles. As this example demonstrates, when
considering how to design the
architecture of a computer, the choice of the word size is very
important.
While the bits in a byte are ordered right to left, that is not
always the case for the bytes
in a word when talking about storing them in memory. There are
two ways that bytes are
ordered when stored in memory, the two ways are called “Big
Endian” and “Little Endian.” For a
given word we have the four bytes B1B2B3B4. If the bytes are
stored 1 – 4, in memory they are
stored using big endian. If the bytes of each word in memory
are stored 4 – 1, in memory they
are stored using little endian. There are advantages and
disadvantages to both formats, and is
the basis for one of the arguments between the PC and Mac.
Little Endian means that you are
storing the low-order byte of the number in memory, at the
lowest address, and that you are
storing the high-order byte, in memory at the highest address.
The advantage of this is that it
creates a one-to-one relationship between the byte number and
memory address. Big Endian
means that you are storing the high order byte of the number, in
memory at the lowest address,
and that you are storing the low order byte in memory, at the
highest address. The advantage to
using this format, is you can test whether a number is positive
or negative just by looking at the
first byte.
Encoding Schemes and Variable Types
Characters
Now that we know what a byte is, and the fact that all data is
stored as a byte, you might
wonder how characters, pictures, and other types of data are
represented. The answer is
encoding schemes. ASCII (pronounced ask-ee) is an acronym
for the American Standard Code
for Information Interchange. ASCII is an encoding scheme for
representing the English alphabet
characters as numbers. In this encoding scheme, each letter is a
byte and assigned a number
from 0 to 127. For example, the ASCII code for uppercase N is
the decimal value 78, the
lowercase n is the decimal value 110. Since ASCII is what most
computers use to represent text,
this is what makes it possible to share data between computers.
The first version of ASCII was
published in 1963 and went through several revisions before it
became the version we use
today in 1986. While ASCII can represent English characters it
does not support characters from
other languages. To solve this, another encoding scheme was
created called Unicode. Unicode
represents characters as a two byte number. This means in can
represent up to 65,536 (2^16)
different characters, the disadvantage of Unicode is, since it is a
two byte encoding scheme it
takes twice the memory as ASCII.
10
Numbers
We now know how to represent characters, but what about
numbers? Numbers are
represented in two different ways. The first is as Integers. Since
they are integers they cannot
hold fractional values. We could just represent them using the
binary value. This would mean
that for a 32 bit word, we could represent the integer values 0 -
4,294,967,295. The problem
with this method is that we cannot represent signed (negative)
numbers. The most obvious way
to represent signed integers would be to use the most significant
as the “sign” bit. This would
mean that when the most significant bit is “1”, the integer is
negative, when it is “0” positive.
Since it is represented with 32 bits and 1 bit is used to denote
the sign this leaves us with 31 bits
for the number allowing us to represent the range:
−2,147,483,647 - 2,147,483,647. This
method of representation is called signed magnitude.
The disadvantage, as seen in the following table is that we are
not efficiently using one
representation, “100”.
11
Another disadvantage is seen when executing arithmetic
operations. For the CPU to add
the two numbers 111 (–3) and 001 (1) together, it would require
more than simple binary
addition.
The solution to this is 2’s complement. 2's complement is a
representation method that
allows the use of binary arithmetic operations on signed
integers to yield the correct 2's
complement result. 2's complement is the method that is used
in today’s computers to
represent signed integers. In 2’s complement we still use the
most significant bit to represent
the sign of the integer. So positive integers with a leading bit of
0 is straight forward, but
negative numbers with a leading bit of 1 are slightly different.
Negative numbers are
represented as a binary number that when added to a positive
number with the same absolute
value will equal zero. This makes implementing the logic gates
in the CPU much simpler than
any other representation.
12
The second way to represent numbers is as a floating point
number. Floating point
numbers represent “real” numbers, meaning it can represent
integers and fractional numbers.
Floating point numbers are represented using an exponential
format. For a floating point
number represented as a single word this would give us 32 bits.
In the typical representation
format, the bits would be broken up into three parts. The sign,
the significant, and the
exponent. So for a 32 bit number they would typically be
separated like this:
The most significant bit (bit 31) is used to represent the sign of
the number, 1 for
negative, o for positive. The next eight bits (bits 30 – 23) are
used to represent the exponent.
The convention for the exponent is to “bias” it by 127. This
means that to represent the
exponent 6 we add 127 to it.
Example: 127 + 6 = 133 which is the binary value: 10000101
On the other hand the representation of the exponent – 6 would
be:
127 – 6 = 121 which is the binary value: 01111001
The last 23 bits are used for the significant and are call the
“Mantissa.” The mantissa M
is “normalized” so that it is between 0.5 and 1. The
normalization is done by adjusting the
binary exponent accordingly. So the value decimal value 0.825,
in binary would be:
0.1101 = ( 1/2 + 1/4 + 1/16 = 13/16 = 0.825).
The other thing to know about the mantissa, is because of our
normalization process it
always begins with 1. Since this is always the case we do not
store the leading bit, this in effect
gives the mantissa 24 bits of resolution using 23 bits. This
means that we can represent values
ranging from approximately 1.5 × 10−45 to 3.4 × 1038 with a
precision of 7 digits.
Let’s look at how the decimal number 0.085 is stored as an
example. 0.085 is stored as
“0 01111011 01011100001010001111011.” Its decimal values
would be 0 for the sign, 123 for
the exponent, and 3019899 for the significant. The exact
representation of this number would
be:
2e-127 * (1 + M / 223)
= 2-4(1 + 3019899/8388608)
= 11408507/134217728
= 0.085000000894069671630859375.
13
As we can see precision is not the same as accuracy. This can
make programming with
floating point numbers a perilous process for the ignorant.
Integer's are exact, unless the result
to a computation is outside the range that integers can represent.
Floating point numbers by
contrast are not exact since some real numbers require an
infinite number of digits to be
represented, like 1/3
Booleans
Booleans are the values true or false, yes or no, and on or off.
Since we only need to
distinguish between two different values and we are using a
base 2 system, representing
booleans is easy. We just use “00000000” for false and
“11111111” for true. While we could just
represent booleans using only one byte, as that is the smallest
piece of addressable memory,
we typically use all 32 bits as that is the word size of our CPU.
Programming Variables
You should already be familiar with variables from your
previous courses, but now you
know generally how the computer represents variables like
Java's float, int, and boolean. In the
examples given we only looked at 32 bit representations, but the
idea is the same for larger
representations like Java's double and long which represent
floating point and integer numbers
using 64 bits. The only difference is that since you have more
bits to work with you can
represent larger and more precise numbers.
Now that you know how different types of data are represented,
it should make you
realize how important it is to keep track of where you are at in
memory. The byte 00010100 can
be used as the decimal value 20, the ASCII value space, or
something else entirely depending on
how you look at it. This leads us into the file system.
File System
Before we look at how a file system is structured, we have to
look at how the data is
physically represented. The information on hard disk drives is
split into what we call “tracks”
and “sectors.” Tracks are concentric circles on the surface of
the disk. Each track on a hard disk
is numbered, starting with zero on the outside and increasing as
you move toward the center. A
sector is a subdivision of a track into fixed size, physical data
blocks. Traditionally, hard disks are
split into sectors of 512 bytes each. However, modern hard
drives are divided into sectors of
2048 bytes, or even 4096 bytes each. When looking at the
information on the hard drive, you
look at it sector by sector.
14
Now that we know how hard disks are divided up physically, we
can look at how the
data is actually stored. Files are stored on the hard disk as
“records.” Records are a physical unit
of information made up of “fields.” Another way of thinking
about them would be as a
subdivision of a file, containing data related to a single entity.
The fields that a record is made
up of can be thought of as the variables in a program. For
example, in the following 6 byte
record, it has room in it to hold an integer and two characters. A
record like this could be used
in a program to represent an ID designation.
The final term we need to know about is a “block.” A block is a
physical data structure
that holds records on a storage median. It is a group of sectors
that the operating system can
address. It might be one sector, or it could be several. Blocks
can also be thought of as groups
of bytes. The size of a block is specified by the operating
system and is, therefore, operating
system dependent. Blocks are manipulated as a whole. An
example would be when disk drives
read and write data, they do so in 512 byte blocks.
The last thing we need to look at are “files” and the job of the
“file system.” The file
system organizes the data of the storage median into an
interface that is familiar, to the
computer's user. A file therefor refers to one or more blocks of
data and gives them a useful
name like “myFile.txt” for reference by a computer's user.
There is generally a very tight
“coupling” between the file system and the operating system.
The two main file systems in use
right now is the proprietary NTFS file system used by Microsoft
and the HFS+ equivalent by
Apple. Another main file system that several electronics use is
Microsoft's old FAT32 file
system.
I/O buffering
We have looked at the basic components of the computer, how
memory is divided into
units, how data is represented as a binary value, and how files
are stored on the hard drive. This
leaves us with our last topic for this section, which is I/O
buffering. I/O buffering is where you
temporarily store data passing between two components. The
purpose of this is to help smooth
out the difference in the rate of which two devices can handle
data.
I/O buffering is done at all levels of communication between
the components of a
computer. You can see why it is so important when trying to
write a file. The CPU runs several
magnitudes faster than disk drives. So if we did not have
buffers, the CPU would be slowed
down to the speed that the disk drive runs at and be unable to do
anything else until it finished.
It should be apparent how inefficient that would be. With I/O
buffering the CPU can quickly
send the information to the buffer, then go about its business,
while the disk drives writes the
data to memory. This idea works for input from a disk drive
also. When the CPU wants a file, it
will send the request to the disk drive. The CPU is then free to
work on other stuff while the
disk drive loads the file into the buffer.
15
If we look back at our primary components we can see that
RAM is used as an
intermediate buffer. In modern computers there are several
“controllers” that are used to
increase a computer's speed and efficiency. During a normal
execution cycle, whenever the CPU
needs a file from the disk drive it will tell the controller, which
will then load the information
from the buffer to a specific place in RAM. The CPU can then
begin executing using the data in
RAM.
Other Sources:
http://www.stanford.edu/class/cs101/how-computers-work-
hardware.html
http://www.stanford.edu/class/cs101/how-computers-work-
software.html
http://betterexplained.com/articles/understanding-big-and-little-
endian-byte-order/
http://academic.evergreen.edu/projects/biophysics/technotes/pro
gram/2s_comp.htm
https://developer.intersystems.com/devconnection/deploy/176-
17/disk-io-buffering-and
http://www.stanford.edu/class/cs101/how-computers-work-
hardware.html
http://www.stanford.edu/class/cs101/how-computers-work-
software.html
http://academic.evergreen.edu/projects/biophysics/technotes/pro
gram/2s_comp.htm
https://developer.intersystems.com/devconnection/deploy/176-
17/disk-io-buffering-and
16
Chapter 2: Algorithm Complexity
For every problem we see in computer science, there are
usually several different
algorithms to solve the problem, as well as several different
implementations of each specific
algorithm that solves the problem. So the question then
becomes which algorithm should we
use? All things being equal, we should use the algorithm that is
easiest to understand,
implement, and document. However when performance is an
issue, then we have to look at
how fast the algorithm runs, as well as how efficiently it uses
the computers resources. This
makes understanding the time complexity of algorithms a
central concept to computer science.
When we look at the time complexity of an algorithm we
typically do not consider how
fast it runs on a specific computer. If one computer has a 1Ghz
processor and another has a
2Ghz processor, then the second computer we generally execute
twice as fast as the first for the
same algorithm. Instead we look at how fast an algorithm runs
as a function in comparison to
the size of its input. We are typically interested in how the
running time of an algorithm
increases when we supply it with a “larger” input. What is the
“size” of input? That depends on
the problem.
Example:
• Number of elements in the array to sort
• Number of vertices and edges in a graph to traverse
To demonstrate how to analyze an algorithm, we will look at an
algorithm for finding the
majority element in array of integers. It takes as input an array
of N positive integers, so the size
of the problem is N. The output is the majority element (M.E.),
the element in the array
occurring more than N/2 times. For simplicity we will assume
that a majority element exists in
the array. In the algorithm we go through each element in the
array and count how many times
each element appears.
Examples:
• <1,4,4,4,2,3> -> no majority element
• <1,4,2,4,4> -> 4
• <2,2,2,3,3,3,3> -> 3
Running time: A– assignment, C– comparison, E – expression,
Blue – conditional execution
17
MajorityElement(A[1..N])
Repeats time line
1 A 1 mIdx=1
1 A+N*(C+E) 2 for i = 1 to I <= N do i++
N A 3 Counts[i]=0
N A+N*(C+E) 4 for j = 1 to j <= N do j++
N*N C+E 5 if A[i] == A[j] then Counts[i]++
N C+A 6 if Counts[i] > Counts[mIdx] then mIdx=i
7 return A[mIdx]
Running time = A + A + N*(C + E + A + A + N*(C + E + C +
E) + C + A))
=2A + N(2A + 2C + E + A + N*(2C + E + E))
=2A + (2A + 2C + E)*N + (2C + E)*N^2 + A*N + E*N^2
First we will focus on the number of conditional executions
Worst case – all elements of A are identical,
• we run E in each execution of line 5
• we never run A in line 6
• Running time=2A + (2A+2C+E)*N + (2C+E)*N^2 + E*N^2
Best case – only N/2+1 copies of Majority Element, at the start
of A, all other elements unique.
• we run E (line 5) N/2+1 times in each of first N/2+1 iterations
of line 2 loop
• we never run A in line 6
• Running time=2A + (2A+2C+E)*N + (2C+E)*N^2 +
E*(1+N+¼N^2)
Typically distinguishing between running times of different
elementary operations is:
• To detailed – obscures the picture
• To machine-dependent
So we can assume that all elementary operations execute in the
same constant unit amount of
time.
• A=C=E=1
Then the running time simplifies to:
• 2 + 5N + 3N^2 + N^2 = 2 + 5N + 4N^2
18
Since N^2 is the part of the algorithm that grows fastest with a
growing N, or more
formally T(n) = n^2, this means that as the size of the input
increases, the time it takes for the
algorithm to completes grows exponentially as N^2. We call
this a O(n^2) algorithm. O(n^2)
means the algorithm is an “order n squared” algorithm.
If we are worried about efficiency then there is a problem with
the previous approach;
we repeat the same calculation may times. If the element X is in
the array M times, then we
count how many times X appears in the array M times. This
wastes time and resources. The
solution is to group the identical elements X together, so that
we only have to count how many
X's there are in the array only once for each different X in the
array. Since we are only looking at
positive integers we will set the last element to a negative
number so we know when to stop.
This time we will ignore elementary operations, meaning
executing a small number of
A/C/E operations takes constant time (equal to 1)
The function N^2
19
MajorityElement(A[1..N])
Repeats time line
1 ? 1 A = sort(A);
1 1 2 me = A[1];
1 1 3 cnt = 1;
1 1 4 currentCnt = 1;
1 1 5 A[N+1] = -1;
1 N 6 for i = 2 to I <= (N+1) do i++
N 1 7 if A[i-1] == A[i] then
1 8 currentCnt++;
1 9 else if currentCnt > cnt then
1 10 cnt = currentCnt;
1 11 me = A[i-1];
1 12 currentCnt = 1;
13 return me;
We can see that the running time is growing as N + (time to
sort A). There are several
different sorting methods. We will not get into them now, but
there are a couple that have a run
time of N log N. So if we choose an appropriate sort method,
running time can be growing as a
function of N log N or O(n log n).
As we can see in the graph, the complexity is still exponential,
but it grows at a much
slower rate.
The function N^2 (red) and N log N (blue)
20
We have looked at two algorithms for solving the same
problem: one with a complexity
of O(n^2) and the other O(n log n). Many people seeing this
make the assumption that the
algorithm with the better time complexity always solves the
problem faster. While not
demonstrated in the previous algorithm, this is not always the
case. We can see this by looking
at the graph of a O(n) and O(n^2).
As we can see for a small N, the algorithm with a better
complexity takes longer than the
other. This pattern is prominent in the study of algorithms;
usually the simple approach is faster
for small inputs, while the more complex approach is faster for
large inputs. This means that if
we are worried about maximum efficiency all the time we must
change our approach
depending on the size of the input.
The last thing to note in this section is the usage of the term
O(n). We said earlier that
this means the time complexity of the algorithm was n. This is
not the exact meaning of O(n).
O(n) means that the complexity of an algorithm can be bounded
above by some function c * n.
For example given an algorithm with a time complexity of 4N,
we can come up with a function
that will always be above it, and therefore act as an upper bound
of order N. This can be seen
graphically in the follow graph.
2N^2 (red) and 100N (blue)
21
The source for the material in this section came from lecture
notes prepared by Dr. Tim
Arodz.
Other Sources:
http://community.topcoder.com/tc?module=Static&d1=tutorials
&d2=complexity1
http://www.cse.buffalo.edu/~szhong/cse191/hw/191Lec6.pdf
The function 4N (red), is bounded above by the function
5N (blue) making the function 4N O(n).
http://community.topcoder.com/tc?module=Static&d1=tutorials
&d2=complexity1
http://www.cse.buffalo.edu/~szhong/cse191/hw/191Lec6.pdf
22
Chapter 3: Search Algorithms
Searching is a very common task in computer science.
Examples include searching a list
for a specific element or searching an array of objects for the
smallest object. Although it is a
very common task, there are only two main searching
algorithms. They are a sequential search
and a binary search. Grasping the ideas and limitations of the
two search algorithms is fairly
intuitive so this section will only give a summary example of
each.
Sequential Search (sometimes called a linear search):
The idea behind the sequential search, as its name suggests, is
to start at the beginning
and sequentially look through a data set for the search
parameter and finish when the
parameter is either found or the end of the data set is reached.
Let's take a look at an example.
[4 29 6 3 9 34 23]
Suppose our search parameter was to see if the value 9 existed
in the array. We would
get the first value and see if it equals 9. If it does, we are
finished. If not, we get the next value
and try the comparison again. The whole sequence would look
like this.
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
[4 29 6 3 9 34 23]
The other typical search parameter is to find the smallest or
largest element in the data
set. To do this with a sequential search we have to have a
temporary variable to hold the
current smallest value. So to solve this we initialize the
temporary variable to the first value in
the array and then increment through the entire array, updating
the temporary variable as we
go. It would look like this:
Initialize:
[4 29 6 3 9 34 23] X = 4
Increment through the rest of the array:
[4 29 6 3 9 34 23] X = 4
[4 29 6 3 9 34 23] X = 4
[4 29 6 3 9 34 23] X = 3
[4 29 6 3 9 34 23] X = 3
[4 29 6 3 9 34 23] X = 3
[4 29 6 3 9 34 23] X = 3
Smallest element = 3.
23
Now we will examine the complexity of the algorithm. It
should be easy to see that for
searches with a parameter like “what is the smallest element”,
we have to go through the
entire array. Since we only have to look at each element once,
this makes the
complexity/runtime in the best, average, and worst case
scenarios O(n).
If we had a parameter like “does this value exist”, then we have
to look a little closer.
The best case would be that the first value we try is the value
we are looking for. The worst
case, of course, would be that the value we are looking for does
not exist or is the last element.
The only way to calculate the average case would be to say that
the value we are looking for is
in the middle.
Summary
Complexity Number of Comparisons
(for n = 100000)
Comparisons as a function of n
Best Case
(fewest comparisons)
1
(target is first item)
1
Worst Case
(most comparisons)
100000
(target is last item)
n
Average Case
(average number of
comparisons)
50000
(target is middle item)
n/2
The best case analysis does not tell us much. If the first
element checked happens to be
the value we are looking for, any algorithm will take only one
comparison. The worst and
average case analysis gives us a better indication of an
algorithm’s efficiency.
Notice that if the size of the array grows, the number of
comparisons required to find a
parameter in both worst and average cases grows linearly. In
general, for an array of size n, the
worst case is n comparisons. The algorithm is also called a
linear search because its complexity
and efficiency can be expressed as a linear function. The
number of comparisons to find a target
increases linearly as the size of the list increases.
Although we have not looked at sorting algorithms yet, the
other thing to look at when
looking at the complexity is whether the run time would change
if the array was sorted. If the
parameter is “what is the smallest or largest value” then the
answer would be yes, as we would
know the position of the largest and smallest elements and we
would not need to search for
them. However, if the parameter is “does this element exist”
then the answer would be no as
our early basis for the complexity would still be valid.
24
Pseudo Code
25
Binary Search
Our second search algorithm is still intuitive, though slightly
more complex to
implement. You might wonder why we need another search
algorithm, as a sequential search
would technically work in every situation. The answer to that is
efficiency. Since a sequential
search’s complexity grows linearly to the size of the input, the
time it takes to execute grows
linearly as well. This is not an issue for small data sets with
only a few hundred to a few
thousand pieces of data. But what happens when the data set
becomes large like a few million
to a few billion pieces of data? Even with modern computers it
could take several minutes to
complete the search.
This is where a binary search algorithm comes into play. A
binary search can only be
used when the data set is sorted and random access is supported.
Therefore, in data structures
such as a linked list, a binary search cannot be used. As one of
the requirements for using a
binary search is that the data set be sorted, there is no need to
use a binary search for a search
parameter of finding the largest or smallest element. Although
you could search for it, their
position would be known so no searching would be required.
The premise behind a binary search is simple. Since our data
set is sorted, comparing
the middle value to our parameter, will give us one of three
situations.
1.) The value we are looking for is in the upper portion of the
data set,
2.) The value we are looking for is in the lower portion of the
data set, or
3.) The middle value is the value we are looking for.
By always comparing the middle value, the binary search
algorithm allows us to vastly
reduce the number of comparisons. Let's look at an example.
[9 20 34 35 68 47 49 65 80 86]
Suppose our search parameter was to see if the value 34 existed
in the array. We first
find the middle value; if that is the value 34 we are done. If it is
not we “cut” the area in the
array in half, which reduces the potential comparisons by half
as well. We keep doing this
process until we find the value we are looking for or until we
cannot cut the array in half
anymore. The sequence of events would look like this.
Active section: [9 20 34 35 68 47 49 65 80 86] (1
+10)/2 = 5.5 => 5
Active section: [9 20 34 35] (1 + 4)/2 = 2.5 => 2
Active section: [34 35] (3 + 4)/2 = 3.5 => 3
26
Now that we see how it works, let’s look at the complexity. We
said earlier that a binary
search was more efficient than a linear search. If that is so, how
much more efficient is it? To
answer this, we look at the number of comparisons needed in
both the best and worst case
scenarios. We will not look at the average case, as it is more
difficult to compute, and it ignores
the differences between the required computations
corresponding to each comparison in the
different algorithms.
The best case of course would be that the middle value is what
we are looking for, so
the best case scenario does not tell us very much about the
algorithms efficiency. That leaves
the worst case scenario. The worst case, as with a sequential
search, is that the value does not
exist or is the last value that we check. So to compare the
complexity of the worst case to the
size of the input we get the following scenario.
[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]
Goal: find the value 16.
The first index we look at is: (1+16)/2 = 8.5 => 8
First comparison
Active Section: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16]
8 < 16 next index is: (9+16)/2 = 12.5 => 12
Second comparison
Active Section: [9 10 11 12 13 14 15 16]
12 < 16 next index is: (12+16)/2 = 14
Third comparison
Active Section: [13 14 15 16]
14 < 16 next index is: (15+16)/2 = 15.5 => 15
Fourth comparison
Active Section: [15 16]
15 < 16 next index is: (16+16)/2 = 16
Final comparison
Active Section: [16]
16 = 16
So it takes us a maximum of five comparisons for a dataset
containing sixteen elements
to find any element. Or to express it in mathematical terms,
given a data set of size n it takes us
X number of comparisons, where X = log2n. So our complexity
is O(log n).
27
Summary
Complexity
Number of Comparisons
(for n = 100000)
Comparisons as a function of n
Best Case
(fewest comparisons)
1
(target is middle item)
1
Worst Case
(most comparisons)
16
(target not in array)
log2n
Pseudo Code
Other sources:
http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/sear
ch.html
http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/sear
ch.html
28
Chapter 4: Sort Algorithms
Sorting
Sorting problem – Definition reminder
Input:
a sequence of numbers <a1, a2, ..., an>
Output:
A permutation <a1',a2',...,an'> of the input sequence, such that
a1'<=a2'<=...<=an'
Insertion sort:
Concept for the algorithm – to sort an array
-Maintain two parts of array
-Sorted part: – initially empty – left part
-Unsorted: – initially full – right part
-Take one element from the unsorted part and insert at correct
position in the sorted
part.
-Iterate until all elements are moved to the sorted part of the
array, and the unsorted
part is empty
Start with an unsorted array of size n, [0...n-1] where 0 is the
first array index and n-1 is
the last index.
[ 4 5 2 0 9 ]
Imagine splitting this array into two different parts. The left
half will be sorted, the right
half is not. I will show my split in the array as two vertical
lines. An array split in the middle
would be of the form [0...i || j..n-1]. It is important to note that
this is still one array, the split is
conceptual only.
We are going to apply this conceptual split to our array. We will
put the split after the
first index, so we have just one element on the left, and n-1
elements on the right. After our
imaginary split, our array now looks like this.
[ 4 || 5 2 0 9 ]
i j n-1
29
Now during each iteration of insertion sort, we will look at the
first element of the right
part of our array which we previously showed was at index j.
We are going to move this
element (insert it) into the left part of the array. We will keep
moving the element to the left
until the left array is sorted again. As you can see, in this
example we didn’t have to physically
move anything in the array for this iteration.
[ 4 5 || 2 0 9]
We will repeat that process by moving 2 into the left section of
the array.
[ 4 5 2 || 0 9 ]
We can clearly see that the left part of the array is no longer
ordered, so let’s continue
moving 2 to the left until the array is ordered again.
[ 4 2 5 || 0 9 ]
[ 2 4 5 || 0 9 ]
We will continue to repeat this process until all elements are in
the left, sorted section.
[ 2 4 5 0 || 9 ]
[ 2 4 0 5 || 9 ]
[ 2 0 4 5 || 9 ]
[ 0 2 4 5 || 9 ] ← End of the this iteration.
[ 0 2 4 5 9 || ] ← End of the final iteration.
We can see here that the left section of the array, which is the
sorted section, contains
all the elements and is still sorted. So now the question is, how
do we do this in a program?
Pseudo Code
Note this pseudo code receives a reference to an array.
30
Algorithm Complexity
Let’s examine the pseudo code for complexity.
Pseudo Code
This will be a little more in depth than we will go in the future,
but for now we will show
what everything doing along with their relative complexities.
The first iteration of the for loop
increments from 1 to n-1. The run time of this is directly related
to the size of the array, n. You
should be able to see that as n gets larger, the number of times
the for loop will increment will
grow in proportion to n.
Everything else is contained in the for loop, so we must, in
essence, say that the number
of executions of everything inside the for loop also depends on
n. We have a few constant time
assignment operations and we have another loop. The
complexity of this loop can be a little
tricky to understand.
This is saying that we have a key which is located at index ‘ j ‘.
We want to move this key
until it fits in the right spot, which is when A[ i ] <= key or i <
0. The number of spaces we
actually move this key element will vary throughout the
algorithm, but the thing to remember
is that, as n gets larger, we will generally have to move that key
more spaces until we find it’s
home, which would make this an O(n) loop. Remember still that
this loop is inside another loop
that is also O(n), so the contents of the while loop can possibly
run n^2 times in a worst case
scenario. This means that this algorithm is O(n^2).
31
Now that we have look at the complexity it is easy to see how
Insertion Sort performs:
Worst case performance: О(n^2)
ex: A reverse sorted array.
Best case performance: О(n)
ex: A sorted array
Average case performance: О(n^2)
Other Sources:
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
http://www.algolist.net/Algorithms/Sorting/Insertion_sort
32
Selection Sort:
Our second sort will focus on a different, yet still intuitive,
way of arranging elements
into the correct order. Insertion Sort focused on moving
elements from an unordered array and
finding its place in an array that is ordered. Selection sort does
something similar. However,
instead of grabbing any element from the unordered array, it
finds the largest element and
swaps it with the smallest element of the ordered array.
Remember, since the array is ordered,
the smallest element will always be the left-most element of
that array. Let’s take a look at an
example.
[4 2 5 1 6 7 0]
This is a sorted, unordered array. Let’s divide this array into a
sorted and an unsorted portion,
similar to what we did with insertion sort. However, the sorted
part of this array will be the
right side.
[4 2 5 1 6 7 || 0]
You can see that the left part of this array is not sorted, and the
right side is sorted as it
only has one element. We will need to know which element is
the largest in the unsorted array,
so we will keep that element’s index as a key. We will also
color this element blue to show that
we are storing it. The first element in the array will be initially
marked as the largest and we will
change that as we move through the array. The current element
we are looking at will be
colored red. The first complete iteration looks like this:
LargestElement = 4
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 7 || 0]
[4 2 5 1 6 0 || 7]
As you should be able to see, we looked at every element in the
unordered array once.
We also looked at the first element of the ordered array. If we
found an element that was
larger than the previous largest, we simply marked that element
as the new largest and kept
looking. Once we arrived at the end of the unsorted array, all
we had to do was swap it with the
first element of the ordered array.
33
For each new iteration, we will slide the divider one element to
the left and continue.
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 6 || 0 7]
[4 2 5 1 0 || 6 7]
[4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
[4 2 5 1 || 0 6 7]
[4 2 0 1 || 5 6 7]
[4 2 0 || 1 5 6 7]
[4 2 0 || 1 5 6 7]
[4 2 0 || 1 5 6 7]
[4 2 0 || 1 5 6 7]
[1 2 0 || 4 5 6 7]
[1 2 || 0 4 5 6 7]
[1 2 || 0 4 5 6 7]
[1 2 || 0 4 5 6 7]
[1 0 || 2 4 5 6 7]
[1 || 0 2 4 5 6 7]
[1 || 0 2 4 5 6 7]
[0 || 1 2 4 5 6 7]
Now that we have reached the end of this last iteration, we can
see that, no matter
what the first element in the array is, it will always be smaller
than every element in the sorted
array. This is because every element we moved thus far has
been larger than this last element.
Before we even look at the pseudo code, we can get a good
understanding of the
complexity of this algorithm. For each extra element in the
array, the number of iterations we
would have to do will grow by 1. During each of these
iterations, we have to look at every
element in the unsorted array. While this number gradually gets
smaller as the algorithm
progresses, ultimately as n gets larger, so will the number of
elements we have to look at
during each iteration. This tells us already that Selection Sort
will be O(n^2).
34
Pseudo-code
35
Bubble Sort:
The third sort we will discuss is the Bubble Sort. Unlike
Insertion and Selection sort, this
one is not so intuitive. The name comes from bubbles rising to
the surface of water. As the
bubble passes through the array it moves each number closer to
the location it needs to be.
In order to show this we will once again start off with an
unordered array.
[9 1 2 4 5 8 7 6 3]
[1 9 2 4 5 8 7 6 3]
[1 2 9 4 5 8 7 6 3]
[1 2 4 9 5 8 7 6 3]
[1 2 4 5 9 8 7 6 3]
[1 2 4 5 8 9 7 6 3]
[1 2 4 5 8 7 9 6 3]
[1 2 4 5 8 7 6 9 3]
[1 2 4 5 8 7 6 3 9]
This is one iteration of the bubble sort. The red “bubble” is
going from left to right and
each time it is putting the two elements inside it in the proper
order. Since the largest element,
9, happened to be at the beginning of the array, the 9 was
trapped in the bubble until it was put
at the very end. We will create a new bubble to iterate through
the array and each time it will
grab the next largest element and put it in its place. Let’s go
through the rest of this sort.
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 8 7 6 3 9]
[1 2 4 5 7 8 6 3 9]
[1 2 4 5 7 6 8 3 9]
[1 2 4 5 7 6 3 8 9] <-End of iteration 2
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 7 6 3 8 9]
[1 2 4 5 6 7 3 8 9]
[1 2 4 5 6 3 7 8 9] <-End of iteration 3
36
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 6 3 7 8 9]
[1 2 4 5 3 6 7 8 9] <-End of iteration 4
[1 2 4 5 3 6 7 8 9]
[1 2 4 5 3 6 7 8 9]
[1 2 4 5 3 6 7 8 9]
[1 2 4 5 3 6 7 8 9]
[1 2 4 3 5 6 7 8 9] <-End of iteration 5
[1 2 4 3 5 6 7 8 9]
[1 2 4 3 5 6 7 8 9]
[1 2 4 3 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9] <-End of iteration 6
As you can probably see, each iteration the bubble has to go
one less index in the array.
This is fairly easy to implement because we can just reduce the
apparent size of the array by
one in each iteration. You can also see that we stopped the
algorithm early because it is sorted.
In a worst case scenario, we would have to iterate through this
n-1 times. However, there is
also a technique to determine if the array is sorted and all it
requires is that we iterate through
one more time.
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
Notice that the bubble never moved anything. If the bubble had
moved something, then
we know that the array was not sorted when we began this
iteration.
Pseudo code
37
Complexity Summary:
Bubble sort contains a for loop that will grow in proportion to
n. This loop is inside a
while loop. This while loop probably won’t iterate n-times, but
the number of iterations of this
loop will tend to grow linearly with n.
Worst Case O(n^2)
We will see the worst case trying to sort an array that is
initially reverse sorted.
Best Case O(n)
We will see the best case trying to sort an array that is already
sorted. The for loop will iterate
once, see that the bubble didn’t move anything, and break out of
the while loop.
Average Case O(n^2)
During the majority of sorts done with bubble sort, the
computation time will grow by some
factor of n^2.
38
Merge Sort:
The final sorting algorithm we will discuss is Merge Sort. This
sort uses a divide-and-
conquer approach. This is not a way people tend to sort things,
but it is much more efficient at
sorting very large arrays. We will use a different type of
example for this that is used from Dr.
Tomaz Arodz’s CMSC 401 Lecture notes. Many of his notes we
use will include images used
from “An Introduction to Algorithms: Third Edition” by Thomas
H. Cormen, Charles E. Leiserson,
Ronald L. Rivest, and Clifford Stein.
General Merging of two arrays to produce a single, sorted,
array:
We will start by exploring how to merge two arrays together.
First we will see the
merging of two unordered arrays, then with two ordered arrays.
We want start with two
separate arrays and end up with a single, sorted array.
Given two arrays:
[3 5 1 7]
[4 2 8 6]
Let’s think about this for a minute. We have options if we want
to merge these two
arrays. If you apply what you have learned from the past sorts,
we can simply put these two
arrays into one and then sort them with Insertion, Selection or
Bubble sort. So, let’s try that.
We will use the Insertion Sort method. We won’t show every
minute step, just remember that
during each step we are looking at every element in the sorted
side until we find the correct
place for each element.
[3 5 1 7 4 2 8 6]
side.
[3 5|| 1 7 4 2 8 6]
[1 3 5|| 7 4 2 8 6]
[1 3 5 7|| 4 2 8 6]
[1 3 4 5 7|| 2 8 6]
[1 2 3 4 5 7|| 8 6]
[1 2 3 4 5 7 8|| 6]
[1 2 3 4 5 6 7 8||]
We have seen this before. We know this is a O(n^2) algorithm.
Let’s look at a different approach. We will take the same two
arrays, only first we will
sort those arrays before combining them.
[1 3 5 7]
[2 4 6 8]
39
Now let’s set them up so they are easier to visualize. We will
go through this part step-
by-step. We will have the two initial arrays on top and we will
create a destination array on the
bottom. The destination array will be large enough to fit all of
the elements and initially it will
be empty.
[1 3 5 7] [2 4 6 8]
[ ]
We will keep an index for the element we look at in each array.
We will highlight these
current elements in red.
[1 3 5 7] [2 4 6 8]
[ ]
Now for the fun part, let’s start the merge process. During each
iteration, we will
perform one check. We will find which red element is the
smallest, put that element in the
destination array, and then look at the next element from the
source array. That may be a bit
confusing to conceptualize, so let’s see it in action.
[1 3 5 7] [2 4 6 8]
(1 < 2)? Yes, let’s move 1 down.
[1 ]
[1 3 5 7] [2 4 6 8]
(3 < 2)? No. Let’s move the 2 down.
[1 2 ]
[1 3 5 7] [2 4 6 8]
(3 < 4)? Yes. 3 goes down.
[1 2 3 ]
[1 3 5 7] [2 4 6 8]
(5 < 4)? No. Move 4.
[1 2 3 4 ]
[1 3 5 7] [2 4 6 8]
(5 < 6)? Yes. Move 5.
[1 2 3 4 5 ]
[1 3 5 7] [2 4 6 8]
(7 < 6)? No. Move 6.
[1 2 3 4 5 6 ]
40
[1 3 5 7] [2 4 6 8]
(7 < 8)? Yes. Move 7.
[1 2 3 4 5 6 7 ]
Now we have a small issue. As you can see, the next, and only,
possible item to move
now is the 8. However, telling a computer how to do this is a bit
more complicated. Luckily, we
know a few solutions to this issue. The solution we will cover
involves expanding the source
array’s to include one extra element, Positive Infinity (INF). We
would need to do this before we
started the merging process, but it doesn’t affect anything until
now. So, imagine we made this
addition before, and we will resume where we left off.
[1 3 5 7 INF] [2 4 6 8 INF]
(INF < 8)? Not even close. Move 8.
[1 2 3 4 5 6 7 8]
Now you may be asking yourself, how do we use Infinity in a
Java program? Well, you
can’t, but you can come close. All integers contain a maximum
value. We can get this in Java
with Integer.MAX_VALUE. The other numerical primitive data-
types have wrappers that provide
the maximum or infinity values as well.
Going back to the merging, we can see that, with two pre-
sorted arrays, we only look at
each element once. The number of elements we look at grows
proportionally to the number of
elements we need to merge. This means the merging of two
sorted arrays is O(n). This doesn’t
do us much good though if we have to use a O(n^2) algorithm to
get the two initial arrays
sorted.
Divide and Conquer approach:
As mentioned above, merge sort is a divide and conquer
algorithm. It divides the
problem into smaller problems of the same nature, recursively
solves them, and then combines
their solutions. We will divide the array into two smaller arrays
until the arrays contain just one
element each.
This division of the array creates a tree where each child node
is an array that is half the
size of its parent node. Each leaf will be an array containing
just one element. This, as we have
seen before, means that each leaf is a sorted array.
41
This should be fairly easy to see so far, at least conceptually.
Implementing this in code
will be slightly trickier since computers tend to do things
procedurally. We will see how this is
done later in this text, but for now let’s go over the merging
procedure.
We will start with the leaf nodes. Since they are already sorted,
we will simply use the
merging procedure we talked about above. We want to merge
them into the same parent
arrays they had before.
42
As you can see, Merge sort first divides the problem into a lot
of small, easy to tackle
problems. The “easy problem” in this case is merging two
arrays that are already sorted.
Each time we go down a level we cut the array in focus by half.
Each level will contain
twice as many smaller problems as the last until we have n
levels. This means that the dividing
process is O(log2(n)). We already discovered that merging two
sorted arrays is O(n), so doing a
O(n) operation log2(n) times will give us a O(nlog2(n))
algorithm. This is significantly better than
the O(n
2
) sorting algorithms we discussed earlier for very large n’s.
Pseudo-Code
43
Chapter 5: Trees
Trees:
We will now take a look at Trees. For our purposes, the only
trees we will be discussing
in detail are B-Trees. However, you will need a firm
understanding of general tree structures
before you can fully understand the concepts behind a B-Tree.
You may have gone over Trees in
CMSC-256, if so this section will just be a review.
Trees in computer science are used as a means of storing data
using an acyclic
connected graph where each node has zero or more children
nodes and at most one parent
node. Furthermore, the children of each node have a specific
order.
Much of the following information was used, with permission,
from Dr. Arodz’s CMSC
401 notes.
We will discuss the above operations but first we need to
describe structures we are
using. The first is the Node.
- Each node has a key: x.key
- Each node may have children: x.left, x.right
o A null child represents no child
- Each node has a parent: x.p
o The exception to this is the root node. The root node of any
given tree, by
definition, has no parent.
A binary tree must maintain certain properties with respect to
these nodes. For each node x,
- If node y is in a left subtree of x, y.key <= x.key
- If node y is in a right subtree of x, y.key >= x.key
44
This tree has other properties you are probably already familiar
with. With the trees we
display, we will only be displaying the node and key. Each of
these nodes in a real application
would contain some information. This information is usually
accessed with a getter method
such as: x.data. Other properties of this particular tree include:
- 6 is the root node
- 5 is the left child of 6. (7 is the right)
- 2, 5 and 8 are leaf nodes.
- 5 is the root of a subtree of 6.
As with all data structures there is a set of operations we will
want to perform on this
tree. We will want to, at minimum, insert and remove nodes
from a tree. Other useful
operations include in order, preorder and post order traversals.
Note: the following methods
can be done recursively. We will only show the iterative
approach.
Insert:
This is the full pseudo code for a Binary Tree insertion. Instead
of explaining how an
insertion works then providing the code at the end, we will take
the opposite approach. This
way we can provide the overview and decompose the code piece
by piece.
This method takes a Tree and a new node to be inserted into the
tree.
We define the current node as the root of the tree. We also need
to keep track of the
parent. Since the root of any tree has no parent, we initialize
this as null.
45
This is where we find the proper location to put our new node.
We always insert onto a
leaf, so our while loop will iterate until the current node is null
which is why we want to keep
track of the parent. After iterating through this loop, the
currentParent variable will hold the
node which will be a parent to our new node…
…so let’s go ahead and assign the parent reference of our new
node to the
currentParent variable.
Now we have to determine the details of where we are inserting
this node. If the tree
was empty, Tree.root would have returned null. In that case,
currentNode would have been null
and the while loop would have never executed and currentParent
would be null. The above
statement covers this situation and inserts the new node as the
root of the tree.
If the parent is not null then we will have to insert to the left or
the right of that parent.
This statement determines where that is. If the value of the new
node is less than its parent we
will insert as the left child, otherwise right.
That’s it! We are done with inserting a single node into a
binary tree.
Remove:
Removing from a binary tree is slightly more complicated. We
will try to break it down
into small chunks. To do this, we will introduce two helper
methods. These also have other uses
outside of removing a node which we will not discuss.
46
The first helper method we
will introduce is the TreeMinimum.
Given a Tree, if you were to follow
the left child until you reached a
node where there is no left child,
you will end up at the smallest
value in that tree. This can be used
on any node in a tree in order to
get the minimum value within a
subtree.
Two examples are
highlighted in the tree to the right.
The TreeMinimum of 6 is 2. The
tree minimum of 18 is 17.
If you were to take the minimum of 7, you would simply get 7.
This code should be fairly self-explanatory. We extract the root
of the current tree.
While the current node’s left child is not null, we move on to
that left child. When the left child
is null we know we have reached the smallest value in the tree.
The second helper method is the Transplant method. This
method does the actual work
of removing a node from a tree. Its parameters include a tree,
the node to be removed and the
subtree that will take the place of that node.
Since this is a bit more complicated than the TreeMinimum, we
will once again break up
the code and explain it piece by piece.
47
If the node we want to remove has a null parent then that node
was the root of the
tree. In this simple case we simply assign the root of the tree to
the new subtree.
This part usually looks complicated at first. All we are doing is
checking the parent of the
removed node to see which child is being taken away. We will
then replace that child with the
subtree.
Finally, we will check to see if that subtree is empty. If it is
not empty then we finalize
the attachment by setting the parent of the root node to the
removed node’s parent.
We can now go over the actual removal method.
We use both Transplant and TreeMinimum to do a proper
removal. Care must be taken
when removing a node. You cannot just use a transplant in
every situation. If we remove a node
that has only one child, we can just transplant the subtree
represented by that node’s child to
the parent. If that node has more than one child then a single
transplant won’t work.
48
This method takes a tree and a node to be removed from that
tree.
This conditional expression takes care of the two easy cases. If
either children of the
removed node is null then a single transplant will effectively
remove the node.
Otherwise we will require a bit more manipulation. We will
start by finding the smallest
value in the right subtree of the node to be removed. We want
this value because it is the
smallest value that is larger than every other value in the left
subtree of our removed node.
If the parent of that minimum node is the node we want
removed then we will skip this
next step. Otherwise we will run a Transplant on the minimum
node. This will take the
minimum node out of the right subtree. Remember, this node is
larger than every element of
the removed node’s left subtree and smaller than or equal to
every element in the removed
node’s right sub tree. It only makes sense that we should
replace the removed node with this
minimum node.
49
After the transplant we assign the right child of this minimum
node to the right child of
the node we wish to remove and give that child a new parent.
Because this minimum node is
smaller than every node in the subtree represented by its new
right child, the fundamental
properties of the binary tree hold.
Now we can deal with the left subtree of the node we wish to
remove. We start this by
transplanting the node we wish to remove with our old minimum
reference.
After the transplant the node is now completely removed from
the tree. Unfortunately
its left subtree is still attached. We can fix this by attaching the
left child of the removed node
to that minimum node. Because the minimum node used to
reside in the removed node’s right
subtree and we are attaching the removed nodes left subtree,
every element in that left
subtree will be smaller than the minimum node.
And now we are finally done removing a node from our tree.
Complexity:
The operations done on a binary tree vary with the structure of
the tree. As you have
probably noticed, if you insert the values [1,2,3,4,5,6,7,8,9] into
a binary tree, you will
essentially get a list. Any operations done on a tree as
unbalanced as this will yield O(n)
complexity.
However, if a tree is properly balanced then it can yield an
average O(log(n)) complexity.
This is much better than performing operations on a linear list.
Additionally, this performance is comparable to sorting a list
and then doing a binary
search on that list to find and extract some information. In order
to add the unordered
elements of an array to a tree, we will have to iterate through
the list once O(n), and at each
element we will have to perform an insertion O(log(n)).
Extracting that information will simply
be O(log(n)). In total, inserting an entire list and extracting one
value will be O(nlog(n) + log(n)),
or simply O(nlog(n)).
Sorting the linear list can be done with O(nlog(n)) and then
extracting an element from a
sorted list can be done using a binary search which is O(log(n))
for a total of O(nlog(n) + log(n)),
or simply O(nlog(n)).
50
So, if doing each of these operations is bound by O(nlog(n)),
why use a tree over a linear
list? Well, it depends. Given some list of comparable elements,
sorting that list will be faster
than inserting the entire thing into a binary tree. Using a
standard desktop, a sample set of
50,000,000 integers took about 9 seconds to sort. That same
sample took almost 90 seconds to
insert into a binary tree. So the question remains, why use a
binary tree?
If you insert something into a linear list, the insert will take
O(n) time. An insert into a
binary tree can be done in O(log(n)) time. The same applies for
removals. So, if you are planning
on manipulating the data then a binary tree is probably what you
want. However, if you are not
going to change the data, having a sorted list may be more
beneficial than a tree. Basically, the
binary tree is much faster to maintain after the initial
preparation has been finished. You have
options, use them wisely.
Other Sources:
http://cslibrary.stanford.edu/110/BinaryTrees.html
http://cslibrary.stanford.edu/110/BinaryTrees.html
51
B-Tree:
Binary Trees tend to be an excellent way of storing data when
all of that data can fit in
RAM. However, as most computer scientists know, there are
many applications where we have
more information than RAM available. This means we will have
to store the actual information
on a hard drive. Accessing a hard drive is much slower than
RAM as you have seen in the
architecture portion of this text.
The time it takes to perform I/O to a hard drive is slow because
there are physical
moving parts.
The platter spins around the
spindle. The read/write head
reads pages off of the current
track. We have to wait until the
information we want is under
the read/write head before we
can access it.
We could use a binary
tree to store very large amounts
of information. Every node
would be some file on the disk.
As we traverse the tree we will have to do a disk I/O for each
node visited. Even if the tree is
perfectly balanced we can still find ourselves doing lots of disk
I/O’s for very large trees.
To prevent this we want to store more than one thing in each
file. To be more specific,
since each disk read gets a page from the hard drive, we want to
fit as much information into a
node as possible before we overflow that page. This will ensure
we make the most efficient use
of our available resources while drastically reducing the time it
takes to find information.
This is where the B-Tree comes in. A B-Tree is basically a
Binary Tree which allows each
node to have more than two children. Each node will have some
number of separator keys. The
number of children coming out of these nodes will be equal to
the number of separator keys+1.
52
An easy way to demonstrate the B-Tree is with the alphabet.
The tree above is rooted at
M. The number of separators for the root in this case is one and
it has two children. Note the
number of keys in a node can be much larger than 2 or 3. In
practical applications we may have
thousands of keys in each node.
Each node in a B-Tree contains:
- x.p : a pointer/reference to the parent
- x.n : a number of separator keys in the node
- x.key[1 … x.n] : array with value of separator keys (as
opposed to a single key for the node)
- x.c[1 … x.n+1] : array of pointers to children (as opposed to
x.left and x.right)
- x.leaf : Boolean value representing whether x is a leaf or not.
Other Properties:
- Every leaf in the B-Tree will have the same depth. i.e. the
length of the paths from root to
each leaf is the same (equal to the height of the tree).
- Each node may have no more than 2t-1 keys.
o t is a predefined number that will regulate the properties of
this tree.
- Each node, with the exception of the root, must have at least t-
1 keys. In a non-empty tree,
the root must have at least 1 key.
B-Trees support dictionary operations, meaning insert, search,
and remove which we will
go over. They do support other operations such as successor and
predecessor which we will not
look into.
53
BTreeSearch:
Let’s start with searching for an element in an existing B-Tree.
In this search we will use a
method, FindBranch(x,k) such that x is a node and k is the key
we are looking for. This searches
through a single node to find where the branch to a child should
be. It can be either a linear or
binary search through the node, what is important is that it finds
a key index, i, such that:
- x.key[i] >= k
- x.key[i-1] < k or i == x.n+1 if x.key[x.n] < k
Above is the pseudo code for a BTreeSearch. As with the
Binary Tree we will break this
down to help you understand exactly what is going on.
This method takes two parameters, x which is the current node
and k, the key we are
searching for.
This searches through the node to find the branch location as
described above.
This checks to see if the key we are looking for matches the
key we found with our
FindBranch method. If it is, we are done and can return the
node and the index of the key.
Otherwise if this is a leaf node then the key we are looking for
does not exist in this tree.
54
We have covered the cases where we found the item we are
looking for and the case
where we are the key isn’t in the structure, so now we must
move on and repeat this process
for the next node.
We have already found the branch to the next child and we
know that child exists, so we
need to read that node into memory. This is done with
DiskRead(x.c[i]) which gets the i
th
child
of x. In Java, this could be done by creating an input stream
from the file.
We then recursively call BTreeSearch with the new node and
the same key we are looking for.
BTreeInsert:
For the insert (and remove later) we won’t show pseudo code.
The implementations for
these are quite ugly because it involves modifying the actual
file structures. If you find yourself
needing to implement a B-Tree, pseudo code is available on the
internet or you may be able to
form the code yourself after reading the descriptions of how
they work.
When we wish to insert into a B-Tree, we always begin at the
root. We will do a search
until we find the appropriate leaf to insert our information. We
can only insert into a leaf if
there is space (it must have less than 2t-1 keys)
In this case we do have room. We want to insert ‘B’ into this
tree where t=3.
- When t=3 each node has:
- Keys >= (t-1 = 2)
- Keys <= (2t-1 = 5)
55
The appropriate leaf node may not always have room to insert.
We don’t want to just add a
new level because B-Trees have to hold the property where
every leaf in the tree will have the
same depth. It also means we will have a node with just one
element in it which is a waste of a
hard drive I/O. Remember: we want to make the most efficient
use of our space as possible
which is why we need to enforce these rules on how many keys
each node can hold.
We will now try to insert ‘Q’ into this B-Tree. Intuitively we
can see that, since ‘Q’ comes
after ‘P’ we will want to insert ‘Q’ into the node with [R S T U
V]. However, this node has 5 keys
already so we cannot insert into it as-is.
The next place we may look to insert is in the root with [G M P
X]. This is not an option
either. Doing this would introduce a child node between ‘P’ and
‘Q’ that has no keys. Since the
number of keys has to be at least t-1, this would violate the
properties of a B-Tree.
We do have another option.
We can split this node in half and
raise the ‘T’ to the root. It has room for one more key. The
pointer between ‘P’ and ‘T’ will
contain [R S] as a child and the pointer between ‘T’ and ‘X’
will contain [U V] as a child. Both of
these abide by the t-1 keys property.
We can then insert ‘Q’ into the tree after making the split.
There is one last contingency we will have to deal with. We
can see that if we tried to
insert ‘F’ into this tree we would have no room for it in either
the appropriate leaf or in the
root.
The way to deal with this is actually quite simple. Whenever
we want to insert
something into a tree, if we run into a node that is full we split
it unconditionally.
56
Inserting ‘L’ into this tree forced the split of the root node.
This is also why we have an
exception for the minimum number of keys in the root node.
Using this technique will ensure
that the tree not only stays balanced but there will always be
room if we need to split a node.
Special mention should be made in this case. Because we had
to split the root, there
was no parent to place the median key from the root. As you
probably guessed, we just make
that median key its own node and make the tree’s root pointer
point to that. The height of the
tree is increased by one whenever we have to split the root and
this is the only way that the
height of the tree is allowed to grow.
Finally, when we insert ‘F’, we have plenty of room for a split.
All properties discussed
earlier are held true.
At any given node we will have to do a search. Assuming this
is a linear search it will be
bound by O(t) where t is the constraint we applied limiting the
number of keys in our node.
Each node we visit we go down one level in the tree, which
means we have to do another
search. The total time ends up being O(t*h) = O(t*logtn).
57
B-TreeRemoval:
It is even more complicated to delete a key from a B-Tree than
it is to insert one. We will
start, per usual with any tree structure, from the root. We will
do a search for the key we want
to delete. We will have two main cases: deleting from a leaf and
deleting from an internal node.
Deleting from Leaf:
If the leaf has at least t keys we can just remove the key. The
node will have at least t-1
keys and the structural properties will be maintained.
In this case we want to delete ‘B’. We can’t just remove it
because the leaf will have less
than t-1 keys. We will want to increase the size of the node
before attempting any removal. To
make things simple, on removals we will check each node
before we move to it to see if it has t-
1 keys. If it does, we will preemptively increase the size of that
node just in case the key we
want to remove is there. There are several different approaches
to do this depending on the
siblings of that node.
58
Case A: The t-1 node has a neighbor with at least t keys.
In the case above, the sibling of the node we want to remove
‘B’ from has more than t-1
keys. We can move the first element from that node, ‘E’, up to
its parent. We can then move
the ‘C’ from the root to the node with which we want to delete
‘B’.
After moving ‘C’ and ‘E’, the node had t elements and we were
able to remove ‘B’ with
no issue.
Case B: Neither left nor right siblings have more than t-1 keys.
In this case we will use a technique that merges nodes with a
key from the parent.
If we wish to delete ‘D’ we will first visit [C L]. This node has
t-1 keys so we want to
increase that. We can’t just take a key from [T X] because it
also only has t-1 keys. Instead we
will merge the two nodes using the key ‘P’ from the parent.
59
After the merge we are free to delete ‘D’ as it’s node had more
than t-1 keys. Also, since
we removed the only key from the key, we remove that node and
the new root of this tree is [C
L P T X].
Why preemptively increase keys to t?
Let’s go back to a previous tree:
In this example we want to delete Z. We can see that we will
have to pass through [T X]
which has t-1 keys. We already know we can increase that by
using a sibling from [C G M].
We can now, because of our preemptive efforts, merge [U V]
and [Y Z] using ‘X’.
60
The process of removing ‘Z’ is now our simple case of simply
removing it from the leaf
node.
Removing from non-leaf:
This process, again, has several cases. Sometimes, if we are
lucky, we will want to delete
a key that separates two children with t-1 nodes. If this is the
case, we simply merge the
children.
Case A: Key to be removed separates two children with t-1
keys.
There is not a lot to explain here. If we first merge [D E] and [J
K] using ‘G’ as a median,
then we will get a single child [D E G J K] between ‘C’ and ‘L’.
G would then be in a child with 2t-
1 keys and we have already seen how to delete that.
61
Case B: Key to be removed separates children with more than t-
1 keys.
If we wish to delete ‘M’ then we will have to find something to
take its place. To do this
we will have to find its predecessor. This will be the “largest”
key in the left subtree of ‘M’. The
predecessor will always be found in a leaf. This will use an
algorithm similar to the
TreeMinimum for Binary Trees.
The substituted node will have to be deleted from the leaf using
the standard deletion
techniques we have already discussed (i.e. ensuring each node
has at least t keys). After it is
“deleted” from the tree, we will just replace the element in the
internal node with our
predecessor. In the example above, we deleted ‘L’ from the left
subtree of ‘M’ and then
replaced ‘M’ with ‘L’, effectively deleting ‘M’ from the tree.
Overall Complexity:
We only move from top to bottom, returning up once only if we
need to delete from an
internal node. At each node we will access at most two of its
children. Since these are constant
values, meaning the extra work we do at each step will not
increase as n increases, we will have
O(log n) operations.
Other Sources:
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
62
Chapter 6: Hashing
In this chapter we will look at hashing and how it is used to
implement the hash table
data structure. While they are outside the scope of this course,
other uses of hashing in
computer science include security and encryption. First off let's
define what hashing is. Hashing,
in general, is the use of a “hash function” that maps a data set
of potentially variable length to
another data set of standardized length.
We will use this table representation of a directory for our
examples:
Index Name Phone # Address Email
0 John, Smith 804-453-3425 25 West Main St. [email protected]
1 John, Doe 804-343-7385 54 Marshal Rd. [email protected]
2 Jane, Wilkerson 804-374-3836 978 Woodman Rd.
[email protected]
At this point you might be wondering why we need another
data structure. We have
already looked at searching algorithms, sorting algorithms, and
trees and come up with fast
implementations for all of them. So why would we not just use
something like a binary tree or
sorted array to represent the directory? As we have seen
previously, if we used a sorted array to
implement the directory and a binary search to find the record,
that would give us a search time
of O(log 2n) . While that is much better than some of the other
methods we have seen, for a
large directory it could still take a long time to complete. The
second issue would be adding
new records. Whenever you add a new record you would have to
shift the entire part of the
array that comes after the new record for each addition that you
make. This could take up to
O(n). Even with a binary tree the best we get for insertions and
deletions is O(log 2n) .
Imagine if the telephone company stored numbers that way.
When you placed a call the
telephone company would have to search potentially several
directories from different
companies to find the number you are calling. This could still
work if only a few people were
calling at a time. However, when you have millions of calls at a
time, a faster method is required.
Another case would be a guidance system for a missile, where
last second changes are needed.
If the calculation takes too long, the missile would not have
time to change its trajectory.
This is where hashing comes into play. If we look back to the
table of our example
directory and made a hash function for it, we see that our hash
function would need to return
the index's 0, 1, and 2. This leads us to our discussion of
creating hash functions. As creating
hash functions is not the focus of this course we will only look
at one method. The method we
will look at is called modular hashing. Modular hashing is
where we convert our key into an
integer then divide by the size table M, to get the remainder as
the index.
63
In our table above this would give us the function: h(x) = x
mod 3
If we choose the phone number as our key X, and we pass
“8043743836” to the hash
function it would return 2. So to reiterate, a hash table is an
array data structure that maps
elements to each index by inputting the key of each element into
the hash function.
Now that you know what hash tables are, there should be one
issue that jumps out at
you. That of course is, what happens when the hash function
returns the same value for two
different inputs? When this happens we have what is called a
“collision.” Avoiding collisions is
one of the primary concerns when constructing hash functions.
One of the simplest techniques we use is the size of the table.
If the table of our
example directory had a size of 10 then all the numbers ending
in “00”, “10” and so on would
each map to the same index. This gives us our standard for the
table size: the size of hash tables
should always be a prime number. This is because each division
is more likely to be unique,
since you are not with a common denominator other than itself.
This will not completely avoid
collisions but it will significantly reduce them. Besides the
table size, the only other way we can
avoid collisions is by adjusting our key and hash function. As
there is no good way to do this,
usually we change our focus from avoiding, collisions to
dealing with them.
The most common method for dealing with collisions is to
make a table of linked lists. In
case you do not remember, a linked list is a data structure
composed of a group of nodes where
each node holds a piece of data and a pointer to the next node in
the list. It would look like this:
When you insert an element, you would get the index from the
hash function and add
the element to the head of the linked list. By adding to the head
of the list, it prevents you from
having to traverse the list on insertion. To then find the
element, you would get your index from
the hash function, and then do a sequential search of the list to
find the element you are
looking for. This would give us a data structure that looks like
the picture on the following page.
Linked List
64
There are other methods for dealing with collisions. The first is
to increment through the
table until you find an empty space and place the element there.
There are two issues with
this. The first is clustering. Clustering is when elements are
clumped together around the same
index. When this happens it increases the potential for more
collisions to occur. The second
issue with this method is finding the element. Since you use the
next open space you have to
look forward linearly at each element, but you could see an
element that was hashed to that
index. So how do you know how far forward you should look?
Because of these issues, this
method is rarely used. Another method would be to create a
second hash function for when
collisions occur. If this method was used, it should be evident
that you are just pushing the
problem farther along instead of dealing with it, as you would
then have the question of how to
deal with collisions from the second hash function.
At this point you should start seeing why hashing is so
valuable. Our directory example
only had three values, so the execution time would be very
small no matter what data structure
you used. If, on the other hand, you have a directory holding a
few million records, the speed of
a hash table over something like a binary tree would be very
significant.
There are several important things that should be noted about
hash tables.
1.) Even with a good hash function it is still possible for
collisions to occur. So always
anticipate and have a mechanism for dealing with collisions.
2.) Typically, the number of possible keys is much larger than
the actual keys that are used
and stored. This means that you need to know and plan for the
maximum number of entries.
3.) There are several different techniques for creating hash
functions. But no matter what
method you use, if the hash function is called twice with the
same input, it should always return
the same value.
Linked List implementation of a hash table.
65
4.) The only operations that the hash table data structure
supports are the dictionary
operations, based on the element's key: insert, search, and
delete. This means that there are
some limitations to hash tables. Unlike a binary tree or sorted
array, hash tables do not support
operations based on order. This means that they do not support
operations like minimum,
maximum, successor, and predecessor.
To finish up, we need to look at the complexity of hash tables.
For insertions using the
linked list implementation we have the time it takes to compute
the index plus the time to
insert into the linked list. Since we insert into the head, this
gives us a best and worst case time
complexity of O(1). For searches and deletions we have the time
it takes to compute the index,
plus the time it takes to find the element in the linked list. In
the best case, we have only one
element in each list which would give us a time complexity of
O(1). The worst case, on the other
hand, would be that all the keys hashed to the same index. If
that happened, we would have a
time complexity of O(n). As long as care is taken in designing
the hash table and hash function,
there should be a relatively small number of collisions. So the
average run time for well-
constructed hash tables would be O(1).
Other sources:
http://algs4.cs.princeton.edu/34hash/index.html
http://www.comsci.us/fs/notes/ch11.html
http://algs4.cs.princeton.edu/34hash/index.html
http://www.comsci.us/fs/notes/ch11.html
http://www.comsci.us/fs/notes/ch11.html
66
Authors’ Notes
The material for this course has been assembled by Steven
Andrews and Nathan
Stephens. This course is meant to help prepare you for taking
CMSC 508 Database Theory. As a
firm grasp of this material is essential, it is suggested that if
your still have trouble
understanding a topic you look at the other sources provided at
the end of each section.
__MACOSX/BNFO-501/._BNFO501 Course Guide.pdf
BNFO-501/Project 1.pdf
BNFO-501 Project 1:
Input:
You will be given, in standard input, two arrays of integers.
The first array will be your data. The
second array will contain integers that may or may not be in the
first.
The first line of input will contain two integers separated by a
space. The first integer, n, will be
the size of the data array. The second integer, m, will be the
size of the query array. The next n lines will
contain a single integer that corresponds to a value in the data
array. Immediately following these will
be m more lines containing the elements of the query array. A
small sample set of input may look like
this:
5 2
4
7
12
89
102
92
89
This will correspond to the two arrays:
- Data: [4 7 12 89 102]
- Query: [92 89]
Output:
You are to write both a sequential and binary search that will
look for each value in the query
array and return true if it exists in the data array. You will also
print the time, in milliseconds, that each
search takes along with the number that is being searched for.
For example, the output of the input
above should look like this:
false:2ms false:0ms 92
true:2ms true:0ms 89
The search times will vary with the machine you are using. If
you were to use this input as a test, you will
most likely get 0ms for each search. To truly see the intended
result, it is recommended that you
generate your own input with at least 1,000,000 data elements.
Files used for grading purposes will not
exceed 50,000,000 data elements.
BNFO-501 Project 1:
Help & Tips:
A template file has been provided that shows one way of
reading to and printing from standard
I/O. If you would like to run your own tests you can easily
create a program that generates the standard
file. We will be using the same format of file, with the
exception of ordered elements, for every project
so this would be a wise investment. If you want to use a file to
test your program then you can add a
command line parameter I/O redirect. For example, if you are
on a windows machine and your program
is named Project.java with the input file of data.txt, you can use
the command:
java Project < data.txt
If you would like to route the standard output to a file instead
of having it print to command prompt,
you can use:
java Project < data.txt > output.txt
__MACOSX/BNFO-501/._Project 1.pdf
BNFO-501/Project 2.pdf
BNFO-501 Project 2:
Input:
You will be given, in standard input, two arrays of integers.
The first array will be your data. The
second array will contain integers that may or may not be in the
first.
The first line of input will contain two integers separated by a
space. The first integer, n, will be
the size of the data array. The second integer, m, will be the
size of the query array. The next n lines will
contain a single integer that corresponds to a value in the data
array. Immediately following these will
be m more lines containing the elements of the query array. A
small sample set of input may look like
this:
5 2
89
4
12
7
102
92
89
This will correspond to the two arrays:
- Data: [4 7 12 89 102]
- Query: [92 89]
Output:
You are to modify the program you wrote from Project 1. The
output will be the same with the
addition of 1 line of output. You are to print in standard output
the time it takes to prepare the data.
This preparation time will be the time, in milliseconds, that it
takes for you to sort the data. You must
write a merge sort to accomplish this task. You are NOT
allowed to use Arrays.sort().
You will then print the sequential and binary search results as
done in project 1:
Prep time: 45ms
false:2ms false:0ms 92
true:2ms true:0ms 89
The search times will vary with the machine you are using. If
you were to use this input as a test, you will
most likely get 0ms for each search. To truly see the intended
result, it is recommended that you
generate your own input with at least 1,000,000 data elements.
Files used for grading purposes will not
exceed 50,000,000 data elements.
BNFO-501 Project 2:
Help & Tips:
Try also writing one of the simpler O(n
2
) sorts and compare it with the time it takes a Merge Sort
to run for very large input.
__MACOSX/BNFO-501/._Project 2.pdf
BNFO-501/Project 3.pdf
BNFO-501 Project 3:
Input:
You will be given, in standard input, two arrays of integers.
The first array will be your data. The
second array will contain integers that may or may not be in the
first.
The first line of input will contain two integers separated by a
space. The first integer, n, will be
the size of the data array. The second integer, m, will be the
size of the query array. The next n lines will
contain a single integer that corresponds to a value in the data
array. Immediately following these will
be m more lines containing the elements of the query array. A
small sample set of input may look like
this:
5 2
89
4
12
7
102
92
89
This will correspond to the two arrays:
- Data: [4 7 12 89 102]
- Query: [92 89]
Output:
You are to once again modify the previous project. You will
write a basic binary tree which only
needs to insert and search for elements. Your prep time in this
example will be the time it takes to add
every element to the binary tree. You will only run one query
per item in the query array, which will be a
tree search to find the element. The output will be similar to
before:
Prep time: 450ms
false:0ms 92
true:0ms 89
The search times will vary with the machine you are using. If
you were to use this input as a test, you will
most likely get 0ms for each search. To truly see the intended
result, it is recommended that you
generate your own input with at least 1,000,000 data elements.
Files used for grading purposes will not
exceed 50,000,000 data elements.
BNFO-501 Project 3:
Help & Tips:
This will be the first project you are required to write
something where pseudo-code was not
explicitly given in the text. However, you have everything you
need to write this search. Just remember,
in order to insert something you must first search for an element
to insert it under.
If you are familiar with the recursive algorithms for insertion
or searching you are welcome to
use them. Just remember, since the testing can be done with up
to 50,000,000 elements, memory may
be a concern. You can increase the amount of memory allocated
to the JVM for your machine, but the
grader may not.
__MACOSX/BNFO-501/._Project 3.pdf
BNFO-501/Project program template.txt
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class Template {
public static void main(String[] args){
FileContents file = null;
try{ //Try to read the input.
//Reads standard I/O. When graded, console
input will be redirected to read from a file.
BufferedReader in = new BufferedReader(new
InputStreamReader(System.in));
int dataSize = 0;
int querySize = 0;
{//This block reads the first line and assigns the
values to variables.
String[] header = in.readLine().split(" ");
dataSize = Integer.parseInt(header[0]);
querySize = Integer.parseInt(header[1]);
}
System.out.println("Data Array Contents");
//Read and store the contents of the input array.
for(int i = 0; i < dataSize; i++){
System.out.println(in.readLine());
}
System.out.println("Query Array Contents");
//Read and store the contents of the query array.
for(int i = 0; i < querySize; i++){
System.out.println(in.readLine());
}
}catch(IOException e){
System.err.println("Error Reading Input: " +
e.getMessage());
System.exit(0);
}
//Example for how to time how long a method takes.
long start = System.currentTimeMillis();
timeThis();
long end = System.currentTimeMillis();
System.out.println((end - start) + "ms");
}
//Some Method
public static void timeThis(){
while((int)(Math.random() * 100) < 95);
}
}
__MACOSX/BNFO-501/._Project program template.txt
The mean hourly pay rate for financial managers in the East
North Central region is $48.93, and the standard deviation is
$2.76. Assume that pay rates are normally distributed.
a. What is the probability a financial manager earns between
$45 and $52 per hour?
b. How high must the hourly rate be to put a financial manager
in the top 10% with respect to pay?
c. For a randomly selected financial manager, what is the
probability the manager earned less than $43 per hour?

More Related Content

Similar to BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx

Types of components and objects to be measured_Presentation
Types of components and objects to be measured_PresentationTypes of components and objects to be measured_Presentation
Types of components and objects to be measured_PresentationBryan Corpuz
 
PAM g.tr 3832
PAM g.tr 3832PAM g.tr 3832
PAM g.tr 3832Accenture
 
Latihan8 comp-forensic-bab5
Latihan8 comp-forensic-bab5Latihan8 comp-forensic-bab5
Latihan8 comp-forensic-bab5sabtolinux
 
Grade 11 Processing devices
Grade 11 Processing devicesGrade 11 Processing devices
Grade 11 Processing devicesLeratoLukas
 
Digital Forensic is a part of forensic that focuses on investigati.docx
Digital Forensic is a part of forensic that focuses on investigati.docxDigital Forensic is a part of forensic that focuses on investigati.docx
Digital Forensic is a part of forensic that focuses on investigati.docxcuddietheresa
 
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptxUNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptxSnehaLatha68
 
Computer storage devices
Computer storage devicesComputer storage devices
Computer storage devicesFJWU
 
Basicarchitecturememory
BasicarchitecturememoryBasicarchitecturememory
BasicarchitecturememoryAditya Narang
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer ArchitectureAnkush Srivastava
 
computer Systems & Memory .pptx
 computer Systems & Memory .pptx computer Systems & Memory .pptx
computer Systems & Memory .pptxafnanalkafre
 
Introduction to Computer
Introduction to ComputerIntroduction to Computer
Introduction to ComputerNiti Arora
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 

Similar to BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx (20)

Types of components and objects to be measured_Presentation
Types of components and objects to be measured_PresentationTypes of components and objects to be measured_Presentation
Types of components and objects to be measured_Presentation
 
Chapter1 9-07
Chapter1 9-07Chapter1 9-07
Chapter1 9-07
 
PAM g.tr 3832
PAM g.tr 3832PAM g.tr 3832
PAM g.tr 3832
 
notes2 memory_cpu
notes2 memory_cpunotes2 memory_cpu
notes2 memory_cpu
 
Latihan8 comp-forensic-bab5
Latihan8 comp-forensic-bab5Latihan8 comp-forensic-bab5
Latihan8 comp-forensic-bab5
 
Grade 11 Processing devices
Grade 11 Processing devicesGrade 11 Processing devices
Grade 11 Processing devices
 
Digital Forensic is a part of forensic that focuses on investigati.docx
Digital Forensic is a part of forensic that focuses on investigati.docxDigital Forensic is a part of forensic that focuses on investigati.docx
Digital Forensic is a part of forensic that focuses on investigati.docx
 
Memory managment
Memory managmentMemory managment
Memory managment
 
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptxUNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
UNIT 3 Memory Design for SOC.ppUNIT 3 Memory Design for SOC.pptx
 
18. the components of the system unit
18. the components of the system unit18. the components of the system unit
18. the components of the system unit
 
Computer storage devices
Computer storage devicesComputer storage devices
Computer storage devices
 
Basicarchitecturememory
BasicarchitecturememoryBasicarchitecturememory
Basicarchitecturememory
 
Introduction to Computer Architecture
Introduction to Computer ArchitectureIntroduction to Computer Architecture
Introduction to Computer Architecture
 
computer Systems & Memory .pptx
 computer Systems & Memory .pptx computer Systems & Memory .pptx
computer Systems & Memory .pptx
 
Computer Introduction-Lecture02
Computer Introduction-Lecture02Computer Introduction-Lecture02
Computer Introduction-Lecture02
 
Cache memory
Cache memoryCache memory
Cache memory
 
Computer xi
Computer xiComputer xi
Computer xi
 
Introduction to Computer
Introduction to ComputerIntroduction to Computer
Introduction to Computer
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 

More from hartrobert670

BUS M02C – Managerial Accounting SLO Assessment project .docx
BUS M02C – Managerial Accounting SLO Assessment project .docxBUS M02C – Managerial Accounting SLO Assessment project .docx
BUS M02C – Managerial Accounting SLO Assessment project .docxhartrobert670
 
BUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docx
BUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docxBUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docx
BUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docxhartrobert670
 
BUS LAW2HRM Management Discussion boardDis.docx
BUS LAW2HRM Management Discussion boardDis.docxBUS LAW2HRM Management Discussion boardDis.docx
BUS LAW2HRM Management Discussion boardDis.docxhartrobert670
 
BUS 571 Compensation and BenefitsCompensation Strategy Project.docx
BUS 571 Compensation and BenefitsCompensation Strategy Project.docxBUS 571 Compensation and BenefitsCompensation Strategy Project.docx
BUS 571 Compensation and BenefitsCompensation Strategy Project.docxhartrobert670
 
BUS 475 – Business and Society© 2014 Strayer University. All Rig.docx
BUS 475 – Business and Society© 2014 Strayer University. All Rig.docxBUS 475 – Business and Society© 2014 Strayer University. All Rig.docx
BUS 475 – Business and Society© 2014 Strayer University. All Rig.docxhartrobert670
 
BUS 210 Exam Instructions.Please read the exam carefully and a.docx
BUS 210 Exam Instructions.Please read the exam carefully and a.docxBUS 210 Exam Instructions.Please read the exam carefully and a.docx
BUS 210 Exam Instructions.Please read the exam carefully and a.docxhartrobert670
 
BUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docx
BUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docxBUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docx
BUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docxhartrobert670
 
BUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docx
BUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docxBUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docx
BUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docxhartrobert670
 
BUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docx
BUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docxBUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docx
BUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docxhartrobert670
 
BullyingIntroductionBullying is defined as any for.docx
BullyingIntroductionBullying is defined as any for.docxBullyingIntroductionBullying is defined as any for.docx
BullyingIntroductionBullying is defined as any for.docxhartrobert670
 
BUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docx
BUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docxBUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docx
BUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docxhartrobert670
 
BUMP implementation in Java.docxThe project is to implemen.docx
BUMP implementation in Java.docxThe project is to implemen.docxBUMP implementation in Java.docxThe project is to implemen.docx
BUMP implementation in Java.docxThe project is to implemen.docxhartrobert670
 
BUS 303 Graduate School and Further Education PlanningRead and w.docx
BUS 303 Graduate School and Further Education PlanningRead and w.docxBUS 303 Graduate School and Further Education PlanningRead and w.docx
BUS 303 Graduate School and Further Education PlanningRead and w.docxhartrobert670
 
Bulletin Board Submission 10 Points. Due by Monday at 900 a.m..docx
Bulletin Board Submission 10 Points. Due by Monday at 900 a.m..docxBulletin Board Submission 10 Points. Due by Monday at 900 a.m..docx
Bulletin Board Submission 10 Points. Due by Monday at 900 a.m..docxhartrobert670
 
BUS 371Fall 2014Final Exam – Essay65 pointsDue Monda.docx
BUS 371Fall 2014Final Exam – Essay65 pointsDue  Monda.docxBUS 371Fall 2014Final Exam – Essay65 pointsDue  Monda.docx
BUS 371Fall 2014Final Exam – Essay65 pointsDue Monda.docxhartrobert670
 
Burn with Us Sacrificing Childhood in The Hunger GamesSus.docx
Burn with Us Sacrificing Childhood in The Hunger GamesSus.docxBurn with Us Sacrificing Childhood in The Hunger GamesSus.docx
Burn with Us Sacrificing Childhood in The Hunger GamesSus.docxhartrobert670
 
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docx
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docxBUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docx
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docxhartrobert670
 
Burgerville- Motivation Goals.Peer-reviewed articles.Here ar.docx
Burgerville- Motivation Goals.Peer-reviewed articles.Here ar.docxBurgerville- Motivation Goals.Peer-reviewed articles.Here ar.docx
Burgerville- Motivation Goals.Peer-reviewed articles.Here ar.docxhartrobert670
 
Bullying Bullying in Schools PaperName.docx
Bullying     Bullying in Schools PaperName.docxBullying     Bullying in Schools PaperName.docx
Bullying Bullying in Schools PaperName.docxhartrobert670
 
Building Design and Construction FIRE 1102 – Principle.docx
Building Design and Construction FIRE 1102 – Principle.docxBuilding Design and Construction FIRE 1102 – Principle.docx
Building Design and Construction FIRE 1102 – Principle.docxhartrobert670
 

More from hartrobert670 (20)

BUS M02C – Managerial Accounting SLO Assessment project .docx
BUS M02C – Managerial Accounting SLO Assessment project .docxBUS M02C – Managerial Accounting SLO Assessment project .docx
BUS M02C – Managerial Accounting SLO Assessment project .docx
 
BUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docx
BUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docxBUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docx
BUS 409 – Student Notes(Prerequisite BUS 310)COURSE DESCR.docx
 
BUS LAW2HRM Management Discussion boardDis.docx
BUS LAW2HRM Management Discussion boardDis.docxBUS LAW2HRM Management Discussion boardDis.docx
BUS LAW2HRM Management Discussion boardDis.docx
 
BUS 571 Compensation and BenefitsCompensation Strategy Project.docx
BUS 571 Compensation and BenefitsCompensation Strategy Project.docxBUS 571 Compensation and BenefitsCompensation Strategy Project.docx
BUS 571 Compensation and BenefitsCompensation Strategy Project.docx
 
BUS 475 – Business and Society© 2014 Strayer University. All Rig.docx
BUS 475 – Business and Society© 2014 Strayer University. All Rig.docxBUS 475 – Business and Society© 2014 Strayer University. All Rig.docx
BUS 475 – Business and Society© 2014 Strayer University. All Rig.docx
 
BUS 210 Exam Instructions.Please read the exam carefully and a.docx
BUS 210 Exam Instructions.Please read the exam carefully and a.docxBUS 210 Exam Instructions.Please read the exam carefully and a.docx
BUS 210 Exam Instructions.Please read the exam carefully and a.docx
 
BUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docx
BUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docxBUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docx
BUS 137S Special Topics in Marketing (Services Marketing)Miwa Y..docx
 
BUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docx
BUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docxBUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docx
BUS 313 – Student NotesCOURSE DESCRIPTIONThis course intro.docx
 
BUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docx
BUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docxBUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docx
BUS 1 Mini Exam – Chapters 05 – 10 40 Points S.docx
 
BullyingIntroductionBullying is defined as any for.docx
BullyingIntroductionBullying is defined as any for.docxBullyingIntroductionBullying is defined as any for.docx
BullyingIntroductionBullying is defined as any for.docx
 
BUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docx
BUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docxBUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docx
BUS1001 - Integrated Business PerspectivesCourse SyllabusSch.docx
 
BUMP implementation in Java.docxThe project is to implemen.docx
BUMP implementation in Java.docxThe project is to implemen.docxBUMP implementation in Java.docxThe project is to implemen.docx
BUMP implementation in Java.docxThe project is to implemen.docx
 
BUS 303 Graduate School and Further Education PlanningRead and w.docx
BUS 303 Graduate School and Further Education PlanningRead and w.docxBUS 303 Graduate School and Further Education PlanningRead and w.docx
BUS 303 Graduate School and Further Education PlanningRead and w.docx
 
Bulletin Board Submission 10 Points. Due by Monday at 900 a.m..docx
Bulletin Board Submission 10 Points. Due by Monday at 900 a.m..docxBulletin Board Submission 10 Points. Due by Monday at 900 a.m..docx
Bulletin Board Submission 10 Points. Due by Monday at 900 a.m..docx
 
BUS 371Fall 2014Final Exam – Essay65 pointsDue Monda.docx
BUS 371Fall 2014Final Exam – Essay65 pointsDue  Monda.docxBUS 371Fall 2014Final Exam – Essay65 pointsDue  Monda.docx
BUS 371Fall 2014Final Exam – Essay65 pointsDue Monda.docx
 
Burn with Us Sacrificing Childhood in The Hunger GamesSus.docx
Burn with Us Sacrificing Childhood in The Hunger GamesSus.docxBurn with Us Sacrificing Childhood in The Hunger GamesSus.docx
Burn with Us Sacrificing Childhood in The Hunger GamesSus.docx
 
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docx
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docxBUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docx
BUS 305 SOLUTIONS TOPRACTICE PROBLEMS EXAM 21) B2) B3.docx
 
Burgerville- Motivation Goals.Peer-reviewed articles.Here ar.docx
Burgerville- Motivation Goals.Peer-reviewed articles.Here ar.docxBurgerville- Motivation Goals.Peer-reviewed articles.Here ar.docx
Burgerville- Motivation Goals.Peer-reviewed articles.Here ar.docx
 
Bullying Bullying in Schools PaperName.docx
Bullying     Bullying in Schools PaperName.docxBullying     Bullying in Schools PaperName.docx
Bullying Bullying in Schools PaperName.docx
 
Building Design and Construction FIRE 1102 – Principle.docx
Building Design and Construction FIRE 1102 – Principle.docxBuilding Design and Construction FIRE 1102 – Principle.docx
Building Design and Construction FIRE 1102 – Principle.docx
 

Recently uploaded

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

BNFO-501.DS_Store__MACOSXBNFO-501._.DS_StoreBNFO-501.docx

  • 1. BNFO-501/.DS_Store __MACOSX/BNFO-501/._.DS_Store BNFO-501/BNFO501 Course Guide.pdf 1 Bioinformatics 501 2 Table of Contents Course Outline ............................................................................................... ................................. 3 Chapter 1: Physical Level .................................................................................. ............. ................ 7 Chapter 2: Algorithm Complexity ............................................................................................... . 16
  • 2. Chapter 3: Search Algorithms ............................................................................................... ........ 22 Chapter 4: Sort Algorithms ............................................................................................... ............ 28 Chapter 5: Trees ............................................................................................... ............................. 43 Chapter 6: Hashing ............................................................................................... ........................ 62 Authors’ Notes ............................................................................................... ............................... 66 3 Course Outline Purpose The purpose of this document is to introduce students to concepts, structures and algorithms that form a foundation on which a database is built. Since this is created for a
  • 3. specific class we will assume students have at least taken CMSC-256 at VCU or an equivalent course. Format Consistency will be a key element in learning and understanding this material, but unfortunately there can be many different ways to format this material. We will use this section to illustrate different formats we will use in the examples of this text. We are assuming the readers of this text to have some general understanding of programming and our target audience is supposed to have some background specifically in Java. We will use a pseudo code in our examples that can easily be translated to Java code, but can also be applied to many other languages. An example piece of code is below followed by the Java implementation. Pseudo Code Java implementation 4 Structure This chapter will briefly introduce what will be covered in this
  • 4. document. We will cover: • Computer Architecture (Physical Level) • Algorithm Complexity • Search Algorithms • Sort Algorithms • Trees • Hashing Chapter 1: The Physical Level While we can assume you have taken at least an intermediate Java course, we cannot assume you have taken any courses on computer architecture. The first section will briefly cover key concepts in this field that you will surely hear again in a Database course. You will learn about how information is stored in a computer in both temporary and long term storage along with the time it takes to set and receive information from each. Chapter 2: Algorithm Complexity This is a simple concept, but it is perhaps one of the most important we will discuss. We can create algorithms that can solve just about anything. The problem comes in when these algorithms take an exceptionally long time to run. We will discuss how to determine an algorithm’s complexity in terms of its input size. We will show a few examples and give best/average/worst cases. We generally care the most about worst case scenarios. If we can reduce this worst case
  • 5. then we know the algorithm will always run in an appropriate amount of time. Chapter 3: Search Algorithms Storing information is an essential part of computer science. Retrieving that information is just as important. Retrieving that information in an appropriate amount of time is even more important. You will be introduced to two simple algorithms: sequential search and binary search. We will discuss how and when to use each. After completing Chapter 3, you will be able to complete the first programming assignment. 5 Chapter 4: Sort Algorithms Sorting and ordering information is crucial when it comes to retrieving information quickly. There are many situations where you would want to sort your data. One example is in data retrieval. Attempting to get information from data that is not in any order will require every element to be inspected. However, if that information is sorted, then the time it takes to
  • 6. retrieve it later can be greatly reduced. There are many different sorting algorithms one can use. Some are very intuitive but not very efficient. Others can be very efficient but unintuitive and difficult to code. There are applications where each is useful so we will discuss many different types of sorting algorithms including: • Insertion Sort • Selection Sort • Bubble Sort • Merge Sort An application will be provided with this document to help visualize exactly what these algorithms are doing while they are running. It will also include sorts that are not discussed in this text along with some variations of some of the more efficient sorts. After completing chapter 4, you will be able to complete the second programming assignment. Chapter 5: Trees You should already be familiar with the concept of the tree data structure already. We will discuss a simple binary tree as an introduction, but our primary focus will be on B-Trees. It is important to note that there are many different tree structures we will not discuss, such as general (non-binary, non-balanced) Trees, Heaps, Binary Search Trees and Balanced Trees.
  • 7. The B-Tree is a way of storing very large amounts of information. Until now you may have been able to store all the data you need in RAM. Most databases have much more information than available temporary memory so we have no choice but to store the information on hard disks. As you will learn in the Physical Level chapter, retrieving information from disk is much slower than RAM. B-Trees are constructed with this in mind, giving us a way to quickly navigate and gain access to specific files. After completing chapter 5, you will be able to complete the third programming assignment. 6 Chapter 6: Hashing Hashing is an important data structure which gives extremely fast insertion of data and, when implemented correctly, extremely fast retrieval of that data. Hashing uses a combination of data structures you should be familiar with already. Generally these use an array where each element of that array stores a linked list. 7
  • 8. Chapter 1: Physical Level This chapter gives an overview of some of the basic computer hardware concepts, and what they mean to a computer scientist. Hardware The first thing we will look at is the primary hardware components of a computer. If we ignore peripherals and output devices, the three main components of a computer are the: 1.) Central processing unit (CPU). 2.) Random access memory (RAM) sometimes called the main or primary memory. 3.) The hard drive which is also called secondary memory. The CPU is basically the “brains” of a computer. It is what executes application code, manipulates data, and controls the other hardware. The CPU is composed of three parts: 1.) ALU (Arithmetic Logic Unit) - As its name suggests, this is what does all the mathematical and logical operations. 2.) Control Unit - This can be thought of as the conductor; it does not do anything by itself but it tells the ALU what to do and communicates where to store data in memory. 3.) Registers (little memory chips) - These are where the direct results of the ALU are put and
  • 9. where the data that is to be executed next is stored. 8 RAM is essentially the computer's workbench. It is the memory where the computer stores code and data that it is actively using. In more technical terms, RAM is basically a storage area of bytes that the CPU controls. RAM is relatively fast, especially when compared to a hard drive. Retrieving a particular byte from RAM can be done in a few nanoseconds (1 nanosecond = 1 billionth of a second). The main difference between RAM and our last component, the hard drive, is its speed and the fact that RAM is volatile or non persistent. This means that when RAM loses power, like when a computer is turned off, all the data in RAM is lost. The last primary hardware component is a hard drive (HD). Hard drives are a type of secondary storage. Other types of secondary storage are flash drives, CDs, DVDs, magnetic tape, and blue ray. A hard drive is used for long term storage of data or persistent storage. Persistent means that, unlike RAM, when power is removed the data is still there. Hard drives are typically spinning metal disks on which data is stored with magnetic patterns. The other version of a hard drive is a solid state disk (SSD). SSD's have no moving parts and are faster than a magnetic disk, but are much more expensive. While it is faster than magnetic
  • 10. disks, it is still much slower then RAM. However, no matter what type you use, all of them provide persistent storage. Bytes No matter what type of memory it is registers, RAM, or hard drives, all memory is split up into “bytes.” A byte is made up of 8 “bits,” it is also the smallest “addressable” unit. Bytes are represented as base 2 numbers, so each bit can have the binary value 1 or 0 and the value of each position is found by 2^(0-7). This means that one byte can have the decimal value of 0 – 255. There are different conventions for the symbol of a byte, but it is typically denoted as “B.” Prefixes are also used to represent multiple bytes. However, since it is base 2, a kilobyte 2^10 is 1024 bytes denoted as the symbol “kB” instead of 1000 bytes. Just to confuse things, you also have the symbol “kb” for kilobit. You can also have megabytes (MB), gigabytes (GB), terabytes (TB), etc. Because of the different naming conventions, there is sometimes some ambiguity on what a symbol means in some situations. The bits in a byte are ordered right to left. The left most bit is called the “most significant bit” or “high order bit,” in the same manner, the right most bit is called the “least significant bit” or “low order bit.” It is important that you remember, that all the memory in a computer is measured in bytes and that a byte is the smallest addressable unit in memory.
  • 11. 9 This leads us to the term “word.” A word is basically the unit of data that the CPU thinks in. What this means is that the word size of the CPU is how large a piece the CPU can manipulate at a time. It is also the size of most of the registers. When you hear about 32bit and 64bit machines, this refers to the word size. So when you have a 32bit machine you have word size of 32bits. You can see why this is important with a simple arithmetic example. Suppose you wanted to execute: 1000 + 1000. If you had a word size of one byte (8 bits) the largest number you could represent using a single word is 255. So to represent a larger number you have to use two words. This means that to do the addition, instead of it taking just one operation and therefore one computer cycle, it would have to be split up into several operations taking several computer cycles. As this example demonstrates, when considering how to design the architecture of a computer, the choice of the word size is very important. While the bits in a byte are ordered right to left, that is not always the case for the bytes in a word when talking about storing them in memory. There are two ways that bytes are ordered when stored in memory, the two ways are called “Big Endian” and “Little Endian.” For a given word we have the four bytes B1B2B3B4. If the bytes are
  • 12. stored 1 – 4, in memory they are stored using big endian. If the bytes of each word in memory are stored 4 – 1, in memory they are stored using little endian. There are advantages and disadvantages to both formats, and is the basis for one of the arguments between the PC and Mac. Little Endian means that you are storing the low-order byte of the number in memory, at the lowest address, and that you are storing the high-order byte, in memory at the highest address. The advantage of this is that it creates a one-to-one relationship between the byte number and memory address. Big Endian means that you are storing the high order byte of the number, in memory at the lowest address, and that you are storing the low order byte in memory, at the highest address. The advantage to using this format, is you can test whether a number is positive or negative just by looking at the first byte. Encoding Schemes and Variable Types Characters Now that we know what a byte is, and the fact that all data is stored as a byte, you might wonder how characters, pictures, and other types of data are represented. The answer is encoding schemes. ASCII (pronounced ask-ee) is an acronym for the American Standard Code for Information Interchange. ASCII is an encoding scheme for representing the English alphabet
  • 13. characters as numbers. In this encoding scheme, each letter is a byte and assigned a number from 0 to 127. For example, the ASCII code for uppercase N is the decimal value 78, the lowercase n is the decimal value 110. Since ASCII is what most computers use to represent text, this is what makes it possible to share data between computers. The first version of ASCII was published in 1963 and went through several revisions before it became the version we use today in 1986. While ASCII can represent English characters it does not support characters from other languages. To solve this, another encoding scheme was created called Unicode. Unicode represents characters as a two byte number. This means in can represent up to 65,536 (2^16) different characters, the disadvantage of Unicode is, since it is a two byte encoding scheme it takes twice the memory as ASCII. 10 Numbers We now know how to represent characters, but what about numbers? Numbers are represented in two different ways. The first is as Integers. Since they are integers they cannot hold fractional values. We could just represent them using the binary value. This would mean that for a 32 bit word, we could represent the integer values 0 - 4,294,967,295. The problem
  • 14. with this method is that we cannot represent signed (negative) numbers. The most obvious way to represent signed integers would be to use the most significant as the “sign” bit. This would mean that when the most significant bit is “1”, the integer is negative, when it is “0” positive. Since it is represented with 32 bits and 1 bit is used to denote the sign this leaves us with 31 bits for the number allowing us to represent the range: −2,147,483,647 - 2,147,483,647. This method of representation is called signed magnitude. The disadvantage, as seen in the following table is that we are not efficiently using one representation, “100”. 11 Another disadvantage is seen when executing arithmetic operations. For the CPU to add the two numbers 111 (–3) and 001 (1) together, it would require more than simple binary addition. The solution to this is 2’s complement. 2's complement is a representation method that allows the use of binary arithmetic operations on signed integers to yield the correct 2's complement result. 2's complement is the method that is used in today’s computers to represent signed integers. In 2’s complement we still use the
  • 15. most significant bit to represent the sign of the integer. So positive integers with a leading bit of 0 is straight forward, but negative numbers with a leading bit of 1 are slightly different. Negative numbers are represented as a binary number that when added to a positive number with the same absolute value will equal zero. This makes implementing the logic gates in the CPU much simpler than any other representation. 12 The second way to represent numbers is as a floating point number. Floating point numbers represent “real” numbers, meaning it can represent integers and fractional numbers. Floating point numbers are represented using an exponential format. For a floating point number represented as a single word this would give us 32 bits. In the typical representation format, the bits would be broken up into three parts. The sign, the significant, and the exponent. So for a 32 bit number they would typically be separated like this: The most significant bit (bit 31) is used to represent the sign of the number, 1 for negative, o for positive. The next eight bits (bits 30 – 23) are
  • 16. used to represent the exponent. The convention for the exponent is to “bias” it by 127. This means that to represent the exponent 6 we add 127 to it. Example: 127 + 6 = 133 which is the binary value: 10000101 On the other hand the representation of the exponent – 6 would be: 127 – 6 = 121 which is the binary value: 01111001 The last 23 bits are used for the significant and are call the “Mantissa.” The mantissa M is “normalized” so that it is between 0.5 and 1. The normalization is done by adjusting the binary exponent accordingly. So the value decimal value 0.825, in binary would be: 0.1101 = ( 1/2 + 1/4 + 1/16 = 13/16 = 0.825). The other thing to know about the mantissa, is because of our normalization process it always begins with 1. Since this is always the case we do not store the leading bit, this in effect gives the mantissa 24 bits of resolution using 23 bits. This means that we can represent values ranging from approximately 1.5 × 10−45 to 3.4 × 1038 with a precision of 7 digits. Let’s look at how the decimal number 0.085 is stored as an example. 0.085 is stored as “0 01111011 01011100001010001111011.” Its decimal values would be 0 for the sign, 123 for the exponent, and 3019899 for the significant. The exact
  • 17. representation of this number would be: 2e-127 * (1 + M / 223) = 2-4(1 + 3019899/8388608) = 11408507/134217728 = 0.085000000894069671630859375. 13 As we can see precision is not the same as accuracy. This can make programming with floating point numbers a perilous process for the ignorant. Integer's are exact, unless the result to a computation is outside the range that integers can represent. Floating point numbers by contrast are not exact since some real numbers require an infinite number of digits to be represented, like 1/3 Booleans Booleans are the values true or false, yes or no, and on or off. Since we only need to distinguish between two different values and we are using a base 2 system, representing booleans is easy. We just use “00000000” for false and “11111111” for true. While we could just
  • 18. represent booleans using only one byte, as that is the smallest piece of addressable memory, we typically use all 32 bits as that is the word size of our CPU. Programming Variables You should already be familiar with variables from your previous courses, but now you know generally how the computer represents variables like Java's float, int, and boolean. In the examples given we only looked at 32 bit representations, but the idea is the same for larger representations like Java's double and long which represent floating point and integer numbers using 64 bits. The only difference is that since you have more bits to work with you can represent larger and more precise numbers. Now that you know how different types of data are represented, it should make you realize how important it is to keep track of where you are at in memory. The byte 00010100 can be used as the decimal value 20, the ASCII value space, or something else entirely depending on how you look at it. This leads us into the file system. File System Before we look at how a file system is structured, we have to look at how the data is physically represented. The information on hard disk drives is split into what we call “tracks”
  • 19. and “sectors.” Tracks are concentric circles on the surface of the disk. Each track on a hard disk is numbered, starting with zero on the outside and increasing as you move toward the center. A sector is a subdivision of a track into fixed size, physical data blocks. Traditionally, hard disks are split into sectors of 512 bytes each. However, modern hard drives are divided into sectors of 2048 bytes, or even 4096 bytes each. When looking at the information on the hard drive, you look at it sector by sector. 14 Now that we know how hard disks are divided up physically, we can look at how the data is actually stored. Files are stored on the hard disk as “records.” Records are a physical unit of information made up of “fields.” Another way of thinking about them would be as a subdivision of a file, containing data related to a single entity. The fields that a record is made up of can be thought of as the variables in a program. For example, in the following 6 byte record, it has room in it to hold an integer and two characters. A record like this could be used in a program to represent an ID designation. The final term we need to know about is a “block.” A block is a physical data structure that holds records on a storage median. It is a group of sectors
  • 20. that the operating system can address. It might be one sector, or it could be several. Blocks can also be thought of as groups of bytes. The size of a block is specified by the operating system and is, therefore, operating system dependent. Blocks are manipulated as a whole. An example would be when disk drives read and write data, they do so in 512 byte blocks. The last thing we need to look at are “files” and the job of the “file system.” The file system organizes the data of the storage median into an interface that is familiar, to the computer's user. A file therefor refers to one or more blocks of data and gives them a useful name like “myFile.txt” for reference by a computer's user. There is generally a very tight “coupling” between the file system and the operating system. The two main file systems in use right now is the proprietary NTFS file system used by Microsoft and the HFS+ equivalent by Apple. Another main file system that several electronics use is Microsoft's old FAT32 file system. I/O buffering We have looked at the basic components of the computer, how memory is divided into units, how data is represented as a binary value, and how files are stored on the hard drive. This leaves us with our last topic for this section, which is I/O buffering. I/O buffering is where you temporarily store data passing between two components. The
  • 21. purpose of this is to help smooth out the difference in the rate of which two devices can handle data. I/O buffering is done at all levels of communication between the components of a computer. You can see why it is so important when trying to write a file. The CPU runs several magnitudes faster than disk drives. So if we did not have buffers, the CPU would be slowed down to the speed that the disk drive runs at and be unable to do anything else until it finished. It should be apparent how inefficient that would be. With I/O buffering the CPU can quickly send the information to the buffer, then go about its business, while the disk drives writes the data to memory. This idea works for input from a disk drive also. When the CPU wants a file, it will send the request to the disk drive. The CPU is then free to work on other stuff while the disk drive loads the file into the buffer. 15 If we look back at our primary components we can see that RAM is used as an intermediate buffer. In modern computers there are several “controllers” that are used to increase a computer's speed and efficiency. During a normal execution cycle, whenever the CPU needs a file from the disk drive it will tell the controller, which will then load the information from the buffer to a specific place in RAM. The CPU can then begin executing using the data in
  • 22. RAM. Other Sources: http://www.stanford.edu/class/cs101/how-computers-work- hardware.html http://www.stanford.edu/class/cs101/how-computers-work- software.html http://betterexplained.com/articles/understanding-big-and-little- endian-byte-order/ http://academic.evergreen.edu/projects/biophysics/technotes/pro gram/2s_comp.htm https://developer.intersystems.com/devconnection/deploy/176- 17/disk-io-buffering-and http://www.stanford.edu/class/cs101/how-computers-work- hardware.html http://www.stanford.edu/class/cs101/how-computers-work- software.html http://academic.evergreen.edu/projects/biophysics/technotes/pro gram/2s_comp.htm https://developer.intersystems.com/devconnection/deploy/176- 17/disk-io-buffering-and 16 Chapter 2: Algorithm Complexity For every problem we see in computer science, there are usually several different algorithms to solve the problem, as well as several different implementations of each specific
  • 23. algorithm that solves the problem. So the question then becomes which algorithm should we use? All things being equal, we should use the algorithm that is easiest to understand, implement, and document. However when performance is an issue, then we have to look at how fast the algorithm runs, as well as how efficiently it uses the computers resources. This makes understanding the time complexity of algorithms a central concept to computer science. When we look at the time complexity of an algorithm we typically do not consider how fast it runs on a specific computer. If one computer has a 1Ghz processor and another has a 2Ghz processor, then the second computer we generally execute twice as fast as the first for the same algorithm. Instead we look at how fast an algorithm runs as a function in comparison to the size of its input. We are typically interested in how the running time of an algorithm increases when we supply it with a “larger” input. What is the “size” of input? That depends on the problem. Example: • Number of elements in the array to sort • Number of vertices and edges in a graph to traverse To demonstrate how to analyze an algorithm, we will look at an algorithm for finding the majority element in array of integers. It takes as input an array of N positive integers, so the size of the problem is N. The output is the majority element (M.E.),
  • 24. the element in the array occurring more than N/2 times. For simplicity we will assume that a majority element exists in the array. In the algorithm we go through each element in the array and count how many times each element appears. Examples: • <1,4,4,4,2,3> -> no majority element • <1,4,2,4,4> -> 4 • <2,2,2,3,3,3,3> -> 3 Running time: A– assignment, C– comparison, E – expression, Blue – conditional execution 17 MajorityElement(A[1..N]) Repeats time line 1 A 1 mIdx=1 1 A+N*(C+E) 2 for i = 1 to I <= N do i++ N A 3 Counts[i]=0 N A+N*(C+E) 4 for j = 1 to j <= N do j++
  • 25. N*N C+E 5 if A[i] == A[j] then Counts[i]++ N C+A 6 if Counts[i] > Counts[mIdx] then mIdx=i 7 return A[mIdx] Running time = A + A + N*(C + E + A + A + N*(C + E + C + E) + C + A)) =2A + N(2A + 2C + E + A + N*(2C + E + E)) =2A + (2A + 2C + E)*N + (2C + E)*N^2 + A*N + E*N^2 First we will focus on the number of conditional executions Worst case – all elements of A are identical, • we run E in each execution of line 5 • we never run A in line 6 • Running time=2A + (2A+2C+E)*N + (2C+E)*N^2 + E*N^2 Best case – only N/2+1 copies of Majority Element, at the start of A, all other elements unique. • we run E (line 5) N/2+1 times in each of first N/2+1 iterations of line 2 loop • we never run A in line 6 • Running time=2A + (2A+2C+E)*N + (2C+E)*N^2 + E*(1+N+¼N^2) Typically distinguishing between running times of different elementary operations is: • To detailed – obscures the picture • To machine-dependent
  • 26. So we can assume that all elementary operations execute in the same constant unit amount of time. • A=C=E=1 Then the running time simplifies to: • 2 + 5N + 3N^2 + N^2 = 2 + 5N + 4N^2 18 Since N^2 is the part of the algorithm that grows fastest with a growing N, or more formally T(n) = n^2, this means that as the size of the input increases, the time it takes for the algorithm to completes grows exponentially as N^2. We call this a O(n^2) algorithm. O(n^2) means the algorithm is an “order n squared” algorithm. If we are worried about efficiency then there is a problem with the previous approach; we repeat the same calculation may times. If the element X is in the array M times, then we count how many times X appears in the array M times. This wastes time and resources. The solution is to group the identical elements X together, so that we only have to count how many X's there are in the array only once for each different X in the array. Since we are only looking at positive integers we will set the last element to a negative number so we know when to stop.
  • 27. This time we will ignore elementary operations, meaning executing a small number of A/C/E operations takes constant time (equal to 1) The function N^2 19 MajorityElement(A[1..N]) Repeats time line 1 ? 1 A = sort(A); 1 1 2 me = A[1]; 1 1 3 cnt = 1; 1 1 4 currentCnt = 1; 1 1 5 A[N+1] = -1; 1 N 6 for i = 2 to I <= (N+1) do i++ N 1 7 if A[i-1] == A[i] then 1 8 currentCnt++; 1 9 else if currentCnt > cnt then 1 10 cnt = currentCnt; 1 11 me = A[i-1]; 1 12 currentCnt = 1;
  • 28. 13 return me; We can see that the running time is growing as N + (time to sort A). There are several different sorting methods. We will not get into them now, but there are a couple that have a run time of N log N. So if we choose an appropriate sort method, running time can be growing as a function of N log N or O(n log n). As we can see in the graph, the complexity is still exponential, but it grows at a much slower rate. The function N^2 (red) and N log N (blue) 20 We have looked at two algorithms for solving the same problem: one with a complexity of O(n^2) and the other O(n log n). Many people seeing this make the assumption that the algorithm with the better time complexity always solves the problem faster. While not demonstrated in the previous algorithm, this is not always the case. We can see this by looking at the graph of a O(n) and O(n^2). As we can see for a small N, the algorithm with a better complexity takes longer than the other. This pattern is prominent in the study of algorithms;
  • 29. usually the simple approach is faster for small inputs, while the more complex approach is faster for large inputs. This means that if we are worried about maximum efficiency all the time we must change our approach depending on the size of the input. The last thing to note in this section is the usage of the term O(n). We said earlier that this means the time complexity of the algorithm was n. This is not the exact meaning of O(n). O(n) means that the complexity of an algorithm can be bounded above by some function c * n. For example given an algorithm with a time complexity of 4N, we can come up with a function that will always be above it, and therefore act as an upper bound of order N. This can be seen graphically in the follow graph. 2N^2 (red) and 100N (blue) 21 The source for the material in this section came from lecture notes prepared by Dr. Tim Arodz. Other Sources:
  • 30. http://community.topcoder.com/tc?module=Static&d1=tutorials &d2=complexity1 http://www.cse.buffalo.edu/~szhong/cse191/hw/191Lec6.pdf The function 4N (red), is bounded above by the function 5N (blue) making the function 4N O(n). http://community.topcoder.com/tc?module=Static&d1=tutorials &d2=complexity1 http://www.cse.buffalo.edu/~szhong/cse191/hw/191Lec6.pdf 22 Chapter 3: Search Algorithms Searching is a very common task in computer science. Examples include searching a list for a specific element or searching an array of objects for the smallest object. Although it is a very common task, there are only two main searching algorithms. They are a sequential search and a binary search. Grasping the ideas and limitations of the two search algorithms is fairly intuitive so this section will only give a summary example of each. Sequential Search (sometimes called a linear search):
  • 31. The idea behind the sequential search, as its name suggests, is to start at the beginning and sequentially look through a data set for the search parameter and finish when the parameter is either found or the end of the data set is reached. Let's take a look at an example. [4 29 6 3 9 34 23] Suppose our search parameter was to see if the value 9 existed in the array. We would get the first value and see if it equals 9. If it does, we are finished. If not, we get the next value and try the comparison again. The whole sequence would look like this. [4 29 6 3 9 34 23] [4 29 6 3 9 34 23] [4 29 6 3 9 34 23] [4 29 6 3 9 34 23] [4 29 6 3 9 34 23] The other typical search parameter is to find the smallest or largest element in the data set. To do this with a sequential search we have to have a temporary variable to hold the current smallest value. So to solve this we initialize the temporary variable to the first value in the array and then increment through the entire array, updating the temporary variable as we go. It would look like this: Initialize: [4 29 6 3 9 34 23] X = 4 Increment through the rest of the array: [4 29 6 3 9 34 23] X = 4
  • 32. [4 29 6 3 9 34 23] X = 4 [4 29 6 3 9 34 23] X = 3 [4 29 6 3 9 34 23] X = 3 [4 29 6 3 9 34 23] X = 3 [4 29 6 3 9 34 23] X = 3 Smallest element = 3. 23 Now we will examine the complexity of the algorithm. It should be easy to see that for searches with a parameter like “what is the smallest element”, we have to go through the entire array. Since we only have to look at each element once, this makes the complexity/runtime in the best, average, and worst case scenarios O(n). If we had a parameter like “does this value exist”, then we have to look a little closer. The best case would be that the first value we try is the value we are looking for. The worst case, of course, would be that the value we are looking for does not exist or is the last element. The only way to calculate the average case would be to say that the value we are looking for is in the middle. Summary Complexity Number of Comparisons (for n = 100000)
  • 33. Comparisons as a function of n Best Case (fewest comparisons) 1 (target is first item) 1 Worst Case (most comparisons) 100000 (target is last item) n Average Case (average number of comparisons) 50000 (target is middle item) n/2 The best case analysis does not tell us much. If the first element checked happens to be the value we are looking for, any algorithm will take only one comparison. The worst and average case analysis gives us a better indication of an algorithm’s efficiency.
  • 34. Notice that if the size of the array grows, the number of comparisons required to find a parameter in both worst and average cases grows linearly. In general, for an array of size n, the worst case is n comparisons. The algorithm is also called a linear search because its complexity and efficiency can be expressed as a linear function. The number of comparisons to find a target increases linearly as the size of the list increases. Although we have not looked at sorting algorithms yet, the other thing to look at when looking at the complexity is whether the run time would change if the array was sorted. If the parameter is “what is the smallest or largest value” then the answer would be yes, as we would know the position of the largest and smallest elements and we would not need to search for them. However, if the parameter is “does this element exist” then the answer would be no as our early basis for the complexity would still be valid. 24 Pseudo Code
  • 35. 25 Binary Search Our second search algorithm is still intuitive, though slightly more complex to implement. You might wonder why we need another search algorithm, as a sequential search would technically work in every situation. The answer to that is efficiency. Since a sequential search’s complexity grows linearly to the size of the input, the time it takes to execute grows linearly as well. This is not an issue for small data sets with only a few hundred to a few thousand pieces of data. But what happens when the data set becomes large like a few million to a few billion pieces of data? Even with modern computers it could take several minutes to complete the search. This is where a binary search algorithm comes into play. A binary search can only be used when the data set is sorted and random access is supported. Therefore, in data structures such as a linked list, a binary search cannot be used. As one of the requirements for using a binary search is that the data set be sorted, there is no need to use a binary search for a search parameter of finding the largest or smallest element. Although you could search for it, their position would be known so no searching would be required. The premise behind a binary search is simple. Since our data set is sorted, comparing the middle value to our parameter, will give us one of three
  • 36. situations. 1.) The value we are looking for is in the upper portion of the data set, 2.) The value we are looking for is in the lower portion of the data set, or 3.) The middle value is the value we are looking for. By always comparing the middle value, the binary search algorithm allows us to vastly reduce the number of comparisons. Let's look at an example. [9 20 34 35 68 47 49 65 80 86] Suppose our search parameter was to see if the value 34 existed in the array. We first find the middle value; if that is the value 34 we are done. If it is not we “cut” the area in the array in half, which reduces the potential comparisons by half as well. We keep doing this process until we find the value we are looking for or until we cannot cut the array in half anymore. The sequence of events would look like this. Active section: [9 20 34 35 68 47 49 65 80 86] (1 +10)/2 = 5.5 => 5 Active section: [9 20 34 35] (1 + 4)/2 = 2.5 => 2 Active section: [34 35] (3 + 4)/2 = 3.5 => 3 26
  • 37. Now that we see how it works, let’s look at the complexity. We said earlier that a binary search was more efficient than a linear search. If that is so, how much more efficient is it? To answer this, we look at the number of comparisons needed in both the best and worst case scenarios. We will not look at the average case, as it is more difficult to compute, and it ignores the differences between the required computations corresponding to each comparison in the different algorithms. The best case of course would be that the middle value is what we are looking for, so the best case scenario does not tell us very much about the algorithms efficiency. That leaves the worst case scenario. The worst case, as with a sequential search, is that the value does not exist or is the last value that we check. So to compare the complexity of the worst case to the size of the input we get the following scenario. [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] Goal: find the value 16. The first index we look at is: (1+16)/2 = 8.5 => 8 First comparison Active Section: [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] 8 < 16 next index is: (9+16)/2 = 12.5 => 12 Second comparison Active Section: [9 10 11 12 13 14 15 16] 12 < 16 next index is: (12+16)/2 = 14
  • 38. Third comparison Active Section: [13 14 15 16] 14 < 16 next index is: (15+16)/2 = 15.5 => 15 Fourth comparison Active Section: [15 16] 15 < 16 next index is: (16+16)/2 = 16 Final comparison Active Section: [16] 16 = 16 So it takes us a maximum of five comparisons for a dataset containing sixteen elements to find any element. Or to express it in mathematical terms, given a data set of size n it takes us X number of comparisons, where X = log2n. So our complexity is O(log n). 27 Summary Complexity Number of Comparisons (for n = 100000) Comparisons as a function of n Best Case (fewest comparisons)
  • 39. 1 (target is middle item) 1 Worst Case (most comparisons) 16 (target not in array) log2n Pseudo Code Other sources: http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/sear ch.html http://research.cs.queensu.ca/home/cisc121/2006s/webnotes/sear ch.html 28 Chapter 4: Sort Algorithms
  • 40. Sorting Sorting problem – Definition reminder Input: a sequence of numbers <a1, a2, ..., an> Output: A permutation <a1',a2',...,an'> of the input sequence, such that a1'<=a2'<=...<=an' Insertion sort: Concept for the algorithm – to sort an array -Maintain two parts of array -Sorted part: – initially empty – left part -Unsorted: – initially full – right part -Take one element from the unsorted part and insert at correct position in the sorted part. -Iterate until all elements are moved to the sorted part of the array, and the unsorted part is empty Start with an unsorted array of size n, [0...n-1] where 0 is the first array index and n-1 is the last index. [ 4 5 2 0 9 ]
  • 41. Imagine splitting this array into two different parts. The left half will be sorted, the right half is not. I will show my split in the array as two vertical lines. An array split in the middle would be of the form [0...i || j..n-1]. It is important to note that this is still one array, the split is conceptual only. We are going to apply this conceptual split to our array. We will put the split after the first index, so we have just one element on the left, and n-1 elements on the right. After our imaginary split, our array now looks like this. [ 4 || 5 2 0 9 ] i j n-1 29 Now during each iteration of insertion sort, we will look at the first element of the right part of our array which we previously showed was at index j. We are going to move this element (insert it) into the left part of the array. We will keep moving the element to the left until the left array is sorted again. As you can see, in this example we didn’t have to physically move anything in the array for this iteration.
  • 42. [ 4 5 || 2 0 9] We will repeat that process by moving 2 into the left section of the array. [ 4 5 2 || 0 9 ] We can clearly see that the left part of the array is no longer ordered, so let’s continue moving 2 to the left until the array is ordered again. [ 4 2 5 || 0 9 ] [ 2 4 5 || 0 9 ] We will continue to repeat this process until all elements are in the left, sorted section. [ 2 4 5 0 || 9 ] [ 2 4 0 5 || 9 ] [ 2 0 4 5 || 9 ] [ 0 2 4 5 || 9 ] ← End of the this iteration. [ 0 2 4 5 9 || ] ← End of the final iteration. We can see here that the left section of the array, which is the sorted section, contains all the elements and is still sorted. So now the question is, how do we do this in a program? Pseudo Code
  • 43. Note this pseudo code receives a reference to an array. 30 Algorithm Complexity Let’s examine the pseudo code for complexity. Pseudo Code This will be a little more in depth than we will go in the future, but for now we will show what everything doing along with their relative complexities. The first iteration of the for loop increments from 1 to n-1. The run time of this is directly related to the size of the array, n. You should be able to see that as n gets larger, the number of times the for loop will increment will grow in proportion to n. Everything else is contained in the for loop, so we must, in essence, say that the number of executions of everything inside the for loop also depends on n. We have a few constant time assignment operations and we have another loop. The complexity of this loop can be a little tricky to understand.
  • 44. This is saying that we have a key which is located at index ‘ j ‘. We want to move this key until it fits in the right spot, which is when A[ i ] <= key or i < 0. The number of spaces we actually move this key element will vary throughout the algorithm, but the thing to remember is that, as n gets larger, we will generally have to move that key more spaces until we find it’s home, which would make this an O(n) loop. Remember still that this loop is inside another loop that is also O(n), so the contents of the while loop can possibly run n^2 times in a worst case scenario. This means that this algorithm is O(n^2). 31 Now that we have look at the complexity it is easy to see how Insertion Sort performs: Worst case performance: О(n^2) ex: A reverse sorted array. Best case performance: О(n) ex: A sorted array
  • 45. Average case performance: О(n^2) Other Sources: http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort http://www.algolist.net/Algorithms/Sorting/Insertion_sort 32 Selection Sort: Our second sort will focus on a different, yet still intuitive, way of arranging elements into the correct order. Insertion Sort focused on moving elements from an unordered array and finding its place in an array that is ordered. Selection sort does
  • 46. something similar. However, instead of grabbing any element from the unordered array, it finds the largest element and swaps it with the smallest element of the ordered array. Remember, since the array is ordered, the smallest element will always be the left-most element of that array. Let’s take a look at an example. [4 2 5 1 6 7 0] This is a sorted, unordered array. Let’s divide this array into a sorted and an unsorted portion, similar to what we did with insertion sort. However, the sorted part of this array will be the right side. [4 2 5 1 6 7 || 0] You can see that the left part of this array is not sorted, and the right side is sorted as it only has one element. We will need to know which element is the largest in the unsorted array, so we will keep that element’s index as a key. We will also color this element blue to show that we are storing it. The first element in the array will be initially marked as the largest and we will change that as we move through the array. The current element we are looking at will be colored red. The first complete iteration looks like this: LargestElement = 4 [4 2 5 1 6 7 || 0] [4 2 5 1 6 7 || 0] [4 2 5 1 6 7 || 0] [4 2 5 1 6 7 || 0]
  • 47. [4 2 5 1 6 7 || 0] [4 2 5 1 6 7 || 0] [4 2 5 1 6 7 || 0] [4 2 5 1 6 0 || 7] As you should be able to see, we looked at every element in the unordered array once. We also looked at the first element of the ordered array. If we found an element that was larger than the previous largest, we simply marked that element as the new largest and kept looking. Once we arrived at the end of the unsorted array, all we had to do was swap it with the first element of the ordered array. 33 For each new iteration, we will slide the divider one element to the left and continue. [4 2 5 1 6 || 0 7] [4 2 5 1 6 || 0 7] [4 2 5 1 6 || 0 7] [4 2 5 1 6 || 0 7] [4 2 5 1 6 || 0 7] [4 2 5 1 6 || 0 7] [4 2 5 1 0 || 6 7] [4 2 5 1 || 0 6 7] [4 2 5 1 || 0 6 7]
  • 48. [4 2 5 1 || 0 6 7] [4 2 5 1 || 0 6 7] [4 2 5 1 || 0 6 7] [4 2 0 1 || 5 6 7] [4 2 0 || 1 5 6 7] [4 2 0 || 1 5 6 7] [4 2 0 || 1 5 6 7] [4 2 0 || 1 5 6 7] [1 2 0 || 4 5 6 7] [1 2 || 0 4 5 6 7] [1 2 || 0 4 5 6 7] [1 2 || 0 4 5 6 7] [1 0 || 2 4 5 6 7] [1 || 0 2 4 5 6 7] [1 || 0 2 4 5 6 7] [0 || 1 2 4 5 6 7] Now that we have reached the end of this last iteration, we can see that, no matter what the first element in the array is, it will always be smaller than every element in the sorted array. This is because every element we moved thus far has been larger than this last element. Before we even look at the pseudo code, we can get a good understanding of the complexity of this algorithm. For each extra element in the array, the number of iterations we would have to do will grow by 1. During each of these iterations, we have to look at every element in the unsorted array. While this number gradually gets
  • 49. smaller as the algorithm progresses, ultimately as n gets larger, so will the number of elements we have to look at during each iteration. This tells us already that Selection Sort will be O(n^2). 34 Pseudo-code 35 Bubble Sort: The third sort we will discuss is the Bubble Sort. Unlike Insertion and Selection sort, this one is not so intuitive. The name comes from bubbles rising to the surface of water. As the bubble passes through the array it moves each number closer to the location it needs to be. In order to show this we will once again start off with an unordered array. [9 1 2 4 5 8 7 6 3]
  • 50. [1 9 2 4 5 8 7 6 3] [1 2 9 4 5 8 7 6 3] [1 2 4 9 5 8 7 6 3] [1 2 4 5 9 8 7 6 3] [1 2 4 5 8 9 7 6 3] [1 2 4 5 8 7 9 6 3] [1 2 4 5 8 7 6 9 3] [1 2 4 5 8 7 6 3 9] This is one iteration of the bubble sort. The red “bubble” is going from left to right and each time it is putting the two elements inside it in the proper order. Since the largest element, 9, happened to be at the beginning of the array, the 9 was trapped in the bubble until it was put at the very end. We will create a new bubble to iterate through the array and each time it will grab the next largest element and put it in its place. Let’s go through the rest of this sort. [1 2 4 5 8 7 6 3 9] [1 2 4 5 8 7 6 3 9] [1 2 4 5 8 7 6 3 9] [1 2 4 5 8 7 6 3 9] [1 2 4 5 8 7 6 3 9] [1 2 4 5 7 8 6 3 9] [1 2 4 5 7 6 8 3 9] [1 2 4 5 7 6 3 8 9] <-End of iteration 2 [1 2 4 5 7 6 3 8 9] [1 2 4 5 7 6 3 8 9] [1 2 4 5 7 6 3 8 9] [1 2 4 5 7 6 3 8 9] [1 2 4 5 7 6 3 8 9]
  • 51. [1 2 4 5 6 7 3 8 9] [1 2 4 5 6 3 7 8 9] <-End of iteration 3 36 [1 2 4 5 6 3 7 8 9] [1 2 4 5 6 3 7 8 9] [1 2 4 5 6 3 7 8 9] [1 2 4 5 6 3 7 8 9] [1 2 4 5 6 3 7 8 9] [1 2 4 5 3 6 7 8 9] <-End of iteration 4 [1 2 4 5 3 6 7 8 9] [1 2 4 5 3 6 7 8 9] [1 2 4 5 3 6 7 8 9] [1 2 4 5 3 6 7 8 9] [1 2 4 3 5 6 7 8 9] <-End of iteration 5 [1 2 4 3 5 6 7 8 9] [1 2 4 3 5 6 7 8 9] [1 2 4 3 5 6 7 8 9] [1 2 3 4 5 6 7 8 9] <-End of iteration 6 As you can probably see, each iteration the bubble has to go one less index in the array. This is fairly easy to implement because we can just reduce the apparent size of the array by one in each iteration. You can also see that we stopped the algorithm early because it is sorted. In a worst case scenario, we would have to iterate through this
  • 52. n-1 times. However, there is also a technique to determine if the array is sorted and all it requires is that we iterate through one more time. [1 2 3 4 5 6 7 8 9] [1 2 3 4 5 6 7 8 9] [1 2 3 4 5 6 7 8 9] Notice that the bubble never moved anything. If the bubble had moved something, then we know that the array was not sorted when we began this iteration. Pseudo code 37 Complexity Summary: Bubble sort contains a for loop that will grow in proportion to n. This loop is inside a while loop. This while loop probably won’t iterate n-times, but the number of iterations of this loop will tend to grow linearly with n. Worst Case O(n^2) We will see the worst case trying to sort an array that is initially reverse sorted.
  • 53. Best Case O(n) We will see the best case trying to sort an array that is already sorted. The for loop will iterate once, see that the bubble didn’t move anything, and break out of the while loop. Average Case O(n^2) During the majority of sorts done with bubble sort, the computation time will grow by some factor of n^2. 38 Merge Sort: The final sorting algorithm we will discuss is Merge Sort. This sort uses a divide-and- conquer approach. This is not a way people tend to sort things, but it is much more efficient at sorting very large arrays. We will use a different type of example for this that is used from Dr. Tomaz Arodz’s CMSC 401 Lecture notes. Many of his notes we use will include images used from “An Introduction to Algorithms: Third Edition” by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. General Merging of two arrays to produce a single, sorted, array:
  • 54. We will start by exploring how to merge two arrays together. First we will see the merging of two unordered arrays, then with two ordered arrays. We want start with two separate arrays and end up with a single, sorted array. Given two arrays: [3 5 1 7] [4 2 8 6] Let’s think about this for a minute. We have options if we want to merge these two arrays. If you apply what you have learned from the past sorts, we can simply put these two arrays into one and then sort them with Insertion, Selection or Bubble sort. So, let’s try that. We will use the Insertion Sort method. We won’t show every minute step, just remember that during each step we are looking at every element in the sorted side until we find the correct place for each element. [3 5 1 7 4 2 8 6] side. [3 5|| 1 7 4 2 8 6] [1 3 5|| 7 4 2 8 6] [1 3 5 7|| 4 2 8 6] [1 3 4 5 7|| 2 8 6] [1 2 3 4 5 7|| 8 6] [1 2 3 4 5 7 8|| 6] [1 2 3 4 5 6 7 8||] We have seen this before. We know this is a O(n^2) algorithm.
  • 55. Let’s look at a different approach. We will take the same two arrays, only first we will sort those arrays before combining them. [1 3 5 7] [2 4 6 8] 39 Now let’s set them up so they are easier to visualize. We will go through this part step- by-step. We will have the two initial arrays on top and we will create a destination array on the bottom. The destination array will be large enough to fit all of the elements and initially it will be empty. [1 3 5 7] [2 4 6 8] [ ] We will keep an index for the element we look at in each array. We will highlight these current elements in red. [1 3 5 7] [2 4 6 8] [ ] Now for the fun part, let’s start the merge process. During each iteration, we will perform one check. We will find which red element is the smallest, put that element in the
  • 56. destination array, and then look at the next element from the source array. That may be a bit confusing to conceptualize, so let’s see it in action. [1 3 5 7] [2 4 6 8] (1 < 2)? Yes, let’s move 1 down. [1 ] [1 3 5 7] [2 4 6 8] (3 < 2)? No. Let’s move the 2 down. [1 2 ] [1 3 5 7] [2 4 6 8] (3 < 4)? Yes. 3 goes down. [1 2 3 ] [1 3 5 7] [2 4 6 8] (5 < 4)? No. Move 4. [1 2 3 4 ] [1 3 5 7] [2 4 6 8] (5 < 6)? Yes. Move 5. [1 2 3 4 5 ] [1 3 5 7] [2 4 6 8] (7 < 6)? No. Move 6. [1 2 3 4 5 6 ] 40 [1 3 5 7] [2 4 6 8] (7 < 8)? Yes. Move 7. [1 2 3 4 5 6 7 ]
  • 57. Now we have a small issue. As you can see, the next, and only, possible item to move now is the 8. However, telling a computer how to do this is a bit more complicated. Luckily, we know a few solutions to this issue. The solution we will cover involves expanding the source array’s to include one extra element, Positive Infinity (INF). We would need to do this before we started the merging process, but it doesn’t affect anything until now. So, imagine we made this addition before, and we will resume where we left off. [1 3 5 7 INF] [2 4 6 8 INF] (INF < 8)? Not even close. Move 8. [1 2 3 4 5 6 7 8] Now you may be asking yourself, how do we use Infinity in a Java program? Well, you can’t, but you can come close. All integers contain a maximum value. We can get this in Java with Integer.MAX_VALUE. The other numerical primitive data- types have wrappers that provide the maximum or infinity values as well. Going back to the merging, we can see that, with two pre- sorted arrays, we only look at each element once. The number of elements we look at grows proportionally to the number of elements we need to merge. This means the merging of two sorted arrays is O(n). This doesn’t do us much good though if we have to use a O(n^2) algorithm to get the two initial arrays sorted. Divide and Conquer approach:
  • 58. As mentioned above, merge sort is a divide and conquer algorithm. It divides the problem into smaller problems of the same nature, recursively solves them, and then combines their solutions. We will divide the array into two smaller arrays until the arrays contain just one element each. This division of the array creates a tree where each child node is an array that is half the size of its parent node. Each leaf will be an array containing just one element. This, as we have seen before, means that each leaf is a sorted array. 41 This should be fairly easy to see so far, at least conceptually. Implementing this in code will be slightly trickier since computers tend to do things procedurally. We will see how this is done later in this text, but for now let’s go over the merging procedure. We will start with the leaf nodes. Since they are already sorted, we will simply use the merging procedure we talked about above. We want to merge them into the same parent arrays they had before.
  • 59. 42 As you can see, Merge sort first divides the problem into a lot of small, easy to tackle problems. The “easy problem” in this case is merging two arrays that are already sorted. Each time we go down a level we cut the array in focus by half. Each level will contain twice as many smaller problems as the last until we have n levels. This means that the dividing process is O(log2(n)). We already discovered that merging two sorted arrays is O(n), so doing a O(n) operation log2(n) times will give us a O(nlog2(n)) algorithm. This is significantly better than the O(n 2 ) sorting algorithms we discussed earlier for very large n’s. Pseudo-Code
  • 60. 43 Chapter 5: Trees Trees: We will now take a look at Trees. For our purposes, the only trees we will be discussing in detail are B-Trees. However, you will need a firm understanding of general tree structures before you can fully understand the concepts behind a B-Tree. You may have gone over Trees in CMSC-256, if so this section will just be a review. Trees in computer science are used as a means of storing data using an acyclic connected graph where each node has zero or more children nodes and at most one parent node. Furthermore, the children of each node have a specific order. Much of the following information was used, with permission, from Dr. Arodz’s CMSC 401 notes. We will discuss the above operations but first we need to describe structures we are using. The first is the Node. - Each node has a key: x.key - Each node may have children: x.left, x.right o A null child represents no child - Each node has a parent: x.p o The exception to this is the root node. The root node of any given tree, by
  • 61. definition, has no parent. A binary tree must maintain certain properties with respect to these nodes. For each node x, - If node y is in a left subtree of x, y.key <= x.key - If node y is in a right subtree of x, y.key >= x.key 44 This tree has other properties you are probably already familiar with. With the trees we display, we will only be displaying the node and key. Each of these nodes in a real application would contain some information. This information is usually accessed with a getter method such as: x.data. Other properties of this particular tree include: - 6 is the root node - 5 is the left child of 6. (7 is the right) - 2, 5 and 8 are leaf nodes. - 5 is the root of a subtree of 6. As with all data structures there is a set of operations we will want to perform on this tree. We will want to, at minimum, insert and remove nodes from a tree. Other useful operations include in order, preorder and post order traversals. Note: the following methods can be done recursively. We will only show the iterative approach. Insert:
  • 62. This is the full pseudo code for a Binary Tree insertion. Instead of explaining how an insertion works then providing the code at the end, we will take the opposite approach. This way we can provide the overview and decompose the code piece by piece. This method takes a Tree and a new node to be inserted into the tree. We define the current node as the root of the tree. We also need to keep track of the parent. Since the root of any tree has no parent, we initialize this as null. 45 This is where we find the proper location to put our new node. We always insert onto a leaf, so our while loop will iterate until the current node is null which is why we want to keep track of the parent. After iterating through this loop, the currentParent variable will hold the node which will be a parent to our new node…
  • 63. …so let’s go ahead and assign the parent reference of our new node to the currentParent variable. Now we have to determine the details of where we are inserting this node. If the tree was empty, Tree.root would have returned null. In that case, currentNode would have been null and the while loop would have never executed and currentParent would be null. The above statement covers this situation and inserts the new node as the root of the tree. If the parent is not null then we will have to insert to the left or the right of that parent. This statement determines where that is. If the value of the new node is less than its parent we will insert as the left child, otherwise right. That’s it! We are done with inserting a single node into a binary tree. Remove: Removing from a binary tree is slightly more complicated. We will try to break it down into small chunks. To do this, we will introduce two helper methods. These also have other uses outside of removing a node which we will not discuss.
  • 64. 46 The first helper method we will introduce is the TreeMinimum. Given a Tree, if you were to follow the left child until you reached a node where there is no left child, you will end up at the smallest value in that tree. This can be used on any node in a tree in order to get the minimum value within a subtree. Two examples are highlighted in the tree to the right. The TreeMinimum of 6 is 2. The tree minimum of 18 is 17. If you were to take the minimum of 7, you would simply get 7. This code should be fairly self-explanatory. We extract the root of the current tree. While the current node’s left child is not null, we move on to that left child. When the left child is null we know we have reached the smallest value in the tree. The second helper method is the Transplant method. This method does the actual work of removing a node from a tree. Its parameters include a tree, the node to be removed and the
  • 65. subtree that will take the place of that node. Since this is a bit more complicated than the TreeMinimum, we will once again break up the code and explain it piece by piece. 47 If the node we want to remove has a null parent then that node was the root of the tree. In this simple case we simply assign the root of the tree to the new subtree. This part usually looks complicated at first. All we are doing is checking the parent of the removed node to see which child is being taken away. We will then replace that child with the subtree. Finally, we will check to see if that subtree is empty. If it is not empty then we finalize the attachment by setting the parent of the root node to the removed node’s parent. We can now go over the actual removal method.
  • 66. We use both Transplant and TreeMinimum to do a proper removal. Care must be taken when removing a node. You cannot just use a transplant in every situation. If we remove a node that has only one child, we can just transplant the subtree represented by that node’s child to the parent. If that node has more than one child then a single transplant won’t work. 48 This method takes a tree and a node to be removed from that tree. This conditional expression takes care of the two easy cases. If either children of the removed node is null then a single transplant will effectively remove the node. Otherwise we will require a bit more manipulation. We will start by finding the smallest value in the right subtree of the node to be removed. We want this value because it is the smallest value that is larger than every other value in the left
  • 67. subtree of our removed node. If the parent of that minimum node is the node we want removed then we will skip this next step. Otherwise we will run a Transplant on the minimum node. This will take the minimum node out of the right subtree. Remember, this node is larger than every element of the removed node’s left subtree and smaller than or equal to every element in the removed node’s right sub tree. It only makes sense that we should replace the removed node with this minimum node. 49 After the transplant we assign the right child of this minimum
  • 68. node to the right child of the node we wish to remove and give that child a new parent. Because this minimum node is smaller than every node in the subtree represented by its new right child, the fundamental properties of the binary tree hold. Now we can deal with the left subtree of the node we wish to remove. We start this by transplanting the node we wish to remove with our old minimum reference. After the transplant the node is now completely removed from the tree. Unfortunately its left subtree is still attached. We can fix this by attaching the left child of the removed node to that minimum node. Because the minimum node used to reside in the removed node’s right subtree and we are attaching the removed nodes left subtree, every element in that left subtree will be smaller than the minimum node. And now we are finally done removing a node from our tree. Complexity: The operations done on a binary tree vary with the structure of the tree. As you have probably noticed, if you insert the values [1,2,3,4,5,6,7,8,9] into a binary tree, you will essentially get a list. Any operations done on a tree as unbalanced as this will yield O(n) complexity.
  • 69. However, if a tree is properly balanced then it can yield an average O(log(n)) complexity. This is much better than performing operations on a linear list. Additionally, this performance is comparable to sorting a list and then doing a binary search on that list to find and extract some information. In order to add the unordered elements of an array to a tree, we will have to iterate through the list once O(n), and at each element we will have to perform an insertion O(log(n)). Extracting that information will simply be O(log(n)). In total, inserting an entire list and extracting one value will be O(nlog(n) + log(n)), or simply O(nlog(n)). Sorting the linear list can be done with O(nlog(n)) and then extracting an element from a sorted list can be done using a binary search which is O(log(n)) for a total of O(nlog(n) + log(n)), or simply O(nlog(n)). 50 So, if doing each of these operations is bound by O(nlog(n)), why use a tree over a linear list? Well, it depends. Given some list of comparable elements,
  • 70. sorting that list will be faster than inserting the entire thing into a binary tree. Using a standard desktop, a sample set of 50,000,000 integers took about 9 seconds to sort. That same sample took almost 90 seconds to insert into a binary tree. So the question remains, why use a binary tree? If you insert something into a linear list, the insert will take O(n) time. An insert into a binary tree can be done in O(log(n)) time. The same applies for removals. So, if you are planning on manipulating the data then a binary tree is probably what you want. However, if you are not going to change the data, having a sorted list may be more beneficial than a tree. Basically, the binary tree is much faster to maintain after the initial preparation has been finished. You have options, use them wisely. Other Sources: http://cslibrary.stanford.edu/110/BinaryTrees.html http://cslibrary.stanford.edu/110/BinaryTrees.html 51 B-Tree: Binary Trees tend to be an excellent way of storing data when all of that data can fit in RAM. However, as most computer scientists know, there are many applications where we have
  • 71. more information than RAM available. This means we will have to store the actual information on a hard drive. Accessing a hard drive is much slower than RAM as you have seen in the architecture portion of this text. The time it takes to perform I/O to a hard drive is slow because there are physical moving parts. The platter spins around the spindle. The read/write head reads pages off of the current track. We have to wait until the information we want is under the read/write head before we can access it. We could use a binary tree to store very large amounts of information. Every node would be some file on the disk. As we traverse the tree we will have to do a disk I/O for each node visited. Even if the tree is perfectly balanced we can still find ourselves doing lots of disk I/O’s for very large trees. To prevent this we want to store more than one thing in each file. To be more specific, since each disk read gets a page from the hard drive, we want to fit as much information into a node as possible before we overflow that page. This will ensure we make the most efficient use of our available resources while drastically reducing the time it takes to find information.
  • 72. This is where the B-Tree comes in. A B-Tree is basically a Binary Tree which allows each node to have more than two children. Each node will have some number of separator keys. The number of children coming out of these nodes will be equal to the number of separator keys+1. 52 An easy way to demonstrate the B-Tree is with the alphabet. The tree above is rooted at M. The number of separators for the root in this case is one and it has two children. Note the number of keys in a node can be much larger than 2 or 3. In practical applications we may have thousands of keys in each node. Each node in a B-Tree contains: - x.p : a pointer/reference to the parent - x.n : a number of separator keys in the node - x.key[1 … x.n] : array with value of separator keys (as opposed to a single key for the node) - x.c[1 … x.n+1] : array of pointers to children (as opposed to x.left and x.right) - x.leaf : Boolean value representing whether x is a leaf or not. Other Properties: - Every leaf in the B-Tree will have the same depth. i.e. the length of the paths from root to
  • 73. each leaf is the same (equal to the height of the tree). - Each node may have no more than 2t-1 keys. o t is a predefined number that will regulate the properties of this tree. - Each node, with the exception of the root, must have at least t- 1 keys. In a non-empty tree, the root must have at least 1 key. B-Trees support dictionary operations, meaning insert, search, and remove which we will go over. They do support other operations such as successor and predecessor which we will not look into. 53 BTreeSearch: Let’s start with searching for an element in an existing B-Tree. In this search we will use a method, FindBranch(x,k) such that x is a node and k is the key we are looking for. This searches through a single node to find where the branch to a child should
  • 74. be. It can be either a linear or binary search through the node, what is important is that it finds a key index, i, such that: - x.key[i] >= k - x.key[i-1] < k or i == x.n+1 if x.key[x.n] < k Above is the pseudo code for a BTreeSearch. As with the Binary Tree we will break this down to help you understand exactly what is going on. This method takes two parameters, x which is the current node and k, the key we are searching for. This searches through the node to find the branch location as described above. This checks to see if the key we are looking for matches the key we found with our FindBranch method. If it is, we are done and can return the node and the index of the key. Otherwise if this is a leaf node then the key we are looking for does not exist in this tree.
  • 75. 54 We have covered the cases where we found the item we are looking for and the case where we are the key isn’t in the structure, so now we must move on and repeat this process for the next node. We have already found the branch to the next child and we know that child exists, so we need to read that node into memory. This is done with DiskRead(x.c[i]) which gets the i th child of x. In Java, this could be done by creating an input stream from the file. We then recursively call BTreeSearch with the new node and the same key we are looking for. BTreeInsert: For the insert (and remove later) we won’t show pseudo code. The implementations for these are quite ugly because it involves modifying the actual file structures. If you find yourself needing to implement a B-Tree, pseudo code is available on the internet or you may be able to form the code yourself after reading the descriptions of how they work.
  • 76. When we wish to insert into a B-Tree, we always begin at the root. We will do a search until we find the appropriate leaf to insert our information. We can only insert into a leaf if there is space (it must have less than 2t-1 keys) In this case we do have room. We want to insert ‘B’ into this tree where t=3. - When t=3 each node has: - Keys >= (t-1 = 2) - Keys <= (2t-1 = 5) 55 The appropriate leaf node may not always have room to insert. We don’t want to just add a new level because B-Trees have to hold the property where every leaf in the tree will have the same depth. It also means we will have a node with just one element in it which is a waste of a hard drive I/O. Remember: we want to make the most efficient use of our space as possible which is why we need to enforce these rules on how many keys each node can hold. We will now try to insert ‘Q’ into this B-Tree. Intuitively we
  • 77. can see that, since ‘Q’ comes after ‘P’ we will want to insert ‘Q’ into the node with [R S T U V]. However, this node has 5 keys already so we cannot insert into it as-is. The next place we may look to insert is in the root with [G M P X]. This is not an option either. Doing this would introduce a child node between ‘P’ and ‘Q’ that has no keys. Since the number of keys has to be at least t-1, this would violate the properties of a B-Tree. We do have another option. We can split this node in half and raise the ‘T’ to the root. It has room for one more key. The pointer between ‘P’ and ‘T’ will contain [R S] as a child and the pointer between ‘T’ and ‘X’ will contain [U V] as a child. Both of these abide by the t-1 keys property. We can then insert ‘Q’ into the tree after making the split. There is one last contingency we will have to deal with. We can see that if we tried to insert ‘F’ into this tree we would have no room for it in either the appropriate leaf or in the root. The way to deal with this is actually quite simple. Whenever we want to insert
  • 78. something into a tree, if we run into a node that is full we split it unconditionally. 56 Inserting ‘L’ into this tree forced the split of the root node. This is also why we have an exception for the minimum number of keys in the root node. Using this technique will ensure that the tree not only stays balanced but there will always be room if we need to split a node. Special mention should be made in this case. Because we had to split the root, there was no parent to place the median key from the root. As you probably guessed, we just make that median key its own node and make the tree’s root pointer point to that. The height of the tree is increased by one whenever we have to split the root and this is the only way that the height of the tree is allowed to grow. Finally, when we insert ‘F’, we have plenty of room for a split. All properties discussed earlier are held true. At any given node we will have to do a search. Assuming this is a linear search it will be bound by O(t) where t is the constraint we applied limiting the number of keys in our node. Each node we visit we go down one level in the tree, which
  • 79. means we have to do another search. The total time ends up being O(t*h) = O(t*logtn). 57 B-TreeRemoval: It is even more complicated to delete a key from a B-Tree than it is to insert one. We will start, per usual with any tree structure, from the root. We will do a search for the key we want to delete. We will have two main cases: deleting from a leaf and deleting from an internal node. Deleting from Leaf: If the leaf has at least t keys we can just remove the key. The node will have at least t-1 keys and the structural properties will be maintained. In this case we want to delete ‘B’. We can’t just remove it because the leaf will have less than t-1 keys. We will want to increase the size of the node before attempting any removal. To
  • 80. make things simple, on removals we will check each node before we move to it to see if it has t- 1 keys. If it does, we will preemptively increase the size of that node just in case the key we want to remove is there. There are several different approaches to do this depending on the siblings of that node. 58 Case A: The t-1 node has a neighbor with at least t keys. In the case above, the sibling of the node we want to remove ‘B’ from has more than t-1 keys. We can move the first element from that node, ‘E’, up to its parent. We can then move the ‘C’ from the root to the node with which we want to delete ‘B’. After moving ‘C’ and ‘E’, the node had t elements and we were able to remove ‘B’ with no issue. Case B: Neither left nor right siblings have more than t-1 keys. In this case we will use a technique that merges nodes with a key from the parent.
  • 81. If we wish to delete ‘D’ we will first visit [C L]. This node has t-1 keys so we want to increase that. We can’t just take a key from [T X] because it also only has t-1 keys. Instead we will merge the two nodes using the key ‘P’ from the parent. 59 After the merge we are free to delete ‘D’ as it’s node had more than t-1 keys. Also, since we removed the only key from the key, we remove that node and the new root of this tree is [C L P T X]. Why preemptively increase keys to t? Let’s go back to a previous tree: In this example we want to delete Z. We can see that we will have to pass through [T X] which has t-1 keys. We already know we can increase that by using a sibling from [C G M]. We can now, because of our preemptive efforts, merge [U V] and [Y Z] using ‘X’. 60
  • 82. The process of removing ‘Z’ is now our simple case of simply removing it from the leaf node. Removing from non-leaf: This process, again, has several cases. Sometimes, if we are lucky, we will want to delete a key that separates two children with t-1 nodes. If this is the case, we simply merge the children. Case A: Key to be removed separates two children with t-1 keys. There is not a lot to explain here. If we first merge [D E] and [J K] using ‘G’ as a median, then we will get a single child [D E G J K] between ‘C’ and ‘L’. G would then be in a child with 2t- 1 keys and we have already seen how to delete that. 61 Case B: Key to be removed separates children with more than t- 1 keys.
  • 83. If we wish to delete ‘M’ then we will have to find something to take its place. To do this we will have to find its predecessor. This will be the “largest” key in the left subtree of ‘M’. The predecessor will always be found in a leaf. This will use an algorithm similar to the TreeMinimum for Binary Trees. The substituted node will have to be deleted from the leaf using the standard deletion techniques we have already discussed (i.e. ensuring each node has at least t keys). After it is “deleted” from the tree, we will just replace the element in the internal node with our predecessor. In the example above, we deleted ‘L’ from the left subtree of ‘M’ and then replaced ‘M’ with ‘L’, effectively deleting ‘M’ from the tree. Overall Complexity: We only move from top to bottom, returning up once only if we need to delete from an internal node. At each node we will access at most two of its children. Since these are constant values, meaning the extra work we do at each step will not increase as n increases, we will have O(log n) operations. Other Sources: http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html http://cis.stvincent.edu/carlsond/swdesign/btree/btree.html
  • 84. 62 Chapter 6: Hashing In this chapter we will look at hashing and how it is used to implement the hash table data structure. While they are outside the scope of this course, other uses of hashing in computer science include security and encryption. First off let's define what hashing is. Hashing, in general, is the use of a “hash function” that maps a data set of potentially variable length to another data set of standardized length. We will use this table representation of a directory for our examples: Index Name Phone # Address Email 0 John, Smith 804-453-3425 25 West Main St. [email protected] 1 John, Doe 804-343-7385 54 Marshal Rd. [email protected] 2 Jane, Wilkerson 804-374-3836 978 Woodman Rd. [email protected] At this point you might be wondering why we need another data structure. We have already looked at searching algorithms, sorting algorithms, and trees and come up with fast implementations for all of them. So why would we not just use something like a binary tree or sorted array to represent the directory? As we have seen previously, if we used a sorted array to implement the directory and a binary search to find the record, that would give us a search time
  • 85. of O(log 2n) . While that is much better than some of the other methods we have seen, for a large directory it could still take a long time to complete. The second issue would be adding new records. Whenever you add a new record you would have to shift the entire part of the array that comes after the new record for each addition that you make. This could take up to O(n). Even with a binary tree the best we get for insertions and deletions is O(log 2n) . Imagine if the telephone company stored numbers that way. When you placed a call the telephone company would have to search potentially several directories from different companies to find the number you are calling. This could still work if only a few people were calling at a time. However, when you have millions of calls at a time, a faster method is required. Another case would be a guidance system for a missile, where last second changes are needed. If the calculation takes too long, the missile would not have time to change its trajectory. This is where hashing comes into play. If we look back to the table of our example directory and made a hash function for it, we see that our hash function would need to return the index's 0, 1, and 2. This leads us to our discussion of creating hash functions. As creating hash functions is not the focus of this course we will only look at one method. The method we
  • 86. will look at is called modular hashing. Modular hashing is where we convert our key into an integer then divide by the size table M, to get the remainder as the index. 63 In our table above this would give us the function: h(x) = x mod 3 If we choose the phone number as our key X, and we pass “8043743836” to the hash function it would return 2. So to reiterate, a hash table is an array data structure that maps elements to each index by inputting the key of each element into the hash function. Now that you know what hash tables are, there should be one issue that jumps out at you. That of course is, what happens when the hash function returns the same value for two different inputs? When this happens we have what is called a “collision.” Avoiding collisions is one of the primary concerns when constructing hash functions. One of the simplest techniques we use is the size of the table. If the table of our example directory had a size of 10 then all the numbers ending in “00”, “10” and so on would each map to the same index. This gives us our standard for the table size: the size of hash tables should always be a prime number. This is because each division
  • 87. is more likely to be unique, since you are not with a common denominator other than itself. This will not completely avoid collisions but it will significantly reduce them. Besides the table size, the only other way we can avoid collisions is by adjusting our key and hash function. As there is no good way to do this, usually we change our focus from avoiding, collisions to dealing with them. The most common method for dealing with collisions is to make a table of linked lists. In case you do not remember, a linked list is a data structure composed of a group of nodes where each node holds a piece of data and a pointer to the next node in the list. It would look like this: When you insert an element, you would get the index from the hash function and add the element to the head of the linked list. By adding to the head of the list, it prevents you from having to traverse the list on insertion. To then find the element, you would get your index from the hash function, and then do a sequential search of the list to find the element you are looking for. This would give us a data structure that looks like the picture on the following page. Linked List 64 There are other methods for dealing with collisions. The first is
  • 88. to increment through the table until you find an empty space and place the element there. There are two issues with this. The first is clustering. Clustering is when elements are clumped together around the same index. When this happens it increases the potential for more collisions to occur. The second issue with this method is finding the element. Since you use the next open space you have to look forward linearly at each element, but you could see an element that was hashed to that index. So how do you know how far forward you should look? Because of these issues, this method is rarely used. Another method would be to create a second hash function for when collisions occur. If this method was used, it should be evident that you are just pushing the problem farther along instead of dealing with it, as you would then have the question of how to deal with collisions from the second hash function. At this point you should start seeing why hashing is so valuable. Our directory example only had three values, so the execution time would be very small no matter what data structure you used. If, on the other hand, you have a directory holding a few million records, the speed of a hash table over something like a binary tree would be very significant. There are several important things that should be noted about hash tables. 1.) Even with a good hash function it is still possible for collisions to occur. So always
  • 89. anticipate and have a mechanism for dealing with collisions. 2.) Typically, the number of possible keys is much larger than the actual keys that are used and stored. This means that you need to know and plan for the maximum number of entries. 3.) There are several different techniques for creating hash functions. But no matter what method you use, if the hash function is called twice with the same input, it should always return the same value. Linked List implementation of a hash table. 65 4.) The only operations that the hash table data structure supports are the dictionary operations, based on the element's key: insert, search, and delete. This means that there are some limitations to hash tables. Unlike a binary tree or sorted array, hash tables do not support operations based on order. This means that they do not support operations like minimum, maximum, successor, and predecessor. To finish up, we need to look at the complexity of hash tables. For insertions using the linked list implementation we have the time it takes to compute the index plus the time to insert into the linked list. Since we insert into the head, this gives us a best and worst case time complexity of O(1). For searches and deletions we have the time it takes to compute the index, plus the time it takes to find the element in the linked list. In
  • 90. the best case, we have only one element in each list which would give us a time complexity of O(1). The worst case, on the other hand, would be that all the keys hashed to the same index. If that happened, we would have a time complexity of O(n). As long as care is taken in designing the hash table and hash function, there should be a relatively small number of collisions. So the average run time for well- constructed hash tables would be O(1). Other sources: http://algs4.cs.princeton.edu/34hash/index.html http://www.comsci.us/fs/notes/ch11.html http://algs4.cs.princeton.edu/34hash/index.html http://www.comsci.us/fs/notes/ch11.html http://www.comsci.us/fs/notes/ch11.html 66 Authors’ Notes The material for this course has been assembled by Steven Andrews and Nathan Stephens. This course is meant to help prepare you for taking CMSC 508 Database Theory. As a firm grasp of this material is essential, it is suggested that if
  • 91. your still have trouble understanding a topic you look at the other sources provided at the end of each section. __MACOSX/BNFO-501/._BNFO501 Course Guide.pdf BNFO-501/Project 1.pdf BNFO-501 Project 1: Input: You will be given, in standard input, two arrays of integers. The first array will be your data. The second array will contain integers that may or may not be in the first. The first line of input will contain two integers separated by a space. The first integer, n, will be the size of the data array. The second integer, m, will be the size of the query array. The next n lines will contain a single integer that corresponds to a value in the data array. Immediately following these will be m more lines containing the elements of the query array. A small sample set of input may look like this:
  • 92. 5 2 4 7 12 89 102 92 89 This will correspond to the two arrays: - Data: [4 7 12 89 102] - Query: [92 89] Output: You are to write both a sequential and binary search that will look for each value in the query array and return true if it exists in the data array. You will also print the time, in milliseconds, that each search takes along with the number that is being searched for. For example, the output of the input above should look like this: false:2ms false:0ms 92 true:2ms true:0ms 89 The search times will vary with the machine you are using. If you were to use this input as a test, you will
  • 93. most likely get 0ms for each search. To truly see the intended result, it is recommended that you generate your own input with at least 1,000,000 data elements. Files used for grading purposes will not exceed 50,000,000 data elements. BNFO-501 Project 1: Help & Tips: A template file has been provided that shows one way of reading to and printing from standard I/O. If you would like to run your own tests you can easily create a program that generates the standard file. We will be using the same format of file, with the exception of ordered elements, for every project so this would be a wise investment. If you want to use a file to test your program then you can add a command line parameter I/O redirect. For example, if you are on a windows machine and your program is named Project.java with the input file of data.txt, you can use the command:
  • 94. java Project < data.txt If you would like to route the standard output to a file instead of having it print to command prompt, you can use: java Project < data.txt > output.txt __MACOSX/BNFO-501/._Project 1.pdf BNFO-501/Project 2.pdf BNFO-501 Project 2: Input: You will be given, in standard input, two arrays of integers. The first array will be your data. The second array will contain integers that may or may not be in the first. The first line of input will contain two integers separated by a space. The first integer, n, will be the size of the data array. The second integer, m, will be the size of the query array. The next n lines will contain a single integer that corresponds to a value in the data array. Immediately following these will
  • 95. be m more lines containing the elements of the query array. A small sample set of input may look like this: 5 2 89 4 12 7 102 92 89 This will correspond to the two arrays: - Data: [4 7 12 89 102] - Query: [92 89] Output: You are to modify the program you wrote from Project 1. The output will be the same with the addition of 1 line of output. You are to print in standard output the time it takes to prepare the data. This preparation time will be the time, in milliseconds, that it takes for you to sort the data. You must write a merge sort to accomplish this task. You are NOT allowed to use Arrays.sort().
  • 96. You will then print the sequential and binary search results as done in project 1: Prep time: 45ms false:2ms false:0ms 92 true:2ms true:0ms 89 The search times will vary with the machine you are using. If you were to use this input as a test, you will most likely get 0ms for each search. To truly see the intended result, it is recommended that you generate your own input with at least 1,000,000 data elements. Files used for grading purposes will not exceed 50,000,000 data elements. BNFO-501 Project 2: Help & Tips: Try also writing one of the simpler O(n 2 ) sorts and compare it with the time it takes a Merge Sort to run for very large input. __MACOSX/BNFO-501/._Project 2.pdf
  • 97. BNFO-501/Project 3.pdf BNFO-501 Project 3: Input: You will be given, in standard input, two arrays of integers. The first array will be your data. The second array will contain integers that may or may not be in the first. The first line of input will contain two integers separated by a space. The first integer, n, will be the size of the data array. The second integer, m, will be the size of the query array. The next n lines will contain a single integer that corresponds to a value in the data array. Immediately following these will be m more lines containing the elements of the query array. A small sample set of input may look like this: 5 2 89 4 12 7 102 92 89
  • 98. This will correspond to the two arrays: - Data: [4 7 12 89 102] - Query: [92 89] Output: You are to once again modify the previous project. You will write a basic binary tree which only needs to insert and search for elements. Your prep time in this example will be the time it takes to add every element to the binary tree. You will only run one query per item in the query array, which will be a tree search to find the element. The output will be similar to before: Prep time: 450ms false:0ms 92 true:0ms 89 The search times will vary with the machine you are using. If you were to use this input as a test, you will most likely get 0ms for each search. To truly see the intended result, it is recommended that you generate your own input with at least 1,000,000 data elements. Files used for grading purposes will not
  • 99. exceed 50,000,000 data elements. BNFO-501 Project 3: Help & Tips: This will be the first project you are required to write something where pseudo-code was not explicitly given in the text. However, you have everything you need to write this search. Just remember, in order to insert something you must first search for an element to insert it under. If you are familiar with the recursive algorithms for insertion or searching you are welcome to use them. Just remember, since the testing can be done with up to 50,000,000 elements, memory may be a concern. You can increase the amount of memory allocated to the JVM for your machine, but the grader may not. __MACOSX/BNFO-501/._Project 3.pdf BNFO-501/Project program template.txt
  • 100. import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; public class Template { public static void main(String[] args){ FileContents file = null; try{ //Try to read the input. //Reads standard I/O. When graded, console input will be redirected to read from a file. BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); int dataSize = 0; int querySize = 0; {//This block reads the first line and assigns the values to variables. String[] header = in.readLine().split(" "); dataSize = Integer.parseInt(header[0]); querySize = Integer.parseInt(header[1]); } System.out.println("Data Array Contents"); //Read and store the contents of the input array. for(int i = 0; i < dataSize; i++){ System.out.println(in.readLine()); } System.out.println("Query Array Contents"); //Read and store the contents of the query array. for(int i = 0; i < querySize; i++){
  • 101. System.out.println(in.readLine()); } }catch(IOException e){ System.err.println("Error Reading Input: " + e.getMessage()); System.exit(0); } //Example for how to time how long a method takes. long start = System.currentTimeMillis(); timeThis(); long end = System.currentTimeMillis(); System.out.println((end - start) + "ms"); } //Some Method public static void timeThis(){ while((int)(Math.random() * 100) < 95); } } __MACOSX/BNFO-501/._Project program template.txt The mean hourly pay rate for financial managers in the East North Central region is $48.93, and the standard deviation is $2.76. Assume that pay rates are normally distributed. a. What is the probability a financial manager earns between $45 and $52 per hour? b. How high must the hourly rate be to put a financial manager in the top 10% with respect to pay?
  • 102. c. For a randomly selected financial manager, what is the probability the manager earned less than $43 per hour?