II B.Sc IT DATA STRUCTURES.pptx

DATA STRUCTURES
Dr.Sabitha Banu
19-06-2023 Data Structures

Unit -1
• Introduction of Algorithms
• Analysing Algorithms
• Arrays: Sparse Matrices
• Representation of Arrays
• Stacks and Queues
• Fundamentals - Evaluation of Expression Infix to Postfix Conversion
• Multiple Stacks and Queues

Introduction
• Algorithm is a step-by-step procedure, which defines a set of
instructions to be executed in a certain order to get the desired
output
• An algorithm can be implemented in more than one programming
language.(For eg.C ,C++,Python,Ruby)
• Algorithms just Data structures
• Categories of Algorithm are
Search − Algorithm to search an item in a data structure.
Sort − Algorithm to sort items in a certain order.
Insert − Algorithm to insert item in a data structure.
Update − Algorithm to update an existing item in a data structure.
Delete − Algorithm to delete an existing item from a data structure.

Characteristics
• Unambiguous − Algorithm should be clear and unambiguous. Each of
its steps (or phases), and their inputs/outputs should be clear and
must lead to only one meaning.
• Input − An algorithm should have 0 or more well-defined inputs.
• Output − An algorithm should have 1 or more well-defined outputs,
and should match the desired output.
• Finiteness − Algorithms must terminate after a finite number of steps.
• Feasibility − Should be feasible with the available resources.
• Independent − An algorithm should have step-by-step directions,
which should be independent of any programming code.

How to Write an Algorithm?
• step-by-step procedure
• Algorithm writing is a process and is executed after the problem
domain is well-defined.
• Example

Advantages of Algorithms:
• It is easy to understand.
• An algorithm is a step-wise representation of a solution to a given
problem.
• In Algorithm the problem is broken down into smaller pieces or steps
hence, it is easier for the programmer to convert it into an actual
program.
Disadvantages of Algorithms:
• Writing an algorithm takes a long time so it is time-consuming.
• Understanding complex logic through algorithms can be very difficult.
• Branching and Looping statements are difficult to show in Algorithms

Analysis of Algorithms
• Provides theoretical estimation for the required resources of an algorithm
to solve a specific computational problem.
• Analysis of algorithms is the determination of the amount of time and
space resources required to execute it.
• Efficiency(CPU, Memory ,Disk, Network )
• Time complexity
• Space complexity
• Different ways of analysis
Asymptotic Analysis
Worst, Average and Best Cases
Asymptotic Notations
Analysis of Loops
 Solving Recurrences
 Amortized Analysis

Asymptotic Analysis
• Performance of the algorithm based on the input size
• Relation between the running time and the input size
• Time and Space factor
Worst, Average and Best Cases
• Divided into three different cases
Best Case(Ω) − minimum time taken to execute the program.
Average Case(θ) − average time taken to execute the program.
Worst Case(O) − maximum time taken to execute the program.
Asymptotic Notations
• Asymptotic notations are mathematical tools to represent the time complexity of
algorithms for asymptotic analysis.
Ο (Big O) Notation
Ω (Omega)Notation
θ (Theta) Notation

Analysis of Loops
• analysis of iterative programs
O(1): Time complexity of a function (or set of statements) is considered as O(1) if it doesn’t
contain loop, recursion, and call to any other non-constant time function.
// c=a+b
print c; //
O(n): Time Complexity of a loop is considered as O(n) if the loop variables are
incremented/decremented by a constant amount.
O(nc): Time complexity of nested loops is equal to
the number of times the innermost statement is
executed.

O(Logn) Time Complexity of a loop is
considered as O(Logn) if the loop variables
are divided/multiplied by a constant
amount. And also for recursive call in
recursive function the Time Complexity is
considered as O(Logn).
O(LogLogn) Time Complexity of a loop is
considered as O(LogLogn) if the loop
variables are reduced/increased
exponentially by a constant amount.
Time Complexity of Loops
O(1) Set of statements
O(n) incremented/decremented by a constant
amount
O(nc) Innermost statements in nested loops
executed no. of times
O(Logn) divided/multiplied by a constant amount.
O(LogLogn) reduced/increased exponentially

Solving Recurrences
• Solving recursive problems
• There are mainly three ways of solving recurrences.
Substitution Method- Making a guess for the solution and then using mathematical induction to prove
the guess is correct or incorrect.
Recurrence Tree Method- Draw a recurrence tree and calculate the time taken by every level of the tree.
Finally, sum the work done at all levels. Eg.Divide and Conquer method
Master Method- Master Method is a direct way to get the solution.
Amortized Analysis
• is used for algorithms where an occasional operation is very slow, but most of the other
operations are faster.

Basics of Data structure
• Structuring/organizing the Data in a computer so that it can be used effectively
• Data must be atomic, traceable, accurate ,clear and concise.
• Data type
• Basic Operations
 Traverse
 Search
 Insert
 Delete
 Sort
 Merge
 Create
 Retrieve
 Store
Built-in Data Type Derived Data Type
•Integers
•Boolean (true, false)
•Floating (Decimal numbers)
•Character and Strings
• List
• Array
• Stack
• Queue

Arrays
• fixed-size sequenced collection of variables belonging to the same data types
and stored in contiguous memory.
• Set of pairs, index or value
• The array has adjacent memory locations to store values.
• convenient structure for representing data
• Two terms to understand the concept of array are Element and Index
Element − Each item stored in an array is called an element.
Index − Each location of an element in an array has a numerical index, which is used to
identify the element.
data_type array_name [array_size];

• Index starts with 0.
• Array length is 10 which means it can store 10 elements.
• Each element can be accessed via its index(mapping). For example, we can fetch an
element at index 6 as 9.
#
structure ARRAY(value, index)
declare CREATE( ) array
RETRIEVE(array,index) value
STORE(array,index,value) array;
#
Need for Arrays
• number of variables used will increase

Ordered Lists
• list in which the elements must always be ordered in a particular way
• Also called as Sorted list.
Eg. (SUNDAY ,MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY,
SATURDAY)
Representation of arrays
One dimensional array
A one-dimensional array is also called a single dimensional array where the elements will be accessed in
sequential order. This type of array will be accessed by the subscript of either a column or row index. eg
a[n] or an
Two dimensional array
 When the number of dimensions specified is more than one, then it is called as a multi-
dimensional array. Eg a[3,3] (row x column)

Eg a[3][4]
• A two-dimensional array will be accessed by using the subscript of row and column
index.
eg a[1][1]

 Three dimensional array
 In a three-dimensional array, there will be three dimensions. For eg.a[2][3][4]
#include <stdio.h>
int main()
{
int one_dim [10]; # declaration of 1D array
int two_dim [2][2]; #declaration of 2D array
int three_dim [2][3][4] =
{ { {3, 4, 2, 3}, {0, -3, 9, 11}, {23, 12, 23, 2} },
{ {13, 4, 56, 3}, {5, 9, 3, 5}, {3, 1, 4, 9}
};
return 0;
}

Sparse Matrices
• Triplet/Array representation
• Linked List representation
• Transpose

Sparse Matrices
• a matrix will be a sparse matrix if most of the elements of it is 0 (or)
• 1/3 of the matrix are non zero elements(30%)
• It will take larger space in memory with no purpose.
• To avoid wastage of space the sparse matrix is stored in
a table structure
Row Column Value
1 4 12
1 6 -14
2 2 7
2 3 3
3 4 -8
5 1 91
6 3 25
Triplets
6x6=36
7x3=21
17 memory
locations are saved
Triplet Representation of Sparse
matrix

Linked list representation
• The complexity of inserting or deleting a node in a linked list is lesser than the
array
• The four fields of the linked list are given as follows -
Row - It represents the index of the row where the non-zero element is located.
Column - It represents the index of the column where the non-zero element is located.
Value - It is the value of the non-zero element that is located at the index (row,
column).
Next node - It stores the address of the next node.

• For eg.
• linked list representation of the above matrix

Transpose
• Interchanging row and column
Row Column Value
0 2 1
1 0 3
2 1 4
3 1 6
Triplet
T=
Row Column Value
2 0 1
0 1 3
1 2 4
1 3 6

Benefits of using the sparse matrix
• Storage and
• Computing time

Stacks and Queues
Stacks
• Abstract Data Type (ADT)
• stack allows operations(insertion or deletion) at one end only
called TOP.
• Insertion and Deletion of an element is done by 2 operations
• PUSH (store)
• POP(accessing)
• At any given time, accessing the top element of a stack
• element which is placed (inserted or added) last, is accessed first so it is also called as
LIFO(LAST IN FIRST OUT)
• The stack is called empty or null when the elements =0
• S=(a1,a2,a3,…….,an) where a1 is the bottom most element and an is the top most element

• Status of stack can be known through the below operations
peek() − get the top data element of the stack, without removing it.
 isEmpty() − check if stack is empty.
isFull() − check if stack is full.

• Push Operation
• The process of putting a new data element onto stack is known as a Push
Operation. Push operation involves a series of steps −
I. Step 1 − Checks if the stack is full.
II. Step 2 − If the stack is full, produces an error and exit.
III. Step 3 − If the stack is not full, increments top to point next empty space.
IV. Step 4 − Adds data element to the stack location, where top is pointing.
V. Step 5 − Returns success.

Pop Operation
• Accessing the content while removing it from the stack
• The data element is not actually removed, instead top is decremented to a
lower position in the stack to point to the next value.
• Deallocates memory space.
• A Pop operation may involve the following steps −
I. Step 1 − Checks if the stack is empty.
II. Step 2 − If the stack is empty, produces an error and exit.
III. Step 3 − If the stack is not empty, accesses the data element at which top is pointing.
IV. Step 4 − Decreases the value of top by 1.
V. Step 5 − Returns success.

structure STACK (item)
1 declare CREATE ( )-> stack
2 ADD (item, stack) -> stack
3 DELETE (stack) -> stack
4 TOP (stack) -> item
5 ISEMTS (stack) -> boolean;

Queues
• Similar to stacks
• a queue has two ends and it is open at both of its ends
• Insertions (enqueue/rear) are made at one end and deletions(dequeue
/front) are made at the other end
• For eg Q= {a1,a2,…..,an} rear
• First-In-First-Out methodology, i.e., the data item stored first will be accessed
first.
Front

• Scheduling of jobs in among computer applications
• The basic operations associated with queues −
enqueue() − add (store) an item to the queue.
dequeue() − remove (access) an item from the queue.
Enqueue Operation (Insertion/Rear)
• Queues maintain two data pointers, front and rear. Therefore, its operations
are comparatively difficult to implement than that of stacks.
• The following steps should be taken to enqueue (insert) data into a queue −
I. Step 1 − Check if the queue is full.
II. Step 2 − If the queue is full, produce overflow error and exit.
III. Step 3 − If the queue is not full, increment rear pointer to point the next empty
space.
IV. Step 4 − Add data element to the queue location, where the rear is pointing.
V. Step 5 − return success.

Dequeue Operation(Deletion/Front)
• Accessing data from the queue is a process of two tasks − access the data
where front is pointing and remove the data after access.
• The following steps are taken to perform dequeue operation −
I. Step 1 − Check if the queue is empty.
II. Step 2 − If the queue is empty, produce underflow error and exit.
III. Step 3 − If the queue is not empty, access the data where front is pointing.
IV. Step 4 − Increment front pointer to point to the next available data element.
V. Step 5 − Return success.

• Few more functions are
peek() − Gets the element at the front of the queue without removing it.
isfull() − Checks if the queue is full.
isempty() − Checks if the queue is empty.
• peek() -This function helps to see the data at the front of the queue.

• isfull() -check for the rear pointer to reach at MAXSIZE to determine that the
queue is full
• isempty()- If the value of front is less than MIN or 0, it tells that the queue is
not yet initialized, hence empty.

Multiple stacks and Queues
• A single stack is sometimes not sufficient to store a large amount of data.
• To overcome this problem, multiple stack solves the problem.
• A single array having more than one stack. The array is divided for multiple
stacks.
• m memory is divided in to n number of stacks sharing equal memory.
• If size of stack is known then the m memory can divided in to known number
of stacks

T[i]
B[i]
B[i]=T[i] # if ith stack is empty/underflow
B[i]=T[i+1] # ith stack is full/overflow

Evaluation of Expressions
Expression - An expression is a collection of operators and operands that
represents a specific value.
For eg
• operator is a symbol which performs a particular task like arithmetic operation
or logical operation or conditional operation etc.,
Operands are the values on which the operators can perform the task. Here
operand can be a direct value or variable or address of memory location.

• Three different types of Expressions based on the operator position are
Infix Expression-operator placed between the operands eg.a+b
Postfix Expression- operator is used after operands eg ab+
Prefix Expression- operator is used before operands eg. +ab
• convert an expression from one form to another form like Infix to Postfix,
Infix to Prefix, Prefix to Postfix and vice versa.
• Converting any Infix expression into Postfix or Prefix expression
Find all the operators in the given Infix Expression.
Find the order of operators evaluated according to their Operator precedence.
Convert each operator into required type of expression (Postfix or Prefix) in the same
order

Steps to convert Infix Expression to Postfix Expression...
D = A + B * C
Step 1 - The Operators in the given Infix Expression : = , + , *
Step 2 - The Order of Operators according to their preference : * , + , =
Step 3 - Now, convert the first operator * ----- D = A + B C *
Step 4 - Convert the next operator + ----- D = A BC* +
Step 5 - Convert the next operator = ----- D ABC*+ =
Operator Priority
**,unary-,unary+,¬ 7
^(exponentiation) 6
*,/ 5
+,- 4
<,>,=,≠,≤,≥, 3
and 2
or 1

Unit -2 Linked List
• Linked List: Singly Linked List
• Linked Stacks and Queues
• Polynomial Addition
• More on Linked Lists
• Sparse Matrices
• Doubly Linked List and Dynamic
• Storage Management
• Garbage Collection and Compaction.

Linked Lists
• A linked list is a linear data structure, in which the elements are not stored at
contiguous memory locations.
• The elements in a linked list are linked using pointers.
• A linked list consists of nodes where each node contains a data field and a
reference(link) to the next node in the list.
• Address of the first/starting node is identified head and last node is identified as NULL .
• A linked list can grow and shrink its size, as per the requirement.
• It does not waste memory space.
Node

• Different types of Linked lists are
Singly linked list-Item navigation is forward only.

Doubly linked list-Items can be navigated forward and backward

Circular linked list-Last item contains link of the first element as
next and the first element has a link to the last element as
previous.

• Basic Operations of LL are
Insert − Adds an node to the list.
Display − Displays the complete list.
Search − Searches an element using the given key.
Delete − Deletes an element using the given key.
• Insert- Adding a new node in linked list
NewNode.next −>
RightNode;
LeftNode.next −>
NewNode

• Khg

GAT
1.Get a node which is currently unused and address it
as X
2.Set the DATA field of this node to GAT
3.Set the LINK field of X to point to the node after FAT
which contains HAT
4.Set the LINK field of the node containing FAT to X

Deletion-
• locate the target node to be removed, by using searching algorithms.
TargetNode.next −>
NULL;

• Either it deletes the node from the linkedlist or deallocate its
memory and wipe off completely.
• Suppose to delete the node GAT from the list
1

• Dividing memory into nodes each having at least one link field.
• A mechanism to determine the nodes which are free and in use
• A mechanism to transfer nodes from the reserved pool to the free pool and
vice versa
Storage pool
• Contains all nodes that are not currently being used.
• RET(to the pool) and GETNODE(from the pool) procedures
• If the node is no longer needed it is erased from the pool.
• Initially link all of the available nodes together in a single list-AV
• Singly linked list where available nodes are linked.

Example 1. Assume that each node has two fields DATA and LINK. The following
algorithm creates a linked list with two nodes whose DATA fields are set to be
the values 'MAT' and 'PAT' respectively. T is a pointer to the first node in this list.

Eg 2-Let T be a pointer to a linked list. T= 0 if the list has no nodes. Let X be a
pointer to some arbitrary node in the list T. The following algorithm inserts a
node with DATA field 'OAT' following the node pointed at by X.

Eg 3-Let X be a pointer to some node in a linked list T . Let Y be the node
preceding X. Y = 0 if X is the first node in T (i.e., if X = T). The following
algorithm deletes node X from T.

Array vs Linkedlist
Array Linked list
An array is a collection of elements of a similar data type.
A linked list is a collection of objects known as a node where
node consists of two parts, i.e., data and address.
Array elements store in a contiguous memory location.
Linked list elements can be stored anywhere in the memory or
randomly stored.
Array works with a static memory. Here static memory means
that the memory size is fixed and cannot be changed at the run
time.
The Linked list works with dynamic memory. Here, dynamic
memory means that the memory size can be changed at the
run time according to our requirements.
Array elements are independent of each other.
Linked list elements are dependent on each other. As each
node contains the address of the next node so to access the
next node, we need to access its previous node.
Array takes more time while performing any operation like
insertion, deletion, etc.
Linked list takes less time while performing any operation like
insertion, deletion, etc.
Accessing any element in an array is faster as the element in an
array can be directly accessed through the index.
Accessing an element in a linked list is slower as it starts
traversing from the first element of the linked list.
In the case of an array, memory is allocated at compile-time. In the case of a linked list, memory is allocated at run time.
Memory utilization is inefficient in the array. For example, if the
size of the array is 6, and array consists of 3 elements only then
the rest of the space will be unused.
Memory utilization is efficient in the case of a linked list as the
memory can be allocated or deallocated at the run time
according to our requirement.

Polynomial addition
• polynomials are the expressions that contain the number of terms with non-
zero exponents and coefficients.
• Consider the following General Represent of Polynomial.
• Linked representation of polynomials, each term considered as a node,
therefore these node contains three fields.
• Coefficient Field – The coefficient field holds the value of the coefficient of a term
• Exponent Field – The Exponent field contains the exponent value of the term
• Link Field – The linked field contains the address of the next term in the polynomial

• let us consider P and Q be two polynomials having these two polynomials
three terms each.
A=3𝑥14+2𝑥8+1
B=8𝑥14-3𝑥10+10𝑥6
• The two plynomials are represented in the form of linked list below
A=3𝑥14+2𝑥8+1 B=8𝑥14-3𝑥10+10𝑥6

• The following algorithm computes time and cost for the below operations
• Coefficient additions
• Coefficient comparisons
• Additions/deletions on available space
• Creating new node for C

• ATTACH procedure creates a new node with C(coefficient),E(exponent),d
(current last node)
• Whenever new node is generated with C ,E it is appended to the end of the
list C

Polynomial Addition
• https://www.youtube.com/watch?v=cFHZ-a87Vp4

• The use of linked lists is well suited for all polynomial operations like
addition,subtraction,multiplication by writing procedures collecting input,
and displaying output.
• For eg
D(x)=A(x)*B(x)+C(x)
Can be written as

• To compute more polynomial operations the nodes T(x) are reclaimed to
hold other polynomials for the future use.

• RET procedure is avoided by using ERASE procedure
• The time take to erase T(x) proportional to the number of nodes in T.
• Another efficient way to erase the nodes is by modifying the list structure
(link field of the last node points back to the first node )
• Circular list erases the nodes in fixed amount of time independent of the
number of nodes in the list.

• CERASE(T)

• Zero/Non zero polynomials are handled in a special case
• One special node is added for handling zero polynomials
• A=3x14+2x8+1

• Invert linked list
• https://www.youtube.com/watch?v=sYcOK51hl-A
• https://www.youtube.com/watch?v=D7y_hoT_YZI

CONCATENATE Procedure
• Concatenates subroutines two chains X and Y .It is linear .
• Concatenation means joining two linked lists or appending one linked list to
another linked list and generate a combined linked list.
• Time Complexity of Concatenate procedure is O(n).

INSERT_FRONT procedure
• Inserts a node at the front or rear of a circular list and take a fixed amount of
time.

LENGTH Procedure
• To find a length of a list

SPARSE MATRIX linked list representation-
• Each column of a sparse matrix will be represented by a circularly linked list
with a HEAD node.
• Each row will also be a circularly linked list with a head node .
• Each node in the structure other than a head node will represent a non zero
term in the matrix A.
• Linked list representation of Sparse matrix has 5 fields.
Down-links to the
next non zero
element in the same
column
Right-links to the next
non zero element in
the same row.

• a ij will be linked into the circular linked list for row i and circular linked list of
column j.
• So aij be a member of two lists at the same time.
• Every row and column has head nodes and it is set to zero.
• For every non zero term of Matrix A ,one 5 field node is given.

• MREAD and MERASE procedure is used to read and erase the elements of the
sparse matrix linked list representation.

Doubly linked list-
• A node in a DLL has 3 fields DATA,LLINK,RLINK
• May or may not be circular
• DATA field of the head node will not contain information.

• If P node points to any node in the doubly linked list

Dynamic storage Management-
• In a multiprocessor system several programs reside in memory at the same time.
• Different programs have different memory requirements.
• When OS requests for memory in dynamic environment memory size is not known
ahead of time.
• After the execution of the program the memory is freed is some order different
from allocation.
• At the start of the computer system whole memory with no jobs are available for
allocation.
• Then jobs are submitted to the computer and requests for memory allocation.

• For eg start with 1,00,000 words of memory and 5 programs
• Unshaded area indicates memory that is not currently in use.
• Assume P2 and P4 complete execution freeing the memory used by them.
Memory Programs
10,000 P1
15,000 P2
6,000 P3
8,000 P4
20,000 P5
41,000

• OS has to maintain a list of all blocks of storage currently not in use and then
to allocate storage from this unused pool as required .
• Chain structure is adopted to maintain the available space list.
• Linking all the free blocks together retaining the memory size of the block.
• Each node on the free list has 2 fields in its first word SIZE and LINK.
Memory Programs
10,000 P1
15,000 P2
6,000 P3
8,000 P4
20,000 P5

• During requisition for the memory of storing N words in the list of free blocks
finding or searching the necessary free block is done by allocation strategy.
• Allocation strategy is of two types
First fit
Best fit
• If the memory block size ≥ N and allocating N words out this
block-First fit
• If the memory whose size is as close to N as possible and not less
than N-Best Fit

n- memory size
needed
p- address where n
can be allocated
AV-available space list

• Allocation for a portion of memory in a free block is made from the bottom
of the block to avoid changing links in the available list.
• The blocks in the available list is maintained as a circular linked list with head
node set to 0.
• Allocation and freeing of nodes is made here .
• Freeing nodes or returning nodes to AV and recognize if its neighbours are
also free so that they can be coalesced in to single block.
Memory Programs
10,000 P1
15,000 P2
6,000 P3
8,000 P4
20,000 P5

If P3 is the next program to terminate rather than adding
it to the free list ,it is better to combine the adjacent free
blocks corresponding to P2 and P4
Memory Program
s
10,000 P1
15,000 P2
6,000 P3
8,000 P4
20,000 P5

• When are free blocks are combined together available block sizes get
smaller and smaller.
• To determine free adjacent memory blocks without searching the available
list ,a node structure is adopted for allocated and free nodes.

• Assume memory of size 5000 from which the following allocations are made
Resource size
R1 300
R2 600
R3 900
R4 700
R5 1500
R6 1000
Memory Configuration-
Different blocks of storage and the available space list-

• When a portion of free block is allocated ,allocation is made from the
bottom of the block.
• When r1 is freed

When r5 is freed

Garbage Collection and Compaction
• The process of collecting all unused nodes and returning them to available space.
• Carried out in two phases
• First phase-marking phase-all nodes in use are marked.
• Second Phase-all unmarked nodes are returned to available space list. It is trivial when all nodes
are fixed size. Examining every unmarked nodes to check whether it is marked or unmarked. Take
O(n) steps. free nodes form a contiguous block of memory called memory compaction
• Each node contains Mark bit and it can be changed at any time by using marking
algorithm
• Marking algorithm marks all direct and indirect accessible nodes .
• Initially all the nodes are set to zero.

• Each node will have MARK and TAG field .
• The node with MARK field as 1 contains DLINK
And RLINK.
• The TAG bit will be zero it contains atomic
Information and are called atomic nodes.
• Other nodes which contains 1 bit are called list
Nodes.
• Marking algorithms is used to mark the nodes
• Initially all the nodes are unmarked MARK(i)=0 for all nodes i
• Driver for marking algorithm is called to mark the nodes accessible from the
pointer variables .

Storage Compaction
• When storage requests may be for blocks of varying sizes ,compact storage so that the free
storage forms one contiguous block.
• Nodes in use have MARK bit =1 and free have MARK bit=0
• Nodes are labelled 1 to 8.
• Free nodes can be linked together to obtain the available space.moving current in use
nodes to the one end and free nodes are moved to the other

• By relocating the storage of nodes forms two contiguous block one is for used
and another one is free
•
• Storage compaction should update the links to point to the relocated address
of the respective node .

• With storage compaction three tasks are identified:
• Determine new addresses for nodes in use
• Update all links in nodes in use
• Relocate nodes to new addresses

• Each node has size ,NEW_ADDR,LINK1 and LINK2

Trees
• Basic Terminology
• Binary Trees
• Binary Tree Representations
• Binary Trees Traversal
• More on Binary Trees
• Threaded Binary Trees
• Representation of Binary Trees
• Counting Binary Trees

Trees
• A tree is a non linear data structure means that the data is organized so that
items of information are related by branches.
• It is easier and quick to access
• Data is organised in the form of trees with root node, branches and leaf nodes
• Also called as genealogies. There are two different types of genealogical charts
• Pedigree chart( tree of organisms
or genes)
• lineal chart( tree of languages)

Recursive definition of tree-A tree consists of a root, and zero or more subtrees
T1, T2, … , Tk such that there is an edge from the root of the tree to the root of
each subtree.

• A node stands for the item of information plus the branches to other items.
• The number of subtrees of a node is called its degree.
• Nodes that have degree zero are called leaf or terminal nodes.
• The other nodes which has degree is called non terminal nodes.
• Trees nodes can also be referred as parent and child nodes.
• c
• Children of the same parent are called
siblings
• The degree of a tree is the maximum
degree of the nodes in the tree.
• The ancestors of a node are all the
nodes along the path from the root to
that node.
• The level of the node letting the root be
at the level one.
• The height or depth of the tree
depends on the maximum level of any
node in the tree

• A forest is a set of n≥0 disjoint trees .
• A tree is called a forest when the root of the tree is removed
• We have 3 trees if node A is removed

• Another useful way to draw a tree is using list
• The example of the tree can be written in the list form as
• The node structure of tree when represented in the form of linked list

Binary trees

Tree Binary tree
General tree is a tree in which each node can have many
children or nodes.
Whereas in binary tree, each node can have at most two
nodes.
The subtree of a general tree do not hold the ordered
property.
While the subtree of binary tree hold the ordered
property.
In data structure, a general tree can not be empty. While it can be empty.
In general tree, a node can have at most n(number of
child nodes) nodes.
While in binary tree, a node can have at most 2(number
of child nodes) nodes.
In general tree, there is no limitation on the degree of a
node.
While in binary tree, there is limitation on the degree of
a node because the nodes in a binary tree can’t have
more than two child node.
In general tree, there is either zero subtree or many
subtree.
While in binary tree, there are mainly two subtree: Left-
subtree and Right-subtree

• S
Skewed Binary tree Complete Binary Tree
• Degree ,level , height ,leaf ,parent , and child are also applied here.
• https://www.javatpoint.com/discrete-mathematics-binary-trees
• https://www.geeksforgeeks.org/introduction-to-binary-tree-data-
structure-and-algorithm-tutorials/

Binary tree Representation
• Binary tree is represented in the form of its depth k have 2k-1 nodes
• Sequential representation of binary tree is represented from sequentially
numbering the nodes starting from the node 1 in the level 1
• Nodes on any level are numbered from left to right
• A binary tree with n nodes and depth is complete if the nodes corresponds to
the node which are numbered one to n in the full binary tree of depth k.

• Array representation of sequential tree does not waste space.
• Insertion or deletion of a node in the middle of tree requires movement of
many nodes to reflect the change of level number of these nodes.
• It can be overcome easily by using linked list representation

• It is difficult to determine the parent node
• So a fourth field is included to identify PARENT node
Binary Tree Traversal
• Many operations can be performed on trees.
• Traversing a tree or visiting each node at least once.
• Full traversal of a tree produces a linear order for the information in a tree.
• While traversal every node is treated in the same manner

• Six possible combinations of traversal are
LDR
LRD
DLR
DRL
RDL
RLD
• Traversal from left have 3 traversals like
LDR
LRD
DLR
• These traversals are called
Inorder
Postorder
Preorder
https://www.youtube.com/watch?v=WLvU5EQVZqY

Inorder –moving down the tree towards the left until no
nodes left then visit the next node on the right and move on

COPY of a binary tree
• Producing an exact copy or clone or duplicate of a given binary tree
• Modification of post order traversal gives the copy of the binary tree

EQUAL of a binary tree(identical/same)
• Binary trees are equivalent if they have the same topology and the
information in corresponding nodes is identical
• By the same topology every branch in one tree corresponds to a branch in
the second in the same order
• EQUAL traverses the binary trees in preorder

Algorithm to check binary trees are identical
• Check both nodes of both tree1 and tree2
• If tree1 and tree2 is null, tree traversal completed successfully.
• return true
• If node of any of tree is null.
• Trees are not identical, return false .
• Compare data of tree1 and tree2
• Data is same for both nodes
• Go through Left subtree and right subtree
• Traverse Left child of binary tree1 and left child of tree2
• Traverse Right child of binary tree1 and right child of tree2
• Data is not same
• Trees are not identical, return false
• After above traversal, we will know whether binary trees are identical or equal or
same.
• Time Complexity:
• Let tree1 contains p number of nodes & tree2 contains q number of nodes.
• Time Complexity: O(p) where p > q

Example 1: Identical or Same binary trees
• Structure of both the trees is same
• Data nodes of corresponding binary trees are same.

Example 2: Non-Identical binary trees
• Structure of both the trees is same
• Data nodes of corresponding binary trees are NOT same.
• Node C and Node R has different values.
• Node D and Node S has different values.

Propositional logic of a binary tree
• Propositional formula contains variables x1,x2,x3….
• And operators
• The variable with these operators are called expressions which have only 2
possible values either TRUE or FALSE
• the expression with operators is called propositional calculus
• For eg
• Can be read as

• If x1 and x3 are false and x2 is true then the value of the above expression is
• For eg

Threaded Binary Tree(TBT)
• Linked list representation of binary tree contains more null links than actual
pointers
• Like n+1 null links and 2n total links
• TBT is a technique to make use of null links in a clever way founded by
A.J.Perlis and C.Thornton
• Their idea was to replace the null links by pointers called threads to other
nodes.
• Rules to be followed for a thread binary tree
• Left most and right most node in the binary should be “NULL”
• Change all other null pointers to
• Left pointer-inorder predecessor
• Right pointer-inorder successor

H,d,i,b,e,a,f,c,g-inorder traversal
• Tree has 9 nodes and 10 NULL links
• These NULL links must be replaced by threads
Left pointer-inorder predecessor
Right pointer-inorder successor

NULL
NULL

• In memory representation normal pointers and threads must be
differentiated
• So it can be done by mentioning the address as either parent or child
• (child(1) and parent node(0))
• Differentiated by using two extra one bit fields bits LBIT and RBIT
• Node structure of a linked binary tree with LBIT and RBIT is
• If the left pointer points to the child node LBIT will be 1 and 0 if it points the
parent or ancestor node
• If Right pointer points to child RBIT will be 1 and 0 if it points to ancestors
• https://www.youtube.com/watch?v=ffgg_zmbaxw
Left pointer LBIT Data RBIT Right
pointer

Left pointer LBIT Data RBIT Right
pointer
NULL
NULL

• Introducing a new dummy node and the NULL left pointer of H node points
to the left pointer of dummy node and right pointer of dummy node points
itself
• To maintain consistency of the TBT.

• The computing time is O(n) for n nodes.
• Same can be applied to pre and postorder traversal
• Insertion is possible in threaded binary tree
• Procedure to grow a threaded tree
• If the node has an empty sub tree it is easy to insert another node otherwise
right subtree is made to right subtree of already available node.

Binary tree representation of trees
• Every tree can be represented as binary tree
• Array representation
• Linkedlist representation
• Relationship representation

Relationship representation-
• Relationship between the nodes are characterized by two quantities
• Leftmost-child-next-right-sibling relationship
• Every node has at most leftmost child and one next right sibling
• Left most child of B is E and next right sibling
Of B is C

• Connecting together all
siblings of a node
• Deleting all links from a
node to its children except
the link of its left most child

Tree can represented in the formal way as
Preorder,inorder and post order traversal of the binary tree can also be applied
here
Preorder-

Inorder traversal of T
Post-order traversal of T
https://prod-edxapp.edx-
cdn.org/assets/courseware/v1/0f0865e1fe974ec8b2244cdcd7f5d68a/c4x/Pe
kingX/04830050x/asset/chapter6_001_en.pdf

Counting Binary Trees-
• Determining distinct binary trees with n nodes
• When n=0 and n=1 there is only one binary tree
• When n=2 ,two distinct binary trees
• When n=3, five distinct binary trees

• Pre order
• In order

GRAPHS
• A graph consists of two sets V and E
• Vertices (V)-units of a graph
• Edges(E)-connection of units
• A Graph is a non-linear data structure consisting of
vertices and edges.
• V is a finite non empty set of vertices or units of the graph
• E is a set of pairs of vertices called edges
• V(G) and E(G) represents the vertices and Edges
of Graph G
• A graph is represented as G=(V,E)

• A graph is of two types
Directed graph
Un directed graph

• Multigraph-A graph is said to be a multigraph if the graph doesn't consist of
any self-loops, but parallel edges are present in the graph. If there is more
than one edge present between two vertices, then that pair of vertices is
said to be having parallel edges.

• Complete Graph-
A graph is said to be a complete graph if, for all the vertices of the graph, there exists an
edge between every pair of the vertices.

• Adjacent-Two node or vertices are adjacent if they are connected to each
other through an edge. The adjacent vertices to vertex 2 are 4,5, and 1

• Subgraph-A graph in data structure is said to be a subgraph if it is a part of
another graph.

• Length- the length of a path is the number of edges on it .
• Simple Path-A path that does not repeat vertices is called a simple path.
• Cycle-is a simple path in which
the first and last vertices are the same.

• In degree-In-degree of a vertex is the number of edges coming to the
vertex.
• Out degree -Out-degree of a vertex is the number edges which are coming
out from the vertex.

Graph Representation-
• There are three representations of graphs
Adjacency Matrix
Adjacency List
Adjacency multilists
Adjacency Matrix-
• Let g=(V,E) be a graph with n vertices,n>=1
• The adjacency matrix of G is a 2 dimensional n x n array say A, with the
property that A(i,j)=1 if the edge (vi,vj)is in E(G).
• A(i,j)=0 if there is no edge in G

• The adjacency matrix for graphs G1,G3 and G4 are given below

The adjacency matrix will require atleast O(n2) time to examine all the diagonals

Adjacency Lists-
• N rows of adjacency matrix is represented as n linked lists.
• There is one list for each vertex in G
• Each node has atleast 2 fields
• VERTEX-contains the indices of the vertices adjacent to vertex i.
• LINK
• Each list has a head node
• The head nodes are sequential providing easy random access to the list for
any vertex

• Adjacency list requires n head nodes and 2e list nodes
• In terms of number of bits of storage needed this count should be
multiplied by log n for the head nodes and log n +log e for the list nodes
• It takes O(logm) bits to represent the number of value m.
• Sparse matrix representation of graph has 4 fields

Adjacency multilists-
• are an edge, rather than vertex based, graph representation.
• In the Multilist representation of graph structures consists of two parts
 a directory of Node information and
a set of linked list of edge information.
• For each edge there will be an exactly one node,but this node will be in two
lists
m- one bit mark field to indicate that edge is examined
or not
V1-start vertex of edge (v1,v2)=v1
V2-start vertex of edge (v1,v2)=v2
List1-first down “list name” where v1 is present
List 2-First down “list name “ where v2 is present .

Traversals, Connected Components ,and Spanning Trees
• Given an undirected graph g=(V,E) and a vertex v in V(g)
• Visiting all the vertices in G that are reachable from V
• Two ways to visit
• Depth first search(DFS)
• Breadth First Search(BFS)

Depth First Search (DFS) Traversal /Algorithm-
• The start vertex v is visited
• Next an unvisited vertex w adjacent to v is selected
• A depth first search from w is initiated
• When a vertex u is reached such that all its adjacent vertices have been
visited .
• The search is terminated when no unvisited vertex can be reached from any
of the visited nodes
• The DFS algorithm is a recursive algorithm that uses the idea of backtracking.
• https://www.youtube.com/watch?v=iaBEKo5sM7w

• This recursive nature of DFS can be implemented using stacks.
• The basic idea is as follows:
 Pick a starting node and push all its adjacent nodes into a stack.
 Pop a node from stack to select the next node to visit and push all its adjacent nodes
into a stack.
 Repeat this process until the stack is empty.
However, ensure that the nodes that are visited are marked.
This will prevent you from visiting the same node more than once.
If you do not mark the nodes that are visited and you visit the same node more than
once, you may end up in an infinite loop.

DFS visiting order
V1
V2
V4
V8
V5
V6
V3
v7

Breadth First Search Traversal /Algorithm-
• Starting at vertex v (root node)and marking it as visited.
• Traversing the graph layerwise visiting the neighbour nodes (directly
connected to the root node)
• Traversing towards the next level neighbour nodes in breadth wise
• In BFS all nodes should be visited in
layer 1 before moving to the next
Layer 2
https://www.youtube.com/watch?v
=QRq6p9s8NVg

BFS visiting order
V1
V2
V3
V4
V5
V6
V7
V8

Connected components
• Connectivity in an undirected graph means that every vertex can reach every
other vertex via any path.
• Strong Connectivity applies only to directed graphs. A directed graph is
strongly connected if there is a directed path from any vertex to every other
vertex.
• If the graph is not connected the graph can be broken down into Connected
Components.
• This is same as connectivity in an undirected graph, the only difference being
strong connectivity applies to directed graphs and there should be directed
paths instead of just paths. Similar to connected components, a directed
graph can be broken down into Strongly Connected Components.
• To determine all the connected components of the graph
• It can be obtained by making either DFS(v) or BFS(v) calls repeatedly

Spanning tree and Minimum Cost
Spanning Trees
• A graph which contains all vertices with minimum number of edges
• If any vertex is missed it is not a spanning tree
• A spanning tree contains n-1 edges where n is the number of vertices
• Edges of the vertices may or may not have weights assigned to them
• All the possible spanning trees have same number of vertices but the
number of edges would be n-1.
n=4
e=n-1=4-1=3

• Cycle should not formed while designing a spanning tree
• When BFS is used the resulting tree is called BFS spanning tree and when
DFS is used the resulting tree is called DFS spanning tree.

Application of Spanning Tree
• Spanning tree is basically used to find a minimum path to connect all nodes in
a graph. Common application of spanning trees are −
Civil Network Planning
Computer Network Routing Protocol
Cluster Analysis

Minimum Spanning Tree –
• The cost of a spanning tree is the sum of the costs of the edges in that tree
• One approach to find out the minimum cost spanning tree by Krushal.
• In this approach minimum cost spanning tree T is built edge by edge
• Edges are considered for inclusion in T if t is in non decreasing order of their
costs.
• Loops and parallel edges are removed
• An edge is included in T if it does not form a cycle with the edges already in T
• Since G is connected and has n>0 vertices exactly n-1 edges will be selected
for inclusion in T
• Time complexity of minimum cost spanning tree is O(e log e) where e is the
number of edges in E.

(2,3) -5
(2,4)-6
(4,3)-10
(2,6)-11
(4,6)-14
(1,2)-16
(4,5)-18
(1,5)-19
(5,6)-33

Shortest Path
• The length of the path is defined to be the sum of the weights of the edges
on that path rather than the number of edges.
• The starting vertex of the path will be referred to as source and the last
vertex is called as destination
• The graphs will be digraphs and weights assigned are positive
Single Source All destinations
• Given a directed graph G=(V,E) ,a weighing function w(e) for the edges of G
and the source vertex V0.
• Finding the shortest paths from V0 to all the remaining vertices of G

• Shortest path algorithm first given by Dijkstra to determine the shortest
paths from v0 to all other vertices in G
• Number of vertices starts from 1 through n
• The Set S is maintained as a bit array with S(i)=0 if vertex I is not in S and
S(i)=1 if it is
• The graph is represented by its cost adjacency matrix with COST(i,j)being
the weight of the edge (i,j)
• DIST(i)

Basics of Dijkstra's Algorithm
• Dijkstra's Algorithm basically starts at the node that you choose (the source
node) and it analyzes the graph to find the shortest path between that node
and all the other nodes in the graph.
• The algorithm keeps track of the currently known shortest distance from
each node to the source node and it updates these values if it finds a
shorter path.
• Once the algorithm has found the shortest path between the source node
and another node, that node is marked as "visited" and added to the path.
• The process continues until all the nodes in the graph have been added to
the path. This way, we have a path that connects the source node to all
other nodes following the shortest path possible to reach each node.

Transitive Closure
• Determining the existence of the path between every pair of vertices
• Given a directed graph, find out if a vertex j is reachable from another vertex
i for all vertex pairs (i, j) in the given graph.
• Reachable mean that there is a path from vertex i to j. The reachability
matrix is called the transitive closure of a graph.

Unit4
External Sorting
• Storage Devices
• Sorting with disks
• Sorting with Tapes
• Symbol Tables
• Static tree tables
• Dynamic Tree tables
• Hash tables

• Techniques to sort large files
• The files are large to accumulate in internal memory of a computer
• Characteristics of external storage devices
• External storage devices are broadly categorized
• Sequential access(tapes)
• Direct access (drums and disks)

Storage Devices –
Magnetic Tapes
• Used for Computer input /output
• Data is recorded on magnetic tape approximately ½” wide
• The tape is wound around a spool
• A new reel of tape is normally 2400 ft long
• Tracks run across the length of the tape with a tape having
typically 7 to 9 tracks across its width
• Depending on the direction of magnetization ,
a spot on the track can represent either as 0 or 1
• Combination of bits on the tracks represents
a character (A-Z,0-9,etc.)

• The number of bits written per inch of the track is referred to as tape density
• Reading from a magnetic tape or writing onto it is done from a magnetic
drive.
• A tape drive consist of 2 spindles
• One of the spindle is mounted with source
Reel and the other one take up the reel
• Forward reading or writing the tape is pulled
From the source reel across the read/write
heads and onto the take up reel
• Some tape drives also permit backward
Reading and writing of tapes

• If characters are packed onto a tape at a density of 800dpi then a 2400ft
tape would hold a little over 23x106 characters
• If the tape does not have enough space for one full information it can be
grouped into several blocks
• These blocks may be of variable size or fixed size
• In between blocks of data is an interblock gap normally about ¾ inches long
• The interblock gap is long enough to permit
the tape to accelerate from rest to the correct
Read/write speed before the beginning of the
next block reaches the read/write heads.
• To read a block from a tape one specifies the length of the block and also the
address A in the memory

• To write a block of data onto a tape the starting address and the number of
consecutive words to be written in the memory
• The block size will correspond to the size of the input/output buffers set up
in memory
• Computer tape is an example of sequential access device
• If the read head is positioned at the front of the tape and one wishes to read
the information ina block 2000ft down the tape then it is necessary to
forward space the tape the correct number of blocks .
• If to read the first block the tape would have to be rewound 2000 ft to the
front before the first block could be read.
• Typical rewind times over 2400ft of tape takes around 1 minute.

• Some assumptions about the tape drive
Tapes can be written and read in the forward direction only
The I/O channel of a computer permits 3 tasks to be carried out parallel –writing on to
the tape, reading from another tape and CPU operation

Disk Storage-
• Disks is a direct access storage device
• Disks has two distinct component
The disk module(simply the disk on which info is stored)
The disk drive (corresponding to the tape drive which performs the reading or writing
information onto disks)
• Disks can be removed or mounted onto a disk drive
• The disk pack consists of several platters that are similar to phonograph
records. The number of platters per pack varies and typically is about 6.
• Each platter has 2 surfaces on which information
can be recorded

• The outer surfaces of the top and bottom
surface are not used
• There are total of 10 surfaces on which the
information may be recorded
• Disk contains the spindle on which the disk
May be mounted and a set of read/write heads
• There is one read/write for each surface
• Every read/write the heads are held stationary
over the position of the platter where the
read/write to be performed
• While disks itself rotates at high speeds
(2000-3000 rpm)

• Every disk will read/write in concentric circles on each surface
• The area that can be read from or written
onto a single stationary head is referred as a track.
• Tracks are thus concentric circles and each time
the disk completes the revolution an entire track
Passes a read/write head
• There may be 100 to 1000 tracks on each
surface of a platter
• The collection of tracks simultaneously under
a read/write head on the surfaces of all the
platters is called a cylinder

• Tracks are divided into sectors
• A sector is a smallest addressable segment of a track
• Information is stored along the tracks of a surface in the blocks
• In order to use a disk the sector number has to be specified
• The read/write head assembly is positioned to right side of the cylinder.
• Before start to read/write it has to wait for the right sector to come beneath
the read/write head
• Then transmission can take place
• Three factors contributing to I/O time for disks
Seek time –time taken to position the read/write heads to the correct cylinder
depends on the number of cylinders across which the heads have to move
Latency time-time until the right sector of the track is under the read/write head
Transmission time –time taken to transmit the block of data to/from the disk

Sorting with disks-
• The most popular method of sorting in external device is merge sort
• This method have two distinct phases
1. First, divide the file into runs such that the size of a run is small enough to fit into
the main memory. Next, sort each run in main memory using the standard merge
sort sorting algorithm.
2. Finally, merge the resulting runs into successively bigger runs until the file is sorted.
• Calculate the overall computing time
• For eg

1. Internally sort three blocks at a time(ie 750 records) to obtain six runs R-
R6.A method such as heap sort or quick sort could be used .these six runs
are written out on to the disk.
2. Set aside 3 blocks of internal memory each capable of holding 250 records.
Two of these blocks will be used as input buffers and one as the output
buffer. Merge R1 and R2.this is carried out by first reading one block of
each of these runs into input buffers.
3. Blocks of runs are merged from the input buffers in to the output buffer
4. When the output buffer gets full it is written on to the disk.
5. If an input buffer gets empty it is refilled with another block from the same
run
6. Then R3,R4 and finally R5 ,R6 are merged

• Analysing the time required to sort these 4500 records .the analysis will
have the following notation
• Seek time can be reduced by writing the blocks in the same cylinder or
adjacent cylinders
• Should have a close look of the computing time indicates on the number of
passes made over the data.

• Not efficiently using the computers ability to carry I/O ,CPU operations in
parallel and overlap some of the time.
• Parallelism is an important consideration when sorting is done in a non multi
programming environment (when I/O and CPU processing is going on parallel
,the CPU is idle during I/O)
• Parallelism is not possible to achieve because of the structure of the OS

• K-way merging-
To sort a set of sorted arrays of n values
Heap sort is applied in k sorted arrays of n values
The K-way Merge pattern looks like this;
• We can push the smallest (first) element of each sorted array in a Min Heap to get
the overall minimum.
• After this step, we can take out the smallest (top) element from the heap and then
add it to the merged list.
• After removing the smallest element from the heap, insert the next element of the
same list into the heap.
• We can repeat steps 2 and 3 to populate the merged list in sorted order.
• Time Complexity = O(N log K) where N is the total number of elements in all the K
input arrays.
• Space Complexity = O(K)

• https://www.youtube.com/watch?v=Xo54nlPHSpg

• Significant reduction in the number of comparisons needed to find the
next smallest number by using the selection tree
• A selection tree is a binary tree where each node represents the smaller
of its 2 children
• Thus the root node represents the smallest node in the tree

Sorting with tapes-
• Sorting on tapes is carried out using the same steps as sorting on disks
• Difference between sorting in tapes and disks lies in the manner in which
runs are maintained on the external storage media.
• Tapes are sequential access
• Seek time and latency time are different for both tapes and disks
• High seek time and latency time on tapes
• The blocks on tape be read sequentially during k-way merge of runs

• Computing time analysis assumes that no operation are carried out in
parallel

Symbol tables
• A symbol table is a set of name-value pairs
• Associated with each name in the table is an attribute , a collection of
attributes ,or some directions about some processing
• Symbol tables have fixed number of entries
• Operations performed on symbol table are
Ask if a particular name is already present
Retrieve the attributes of that name
Insert a new name and its value
Delete a name and its value

• Different ways to implement symbol tables are
• Static tree table
• Dynamic tree table
Static tree table –
• When identifiers are known in advance
• no insertion or deletions are allowed
• Symbol tables with this property is called static
• The names are sorted and stored them sequentially either using binary
search tree or Fibonacci search method
• Any names can be find out in o(log2n) operations

• While evaluating BST add a special “square” node at every place there is a
null link

• Every binary tree with null links can be represented as two nodes
• External nodes (or failure nodes)–they are not the part of the original tree
• Internal nodes –remaining nodes are called as internal nodes
• A binary search tree with the external nodes are called extended binary
tree
• Each time binary search tree is examined for an identifier
• If it is not available in the tree then the search terminates with the
unsuccessful searches

• Finding the length of the external path and internal path of a binary tree
• External path length of a binary tree to be the sum over all external nodes
of the lengths of the paths from the root to those nodes

Weighed external path length of such as binary tree is calculated by
Where Ki is the distance from the root node to the external node with weight
qi.
Supose n=3,q1=15,q2=2,q3=4 and q5=5

• With over all binary tree with n internal nodes finding the minimum and
maximum values for I
• To obtain trees with minimal I ,there should be as many as internal nodes as
close to the root node
• One tree with minimal internal path length is the complete binary tree
• Binary trees with minimal weighed external path length is used in many
applications such as optimal set of codes for message M1,…Mn+1.
• Each code in the binary string will be used for transmission of the
corresponding message
• At the receiving end it will be decoded using a decode tree
• A decode tree is a binary tree in which external nodes represent messages
• The binary bits in the code word for a message determine the branching
needed at each level of the decode tree to reach the correct external node

Huffman Codes-
M1=000
M2=001
M3=01
M4=1
• The cost of decoding a code word is proportional to the number of bits in the code
• This number is equal to the distance of the corresponding external node from the root node
• The expected decode time is minimized by choosing code words resulting in a decode tree
with minimal weighted external path length.

Huffman Algorithm-
• Huffman Coding is a technique of compressing data to reduce its size without
losing any of the details. It was first developed by David Huffman in 1951.
• It follows a Greedy approach, since it deals with generating minimum length
prefix-free binary codes
• Huffman Coding is generally useful to compress the data in which there are
frequently occurring characters.
• Each character occupies 8 bits. There are a total of 15 characters in the above
string. Thus, a total of 8 * 15 = 120 bits are required to send this string.
• Using the Huffman Coding technique, we can compress the string to a smaller
size.
• Huffman coding first creates a tree using the frequencies of the character and
then generates code for each character.

Steps of Huffman encoding algorithm
1. Calculate the frequency of each character in the string.
2. Sort the characters in increasing order of the frequency. These are stored in
a priority queue Q.

3. Make each unique character as a leaf node.
4. Assign the minimum frequency as the left child and assign the second
minimum frequency as the right child .Set the value as the sum of the above
two minimum frequencies.
5. Repeat steps 3 & 4 for all the characters.

6. For each non-leaf node, assign 0 to the left edge and 1 to the right edge.

• Without encoding, the total size of the string was 120 bits. After encoding the
size is reduced to 32 + 15 + 28 = 75.
Decoding –
• For decoding the code, we can take the code and traverse through the tree to
find the character.
• Let 101 is to be decoded, we can traverse from the root as in the figure below.

Huffman Encoding Algorithm

• create a priority queue Q consisting of each unique character.
• sort then in ascending order of their frequencies.
• for all the unique characters:
• create a newNode extract minimum value from Q and assign it to
leftChild of newNode
• extract minimum value from Q and assign it to rightChild of newNode
• calculate the sum of these two minimum values and assign it to the
value of newNode
• insert this newNode into the tree return rootNode

Time Complexity –
• The time complexity for encoding each unique character based on its
frequency is O(nlog n).
• Extracting minimum frequency from the priority queue takes place
2*(n-1) times and its complexity is O(log n). Thus the overall complexity is
O(nlog n).
Advantages of Huffman Encoding-
• This encoding scheme results in saving lot of storage space, since the binary
codes generated are variable in length
• It generates shorter binary codes for encoding symbols/characters that
appear more frequently in the input string
• The binary codes generated are prefix-free

Disadvantages of Huffman Encoding-
• Lossless data encoding schemes, like Huffman encoding, achieve a lower
compression ratio compared to lossy encoding techniques. Thus, lossless
techniques like Huffman encoding are suitable only for encoding text and
program files and are unsuitable for encoding digital images.
• Huffman encoding is a relatively slower process since it uses two passes- one for
building the statistical model and another for encoding. Thus, the lossless
techniques that use Huffman encoding are considerably slower than others.
• Since length of all the binary codes is different, it becomes difficult for the
decoding software to detect whether the encoded data is corrupt. This can
result in an incorrect decoding and subsequently, a wrong output

Real-life applications of Huffman Encoding-
•Huffman encoding is widely used in compression formats like GZIP, PKZIP
(winzip) and BZIP2.
•Multimedia codecs like JPEG, PNG and MP3 uses Huffman encoding (to be more
precised the prefix codes)

Dynamic Tree tables-
• Dynamic tables may also be maintained as BST
• Insertion, deletion and searching of a node can be done
• When insertions and deletions are done it is necessary to restructure the
whole tree to accommodate the changes and at the same time it should be a
complete binary tree
• It gives the worst time complexity O(h)
• To make less time time complexity the tree should be self balanced or height
balanced using the balance factor
• A method of growing self balanced /Height balanced tree is followed

• worst time complexity O(h)
• Worst time complexity O(h)
O(h)
h=log(n)

AVL Tree-
• Adelson –Velskii and Landis in 1962 introduced a balanced binary search
tree with respect to the heights of the subtrees
• Dynamic searching can be in the balanced BST can be performed in O(log n)
time if the tree has n nodes on it
• Insertion and deletion in the same tree can be done in O(log n) time
• The resulting tree remains balanced

Balance factor=height of left tree-height of right subtree
The tree having the balance factor greater than 1 or less than -1 is not called
balanced tree or AVL tree

• If the tree is not an AVL tree then the tree can be converted to AVL tree by
performing these operations
• LL
• RR
• LR
• RR

• Left rotation-If a tree becomes unbalanced, when a node is inserted into the
right subtree of the right subtree, then we perform a single left rotation
• Right rotation-AVL tree may become unbalanced, if a node is inserted in the
left subtree of the left subtree. The tree then needs a right rotation

Right-Left Rotation
• The second type of double rotation is Right-Left Rotation. It is a combination of
right rotation followed by left rotation

Hashing-
• Hashing is an important data structure designed to solve the problem of efficiently finding
and storing data in an array.
• Hashing is a method for storing and retrieving records from a database.
• insert, delete, and search for records based on a search key value in a constant time
• A hash system stores records in an array called a hash table (HT)
• Every hash table contains values or records stored sequentially .
• Hashing works by performing a computation on a search key K in a way that is intended to
identify the position in HT that contains the record with key K.
• Hash table is partitioned into b buckets HT(0)….HT(b-1)
• Each bucket is capable of holding s records in s slots each slot being large enough to hold 1
record
• Each bucket can hold exactly 1 record in each slot

• Hash tables use a technique to generate these unique index numbers for each
value stored in an array format. This technique is called the hash technique or
hashing
• Hashing searches an identifier or record by the address or location of the
record.

• It returns the following values: a small integer value (also known as hash
value), hash codes, and hash sums. The hashing techniques in the data
structure are very interesting, such as:
• hash = hashfunc(key)
• index = hash % array_size
• Types of hashing in data structure is a two-step
process.
The hash function converts the item into a small integer
or hash value. This integer is used as an index to store
the original data.
It stores the data in a hash table. a hash key can be used to
to locate data quickly.

• Overflow occurs when a new identifier is mapped or hashed into a full bucket
• Collison occurs when two non identical identifiers are hashed into the same
bucket /Collision in hashing is when two or more elements are fighting for the
same slot in the hash table/If the hash function returns the same index for
more than one element then the collision will occur.
• When bucket size is 1 (s=1) collision and overflows simultaneously occurs
• Hashing functions/Methods to handle overflows and collisions are
Mid square
Division
Folding
Digit analysis

Mid-square(middle of square) :
• Mid-Square(fm) hashing is a hashing technique in which unique keys are
generated.
• a seed value is taken and it is squared.
• Then, some digits from the middle are extracted. These extracted digits form
a number which is taken as the new seed.
• This technique can generate keys with high randomness if a big enough seed
value is taken.
• This process is repeated as many times as a key is required.

Example-
Suppose a 4-digit seed is taken. seed = 4765
Hence, square of seed is = 4765 * 4765 = 22705225
Now, from this 8-digit number, any four digits are extracted (Say, the middle
four).
So, the new seed value becomes seed = 7052
Now, square of this new seed is = 7052 * 7052 = 49730704
Again, the same set of 4-digits is extracted.
So, the new seed value becomes seed = 7307
.
.

Division-
• Hash function obtained by using the modulo(mod) operator
• The value is divided by some number M(size of the hash table) and the
remainder is used as the hash address for X
• Example
Size of Hash Table (m) = 1000 (0 - 999)
Suppose we want to calculate the index of element x, where x = 123789456
index =123789456 mod 1000
= 456
The element x is stored at position 456 in the hash table.

Folding –
• The key k is partitioned into a number of parts k1, k2.... kn where each part
except possibly the last, has the same number of digits as the required
address.
• Then the parts are added together, ignoring the last carry.
• There are two type of folding:
Shift –all are added except least bit
Boundary-Alternate pieces are flipped on the boundary.
Boundary folding is indicated by 𝑝𝑖
𝑟

Digit analysis-
• Digit analysis, is used with static files.
• A static file is one in which all the identifiers are known in advance. Using
this method, we first transform the identifiers into numbers using some
radix, r.
• Then examine the digits of each identifier, deleting those digits that have the
most skewed distributions. Continue deleting digits until the number of
remaining digits is small enough to give an address in the range of the hash
table.
• The digits used to calculate the hash address must be the same for all
identifiers and must not have abnormally high peaks or valleys (the standard
deviation must be small).

Overflow handling –
• To detect/handle overflow and collisions/open addressing
• Different ways are
Linear probing
Quadratic probing
Double hashing
Linear probing –
In linear probing, the hash table is searched sequentially that starts from the
original location of the hash. If in case the location is already occupied, then
check for the next location.
It is also called as rehashing

For example Let us consider a simple hash function as “key mod 7”
and a sequence of keys as 50, 700, 76, 85, 92,
73, 101.
Let us consider a simple hash function as
“key mod 5” and a sequence of keys that
are to be inserted are 50, 70, 76, 93.
Let hash(x) be the slot index computed using a hash function and S be the table size
If slot hash(x) % S is full, then we try (hash(x) + 1) % S
If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S
If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S
50%7=1
700%7=0
76%7=6
85%7=1
92%7=1
73%7=3
101%7=3

• 50, 70, 76, 93 50%5=0 70%5=0 76%5=1
• 93%5=3

Quadratic probing-
• In this method, we look for the i2‘th slot in the ith iteration.
• Always start from the original hash location. If only the location is occupied
then check the other slots.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S

Let us consider table Size = 7, hash function as Hash(x) = x % 7
Insert = 22, 30, 50.

• Insert 22 and 30Hash(22) = 22 % 7 = 1, Since the cell at index 1 is empty, we
can easily insert 22 at slot 1.
• Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert
30 at slot 2

• Inserting 50Hash(50) = 50 % 7 = 1
• In our hash table slot 1 is already occupied. So, we will search for slot 1+12, i.e.
1+1 = 2,
• Again slot 2 is found occupied, so we will search for cell 1+22, i.e.1+4 = 5,
• Now, cell 5 is not occupied so we will place 50 in slot 5.

Double hashing-
• In this technique, the increments for the probing sequence are computed by using
another hash function.
• use another hash function hash2(x) and look for the i*hash2(x) slot in the ith
rotation.
let hash(x) be the slot index computed using hash function.
If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S
If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S
If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S

• Insert the keys 27, 43, 92, 72 into the Hash Table of size 7. where first hash-
function is h1(k) = k mod 7 and second hash-function is h2(k) = 1 + (k mod 5)
• Insert 27 27 % 7 = 6, location 6 is empty so insert 27 into 6 slot.

• Insert 43 43 % 7 = 1, location 1 is empty so insert 43 into 1 slot.

• Insert 92
• 92 % 7 = 6, but location 6 is already being occupied and this is a collision
• So need to resolve this collision using double hashing.
• h1(k) = k mod 7
• h2(k) = 1 + (k mod 5)
hnew = [h1(92) + i * (h2(92)] % 7
= [6 + 1 * (1 + 92 % 5)] % 7
= 9 % 7
= 2
Now, as 2 is an empty slot,
so we can insert 92 into 2nd slot.

•Insert 72
•72 % 7 = 2, but location 2 is already being occupied and this is a collision.
•So we need to resolve this collision using double hashing.
hnew = [h1(72) + i * (h2(72)] % 7
= [2 + 1 * (1 + 72 % 5)] % 7
= 5 % 7
= 5,
Now, as 5 is an empty slot,
so we can insert 72 into 5th slot.

Unit-5
Internal Sorting
• Sorting is categorized into
• Internal sorting
• External sorting
• Internal sorting methods are
• Insertion sort
• Quick sort
• 2-way Merge sort
• Heap sort
• Shell sort

Insertion sort-
• The basic step is to insert a record r into a sequence of ordered records.
• It is carried out in the beginning with the ordered sequence and then
successively inserting the records into the
Sequence

• This algorithm is not suitable for large data sets as its average and worst case
complexity are of Ο(n2), where n is the number of items.
Quick sort-
• It is developed by C.A.R hoare
• Sorting with a good average behaviour
• Quick sort is a highly efficient sorting algorithm and is based on partitioning
of array of data into smaller arrays
• A large array is partitioned into two arrays one of which holds values smaller
than the specified value, say pivot, based on which the partition is made and
another array holds values greater than the pivot value.
• Quicksort partitions an array and then calls itself recursively twice to sort the
two resulting subarrays. This algorithm is quite efficient for large-sized data
sets as its average and worst-case complexity are O(n2), respectively.

• This algorithm follows the divide and conquer approach.
• Divide and conquer is a technique of breaking down the algorithms into
subproblems, then solving the subproblems, and combining the results back
together to solve the original problem.

2-way merge sort-

Heap Sort-
• Heap is a tree-based data structure in which all the tree nodes are in a
particular order, such that the tree satisfies the heap properties
• Heap sort may be regarded as two stage method
It is converted to heap with the property that the value of each node is at least as
large as the value of its children nodes .root is the largest key in the tree
The output sequence is generated in decreasing order by successively outputting the
root and restructuring the remaining tree into a heap
• Follow the given steps to solve the problem:
Build a max heap from the input data.
At this point, the maximum element is stored at the root of the heap. Replace it with
the last item of the heap followed by reducing the size of the heap by 1. Finally,
heapify the root of the tree.
Repeat step 2 while the size of the heap is greater than 1.

Shell sort-
• Shell sort is the generalization of insertion sort, which overcomes the drawbacks of
insertion sort by comparing elements separated by a gap of several positions.
• it is an extended version of insertion sort. Shell sort has improved the average time
complexity of insertion sort. As similar to insertion sort, it is a comparison-based
and in-place sorting algorithm.
• Shell sort is efficient for medium-sized data sets.
• In insertion sort, at a time, elements can be moved ahead by one position only. To
move an element to a far-away position, many movements are required that
increase the algorithm's execution time. But shell sort overcomes this drawback of
insertion sort. It allows the movement and swapping of far-away elements as well.
• This algorithm first sorts the elements that are far away from each other, then it
subsequently reduces the gap between them. This gap is called as interval. This
interval can be calculated by using the Knuth's formula given below –
•h= h * 3 + 1
•where, 'h' is the interval having initial value 1.

in the first loop, the element at the 0th position will be
compared with the element at 4th position. If the 0th element is
greater, it will be swapped with the element at 4th position.
Otherwise, it remains the same. This process will continue for
the remaining elements.

In the second loop, elements are lying at the interval of 2 (n/4 = 2), where n = 8.
Now, we are taking the interval of 2 to sort the rest of the array. With an interval of 2, two sublists will be
generated - {12, 25, 33, 40}, and {17, 8, 31, 42}.

Files ,Queries and Sequential organizations
Files-
• A file is a collection of records where each record consists of one or more
fields.
• Primary objective of file organization is to provide means for record retrieval
and update
• Update includes deletion, changes in fields or insertion of entirely new record

• Certain fields in the record are designated as key fields
• Records may be retrieved by specifying values for some or all of these keys.
• Combination of key values specified for retrieval is called query
• Invalid query to the file would be location=Los angeles

• Obtaining data representations of files on external storage devices for
efficient use should have some factors
Kind of external storage device available
Type of queries allowed
Number of keys
Mode of retrieval/update
Storage device types
• Concerned abut files stored on disks/tapes
Query types

Number of keys –
• Distinction between files having only one key or files with more than one key
Mode of retrieval-
• May be either real time or batched
• In real time the response time for any query should be minimal
• In the batched mode the response time is not significant .Request for
retrieval are batched together on a transaction file until either enough
requests have been received or suitable amount of time has passed.then all
transaction files are processed
Mode of update
• Either be real or batched
• Real time update is needed for eg reservation of flight file must be changed
to show the new status

• Batched update would be suitable in bank account system .for eg all
withdrawals and deposits made on particular day collected on a transaction
file and updates are made at the end of the day
• Batched update consists of two files :master file and transaction file
• Master file-represents the file status after the previous update
• Transaction file-holds all the update requests that have not yet been
reflected in the master file so master file is always “out of date”
• Records are placed sequentially onto the storage media (adjacent to each
other)
• The physical sequence of records is ordered on some key called primary key
• For batched retrieval and update ordered sequential files are preferred over
unordered sequential files since they are easier to process

• File organization breaks down into two or more aspects
The directory
The physical organization of the records (sequential)
• Processing a query /update request would proceed in 2 steps
Indexes used to determine the parts of the physical file that are to be searched
These parts of the file will be searched and accessing the records satisfying the query

File Organizations-
• Sequential organization
• Random Organization
• Linked organization
• Inverted files
• Cellular partitions

Sequential Organization-
• Cylinder –surface index is maintained for the primary key
• In order to retrieve records efficiently indexes can be used
• Structure of the indexes is based on the index techniques
Random organization-
• Records are stored at random locations on the disk
• Several techniques are used for randomization .they are
Direct addressing
Directory lookup
hashing

Direct addressing-
• Available disk space is divided in to nodes large enough to store records of
equal size
• The numeric value of the primary key is used to determine the node into
which a particular record is to be stored
• Searching and deleting a record by primary key value requires one disk access
• Updating a record requires 2 (1 to read and 1 to write back to the modified
record)
• Variable size records are being used an index can be set up with pointers to
actual records on the disk

Directory lookup-
• Retrieving a record involves searching the index for the record address and
then accessing the record itself
• The records an be of fixed or variable size
• Searching a record by index requires more than 1 access
• Every record has a unique primary key
• 2 or more records with the same primary key can cause collisions
Hashing-
• The available space is divided into buckets and slots
• Every record have hashed index
• Some space is set aside to handle overflow

II B.Sc IT DATA STRUCTURES.pptx

II B.Sc IT DATA STRUCTURES.pptx

Recommended

Recommended

More Related Content

Similar to II B.Sc IT DATA STRUCTURES.pptx

Similar to II B.Sc IT DATA STRUCTURES.pptx (20)

Recently uploaded

Recently uploaded (20)

II B.Sc IT DATA STRUCTURES.pptx

Editor's Notes