SlideShare a Scribd company logo
1 of 17
Become a Data Architect – session 2
Data Structures & Algorithms
There are many sites with questions and answers, books, youtube videos, etc.
Books:
• Introduction to Algorithms, 3rd Edition (The MIT Press)
• Cracking the Coding Interview: 189 Programming Questions and Solutions 6th Edition - by Gayle Laakmann McDowell
• Computer Algorithms - by Horowitz, Sahni, Rajsekaran
• Data Structures and Algorithms - by Aho, Ullman, Hopcroft
• Algorithm Design: Foundations, Analysis, and Internet Examples by Goodrich, Tamassia
• Designing Data-Intensive Applications - by Martin Kleppmann
Training websites:
- https://leetcode.com/
- https://www.hackerrank.com/
- https://www.educative.io/
- https://www.interviewcake.com/
Youtube - multiple channels, for example: Gaurav Sen
- https://www.youtube.com/watch?v=_5vrfuwhvlQ
- https://www.youtube.com/watch?v=zaRkONvyGr8
etc.
=======================================
array - arr[i] - elements have same type
=======================================
list - lst[i] - like array but:
- elements may have different size and type
- elements may be complext structures (lists, dicts, etc.)
[1,2,3]
[1,"dog",[2,3]]
=======================================
tuple - like a list, but immutable (can not be changed)
(1,2,3)
((1,2),(3,4))
=======================================
string - "abcde fghij" - ordered set of characters
=======================================
sequence - any ordered set (list, tuples, string, ...)
=======================================
index - numbering elements in a sequence (usually starts with 0)
aa[i]
aa[i][j]
=======================================
slice - subset of elements of a sequence, for example:
aa = "mama papa"
012345678
bb = aa[2:5] = "ma p" (2 - included, 5 - not included)
=======================================
=======================================
stack (FILO = First In Last Out)
st = [1,2,3]
st.append(4) [1,2,3,4]
aa = st.pop() aa=4, st=[1,2,3]
=======================================
queue (FIFO = First In First Out)
import collections
qq = collections.deque()
for ii in range(6):
qq.append(ii) # qq == deque([0, 1, 2, 3, 4, 5])
aa = qq.popleft() # 0
bb = qq.popleft() # 1
# qq == deque([2, 3, 4, 5])
=======================================
Priority Queue
Suppose that we create a queue of tickets.
Each ticket is described by a tuple with two numbers:
(priority_num, ticket_num)
priority_num : 1..10 (1=highest, 10=lowest priority)
ticket_num : sequentially growing number
As tickets enter the queue, we sort the queue
(by priority_num, ticket_num)
so that tickets with higher priority (lower priority_num)
will move forward.
And within same priority, "earlier" tickets (with lower
ticket_num) will move forward.
Usually the queue was sorted before adding a new element.
So sorting simply means moving the element forward
until it finds its place.
Usually priority queue is implemented using
a min-heap structure (see later in this document)
set (unique elements)
aa = set(range(1,10)) # {1, 2, 3, 4, 5, 6, 7, 8, 9}
bb = set(range(5,15)) # {5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
# Operations between sets
aa - bb # {1, 2, 3, 4}
bb - aa # {10, 11, 12, 13, 14}
aa | bb # {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
aa & bb # {5, 6, 7, 8, 9}
aa ^ bb # {1, 2, 3, 4, 10, 11, 12, 13, 14}
hash (hashmap, dict)
stores key-value pairs
famous for having constant-time for insert/delete/read
aa = {"k1":"v1", "k2":55} # {'k1':'v1', 'k2':55}
aa['k3'] = 33 # {'k1':'v1', 'k2':55, 'k3':33}
# deleting
if 'k2' in aa:
del aa['k2'] # {'k1':'v1', 'k3':33}
# upsert
if 'k2' in aa:
aa['k2'] += 1
else:
aa['k2'] = 1
Merging two dicts:
starting python 3.5: z = (**d1, **d2)
starting python 3.9: z = d1 | d2
Using defaultdict:
from collections import defaultdict
dd = defaultdict(list)
dd['k1'].append(1)
dd['k1'].append(2)
dd['k1'].append(3)
for kk in dd.keys():
print(f"{kk} => {dd[kk]}") # k1 => [1, 2, 3]
How hash works:
We make an array of buckets.
The key is mapped to one of the buckets
using a simple hashing function
The tuple (key,val) is placed in this bucket.
Reading/deleting by key - same hashing function is used
to locate the item
# ------------------------------
Example of hashing function:
# ------------------------------
hash = 0
for char in key_str:
hash = hash*33 + ord(char)
idx = hash / num_buckets
# ------------------------------
Consistent hashing (ring hash)
Imagine that you have a layer of web servers followed by layer of app servers
• How do you do load balancing between layers?
• How you determine to which server on the next layer to go?
• How you do it consistently to take advantage of caching on layers?
• How you re-hash the system when you scale numbers of server up or down?
Consistent hashing solves these problems by providing a distribution scheme
which does not directly depend on the number of servers.
Idea:
1. Make a circle with a big number of "positions" (232 .. 2160)
This circle is called "hash ring".
2. Map each server to 1024 random positions on circle
3. Map each request (IP/port/...) to the same circle, and go clockwise
to find the server to process this request
4. Continue going clockwise to find a 2nd server (for backup)
• https://medium.com/system-design-blog/consistent-hashing-b9134c8a9062
• https://en.wikipedia.org/wiki/Consistent_hashing
• https://www.youtube.com/watch?v=zaRkONvyGr8
Original MIT Thesis by Daniel Lewin (1998):
"Consistent hashing and random trees : algorithms for caching in distributed networks"
• https://dspace.mit.edu/handle/1721.1/9947
• https://github.com/papers-we-love/papers-we-love/blob/master/distributed_systems/consistent-hashing-and-
random-trees.pdf
Bloom filter
Imagine that we have a huge lookup table (~millions of words)
and we testing if a word is in this table or not.
We may use a huge hash on disk, causing big disk I/O.
Bloom filter uses a small compact bitmap instead of big hash.
This allows to reject most negatives,
while allowing very few false-positives.
The negatives are definitely negatives,
but positives are "maybe" positives.
An empty Bloom filter is a bit array of m bits, all set to 0.
There must also be k different hash functions defined,
each of which maps an element to one of m bits (sets it to 1).
So if we use k hash functions - we can get up to "k" bits set.
Example: m=30, k=10
To query for an element (test whether it is in the set),
feed it to each of the k hash functions to get k array positions.
If any of the bits at these positions is 0, the element is not in the set.
If all are 1, then it may be positive - or false-positive.
If map is big enough, the patterns will be sparse,
and probability of false-positives very low.
- https://hur.st/bloomfilter
Bloom filter was developed by
Burton Howard Bloom, MIT
graduate, in 1970.
https://www.cs.princeton.edu/courses/archive/spr
05/cos598E/bib/p422-bloom.pdf
https://www.quora.com/Where-can-one-find-a-photo-
and-biographical-details-for-Burton-Howard-Bloom-
inventor-of-the-Bloom-filter
Bloom filters are called filters
because they are often used
as a cheap first pass to filter out
segments of a dataset that do not
match a query
BST (Binary Search Tree)
rebalancing BST (rotations)
height of the tree h = lg2(N)..N, search time ~h
=======================================
Trie - prefix tree, good for words/characters
(one node may have 26 children)
typically implemented using dictionaries
=======================================
Linked List
head of the list
going through the list
finding cycle in the linked list
by using turtle and rabbit (slow and fast runners)
Binary Heaps (MaxHeap and MinHeap)
Data structure which looks like binary tree
where each parent is >= (or <= for min-heap) than values of children
Binary heaps are a common way of implementing priority queues.
Given an array
we can "heapify" it in place by swapping elements:
0 - top (root) element
1,2 - next layer (left and right children)
3,4,5,6 - next layer
etc.
0 1 2 3 4 5 6 ...
--- ------- ...
last parent's idx = floor(N/2) - 1
Time complexity:
building heap - O(N)
push (at bottom) and adjust - O(lg(N))
pop (from top) and adjust - O(lg(N))
heap sort - O(N*log(N))
https://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity
Sorting:
• in-place
• stable sort (preserves order of duplicates)
• bubble sort O(n^2)
(go through whole list swapping neighbors until nothing to swap)
• selection sort O(n^2)
divide list into two portions: left sorted, right not sorted yet
on each step: select min value from right - append to left
• insert sort O(n^2)
(take one element, insert it in its place on the left, repeat)
• merge sort O(n log n)
binary division, sort small pieces, then merge them layer by layer
• heap sort O(n log n)
make MinHeap in the array,
take min value, out of the heap, fix the heap, repeat.
• quick sort O(n log n) - or O(n^2) in worst case
pick an element in the middle called a pivot
move to the right of it all elements which are > pivot
move to the left of it all elements which are < pivot
Now recursively apply the same process to left and right subarrays.
• tim sort O(n) - O(n log n)
used in python, combination of merge and insertion sort algorithms.
Takes advantage of runs of consecutive ordered elements
topological sort
schedule a sequence of jobs based on their dependencies.
Example: gnu-make files for compilation, etc.
Dependencies may be represented by a graph:
jobs are points (vertices of a graph)
x ---> y means that we need to calculate "x" before "y"
(x needed for calculating "y")
Kahn algorithm:
jobs=[1,2,3,4, 5, 6]
dependencies = [(1,3), (3,2), (3,4), (5,6)]
# (m, n) where n is a dependency of m
1. for each job calculate number of dependencies NoD
2. find jobs with no dependencies - put them into a queue.
3. process the queue one by one like this:
take queue element, append to the output
take its dependencies, reduce their NoD by one
if NoD for any of them becomes zero, add those to the queue
Complexity: O(Njobs + Ndependencies)
Searches:
• Binary search
• DFS (Depth-First Search) - recursion
• BFS (Breadth-First Search) - queue, level by level
Note: Recursion can always be rewritten as iteration
factorial_recursive(n):
if n<=1:
return 1
else:
return n * factorial_recursive(n-1)
factorial_iterative(n):
ff = 1
for ii in range(1,n+1):
ff *= ii
return ff
Algorithm complexity (time and space):
Big "O" notation
O(1), O(N) , O(N^2) , O(N*log(N)), O(p*k), etc.
==================================
- bottoms-up and top-down algorithms
- divide-and-conquer
- greediness
==================================
declarative vs imperative procedural programming
==================================
functional programming
• declarative
• evaluate functions instead of simply setting values
(recursion instead of for-loop)
• create/modify functions at run time
• pass functions as arguments
• use pure functions (avoid side effects (global vars, ...))
==================================
dynamic programming = memoization
(caching results to reuse them instead of recalculating)
common to use a dictionary for that
# procedural:
def factorial(n):
f=1
for i in range(1,n+1):
f = f*i
return f
factorial(4) # 24
factorial(6) # 720
# --------------------------------
# functional
def multiply(x, y):
return x * y
def factorial(n):
from functools import reduce
return reduce(multiply, range(1,n+1))
factorial(4) # 24
factorial(6) # 720
Bit manipulation
binary positive, negative, addition, shifting, masks
& – Bitwise AND
| – Bitwise OR
~ – Bitwise NOT
^ – XOR
<< – Left Shift
>> – Right Shift
Typical algorithmic tasks:
• merging two sorted lists
• finding largest word
• finding duplicates (use hash)
• func(N) as func of smaller numbers
• calculate Fibonacci numbers
• do coin-change (using memoization)
• recursive staircase (1,2,3 steps - how many ways?)
• shortest reach path (using BFSearch)
• balanced parentheses (using stack)
• queue with 2 stacks
• keep contacts in a Trie
• mirroring BST
• finding the 2nd largest value in BST (binary tree)
(or verifying that the tree is correct)
Turing Machine (TM) (Turing, 1936) - an very simple abstract
computational machine.
An infinite memory tape is divided into discrete "cells".
The machine positions its "head" over a cell and "reads" it.
Then it uses a "finite table" of instructions/rules to:
- optionally update the cell
- move head by one position left or right
- or halt the computation.
NTM (Non-deterministic TM) - more than one possible action
can be taken in some situations. A NTM effectively is able
to duplicate itself at any time, and have each duplicate
take a different execution path.
Turing-complete set of rules - a set which can be used
to simulate a Turing Machine.
Intractable problem - can be solved in theory, but in reality
takes too much resources/time
Time-complexity classes:
P (Polinomial time by a deterministic Turing machine)
NP (Nondeterministic Polynomial)
NP-hard is the class of decision problems to which all problems
in NP can be reduced to in polynomial time
by a TM (Deterministic Turing Machine).
NP-complete is the intersection of NP-hard and NP.
NP-complete is the class of decision problems in NP
to which all other problems in NP can be reduced to
in polynomial time by a TM (Deterministic Turing Machine).
Minimum spanning tree (MST) is a tree that:
- Contains all the nodes (vertices) of the graph.
- has no cycles
- has minimal total length (sum of "weights" of edges)
Example - a cable company wanting to lay lines to multiple houses
while minimizing the amount of cable laid to save money.
Kruskal's algorithm (1956) - a minimum-spanning-tree algorithm
which finds an edge of the least possible weight
that connects any two trees in the forest.
- https://en.wikipedia.org/wiki/Kruskal%27s_algorithm
Shortest Path finding (road navigators, etc.)
- https://en.wikipedia.org/wiki/Shortest_path_problem
Multiple algorithms:
- Dijkstra - single-source shortest path problem with non-negative edge weight.
- Bellman–Ford - single-source problem if edge weights may be negative.
- A* search algorithm - single pair shortest path using heuristics
to try to speed up the search.
- Floyd–Warshall - all pairs shortest paths.
- Johnson's - all pairs shortest paths, and may be faster
than Floyd–Warshall on sparse graphs.
- Viterbi - shortest stochastic path problem with an additional
probabilistic weight on each node.
- etc.
==================================
Combinatorial search algorithms - achieve efficiency by
reducing the effective size of the search space
or by employing heuristics.
Classic combinatorial search problems include solving
the eight queens puzzle or evaluating moves in games
with a large game tree, such as reversi or chess.

More Related Content

Similar to DA_02_algorithms.pptx

L1 - Recap.pdf
L1 - Recap.pdfL1 - Recap.pdf
L1 - Recap.pdfIfat Nix
 
C++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxC++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxAbhishek Tirkey
 
C++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxC++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxGauravPandey43518
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningBig_Data_Ukraine
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
 
Data structure and algorithm All in One
Data structure and algorithm All in OneData structure and algorithm All in One
Data structure and algorithm All in Onejehan1987
 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniyaTutorialsDuniya.com
 
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...HendraPurnama31
 
python-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptxpython-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptxAkashgupta517936
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesAndrew Ferlitsch
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdfNash229987
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
 
Hash presentation
Hash presentationHash presentation
Hash presentationomercode
 

Similar to DA_02_algorithms.pptx (20)

L1 - Recap.pdf
L1 - Recap.pdfL1 - Recap.pdf
L1 - Recap.pdf
 
C++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxC++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptx
 
C++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptxC++ STL (quickest way to learn, even for absolute beginners).pptx
C++ STL (quickest way to learn, even for absolute beginners).pptx
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Data structure and algorithm All in One
Data structure and algorithm All in OneData structure and algorithm All in One
Data structure and algorithm All in One
 
linkedlist.pptx
linkedlist.pptxlinkedlist.pptx
linkedlist.pptx
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
Algorithms notes tutorials duniya
Algorithms notes   tutorials duniyaAlgorithms notes   tutorials duniya
Algorithms notes tutorials duniya
 
stack & queue
stack & queuestack & queue
stack & queue
 
AD3251-Data Structures Design-Notes-Searching-Hashing.pdf
AD3251-Data Structures  Design-Notes-Searching-Hashing.pdfAD3251-Data Structures  Design-Notes-Searching-Hashing.pdf
AD3251-Data Structures Design-Notes-Searching-Hashing.pdf
 
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
Matplotlib adalah pustaka plotting 2D Python yang menghasilkan gambar berkual...
 
python-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptxpython-numpyandpandas-170922144956 (1).pptx
python-numpyandpandas-170922144956 (1).pptx
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdf
 
Recursion Lecture in Java
Recursion Lecture in JavaRecursion Lecture in Java
Recursion Lecture in Java
 
Parallel search
Parallel searchParallel search
Parallel search
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
Hash presentation
Hash presentationHash presentation
Hash presentation
 
R workshop
R workshopR workshop
R workshop
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

DA_02_algorithms.pptx

  • 1. Become a Data Architect – session 2 Data Structures & Algorithms There are many sites with questions and answers, books, youtube videos, etc. Books: • Introduction to Algorithms, 3rd Edition (The MIT Press) • Cracking the Coding Interview: 189 Programming Questions and Solutions 6th Edition - by Gayle Laakmann McDowell • Computer Algorithms - by Horowitz, Sahni, Rajsekaran • Data Structures and Algorithms - by Aho, Ullman, Hopcroft • Algorithm Design: Foundations, Analysis, and Internet Examples by Goodrich, Tamassia • Designing Data-Intensive Applications - by Martin Kleppmann Training websites: - https://leetcode.com/ - https://www.hackerrank.com/ - https://www.educative.io/ - https://www.interviewcake.com/ Youtube - multiple channels, for example: Gaurav Sen - https://www.youtube.com/watch?v=_5vrfuwhvlQ - https://www.youtube.com/watch?v=zaRkONvyGr8 etc.
  • 2. ======================================= array - arr[i] - elements have same type ======================================= list - lst[i] - like array but: - elements may have different size and type - elements may be complext structures (lists, dicts, etc.) [1,2,3] [1,"dog",[2,3]] ======================================= tuple - like a list, but immutable (can not be changed) (1,2,3) ((1,2),(3,4)) ======================================= string - "abcde fghij" - ordered set of characters ======================================= sequence - any ordered set (list, tuples, string, ...) ======================================= index - numbering elements in a sequence (usually starts with 0) aa[i] aa[i][j] ======================================= slice - subset of elements of a sequence, for example: aa = "mama papa" 012345678 bb = aa[2:5] = "ma p" (2 - included, 5 - not included) =======================================
  • 3. ======================================= stack (FILO = First In Last Out) st = [1,2,3] st.append(4) [1,2,3,4] aa = st.pop() aa=4, st=[1,2,3] ======================================= queue (FIFO = First In First Out) import collections qq = collections.deque() for ii in range(6): qq.append(ii) # qq == deque([0, 1, 2, 3, 4, 5]) aa = qq.popleft() # 0 bb = qq.popleft() # 1 # qq == deque([2, 3, 4, 5]) ======================================= Priority Queue Suppose that we create a queue of tickets. Each ticket is described by a tuple with two numbers: (priority_num, ticket_num) priority_num : 1..10 (1=highest, 10=lowest priority) ticket_num : sequentially growing number As tickets enter the queue, we sort the queue (by priority_num, ticket_num) so that tickets with higher priority (lower priority_num) will move forward. And within same priority, "earlier" tickets (with lower ticket_num) will move forward. Usually the queue was sorted before adding a new element. So sorting simply means moving the element forward until it finds its place. Usually priority queue is implemented using a min-heap structure (see later in this document)
  • 4. set (unique elements) aa = set(range(1,10)) # {1, 2, 3, 4, 5, 6, 7, 8, 9} bb = set(range(5,15)) # {5, 6, 7, 8, 9, 10, 11, 12, 13, 14} # Operations between sets aa - bb # {1, 2, 3, 4} bb - aa # {10, 11, 12, 13, 14} aa | bb # {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14} aa & bb # {5, 6, 7, 8, 9} aa ^ bb # {1, 2, 3, 4, 10, 11, 12, 13, 14}
  • 5. hash (hashmap, dict) stores key-value pairs famous for having constant-time for insert/delete/read aa = {"k1":"v1", "k2":55} # {'k1':'v1', 'k2':55} aa['k3'] = 33 # {'k1':'v1', 'k2':55, 'k3':33} # deleting if 'k2' in aa: del aa['k2'] # {'k1':'v1', 'k3':33} # upsert if 'k2' in aa: aa['k2'] += 1 else: aa['k2'] = 1 Merging two dicts: starting python 3.5: z = (**d1, **d2) starting python 3.9: z = d1 | d2 Using defaultdict: from collections import defaultdict dd = defaultdict(list) dd['k1'].append(1) dd['k1'].append(2) dd['k1'].append(3) for kk in dd.keys(): print(f"{kk} => {dd[kk]}") # k1 => [1, 2, 3] How hash works: We make an array of buckets. The key is mapped to one of the buckets using a simple hashing function The tuple (key,val) is placed in this bucket. Reading/deleting by key - same hashing function is used to locate the item # ------------------------------ Example of hashing function: # ------------------------------ hash = 0 for char in key_str: hash = hash*33 + ord(char) idx = hash / num_buckets # ------------------------------
  • 6. Consistent hashing (ring hash) Imagine that you have a layer of web servers followed by layer of app servers • How do you do load balancing between layers? • How you determine to which server on the next layer to go? • How you do it consistently to take advantage of caching on layers? • How you re-hash the system when you scale numbers of server up or down? Consistent hashing solves these problems by providing a distribution scheme which does not directly depend on the number of servers. Idea: 1. Make a circle with a big number of "positions" (232 .. 2160) This circle is called "hash ring". 2. Map each server to 1024 random positions on circle 3. Map each request (IP/port/...) to the same circle, and go clockwise to find the server to process this request 4. Continue going clockwise to find a 2nd server (for backup) • https://medium.com/system-design-blog/consistent-hashing-b9134c8a9062 • https://en.wikipedia.org/wiki/Consistent_hashing • https://www.youtube.com/watch?v=zaRkONvyGr8 Original MIT Thesis by Daniel Lewin (1998): "Consistent hashing and random trees : algorithms for caching in distributed networks" • https://dspace.mit.edu/handle/1721.1/9947 • https://github.com/papers-we-love/papers-we-love/blob/master/distributed_systems/consistent-hashing-and- random-trees.pdf
  • 7. Bloom filter Imagine that we have a huge lookup table (~millions of words) and we testing if a word is in this table or not. We may use a huge hash on disk, causing big disk I/O. Bloom filter uses a small compact bitmap instead of big hash. This allows to reject most negatives, while allowing very few false-positives. The negatives are definitely negatives, but positives are "maybe" positives. An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps an element to one of m bits (sets it to 1). So if we use k hash functions - we can get up to "k" bits set. Example: m=30, k=10 To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is not in the set. If all are 1, then it may be positive - or false-positive. If map is big enough, the patterns will be sparse, and probability of false-positives very low. - https://hur.st/bloomfilter Bloom filter was developed by Burton Howard Bloom, MIT graduate, in 1970. https://www.cs.princeton.edu/courses/archive/spr 05/cos598E/bib/p422-bloom.pdf https://www.quora.com/Where-can-one-find-a-photo- and-biographical-details-for-Burton-Howard-Bloom- inventor-of-the-Bloom-filter Bloom filters are called filters because they are often used as a cheap first pass to filter out segments of a dataset that do not match a query
  • 8. BST (Binary Search Tree) rebalancing BST (rotations) height of the tree h = lg2(N)..N, search time ~h ======================================= Trie - prefix tree, good for words/characters (one node may have 26 children) typically implemented using dictionaries ======================================= Linked List head of the list going through the list finding cycle in the linked list by using turtle and rabbit (slow and fast runners)
  • 9. Binary Heaps (MaxHeap and MinHeap) Data structure which looks like binary tree where each parent is >= (or <= for min-heap) than values of children Binary heaps are a common way of implementing priority queues. Given an array we can "heapify" it in place by swapping elements: 0 - top (root) element 1,2 - next layer (left and right children) 3,4,5,6 - next layer etc. 0 1 2 3 4 5 6 ... --- ------- ... last parent's idx = floor(N/2) - 1 Time complexity: building heap - O(N) push (at bottom) and adjust - O(lg(N)) pop (from top) and adjust - O(lg(N)) heap sort - O(N*log(N)) https://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity
  • 10. Sorting: • in-place • stable sort (preserves order of duplicates) • bubble sort O(n^2) (go through whole list swapping neighbors until nothing to swap) • selection sort O(n^2) divide list into two portions: left sorted, right not sorted yet on each step: select min value from right - append to left • insert sort O(n^2) (take one element, insert it in its place on the left, repeat) • merge sort O(n log n) binary division, sort small pieces, then merge them layer by layer • heap sort O(n log n) make MinHeap in the array, take min value, out of the heap, fix the heap, repeat. • quick sort O(n log n) - or O(n^2) in worst case pick an element in the middle called a pivot move to the right of it all elements which are > pivot move to the left of it all elements which are < pivot Now recursively apply the same process to left and right subarrays. • tim sort O(n) - O(n log n) used in python, combination of merge and insertion sort algorithms. Takes advantage of runs of consecutive ordered elements
  • 11. topological sort schedule a sequence of jobs based on their dependencies. Example: gnu-make files for compilation, etc. Dependencies may be represented by a graph: jobs are points (vertices of a graph) x ---> y means that we need to calculate "x" before "y" (x needed for calculating "y") Kahn algorithm: jobs=[1,2,3,4, 5, 6] dependencies = [(1,3), (3,2), (3,4), (5,6)] # (m, n) where n is a dependency of m 1. for each job calculate number of dependencies NoD 2. find jobs with no dependencies - put them into a queue. 3. process the queue one by one like this: take queue element, append to the output take its dependencies, reduce their NoD by one if NoD for any of them becomes zero, add those to the queue Complexity: O(Njobs + Ndependencies)
  • 12. Searches: • Binary search • DFS (Depth-First Search) - recursion • BFS (Breadth-First Search) - queue, level by level Note: Recursion can always be rewritten as iteration factorial_recursive(n): if n<=1: return 1 else: return n * factorial_recursive(n-1) factorial_iterative(n): ff = 1 for ii in range(1,n+1): ff *= ii return ff
  • 13. Algorithm complexity (time and space): Big "O" notation O(1), O(N) , O(N^2) , O(N*log(N)), O(p*k), etc. ================================== - bottoms-up and top-down algorithms - divide-and-conquer - greediness ================================== declarative vs imperative procedural programming ================================== functional programming • declarative • evaluate functions instead of simply setting values (recursion instead of for-loop) • create/modify functions at run time • pass functions as arguments • use pure functions (avoid side effects (global vars, ...)) ================================== dynamic programming = memoization (caching results to reuse them instead of recalculating) common to use a dictionary for that # procedural: def factorial(n): f=1 for i in range(1,n+1): f = f*i return f factorial(4) # 24 factorial(6) # 720 # -------------------------------- # functional def multiply(x, y): return x * y def factorial(n): from functools import reduce return reduce(multiply, range(1,n+1)) factorial(4) # 24 factorial(6) # 720
  • 14. Bit manipulation binary positive, negative, addition, shifting, masks & – Bitwise AND | – Bitwise OR ~ – Bitwise NOT ^ – XOR << – Left Shift >> – Right Shift
  • 15. Typical algorithmic tasks: • merging two sorted lists • finding largest word • finding duplicates (use hash) • func(N) as func of smaller numbers • calculate Fibonacci numbers • do coin-change (using memoization) • recursive staircase (1,2,3 steps - how many ways?) • shortest reach path (using BFSearch) • balanced parentheses (using stack) • queue with 2 stacks • keep contacts in a Trie • mirroring BST • finding the 2nd largest value in BST (binary tree) (or verifying that the tree is correct)
  • 16. Turing Machine (TM) (Turing, 1936) - an very simple abstract computational machine. An infinite memory tape is divided into discrete "cells". The machine positions its "head" over a cell and "reads" it. Then it uses a "finite table" of instructions/rules to: - optionally update the cell - move head by one position left or right - or halt the computation. NTM (Non-deterministic TM) - more than one possible action can be taken in some situations. A NTM effectively is able to duplicate itself at any time, and have each duplicate take a different execution path. Turing-complete set of rules - a set which can be used to simulate a Turing Machine. Intractable problem - can be solved in theory, but in reality takes too much resources/time Time-complexity classes: P (Polinomial time by a deterministic Turing machine) NP (Nondeterministic Polynomial) NP-hard is the class of decision problems to which all problems in NP can be reduced to in polynomial time by a TM (Deterministic Turing Machine). NP-complete is the intersection of NP-hard and NP. NP-complete is the class of decision problems in NP to which all other problems in NP can be reduced to in polynomial time by a TM (Deterministic Turing Machine).
  • 17. Minimum spanning tree (MST) is a tree that: - Contains all the nodes (vertices) of the graph. - has no cycles - has minimal total length (sum of "weights" of edges) Example - a cable company wanting to lay lines to multiple houses while minimizing the amount of cable laid to save money. Kruskal's algorithm (1956) - a minimum-spanning-tree algorithm which finds an edge of the least possible weight that connects any two trees in the forest. - https://en.wikipedia.org/wiki/Kruskal%27s_algorithm Shortest Path finding (road navigators, etc.) - https://en.wikipedia.org/wiki/Shortest_path_problem Multiple algorithms: - Dijkstra - single-source shortest path problem with non-negative edge weight. - Bellman–Ford - single-source problem if edge weights may be negative. - A* search algorithm - single pair shortest path using heuristics to try to speed up the search. - Floyd–Warshall - all pairs shortest paths. - Johnson's - all pairs shortest paths, and may be faster than Floyd–Warshall on sparse graphs. - Viterbi - shortest stochastic path problem with an additional probabilistic weight on each node. - etc. ================================== Combinatorial search algorithms - achieve efficiency by reducing the effective size of the search space or by employing heuristics. Classic combinatorial search problems include solving the eight queens puzzle or evaluating moves in games with a large game tree, such as reversi or chess.