The document discusses different data structures for maps or dictionaries, including linked lists, sorted arrays, balanced binary search trees, and arrays indexed by keys. It then focuses on hashing, describing how to map keys to array indices using hash functions. Two common approaches for handling collisions are open addressing, where elements are placed in the next available slot, and external chaining, where each slot contains a linked list. The performance of different hashing schemes depends on the load factor.
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Kuntal Bhowmick
A Hash table is a data structure used for storing and retrieving data very quickly. Insertion of data in the hash table is based on the key value. Hence every entry in the hash table is associated with some key.
HASHING AND HASH FUNCTIONS, HASH TABLE REPRESENTATION, HASH FUNCTION, TYPES OF HASH FUNCTIONS, COLLISION, COLLISION RESOLUTION, CHAINING, OPEN ADDRESSING – LINEAR PROBING, QUADRATIC PROBING, DOUBLE HASHING
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Kuntal Bhowmick
A Hash table is a data structure used for storing and retrieving data very quickly. Insertion of data in the hash table is based on the key value. Hence every entry in the hash table is associated with some key.
HASHING AND HASH FUNCTIONS, HASH TABLE REPRESENTATION, HASH FUNCTION, TYPES OF HASH FUNCTIONS, COLLISION, COLLISION RESOLUTION, CHAINING, OPEN ADDRESSING – LINEAR PROBING, QUADRATIC PROBING, DOUBLE HASHING
At the beginning, the number of elements in a set of numbers to be stored in a computer system used to be not so large or having a wide range. Then, using a
simple table T [0, 1, ..., m − 1]called, direct-address table, could be used to store those numbers. As the situation became more and more complex, and a new idea came to be:
Definition
An associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of tuples {(key, value)}
This can bee seen in the example of dictionaries in any spoken language. The problem became more complex when the range of the possible values for the
keys at the tuples became unbounded. Therefore a new type of data structure is needed to avoid the sparsity problem in the data, the hash table.
At the beginning, the number of elements in a set of numbers to be stored in a computer system used to be not so large or having a wide range. Then, using a
simple table T [0, 1, ..., m − 1]called, direct-address table, could be used to store those numbers. As the situation became more and more complex, and a new idea came to be:
Definition
An associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of tuples {(key, value)}
This can bee seen in the example of dictionaries in any spoken language. The problem became more complex when the range of the possible values for the
keys at the tuples became unbounded. Therefore a new type of data structure is needed to avoid the sparsity problem in the data, the hash table.
linear search – binary search – hashing – hash functions – collision handling – load factors, rehashing, and efficiency,Tree ADT – Binary Tree ADT – tree traversals – binary search trees – AVL trees .
These Topics were Covered in this Material.
Make use of it.
Definition,algorithm,example with pictorial representation ,example program & Analysis of Time and space complexity for individual topics are there for your view.
Assignment4 Assignment 4 Hashtables In this assignment we w.pdfkksrivastava1
Assignment4
Assignment 4: Hashtables
In this assignment we will use a rhyming dictionary and loading it into a hashtable and using the
hashtable ADT(abstract data type) to implement a bad poetry generator.
Objectives
TODO #1: Implement a hashtable
TODO #2: Loading the dictionary
Optional:
TODO #3: Removing unrhymable words
Testing
Testing can be done through Maven with the command,
mvn test
To run more specific tests, run the command
mvn -Dtest=AssignmentTest#TESTCASEHERE test
where TESTCASEHERE is the test case you want to run. These test cases can be found in
src/test/java/AssignmentTest
Ex: mvn -Dtest=AssignmentTest#test_2_puts test
The results can be found in the command line or in target/surefire-reports/AssignmentTest.txt
after execution.
Part of grading will be these test cases.
Alternatively,
You can use the main method found in RhymingDict.java. In fact, the test cases are dervied from
that main method.
To Do 1: Implementing a Hashtable
Youll be implementing a hashtable class called MyHashtable. It implements the interface
DictionaryInterface.
The hashtable youll be making will use Strings as the keys and Object as the values. Similar to
linked lists, by storing Object as values, you can store any kind of object in the hashtable.
To implement a hashtable:
Use a protected inner class inside MyHashtable called Entry. This inner class stores Key/Value
pairs. So it has two fields:
String key
Object value
It also should have a constructor for initializing the key and value.
Your hashtable will define three protected fields (remember that protected means that the field can
only be accessed from within the class or by any child class of the class).
int tableSize - the size of the array being used by the hashtable
int size- the number of key/value entries stored in the hashtable
MyLinkedList[] table - an array of MyLinkedList. The reason that each element of the array is a
linked list is to store multiple entries which collide, that is, for which the hash for the different keys
is the same index in the table.
Youll be implementing the following methods on MyHashtable
public boolean isEmpty()
Returns true if the hashtable is empty, false otherwise. You can use the size field to determine this
easily.
public int size() Returns the size (number of key/value pairs stored in the hashtable). Theres more
info about how to implement this method below.
public Object put(String key, Object value)
Adds a new key/value pair to the hashtable. If the key has been previously added, it replaces the
value stored with this key with the new value, and returns the old value. Otherwise it returns null.
public Object get(String key) Returns the value stored with the key. If the key has not previously
been stored in the hashtable, returns null. Theres more info about how to implement this method
below.
public void remove(String key) Removes the key/value pair associated with the key from the
hashtable. Theres more info about how to implement this method below.
pub.
Data Structure and Algorithms: What is Hash Table pptJUSTFUN40
Outline:
What is a HASH TABLE?
What is a HASH FUNCTION?
What is a HASH COLLISION?
Implementation of a Hash Table
Discussion on collision resolution methods in particular SEPARATE CHAINING and OPEN ADDRESSING
Separate Chaining Implementation
Open Addressing Implementation
Linear Probing
Quadratic Probing
Double Hashing
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
How world-class product teams are winning in the AI era by CEO and Founder, P...
Sienna 9 hashing
1. Computer Science 385
Analysis of Algorithms
Siena College
Spring 2011
Topic Notes: Maps and Hashing
Maps/Dictionaries
A map or dictionary is a structure used for looking up items via key-value associations.
There are many possible implementations of maps. If we think abstractly about a few variants,
we can consider their time complexity on common opertions and space complexity. In the table
below, n denotes the actual number of elements in the map, and N denote the maximum number
of elements it could contain (where that restriction makes sense).
Structure Search Insert Delete Space
Linked List O(n) O(1) O(n) O(n)
Sorted Array O(log n) O(n) O(n) O(N )
Balanced BST O(log n) O(log n) O(log n) O(n)
Array[KeyRange] of EltType O(1) O(1) O(1) KeyRange
That last line is very interesting. If we know the range of keys (suppose they’re all integers between
0 and KeyRange-1) we can use the key as the index directly into an array. And have constant time
access!
Hashing
The map implementation of an array with keys as the subscripts and values as contents makes a
lot of sense. For example, we could store information about members of a team whose uniform
numbers are all within a given range by storing each player’s information in the array location
indexed by their uniform number. If some uniform numbers are unused, we’ll have some empty
slots, but it gives us constant time access to any player’s information as long as we know the
uniform number.
However, there are some important restrictions on the use of this representation.
This implementation assumes that the data has a key which is of a type that can be used as an array
subscript (some enumerated type in Pascal, integers in C/C++/Java), which is not always the case.
What if we’re trying to store strings or doubles or something more complex?
Moreover, the size requirements for this implementation could be prohibitive. Suppose we wanted
to store 2000 student records indexed by social security number or a Siena ID number. We would
need an array with 1 billion elements!
In cases like this, where most of the entries will be empty, we would like to use a smaller array, but
come up with a good way to map our elements into the array entries.
2. CS 385 Analysis of Algorithms Spring 2011
Suppose we have a lot of data elements of type E and a set of indices in which we can store data
elements. We would like to obtain a function H: E → index, which has the properties:
1. H(elt) can be computed quickly
2. If elt1 = elt2 then H(elt1) = H(elt2). (i.e., H is a one-to-one function)
This is called a perfect hashing function. Unfortunately, they are difficult to find unless you know
all possible entries to the table in advance. This is rarely the case.
Instead we use something that behaves well, but not necessarily perfectly.
The goal is to scatter elements through the array randomly so that they won’t attempt to be stored
in the same location, or “bump into” each other.
So we will define a hash function H: Keys → Addresses (or array indices), and we will call
H(element.key) the home address of element.
Assuming no two distinct elements wind up mapping to the same location, we can now add, search
for, and remove them in O(1) time.
The one thing we can no longer do efficiently that we could have done with some of the other maps
is to list them in order.
Since the function is not guaranteed to be one-to-one, we can’t guarantee that no two distinct
elements map to the same home address.
This suggests two problems to consider:
1. How to choose a good hashing function?
2. How to resolve the situation where two different elements are mapped to same home address?
Selecting Hashing Functions
The following quote should be memorized by anyone trying to design a hashing function:
“A given hash function must always be tried on real data in order to find out whether it is effective
or not.”
Data which contains unexpected and unfortunate regularities can completely destroy the usefulness
of any hashing function!
We will start by considering hashing functions for integer-valued keys.
Digit selection As the name suggests, this involved choosing digits from certain positions of key
(e.g., last 3 digits of a SSN).
Unfortunately it is easy to make a very unfortunate choice. We would need careful analysis
of the expected keys to see which digits will work best. Likely patterns can cause problems.
2
3. CS 385 Analysis of Algorithms Spring 2011
We want to make sure that the expected keys are at least capable of generating all possible
table positions.
For example the first digits of SSN’s reflect the region in which they were assigned and hence
usually would work poorly as a hashing function.
Even worse, what about the first few digits of a Siena ID number?
Division Here, we let H(key) = key mod TableSize
This is very efficient and often gives good results if the TableSize is chosen properly.
If it is chosen poorly then you can get very poor results. Consider the case where TableSize
= 28 = 256 and the keys are computed from ASCII equivalent of two letter pairs, i.e.
Key(xy) = 28 · ASCII(x) + ASCII(y), then all pairs ending with the same letter get
mapped to same address. Similar problems arise with any power of 2.
It is usually safest to choose the TableSize to be a prime number in the range of your
desired table size. The default size in the structure package’s hash table is 997.
Mid-Square Here, the approach is to square the key and then select a subset of the bits from the
result. Often, bits from the middle part of the square are taken. The mixing provided by the
multiplication ensures that all digits are used in the computation of the hash code.
Example: Let the keys range between 1 and 32000 and let the TableSize be 2048 = 211 .
Square the key and select 11 bits from the middle of the result. Note that such bit manipula-
tions can be done very efficiently using shifting and bitwise and operations.
H(Key) = (key >> 10) & 0x08FF;
Will pull out the bits:
Key = 123456789abcdefghijklmnopqrstuvw
Key >> 10 = 0000000000123456789abcdefghijklm
& 0x08FF = 000000000000000000000cdefghijklm
Using r bits produces a table of size 2r .
Folding Here, we break the key into chunks of digits (sometimes reversing alternating chunks)
and add them up.
This is often used if the key is too big. Suppose the keys are Social Security numbers, which
are 9 digits numbers. We can break it up into three pieces, perhaps the 1st digit, the next 4,
and then the last 4. Then add them together.
This technique is often used in conjunction with other methods – one might do a folding
step, followed by division.
3
4. CS 385 Analysis of Algorithms Spring 2011
What if the key is a String?
We can use a formula like - Key(xy) = 28 · (int)x + (int)y to convert from alphabetic keys to
ASCII equivalents. (Use 216 for Unicode.) This is often used in combination with folding (for the
rest of the string) and division.
Here is a very simple-minded hash code for strings: Add together the ordinal equivalent of all
letters and take the remainder mod TableSize.
However, this has a potential problem: words with same letters get mapped to same places:
miles, slime, smile
This would be much improved if you took the letters in pairs before division.
Here is a function which adds up the ASCII codes of letters and then takes the remainder when
divided by TableSize:
hash = 0;
for (int charNo = 0; charNo < word.length(); charNo++)
hash = hash + (int)(word.charAt(CharNo));
hash = hash % tableSize; /* gives 0 <= hash < tableSize */
This is not going to be a very good hash function, but it’s at least simple.
All Java objects come equipped with a hashCode method (we’ll look at this soon). The hashCode
for Java Strings is defined as specified in its Javadoc:
s[0]*31ˆ(n-1) + s[1]*31ˆ(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string,
and ˆ indicates exponentiation. (The hash value of the empty string is zero.)
See Example:
˜jteresco/shared/cs385/examples/HashCodes
Handling Collisions
The home address of a key is the location that the hash function returns for that key.
A hash collision occurs when two different keys have the same home location.
There are two main options for getting out of trouble:
1. Open addressing, which Levitin most confusingly calls closed hashing: if the desired slot is
full, do something to find an open slot. This must be done in such a way that one can find
the element again quickly!
4
5. CS 385 Analysis of Algorithms Spring 2011
2. External chaining, which Levitin calls open hashing or separate chaining: make each slot
capable of holding multiple items and store all objects that with the same home address in
their desired slot.
For our discussion, we will assume our hash table consists of an array of TableSize entries of
the appropriate type:
T [] table = new T[TableSize];
Open Addressing
Here, we find the home address of the key (using the hash function). If it happens to be occupied,
we keep trying new slots until an empty slot is located.
There will be three types of entries in the table:
• an empty slot, represented by null
• deleted entries - marked by inserting a special “reserved object” (the need for this will soon
become obvious), and
• normal entries which contain a reference to an object in the hash table.
This approach requires a rehashing. There are many ways to do this – we will consider a few.
First, linear probing. We simply look for the next available slot in the array beyond the home
address. If we get to the end, we wrap around to the beginning:
Rehash(i) = (i + 1) % TableSize.
This is about as simple as possible. Successive rehashing will eventually try every slot in the table
for an empty slot.
For example, consider a simple hash table of strings, where TableSize = 26, and we choose
a (poor) hash function:
H(key) = the alphabetic position (zero-based) of the first character in the string.
Strings to be input are GA, D, A, G, A2, A1, A3, A4, Z, ZA, E
Here’s what happens with linear rehashing:
0 1 2 3 4 5 6 7 8 9 10 ... 25
A A2 A1 D A3 A4 GA G ZA E .. ... Z
5
6. CS 385 Analysis of Algorithms Spring 2011
In this example, we see the potential problem with an open addressing approach in the presence of
collisions: clustering.
Primary clustering occurs when more than one key has the same home address. If the same re-
hashing scheme is used for all keys with the same home address then any new key will collide with
all earlier keys with the same home address when the rehashing scheme is employed.
In the example, this happened with A, A2, A1, A3, and A4.
Secondary clustering occurs when a key has a home address which is occupied by an element
which originally got hashed to a different home address, but in rehashing got moved to the address
which is the same as the home address of the new element.
In the example, this happened with E.
When we search for A3, we need to look in 0,1,2,3,4.
What if we search for AB? When do we know we will not find it? We have to continue our search
to the first empty slot! This is not until position 10!
Now let’s delete A2 and then search for A1. If we just put a null in slot 1, we would not find A1
before encountering the null in slot 1.
So we must mark deletions (not just make them empty). These slots can be reused on a new
insertion, but we cannot stop a search when we see a slot marked as deleted. This can be pretty
complicated.
Minor variants of linear rehashing would include adding any number k (which is relatively prime
to TableSize), rather than adding 1.
If the number k is divisible by any factor of TableSize (i.e., k is not relatively prime to
TableSize), then not all entries of the table will be explored when rehashing. For instance,
if TableSize = 100 and k = 50, the linear rehashing function will only explore two slots no
matter how many times it is applied to a starting location.
Often the use of k = 1 works as well as any other choice.
Next, we consider quadratic rehashing.
Here, we try slot (home + j 2 ) % TableSize on the j th rehash.
This variant can help with secondary clustering but not primary clustering.
It can also result in instances where in rehashing you don’t try all possible slots in table.
For example, suppose the TableSize is 5, with slots 0 to 4. If the home address of an element
is 1, then successive rehashings result in 2, 0, 0, 2, 1, 2, 0, 0, ... The slots 3 and 4 will never be
examined to see if they have room. This is clearly a disadvantage.
Our last option for rehashing is called double hashing.
Rather than computing a uniform jump size for successive rehashes, we make it depend on the key
by using a different hashing function to calculate the rehash.
6
7. CS 385 Analysis of Algorithms Spring 2011
For example, we might compute
delta(Key) = Key % (TableSize -2) + 1
and add this delta for successive rehash attempts.
If the TableSize is chosen well, this should alleviate both primary and secondary clustering.
For example, suppose the TableSize is 5, and H(n) = n % 5. We calculate delta as above.
So while H(1) = H(6) = H(11) = 1, the jump sizes for the rehash differ since delta(1)
= 2, delta(6) = 1, and delta(11) = 3.
External Chaining
We can eliminate some of our troubles entirely by allowing each slot in the hash table hold as many
items as necessary.
The easiest way to do this is to let each slot be the head of a linked list of elements.
The simplest way to represent this is to allocate a table as an array of references, with each non-
null entry referring to a linked list of elements which got mapped to that slot.
We can organize these lists as ordered, singly or doubly linked, circular, etc.
We can even let them be binary search trees if we want.
Of course with good hashing functions, the size of lists should be short enough so that there is no
need for fancy representations (though one may wish to hold them as ordered lists).
There are some advantages to this over open addressing:
1. The “deletion problem” doesn’t arise.
2. The number of elements stored in the table can be larger than the table size.
3. It avoids all problems of secondary clustering (though primary clustering can still be a prob-
lem – resulting in long lists in some buckets).
Analysis
We can measure the performance of various hashing schemes with respect to the load factor of the
table.
The load factor, α, of a table is defined as the ratio of the number of elements stored in the table to
the size of the table.
α = 1 means the table is full, α = 0 means it is empty.
Larger values of α lead to more collisions.
7
8. CS 385 Analysis of Algorithms Spring 2011
(Note that with external chaining, it is possible to have α > 1).
The table below (from Bailey’s Java Structures) summarizes the performance of our collision res-
olution techniques in searching for an element.
Strategy Unsuccessful Successful
1 1 1 1
Linear probing 2
1 + (1−α)2 2
1 + (1−α)
1 1 1
Double hashing 1−α α
ln (1−α)
1
External chaining α+e −α
1 + 2α
The value in each slot represents the average number of compares necessary for a search. The
first column represents the number of compares if the search is ultimately unsuccessful, while the
second represents the case when the item is found.
The main point to note is that the time for linear rehashing goes up dramatically as α approaches
1.
Double hashing is similar, but not so bad, whereas external chaining does not increase very rapidly
at all (linearly).
In particular, if α = .9, we get:
Strategy Unsuccessful Successful
11
Linear probing 55 2
Double hashing 10 ∼4
External chaining 3 1.45
The differences become greater with heavier loading.
The space requirements (in words of memory) are roughly the same for both techniques. If we
have n items stored in a table allocated with TableSize entries:
• open addressing: TableSize + n· objectsize
• external chaining: TableSize + n· (objectsize + 1)
With external chaining we can get away with a smaller TableSize, since it responds better to
the resulting higher load factor.
A general rule of thumb might be to use open addressing when we have a small number of elements
and expect a low load factor. We have have a larger number of elements, external chaining can be
more appropriate.
Hashtable Implementations
Hashing is a central enough idea that all Objects in Java come equipped with a method hashCode
that will compute a hash function. Like equals, this is required by Object. We can override it
if we know how we want to have our specific types of objects behave.
Java includes hash tables in java.util.Hashtable.
8
9. CS 385 Analysis of Algorithms Spring 2011
It uses external chaining, with the buckets implemented as a linear structure. It allows two param-
eters to be set through arguments to the constructor:
• initialCapacity - how big is the array of buckets
• loadFactor - how full can the table get before it automatically grows
Suppose an insertion causes the hash table to exceed its load factor. What does the resize operation
entail? Every item in the hash table must be rehashed! If only a small number of buckets get used,
this could be a problem. We could be resizing when most buckets are still empty. Moreover, the
resize and rehash may or may not help. In fact, shrinking the table might be just as effective as
growing it in some circumstances.
Next, we will consider the hash table implementations in the structure package.
The Hashtable Class
Interesting features:
• Implementation of open addressing with default size 997.
• The load factor threshold for a rehashing is fixed at 0.6. Exceeding this threshold will result
in the table being doubled in size and all entries being rehashed.
• There is a special RESERVED entry to make sure we can still locate values after we have
removed items that caused clustering.
• The protected locate method finds the entry where a given key is stored in the table or
an empty slot where it should be stored if it is to be added. Note that we need to search
beyond the first RESERVED location because we’re not sure yet if we might still find the
actual value. But we do want to use the first reserved value we come across to minimize
search distance for this key later.
• Now get and put become simple – most of the work is done in locate.
The ChainedHashtable Class
Interesting features:
• Implementation of external chaining, default size 997.
• We create an array of buckets that holds Lists of Associations. SinglyLinkedLists
are used.
9
10. CS 385 Analysis of Algorithms Spring 2011
• Lists are allocated when we try to locate the bucket for a key, even if we’re not inserting.
This allows other operations to degenerate into list operations.
For example, to check if a key is in the table, we locate its bucket and use the list’s contains
method.
• We add an entry with the put method.
1. we locate the correct bucket
2. we create a new Association to hold the new item
3. we remove the item (!) from the bucket
4. we add the item to the bucket (why?)
5. if we actually removed a value we return the old one
6. otherwise, we increase the count and return null
• The get method is also in a sense destructive. Rather than just looking up the value, we
remove and reinsert the key if it’s found.
• This reinsertion for put and get means that we will move the values we are using to the front
of their bucket. Helpful?
10