This document provides a tutorial overview of hash table methods. It discusses calculating hash addresses to map records to table locations, handling collisions when multiple records hash to the same location, theoretical analyses of hash functions, alternatives to hashing, and areas for further research. The document aims to introduce programmers and students to hash tables for efficient searching of large files where search time is independent of file size.
This document discusses database concepts including:
1) Atomic and non-atomic domains, and how non-atomic values complicate data storage and encourage data redundancy.
2) Functional dependencies, which constrain the set of legal relations by requiring that values for one set of attributes determine values for another set.
3) Third normal form, which places constraints on attribute dependencies to reduce data redundancy.
4) Indexing mechanisms used to speed up data access by mapping search keys to file attributes.
This document discusses different indexing and hashing techniques. It describes ISAM which allows both sequential and random access to records through indexes. It then explains static hashing which uses a fixed hash function and dynamic hashing techniques like extendible hashing and linear hashing which allow the hash table to expand. Extendible hashing uses a directory to point to the logical structure while linear hashing expands the hash table one slot at a time. Finally, it briefly introduces B+ trees which are balanced search trees used for range queries through index and data pages.
Hashing provides a way to access records in constant time by mapping keys to addresses using a hash function. Collisions occur when different keys map to the same address. Common solutions include spreading out records, using extra memory, or storing multiple records at an address. The distribution of records can be analyzed using mathematical tools like the Poisson distribution to predict collisions and optimize performance. Various hashing methods like double hashing and chaining help resolve collisions.
This document provides an overview of indexing and hashing techniques for database systems. It discusses ordered indices like primary and secondary indices, which are based on sorted keys. It also covers hash indices, which distribute keys uniformly across hash buckets. The document evaluates different indexing techniques based on factors like access time, insertion/deletion time, and space overhead. It describes B+ tree indices, which maintain efficiency during data modifications. Multi-level indexing is introduced to handle large index files that do not fit in memory.
The document discusses different techniques for storing and searching data, including sequential search, binary search, and hashing. It provides details on open hashing and closed hashing, describing that closed hashing stores elements within buckets and can cause collisions when multiple elements are mapped to the same bucket. The document also outlines characteristics of good hash functions and different hashing methods like division, mid-square, folding, digit analysis, length dependent, algebraic coding, and multiplicative hashing.
Cosequential processing and the sorting of large filesDevyani Vaidya
This document discusses techniques for efficiently sorting large files using cosequential processing and merge sort algorithms. It begins by defining cosequential operations and describing how they can be used for matching and merging sequential lists. It then provides examples of implementing matching and merging algorithms. The document focuses on using these algorithms to efficiently sort files too large to fit in memory by breaking them into runs, sorting the runs using an in-memory algorithm like heapsort, and then merging the runs. It discusses optimizations like using multiple buffers and disks to overlap I/O with processing.
The document discusses various indexing techniques used to improve data access performance in databases, including ordered indices like B-trees and B+-trees, as well as hashing techniques. It covers the basic concepts, data structures, operations, advantages and disadvantages of each approach. B-trees and B+-trees store index entries in sorted order to support range queries efficiently, while hashing distributes entries uniformly across buckets using a hash function but does not support ranges.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This document discusses database concepts including:
1) Atomic and non-atomic domains, and how non-atomic values complicate data storage and encourage data redundancy.
2) Functional dependencies, which constrain the set of legal relations by requiring that values for one set of attributes determine values for another set.
3) Third normal form, which places constraints on attribute dependencies to reduce data redundancy.
4) Indexing mechanisms used to speed up data access by mapping search keys to file attributes.
This document discusses different indexing and hashing techniques. It describes ISAM which allows both sequential and random access to records through indexes. It then explains static hashing which uses a fixed hash function and dynamic hashing techniques like extendible hashing and linear hashing which allow the hash table to expand. Extendible hashing uses a directory to point to the logical structure while linear hashing expands the hash table one slot at a time. Finally, it briefly introduces B+ trees which are balanced search trees used for range queries through index and data pages.
Hashing provides a way to access records in constant time by mapping keys to addresses using a hash function. Collisions occur when different keys map to the same address. Common solutions include spreading out records, using extra memory, or storing multiple records at an address. The distribution of records can be analyzed using mathematical tools like the Poisson distribution to predict collisions and optimize performance. Various hashing methods like double hashing and chaining help resolve collisions.
This document provides an overview of indexing and hashing techniques for database systems. It discusses ordered indices like primary and secondary indices, which are based on sorted keys. It also covers hash indices, which distribute keys uniformly across hash buckets. The document evaluates different indexing techniques based on factors like access time, insertion/deletion time, and space overhead. It describes B+ tree indices, which maintain efficiency during data modifications. Multi-level indexing is introduced to handle large index files that do not fit in memory.
The document discusses different techniques for storing and searching data, including sequential search, binary search, and hashing. It provides details on open hashing and closed hashing, describing that closed hashing stores elements within buckets and can cause collisions when multiple elements are mapped to the same bucket. The document also outlines characteristics of good hash functions and different hashing methods like division, mid-square, folding, digit analysis, length dependent, algebraic coding, and multiplicative hashing.
Cosequential processing and the sorting of large filesDevyani Vaidya
This document discusses techniques for efficiently sorting large files using cosequential processing and merge sort algorithms. It begins by defining cosequential operations and describing how they can be used for matching and merging sequential lists. It then provides examples of implementing matching and merging algorithms. The document focuses on using these algorithms to efficiently sort files too large to fit in memory by breaking them into runs, sorting the runs using an in-memory algorithm like heapsort, and then merging the runs. It discusses optimizations like using multiple buffers and disks to overlap I/O with processing.
The document discusses various indexing techniques used to improve data access performance in databases, including ordered indices like B-trees and B+-trees, as well as hashing techniques. It covers the basic concepts, data structures, operations, advantages and disadvantages of each approach. B-trees and B+-trees store index entries in sorted order to support range queries efficiently, while hashing distributes entries uniformly across buckets using a hash function but does not support ranges.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Indexing and hashing are crucial techniques for efficiently finding and accessing data in databases. There are various types of indices such as ordered, hash, dense, sparse, and multilevel indices that each have their own tradeoffs regarding speed, space usage, and ease of updates. B-tree and B+-tree data structures provide fast indexed access while also efficiently handling updates. Hashing techniques like static, dynamic, and extendable hashing map data to buckets through hash functions but require mechanisms like overflow chaining to handle collisions. The most appropriate technique depends on factors like the query types and frequencies of data access, insertion, and deletion.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
The document discusses different file structures and indexing techniques used in databases. It describes secondary keys as keys that are not selected as the primary key but are candidate keys. It then explains inverted and multi-user files, and different file organization methods like sequential, heap, hash, and B-tree organizations. It provides details on B-trees and B+trees, including their properties, time complexities for operations, and structure for internal and leaf nodes in B+trees.
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
Framester is a linguistic linked data hub that aims to improve coverage of FrameNet by extending mappings between FrameNet and other resources like WordNet and BabelNet. Framester represents over 40 million triples linking linguistic and factual resources and aligning frames, roles, and types to foundational ontologies. It provides a word frame disambiguation service and was evaluated on annotated corpora, showing improved performance over previous approaches.
This document summarizes indexing and hashing techniques for database systems. It describes ordered indices like B-trees that store index entries in sorted order, and hash indices that distribute entries uniformly across buckets. B+-tree index files are introduced as an improvement over indexed-sequential files that automatically reorganizes with small local changes, avoiding the need to periodically reorganize the entire file. The structure and properties of B+-tree nodes and trees are defined.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
This document discusses hashing techniques for storing data in a hash table. It describes hash collisions that can occur when multiple keys map to the same hash value. Two primary techniques for dealing with collisions are chaining and open addressing. Open addressing resolves collisions by probing to subsequent table indices, but this can cause clustering issues. The document proposes various rehashing functions that incorporate secondary hash values or quadratic probing to reduce clustering in open addressing schemes.
Hashing is a technique used to store and retrieve information quickly by mapping keys to values in a hash table using a hash function. Common hash functions include division, mid-square, and folding methods. Collision resolution techniques like chaining, linear probing, quadratic probing, and double hashing are used to handle collisions in the hash table. Hashing provides constant-time lookup and is widely used in applications like databases, dictionaries, and encryption.
Indexing is used to speed up access to desired data.
E.g. author catalog in library
A search key is an attribute or set of attributes used to look up records in a file. Unrelated to keys in the db schema.
An index file consists of records called index entries.
An index entry for key k may consist of
An actual data record (with search key value k)
A pair (k, rid) where rid is a pointer to the actual data record
A pair (k, bid) where bid is a pointer to a bucket of record pointers
Index files are typically much smaller than the original file if the actual data records are in a separate file.
If the index contains the data records, there is a single file with a special organization.
The document discusses hashing and hash tables. It defines hashing as a technique where the location of an element in a collection is determined by a hashing function of the element's value. Collisions can occur if multiple elements map to the same location. Common techniques for resolving collisions include chaining and open addressing. The Java Collections API provides several implementations of hash tables like HashMap and HashSet.
Dynamic multi level indexing Using B-Trees And B+ TreesPooja Dixit
B-TREE, Properties of B-Tree, B-Tree of minimum degree 3, Drawbacks of B-Tree, B+ tree, B+ tree, Structure of the internal nodes of a B+ tree , structure of the leaf nodes of a B+ tree , Example of B+ tree
This document discusses different searching methods like sequential, binary, and hashing. It defines searching as finding an element within a list. Sequential search searches lists sequentially until the element is found or the end is reached, with efficiency of O(n) in worst case. Binary search works on sorted arrays by eliminating half of remaining elements at each step, with efficiency of O(log n). Hashing maps keys to table positions using a hash function, allowing searches, inserts and deletes in O(1) time on average. Good hash functions uniformly distribute keys and generate different hashes for similar keys.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
The document discusses how hash maps work and the process of rehashing. It explains that inserting a key-value pair into a hash map involves: 1) Hashing the key to get an index, 2) Searching the linked list at that index for an existing key, updating its value if found or adding a new node. Rehashing is done when the load factor increases above a threshold, as that increases lookup time. Rehashing doubles the size of the array and rehashes all existing entries to maintain a low load factor and constant time lookups.
Modified version of Chapter 18 of the book Fundamentals_of_Database_Systems,_6th_Edition with review questions
as part of database management system course
This document discusses information retrieval systems and how they differ from database systems. It covers topics like relevance ranking using terms, relevance using hyperlinks, indexing of documents, and measuring retrieval effectiveness. Information retrieval systems use a simpler data model than databases and focus on locating relevant documents based on keywords rather than structured querying. Web search engines are a common example of information retrieval systems.
The document discusses hashing techniques for storing and retrieving data from memory. It covers hash functions, hash tables, open addressing techniques like linear probing and quadratic probing, and closed hashing using separate chaining. Hashing maps keys to memory addresses using a hash function to store and find data independently of the number of items. Collisions may occur and different collision resolution methods are used like open addressing that resolves collisions by probing in the table or closed hashing that uses separate chaining with linked lists. The efficiency of hashing depends on factors like load factor and average number of probes.
The document discusses access paths in database management systems. It covers hashing and B-trees as the two main techniques used. Hashing maps attribute values to database addresses using a hash function, but requires reorganization if the file size changes. B-trees support efficient retrieval, range queries, and dynamic resizing through a balanced tree structure with index and leaf nodes. The document provides details on properties, implementation, and optimizations of hashing and B-trees.
Locality Sensitive Hashing (LSH) is a technique for finding similar items in large datasets. It works in 3 steps:
1. Shingling converts documents to sets of n-grams (sequences of tokens). This represents documents as high-dimensional vectors.
2. MinHashing maps these high-dimensional sets to short signatures or sketches, in a way that preserves similarity according to the Jaccard coefficient. It uses random permutations to select the minimum value in each permutation.
3. LSH partitions the signature matrix into bands and hashes each band separately, so that similar signatures are likely to hash to the same buckets. Candidate pairs are those that share buckets in one or more bands, reducing
Hash Tables
The memory available to maintain the symbol table is assumed to be sequential. This memory is referred to as the hash table, HT. The term bucket denotes a unit of storage that can store one or more records. A bucket is typically one disk block size but could be chosen to be smaller or larger than a disk block.
If the number of buckets in a Hash table HT is b, then the buckets are designated HT(0), ... HT(b-1). Each bucket is capable of holding one or more records. The number of records a bucket can store is known as its slot-size. Thus, a bucket is said to consist of s slots, if it can hold s number of records in it.
A function that is used to compute the address of a record in the hash table, is known as a hash function. Usually, s = 1 and in this case each bucket can hold exactly 1 record.
This document discusses hashing and different techniques for implementing dictionaries using hashing. It begins by explaining that dictionaries store elements using keys to allow for quick lookups. It then discusses different data structures that can be used, focusing on hash tables. The document explains that hashing allows for constant-time lookups on average by using a hash function to map keys to table positions. It discusses collision resolution techniques like chaining, linear probing, and double hashing to handle collisions when the hash function maps multiple keys to the same position.
Indexing and hashing are crucial techniques for efficiently finding and accessing data in databases. There are various types of indices such as ordered, hash, dense, sparse, and multilevel indices that each have their own tradeoffs regarding speed, space usage, and ease of updates. B-tree and B+-tree data structures provide fast indexed access while also efficiently handling updates. Hashing techniques like static, dynamic, and extendable hashing map data to buckets through hash functions but require mechanisms like overflow chaining to handle collisions. The most appropriate technique depends on factors like the query types and frequencies of data access, insertion, and deletion.
Interactive Knowledge Discovery over Web of Data.Mehwish Alam
This document describes research on classifying and exploring data from the Web of Data. It discusses building a classification structure over RDF data by classifying triples based on RDF Schema and creating views through SPARQL queries. This structure can then be used for data completion and interactive knowledge discovery through data analysis and visualization. Formal concept analysis and pattern structures are introduced as techniques for dealing with complex data types from the Web of Data like graphs and linked data. Range minimum queries are also proposed as a way to compute the lowest common ancestor for structured attribute sets in the pattern structures.
The document discusses different file structures and indexing techniques used in databases. It describes secondary keys as keys that are not selected as the primary key but are candidate keys. It then explains inverted and multi-user files, and different file organization methods like sequential, heap, hash, and B-tree organizations. It provides details on B-trees and B+trees, including their properties, time complexities for operations, and structure for internal and leaf nodes in B+trees.
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
Framester is a linguistic linked data hub that aims to improve coverage of FrameNet by extending mappings between FrameNet and other resources like WordNet and BabelNet. Framester represents over 40 million triples linking linguistic and factual resources and aligning frames, roles, and types to foundational ontologies. It provides a word frame disambiguation service and was evaluated on annotated corpora, showing improved performance over previous approaches.
This document summarizes indexing and hashing techniques for database systems. It describes ordered indices like B-trees that store index entries in sorted order, and hash indices that distribute entries uniformly across buckets. B+-tree index files are introduced as an improvement over indexed-sequential files that automatically reorganizes with small local changes, avoiding the need to periodically reorganize the entire file. The structure and properties of B+-tree nodes and trees are defined.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Spatial database are becoming more and more popular in recent years. There is more and more
commercial and research interest in location-based search from spatial database. Spatial keyword search
has been well studied for years due to its importance to commercial search engines. Specially, a spatial
keyword query takes a user location and user-supplied keywords as arguments and returns objects that are
spatially and textually relevant to these arguments. Geo-textual index play an important role in spatial
keyword querying. A number of geo-textual indices have been proposed in recent years which mainly
combine the R-tree and its variants and the inverted file. This paper propose new index structure that
combine K-d tree and inverted file for spatial range keyword query which are based on the most spatial
and textual relevance to query point within given range.
This document discusses hashing techniques for storing data in a hash table. It describes hash collisions that can occur when multiple keys map to the same hash value. Two primary techniques for dealing with collisions are chaining and open addressing. Open addressing resolves collisions by probing to subsequent table indices, but this can cause clustering issues. The document proposes various rehashing functions that incorporate secondary hash values or quadratic probing to reduce clustering in open addressing schemes.
Hashing is a technique used to store and retrieve information quickly by mapping keys to values in a hash table using a hash function. Common hash functions include division, mid-square, and folding methods. Collision resolution techniques like chaining, linear probing, quadratic probing, and double hashing are used to handle collisions in the hash table. Hashing provides constant-time lookup and is widely used in applications like databases, dictionaries, and encryption.
Indexing is used to speed up access to desired data.
E.g. author catalog in library
A search key is an attribute or set of attributes used to look up records in a file. Unrelated to keys in the db schema.
An index file consists of records called index entries.
An index entry for key k may consist of
An actual data record (with search key value k)
A pair (k, rid) where rid is a pointer to the actual data record
A pair (k, bid) where bid is a pointer to a bucket of record pointers
Index files are typically much smaller than the original file if the actual data records are in a separate file.
If the index contains the data records, there is a single file with a special organization.
The document discusses hashing and hash tables. It defines hashing as a technique where the location of an element in a collection is determined by a hashing function of the element's value. Collisions can occur if multiple elements map to the same location. Common techniques for resolving collisions include chaining and open addressing. The Java Collections API provides several implementations of hash tables like HashMap and HashSet.
Dynamic multi level indexing Using B-Trees And B+ TreesPooja Dixit
B-TREE, Properties of B-Tree, B-Tree of minimum degree 3, Drawbacks of B-Tree, B+ tree, B+ tree, Structure of the internal nodes of a B+ tree , structure of the leaf nodes of a B+ tree , Example of B+ tree
This document discusses different searching methods like sequential, binary, and hashing. It defines searching as finding an element within a list. Sequential search searches lists sequentially until the element is found or the end is reached, with efficiency of O(n) in worst case. Binary search works on sorted arrays by eliminating half of remaining elements at each step, with efficiency of O(log n). Hashing maps keys to table positions using a hash function, allowing searches, inserts and deletes in O(1) time on average. Good hash functions uniformly distribute keys and generate different hashes for similar keys.
This document introduces inverted files, which are a core data structure for text search engines. It describes inverted files and how they allow for efficient indexing, construction, and querying. The document then outlines some common extensions to inverted file indexes, such as compression, phrase querying, and distribution. It concludes by providing context on text search and information retrieval.
The document discusses how hash maps work and the process of rehashing. It explains that inserting a key-value pair into a hash map involves: 1) Hashing the key to get an index, 2) Searching the linked list at that index for an existing key, updating its value if found or adding a new node. Rehashing is done when the load factor increases above a threshold, as that increases lookup time. Rehashing doubles the size of the array and rehashes all existing entries to maintain a low load factor and constant time lookups.
Modified version of Chapter 18 of the book Fundamentals_of_Database_Systems,_6th_Edition with review questions
as part of database management system course
This document discusses information retrieval systems and how they differ from database systems. It covers topics like relevance ranking using terms, relevance using hyperlinks, indexing of documents, and measuring retrieval effectiveness. Information retrieval systems use a simpler data model than databases and focus on locating relevant documents based on keywords rather than structured querying. Web search engines are a common example of information retrieval systems.
The document discusses hashing techniques for storing and retrieving data from memory. It covers hash functions, hash tables, open addressing techniques like linear probing and quadratic probing, and closed hashing using separate chaining. Hashing maps keys to memory addresses using a hash function to store and find data independently of the number of items. Collisions may occur and different collision resolution methods are used like open addressing that resolves collisions by probing in the table or closed hashing that uses separate chaining with linked lists. The efficiency of hashing depends on factors like load factor and average number of probes.
The document discusses access paths in database management systems. It covers hashing and B-trees as the two main techniques used. Hashing maps attribute values to database addresses using a hash function, but requires reorganization if the file size changes. B-trees support efficient retrieval, range queries, and dynamic resizing through a balanced tree structure with index and leaf nodes. The document provides details on properties, implementation, and optimizations of hashing and B-trees.
Locality Sensitive Hashing (LSH) is a technique for finding similar items in large datasets. It works in 3 steps:
1. Shingling converts documents to sets of n-grams (sequences of tokens). This represents documents as high-dimensional vectors.
2. MinHashing maps these high-dimensional sets to short signatures or sketches, in a way that preserves similarity according to the Jaccard coefficient. It uses random permutations to select the minimum value in each permutation.
3. LSH partitions the signature matrix into bands and hashes each band separately, so that similar signatures are likely to hash to the same buckets. Candidate pairs are those that share buckets in one or more bands, reducing
Hash Tables
The memory available to maintain the symbol table is assumed to be sequential. This memory is referred to as the hash table, HT. The term bucket denotes a unit of storage that can store one or more records. A bucket is typically one disk block size but could be chosen to be smaller or larger than a disk block.
If the number of buckets in a Hash table HT is b, then the buckets are designated HT(0), ... HT(b-1). Each bucket is capable of holding one or more records. The number of records a bucket can store is known as its slot-size. Thus, a bucket is said to consist of s slots, if it can hold s number of records in it.
A function that is used to compute the address of a record in the hash table, is known as a hash function. Usually, s = 1 and in this case each bucket can hold exactly 1 record.
This document discusses hashing and different techniques for implementing dictionaries using hashing. It begins by explaining that dictionaries store elements using keys to allow for quick lookups. It then discusses different data structures that can be used, focusing on hash tables. The document explains that hashing allows for constant-time lookups on average by using a hash function to map keys to table positions. It discusses collision resolution techniques like chaining, linear probing, and double hashing to handle collisions when the hash function maps multiple keys to the same position.
The document discusses hashing techniques and collision resolution methods for hash tables. It covers:
- Hashing maps keys of variable length to smaller fixed-length values using a hash function. Hash tables use hashing to efficiently store and retrieve key-value pairs.
- Collisions occur when two keys hash to the same value. Common collision resolution methods are separate chaining, where each slot points to a linked list, and open addressing techniques like linear probing and double hashing.
- Bucket hashing groups hash table slots into buckets to improve performance. Records are hashed to buckets and stored sequentially within buckets or in an overflow bucket if a bucket is full. This reduces disk accesses when the hash table is stored
This document discusses hashing and its applications. It begins by describing dictionary operations like search, insert, delete, minimum, maximum, and their implementations using different data structures. It then focuses on hash tables, explaining how they work using hash functions to map keys to array indices. The document discusses collisions, good and bad hash functions, and performance of hash table operations. It also describes how hashing can be used for substring pattern matching and other applications like document fingerprinting.
Hashing is an algorithm that maps keys of variable length to fixed-length values called hash values. A hash table uses a hash function to map keys to values for efficient search and retrieval. Linear probing is a collision resolution technique for hash tables where open addressing is used. When a collision occurs, linear probing searches sequentially for the next empty slot, wrapping around to the beginning if reaching the end. This can cause clustering where many collisions occur in the same area. Lazy deletion marks deleted slots as deleted instead of emptying them.
This document discusses data structures and algorithms, specifically dictionaries and hash tables. It defines dictionaries as collections of keys and values where each key maps to a value. Hash tables are described as an optimized data structure for lookup with average constant time search using a hash function to map keys to indexes. The document covers implementing hash tables using linked lists to handle collisions, as well as common operations like put, get, and hasKey. It also provides hints for an assignment to implement a hash table data structure.
Trees are hierarchical data structures that consist of nodes connected by edges. They are used to store and access information efficiently. Binary trees are a type of tree where each node has at most two children. Graphs model relationships between objects using nodes connected by edges. Hash tables store key-value pairs and allow for very fast lookup, insertion, and deletion of data using hash functions, but collisions can decrease efficiency.
Hashing is a technique used to map data of arbitrary size to values of fixed size. It allows for fast lookup of data in near constant time. Common applications include dictionaries, databases, and search engines. Hashing works by applying a hash function to a key that returns an index value. Collisions occur when different keys hash to the same index, and must be resolved through techniques like separate chaining or open addressing.
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In this presentation, I described popular algorithms that employed Locality Sensitive Hashing (LSH) to solve similarity-related problems. I started with LSH in general and then switched to such algorithms as MinHash (LSH for Jaccard similarity) and SimHash (LSH for cosine similarity). Each approach came with some math that was behind it and simple examples to clarify the theory statements.
Hashing is a common technique for implementing dictionaries that provides constant-time operations by mapping keys to table positions using a hash function, though collisions require resolution strategies like separate chaining or open addressing. Popular hash functions include division and cyclic shift hashing to better distribute keys across buckets. Both open hashing using linked lists and closed hashing using linear probing can provide average constant-time performance for dictionary operations depending on load factor.
Record linkage is used to identify records from different data sources that represent the same real-world entity. It involves preprocessing data, reducing the search space using blocking methods, computing similarity functions to compare records, and applying decision models to classify record pairs. A common blocking method is the sorted neighborhood method, which sorts records by a blocking key and compares nearby records within a fixed window. The effectiveness of record linkage depends on selecting good blocking keys and similarity functions.
Hashing and File Structures in Data Structure.pdfJaithoonBibi
Hashing is a technique for storing data in an array such that each element is assigned a unique location based on its key value. This allows for constant time retrieval but collisions can occur when two elements hash to the same location. Collision resolution techniques like chaining, linear probing, quadratic probing, and double hashing are used to handle collisions. File structures like sequential, indexed, and relative organization are used to store records on storage devices efficiently with different access methods. Indexing uses a separate index file to speed up retrieval by mapping keys to record locations.
The document describes the process of depth-first search (DFS) on a graph using an adjacency list representation. It shows the recursive DFS algorithm by stepping through an example graph with 8 nodes. At each step, it marks the currently visited node as visited, marks the predecessors, and makes recursive calls to visit neighboring unvisited nodes. This traces out the DFS tree, showing how the structure captures the recursive calls. It concludes that DFS finds valid paths in the graph and runs in O(V+E) time like breadth-first search when using an adjacency list.
Data to be Signed: Suppose you have a message or data that you want to sign to prove its authenticity and integrity.
Hashing: First, this message is subjected to a cryptographic hash function (commonly SHA-1, SHA-256, or similar). The purpose of this step is to produce a fixed-length hash value, which represents the unique fingerprint of the data. The hash value is considerably shorter than the original data.
Signing: The DSA algorithm then uses this hash value to create a digital signature. The signature is generated using the private key of the signer and some mathematical operations.
Verification: The recipient of the data, who has access to the corresponding public key, can use the same cryptographic hash function to generate a hash value from the received data. They then use the sender's public key to verify the digital signature.
This document provides information about dictionaries and hash tables. It defines dictionaries as dynamic sets that support operations like insertion, deletion, and searching. Hash tables are described as an efficient implementation of dictionaries that map keys to array positions using a hash function. The document discusses hash functions, collisions, open and closed addressing techniques to handle collisions, and qualities of good hash functions.
A hash table is a data structure that uses a hash function to map keys to unique indices in an underlying array. Collisions occur when two keys hash to the same index and must be resolved. Open addressing resolves collisions by probing through alternative array locations until an empty slot is found. Double hashing is an open addressing collision resolution technique that uses a secondary hash of the key as an offset when probing for the next index. Hash tables provide efficient lookup, insertion and deletion of key-value pairs and are used widely in applications like databases, caching and cryptography.
Hash tables store records in a bucket array using hash functions. Main memory hash tables store records directly in buckets, while secondary storage hash tables store records in blocks associated with buckets. Records are inserted by computing their hash value and storing in the corresponding bucket block. Hash tables can be static or dynamic, with dynamic tables like extendible hashing allowing the number of buckets to grow. Extendible hashing uses a level of indirection, doubles the number of buckets during growth, and splits blocks as needed during insertion. kd-trees and quad trees are data structures for multi-dimensional data that partition space using splitting planes or hyperplanes.
Hashing is a technique used to store and retrieve data efficiently. It involves using a hash function to map keys to integers that are used as indexes in an array. This improves searching time from O(n) to O(1) on average. However, collisions can occur when different keys map to the same index. Collision resolution techniques like chaining and open addressing are used to handle collisions. Chaining resolves collisions by linking keys together in buckets, while open addressing resolves them by probing to find the next empty index. Both approaches allow basic dictionary operations like insertion and search to be performed in O(1) average time when load factors are low.
presentation on important DAG,TRIE,Hashing.pptxjainaaru59
Directed acyclic graph (DAG) is used to represent the flow of values between basic blocks of code. A DAG is a directed graph with no cycles. It is generated during intermediate code generation. DAGs determine common subexpressions and the flow of names and computed values between blocks of code. An algorithm is described to construct a DAG by creating nodes for operands and adding edges between nodes and operator nodes. Examples show how expressions are represented by a DAG. The complexity of a DAG depends on its width and depth. Applications of DAGs include determining common subexpressions, names used in blocks, and which statements' values may be used outside blocks.
The document summarizes XML Linking, which is developing specifications to enable more advanced hypertext functionality on the web. It examines goals and approaches, describes HTML linking limitations that XML Linking seeks to overcome, and surveys key specifications including XPath, XPointer, and XLink. These specifications aim to provide a standard way to address portions of resources, support external and bidirectional links, and separate linking structure from behavior.
The document provides an overview of approaches for clustering XML data based on structure and content. It first outlines applications where XML clustering is useful, including XML query processing and data integration. It then presents a generic framework for XML clustering with three phases: data representation, similarity computation, and clustering/grouping. The document surveys current approaches and aims to classify them and identify common features. It also discusses challenges in XML clustering and future research directions.
This document provides a survey of word sense disambiguation (WSD). It introduces WSD as the task of identifying the meaning of words in context computationally. WSD is considered an AI-complete problem due to various challenges, including knowledge representation and acquisition bottlenecks. The document surveys supervised, unsupervised, and knowledge-based WSD approaches, and discusses evaluation methods and applications, as well as open problems in the field.
Web page classification features and algorithmsunyil96
This document summarizes research on classifying web pages. It discusses how web page classification is important for tasks like maintaining web directories, improving search results, and building focused crawlers. The document reviews different types of web page classification problems and features that are useful for classification, like content-based features and link-based features. It also discusses algorithms that have been used for web page classification.
The document discusses the history and significance of links in hypertext and hypermedia. It covers:
- The evolution of links from static embedded links to dynamic links stored separately in link databases.
- The distinction between navigation using links that don't require similarity, versus retrieval which relies on similarity between a query and document.
- The challenges of extending content-based retrieval and navigation to non-text media like images and video.
- The goal of building systems that can extract semantics from media and associate media with concepts to enable more versatile concept-based navigation and retrieval.
Techniques for automatically correcting words in textunyil96
The problem of automatically correcting words in text has been an ongoing research challenge since the 1960s. Existing spelling checkers and text recognition techniques are limited in their accuracy. Three main areas of research have focused on detecting and correcting (1) nonwords, (2) isolated misspelled words, and (3) context-dependent real-word errors. While progress has been made, fully automatic correction of all word errors requires techniques that can analyze contextual information to detect errors resulting in other valid words.
Strict intersection types for the lambda calculusunyil96
This article discusses strict intersection types for the lambda calculus. It focuses on an essential intersection type assignment system (E) that is almost syntax directed. The system E is shown to satisfy all major properties of the Barendregt-Coppo-Dezani type system (BCD), including the approximation theorem, characterization of normalization, completeness of type assignment using filter semantics, strong normalization for cut-elimination, and the principal pair property. Some proofs of these properties for E are new. E is a true restriction of BCD and provides a less complicated approach than BCD while achieving the same results.
Smart meeting systems a survey of state of-the-artunyil96
Smart meeting systems aim to automatically record, analyze, and summarize meetings. The article surveys the state-of-the-art technologies in smart meeting systems, including their typical architecture, methods for capturing meetings through video, audio and other sensors, techniques for recognizing meeting content, processing meeting semantics, and evaluating system performance. It also discusses various open issues that could extend the capabilities of current smart meeting systems.
Semantically indexed hypermedia linking information disciplinesunyil96
This document discusses semantic indexing as a way to facilitate access to information in large hypertexts and the semantic web. Semantic indexing represents semantic knowledge about a domain through a controlled vocabulary of index terms with semantic relationships. This allows indirect navigation between information items based on queries to the semantic index space. Reasoning over the semantic relationships can provide intelligent navigation support through techniques like expanding queries, filtering options, and ranking results based on semantic distance in the index.
The document presents an overview of searching in metric spaces. It discusses how similarity searching is needed for unstructured data like text, images, and audio, where exact matching is not possible. It describes how similarity is modeled using a distance function between objects in a metric space. The document surveys existing solutions from different fields that address proximity searching in metric spaces and vector spaces. It aims to provide a unified framework to analyze and categorize existing algorithms.
Searching in high dimensional spaces index structures for improving the perfo...unyil96
This document provides an overview of index structures for improving the performance of multimedia databases. It discusses how multimedia databases require content-based retrieval of similar objects, which is challenging due to the high-dimensional nature of feature spaces used to represent multimedia objects. The document summarizes the problems that arise from processing queries in high-dimensional spaces, known as the "curse of dimensionality", and provides an overview of index structure approaches that have been proposed to overcome these problems to efficiently process similarity queries in multimedia databases.
Realization of natural language interfaces usingunyil96
The document discusses research on using lazy functional programming (LFP) to build natural language interfaces (NLIs). LFP involves delaying evaluation of function arguments until needed. Over 45 researchers have investigated using LFP for NLI design and implementation due to similarities between some linguistic theories and LFP theories. The research has resulted in over 60 papers on using LFP for natural language processing tasks like syntactic and semantic analysis. The paper provides a comprehensive survey of this research area at the intersection of computer science and computational linguistics.
This document surveys ontology visualization methods. It begins by defining ontologies as sets of concepts and relationships in a domain that have proven useful for digital libraries, the semantic web, and personalized information management. However, effectively visualizing ontologies is challenging due to the complex relationships and attributes involved. The document aims to categorize existing ontology visualization techniques and their characteristics in order to help with method selection and further research. It provides context on related work reviewing data visualization techniques before analyzing ontology visualization methods in detail.
On nonmetric similarity search problems in complex domainsunyil96
This document surveys the use of nonmetric similarity functions for efficient similarity search across complex domains. It begins by discussing the growth of digital data and need for content-based retrieval beyond text-based search. Similarity functions were traditionally metric, but increasingly complex data requires nonmetric functions. The document scopes the topic to context-free, static nonmetric functions and surveys domains using them along with techniques for efficient nonmetric similarity search, both exact and approximate. It aims to demonstrate the importance of nonmetric search across disciplines and review current methods.
The document discusses nonmetric similarity search problems in complex domains. It begins by defining similarity measuring and similarity search. Specifically, it defines similarity spaces, similarity functions that assign scores to object pairs, and two common similarity queries: range queries and k-nearest neighbor queries. The document then surveys domains that require nonmetric similarity functions for effective similarity search, and methods for efficient nonmetric similarity search.
This document provides an overview of multidimensional access methods for spatial databases. It discusses how spatial data has complex structures and is often dynamic, with large volumes. Specialized indexing is needed to support common spatial queries like point queries and region queries that search for objects within a given point or region. The document surveys both point access methods for searching point data and spatial access methods for extended objects like rectangles or polyhedra. It concludes by discussing theoretical and experimental performance analyses of different access methods.
This document provides a survey of machine transliteration approaches. It begins with an introduction to machine transliteration, defining it as the process of transforming the script of a word from one language to another while preserving pronunciation. The survey then reviews the key methodologies introduced in the transliteration literature, categorizing approaches based on the resources and algorithms used and comparing their effectiveness.
Machine learning in automated text categorizationunyil96
This document summarizes a research paper on machine learning for automated text categorization. It discusses how machine learning techniques are used to automatically build classifiers that can categorize texts into predefined categories. Specifically, it discusses how machine learning involves using a set of pre-classified documents to learn the characteristics of different categories and build a classifier. This classifier can then categorize new texts. The document also discusses key aspects of text categorization like document representation, classifier construction, and classifier evaluation.
This article surveys probabilistic approaches to modeling information retrieval. It outlines the basic concepts of probabilistic IR and describes various probabilistic models proposed over time, classifying and comparing them using a common formalism. The article also describes new approaches that constitute the basis of future research in probabilistic IR modeling.
Integrating content search with structure analysis for hypermedia retrieval a...unyil96
This document summarizes research on integrating content search and structure analysis for hypermedia retrieval and management. It discusses how link analysis and topic distillation techniques can organize query results and identify authoritative pages. Database approaches aim to facilitate search, navigation and associating web pages through extended query languages and logical document representations. Overall the paper outlines the state-of-the-art in utilizing both content and link structure to improve hypermedia search and organization.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
2. 6
•
W . D . Maurer and T. G. Lewis
CONTENTS
Introductwn
HASHING FUNCTIONS
COLLISION A N D B U C K E T OVERFLOW
T H E O R E T I C A L ANALYSES
A L T E R N A T I V E S TO H A S H I N G
F U R T H E R AREAS OF S T U D Y
Acknowledgments
References
determined value. The d a t a item R,1 is called
the key.
As an example, consider a symbol table
in an assembler, compiler, interpreter, or
more general translator. There is one record
R, corresponding to each identifier in the
source program. The identifier itself is R~I;
the other information which m u s t be associated with t h a t identifier (an address, type,
rank, dimensions, etc.) is R,~., R,~, and so
forth. E a c h time an identifier is encountered
as the source p r o g r a m is processed, we m u s t
find the record R , which contains t h a t
identifier as R,1. I f there is no such record,
we m u s t create a new one. All of the records
are normally kept in main memory.
As another example, consider a d a t a base.
Here the records are normally kept on disk
or some other form of addressable auxiliary
m e m o r y . The key, R,1, in each record R , m a y
be a p a r t number, an airline flight n u m b e r
and, date, an employee number, an automobile license number, or the like. W h e n e v e r a
t r a n s a c t i o n takes place, we m u s t find, on
disk, t h a t record which contains a given
key value.
All hashing methods involve a hash code
Computing Surveys, Vol. 7, N o
1, March 1975
or hashing function or mapping function or
randomizing technique or key-to-address transform, which we shall denote b y h. I f K is an
a r b i t r a r y key; then h ( K ) is an address.
Specifically, h(K) is the address of some
position in a table, known as a hash table or
scatter storage table, at which we intend to
store the record whose key is K. I f we c a n do
this, then if a t some later time we w a n t to
search for the record whose key is K, all we
have to do is to calculate h(K) again. This is
what makes hashing methods so inviting;
most of the time, we can find the record we
want immediately, without any repeated
comparison with other items.
The only time we cannot store a record at
its home address--that is, at the address
h(K), where K is the key in t h a t r e c o r d - - i s
when two or more records have the same
h o m e address. I n practice, there is no w a y we
can prevent this from happening, because
there are normally several orders of magnitude m o r e possible keys t h a n there are possible h o m e addresses. For example, with six
bits per character, six characters per word,
and eighteen bits per address, there are 2 TM
possible addresses, while the n u m b e r of
possible six-character alphabetic identifiers
is 26 s, which is over 1000 times greater
t h a n 2 is.
The phenomenon of two records having
the same home address is called collision,
and the records involved are often called
synonyms. The possibility of collision, alt h o u g h slight, is the chief problem with hash
table methods. I t m a y be circumvented in
a n u m b e r of ways, as follows:
1) I f the home addresses are in main
m e m o r y , we can m a k e a list of all the synon y m s of each record. To find a record in the
table, we would first find its home address,
and then search the list which is associated
with t h a t particular home address.
2) Alternatively, we can determine some
address, other t h a n the h o m e address, in
which to store a record. We m u s t then m a k e
sure t h a t we can retrieve such a record. This
subject is taken up further in the section below on "Collision and Bucket Overflow."
3) There are some situations in which
more t h a n one record can be stored at the
same address. This is particularly common
3. Hash Table Methods
when the records are to be stored on disk. A
disk address is often the address of an area
(such as a track) which is large enough to
hold several records. Similarly, a drum channel is an addressable area which m a y hold
several records. For the purposes of hashing,
such an area is called a bucket. If n records
can be stored in a bucket, then any record
which is kept in t h a t bucket can have n-1
synonyms, with no problem. Of course; it
might conceivably have more synonyms than
that, in which case we have to find another
bucket. This is called overflow, and will also
be taken up in the section on "Collision and
Bucket Overflow."
4) Finally, we can seek to minimize the
probability of collisions--or, for records in
buckets as above, the probability of bucket
overflow--by a suitable choice of the function h(K). Intuitively, we can see t h a t this
should be more likely to happen the more
thoroughly the hashing function "mixes up"
the bits in the key (in fact, this is how
"hashing" got its name). Ideally, for a hash
table of size n, the probability that two
randomly chosen identifiers have the same
home address should be 1/n. However, there
are hashing methods which are actually
"better t h a n random" in that it is possible to
prove that all members of a certain class of
identifiers must have distinct home addresses.
This is considered in the following section,
"Hashing Functions"; further "Theoretical
Analyses" are taken up in the section so
titled.
A hash table has fixed size. If it is necessary to increase the size of a hash table during a run, all quantities currently in the
table must have their hash codes recalculated, and they must be stored at new locations. Also, hashing is applicable only to
certain rather simple (although quite commonly occurring) searching situations. The
section titled "Alternatives to Hashing"
offers solutions for these problems, together
with guidelines as to whether hashing is or is
not applicable in various situations.
Hashing was first developed by a number
of I B M personnel, who did not publish their
work. (See [14], pp. 540-541, for a discussion
of this early work.) Hashing methods were
first published by Dumey [9], and inde-
•
7
pendently b y Peterson [27]. The term "syno n y m " seems to be due to Schay and Raver
[30]. Several surveys of hash table methods
exist, of which the best (although also the
oldest, and restricted to file addressing
methods) is that of Buchholz [6]; also see
Morris [24], K n u t h [14], and Severance [32].
An interesting method of recalculating hash
codes when the hash table size is increased
is given by Bays [2].
HASHING FUNCTIONS
As an example of a hashing function, consider the following. Take the key and divide
it by some number n. Consider the remaunder. This is an integer less t h a n n and
greater t h a n or equal to zero. Hence we may
use it as an index in a table of size n. In other
words, if T is the starting address of a hash
table and K is an arbitrary key, then h(K) =
T + M O D ( K , n) (as it is referred to in
FORTRAN) is the home address of K.
This is the division method of calculating
h(K). At least six other suggestions have
been made, over the years, as to what hashing function h(K) should be used. The choice
of a hashing function is influenced by the
following considerations:
1) Are we allowed to choose our hashing
function to fit the particular records being
searched? In an assembler or compiler, the
answer is clearly no; we do not know what
identifiers we will encounter in the source
program until we actually encounter them.
I n a data base environment, however, the
answer might be yes, particularly if we have
a data base which remains constant, or
nearly so, over a period of time. This is discussed further when we consider the digit
analysis method.
2) Does it m a t t e r to us how long it takes
to calculate h(K)? If our records are on disk
or drum, the answer might very well be no,
particularly if access time is large. I n such a
case, we should use the m e t h o d which minimizes the number of disk or d r u m accesses,
regardless of how long it takes to calculate
h(K). We shall see in the discussion of
"Theoretical Analyses" t h a t this often
means we want to use the diviswn m e t h o d as
Computing Surveys, VoL 7, No. 1, March 1975
4. 8
•
W . D . Maurer and T. G. Lewis
outlined above, even on a computer which
does not have a divide instruction. In an assembler or compiler, on the other hand, it
would not make much sense to minimize the
number of accesses to the hash table, when a
single calculation of h(K) could, for some
hashing functions h, take as long as several
such (main memory) accesses.
3) Are we allowed to design hardware to
implement the hashing function? Most computers do not have instructions which perform hashing directly; we must use some
combination of existing instructions. This is
quite unfortunate, since hashing is a common operation in a variety of programs. One
hashing function (algebraic ceding, described
below) is constructed specifically for implementation in hardware. Another possibility
would be to design a hashing instruction
which incorporates one or another of the collision algorithms discussed in the next section.
4) How long are the keys? If we are concerned about calculation time, we might not
want to use the division method for keys
which are longer t h a n one computer word.
For longer keys, we may combine one of the
other methods with the method of folding,
which is discussed further on.
5) Are the addresses binary or decimal?
Some of the methods described below rely
heavily on specific instructions which are
normally available only in binary or only in
decimal form.
6) Are we free to choose which addresses
or indices will be used in our table? If so, we
can choose the table size to be a power of 2
(or a power of 10, if the addresses are in
decimal form). This is required by certain
of the methods discussed in this paper. I t
m a y be, however, t h a t our hashing routines
are to be used in an environment in which
the addresses (presumably on drum or disk)
have already been chosen in some other way.
7) Are the keys more or less randomly
chosen? Usually there will be some degree of
uniformity in the keys. I t is very often possible to show that, for a certain choice of
hashing function, a wide class of plausible
keys will all have the same hash code (or
home address). Also, it may be t h a t the keys
are so nonrandom that we may not want to
Computing Surveys, Vol 7, N o 1, March 1975
use hashing at all. This is discussed further
in the section, "Alternatives to Hashing."
Under certain conditions, we may show
t h a t keys constructed in special ways do not
have the same home address. In particular,
suppose t h a t K is a key and suppose that
K-q-l, Kq-2, etc., are also keys. If the division method above is used, and h(K) = a,
then h(Kq-1) = ~q-1, h(Kq-2) = ~q-2, and
so on (modulo the size of the table). Thus,
under these conditions, all of these keys
must have distinct home addresses. This subject is considered further in the discussion of
"Theoretical Analyses."
Let us now look at various other methods
which have been proposed for performing
hashing.
a) The random method. Almost all computers have subroutines which are used in
statistical work when a random sequence of
numbers is desired; they are called " r a n d o m
number generators," although the numbers
they produce are actually not quite random
(in fact such subroutines are more properly
referred to as pseudo-random number generators). Typically such a subroutine is provided
with a starting number, called the seed, and
proceeds to produce, when called repeatedly,
a random-looking sequence of numbers. The
idea here is to use the key K as the seed,
choose the first of the random numbers for
the home address, and use the others, if
necessary, for collision handling (see the following section). The output of a random
number generator is usually a floating-point
number between 0 and 1, and this must be
normalized b y first multiplying b y n and
then converting from real to integer, giving
a random number between 0 and n-1 inclusive. We remark for future reference that, if
this method produces a truly random result,
the probability of any two keys having the
same home address is strictly positive, even
if the keys are such that, as noted before,
that probability would be zero if we used the
division method. Thus, even perfect randomization gives less than optimal results
in m a n y cases.
b) Midsquare and other multiplication
methods. Take the key and multiply it
either by itself or by some constant. This
5. Hash Table Methods
•
9
presumably "hashes up" the bits rather
e) Folding and its generalizations. A very
thoroughly (this assumption is often false, fast method of obtaining a k-bit hash code
and must be thought through carefully in from an n-bit key, for k < n, is by picking
every hashing situation). Now pick out a out several k-bit fields from the n-bit key
certain field, say 10 bits long, from the and adding them up, or, alternatively, taking
middle (approximately) of the result. Mask the excIusive OR. This is sometimes called
out everything but this 10-bit field, and "fold-shifting," since it is related to the
consider it as a more or less random integer original idea of folding, which is as follows.
between 0 and 21°-1, inclusive. This means Consider the key as a string of digits written
t h a t the hash table will be 2 l° -- 1024 words down on a slip of paper which is then folded
long. If we need a larger or a smaller hash as in Figure 1. All the digits which are "next
table, we can use a different power of 2.
to each other" as a result of the folding
c) The radix method. Take the key and process are now added up. I t will be seen
consider it as a string of octal digits. Now that this is the same as fold-shifting except
consider this same string of digits as if they that some of the fields have reversed bit
were digits, not in base 8, but in some other (or digit) strings, making the method conbase (say 11). Convert the resulting number, siderably slower in that case. Folding is often
in base 11, to base 10 (if you want a decimal combined with other methods when the keys
address). This process "hashes up" the bits are long; for example, for 16-byte (128-bit)
as before, and we can now take some field as keys on the I B M 370, we may consider these
in the midsquare method. This presumes, of as four single words, add these up (or exclucourse, that the table has size 10% for some sive-OR them), and apply division or some
n; for a table of size 10,000, a four-digit field other method to the single-word result.
is required. A general discussion of this
f) Digit analysis. Of all the methods premethod, with parameters p, q, and m (we sented here, this is the only one which dehave here t a k e n p = 11, q = 10, and m = 4)
is given by Lin [16].
d) The algebraic coding method. Suppose
the key K is n bits long. Consider a polynomial of degree n-l, each of whose coefficients is one of these n bits. (Each coefficient
is therefore either 0 or 1.) Divide this polynomial by some constant kth degree polynomial; all arithmetic operations in this
polynomial division are to be carried out
modulo 2. (Mathematically, this amounts to
performing the polynomial division over
GF(2), the Galois field of two elements.)
Consider the remainder, just as in the division method; this is a polynomial of degree
k-l, and may be considered as a string of k
bits. In algebraic coding theory, we would
append these k bits as check bits to the
I
4
2
O
original n bits; if the constant polynomial by
+5
+5
+4
+8
which we divide is chosen carefully, we obtain an error-correcting code. In the present
"
I
9
"
I
/
/
application, however, we simply use the
|
ill I
resulting k bits as the home address of K.
"6;9%"
As in the midsquare method, this requires
(HOME
ADDRESS)
t h a t the table size be a power of 2 (in this FIG. 1. Folding (sometimes, in the form illustracase, 2k).
ted, known as the fold-boundary method).
I l lcl,l l l°l.l l
Computing Surveys, Vol. 7, No. I, March 1975
6. 10
•
W. D. Maurer and T. G. Lewis
pends on the specific data to be hashed.
I t is usually used with decimal-number keys
and decimal-number addresses, and in this
context it works as follows. Look at the first
digit of each key. If there are N different
keys, then N / I O of these, on the average,
should have the first digit zero, another N / I O
should have the first digit 1, and so on. Suppose that there are actually N, keys whose
first digit is i, 0 < i < 9. Sums such as
9
Y'~ I N , - N / 1 0 I
sz0
9
or
~ (N, -- N/10) ~
~0
represent the "skewness" of the distribution
of the first digit. Repeat this analysis for
each digit, and find the k digits which are the
b e s t - - t h a t is, which have the least amount
of skewness, using one or the other measure
or b o t h - - w h e r e k is the number of digits in
an address. The home address of an arbitrary
key is now found by "crossing out" all b u t
those particular digits from t h a t key.
The division method is mentioned explicitly b y D u m e y [9]. We have already mentioned Lin's work on the radix method. The
algebraic coding method seems to have been
first presented by H a n a n and Palermo [11]
and Schay and Raver [30]. The origin of the
other methods we have mentioned is lost at
the present time. Buchholz [6] surveys all of
these methods; an introductory account of
key-to-address transformation by Price [28]
mentions folding and digit analysis, while a
random search method is mentioned in a
"pracnique" by McIlroy [23]. Several books
([8], [12], [14], [22], [38]) also survey hashing
methods.
COLLISION AND BUCKET OVERFLOW
We draw in this survey a sharp distinction
between the problem of colhsion and the
problem of overflow of buckets. These problems arise in similar ways, but they are not
identical. A collision problem arises when we
are assuming that, most of the time, all of
our keys have distinct home addresses.
When two keys happen to have the same
home address, then we invoke a collision
handling method. An overflow problem arises
when we are assuming that, most of the time,
Computing Surveys, Vol 7, No. 1, M a r c h 1975
no more t h a n n keys, for some fixed value of
n, have the same home address. I n this case
each bucket will contain n records. We invoke an overflow handling method in the
few cases where there is one home address
corresponding to more t h a n n keys.
There are quite a number of problems which
have to be solved in connection with collision; and, in order to give the reader a feel
for these problems, we now describe one
particular collision handling method in
detail. Suppose that the home address of key
K is a; we store at a the record whose key is
K. Now suppose t h a t another key, K', also
has home address a. When we try to store
the new record at a, we find t h a t the old
record is already there; so we store the new
record at a W l instead. This raises the following further problems:
1) There m a y be a record at a w l already,
as well. In this case we t r y aW2, aW3, and
SO o n .
2) We have to be able to tell whether
there is, in fact, a record currently at any
given position. This means t h a t we have to
have a code which means "nothing is at the
given location"; the entire table is initialized
at the start of our program in such a way
t h a t every position contains this special
code.
3) If a is the last location in the table, and
there is something there already, we cannot
store the new record at a w l ; so we cycle
back to the beginriing of the table (that is,
we try to store the new record as the first
record in the table).
4) If we ever have to take any records out
of the table, there is a rather subtle difficulty
which will be discussed below under deletwns.
5) The next position in our table after the
one with address a might have address (say)
aW8, rather than a w l . This would happen,
for example, if the records were double words
on the I B M 370, each of which is eight bytes
long. The adjustments in our algorithms
which are necessary in order to take care of
this phenomenon are so trivial t h a t we will
ignore them hereafter.
This is the linear method of handling collisions. If a = h(K) is the home address, then
we shall refer to a-t-l, aW2, and so on (possibly adjusted as in (5) preceding) as the
7. Hash Table Methods
STORING RECORD
A
•
11
RETRIEVIHGA RECORD
Calculate the
HomeAddress
HomeAddress
the Current
of the Record
Record
to be Searched
Calculate
the
N°~StoretleCur"
rent Record at
t~ Ho~ Address
~o
Cotculote
Next
the
Subsequent
Address
~s
i
StoretheCurrentRecord at
This~ubssqu~t)
A~ress
Calculate
m
the
1
i
I
1
Fro. 2. Storing and retrieving a record in a hash table in main memory.
subsequent addresses for the record whose key
is K. I n general, any hashing method involves a home address h(K) and a sequence
of subsequent addresses, which we shall
denote b y hi(K), h~(K), and so on. Simplified
flowcharts for storing and retrieving a
record, given an arbitrary hashing function
and collision handling method, are shown in
Figure 2.
There is also a linear method of handling
bucket overflow, known as open addresszng
(or progresswe overflow), which works much
the same way. If we are putting a new record
in the table, and the home address for that
record refers to a bucket which is entirely
full of records, we try the next bucket. We
will refer to the address of the next bucket,
the next one after that, and so on, as the
subsequent addresses hKK), h2(K), and so
on, just as before. Simplified flowcharts for
storing and retrieving records in buckets,
given arbitrary hashing and bucket overflow
methods, are given in Figure 3 (page 12).
Just as there are several methods of constructing hashing functions, so there are
various ways of handling collisions and
handling overflow. The choice of method is
influenced by the following:
1) Suppose we have just accessed a record
with address a. How long does it take to
access another record with address ~; and,
more importantly, in what ways does this
timing depend on the relation between a and
~? For example, if ~ and ~ are disk addresses,
the time will be considerably shorter if they
are on the same cylinder (for a movable-head
disk), because then the access arm does not
have to move. Similarly, if ~ immediately
Computing Surveys, Vol 7, N o
1, M a r c h 10"/5
8. 12
•
W. D. Maurer and T. G. Lewis
STORING A RECORD
RETRIEVING A RECORD
Home Address
Calculate the
Home Address
of the Current
of the Record
Record
to be Searched
Calculate the
~
Store Car-I
,~
rent Record at I
I
'
NO
Yes
l
Calculate the
Next Subsequent
Address
Store the Record
cn the Bucket
l
Addrese
With Thin Address
1
F I G . 3.
Storing and retrieving records in buckets.
follows a on a disk track, then it is much
faster to get from a to fl than it is, say, to get
from fl all the way around the track to a
again.
2) Does it m a t t e r how long it takes to calculate the subsequent addresses? If our records are in auxiliary memory, the answer is
almost certainly no. If they are in main
memory, however, there arises the possibility t h a t the improvement of one m e t h o d
over another in reducing the number of
accesses is more than cancelled out by the
increased calculation time for each of the
subsequent addresses.
3) How well does our basic hashing method
perform? We can improve the performance
of a hashing method by using a good collision-handling method. In particular, if there
are a number of synonyms of a particular
Computing Surveys, Vol. 7, No 1, M a r c h 1975
record, certain methods allow us to find
most of these synonyms by calculating only
one subsequent address in each case. A similar phenomenon is clustering, in which certain records are not synonyms, but have
hash codes in sequence, so that, if the linear
m e t h o d is used, they occupy the same space
in the hash table as if they were synonyms.
We will now present a number of methods
of handling collision and bucket overflow
that have been used as alternatives to the
linear and open addressing methods described above.
a) The overflow-area method. This is extremely simple, and is used when records are
to be kept on disk or drum and the expected
probability of overflow is low. I t consists
simply in having a general overflow bucket
for all overflow records, or, sometimes, a
9. Hash Table Methods
•
13
separate such area for each cylinder on the attack the synonym problem as well. Many
disk.
of these extensions are included in the survey
b) The list method. In main memory, this by Severance [32].
Whenever there are deletions from a hash
consists of making a list (in Knuth's notation
[14], a list with linked allocation) of all the table, the positions at which these occur
synonyms of each record, and placing the cannot merely be reset to the same value
head of this list at the home address of the they had originally (meaning "no record at
given record. Items on such lists reside in a this position"). The reason is that, when a
free storage area which is separate from the search for another item encounters this
hash table and which, unlike the hash table, record, the search will terminate, and the
may be expanded in length ~,ithout any search routine will conclude, perhaps ernecessity for reorganization. As in the case roneously, that the given item is not to be
of the linear method, there is a bucket over- found in the table. Instead, a second special
flow handling method similar to this one, code, signifying a deleted item, should be
known as the closed chaining method, in which used.
The open method, as presented here, was
each bucket contains a pointer to that bucket
mentioned by Peterson [27] and modified by
which contains its overflow records.
c) The random method. We recall from Schay and Spruth [31]; an analysis of the
the previous section that the random method modified method is carried out by Tainiter
of obtaining a home address also supplies us [33]. The overflow-area method is compared
with subsequent addresses to be used if colli- with the open method by Olsen [26]. Closed
sion occurs. This has one immediate advan- chaining is mentioned first by Johnson [13]
tage over the list method and over the linear and is compared to open chaining in a commethod, in that there is no need to make a parative study by Lum, Yuen, and Dodd
[18] of a number of hashing methods in a file
linear search of a list of synonyms.
system context. The observation made here
d) The quadratic method and its extenconcerning deletions was noted by Morris
sions. Suppose that our data are left-justi[24].
fied, blank-filled identifiers, and suppose
that we have several of these in sequence
and of maximum length (such as GENE:R1, THEORETICAL ANALYSES
GENER2, GENER3, and so on, for 6 bits
per character and 36 bits per word). If we A certain amount of common sense should
use the division method of hashing, the home be applied before any hashing method is
addresses of these will also be in sequence, used. For example, we have noticed that
and, as we have noted, they will, in particu- quite a number of hashing methods require
lar, all be different. However, if we use the that the size of the table (for binary adlinear method of handling collisions, we have dresses) be a power of 2. It is not hard to see,
to search through this cluster of records however, that this would be disastrous if we
whenever we have a subsequent key whose used the division method. The remainder
home address falls anywhere within the upon dividing an arbitrary binary quantity
cluster. The original quadratic method, de- by 2 k is simply the last k bits of that quanvised by one of the authors of this paper tity. An extreme case happens when k _< 12
[22], consists of calculating the subsequent and the keys are left-justified, blank-filled
address h,(K) as h(K) ~ m~ ~- ni ~, for some identifiers, with 6 bits per character and c
fixed values of m and n. This device circum- characters per word; all identifiers of length
vents the difficulty mentioned above, al- less than or equal to c-2 would be given the
though it does nothing about the problem of same home address.
a large number of actual synonyms of a recSimilar difficulties can occur with other
ord. The quadratic method has been subject hashing methods, for slightly more involved
to numerous extensions ([1], [3], [4], [5], [7], reasons. The CDC 6000 series of computers
[17], [29]), some of which are formulated to do not have fixed-point divide instructions,
Computing Surveys, Vol. ?, No 1, March 1975
10. 14
•
W. D. Maurer and T. G. Lewis
but do have floating-point multiplication. therefore, using the division method, have
If we use the midsquare method on such a different hash codes. But now suppose that
computer, however, all one- and two-charac- these strings do not have maximum length.
ter identifiers (again left-justified and blank- If they are u bits long, and there are v bits
filled) have the same last 48 bits, namely per word, they ~ill still be in sequence by
eight blanks. These 48 bits are the mantissa multiples of k = 2~-~; that is, they will be of
of a floating-point number on these com- the form K, K + k , K + 2 k , K + 3 k , etc. So
p u t e r s - a n d the mantissa of the product of long as. k is relatively prime to the size of
two floating-point quantities depends only the table, all these keys will still have differon the mantissas of those quantities. The ent hash codes, if the division method is used.
result is that all one- and two-character This can always be brought about by choosidentifiers have the same hash code if this ing a pmme number for the table size (this
method is used. Much the same problem is also required, for a different reason, by the
arises with any of the multiplication meth- original quadratic search method of collision
ods, although it may be circumvented by handling).
shifting the keys before multiplication.
The most widely usable consequence of
In comparing the various hashing meth- these results is the following. Suppose we
ods, there are two factors to be taken into are choosing a basic hashing method for
account. One is the time taken to calculate records stored on a movable-head disk. Supthe value of the hashing function, and the pose our computer is an IBM 360 or 370
other is the expected number of subsequent which does not have the optional DP (deciaddresses that are to be calculated for each mal divide) instruction. We have the binary
record. For a record with no synonyms, of divide (D) instruction, but recall that disk
course, or with fewer synonyms than the addresses on this machine are decimal quannumber of records in a bucket, this expected tities. The results of Ghosh and Lum now
number is zero. We have already mentioned tell us that we should use the division
the fact that, if our records are on disk, the method anyway--even if it means going over
first of these factors can often be ignored, to a macro.
because it is negligible in comparison ~ith
One important basic property of hash
the second. That is, the calculation of even table methods is that they start working
the most complex hashing function takes very badly ~hen the hash table becomes
less time than a disk access.
almost full (unless the list method of hanUnder these conditions, a thorough analy- dling collisions is used). In the extreme case
sis of basic hashing methods has been made in which the table is completely full except
in a series of papers by Lum, Yuen, and for one space, and the linear method of hanGhosh ([19], [20], [10]). The main results dling collisions is used, a search for this
here are that, of all the commonly used space takes, on the average, N / 2 steps, for
methods, the division method and the ran- a table of size N. In practice, a hash table
dom method minimize disk accesses the best, should never be allo~ ed to get that full. The
and furthermore that the division method is ratio of the number of entries currently in a
better than the random method even if the hash table to the number of spaces for such
random method gives perfect randomization entries is the load(zng) factor of the hash
(which of course it never does). This last, table; it ranges between 0 and 1. When the
rather startling, conclusion may be ex- load factor is about 0.7 or 0.8--or, in other
plained as follows. Key sets often contain words, when the table is about 70 % or 80 %
runs of keys such as PRA, PRB, PRC, etc., full--the size of the hash table should be
or STR1, STR2, STR3, etc., which are in increased, and the records in the table
sequence. We have already seen that, if such should be rehashed. The expected number
a character string is of maximum length of subsequent addresses that must be cal(that is, if it contains exactly as many culated in a hash table of size m, when n
characters as will fit into one word), the re- positions in the table are currently filled, and
sulting keys will be in sequence, and ~ill when the linear method of handling collisions
Computing Surveys, Vol. 7, N o
I, March 1975
11. Hash Table Methods
is used, is given by
1 ( n-1
d(m, n) = -~ 2
n-1 n-2
--1-3--
m
m
m
-t- 4 n--1 n - - 2 n - - 3 _}_ . . . .
m
m
m
/
(see Morris [24] or, Knuth [15]). The load
factor here is n/m; if this is held to the constant value a while m and n themselves go to
infinity, the above quantity has the limiting
value ½a/(1 - a).
When we use the list method of handling
collisions, the performance of our hashing
algorithm deteriorates slowly, rather than
quickly, as the table fills. The list method, of
course, puts no restrictions on the total
number of records to be stored, which might
even be greater than the size of the table,
since synonyms of a record are kept in a
separate free storage area. One method
which was used on early UNIvAc 1107 software (in particular, the SL~VTHII assembler,
under the ExEc II operating system), in
fact, involved a very small hash table (64
positions), together with a much larger free
storage space for lists. (Fold-shifting was
used as the hashing function.) Under these
conditions the average time taken to retrieve
an item is N/128, where there are N items
in the table (provided that N is large). For
N < 1000, this compares favorably to binary
searching (see the following section); larger
values of N, in this context, were held to be
too infrequent to matter much.
Comparison of the time of calculation of
various hashing methods, in a context where
this matters, is unfortunately completely
machine-dependent. On many machines, it
is actually impossible, because instruction
timings are not published and may, in fact,
vary from one execution to another, particularly if pipeline hardware is used. Foldshifting is probably the fastest, followed by
division (on machines that have divide instructions). The radix method and the algebraic coding method both suffer much more,
in such a comparison, than they would if
they were implemented in special-purpose
hardware, and the same is true of conventional folding (as opposed to fold-shifting)
and, to a lesser extent, of digit analysis.
•
15
Recent work of van der Pool ([35], [36])
and Webb [37] incorporates other parameters
into the evaluation process to determine performance. Van der Pool includes in his
analysis storage cost for the whole file, cost
of storage for one record per unit time, fraction of the key set accessed per unit time,
cost of access to the prime area, and cost of
additional accesses. He derives formulas for
calculating the total cost, with and without
additions and deletions, but always limited
to the case of separate overflow areas. His
results show that loading factors higher than
1 may give better results than loading
factors less than 1, particularly if wasted
space is taken into account. (A loading
factor higher than 1 is obtained if we divide
the number of records in the table, including
those in overflow areas, into the space available, excluding space m overflow areas.) Webb
includes the computer CPU time in his
evaluation of hash coding systems. He combines the various hashing functions with
various collision handling methods to determine an overall cost. Basically, his results
agree with those of Lum, Yuen, and Dodd
[181.
ALTERNATIVES TO HASHING
There arc many searching situations in which
we either cannot or should not use hashing.
We will now discuss some alternative methods of searching. It is not our purpose here
to compare various alternative methods ~vith
each other (this is done in the surveys by
Severance [32] and by Nievergelt [25]), but
only to compare them with hashing.
First of all, there is no order in a hash table.
The entries in the table are scattered around,
seemingly at random (this is why we sometimes use the term "scatter storage" for a
hash table). There is no way that we can go
through a hash table and immediately print
out all keys in order (without using a separate sorting process). If we want that capability, there are two methods which will still
allow us to search our table relatively fast.
The first is good for tables which do not
change, or which change only infrequently,
over time. We simply keep a sorted array and
Computing Surveys, Vol. 7, No. I, March 1975
12. 16
•
W. D. Maurer and T. G. Lew~s
use a binary search, which repeatedly divides
the table in half and determines in which
half the given key is. A binary search of a
table of N records, with two-way comparisons at each stage, takes 1 + log= N steps
[22] (and is therefore sometimes called a
logarithmic search). The steps are short, so
this is comparable with m a n y of the hashing
functions--though not with the fastest of
t h e m - - e v e n for a table which is quite large,
as long as it remains in main memory. Insertion of a new item in such a table, however, takes N / 2 steps on the average (since
the table must remain sorted), although a
group of new items which are already sorted
can be inserted by a merging process which
takes N + N ' steps, where there are N' new
items.
The second method gives faster insertion,
but also takes up more space. I t is the binary
tree search. A binary tree is constructed as in
Figure ~ Each item contains a left pointer
to a left subtree and a right pointer to a right
subtree. E v e r y item in the left subtree is less
t h a n every item in the right subtree, and
the item which points to these two subtrees
is between the two. Searching proceeds from
top to b o t t o m in the figure, and takes log~ N
steps, for a tree of N items which is balanced
(that is, in which each left subtree is as
large as the corresponding right subtree).
Insertion of a new item, after we have
searched the table unsuccessfully for it, only
takes one step; we simply "hook it on the
b o t t o m " as shown in Figure 4. The main
disadvantage of this method is the 2N additional pointers required.
Binary methods, in this form, do not work
well when the records are in auxiliary memory. For a table of size 21° -- 1024, for ex-
/
V:.-,-?::_:,
FIG. 4. Binary tree search and insertion.
Computing Surveys, Vol 7, No 1, March 1975
Keys'
SMITH
SON
JONES
JOHN
ALBERT
R~
O
L
O
T
FIG. 5. The distributed key method.
ample, we would need to make 10 accesses
to drum or disk, and this process is much too
slow. A variation of the binary tree search,
however, in which there is some larger
number, k, of pointers at each level, rather
than two, is widely used. For 1 _< i < 3 -< k,
each item in the zth subtree at any given
level is less than each item in the 3th subtree
at that level. This, in fact, is the basic idea
behind the indexed sequential (sometimes
index-sequentzal) file organization method.
Indexed sequential files are easy to process
sequentially, in order, if this is desired, at the
expense of a few more disk accesses for each
individual record (almost never more than
four, however).
Another method of searching, allied to the
binary search, is the distributed key method.
I t is commonly used with keys which are
strings of characters; a typical tree structure,
as required for this method, is shown in
Figure 5. At each level, there is a vamable
number of pointers, each of which corresponds to a character at t h a t level. Searching proceeds b y starting at the root and
taking the branch corresponding to the first
character in the key, then from there the
branch corresponding to the second character in the key, and so on.
Linear searching, in which every record in
the table is retrieved and checked, is still
required for any type of searching which is
not specifically provided for by the structure
of the table. I t often happens t h a t we need
to know which record R is such t h a t some
function f ( R ) has its maximum value over
all records. I n a data base environment, for
example, we might want to know which
13. Hash Table Methods
•
17
customer ordered the most of product X last city range from al to al W k~, and those in
year, which department spent the most on the suburbs lie in two other ranges, a2 to
stationery, which supplier had the greatest as ~- ks, and a~ to a3 ~- k3. In this case we can
total slippage, or which salesman has the test a zip code to see which range it belongs
best record on some specific item. This type in, subtract an appropriate constant, and
of search always takes as many steps as directly obtain the index in a (k~ -t- ks ~there are records in the table, unless we k3 -t- 3)-position table.
know beforehand that we need certain speFinally, we must mention the all-imporcific information of this kind and build tant fact that these are basic techniques only.
facilities for this into the program.
An improved solution to a specific problem
An important class of searching methods, may almost always be achieved by using a
sometimes called direct methods, may be used combination of techniques. We have already
to advantage when the keys are highly non- mentioned folding combined with division as
random. It is generally advantageous, when- a hashing method; it is also possible to comever the choice of values for the keys is bine hashing with binary tree search methwithin our control, to choose them in such a ods, or with direct methods. In particular,
way that hashing becomes unnecessary. As it is quite often true that a large proportion
an example, consider a personnel file with of a table will be accessible directly, with
records stored by employee number. If the hashing used only for the exceptional cases.
social security number of an employee is This would be true, for example, if (say) 80 %
used as the key, some form of hashing is of the zip codes appearing in a system were in
generally necessary. However, some organi- a single city and its suburbs, as in the prezations have their own employee numbering ceding example, with the remainder scatsystems. In this case, one commonly en- tered around the country.
countered device is to keep an employee's
record at a disk address which is itself the
employee number. This, of course, allows us FURTHER AREAS OF STUDY
to make access to the disk d,rectly (without
hashing), although if such an organization By far the greatest need for research in hashgets a new computer it will probably have to ing at the time of this writing is for further
renumber all its employees (which might not empirical studies of the effects of varying
necessarily be bad).
hashing methods, collision and overflow
Even when the choice of keys is not under handling methods, and alternatives to hashour control, the keys might still be so non- ing in a wide variety of programming situarandom that a direct access method gives tions. One interesting paper along these lines
better results than a hashing method. A is due to Ullman [34]. We should never forget
common example of this has to do with zip the observational aspect of computer science,
codes. If a mailing list is kept in zip-code particularly since mathematical results in
order, and all the zip codes in the list have this area are always open to questions conthe same first three digits, it is not necessary cerning the validity of the mathematical asto use hashing; we simply take the last two sumptions which must be made.
digits and use them directly as an index into
This is not to downplay the need for fura 100-position table of disk addresses. This ther mathematical studies. Lower bounds on
superficially resembles hashing by digit search time as a function of other factors
analysis, but has the additional property in a computational process would be greatly
that, since the first three digits are always the appreciated. Also, distributions of key sets,
same, we have a file which can be accessed other than a uniform distribution, should be
sequentially, as well as by the direct access studied in order to determine, for each such
method described above.
distribution, a best hashing function. This
Usually, the situation is not this simple, would extend the work of Lum [20] which
but direct methods can still be used in many concerned itself solely with the uniform
cases. Suppose that zip codes in a certain distribution.
Computing Surveys, Vol. 7, No. 1, March 1975
14. 18
•
W . D. Maurer and T. G. Lewis
W o r k of this kind, however, should always
be subjected to a critical eye. One of the
authors of this paper can r e m e m b e r vividly
the a t t e n t i o n t h a t was showered recently on
a new algorithm which performed a certain
operation in one fewer multiplication t h a n
was previously t h o u g h t n e c e s s a r y - - a t the
expense of ten m o r e addition a n d s u b t r a c t i o n
operations. Of course, it is perfectly t r u e
t h a t , as the n u m b e r of bits in the quantities
being multiplied goes to infinity, one multiplication m u s t always take longer t h a n t e n
additions; in m o s t practical cases, however,
this will n o t be the case.
A n o t h e r area of importance, which has
m o r e to do with d e v e l o p m e n t t h a n with research, is in the m a k i n g of plans for providing the next generation of computers with
instructions which p e r f o r m hashing directly.
A t the present time, we have instructions on
m a n y disk a n d d r u m units which will find a
record in a bucket, given the b u c k e t address
and the key, a n d which will a u t o m a t i c a l l y
read t h a t record into m a i n m e m o r y . I t
should be just as easy t o develop instructions
which will calculate the h o m e address a n d
the subsequent addresses as well. This is
particularly t r u e of the digit analysis
m e t h o d ; one can easily imagine a hardware
operation of r e m o v i n g certain digits from a
key, as indicated b y a digit analysis p a t t e r n
which would be analogous to the I B M 360
a n d 370 editing patterns.
ACKNOWLEDGMENTS
The first author received partial support from
the National Science Foundation under Grants
G J-41160 and DCR73-03431-A01. The authors
would like to express their appreciation to V. Y.
Lum for his thoughtful comments on an initial
version of the paper, and to Stephen Chert for a
careful reading of the final version.
[4]
[5]
[6]
[7]
[8] DIPPEL, G.; AND HOUSE, W . C . Information
systems, Scott, Foresman & Co, Glenview,
Ill , 1969.
[9] DUMEY, A . I .
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
REFERENCES
"Quadratic search for
hash tables of size pn,,, Comm. ACM 17, 3
(March 1974), p. 164.
[2] BAYS,C "The reallocation of hash-coded
tables," Comm. ACM 16, 1 (Jan. 1973), pp.
11-14.
[3] BELL, J. R. "The quadratic quotient
method: a hash code eliminating secondary
Computing Surveys, Vol 7, No 1, March 1975
"Indexing for rapid randomaccess memory systems," Computers and
Automatwn 5, 12 (Dec. 1956), pp. 6-9
GnosR, S. P . ; AND LUM, V Y. "An analysis
of collisions when hashing by division," Tech.
Report R J-1218, IBM, May 1973.
HANAN,M., ANDPALERMO,F.P. "Anapplii
cation of coding theory to a file address problem," IBM J. Res. & Development 7, 2 (April
1963), pp. 127-129.
HELLERMAN, H. D~gital computer system
prznczples, McGraw-Hill, New York, 1967.
JOHNSON, L. R. "An indirect chaining
method for addressing on secondary keys,"
Comm ACM 4, 5 (May 1961), pp. 218-222.
KNUTH, D . E . The art of computer programmzng, Vol. I I I : Sorting and searching, Addison-Wesley, Reading, Mass , 1973.
KNUTn, D. E. "Computer scmnce and its
relation to mathematics," Amer. Math.
Monthly 81, 4 (April 1974), pp 323-343.
LIN, A . D . . " K e y addressing of random access memorms by radix transformation,"
Proc
[20]
[1] ACKERMAN, A. F.
clustering," Comm. ACM 13, 2 (Feb. 1970),
pp. 107-109.
BELL, J. •.; AND KAMAN, C. H. "The
hnear quotmnt hash code," Comm ACM 13,
11 (Nov. 1970), pp. 675-677.
BRENT, R. P "Reducing the retrieval time
of scatter storage techmques," Comm. ACM
16, 2 (Feb 1973), pp. 105-109
BUCHHOLZ,W. "File organization and addressing," IBM Systems J. 2 (June 1963),
pp. 86-111.
DAY,A.C. "Full table quadratic searching
for scatter storage," Comm. ACM 13, 8
(August 1970), pp. 481-482.
[21]
[22]
[23]
1963 Spr~ng Joint Computer Conf.
AFIPS Vol. 23, Spartan Books, Baltimore,
1963, pp. 355-366.
LuccIo, F. "Weighted increment linear
search for scatter tables," Comm. ACM 15,
12 (Dec. 1972), pp. 1045-1047.
LUM, V. Y.; YUEN, P. S. T.; AND DODD, M.
"Key-to-address transform techniques, a
fundamental performance study on large
existing formatted files," Comm. ACM 14,
4 (April 1971), pp. 228-239.
LUM, V Y.; AND YUEN, P. S. T. "Additional results on key-to-address transform
techniques," Comm. ACM 15, 11 (Nov 1972),
pp. 996-997.
LUM, V.Y. "General performance analysis
of key-to-address transformation methods
using an abstract file concept," Comm. ACM
16, 10 (Oct 1973), pp. 603-612.
MAURER, W. D. "An improved hash code
for scatter storage," Comm. ACM 11, 1 (Jan.
1968), pp 35-38.
MAVRER,W D. Programm,ng, Holden-Day,
San Francisco, Cahf., 1968.
MCILROY,M . D . "A variant method of file
searching," Comm. ACM 6, 3 (March 1963),
p. 101.
15. H a s h Table Methods
[24] MORRIS, R. "Scatter storage techniques,"
Comm. ACM 11, 1 (Jan. 1968), pp. 38-43.
[25] NIEVERGELT, J. "Binary search trees and
file organization," Computing Surveys 6, 3
(Sept. 1974), pp. 195-207.
[26] OLSON,C.A. "Random access file organization for indirectly accessed records," Proc.
ACM 24th National Conf., 1969, pp. 539-549.
[27] PETERSoN, W.W. "Addressing for randomaccess storage," IBM J. Res. & Development
1, 2 (April 1957), pp. 130-146.
[28] PRICE, C. E. "Table lookup techniques,"
Computing Surveys 3, 2 (June 1971), pp. 4965.
[29] RADKE,C.E. "The use of quadratic residue
research," Comm. ACM 13, 2 (Feb. 1970),
pp. 103-105.
[30] SCHA¥,G.; AND RAVER, N. "A method for
key-to-address transformation," IBM J.
Res. & Development 7,2 (April 1963), pp 121126.
[31] SCHAV,G.; AND SPRUTH, W.G. "Analysis of
a file addressing method," Comm. ACM 5,
8 (August 1962), pp. 459-462.
•
19
[32] SEVERANCE,D . G .
[33]
[34]
[35]
[36]
[37]
[38]
"Identifier search mechanisms: a survey and generalized model,"
Computing Surveys 6, 3 (Sept. 1974), pp. 175194.
TAINITER, M. "Addressing for randomaccess storage with multiple bucket capacities," J. ACM 10, 3 (July 1963), pp. 307-315.
ULLMAN,J . D . "A note on the efficiency of
hashing functions," J. ACM 19, 3 (July
1972), pp. 569-575.
VAN DER POOL, J. A. "Optimum storage
allocation for initial loading of a file," IBM
J. Res. & Development 16, 6 (Nov. 1972), pp.
579-586.
VAN DER POOL, J. A. "Optimum storage
allocation for a file in steady state," IBM J.
Res. & Development 17, 1 (Jan. 1973), pp.
27-38.
WEBB, D.A. "The development and application of an evaluation model for hash coding
systems," PhD Thesis, Syracuse Univ.,
Syracuse, N. Y., August 1972.
WEONER, P. Programming languages, information structures, and machine organization, McGraw-Hill, New York, 1968.
Computing Surveys,Vol 7, No. 1, March 1975