Hash tables are quite different from the previous data structures we’ve covered. Hash tables are very specialized for one purpose: searching. The hash table is the end result of a journey to build a data structure with O(1) searching. The design of the hash table data structure sacrifices everything, especially memory, to obtain near perfect searching.
To understand hash tables, first you have to understand hashing. Perfect hashing is the process by which a given value is mapped to a unique value (the value’s hash). If the same value is hashed multiple times, it will result in the same hash every time. However, different values will always have unique hashes. Sometimes it’s practical to reduce the constraint that all values must have unique hashes. It is often good enough as long as the hash function evenly distributes the values among all the possible hashes.
At it’s core the hash table uses an array to store elements. However, new elements are not simply placed at the end of the array. Instead, each new element is hashed to an integer. Then, the element is stored in the array at the index corresponding to the its hash value. To accomodate this constraint, the underlying array is initialized with a size corresponding to the number of possible hashes (with perfect hashing, this is the same as the number of possible elements).
In a perfect hash table that stores 32 bit integers as its elements, each element must have a unique hash value. Therefore, the array must be initialized to a size of 4.29 billion (the number of possible 32 bit integers). This would result in hash table with the memory footprint of about 16 GB. If the possible inputs consisted of 5 character unicode strings, then the array would need a size of over one septillion which would result in a hash table size of just over 10 quadrillion GB.
Clearly, perfect hashing becomes impractical for all but the smallest of problems. Instead, for hash tables, we relax the perfect hash constraint and ensure that different values are simply very likely (but not guaranteed) to produce different hashes.
There are many different ways to write a hash function. It is a topic of its own and is outside the scope of this article. For the purpose of our IDynamicSet<T> hash table, we’ll just use C#’s built-in GetHashCode function which will work well for our purposes. The GetHashCode function uses the 32 bit integer as the hash and will therefore only have aproximately 4.29 billion possible hashes, no matter what the input is. Because we don’t want our hash table to consume 16 GB of memory, we’ll further constrain the number possible hashes depending on the number of elements in the hash table. To do this we perform a mod operation of the size of the array (hash % size). For example, if our array had a current capacity of 100, and the hash was 456, then we’d perform 456 % 100 = 56 and store the element in the array at index 56.
This allows us to search for any element by hashing its value to get immediate access to the index where the element would be if it exists in the hash table. Hash functions are typically very fast, and because the input is only a single element, they necesarily take a constant.
Because we are not using perfect hashing, there will be more possible inputs than there are array indexes. Therefore, some collisions are likely to occur and you can’t store multiple different elements in the same array index. Or can you? If we change our array of elements to be an array of data structures of elements, then each index in our array corresponds to a data structure of elements. For this internal data structure, we could use a number of different structures: a dynamic array, another hash table, or, in this implementation a linked list. This allows us to store multiple elements at the same index in our main array. When we search for an element, we will find the index associated with a linked list. Then, all we need to do is search this list to find the element. Because the elements should be roughly distributed among the lists, and because we constantly increase the number of lists as we add new elements, it is very likely that any given list will have only a very few elements. In fact, it is very likely that most of the lists will only have one element in them. And it is likley that only a very small percentage of lists would have more than two or three elements. However, this is just probability. Although with a good hash algorithm it would be extremely unlikely, it is nevertheless possible that every element has the same hash and all elements end up colliding. If this were to happen, then searching for an element would be equivalent to searching for an element in a linked list, which we know is Θ(n).
Note: The math to figure out the likelyhood of a collision is the same as the birthday paradox.
There are many different strategies for building a hash table. Some things to consider are what hash function to use, when and how to resize the internal array, and what internal data structure to use inside the array (or there is even a strategy called probing that does not involve using a second internal data structure).
To determine when and how to resize the array. The tradeoff will always come down to time vs space. When there are more possible hash values, more space will be needed, but less collisions will occur which will results in faster searches. In this implementation we use the same array expansions strategy used when we built linked lists, stacks and queues. That is, we start with an array of size one and whenever we need to insert new elements into an already full array, we rebuild the array onto a new array with twice the capacity.
To determine what internal data structure to use inside each element of the main array, simply fall back to your knowledge of the data structures and their tradeoffs. This is an example of data structures building on top of each other to acheive more and more specialized goals.
Unlike the previous data structures we’ve covered, hash tables have an average runtime of Θ(1) for searching. This is by far the biggest advantage of a hash table.
Unlike linked lists, if the underlying array is full and needs to be rebuilt, the insertion will take Θ(n) time.
To minimize collisions, hash tables can often take up more memory than other data structures.
- Because hash tables have Θ(1) average searching, they will be very performant in situations when there will be many searches.
Linked Lists in C#
- The built-in C# generic implementation of the hash table is System.Collections.Generic.Dictionary<T>. The implementation is very similar, but with a few more optimizations built in.
|Average Case Time Complexity
|Worst Case Time Complexity