Practical Hash Tables

This is a quick review of hash tables and some musings about their performance. The article was written in 2021, around the first time I needed to really pay attention to the performance and detailed design of hash tables. It comes off as fairly rudimentary to me now, and probably to you too if you know anything about performance optimization or CPU architecture. Please be warned.

Hash tables: from the outside

Most languages come with standard library implementations of hash tables, usually presented as an associative array of (key, value) pairs. They support amortized constant-time lookup, insertion, removal of a key, and also iteration through all stored (key, value) pairs.

Beyond the basics, there are two common gotchas:

Iteration order. The iteration order guarantees of a hash table vary by implementation - relying on a deterministic iteration order when not explicitly specified in the contract is often a source of subtle bugs.
Pointer stability. In a language that exposes pointers like C++, pointers (and iterators) to elements in the hash table may or may not be invalidated by rehashing. Again, one should be careful not to rely on such properties unless specified.

Hash tables: the internals

Internally, hash tables typically allocate a contiguous array of hash buckets for storage. Elements are assigned to buckets based on their hash value. A hash collision occurs when multiple elements hash to the same bucket. Practically speaking, there are two primary differentiators for the performance of a hash table: the choice of hash function, and the collision resolution scheme.

Hash function quality is not strictly a property of the hash table, but it is vitally important in practice. It dictates how often hash collisions occur, and collisions destroy performance. Choosing a good hash function is half the battle to get a well-behaved hash table. This burden is often shared between the user and the library. While the default hash functions of a language or library are often good, it may not be as efficient as possible due to its general-purpose nature, or it may perform badly on a custom data type. For performance-critical code, it pays to evaluate hash functions for the specific use case needed.

Beyond hash functions, the collision resolution scheme determines how collisions are handled, and also informs the internal data layout of the table. Collision resolution comes in two flavors: separate chaining and open addressing.

In separate chaining, each bucket contains a list. Colliding entries are placed on the list, and lookup traverses through the list until a match is found. The simplest choice for this list is a uni-directional linked list.
In open addressing, each bucket stores a single entry. Colliding entries are placed in different buckets of the hash table. The bucket is determined by starting with the hashed-to bucket, and then probing for an available bucket in an algorithmically defined sequence.

Many variations of the above strategies exist. In particular, there is a variety of probing schemes.

Load factor and resizing

When talking about collision resolution, load factor comes up a lot. It refers to the ratio between the number of elements and number of buckets in a hash table. A higher load factor usually means there are more collisions in the hash table, and growing the table would help reduce collisions and improve performance. Some probing techniques (e.g. Robin Hood Hashing, Hopscotch Hashing) help avoid long probe chains, and therefore enable hash tables with very high load factors to still perform well.

General-purpose hash tables usually grow and shrink to keep the load factor within a certain range. It’s useful to note that if one can determine the maximum storage requirements of a hash table, or stop storing data into the hash table when the load factor becomes high, then resizing is unnecessary. Resizing can even be harmful because it worsens tail latency.

One algorithm to rule them all?

There are many different hash table implementations in the standard libraries of different languages, so there isn’t really a clear and fast rule on which algorithm to use. However, I’d like to focus on one particular, simple approach taken by open-source general-purpose implementations in recent years: open addressing with linear probing, where adjacent hash buckets are simply probed one by one. Examples are Google’s Swisstable (talk), Rust’s std::collections::HashMap (based on Swisstable), and Facebook’s F14. Why the sudden popularity of linear probing?

My guess: a combination of cache-friendliness and SIMD operations. Let’s step back and talk about performance.

Performance 101

I’m only starting to scratch the surface of program optimization, and there are many cautions in the wild about premature optimization, but this topic is increasingly relevant in a world where Moore’s law is slowing down, and hardware resource efficiency is emphasized.

Performance analysis often starts with big-O, but it’s only a first step in understanding performance. It helps one avoid big-picture mistakes like using a quadratic algorithm when sorting + a linear scan is sufficient, but doesn’t say anything about constant factors which can be huge, and misleads one into thinking that linked lists and binary trees are decent data structures (they’re not; see below).

So, what aspects of an algorithm or data structure affect its performance beyond big-O complexity?

Generally, how it interacts with computer hardware. Hardware is built to optimize certain patterns of computation, and by taking advantage of these patterns, an algorithm can be orders of magnitude more efficient than an otherwise similar algorithm that do not.

In the domain of single-threaded, in-memory computing (the scope of this article), the most prominent limiting factor of performance is the “memory wall” - the fact that RAM latency has become overwhelmingly high compared to CPU cycle latency. Modern CPUs remedy this with multi-level (typically 3) cache hierarchies. CPU cache is generally much closer to the CPU than RAM, and performs much better. You may have seen a list of latency numbers illustrating this - L1 cache reference (0.5ns) is two orders of magnitudes faster than main memory reference (100ns). On a 4GHz CPU, the difference is 2 cycles for L1 vs 200 cycles for RAM. Without other optimizations, the CPU could be spinning idle for 200 cycles every time a hash table slot is read.

What this means in practice is that, when operating completely in-memory (e.g. hash tables in standard libraries), to achieve good performance, one must do everything possible to avoid cache misses. Chandler Carruth summarized this point succinctly:

Dis-contiguous data structures are the root of all performance evil. Say no to linked lists.

Caching is also not the only hardware optimization done to hide memory latency. Here are a few others I’ve learned about:

Out-of-order execution

Modern CPUs execute instructions out of program order, depending on when data becomes available. Each CPU core maintains an instruction pipeline and executes instructions from the pipeline as instructions and their associated data become ready. As a result, for example, a CPU core may execute several iterations of a for-loop in parallel, even though by program order an iteration shouldn’t happen until the previous one is done.
Speculative execution

This is an umbrella term for many different optimizations, and a source of security vulnerabilities in recent years. One example is branch prediction, where the CPU tries to predict the outcome of an if-else block before the conditional expression is evaluated, and preemptively executes the predicted branch.

Hash tables again

One primary advantage of open-addressing with linear probing is its cache-friendliness (or cache locality). In particular, modern processors have 64-byte or 128-byte cache lines. When a memory location is fetched, the entire cache line containing that location is fetched. Thus, the cost to traverse, say, 4 bytes of memory, is comparable to traversing 64 bytes - both much less than traversing the same amount of data by repeatedly chasing pointers to random memory locations. A data structure with a contiguous storage layout takes advantage of cache locality a lot better than dis-contiguous ones.

Consider a separate chaining hash table with a linked list in each bucket. This has the same big-O complexity as a hash table with open addressing. However, traversing each step of the linked list is a RAM reference, and each step requires the previous read to be completely done first. This translates to anywhere between 2 to N RAM references for a lookup, where N is the maximum number of elements in a bucket (minimum is 2 assuming the head of the linked list is not stored directly in the bucket). Compare this with linear probing, where one read is needed to find the bucket, and subsequent linear probing is ~free until we go over a cache line boundary - if elements are small, the chances of doing more than 1 or 2 RAM references is minimal.

An additional optimization that linear probing is amenable to is SIMD, as explained in more detail in the talks and blog posts linked above. Briefly, one can group N buckets together, and store a subset of the hash code (say, 8 out of 64 bits) per bucket as metadata, such that the resulting metadata per group has 8 * N bits. Then, a set of SIMD instructions can operate on all of those bits at once, probing many buckets in just a few cycles.

Naive linear probing suffers from the problem of long probe chains when load factor becomes high. The SIMD optimization seems to have mostly mitigated this shortcoming. This would explain why earlier hash table implementations (Python, Ruby) chose to use probing sequences with a constant multiplicative factor + perturbation, while recent implementations opt for the simpler linear probing.

Aside: databases

As hinted in the above analysis, one way to think about optimizing a program is to “count memory references”. More accurately, if the data fits entirely into cache, count instructions, otherwise count memory references.

This is strikingly similar to how query optimizers of traditional on-disk databases evaluate the cost of a query, except memory references are replaced by disk reads. Magnetic hard drives are orders of magnitudes slower than RAM, so when evaluating a query execution plan, it’s often effective to count the number of disk reads it does.

I think this is also an argument for why B-Trees, the preferred data structure of databases, can be the right data structure for in-memory data when ordering is required, despite hash tables being the go-to solution when associative arrays are needed.