Hands-On Concurrency with Rust
上QQ阅读APP看书,第一时间看更新

A better naive HashMap

How can we do better than our naive implementation? Obviously, there's scope for algorithmic improvement—we could implement any kind of probing—but if we're trying to compete with standard library's HashMap it's likely that we have a specialized reason for doing so. A specialized reason implies we know something unique about our data, or are willing to make a trade-off that a general data structure cannot. Speaking broadly, the main goals one should have when building software at the limit of machine performance are as follows:

  1. Improve the underlying algorithm, reducing total work done.
  2. Improve cache locality of data accesses. This may mean:
    1. Keeping your working-set in L1 cache
    2. Compressing your data to fit better into cache.
  3. Avoid branch mispredictions. This may mean:
    1. Shaving off branches entirely when possible
    2. Constraining the probability distribution of your branches.

How does that apply here? Well, say we knew that for our application K == u8 but we still have an unconstrained V. u8 is a key with low cardinality and we ought to be able to trade a little memory to build a faster structure for u8 keys. Unfortunately, Rust does not yet have type specialization in a stable channel. Type specialization is very important for producing high-performance software without breaking abstractions. It allows the programmer to define an abstract interface and implementation for such and then, at a later date, specialize some of the parameterized types into concrete form with a special purpose implementation. Rust RFC 1210 (https://github.com/rust-lang/rfcs/pull/1210/files#diff-b652f1feca90247198ee29514ac22cf3) details how specialization in Rust will work and Rust PR 31844 (https://github.com/rust-lang/rust/issues/31844) tracks the ongoing implementation, which is to say, all of this is only exposed in nightly. This chapter sticks to stable and so, unfortunately, we'll need to create a new HashMap rather than specializing. The reader is encouraged to try out specialization for themselves. It's quite nice.

We'll park our HashMapU8 implementation in naive_hashmap/src/lib.rs. The implementation is quite small:

pub struct HashMapU8<V>
where
    V: ::std::fmt::Debug,
{
    data: [Option<V>; 256],
}

impl<V> HashMapU8<V>
where
    V: ::std::fmt::Debug,
{
    pub fn new() -> HashMapU8<V> {
        let data = unsafe {
            let mut data: [Option<V>; 256] = mem::uninitialized();
            for element in data.iter_mut() {
                ptr::write(element, None)
            }
            data
        };
        HashMapU8 { data: data }
    }

    pub fn insert(&mut self, k: u8, v: V) -> Option<V> {
        mem::replace(&mut self.data[(k as usize)], Some(v))
    }

    pub fn get(&mut self, k: &u8) -> Option<&V> {
        let val = unsafe { self.data.get_unchecked((*k as usize)) };
        val.as_ref()
    }
}

The idea here is simple—u8 is a type with such cardinality that we can rework every possible key into an array offset. The value for each key is an Option<V>, None if no value has ever been set for the key, and Some otherwise. No hashing needs to be done and, absent specialization, we drop the type requirements for that. Every HashMapU8 will reserve 256 * ::core::mem::size_of::<Option<V>>() bytes. Being that there's unsafe code in this implementation, it's worthwhile doing an AFL run to search for crashes. The interpreter for the specialized map is similar to the naive interpreter, except that we now take care to parse for u8 keys:

extern crate naive_hashmap;

use std::io;
use std::io::prelude::*;
use std::str::FromStr;

fn main() {
    let mut hash_map = naive_hashmap::HashMapU8::new();

    let n = io::stdin();
    for line in n.lock().lines() {
        if let Ok(line) = line {
            let mut cmd = line.split(" ");
            match cmd.next() {
                Some("LOOKUP") => {
                    if let Some(key) = cmd.next() {
                        if let Ok(key) = u8::from_str(key) {
                            let _ = hash_map.get(&key);
                        } else {
                            continue;
                        }
                    } else {
                        continue;
                    }
                }
                Some("INSERT") => {
                    if let Some(key) = cmd.next() {
                        if let Ok(key) = u8::from_str(key) {
                            if let Some(val) = cmd.next() {
                          let _ = hash_map.insert(key, 
val.to_string()); } else { continue; } } } else { continue; } } _ => continue, } } else { break; } } }

I'll spare you the AFL output but, as a reminder, here's how you run the specialized interpreter through:

> cargo afl build --release
> cargo afl fuzz -i resources/in/ -o resources/out/ target/release/specialized_interpreter

Producing a criterion benchmark will be very similar to the approach taken for the naive implementation, save that we'll swap out a few names here and there. We'll skip listing the code with the hopes that you'll be able to reproduce it as desired. The results, however, are promising:

HashMap/100000/speciali time:   [662.01 us 665.28 us 668.44 us]
HashMap/100000/standard time:   [2.3471 ms 2.3521 ms 2.3583 ms]

HashMap/10000/specializ time:   [64.294 us 64.440 us 64.576 us]
HashMap/10000/standard  time:   [253.14 us 253.31 us 253.49 us]

In a like manner to naive.rs and standard.rs from previously, we've also got a specialized.rs runner which, to avoid duplication, we'll avoid listing here. Let's run specialized through Valgrind cachegrind:

naive_hashmap > valgrind --tool=cachegrind --branch-sim=yes target/release/specialized
==24235== Cachegrind, a cache and branch-prediction profiler
==24235== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==24235== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==24235== Command: target/release/specialized
==24235==
--24235-- warning: L3 cache found, using its data for the LL simulation.
INSERT
  empty:   256
  present: 50032
LOOKUP
  fail:    199
  success: 49513
==24235==
==24235== I   refs:      5,200,051
==24235== I1  misses:        2,248
==24235== LLi misses:        2,068
==24235== I1  miss rate:      0.04%
==24235== LLi miss rate:      0.04%
==24235==
==24235== D   refs:        443,562  (233,633 rd   + 209,929 wr)
==24235== D1  misses:        4,736  (  3,249 rd   +   1,487 wr)
==24235== LLd misses:        3,512  (  2,112 rd   +   1,400 wr)
==24235== D1  miss rate:       1.1% (    1.4%     +     0.7%  )
==24235== LLd miss rate:       0.8% (    0.9%     +     0.7%  )
==24235==
==24235== LL refs:           6,984  (  5,497 rd   +   1,487 wr)
==24235== LL misses:         5,580  (  4,180 rd   +   1,400 wr)
==24235== LL miss rate:        0.1% (    0.1%     +     0.7%  )
==24235==
==24235== Branches:        393,599  (391,453 cond +   2,146 ind)
==24235== Mispredicts:      57,380  ( 56,657 cond +     723 ind)
==24235== Mispred rate:       14.6% (   14.5%     +    33.7%   )

Compared to the standard valgrind run, we're doing around 1/5 the total number of instructions and with substantially fewer D1 and LLd misses. No surprise here. Our hash for HashMapU8 is an exceedingly cheap pointer offset and the size of the storage is going to fit comfortably into the cache. Linux perf tells a similar story:

naive_hashmap > perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache-references,cache-misses target/release/specialized > /dev/null

 Performance counter stats for 'target/release/specialized':

     1.433884      task-clock (msec)         #    0.788 CPUs utilized
            0      context-switches          #    0.000 K/sec
          104      page-faults               #    0.073 M/sec
    4,138,436      cycles                    #    2.886 GHz
    6,141,529      instructions              #    1.48  insn per cycle
      749,155      branches                  #  522.466 M/sec
       59,914      branch-misses             #    8.00% of all branches
       74,760      cache-references          #   52.138 M/sec

       0.001818537 seconds time elapsed

Phew! Let's summarize our efforts: