oasislmf.pytools.common.hashmap

Robin Hood hash table for numba JIT code.

Supports structured-record keys (e.g. np.dtype([(‘a’,’i4’),(‘b’,’u1’)])) and scalar keys — numeric (int8..int64, uint8..uint64, bool, float32, float64) or unicode (fixed-width ‘U<n>’ / variable-length unicode strings). Hash and equality are synthesized per dtype at compile time via @numba.extending.overload, so each call site does typed field accesses only — no byte view, no buffer allocation.

All hashmap state is stored in a single packed uint8 buffer (“table”) holding [info | lookup_table | index_table]. Use unpack(table) to get views.

Public API (convenience — unpacks per call):

init_dict(hint_size) -> table try_add_key(table, key_storage, key, i_item=None) -> slot | new_slot_bit, or i_add_key_fail find_key(table, key_storage, key) -> slot, or NOT_FOUND rehash(table, key_storage) -> new_table unpack(table) -> (info, lookup, index) views

Internal API (hot-path — caller unpacks once):

_try_add_key(info, lookup, index, key_storage, key, i_item=None) _find_key(info, lookup, index, key_storage, key)

Performance (indicative — varies with hardware, numba / numpy versions, and workload; here on int32 scalar keys, 1M ops, 50% hit / 50% miss):

Public try_add_key : a few M ops/s (unpack overhead per call) Public find_key : a few M ops/s Internal _try_add_key: ~1.5–2x faster than public Internal _find_key : ~2–3x faster than public

Records, floats, and unicode keys are slower than int32 scalars roughly in proportion to the number of fields / code points hashed.

Rule of thumb: use the public API for one-off or low-volume calls. For tight loops (inserting/looking up thousands of keys), unpack once and call the internal _impl functions directly.

try_add_key has two modes selected via i_item (compile-time literal branch — both modes specialize with no runtime overhead):

i_item is None (by-value mode):

On insertion, the hashmap writes key to key_storage[info[HM_INFO_N_VALID]] and stores info[HM_INFO_N_VALID] in the index. The caller doesn’t track positions; key_storage[:info[HM_INFO_N_VALID]] ends up holding the unique keys in insertion order.

i_item is an int (by-position mode):

On insertion, stores i_item in the index. The hashmap does NOT write to key_storage — the caller is asserting that key_storage[i_item] already holds key.

Sizing and rehash:

init_dict(hint_size) pre-allocates the table so that hint_size unique keys fit within the load factor (n_full_factor = 0.95). When the caller knows an upper bound on the number of unique keys and passes it as hint_size, the load-factor guard (info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]) is guaranteed never to fire and can be omitted from the insert loop.

The while result == i_add_key_fail rehash loop must still be kept even with a correctly pre-sized table. i_add_key_fail signals that the Robin Hood displacement chain exceeded max_rh due to hash collisions — this is independent of the load factor and can (rarely) occur at any occupancy.

When hint_size is not known up front (e.g. streaming inserts), the caller must check the load-factor guard before each insertion:

if info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]:

table = rehash(table, key_storage) info, lookup, index = unpack(table)

Example — by-value insert, size known (pre-sized table, no load-factor check):

from oasislmf.pytools.common.hashmap import (
    init_dict, unpack, rehash, _try_add_key,
    i_add_key_fail, new_slot_bit, slot_mask,
)

@nb.jit(cache=True)
def build_vuln_dict(vuln_ids):
    key_storage = np.empty(len(vuln_ids), dtype=np.int32)
    table = init_dict(len(vuln_ids))  # pre-sized: no load-factor rehash needed
    info, lookup, index = unpack(table)

    for i in range(len(vuln_ids)):
        result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])
        while result == i_add_key_fail:  # Robin Hood collision — still possible
            table = rehash(table, key_storage)
            info, lookup, index = unpack(table)
            result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])

        dense_idx = index[result & slot_mask]  # works for both new and existing

    return table, key_storage[:info[HM_INFO_N_VALID]]

Example — by-value insert, size unknown (must check load factor):

from oasislmf.pytools.common.hashmap import (
    init_dict, unpack, rehash, _try_add_key,
    i_add_key_fail, new_slot_bit, slot_mask,
    HM_INFO_N_VALID, HM_INFO_N_FULL,
)

@nb.jit(cache=True)
def build_vuln_dict(vuln_ids):
    key_storage = np.empty(len(vuln_ids), dtype=np.int32)
    table = init_dict()
    info, lookup, index = unpack(table)

    for i in range(len(vuln_ids)):
        if info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]:
            table = rehash(table, key_storage)
            info, lookup, index = unpack(table)

        result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])
        while result == i_add_key_fail:
            table = rehash(table, key_storage)
            info, lookup, index = unpack(table)
            result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])

        dense_idx = index[result & slot_mask]  # works for both new and existing

    return table, key_storage[:info[HM_INFO_N_VALID]]

Example — lookup with internal API:

from oasislmf.pytools.common.hashmap import (
    unpack, _find_key, NOT_FOUND,
)

@nb.jit(cache=True)
def lookup_vuln_ids(table, key_storage, query_ids, out):
    info, lookup, index = unpack(table)
    for i in range(len(query_ids)):
        slot = _find_key(info, lookup, index, key_storage, query_ids[i])
        if slot != NOT_FOUND:
            out[i] = index[slot]   # dense index
        else:
            out[i] = -1
Notes:
  • The hash function (fnv1a) is deterministic with no random seed. Both table and key_storage can be saved/loaded from binary files and used directly with find_key — no rebuild needed.

  • Float keys (scalar or record field) are hashed AND compared bit-wise: IEEE 754 bytes are reinterpreted as a same-width uint for both fnv1a and key_eq. Consequences:

    • +0.0 and -0.0 are distinct keys (their sign bits differ).

    • NaN equals NaN iff the bit patterns match (no NaN != NaN rule).

    • Two NaN values with different mantissa/sign bits are distinct keys.

    Integer/uint/bool keys keep ordinary value-equality semantics.

  • Fixed-width unicode keys (‘U<n>’, whether scalar or record field) are hashed and compared code-point-by-code-point, with trailing NUL padding stripped (numpy’s fixed-width convention). So “ab” stored in ‘U3’ hashes and compares equal to “ab” stored in ‘U4’.

Attributes

Functions

extract_hash_bit(hash_key)

make_lookup_val(is_full, i_rh, hash_lookup_bit)

fnv1a(record[, h])

FNV-1a hash of a numpy scalar or structured record, followed by a

fnv1a_overload_record(record[, h])

fnv1a_overload_scalar(key[, h])

Handles any numeric scalar: int8..int64, uint8..uint64, float32, float64, bool.

fnv1a_overload_unichr(key[, h])

Handles unicode scalar keys — both fixed-width UnicodeCharSeq

key_eq(a, b)

Equality of two keys for numpy scalars / structured records. Under JIT,

key_eq_overload_record(a, b)

key_eq_overload_scalar(a, b)

Handles any numeric scalar. Floats are bit-compared to match the

key_eq_overload_unichr(a, b)

Handles unicode scalar key pairs — any combination of

unpack(table)

Extract (info, lookup_table, index_table) views from a packed buffer.

init_dict([hint_size])

Create a packed table buffer. Returns a single uint8 array.

try_add_key(table, key_storage, key[, i_item])

Try to insert key. Caller must check for i_add_key_fail first, then

find_key(table, key_table, key)

Look up key in the table without inserting. Returns the slot index

rehash(table, key_table)

Double the table size and re-insert every live key. Returns a new

jit_factorize(key_table)

Return a 1-based id per row of key_table; identical rows share an id.

factorize(df)

pd.factorize-equivalent driver for a pandas DataFrame.

Module Contents

oasislmf.pytools.common.hashmap.init_hash[source]
oasislmf.pytools.common.hashmap.FNV_PRIME[source]
oasislmf.pytools.common.hashmap.M3_C1[source]
oasislmf.pytools.common.hashmap.M3_C2[source]
oasislmf.pytools.common.hashmap.lookup_dtype[source]
oasislmf.pytools.common.hashmap.nb_lookup_dtype[source]
oasislmf.pytools.common.hashmap.lookup_bitsize[source]
oasislmf.pytools.common.hashmap.lookup_hash_bitsize[source]
oasislmf.pytools.common.hashmap.hash_key_shift[source]
oasislmf.pytools.common.hashmap.full_bit[source]
oasislmf.pytools.common.hashmap.hash_mask[source]
oasislmf.pytools.common.hashmap.max_rh[source]
oasislmf.pytools.common.hashmap.i_rh_mask[source]
oasislmf.pytools.common.hashmap.i_rh_increment[source]
oasislmf.pytools.common.hashmap.full_rh[source]
oasislmf.pytools.common.hashmap.n_full_factor = 0.95[source]
oasislmf.pytools.common.hashmap.inverse_n_full_factor = 1.0526315789473684[source]
oasislmf.pytools.common.hashmap.index_dtype[source]
oasislmf.pytools.common.hashmap.nb_index_dtype[source]
oasislmf.pytools.common.hashmap.HM_INFO_MASK = 0[source]
oasislmf.pytools.common.hashmap.HM_INFO_N_VALID = 1[source]
oasislmf.pytools.common.hashmap.HM_INFO_N_FULL = 2[source]
oasislmf.pytools.common.hashmap.i_add_key_fail[source]
oasislmf.pytools.common.hashmap.index_bitsize[source]
oasislmf.pytools.common.hashmap.new_slot_bit[source]
oasislmf.pytools.common.hashmap.slot_mask[source]
oasislmf.pytools.common.hashmap.NOT_FOUND[source]
oasislmf.pytools.common.hashmap.LOOKUP_ITEMSIZE[source]
oasislmf.pytools.common.hashmap.INDEX_ITEMSIZE[source]
oasislmf.pytools.common.hashmap.INFO_N[source]
oasislmf.pytools.common.hashmap.INFO_BYTES[source]
oasislmf.pytools.common.hashmap.extract_hash_bit(hash_key)[source]
oasislmf.pytools.common.hashmap.make_lookup_val(is_full, i_rh, hash_lookup_bit)[source]
oasislmf.pytools.common.hashmap.fnv1a(record, h=init_hash)[source]

FNV-1a hash of a numpy scalar or structured record, followed by a finalizer mix. Under JIT, @overload synthesizes a typed impl with the same contract.

Floats are hashed via bit-cast (IEEE 754 bytes reinterpreted as uint), so +0.0 and -0.0 hash distinctly and NaN bit patterns are preserved. Fixed-width unicode scalars and record fields (‘U<n>’) are hashed code-point-by-code-point after stripping trailing NUL padding (numpy’s fixed-width convention).

The finalizer is Murmur3 fmix64 (three shift/multiply rounds; see _fmix64). Without it, biased low-bit inputs (e.g. floats in [0, 1) or integers spaced by a power of 2) produce a non-uniform bucket distribution and exceed the Robin Hood probe limit.

Note: plain Python float / int keys (without a numpy dtype) take the np.uint64(record) value-cast path. Production code uses numpy types throughout, so this restriction is academic.

oasislmf.pytools.common.hashmap.fnv1a_overload_record(record, h=init_hash)[source]
oasislmf.pytools.common.hashmap.fnv1a_overload_scalar(key, h=init_hash)[source]

Handles any numeric scalar: int8..int64, uint8..uint64, float32, float64, bool. Floats are bit-cast to match the record overload’s behavior.

Unicode scalars (UnicodeCharSeq / UnicodeType) are handled by fnv1a_overload_unichr (the next overload arm).

oasislmf.pytools.common.hashmap.fnv1a_overload_unichr(key, h=init_hash)[source]

Handles unicode scalar keys — both fixed-width UnicodeCharSeq (e.g. when indexed inside JIT from a ‘U<n>’ array) and variable-length UnicodeType (e.g. when a numpy.str_ scalar is passed in from Python and lowered to a unicode string). str(key) is a no-op for UnicodeType and trims trailing NUL padding for UnicodeCharSeq, so both converge on the same hash for the same logical string.

oasislmf.pytools.common.hashmap.key_eq(a, b)[source]

Equality of two keys for numpy scalars / structured records. Under JIT, @overload synthesizes a typed impl with the same contract.

Records are compared field-by-field (NOT via a.tobytes() == b.tobytes(), which would also compare any padding bytes left uninitialized by np.empty under aligned dtypes). Float fields and float scalars are bit-compared so +0.0 ≠ -0.0 and NaN == NaN when bit patterns match; int/uint/bool fields use value equality. Fixed-width unicode scalars and record fields (‘U<n>’) compare via numpy / numba’s UnicodeCharSeq equality, which strips trailing NUL padding — matching fnv1a’s hashing semantics.

Note: plain Python float / int keys (without a numpy dtype) fall back to Python == semantics. Production code uses numpy types throughout, so this restriction is academic.

oasislmf.pytools.common.hashmap.key_eq_overload_record(a, b)[source]
oasislmf.pytools.common.hashmap.key_eq_overload_scalar(a, b)[source]

Handles any numeric scalar. Floats are bit-compared to match the fnv1a hash side: +0.0 ≠ -0.0, NaN equals NaN iff bit patterns match.

Unicode scalar pairs are handled by key_eq_overload_unichr (the next overload arm).

oasislmf.pytools.common.hashmap.key_eq_overload_unichr(a, b)[source]

Handles unicode scalar key pairs — any combination of UnicodeCharSeq and UnicodeType. Numba’s == for these types (charseq_eq) compares code-by-code after trimming trailing NUL padding on the UnicodeCharSeq side, matching fnv1a’s hashing semantics.

oasislmf.pytools.common.hashmap.unpack(table)[source]

Extract (info, lookup_table, index_table) views from a packed buffer. info: uint8 view → .view(index_dtype) → 3-element array [HM_INFO_MASK, HM_INFO_N_VALID, HM_INFO_N_FULL] lookup: uint8 view → .view(lookup_dtype) → table_size elements index: uint8 view → .view(index_dtype) → table_size elements

oasislmf.pytools.common.hashmap.init_dict(hint_size=15)[source]

Create a packed table buffer. Returns a single uint8 array.

oasislmf.pytools.common.hashmap.try_add_key(table, key_storage, key, i_item=None)[source]

Try to insert key. Caller must check for i_add_key_fail first, then mask off new_slot_bit to recover the slot index. See _try_add_key for the i_item mode semantics.

Returns:

i_add_key_fail — rehash needed slot — key already present at slot slot | new_slot_bit — key was just inserted at slot

oasislmf.pytools.common.hashmap.find_key(table, key_table, key)[source]

Look up key in the table without inserting. Returns the slot index on hit, or NOT_FOUND on miss.

Note: takes a key value (not an index into key_table), since the caller typically has a raw key in hand rather than a position in key_table.

oasislmf.pytools.common.hashmap.rehash(table, key_table)[source]

Double the table size and re-insert every live key. Returns a new packed table buffer (the old one becomes stale).

oasislmf.pytools.common.hashmap.jit_factorize(key_table)[source]

Return a 1-based id per row of key_table; identical rows share an id.

Uses _try_add_key in by-position mode (i_item passed): no extra storage allocated, no input modification. index[slot] points to the i_item of the first occurrence; agg_id is tracked via info[HM_INFO_N_VALID] (which increments to match the insertion order).

oasislmf.pytools.common.hashmap.factorize(df)[source]

pd.factorize-equivalent driver for a pandas DataFrame.

Per-column coercion to a numpy-friendly dtype:
  • Nullable integer columns (‘Int*’) → float32 (NaN-preserving).

  • Other numeric columns → kept as-is.

  • Everything else → fixed-width unicode ‘U<max_len>’ via astype(str).

The resulting columns are packed into a single structured array and handed to jit_factorize for grouping.