oasislmf.pytools.common.hashmap¶
Robin Hood hash table for numba JIT code.
Supports structured-record keys (e.g. np.dtype([(‘a’,’i4’),(‘b’,’u1’)])) and scalar keys — numeric (int8..int64, uint8..uint64, bool, float32, float64) or unicode (fixed-width ‘U<n>’ / variable-length unicode strings). Hash and equality are synthesized per dtype at compile time via @numba.extending.overload, so each call site does typed field accesses only — no byte view, no buffer allocation.
All hashmap state is stored in a single packed uint8 buffer (“table”) holding [info | lookup_table | index_table]. Use unpack(table) to get views.
- Public API (convenience — unpacks per call):
init_dict(hint_size) -> table try_add_key(table, key_storage, key, i_item=None) -> slot | new_slot_bit, or i_add_key_fail find_key(table, key_storage, key) -> slot, or NOT_FOUND rehash(table, key_storage) -> new_table unpack(table) -> (info, lookup, index) views
- Internal API (hot-path — caller unpacks once):
_try_add_key(info, lookup, index, key_storage, key, i_item=None) _find_key(info, lookup, index, key_storage, key)
Performance (indicative — varies with hardware, numba / numpy versions, and workload; here on int32 scalar keys, 1M ops, 50% hit / 50% miss):
Public try_add_key : a few M ops/s (unpack overhead per call) Public find_key : a few M ops/s Internal _try_add_key: ~1.5–2x faster than public Internal _find_key : ~2–3x faster than public
Records, floats, and unicode keys are slower than int32 scalars roughly in proportion to the number of fields / code points hashed.
Rule of thumb: use the public API for one-off or low-volume calls. For tight loops (inserting/looking up thousands of keys), unpack once and call the internal _impl functions directly.
try_add_key has two modes selected via i_item (compile-time literal branch — both modes specialize with no runtime overhead):
- i_item is None (by-value mode):
On insertion, the hashmap writes key to key_storage[info[HM_INFO_N_VALID]] and stores info[HM_INFO_N_VALID] in the index. The caller doesn’t track positions; key_storage[:info[HM_INFO_N_VALID]] ends up holding the unique keys in insertion order.
- i_item is an int (by-position mode):
On insertion, stores i_item in the index. The hashmap does NOT write to key_storage — the caller is asserting that key_storage[i_item] already holds key.
Sizing and rehash:
init_dict(hint_size) pre-allocates the table so that hint_size unique keys fit within the load factor (n_full_factor = 0.95). When the caller knows an upper bound on the number of unique keys and passes it as hint_size, the load-factor guard (info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]) is guaranteed never to fire and can be omitted from the insert loop.
The while result == i_add_key_fail rehash loop must still be kept even with a correctly pre-sized table. i_add_key_fail signals that the Robin Hood displacement chain exceeded max_rh due to hash collisions — this is independent of the load factor and can (rarely) occur at any occupancy.
When hint_size is not known up front (e.g. streaming inserts), the caller must check the load-factor guard before each insertion:
- if info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]:
table = rehash(table, key_storage) info, lookup, index = unpack(table)
Example — by-value insert, size known (pre-sized table, no load-factor check):
from oasislmf.pytools.common.hashmap import (
init_dict, unpack, rehash, _try_add_key,
i_add_key_fail, new_slot_bit, slot_mask,
)
@nb.jit(cache=True)
def build_vuln_dict(vuln_ids):
key_storage = np.empty(len(vuln_ids), dtype=np.int32)
table = init_dict(len(vuln_ids)) # pre-sized: no load-factor rehash needed
info, lookup, index = unpack(table)
for i in range(len(vuln_ids)):
result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])
while result == i_add_key_fail: # Robin Hood collision — still possible
table = rehash(table, key_storage)
info, lookup, index = unpack(table)
result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])
dense_idx = index[result & slot_mask] # works for both new and existing
return table, key_storage[:info[HM_INFO_N_VALID]]
Example — by-value insert, size unknown (must check load factor):
from oasislmf.pytools.common.hashmap import (
init_dict, unpack, rehash, _try_add_key,
i_add_key_fail, new_slot_bit, slot_mask,
HM_INFO_N_VALID, HM_INFO_N_FULL,
)
@nb.jit(cache=True)
def build_vuln_dict(vuln_ids):
key_storage = np.empty(len(vuln_ids), dtype=np.int32)
table = init_dict()
info, lookup, index = unpack(table)
for i in range(len(vuln_ids)):
if info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]:
table = rehash(table, key_storage)
info, lookup, index = unpack(table)
result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])
while result == i_add_key_fail:
table = rehash(table, key_storage)
info, lookup, index = unpack(table)
result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i])
dense_idx = index[result & slot_mask] # works for both new and existing
return table, key_storage[:info[HM_INFO_N_VALID]]
Example — lookup with internal API:
from oasislmf.pytools.common.hashmap import (
unpack, _find_key, NOT_FOUND,
)
@nb.jit(cache=True)
def lookup_vuln_ids(table, key_storage, query_ids, out):
info, lookup, index = unpack(table)
for i in range(len(query_ids)):
slot = _find_key(info, lookup, index, key_storage, query_ids[i])
if slot != NOT_FOUND:
out[i] = index[slot] # dense index
else:
out[i] = -1
- Notes:
The hash function (fnv1a) is deterministic with no random seed. Both table and key_storage can be saved/loaded from binary files and used directly with find_key — no rebuild needed.
Float keys (scalar or record field) are hashed AND compared bit-wise: IEEE 754 bytes are reinterpreted as a same-width uint for both fnv1a and key_eq. Consequences:
+0.0 and -0.0 are distinct keys (their sign bits differ).
NaN equals NaN iff the bit patterns match (no NaN != NaN rule).
Two NaN values with different mantissa/sign bits are distinct keys.
Integer/uint/bool keys keep ordinary value-equality semantics.
Fixed-width unicode keys (‘U<n>’, whether scalar or record field) are hashed and compared code-point-by-code-point, with trailing NUL padding stripped (numpy’s fixed-width convention). So “ab” stored in ‘U3’ hashes and compares equal to “ab” stored in ‘U4’.
Attributes¶
Functions¶
|
|
|
|
|
FNV-1a hash of a numpy scalar or structured record, followed by a |
|
|
|
Handles any numeric scalar: int8..int64, uint8..uint64, float32, float64, bool. |
|
Handles unicode scalar keys — both fixed-width |
|
Equality of two keys for numpy scalars / structured records. Under JIT, |
|
|
|
Handles any numeric scalar. Floats are bit-compared to match the |
|
Handles unicode scalar key pairs — any combination of |
|
Extract (info, lookup_table, index_table) views from a packed buffer. |
|
Create a packed table buffer. Returns a single uint8 array. |
|
Try to insert key. Caller must check for i_add_key_fail first, then |
|
Look up key in the table without inserting. Returns the slot index |
|
Double the table size and re-insert every live key. Returns a new |
|
Return a 1-based id per row of key_table; identical rows share an id. |
|
pd.factorize-equivalent driver for a pandas DataFrame. |
Module Contents¶
- oasislmf.pytools.common.hashmap.fnv1a(record, h=init_hash)[source]¶
FNV-1a hash of a numpy scalar or structured record, followed by a finalizer mix. Under JIT, @overload synthesizes a typed impl with the same contract.
Floats are hashed via bit-cast (IEEE 754 bytes reinterpreted as uint), so +0.0 and -0.0 hash distinctly and NaN bit patterns are preserved. Fixed-width unicode scalars and record fields (‘U<n>’) are hashed code-point-by-code-point after stripping trailing NUL padding (numpy’s fixed-width convention).
The finalizer is Murmur3 fmix64 (three shift/multiply rounds; see
_fmix64). Without it, biased low-bit inputs (e.g. floats in [0, 1) or integers spaced by a power of 2) produce a non-uniform bucket distribution and exceed the Robin Hood probe limit.Note: plain Python
float/intkeys (without a numpy dtype) take thenp.uint64(record)value-cast path. Production code uses numpy types throughout, so this restriction is academic.
- oasislmf.pytools.common.hashmap.fnv1a_overload_scalar(key, h=init_hash)[source]¶
Handles any numeric scalar: int8..int64, uint8..uint64, float32, float64, bool. Floats are bit-cast to match the record overload’s behavior.
Unicode scalars (
UnicodeCharSeq/UnicodeType) are handled byfnv1a_overload_unichr(the next overload arm).
- oasislmf.pytools.common.hashmap.fnv1a_overload_unichr(key, h=init_hash)[source]¶
Handles unicode scalar keys — both fixed-width
UnicodeCharSeq(e.g. when indexed inside JIT from a ‘U<n>’ array) and variable-lengthUnicodeType(e.g. when a numpy.str_ scalar is passed in from Python and lowered to a unicode string).str(key)is a no-op for UnicodeType and trims trailing NUL padding for UnicodeCharSeq, so both converge on the same hash for the same logical string.
- oasislmf.pytools.common.hashmap.key_eq(a, b)[source]¶
Equality of two keys for numpy scalars / structured records. Under JIT, @overload synthesizes a typed impl with the same contract.
Records are compared field-by-field (NOT via
a.tobytes() == b.tobytes(), which would also compare any padding bytes left uninitialized bynp.emptyunder aligned dtypes). Float fields and float scalars are bit-compared so +0.0 ≠ -0.0 and NaN == NaN when bit patterns match; int/uint/bool fields use value equality. Fixed-width unicode scalars and record fields (‘U<n>’) compare via numpy / numba’s UnicodeCharSeq equality, which strips trailing NUL padding — matching fnv1a’s hashing semantics.Note: plain Python
float/intkeys (without a numpy dtype) fall back to Python==semantics. Production code uses numpy types throughout, so this restriction is academic.
- oasislmf.pytools.common.hashmap.key_eq_overload_scalar(a, b)[source]¶
Handles any numeric scalar. Floats are bit-compared to match the fnv1a hash side: +0.0 ≠ -0.0, NaN equals NaN iff bit patterns match.
Unicode scalar pairs are handled by
key_eq_overload_unichr(the next overload arm).
- oasislmf.pytools.common.hashmap.key_eq_overload_unichr(a, b)[source]¶
Handles unicode scalar key pairs — any combination of
UnicodeCharSeqandUnicodeType. Numba’s == for these types (charseq_eq) compares code-by-code after trimming trailing NUL padding on the UnicodeCharSeq side, matching fnv1a’s hashing semantics.
- oasislmf.pytools.common.hashmap.unpack(table)[source]¶
Extract (info, lookup_table, index_table) views from a packed buffer. info: uint8 view → .view(index_dtype) → 3-element array [HM_INFO_MASK, HM_INFO_N_VALID, HM_INFO_N_FULL] lookup: uint8 view → .view(lookup_dtype) → table_size elements index: uint8 view → .view(index_dtype) → table_size elements
- oasislmf.pytools.common.hashmap.init_dict(hint_size=15)[source]¶
Create a packed table buffer. Returns a single uint8 array.
- oasislmf.pytools.common.hashmap.try_add_key(table, key_storage, key, i_item=None)[source]¶
Try to insert key. Caller must check for i_add_key_fail first, then mask off new_slot_bit to recover the slot index. See _try_add_key for the i_item mode semantics.
- Returns:
i_add_key_fail — rehash needed slot — key already present at slot slot | new_slot_bit — key was just inserted at slot
- oasislmf.pytools.common.hashmap.find_key(table, key_table, key)[source]¶
Look up key in the table without inserting. Returns the slot index on hit, or NOT_FOUND on miss.
Note: takes a key value (not an index into key_table), since the caller typically has a raw key in hand rather than a position in key_table.
- oasislmf.pytools.common.hashmap.rehash(table, key_table)[source]¶
Double the table size and re-insert every live key. Returns a new packed table buffer (the old one becomes stale).
- oasislmf.pytools.common.hashmap.jit_factorize(key_table)[source]¶
Return a 1-based id per row of key_table; identical rows share an id.
Uses _try_add_key in by-position mode (i_item passed): no extra storage allocated, no input modification. index[slot] points to the i_item of the first occurrence; agg_id is tracked via info[HM_INFO_N_VALID] (which increments to match the insertion order).
- oasislmf.pytools.common.hashmap.factorize(df)[source]¶
pd.factorize-equivalent driver for a pandas DataFrame.
- Per-column coercion to a numpy-friendly dtype:
Nullable integer columns (‘Int*’) → float32 (NaN-preserving).
Other numeric columns → kept as-is.
Everything else → fixed-width unicode ‘U<max_len>’ via astype(str).
The resulting columns are packed into a single structured array and handed to
jit_factorizefor grouping.