oasislmf.pytools.common.hashmap =============================== .. py:module:: oasislmf.pytools.common.hashmap .. autoapi-nested-parse:: Robin Hood hash table for numba JIT code. Supports structured-record keys (e.g. np.dtype([('a','i4'),('b','u1')])) and scalar keys — numeric (int8..int64, uint8..uint64, bool, float32, float64) or unicode (fixed-width 'U' / variable-length unicode strings). Hash and equality are synthesized per dtype at compile time via @numba.extending.overload, so each call site does typed field accesses only — no byte view, no buffer allocation. All hashmap state is stored in a single packed uint8 buffer ("table") holding [info | lookup_table | index_table]. Use unpack(table) to get views. Public API (convenience — unpacks per call): init_dict(hint_size) -> table try_add_key(table, key_storage, key, i_item=None) -> slot | new_slot_bit, or i_add_key_fail find_key(table, key_storage, key) -> slot, or NOT_FOUND rehash(table, key_storage) -> new_table unpack(table) -> (info, lookup, index) views Internal API (hot-path — caller unpacks once): _try_add_key(info, lookup, index, key_storage, key, i_item=None) _find_key(info, lookup, index, key_storage, key) Performance (indicative — varies with hardware, numba / numpy versions, and workload; here on int32 scalar keys, 1M ops, 50% hit / 50% miss): Public try_add_key : a few M ops/s (unpack overhead per call) Public find_key : a few M ops/s Internal _try_add_key: ~1.5–2x faster than public Internal _find_key : ~2–3x faster than public Records, floats, and unicode keys are slower than int32 scalars roughly in proportion to the number of fields / code points hashed. Rule of thumb: use the public API for one-off or low-volume calls. For tight loops (inserting/looking up thousands of keys), unpack once and call the internal _impl functions directly. try_add_key has two modes selected via `i_item` (compile-time literal branch — both modes specialize with no runtime overhead): i_item is None (by-value mode): On insertion, the hashmap writes `key` to key_storage[info[HM_INFO_N_VALID]] and stores info[HM_INFO_N_VALID] in the index. The caller doesn't track positions; key_storage[:info[HM_INFO_N_VALID]] ends up holding the unique keys in insertion order. i_item is an int (by-position mode): On insertion, stores `i_item` in the index. The hashmap does NOT write to key_storage — the caller is asserting that key_storage[i_item] already holds `key`. Sizing and rehash: init_dict(hint_size) pre-allocates the table so that `hint_size` unique keys fit within the load factor (n_full_factor = 0.95). When the caller knows an upper bound on the number of unique keys and passes it as hint_size, the load-factor guard (info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]) is guaranteed never to fire and can be omitted from the insert loop. The `while result == i_add_key_fail` rehash loop must still be kept even with a correctly pre-sized table. i_add_key_fail signals that the Robin Hood displacement chain exceeded max_rh due to hash collisions — this is independent of the load factor and can (rarely) occur at any occupancy. When hint_size is not known up front (e.g. streaming inserts), the caller must check the load-factor guard before each insertion: if info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]: table = rehash(table, key_storage) info, lookup, index = unpack(table) Example — by-value insert, size known (pre-sized table, no load-factor check):: from oasislmf.pytools.common.hashmap import ( init_dict, unpack, rehash, _try_add_key, i_add_key_fail, new_slot_bit, slot_mask, ) @nb.jit(cache=True) def build_vuln_dict(vuln_ids): key_storage = np.empty(len(vuln_ids), dtype=np.int32) table = init_dict(len(vuln_ids)) # pre-sized: no load-factor rehash needed info, lookup, index = unpack(table) for i in range(len(vuln_ids)): result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i]) while result == i_add_key_fail: # Robin Hood collision — still possible table = rehash(table, key_storage) info, lookup, index = unpack(table) result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i]) dense_idx = index[result & slot_mask] # works for both new and existing return table, key_storage[:info[HM_INFO_N_VALID]] Example — by-value insert, size unknown (must check load factor):: from oasislmf.pytools.common.hashmap import ( init_dict, unpack, rehash, _try_add_key, i_add_key_fail, new_slot_bit, slot_mask, HM_INFO_N_VALID, HM_INFO_N_FULL, ) @nb.jit(cache=True) def build_vuln_dict(vuln_ids): key_storage = np.empty(len(vuln_ids), dtype=np.int32) table = init_dict() info, lookup, index = unpack(table) for i in range(len(vuln_ids)): if info[HM_INFO_N_VALID] >= info[HM_INFO_N_FULL]: table = rehash(table, key_storage) info, lookup, index = unpack(table) result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i]) while result == i_add_key_fail: table = rehash(table, key_storage) info, lookup, index = unpack(table) result = _try_add_key(info, lookup, index, key_storage, vuln_ids[i]) dense_idx = index[result & slot_mask] # works for both new and existing return table, key_storage[:info[HM_INFO_N_VALID]] Example — lookup with internal API:: from oasislmf.pytools.common.hashmap import ( unpack, _find_key, NOT_FOUND, ) @nb.jit(cache=True) def lookup_vuln_ids(table, key_storage, query_ids, out): info, lookup, index = unpack(table) for i in range(len(query_ids)): slot = _find_key(info, lookup, index, key_storage, query_ids[i]) if slot != NOT_FOUND: out[i] = index[slot] # dense index else: out[i] = -1 Notes: - The hash function (fnv1a) is deterministic with no random seed. Both `table` and `key_storage` can be saved/loaded from binary files and used directly with find_key — no rebuild needed. - Float keys (scalar or record field) are hashed AND compared bit-wise: IEEE 754 bytes are reinterpreted as a same-width uint for both fnv1a and key_eq. Consequences: * +0.0 and -0.0 are distinct keys (their sign bits differ). * NaN equals NaN iff the bit patterns match (no `NaN != NaN` rule). * Two NaN values with different mantissa/sign bits are distinct keys. Integer/uint/bool keys keep ordinary value-equality semantics. - Fixed-width unicode keys ('U', whether scalar or record field) are hashed and compared code-point-by-code-point, with trailing NUL padding stripped (numpy's fixed-width convention). So "ab" stored in 'U3' hashes and compares equal to "ab" stored in 'U4'. Attributes ---------- .. autoapisummary:: oasislmf.pytools.common.hashmap.init_hash oasislmf.pytools.common.hashmap.FNV_PRIME oasislmf.pytools.common.hashmap.M3_C1 oasislmf.pytools.common.hashmap.M3_C2 oasislmf.pytools.common.hashmap.lookup_dtype oasislmf.pytools.common.hashmap.nb_lookup_dtype oasislmf.pytools.common.hashmap.lookup_bitsize oasislmf.pytools.common.hashmap.lookup_hash_bitsize oasislmf.pytools.common.hashmap.hash_key_shift oasislmf.pytools.common.hashmap.full_bit oasislmf.pytools.common.hashmap.hash_mask oasislmf.pytools.common.hashmap.max_rh oasislmf.pytools.common.hashmap.i_rh_mask oasislmf.pytools.common.hashmap.i_rh_increment oasislmf.pytools.common.hashmap.full_rh oasislmf.pytools.common.hashmap.n_full_factor oasislmf.pytools.common.hashmap.inverse_n_full_factor oasislmf.pytools.common.hashmap.index_dtype oasislmf.pytools.common.hashmap.nb_index_dtype oasislmf.pytools.common.hashmap.HM_INFO_MASK oasislmf.pytools.common.hashmap.HM_INFO_N_VALID oasislmf.pytools.common.hashmap.HM_INFO_N_FULL oasislmf.pytools.common.hashmap.i_add_key_fail oasislmf.pytools.common.hashmap.index_bitsize oasislmf.pytools.common.hashmap.new_slot_bit oasislmf.pytools.common.hashmap.slot_mask oasislmf.pytools.common.hashmap.NOT_FOUND oasislmf.pytools.common.hashmap.LOOKUP_ITEMSIZE oasislmf.pytools.common.hashmap.INDEX_ITEMSIZE oasislmf.pytools.common.hashmap.INFO_N oasislmf.pytools.common.hashmap.INFO_BYTES Functions --------- .. autoapisummary:: oasislmf.pytools.common.hashmap.extract_hash_bit oasislmf.pytools.common.hashmap.make_lookup_val oasislmf.pytools.common.hashmap.fnv1a oasislmf.pytools.common.hashmap.fnv1a_overload_record oasislmf.pytools.common.hashmap.fnv1a_overload_scalar oasislmf.pytools.common.hashmap.fnv1a_overload_unichr oasislmf.pytools.common.hashmap.key_eq oasislmf.pytools.common.hashmap.key_eq_overload_record oasislmf.pytools.common.hashmap.key_eq_overload_scalar oasislmf.pytools.common.hashmap.key_eq_overload_unichr oasislmf.pytools.common.hashmap.unpack oasislmf.pytools.common.hashmap.init_dict oasislmf.pytools.common.hashmap.try_add_key oasislmf.pytools.common.hashmap.find_key oasislmf.pytools.common.hashmap.rehash oasislmf.pytools.common.hashmap.jit_factorize oasislmf.pytools.common.hashmap.factorize Module Contents --------------- .. py:data:: init_hash .. py:data:: FNV_PRIME .. py:data:: M3_C1 .. py:data:: M3_C2 .. py:data:: lookup_dtype .. py:data:: nb_lookup_dtype .. py:data:: lookup_bitsize .. py:data:: lookup_hash_bitsize .. py:data:: hash_key_shift .. py:data:: full_bit .. py:data:: hash_mask .. py:data:: max_rh .. py:data:: i_rh_mask .. py:data:: i_rh_increment .. py:data:: full_rh .. py:data:: n_full_factor :value: 0.95 .. py:data:: inverse_n_full_factor :value: 1.0526315789473684 .. py:data:: index_dtype .. py:data:: nb_index_dtype .. py:data:: HM_INFO_MASK :value: 0 .. py:data:: HM_INFO_N_VALID :value: 1 .. py:data:: HM_INFO_N_FULL :value: 2 .. py:data:: i_add_key_fail .. py:data:: index_bitsize .. py:data:: new_slot_bit .. py:data:: slot_mask .. py:data:: NOT_FOUND .. py:data:: LOOKUP_ITEMSIZE .. py:data:: INDEX_ITEMSIZE .. py:data:: INFO_N .. py:data:: INFO_BYTES .. py:function:: extract_hash_bit(hash_key) .. py:function:: make_lookup_val(is_full, i_rh, hash_lookup_bit) .. py:function:: fnv1a(record, h=init_hash) FNV-1a hash of a numpy scalar or structured record, followed by a finalizer mix. Under JIT, @overload synthesizes a typed impl with the same contract. Floats are hashed via bit-cast (IEEE 754 bytes reinterpreted as uint), so +0.0 and -0.0 hash distinctly and NaN bit patterns are preserved. Fixed-width unicode scalars and record fields ('U') are hashed code-point-by-code-point after stripping trailing NUL padding (numpy's fixed-width convention). The finalizer is Murmur3 fmix64 (three shift/multiply rounds; see ``_fmix64``). Without it, biased low-bit inputs (e.g. floats in [0, 1) or integers spaced by a power of 2) produce a non-uniform bucket distribution and exceed the Robin Hood probe limit. Note: plain Python ``float`` / ``int`` keys (without a numpy dtype) take the ``np.uint64(record)`` value-cast path. Production code uses numpy types throughout, so this restriction is academic. .. py:function:: fnv1a_overload_record(record, h=init_hash) .. py:function:: fnv1a_overload_scalar(key, h=init_hash) Handles any numeric scalar: int8..int64, uint8..uint64, float32, float64, bool. Floats are bit-cast to match the record overload's behavior. Unicode scalars (``UnicodeCharSeq`` / ``UnicodeType``) are handled by ``fnv1a_overload_unichr`` (the next overload arm). .. py:function:: fnv1a_overload_unichr(key, h=init_hash) Handles unicode scalar keys — both fixed-width ``UnicodeCharSeq`` (e.g. when indexed inside JIT from a 'U' array) and variable-length ``UnicodeType`` (e.g. when a numpy.str_ scalar is passed in from Python and lowered to a unicode string). ``str(key)`` is a no-op for UnicodeType and trims trailing NUL padding for UnicodeCharSeq, so both converge on the same hash for the same logical string. .. py:function:: key_eq(a, b) Equality of two keys for numpy scalars / structured records. Under JIT, @overload synthesizes a typed impl with the same contract. Records are compared field-by-field (NOT via ``a.tobytes() == b.tobytes()``, which would also compare any padding bytes left uninitialized by ``np.empty`` under aligned dtypes). Float fields and float scalars are bit-compared so +0.0 ≠ -0.0 and NaN == NaN when bit patterns match; int/uint/bool fields use value equality. Fixed-width unicode scalars and record fields ('U') compare via numpy / numba's UnicodeCharSeq equality, which strips trailing NUL padding — matching fnv1a's hashing semantics. Note: plain Python ``float`` / ``int`` keys (without a numpy dtype) fall back to Python ``==`` semantics. Production code uses numpy types throughout, so this restriction is academic. .. py:function:: key_eq_overload_record(a, b) .. py:function:: key_eq_overload_scalar(a, b) Handles any numeric scalar. Floats are bit-compared to match the fnv1a hash side: +0.0 ≠ -0.0, NaN equals NaN iff bit patterns match. Unicode scalar pairs are handled by ``key_eq_overload_unichr`` (the next overload arm). .. py:function:: key_eq_overload_unichr(a, b) Handles unicode scalar key pairs — any combination of ``UnicodeCharSeq`` and ``UnicodeType``. Numba's `==` for these types (``charseq_eq``) compares code-by-code after trimming trailing NUL padding on the UnicodeCharSeq side, matching fnv1a's hashing semantics. .. py:function:: unpack(table) Extract (info, lookup_table, index_table) views from a packed buffer. info: uint8 view → .view(index_dtype) → 3-element array [HM_INFO_MASK, HM_INFO_N_VALID, HM_INFO_N_FULL] lookup: uint8 view → .view(lookup_dtype) → table_size elements index: uint8 view → .view(index_dtype) → table_size elements .. py:function:: init_dict(hint_size=15) Create a packed table buffer. Returns a single uint8 array. .. py:function:: try_add_key(table, key_storage, key, i_item=None) Try to insert `key`. Caller must check for i_add_key_fail first, then mask off `new_slot_bit` to recover the slot index. See `_try_add_key` for the `i_item` mode semantics. Returns: i_add_key_fail — rehash needed slot — key already present at `slot` slot | new_slot_bit — key was just inserted at `slot` .. py:function:: find_key(table, key_table, key) Look up `key` in the table without inserting. Returns the slot index on hit, or NOT_FOUND on miss. Note: takes a key *value* (not an index into key_table), since the caller typically has a raw key in hand rather than a position in key_table. .. py:function:: rehash(table, key_table) Double the table size and re-insert every live key. Returns a new packed table buffer (the old one becomes stale). .. py:function:: jit_factorize(key_table) Return a 1-based id per row of `key_table`; identical rows share an id. Uses _try_add_key in by-position mode (i_item passed): no extra storage allocated, no input modification. index[slot] points to the i_item of the first occurrence; agg_id is tracked via info[HM_INFO_N_VALID] (which increments to match the insertion order). .. py:function:: factorize(df) pd.factorize-equivalent driver for a pandas DataFrame. Per-column coercion to a numpy-friendly dtype: - Nullable integer columns ('Int*') → float32 (NaN-preserving). - Other numeric columns → kept as-is. - Everything else → fixed-width unicode 'U' via astype(str). The resulting columns are packed into a single structured array and handed to ``jit_factorize`` for grouping.