oasislmf.utils.data

Module Contents

Functions

factorize_array(arr[, sort_opt])

Groups a 1D Numpy array by item value, and optionally enumerates the

factorize_ndarray(ndarr[, row_idxs, col_idxs, sort_opt])

Groups an n-D Numpy array by item value, and optionally enumerates the

factorize_dataframe(df[, by_row_labels, ...])

Groups a selection of rows or columns of a Pandas DataFrame array by value,

fast_zip_arrays(*arrays)

Speedy zip of a sequence or ordered iterable of Numpy arrays (Python

fast_zip_dataframe_columns(df, cols)

Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns

detect_encoding(filepath)

Given a path to a CSV of unknown encoding

get_dataframe([src_fp, src_type, src_buf, src_data, ...])

Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer

get_dtypes_and_required_cols(get_dtypes[, all_dtypes])

Get OED column data types and required column names from JSON.

get_ids(df, usecols[, group_by, sort_keys])

Enumerates (counts) the rows of a given dataframe in a given subset

get_json(src_fp)

Loads JSON from file.

get_timestamp([thedate, fmt])

Get a timestamp string from a datetime.datetime object

get_utctimestamp([thedate, fmt])

Get a UTC timestamp string from a datetime.datetime object

merge_check(left, right[, on, raise_error])

Check two dataframes for keys intersection, use before performing a merge

merge_dataframes(left, right[, join_on])

Merges two dataframes by ensuring there is no duplication of columns.

prepare_location_df(location_df)

prepare_account_df(accounts_df)

prepare_reinsurance_df(ri_info, ri_scope)

get_exposure_data(computation_step[, add_internal_col])

print_dataframe(df[, cols, string_cols, show_index, ...])

A method to pretty-print a Pandas dataframe - calls on the tabulate

set_dataframe_column_dtypes(df, dtypes)

A method to set column datatypes for a Pandas dataframe

validate_vuln_csv_contents(file_path)

Validate the contents of the CSV file for vulnerability replacements.

validate_vulnerability_replacements(analysis_settings_json)

Validate vulnerability replacements in analysis settings file.

fill_na_with_categoricals(df, fill_value)

Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns.

Attributes

oasislmf.utils.data.PANDAS_BASIC_DTYPES[source]
oasislmf.utils.data.PANDAS_DEFAULT_NULL_VALUES[source]
oasislmf.utils.data.RI_INFO_DEFAULTS[source]
oasislmf.utils.data.RI_SCOPE_DEFAULTS[source]
oasislmf.utils.data.factorize_array(arr, sort_opt=False)[source]

Groups a 1D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too.

Parameters:

arr (numpy.ndarray) – 1D Numpy array (or list, tuple, or Pandas series)

Returns:

A 2-tuple consisting of the enumeration and the value groups

Return type:

tuple

oasislmf.utils.data.factorize_ndarray(ndarr, row_idxs=[], col_idxs=[], sort_opt=False)[source]

Groups an n-D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too.

Parameters:
  • ndarr (numpy.ndarray) – n-D Numpy array (or appropriate Python structure or Pandas dataframe)

  • row_idxs (list) – A list of row indices to use for factorization (optional)

  • col_idxs (list) – A list of column indices to use for factorization (optional)

Returns:

A 2-tuple consisting of the enumeration and the value groups

Return type:

tuple

oasislmf.utils.data.factorize_dataframe(df, by_row_labels=None, by_row_indices=None, by_col_labels=None, by_col_indices=None)[source]

Groups a selection of rows or columns of a Pandas DataFrame array by value, and optionally enumerates the groups, starting from 1.

Parameters:
  • df – Pandas DataFrame

  • by_row_labels (list, tuple) – A list or tuple of row labels

  • by_row_indices (list, tuple) – A list or tuple of row indices

  • by_col_labels (list, tuple) – A list or tuple of column labels

  • by_col_indices (list, tuple) – A list or tuple of column indices

Type:

pandas.DataFrame

Returns:

A 2-tuple consisting of the enumeration and the value groups

Return type:

tuple

oasislmf.utils.data.fast_zip_arrays(*arrays)[source]

Speedy zip of a sequence or ordered iterable of Numpy arrays (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work).

Parameters:

arrays (list, tuple, collections.Iterator, types.GeneratorType) – An iterable or iterator or generator of Numpy arrays

Returns:

A Numpy 1D array of n-tuples of the zipped sequences

Return type:

np.array

oasislmf.utils.data.fast_zip_dataframe_columns(df, cols)[source]

Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work).

Parameters:
  • df (pandas.DataFrame) – Pandas DataFrame

  • cols (list, tuple, collections.Iterator, types.GeneratorType) – An iterable or iterator or generator of Pandas DataFrame columns

Returns:

A Numpy 1D array of n-tuples of the dataframe columns to be zipped

Return type:

np.array

oasislmf.utils.data.detect_encoding(filepath)[source]

Given a path to a CSV of unknown encoding read lines to detects its encoding type

Parameters:

filepath (str) – Filepath to check

Returns:

Example {‘encoding’: ‘ISO-8859-1’, ‘confidence’: 0.73, ‘language’: ‘’}

Return type:

dict

oasislmf.utils.data.get_dataframe(src_fp=None, src_type=None, src_buf=None, src_data=None, float_precision='high', empty_data_error_msg=None, lowercase_cols=True, required_cols=(), col_defaults={}, non_na_cols=(), col_dtypes={}, sort_cols=None, sort_ascending=None, memory_map=False, low_memory=False, encoding=None)[source]

Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer of such a file (io.StringIO), or another Pandas dataframe.

Parameters:
  • src_fp (str) – Source CSV or JSON file path (optional)

  • src_type – Type of source file -CSV or JSON (optional; default is csv)

  • src_type – str

  • src_buf (io.StringIO) – Text buffer of a source CSV or JSON file (optional)

  • float_precision (str) – Indicates whether to support high-precision numbers present in the data (optional; default is high)

  • empty_data_error_msg (str) – The message of the exception that is thrown there is no data content, i.e no rows (optional)

  • lowercase_cols (bool) – Whether to convert the dataframe columns to lowercase (optional; default is True)

  • required_cols (list, tuple, collections.Iterable) – An iterable of columns required to be present in the source data (optional)

  • col_defaults (dict) – A dict of column names and their default values. This can include both existing columns and new columns - defaults for existing columns are set row-wise using pd.DataFrame.fillna, while defaults for non-existent columns are set column-wise using assignment (optional)

  • non_na_cols (list, tuple, collections.Iterable) – An iterable of names of columns which must be dropped if they contain any null values (optional)

  • col_dtypes (dict) – A dict of column names and corresponding data types - Python built-in datatypes are accepted but are mapped to the corresponding Numpy datatypes (optional)

  • sort_cols (list, tuple, collections.Iterable) – An iterable of column names by which to sort the frame rows (optional)

  • sort_ascending (bool) – Whether to perform an ascending or descending sort - is used only in conjunction with the sort_cols option (optional)

  • memory_map (bool) – Memory-efficient option used when loading a frame from a file or text buffer - is a direct optional argument for the pd.read_csv method

  • low_memory (bool) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False,

  • encoding (str) – Try to read CSV of JSON data with the given encoding type, if ‘None’ will try to auto-detect on UnicodeDecodeError

Returns:

A Pandas dataframe

Return type:

pd.DataFrame

oasislmf.utils.data.get_dtypes_and_required_cols(get_dtypes, all_dtypes=False)[source]

Get OED column data types and required column names from JSON.

Parameters:
  • all_dtypes (boolean) – If true return every dtype field, otherwise only categoricals

  • get_dtypes (function) – method to get dict from JSON

oasislmf.utils.data.get_ids(df, usecols, group_by=[], sort_keys=True)[source]

Enumerates (counts) the rows of a given dataframe in a given subset of dataframe columns, and optionally does the enumeration with respect to subgroups of the column subset.

Parameters:
  • df (pandas.DataFrame) – Input dataframe

  • usecols – The column subset

  • usecols – list

  • group_by – A subset of the column subset to use a subgroup key

  • group_by – list

  • sort_keys – Sort keys by value before assigning ids

  • sort_keys

    Boolean

    index PortNumber AccNumber locnumbera id (returned)

    0 1 A11111 10002082049 3 1 1 A11111 10002082050 4 2 1 A11111 10002082051 5 3 1 A11111 10002082053 7 4 1 A11111 10002082054 8 5 1 A11111 10002082052 6 6 1 A11111 10002082046 1 7 1 A11111 10002082046 1 8 1 A11111 10002082048 2 9 1 A11111 10002082055 9

Returns:

The enumeration

Return type:

numpy.ndarray

oasislmf.utils.data.get_json(src_fp)[source]

Loads JSON from file.

Parameters:

src_fp (str) – Source JSON file path

Returns:

dict

Return type:

dict

oasislmf.utils.data.get_timestamp(thedate=datetime.now(), fmt='%Y%m%d%H%M%S')[source]

Get a timestamp string from a datetime.datetime object

Parameters:
Returns:

Timestamp string

Return type:

str

oasislmf.utils.data.get_utctimestamp(thedate=datetime.utcnow(), fmt='%Y-%b-%d %H:%M:%S')[source]

Get a UTC timestamp string from a datetime.datetime object

Parameters:
  • thedate (datetime.datetime) – datetime.datetime object

  • fmt (str) – Timestamp format string, default is “%Y-%b-%d %H:%M:%S”

Returns:

UTC timestamp string

Return type:

str

oasislmf.utils.data.merge_check(left, right, on=[], raise_error=True)[source]

Check two dataframes for keys intersection, use before performing a merge

Parameters:
  • left (pd.DataFrame) – The first of two dataframes to be merged

  • right – The second of two dataframes to be merged

  • on (list) – column keys to test

Returns:

A dict of booleans, True for an intersection between left/right

Return type:

dict

{‘PortNumber’: False, ‘AccNumber’: True, ‘layer_id’: True, ‘condnumber’: True}

oasislmf.utils.data.merge_dataframes(left, right, join_on=None, **kwargs)[source]

Merges two dataframes by ensuring there is no duplication of columns.

Parameters:
  • left (pd.DataFrame) – The first of two dataframes to be merged

  • right – The second of two dataframes to be merged

  • kwargs (dict) – Optional keyword arguments passed directly to the underlying pd.merge method that is called, including options for the join keys, join type, etc. - please see the pd.merge documentation for details of these optional arguments

Returns:

A merged dataframe

Return type:

pd.DataFrame

oasislmf.utils.data.prepare_location_df(location_df)[source]
oasislmf.utils.data.prepare_account_df(accounts_df)[source]
oasislmf.utils.data.prepare_reinsurance_df(ri_info, ri_scope)[source]
oasislmf.utils.data.get_exposure_data(computation_step, add_internal_col=False)[source]
oasislmf.utils.data.print_dataframe(df, cols=[], string_cols=[], show_index=False, frame_header=None, column_headers='keys', tablefmt='psql', floatfmt=',.2f', end='\n', **tabulate_kwargs)[source]

A method to pretty-print a Pandas dataframe - calls on the tabulate package

Parameters:
  • df (pd.DataFrame) – The dataframe to pretty-print

  • cols (list, tuple, collections.Iterable) – An iterable of names of columns whose values should be printed (optional). If unset, all columns will be printed.

  • string_cols (list, tuple, collections.Iterable) – An iterable of names of columns whose values should be treated as strings (optional)

  • show_index (bool) – Whether to display the index column in the printout (optional; default is False)

  • frame_header (str) – Header string to display on top of the printed dataframe (optional)

  • column_headers (list, str) – Column header format - see the tabulate.tabulate method documentation (optional, default is ‘keys’)

  • tablefmt (str, list, tuple) – Table format - see the tabulate.tabulate method documentation (optional; default is ‘psql’)

  • floatfmt (str) – Floating point format - see the tabulate.tabulate method documnetation (optional; default is “.2f”)

  • end (str) – String to append after printing the dataframe (optional; default is newline)

  • tabulate_kwargs – Additional optional arguments passed directly to the underlying tabulate.tabulate method - see the method documentation for more details

  • tabulate_kwargs – dict

oasislmf.utils.data.set_dataframe_column_dtypes(df, dtypes)[source]

A method to set column datatypes for a Pandas dataframe

Parameters:
  • df (pd.DataFrame) – The dataframe to process

  • dtypes (dict) – A dict of column names and corresponding Numpy datatypes - Python built-in datatypes can be passed in but they will be mapped to the corresponding Numpy datatypes

Returns:

The processed dataframe with column datatypes set

Return type:

pandas.DataFrame

oasislmf.utils.data.validate_vuln_csv_contents(file_path)[source]

Validate the contents of the CSV file for vulnerability replacements.

Args:

file_path (str): Path to the vulnerability CSV file

Returns:

bool: True if the file is valid, False otherwise

oasislmf.utils.data.validate_vulnerability_replacements(analysis_settings_json)[source]

Validate vulnerability replacements in analysis settings file. If vulnerability replacements are specified as a file path, check that the file exists. This way the user will be warned early if the vulnerability option selected is not valid.

Args:

analysis_settings_json (str): JSON file path to analysis settings file

Returns:

bool: True if the vulnerability replacements are present and valid, False otherwise

oasislmf.utils.data.fill_na_with_categoricals(df, fill_value)[source]

Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns.

The input dataframe is modified inplace.

Parameters:
  • df (pd.DataFrame) – The dataframe to process

  • fill_value (int, float, str, dict) – A single value to use in all columns, or a dict of column names and corresponding values to fill.