`oasislmf.utils.data`¶

Module Contents¶

Functions¶

`factorize_array`(arr[, sort_opt])	Groups a 1D Numpy array by item value, and optionally enumerates the
`factorize_ndarray`(ndarr[, row_idxs, col_idxs, sort_opt])	Groups an n-D Numpy array by item value, and optionally enumerates the
`factorize_dataframe`(df[, by_row_labels, ...])	Groups a selection of rows or columns of a Pandas DataFrame array by value,
`fast_zip_arrays`(*arrays)	Speedy zip of a sequence or ordered iterable of Numpy arrays (Python
`fast_zip_dataframe_columns`(df, cols)	Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns
`detect_encoding`(filepath)	Given a path to a CSV of unknown encoding
`get_dataframe`([src_fp, src_type, src_buf, src_data, ...])	Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer
`get_dtypes_and_required_cols`(get_dtypes[, all_dtypes])	Get OED column data types and required column names from JSON.
`get_ids`(df, usecols[, group_by, sort_keys])	Enumerates (counts) the rows of a given dataframe in a given subset
`get_json`(src_fp)	Loads JSON from file.
`get_timestamp`([thedate, fmt])	Get a timestamp string from a `datetime.datetime` object
`get_utctimestamp`([thedate, fmt])	Get a UTC timestamp string from a `datetime.datetime` object
`merge_check`(left, right[, on, raise_error])	Check two dataframes for keys intersection, use before performing a merge
`merge_dataframes`(left, right[, join_on])	Merges two dataframes by ensuring there is no duplication of columns.
`prepare_location_df`(location_df)
`prepare_account_df`(accounts_df)
`prepare_reinsurance_df`(ri_info, ri_scope)
`get_exposure_data`(computation_step[, add_internal_col])
`print_dataframe`(df[, cols, string_cols, show_index, ...])	A method to pretty-print a Pandas dataframe - calls on the `tabulate`
`set_dataframe_column_dtypes`(df, dtypes)	A method to set column datatypes for a Pandas dataframe
`validate_vuln_csv_contents`(file_path)	Validate the contents of the CSV file for vulnerability replacements.
`validate_vulnerability_replacements`(analysis_settings_json)	Validate vulnerability replacements in analysis settings file.
`fill_na_with_categoricals`(df, fill_value)	Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns.

Attributes¶

`PANDAS_BASIC_DTYPES`
`PANDAS_DEFAULT_NULL_VALUES`
`RI_INFO_DEFAULTS`
`RI_SCOPE_DEFAULTS`

oasislmf.utils.data.PANDAS_BASIC_DTYPES[source]¶

oasislmf.utils.data.PANDAS_DEFAULT_NULL_VALUES[source]¶

oasislmf.utils.data.RI_INFO_DEFAULTS[source]¶

oasislmf.utils.data.RI_SCOPE_DEFAULTS[source]¶

oasislmf.utils.data.factorize_array(arr, sort_opt=False)[source]¶

Groups a 1D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too.

Parameters:: arr (numpy.ndarray) – 1D Numpy array (or list, tuple, or Pandas series)
Returns:: A 2-tuple consisting of the enumeration and the value groups
Return type:: tuple

oasislmf.utils.data.factorize_ndarray(ndarr, row_idxs=[], col_idxs=[], sort_opt=False)[source]¶

Groups an n-D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too.

Parameters:

ndarr (numpy.ndarray) – n-D Numpy array (or appropriate Python structure or Pandas dataframe)
row_idxs (list) – A list of row indices to use for factorization (optional)
col_idxs (list) – A list of column indices to use for factorization (optional)

Returns:

A 2-tuple consisting of the enumeration and the value groups

Return type:

tuple

oasislmf.utils.data.factorize_dataframe(df, by_row_labels=None, by_row_indices=None, by_col_labels=None, by_col_indices=None)[source]¶

Groups a selection of rows or columns of a Pandas DataFrame array by value, and optionally enumerates the groups, starting from 1.

Parameters:

df – Pandas DataFrame
by_row_labels (list, tuple) – A list or tuple of row labels
by_row_indices (list, tuple) – A list or tuple of row indices
by_col_labels (list, tuple) – A list or tuple of column labels
by_col_indices (list, tuple) – A list or tuple of column indices

Type:

pandas.DataFrame

Returns:

A 2-tuple consisting of the enumeration and the value groups

Return type:

tuple

oasislmf.utils.data.fast_zip_arrays(*arrays)[source]¶

Speedy zip of a sequence or ordered iterable of Numpy arrays (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work).

Parameters:: arrays (list, tuple, collections.Iterator, types.GeneratorType) – An iterable or iterator or generator of Numpy arrays
Returns:: A Numpy 1D array of n-tuples of the zipped sequences
Return type:: np.array

oasislmf.utils.data.fast_zip_dataframe_columns(df, cols)[source]¶

Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work).

Parameters:

df (pandas.DataFrame) – Pandas DataFrame
cols (list, tuple, collections.Iterator, types.GeneratorType) – An iterable or iterator or generator of Pandas DataFrame columns

Returns:

A Numpy 1D array of n-tuples of the dataframe columns to be zipped

Return type:

np.array

oasislmf.utils.data.detect_encoding(filepath)[source]¶

Given a path to a CSV of unknown encoding read lines to detects its encoding type

Parameters:: filepath (str) – Filepath to check
Returns:: Example {‘encoding’: ‘ISO-8859-1’, ‘confidence’: 0.73, ‘language’: ‘’}
Return type:: dict

oasislmf.utils.data.get_dataframe(src_fp=None, src_type=None, src_buf=None, src_data=None, float_precision='high', empty_data_error_msg=None, lowercase_cols=True, required_cols=(), col_defaults={}, non_na_cols=(), col_dtypes={}, sort_cols=None, sort_ascending=None, memory_map=False, low_memory=False, encoding=None)[source]¶

Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer of such a file (io.StringIO), or another Pandas dataframe.

Parameters:

src_fp (str) – Source CSV or JSON file path (optional)
src_type – Type of source file -CSV or JSON (optional; default is csv)
src_type – str
src_buf (io.StringIO) – Text buffer of a source CSV or JSON file (optional)
float_precision (str) – Indicates whether to support high-precision numbers present in the data (optional; default is high)
empty_data_error_msg (str) – The message of the exception that is thrown there is no data content, i.e no rows (optional)
lowercase_cols (bool) – Whether to convert the dataframe columns to lowercase (optional; default is True)
required_cols (list, tuple, collections.Iterable) – An iterable of columns required to be present in the source data (optional)
col_defaults (dict) – A dict of column names and their default values. This can include both existing columns and new columns - defaults for existing columns are set row-wise using pd.DataFrame.fillna, while defaults for non-existent columns are set column-wise using assignment (optional)
non_na_cols (list, tuple, collections.Iterable) – An iterable of names of columns which must be dropped if they contain any null values (optional)
col_dtypes (dict) – A dict of column names and corresponding data types - Python built-in datatypes are accepted but are mapped to the corresponding Numpy datatypes (optional)
sort_cols (list, tuple, collections.Iterable) – An iterable of column names by which to sort the frame rows (optional)
sort_ascending (bool) – Whether to perform an ascending or descending sort - is used only in conjunction with the sort_cols option (optional)
memory_map (bool) – Memory-efficient option used when loading a frame from a file or text buffer - is a direct optional argument for the pd.read_csv method
low_memory (bool) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False,
encoding (str) – Try to read CSV of JSON data with the given encoding type, if ‘None’ will try to auto-detect on UnicodeDecodeError

Returns:

A Pandas dataframe

Return type:

pd.DataFrame

oasislmf.utils.data.get_dtypes_and_required_cols(get_dtypes, all_dtypes=False)[source]¶

Get OED column data types and required column names from JSON.

Parameters:

all_dtypes (boolean) – If true return every dtype field, otherwise only categoricals
get_dtypes (function) – method to get dict from JSON

oasislmf.utils.data.get_ids(df, usecols, group_by=[], sort_keys=True)[source]¶

Enumerates (counts) the rows of a given dataframe in a given subset of dataframe columns, and optionally does the enumeration with respect to subgroups of the column subset.

Parameters:

df (pandas.DataFrame) – Input dataframe
usecols – The column subset
usecols – list
group_by – A subset of the column subset to use a subgroup key
group_by – list
sort_keys – Sort keys by value before assigning ids
sort_keys –
Boolean

index PortNumber AccNumber locnumbera id (returned)
0 1 A11111 10002082049 3 1 1 A11111 10002082050 4 2 1 A11111 10002082051 5 3 1 A11111 10002082053 7 4 1 A11111 10002082054 8 5 1 A11111 10002082052 6 6 1 A11111 10002082046 1 7 1 A11111 10002082046 1 8 1 A11111 10002082048 2 9 1 A11111 10002082055 9

Returns:

The enumeration

Return type:

numpy.ndarray

oasislmf.utils.data.get_json(src_fp)[source]¶

Loads JSON from file.

Parameters:: src_fp (str) – Source JSON file path
Returns:: dict
Return type:: dict

oasislmf.utils.data.get_timestamp(thedate=datetime.now(), fmt='%Y%m%d%H%M%S')[source]¶

Get a timestamp string from a datetime.datetime object

Parameters:

thedate (datetime.datetime) – datetime.datetime object
fmt (str) – Timestamp format string

Returns:

Timestamp string

Return type:

str

oasislmf.utils.data.get_utctimestamp(thedate=datetime.utcnow(), fmt='%Y-%b-%d %H:%M:%S')[source]¶

Get a UTC timestamp string from a datetime.datetime object

Parameters:

thedate (datetime.datetime) – datetime.datetime object
fmt (str) – Timestamp format string, default is “%Y-%b-%d %H:%M:%S”

Returns:

UTC timestamp string

Return type:

str

oasislmf.utils.data.merge_check(left, right, on=[], raise_error=True)[source]¶

Check two dataframes for keys intersection, use before performing a merge

Parameters:

left (pd.DataFrame) – The first of two dataframes to be merged
right – The second of two dataframes to be merged
on (list) – column keys to test

Returns:

A dict of booleans, True for an intersection between left/right

Return type:

dict

{‘PortNumber’: False, ‘AccNumber’: True, ‘layer_id’: True, ‘condnumber’: True}

oasislmf.utils.data.merge_dataframes(left, right, join_on=None, **kwargs)[source]¶

Merges two dataframes by ensuring there is no duplication of columns.

Parameters:

left (pd.DataFrame) – The first of two dataframes to be merged
right – The second of two dataframes to be merged
kwargs (dict) – Optional keyword arguments passed directly to the underlying pd.merge method that is called, including options for the join keys, join type, etc. - please see the pd.merge documentation for details of these optional arguments

Returns:

A merged dataframe

Return type:

pd.DataFrame

oasislmf.utils.data.prepare_location_df(location_df)[source]¶

oasislmf.utils.data.prepare_account_df(accounts_df)[source]¶

oasislmf.utils.data.prepare_reinsurance_df(ri_info, ri_scope)[source]¶

oasislmf.utils.data.get_exposure_data(computation_step, add_internal_col=False)[source]¶

oasislmf.utils.data.print_dataframe(df, cols=[], string_cols=[], show_index=False, frame_header=None, column_headers='keys', tablefmt='psql', floatfmt=',.2f', end='\n', **tabulate_kwargs)[source]¶

A method to pretty-print a Pandas dataframe - calls on the tabulate package

Parameters:

df (pd.DataFrame) – The dataframe to pretty-print
cols (list, tuple, collections.Iterable) – An iterable of names of columns whose values should be printed (optional). If unset, all columns will be printed.
string_cols (list, tuple, collections.Iterable) – An iterable of names of columns whose values should be treated as strings (optional)
show_index (bool) – Whether to display the index column in the printout (optional; default is False)
frame_header (str) – Header string to display on top of the printed dataframe (optional)
column_headers (list, str) – Column header format - see the tabulate.tabulate method documentation (optional, default is ‘keys’)
tablefmt (str, list, tuple) – Table format - see the tabulate.tabulate method documentation (optional; default is ‘psql’)
floatfmt (str) – Floating point format - see the tabulate.tabulate method documnetation (optional; default is “.2f”)
end (str) – String to append after printing the dataframe (optional; default is newline)
tabulate_kwargs – Additional optional arguments passed directly to the underlying tabulate.tabulate method - see the method documentation for more details
tabulate_kwargs – dict

oasislmf.utils.data.set_dataframe_column_dtypes(df, dtypes)[source]¶

A method to set column datatypes for a Pandas dataframe

Parameters:

df (pd.DataFrame) – The dataframe to process
dtypes (dict) – A dict of column names and corresponding Numpy datatypes - Python built-in datatypes can be passed in but they will be mapped to the corresponding Numpy datatypes

Returns:

The processed dataframe with column datatypes set

Return type:

pandas.DataFrame

oasislmf.utils.data.validate_vuln_csv_contents(file_path)[source]¶

Validate the contents of the CSV file for vulnerability replacements.

Args:: file_path (str): Path to the vulnerability CSV file
Returns:: bool: True if the file is valid, False otherwise

oasislmf.utils.data.validate_vulnerability_replacements(analysis_settings_json)[source]¶

Validate vulnerability replacements in analysis settings file. If vulnerability replacements are specified as a file path, check that the file exists. This way the user will be warned early if the vulnerability option selected is not valid.

Args:: analysis_settings_json (str): JSON file path to analysis settings file
Returns:: bool: True if the vulnerability replacements are present and valid, False otherwise

oasislmf.utils.data.fill_na_with_categoricals(df, fill_value)[source]¶

Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns.

The input dataframe is modified inplace.

Parameters:

df (pd.DataFrame) – The dataframe to process
fill_value (int, float, str, dict) – A single value to use in all columns, or a dict of column names and corresponding values to fill.

oasislmf.utils.data¶

Module Contents¶

Functions¶

Attributes¶

`oasislmf.utils.data`¶