oasislmf.utils.data¶
Attributes¶
Functions¶
|
Groups a 1D Numpy array by item value, and optionally enumerates the |
|
Groups an n-D Numpy array by item value, and optionally enumerates the |
|
Groups a selection of rows or columns of a Pandas DataFrame array by value, |
|
Speedy zip of a sequence or ordered iterable of Numpy arrays (Python |
|
Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns |
|
Given a path to a CSV of unknown encoding |
|
Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer |
|
Get OED column data types and required column names from JSON. |
|
Enumerates (counts) the rows of a given dataframe in a given subset |
|
Loads JSON from file. |
|
Get a timestamp string from a |
|
Get a UTC timestamp string from a |
|
Check two dataframes for keys intersection, use before performing a merge |
|
Merges two dataframes by ensuring there is no duplication of columns. |
|
|
|
|
|
|
|
|
|
A method to pretty-print a Pandas dataframe - calls on the |
|
A method to set column datatypes for a Pandas dataframe |
|
Validate the contents of the CSV file for vulnerability replacements. |
|
Validate vulnerability replacements in analysis settings file. |
|
Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns. |
Module Contents¶
- oasislmf.utils.data.factorize_array(arr, sort_opt=False)[source]¶
Groups a 1D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too.
- Parameters:
arr (numpy.ndarray) – 1D Numpy array (or list, tuple, or Pandas series)
- Returns:
A 2-tuple consisting of the enumeration and the value groups
- Return type:
- oasislmf.utils.data.factorize_ndarray(ndarr, row_idxs=[], col_idxs=[], sort_opt=False)[source]¶
Groups an n-D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too.
- Parameters:
- Returns:
A 2-tuple consisting of the enumeration and the value groups
- Return type:
- oasislmf.utils.data.factorize_dataframe(df, by_row_labels=None, by_row_indices=None, by_col_labels=None, by_col_indices=None)[source]¶
Groups a selection of rows or columns of a Pandas DataFrame array by value, and optionally enumerates the groups, starting from 1.
- Parameters:
- Type:
pandas.DataFrame
- Returns:
A 2-tuple consisting of the enumeration and the value groups
- Return type:
- oasislmf.utils.data.fast_zip_arrays(*arrays)[source]¶
Speedy zip of a sequence or ordered iterable of Numpy arrays (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work).
- oasislmf.utils.data.fast_zip_dataframe_columns(df, cols)[source]¶
Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work).
- oasislmf.utils.data.detect_encoding(filepath)[source]¶
Given a path to a CSV of unknown encoding read lines to detects its encoding type
- oasislmf.utils.data.get_dataframe(src_fp=None, src_type=None, src_buf=None, src_data=None, float_precision='high', empty_data_error_msg=None, lowercase_cols=True, required_cols=(), col_defaults={}, non_na_cols=(), col_dtypes={}, sort_cols=None, sort_ascending=None, memory_map=False, low_memory=False, encoding=None)[source]¶
Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer of such a file (
io.StringIO
), or another Pandas dataframe.- Parameters:
src_fp (str) – Source CSV or JSON file path (optional)
src_type – Type of source file -CSV or JSON (optional; default is csv)
src_type – str
src_buf (io.StringIO) – Text buffer of a source CSV or JSON file (optional)
float_precision (str) – Indicates whether to support high-precision numbers present in the data (optional; default is high)
empty_data_error_msg (str) – The message of the exception that is thrown there is no data content, i.e no rows (optional)
lowercase_cols (bool) – Whether to convert the dataframe columns to lowercase (optional; default is True)
required_cols (list, tuple, collections.Iterable) – An iterable of columns required to be present in the source data (optional)
col_defaults (dict) – A dict of column names and their default values. This can include both existing columns and new columns - defaults for existing columns are set row-wise using pd.DataFrame.fillna, while defaults for non-existent columns are set column-wise using assignment (optional)
non_na_cols (list, tuple, collections.Iterable) – An iterable of names of columns which must be dropped if they contain any null values (optional)
col_dtypes (dict) – A dict of column names and corresponding data types - Python built-in datatypes are accepted but are mapped to the corresponding Numpy datatypes (optional)
sort_cols (list, tuple, collections.Iterable) – An iterable of column names by which to sort the frame rows (optional)
sort_ascending (bool) – Whether to perform an ascending or descending sort - is used only in conjunction with the sort_cols option (optional)
memory_map (bool) – Memory-efficient option used when loading a frame from a file or text buffer - is a direct optional argument for the pd.read_csv method
low_memory (bool) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False,
encoding (str) – Try to read CSV of JSON data with the given encoding type, if ‘None’ will try to auto-detect on UnicodeDecodeError
- Returns:
A Pandas dataframe
- Return type:
pd.DataFrame
- oasislmf.utils.data.get_dtypes_and_required_cols(get_dtypes, all_dtypes=False)[source]¶
Get OED column data types and required column names from JSON.
- Parameters:
all_dtypes (boolean) – If true return every dtype field, otherwise only categoricals
get_dtypes (function) – method to get dict from JSON
- oasislmf.utils.data.get_ids(df, usecols, group_by=[], sort_keys=True)[source]¶
Enumerates (counts) the rows of a given dataframe in a given subset of dataframe columns, and optionally does the enumeration with respect to subgroups of the column subset.
- Parameters:
df (pandas.DataFrame) – Input dataframe
usecols – The column subset
usecols – list
group_by – A subset of the column subset to use a subgroup key
group_by – list
sort_keys – Sort keys by value before assigning ids
sort_keys –
Boolean
- index PortNumber AccNumber locnumbera id (returned)
0 1 A11111 10002082049 3 1 1 A11111 10002082050 4 2 1 A11111 10002082051 5 3 1 A11111 10002082053 7 4 1 A11111 10002082054 8 5 1 A11111 10002082052 6 6 1 A11111 10002082046 1 7 1 A11111 10002082046 1 8 1 A11111 10002082048 2 9 1 A11111 10002082055 9
- Returns:
The enumeration
- Return type:
numpy.ndarray
- oasislmf.utils.data.get_timestamp(thedate=datetime.now(), fmt='%Y%m%d%H%M%S')[source]¶
Get a timestamp string from a
datetime.datetime
object- Parameters:
thedate (datetime.datetime) –
datetime.datetime
objectfmt (str) – Timestamp format string
- Returns:
Timestamp string
- Return type:
- oasislmf.utils.data.get_utctimestamp(thedate=datetime.utcnow(), fmt='%Y-%b-%d %H:%M:%S')[source]¶
Get a UTC timestamp string from a
datetime.datetime
object- Parameters:
thedate (datetime.datetime) –
datetime.datetime
objectfmt (str) – Timestamp format string, default is “%Y-%b-%d %H:%M:%S”
- Returns:
UTC timestamp string
- Return type:
- oasislmf.utils.data.merge_check(left, right, on=[], raise_error=True)[source]¶
Check two dataframes for keys intersection, use before performing a merge
- Parameters:
left (pd.DataFrame) – The first of two dataframes to be merged
right – The second of two dataframes to be merged
on (list) – column keys to test
- Returns:
A dict of booleans, True for an intersection between left/right
- Return type:
{‘PortNumber’: False, ‘AccNumber’: True, ‘layer_id’: True, ‘condnumber’: True}
- oasislmf.utils.data.merge_dataframes(left, right, join_on=None, **kwargs)[source]¶
Merges two dataframes by ensuring there is no duplication of columns.
- Parameters:
left (pd.DataFrame) – The first of two dataframes to be merged
right – The second of two dataframes to be merged
kwargs (dict) – Optional keyword arguments passed directly to the underlying pd.merge method that is called, including options for the join keys, join type, etc. - please see the pd.merge documentation for details of these optional arguments
- Returns:
A merged dataframe
- Return type:
pd.DataFrame
- oasislmf.utils.data.print_dataframe(df, cols=[], string_cols=[], show_index=False, frame_header=None, column_headers='keys', tablefmt='psql', floatfmt=',.2f', end='\n', **tabulate_kwargs)[source]¶
A method to pretty-print a Pandas dataframe - calls on the
tabulate
package- Parameters:
df (pd.DataFrame) – The dataframe to pretty-print
cols (list, tuple, collections.Iterable) – An iterable of names of columns whose values should be printed (optional). If unset, all columns will be printed.
string_cols (list, tuple, collections.Iterable) – An iterable of names of columns whose values should be treated as strings (optional)
show_index (bool) – Whether to display the index column in the printout (optional; default is False)
frame_header (str) – Header string to display on top of the printed dataframe (optional)
column_headers (list, str) – Column header format - see the tabulate.tabulate method documentation (optional, default is ‘keys’)
tablefmt (str, list, tuple) – Table format - see the tabulate.tabulate method documentation (optional; default is ‘psql’)
floatfmt (str) – Floating point format - see the tabulate.tabulate method documnetation (optional; default is “.2f”)
end (str) – String to append after printing the dataframe (optional; default is newline)
tabulate_kwargs – Additional optional arguments passed directly to the underlying tabulate.tabulate method - see the method documentation for more details
tabulate_kwargs – dict
- oasislmf.utils.data.set_dataframe_column_dtypes(df, dtypes)[source]¶
A method to set column datatypes for a Pandas dataframe
- Parameters:
df (pd.DataFrame) – The dataframe to process
dtypes (dict) – A dict of column names and corresponding Numpy datatypes - Python built-in datatypes can be passed in but they will be mapped to the corresponding Numpy datatypes
- Returns:
The processed dataframe with column datatypes set
- Return type:
pandas.DataFrame
- oasislmf.utils.data.validate_vuln_csv_contents(file_path)[source]¶
Validate the contents of the CSV file for vulnerability replacements.
- Args:
file_path (str): Path to the vulnerability CSV file
- Returns:
bool: True if the file is valid, False otherwise
- oasislmf.utils.data.validate_vulnerability_replacements(analysis_settings_json)[source]¶
Validate vulnerability replacements in analysis settings file. If vulnerability replacements are specified as a file path, check that the file exists. This way the user will be warned early if the vulnerability option selected is not valid.
- Args:
analysis_settings_json (str): JSON file path to analysis settings file
- Returns:
bool: True if the vulnerability replacements are present and valid, False otherwise