oasislmf.utils.data =================== .. py:module:: oasislmf.utils.data Attributes ---------- .. autoapisummary:: oasislmf.utils.data.PANDAS_BASIC_DTYPES oasislmf.utils.data.PANDAS_DEFAULT_NULL_VALUES oasislmf.utils.data.RI_INFO_DEFAULTS oasislmf.utils.data.RI_SCOPE_DEFAULTS Functions --------- .. autoapisummary:: oasislmf.utils.data.factorize_array oasislmf.utils.data.factorize_ndarray oasislmf.utils.data.factorize_dataframe oasislmf.utils.data.fast_zip_arrays oasislmf.utils.data.fast_zip_dataframe_columns oasislmf.utils.data.detect_encoding oasislmf.utils.data.get_dataframe oasislmf.utils.data.get_dtypes_and_required_cols oasislmf.utils.data.get_ids oasislmf.utils.data.get_json oasislmf.utils.data.get_timestamp oasislmf.utils.data.get_utctimestamp oasislmf.utils.data.merge_check oasislmf.utils.data.merge_dataframes oasislmf.utils.data.prepare_location_df oasislmf.utils.data.prepare_account_df oasislmf.utils.data.prepare_reinsurance_df oasislmf.utils.data.get_exposure_data oasislmf.utils.data.print_dataframe oasislmf.utils.data.set_dataframe_column_dtypes oasislmf.utils.data.validate_vuln_csv_contents oasislmf.utils.data.validate_vulnerability_replacements oasislmf.utils.data.fill_na_with_categoricals Module Contents --------------- .. py:data:: PANDAS_BASIC_DTYPES .. py:data:: PANDAS_DEFAULT_NULL_VALUES .. py:data:: RI_INFO_DEFAULTS .. py:data:: RI_SCOPE_DEFAULTS .. py:function:: factorize_array(arr, sort_opt=False) Groups a 1D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too. :param arr: 1D Numpy array (or list, tuple, or Pandas series) :type arr: numpy.ndarray :return: A 2-tuple consisting of the enumeration and the value groups :rtype: tuple .. py:function:: factorize_ndarray(ndarr, row_idxs=[], col_idxs=[], sort_opt=False) Groups an n-D Numpy array by item value, and optionally enumerates the groups, starting from 1. The default or assumed type is a Nunpy array, although a Python list, tuple or Pandas series will work too. :param ndarr: n-D Numpy array (or appropriate Python structure or Pandas dataframe) :type ndarr: numpy.ndarray :param row_idxs: A list of row indices to use for factorization (optional) :type row_idxs: list :param col_idxs: A list of column indices to use for factorization (optional) :type col_idxs: list :return: A 2-tuple consisting of the enumeration and the value groups :rtype: tuple .. py:function:: factorize_dataframe(df, by_row_labels=None, by_row_indices=None, by_col_labels=None, by_col_indices=None) Groups a selection of rows or columns of a Pandas DataFrame array by value, and optionally enumerates the groups, starting from 1. :param df: Pandas DataFrame :type: pandas.DataFrame :param by_row_labels: A list or tuple of row labels :type by_row_labels: list, tuple :param by_row_indices: A list or tuple of row indices :type by_row_indices: list, tuple :param by_col_labels: A list or tuple of column labels :type by_col_labels: list, tuple :param by_col_indices: A list or tuple of column indices :type by_col_indices: list, tuple :return: A 2-tuple consisting of the enumeration and the value groups :rtype: tuple .. py:function:: fast_zip_arrays(*arrays) Speedy zip of a sequence or ordered iterable of Numpy arrays (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work). :param arrays: An iterable or iterator or generator of Numpy arrays :type arrays: list, tuple, collections.Iterator, types.GeneratorType :return: A Numpy 1D array of n-tuples of the zipped sequences :rtype: np.array .. py:function:: fast_zip_dataframe_columns(df, cols) Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns (Python iterables with ordered elements such as lists and tuples, or iterators or generators of these, will also work). :param df: Pandas DataFrame :type df: pandas.DataFrame :param cols: An iterable or iterator or generator of Pandas DataFrame columns :type cols: list, tuple, collections.Iterator, types.GeneratorType :return: A Numpy 1D array of n-tuples of the dataframe columns to be zipped :rtype: np.array .. py:function:: detect_encoding(filepath) Given a path to a CSV of unknown encoding read lines to detects its encoding type :param filepath: Filepath to check :type filepath: str :return: Example `{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}` :rtype: dict .. py:function:: get_dataframe(src_fp=None, src_type=None, src_buf=None, src_data=None, float_precision='high', empty_data_error_msg=None, lowercase_cols=True, required_cols=(), col_defaults={}, non_na_cols=(), col_dtypes={}, sort_cols=None, sort_ascending=None, memory_map=False, low_memory=False, encoding=None) Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer of such a file (``io.StringIO``), or another Pandas dataframe. :param src_fp: Source CSV or JSON file path (optional) :type src_fp: str :param src_type: Type of source file -CSV or JSON (optional; default is csv) :param src_type: str :param src_buf: Text buffer of a source CSV or JSON file (optional) :type src_buf: io.StringIO :param float_precision: Indicates whether to support high-precision numbers present in the data (optional; default is high) :type float_precision: str :param empty_data_error_msg: The message of the exception that is thrown there is no data content, i.e no rows (optional) :type empty_data_error_msg: str :param lowercase_cols: Whether to convert the dataframe columns to lowercase (optional; default is True) :type lowercase_cols: bool :param required_cols: An iterable of columns required to be present in the source data (optional) :type required_cols: list, tuple, collections.Iterable :param col_defaults: A dict of column names and their default values. This can include both existing columns and new columns - defaults for existing columns are set row-wise using pd.DataFrame.fillna, while defaults for non-existent columns are set column-wise using assignment (optional) :type col_defaults: dict :param non_na_cols: An iterable of names of columns which must be dropped if they contain any null values (optional) :type non_na_cols: list, tuple, collections.Iterable :param col_dtypes: A dict of column names and corresponding data types - Python built-in datatypes are accepted but are mapped to the corresponding Numpy datatypes (optional) :type col_dtypes: dict :param sort_cols: An iterable of column names by which to sort the frame rows (optional) :type sort_cols: list, tuple, collections.Iterable :param sort_ascending: Whether to perform an ascending or descending sort - is used only in conjunction with the sort_cols option (optional) :type sort_ascending: bool :param memory_map: Memory-efficient option used when loading a frame from a file or text buffer - is a direct optional argument for the pd.read_csv method :type memory_map: bool :param low_memory: Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, :type low_memory: bool :param encoding: Try to read CSV of JSON data with the given encoding type, if 'None' will try to auto-detect on UnicodeDecodeError :type encoding: str :return: A Pandas dataframe :rtype: pd.DataFrame .. py:function:: get_dtypes_and_required_cols(get_dtypes, all_dtypes=False) Get OED column data types and required column names from JSON. :param all_dtypes: If true return every dtype field, otherwise only categoricals :type all_dtypes: boolean :param get_dtypes: method to get dict from JSON :type get_dtypes: function .. py:function:: get_ids(df, usecols, group_by=[], sort_keys=True) Enumerates (counts) the rows of a given dataframe in a given subset of dataframe columns, and optionally does the enumeration with respect to subgroups of the column subset. :param df: Input dataframe :type df: pandas.DataFrame :param usecols: The column subset :param usecols: list :param group_by: A subset of the column subset to use a subgroup key :param group_by: list :param sort_keys: Sort keys by value before assigning ids :param sort_keys: Boolean Example if sort_keys=True: ----------------- index PortNumber AccNumber locnumbera id (returned) 0 1 A11111 10002082049 3 1 1 A11111 10002082050 4 2 1 A11111 10002082051 5 3 1 A11111 10002082053 7 4 1 A11111 10002082054 8 5 1 A11111 10002082052 6 6 1 A11111 10002082046 1 7 1 A11111 10002082046 1 8 1 A11111 10002082048 2 9 1 A11111 10002082055 9 :return: The enumeration :rtype: numpy.ndarray .. py:function:: get_json(src_fp) Loads JSON from file. :param src_fp: Source JSON file path :type src_fp: str :return: dict :rtype: dict .. py:function:: get_timestamp(thedate=datetime.now(), fmt='%Y%m%d%H%M%S') Get a timestamp string from a ``datetime.datetime`` object :param thedate: ``datetime.datetime`` object :type thedate: datetime.datetime :param fmt: Timestamp format string :type fmt: str :return: Timestamp string :rtype: str .. py:function:: get_utctimestamp(thedate=datetime.utcnow(), fmt='%Y-%b-%d %H:%M:%S') Get a UTC timestamp string from a ``datetime.datetime`` object :param thedate: ``datetime.datetime`` object :type thedate: datetime.datetime :param fmt: Timestamp format string, default is "%Y-%b-%d %H:%M:%S" :type fmt: str :return: UTC timestamp string :rtype: str .. py:function:: merge_check(left, right, on=[], raise_error=True) Check two dataframes for keys intersection, use before performing a merge :param left: The first of two dataframes to be merged :type left: pd.DataFrame :param right: The second of two dataframes to be merged :type left: pd.DataFrame :param on: column keys to test :type on: list :return: A dict of booleans, True for an intersection between left/right :rtype: dict {'PortNumber': False, 'AccNumber': True, 'layer_id': True, 'condnumber': True} .. py:function:: merge_dataframes(left, right, join_on=None, **kwargs) Merges two dataframes by ensuring there is no duplication of columns. :param left: The first of two dataframes to be merged :type left: pd.DataFrame :param right: The second of two dataframes to be merged :type left: pd.DataFrame :param kwargs: Optional keyword arguments passed directly to the underlying pd.merge method that is called, including options for the join keys, join type, etc. - please see the pd.merge documentation for details of these optional arguments :type kwargs: dict :return: A merged dataframe :rtype: pd.DataFrame .. py:function:: prepare_location_df(location_df) .. py:function:: prepare_account_df(accounts_df) .. py:function:: prepare_reinsurance_df(ri_info, ri_scope) .. py:function:: get_exposure_data(computation_step, add_internal_col=False) .. py:function:: print_dataframe(df, cols=[], string_cols=[], show_index=False, frame_header=None, column_headers='keys', tablefmt='psql', floatfmt=',.2f', end='\n', **tabulate_kwargs) A method to pretty-print a Pandas dataframe - calls on the ``tabulate`` package :param df: The dataframe to pretty-print :type df: pd.DataFrame :param cols: An iterable of names of columns whose values should be printed (optional). If unset, all columns will be printed. :type cols: list, tuple, collections.Iterable :param string_cols: An iterable of names of columns whose values should be treated as strings (optional) :type string_cols: list, tuple, collections.Iterable :param show_index: Whether to display the index column in the printout (optional; default is False) :type show_index: bool :param frame_header: Header string to display on top of the printed dataframe (optional) :type frame_header: str :param column_headers: Column header format - see the tabulate.tabulate method documentation (optional, default is 'keys') :type column_headers: list, str :param tablefmt: Table format - see the tabulate.tabulate method documentation (optional; default is 'psql') :type tablefmt: str, list, tuple :param floatfmt: Floating point format - see the tabulate.tabulate method documnetation (optional; default is ".2f") :type floatfmt: str :param end: String to append after printing the dataframe (optional; default is newline) :type end: str :param tabulate_kwargs: Additional optional arguments passed directly to the underlying tabulate.tabulate method - see the method documentation for more details :param tabulate_kwargs: dict .. py:function:: set_dataframe_column_dtypes(df, dtypes) A method to set column datatypes for a Pandas dataframe :param df: The dataframe to process :type df: pd.DataFrame :param dtypes: A dict of column names and corresponding Numpy datatypes - Python built-in datatypes can be passed in but they will be mapped to the corresponding Numpy datatypes :type dtypes: dict :return: The processed dataframe with column datatypes set :rtype: pandas.DataFrame .. py:function:: validate_vuln_csv_contents(file_path) Validate the contents of the CSV file for vulnerability replacements. Args: file_path (str): Path to the vulnerability CSV file Returns: bool: True if the file is valid, False otherwise .. py:function:: validate_vulnerability_replacements(analysis_settings_json) Validate vulnerability replacements in analysis settings file. If vulnerability replacements are specified as a file path, check that the file exists. This way the user will be warned early if the vulnerability option selected is not valid. Args: analysis_settings_json (str): JSON file path to analysis settings file Returns: bool: True if the vulnerability replacements are present and valid, False otherwise .. py:function:: fill_na_with_categoricals(df, fill_value) Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns. The input dataframe is modified inplace. :param df: The dataframe to process :type df: pd.DataFrame :param fill_value: A single value to use in all columns, or a dict of column names and corresponding values to fill. :type fill_value: int, float, str, dict