oasislmf.utils.data
===================

.. py:module:: oasislmf.utils.data


Attributes
----------

.. autoapisummary::

   oasislmf.utils.data.PANDAS_BASIC_DTYPES
   oasislmf.utils.data.PANDAS_DEFAULT_NULL_VALUES
   oasislmf.utils.data.RI_INFO_DEFAULTS
   oasislmf.utils.data.RI_SCOPE_DEFAULTS


Functions
---------

.. autoapisummary::

   oasislmf.utils.data.factorize_array
   oasislmf.utils.data.factorize_ndarray
   oasislmf.utils.data.factorize_dataframe
   oasislmf.utils.data.fast_zip_arrays
   oasislmf.utils.data.fast_zip_dataframe_columns
   oasislmf.utils.data.detect_encoding
   oasislmf.utils.data.get_dataframe
   oasislmf.utils.data.get_dtypes_and_required_cols
   oasislmf.utils.data.get_ids
   oasislmf.utils.data.get_json
   oasislmf.utils.data.get_timestamp
   oasislmf.utils.data.get_utctimestamp
   oasislmf.utils.data.merge_check
   oasislmf.utils.data.merge_dataframes
   oasislmf.utils.data.prepare_location_df
   oasislmf.utils.data.prepare_account_df
   oasislmf.utils.data.prepare_reinsurance_df
   oasislmf.utils.data.get_exposure_data
   oasislmf.utils.data.print_dataframe
   oasislmf.utils.data.set_dataframe_column_dtypes
   oasislmf.utils.data.validate_vuln_csv_contents
   oasislmf.utils.data.validate_vulnerability_replacements
   oasislmf.utils.data.fill_na_with_categoricals


Module Contents
---------------

.. py:data:: PANDAS_BASIC_DTYPES

.. py:data:: PANDAS_DEFAULT_NULL_VALUES

.. py:data:: RI_INFO_DEFAULTS

.. py:data:: RI_SCOPE_DEFAULTS

.. py:function:: factorize_array(arr, sort_opt=False)

   Groups a 1D Numpy array by item value, and optionally enumerates the
   groups, starting from 1. The default or assumed type is a Nunpy
   array, although a Python list, tuple or Pandas series will work too.

   :param arr: 1D Numpy array (or list, tuple, or Pandas series)
   :type arr: numpy.ndarray

   :return: A 2-tuple consisting of the enumeration and the value groups
   :rtype: tuple


.. py:function:: factorize_ndarray(ndarr, row_idxs=[], col_idxs=[], sort_opt=False)

   Groups an n-D Numpy array by item value, and optionally enumerates the
   groups, starting from 1. The default or assumed type is a Nunpy
   array, although a Python list, tuple or Pandas series will work too.

   :param ndarr: n-D Numpy array (or appropriate Python structure or Pandas dataframe)
   :type ndarr: numpy.ndarray

   :param row_idxs: A list of row indices to use for factorization (optional)
   :type row_idxs: list

   :param col_idxs: A list of column indices to use for factorization (optional)
   :type col_idxs: list

   :return: A 2-tuple consisting of the enumeration and the value groups
   :rtype: tuple


.. py:function:: factorize_dataframe(df, by_row_labels=None, by_row_indices=None, by_col_labels=None, by_col_indices=None)

   Groups a selection of rows or columns of a Pandas DataFrame array by value,
   and optionally enumerates the groups, starting from 1.

   :param df: Pandas DataFrame
   :type: pandas.DataFrame

   :param by_row_labels: A list or tuple of row labels
   :type by_row_labels: list, tuple

   :param by_row_indices: A list or tuple of row indices
   :type by_row_indices: list, tuple

   :param by_col_labels: A list or tuple of column labels
   :type by_col_labels: list, tuple

   :param by_col_indices: A list or tuple of column indices
   :type by_col_indices: list, tuple

   :return: A 2-tuple consisting of the enumeration and the value groups
   :rtype: tuple


.. py:function:: fast_zip_arrays(*arrays)

   Speedy zip of a sequence or ordered iterable of Numpy arrays (Python
   iterables with ordered elements such as lists and tuples, or iterators
   or generators of these, will also work).

   :param arrays: An iterable or iterator or generator of Numpy arrays
   :type arrays: list, tuple, collections.Iterator, types.GeneratorType

   :return: A Numpy 1D array of n-tuples of the zipped sequences
   :rtype: np.array


.. py:function:: fast_zip_dataframe_columns(df, cols)

   Speedy zip of a sequence or ordered iterable of Pandas DataFrame columns
   (Python iterables with ordered elements such as lists and tuples, or
   iterators or generators of these, will also work).

   :param df: Pandas DataFrame
   :type df: pandas.DataFrame

   :param cols: An iterable or iterator or generator of Pandas DataFrame columns
   :type cols: list, tuple, collections.Iterator, types.GeneratorType

   :return: A Numpy 1D array of n-tuples of the dataframe columns to be zipped
   :rtype: np.array


.. py:function:: detect_encoding(filepath)

   Given a path to a CSV of unknown encoding
   read lines to detects its encoding type

   :param filepath: Filepath to check
   :type  filepath: str

   :return: Example `{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}`
   :rtype: dict


.. py:function:: get_dataframe(src_fp=None, src_type=None, src_buf=None, src_data=None, float_precision='high', empty_data_error_msg=None, lowercase_cols=True, required_cols=(), col_defaults={}, non_na_cols=(), col_dtypes={}, sort_cols=None, sort_ascending=None, memory_map=False, low_memory=False, encoding=None)

   Loads a Pandas dataframe from a source CSV or JSON file, or a text buffer
   of such a file (``io.StringIO``), or another Pandas dataframe.

   :param src_fp: Source CSV or JSON file path (optional)
   :type src_fp: str

   :param src_type: Type of source file -CSV or JSON (optional; default is csv)
   :param src_type: str

   :param src_buf: Text buffer of a source CSV or JSON file (optional)
   :type src_buf: io.StringIO

   :param float_precision: Indicates whether to support high-precision numbers
                           present in the data (optional; default is high)
   :type float_precision: str

   :param empty_data_error_msg: The message of the exception that is thrown
                               there is no data content, i.e no rows
                               (optional)
   :type empty_data_error_msg: str

   :param lowercase_cols: Whether to convert the dataframe columns to lowercase
                          (optional; default is True)
   :type lowercase_cols: bool

   :param required_cols: An iterable of columns required to be present in the
                         source data (optional)
   :type required_cols: list, tuple, collections.Iterable

   :param col_defaults: A dict of column names and their default values. This
                        can include both existing columns and new columns -
                        defaults for existing columns are set row-wise using
                        pd.DataFrame.fillna, while defaults for non-existent
                        columns are set column-wise using assignment (optional)
   :type col_defaults: dict

   :param non_na_cols: An iterable of names of columns which must be dropped
                       if they contain any null values (optional)
   :type non_na_cols: list, tuple, collections.Iterable

   :param col_dtypes: A dict of column names and corresponding data types -
                      Python built-in datatypes are accepted but are mapped
                      to the corresponding Numpy datatypes (optional)
   :type col_dtypes: dict

   :param sort_cols: An iterable of column names by which to sort the frame
                     rows (optional)
   :type sort_cols: list, tuple, collections.Iterable

   :param sort_ascending: Whether to perform an ascending or descending sort -
                          is used only in conjunction with the sort_cols
                          option (optional)
   :type sort_ascending: bool

   :param memory_map: Memory-efficient option used when loading a frame from
                      a file or text buffer - is a direct optional argument
                      for the pd.read_csv method
   :type memory_map: bool

   :param low_memory: Internally process the file in chunks, resulting in lower memory use
                      while parsing, but possibly mixed type inference.
                      To ensure no mixed types either set False,
   :type low_memory: bool

   :param encoding: Try to read CSV of JSON data with the given encoding type,
                    if 'None' will try to auto-detect on UnicodeDecodeError
   :type  encoding: str


   :return: A Pandas dataframe
   :rtype: pd.DataFrame


.. py:function:: get_dtypes_and_required_cols(get_dtypes, all_dtypes=False)

   Get OED column data types and required column names from JSON.

   :param all_dtypes: If true return every dtype field, otherwise only categoricals
   :type all_dtypes: boolean

   :param get_dtypes: method to get dict from JSON
   :type get_dtypes: function


.. py:function:: get_ids(df, usecols, group_by=[], sort_keys=True)

   Enumerates (counts) the rows of a given dataframe in a given subset
   of dataframe columns, and optionally does the enumeration with
   respect to subgroups of the column subset.

   :param df: Input dataframe
   :type df: pandas.DataFrame

   :param usecols: The column subset
   :param usecols: list

   :param group_by: A subset of the column subset to use a subgroup key
   :param group_by: list

   :param sort_keys: Sort keys by value before assigning ids
   :param sort_keys: Boolean

       Example if sort_keys=True:
       -----------------
       index  PortNumber AccNumber    locnumbera  id (returned)
           0           1    A11111  10002082049    3
           1           1    A11111  10002082050    4
           2           1    A11111  10002082051    5
           3           1    A11111  10002082053    7
           4           1    A11111  10002082054    8
           5           1    A11111  10002082052    6
           6           1    A11111  10002082046    1
           7           1    A11111  10002082046    1
           8           1    A11111  10002082048    2
           9           1    A11111  10002082055    9

   :return: The enumeration
   :rtype: numpy.ndarray


.. py:function:: get_json(src_fp)

   Loads JSON from file.

   :param src_fp: Source JSON file path
   :type src_fp: str

   :return: dict
   :rtype: dict


.. py:function:: get_timestamp(thedate=datetime.now(), fmt='%Y%m%d%H%M%S')

   Get a timestamp string from a ``datetime.datetime`` object

   :param thedate: ``datetime.datetime`` object
   :type thedate: datetime.datetime

   :param fmt: Timestamp format string
   :type fmt: str

   :return: Timestamp string
   :rtype: str


.. py:function:: get_utctimestamp(thedate=datetime.utcnow(), fmt='%Y-%b-%d %H:%M:%S')

   Get a UTC timestamp string from a ``datetime.datetime`` object

   :param thedate: ``datetime.datetime`` object
   :type thedate: datetime.datetime

   :param fmt: Timestamp format string, default is "%Y-%b-%d %H:%M:%S"
   :type fmt: str

   :return: UTC timestamp string
   :rtype: str


.. py:function:: merge_check(left, right, on=[], raise_error=True)

   Check two dataframes for keys intersection, use before performing a merge

   :param left: The first of two dataframes to be merged
   :type left: pd.DataFrame

   :param right: The second of two dataframes to be merged
   :type left: pd.DataFrame

   :param on: column keys to test
   :type on: list

   :return: A dict of booleans, True for an intersection between left/right
   :rtype: dict

   {'PortNumber': False, 'AccNumber': True, 'layer_id': True, 'condnumber': True}


.. py:function:: merge_dataframes(left, right, join_on=None, **kwargs)

   Merges two dataframes by ensuring there is no duplication of columns.

   :param left: The first of two dataframes to be merged
   :type left: pd.DataFrame

   :param right: The second of two dataframes to be merged
   :type left: pd.DataFrame

   :param kwargs: Optional keyword arguments passed directly to the underlying
                  pd.merge method that is called, including options for the
                  join keys, join type, etc. - please see the pd.merge
                  documentation for details of these optional arguments
   :type kwargs: dict

   :return: A merged dataframe
   :rtype: pd.DataFrame


.. py:function:: prepare_location_df(location_df)

.. py:function:: prepare_account_df(accounts_df)

.. py:function:: prepare_reinsurance_df(ri_info, ri_scope)

.. py:function:: get_exposure_data(computation_step, add_internal_col=False)

.. py:function:: print_dataframe(df, cols=[], string_cols=[], show_index=False, frame_header=None, column_headers='keys', tablefmt='psql', floatfmt=',.2f', end='\n', **tabulate_kwargs)

   A method to pretty-print a Pandas dataframe - calls on the ``tabulate``
   package

   :param df: The dataframe to pretty-print
   :type df: pd.DataFrame

   :param cols: An iterable of names of columns whose values should
                          be printed (optional). If unset, all columns will be printed.
   :type cols: list, tuple, collections.Iterable

   :param string_cols: An iterable of names of columns whose values should
                          be treated as strings (optional)
   :type string_cols: list, tuple, collections.Iterable

   :param show_index: Whether to display the index column in the printout
                      (optional; default is False)
   :type show_index: bool

   :param frame_header: Header string to display on top of the printed
                        dataframe (optional)
   :type frame_header: str

   :param column_headers: Column header format - see the tabulate.tabulate
                       method documentation (optional, default is 'keys')
   :type column_headers: list, str

   :param tablefmt: Table format - see the tabulate.tabulate method
                    documentation (optional; default is 'psql')
   :type tablefmt: str, list, tuple

   :param floatfmt: Floating point format - see the tabulate.tabulate
                   method documnetation (optional; default is ".2f")
   :type floatfmt: str

   :param end: String to append after printing the dataframe
               (optional; default is newline)
   :type end: str

   :param tabulate_kwargs: Additional optional arguments passed directly to
                           the underlying tabulate.tabulate method - see the
                           method documentation for more details
   :param tabulate_kwargs: dict


.. py:function:: set_dataframe_column_dtypes(df, dtypes)

   A method to set column datatypes for a Pandas dataframe

   :param df: The dataframe to process
   :type df: pd.DataFrame

   :param dtypes: A dict of column names and corresponding Numpy datatypes -
                  Python built-in datatypes can be passed in but they will be
                  mapped to the corresponding Numpy datatypes
   :type dtypes: dict

   :return: The processed dataframe with column datatypes set
   :rtype: pandas.DataFrame


.. py:function:: validate_vuln_csv_contents(file_path)

   Validate the contents of the CSV file for vulnerability replacements.

   Args:
       file_path (str): Path to the vulnerability CSV file

   Returns:
       bool: True if the file is valid, False otherwise


.. py:function:: validate_vulnerability_replacements(analysis_settings_json)

   Validate vulnerability replacements in analysis settings file.
   If vulnerability replacements are specified as a file path, check that the file exists.
   This way the user will be warned early if the vulnerability option selected is not valid.

   Args:
       analysis_settings_json (str): JSON file path to analysis settings file

   Returns:
       bool: True if the vulnerability replacements are present and valid, False otherwise


.. py:function:: fill_na_with_categoricals(df, fill_value)

   Fill NA values in a Pandas DataFrame, with handling for Categorical dtype columns.

   The input dataframe is modified inplace.

   :param df: The dataframe to process
   :type df: pd.DataFrame

   :param fill_value: A single value to use in all columns, or a dict of column names and
                      corresponding values to fill.
   :type fill_value: int, float, str, dict