oasislmf.lookup.builtin¶
Module for the built-in Lookup Class
in the future we may want to improve on the management of files used to generate the keys tutorial for pandas and parquet https://towardsdatascience.com/a-gentle-introduction-to-apache-arrow-with-apache-spark-and-pandas-bb19ffe0ddae
Attributes¶
Classes¶
Basic abstract class for KeyLookup |
|
Built-in Lookup class that implement the OasisLookupInterface |
Functions¶
|
Find nearest neighbors for all source points from a set of candidate points |
|
For each point in left_gdf, find closest point in right GeoDataFrame and return them. |
Module Contents¶
- oasislmf.lookup.builtin.OPT_INSTALL_MESSAGE = "install oasislmf with extra packages by running 'pip install oasislmf[extra]'"[source]¶
- oasislmf.lookup.builtin.get_nearest(src_points, candidates, k_neighbors=1)[source]¶
Find nearest neighbors for all source points from a set of candidate points
- oasislmf.lookup.builtin.nearest_neighbor(left_gdf, right_gdf, return_dist=False)[source]¶
For each point in left_gdf, find closest point in right GeoDataFrame and return them.
NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
- oasislmf.lookup.builtin.key_columns = ['loc_id', 'peril_id', 'coverage_type', 'area_peril_id', 'vulnerability_id', 'status', 'message'][source]¶
- class oasislmf.lookup.builtin.PerilCoveredDeterministicLookup(config, config_dir=None, user_data_dir=None, output_dir=None)[source]¶
Bases:
oasislmf.lookup.base.AbstractBasicKeyLookup
Basic abstract class for KeyLookup
- class oasislmf.lookup.builtin.Lookup(config, config_dir=None, user_data_dir=None, output_dir=None)[source]¶
Bases:
oasislmf.lookup.base.AbstractBasicKeyLookup
,oasislmf.lookup.base.MultiprocLookupMixin
Built-in Lookup class that implement the OasisLookupInterface The aim of this class is to provide a data driven lookup capability that will be both flexible and efficient.
it provide several generic function factory that can be define in the config under the “step_definition” key (ex:) “step_definition”: {
- “split_loc_perils_covered”:{
“type”: “split_loc_perils_covered” , “columns”: [“locperilscovered”], “parameters”: {
“model_perils_covered”: [“WTC”, “WSS”]
}
}, “vulnerability”: {
“type”: “merge”, “columns”: [“peril_id”, “coverage_type”, “occupancycode”], “parameters”: {“file_path”: “%%KEYS_DATA_PATH%%/vulnerability_dict.csv”,
“id_columns”: [“vulnerability_id”]
}
}
} mapper key: is called the step_name,
it will be added the the lookup object method once the function has been built it can take any value but make sure it doesn’t collide with already existing method
- type: define the function factory to call.
in the class for type <fct_type> the function factory called will be build_<fct_type> ex: “type”: “merge” => build_merge
- columns: are the column required to be able to apply the step.
those are quite important as any column (except ‘loc_id’) from the original Locations Dataframe that is not in any step will be drop to reduce memory consumption
parameters: the parameter passed the the function factory.
Once all the functions have been defined, the order in which they must be applied is defined in the config under the “strategy” key (ex:)
“strategy”: [“split_loc_perils_covered”, “vulnerability”]
It is totally possible to subclass Lookup in order to create your custom step or function factory for custom step:
add your function definition to the “mapper”with no parameters
- “my_custom_step”: {
“type”: “custom_type” , “columns”: […],
} simply add it to your “strategy”: [“split_loc_perils_covered”, “vulnerability”, “my_custom_step”] and code the function in your subclass class MyLookup(Lookup):
@staticmethod def my_custom_step(locations):
<do something on locations> return modified_locations
for function factory: add your function definition to the “step_definition” with the required parameters “my_custom_step”: {
“type”: “custom_type” , “columns”: […], “parameters”: {
“param1”: “value1”
}
} add your step to “strategy”: [“split_loc_perils_covered”, “vulnerability”, “my_custom_step”] and code the function factory in your subclass class MyLookup(Lookup):
- def build_custom_type(self, param1):
- def fct(locations):
<do something on locations that depend on param1> return modified_locations
return fct
- set_step_function(step_name, step_config, function_being_set=None)[source]¶
set the step as a function of the lookup object if it’s not already done and return it. if the step is composed of several child steps, it will set the child steps recursively.
- Args:
step_name (str): name of the strategy for this step step_config (dict): config of the strategy for this step function_being_set (set, None): set of all the strategy that are parent of this step
- Returns:
function: function corresponding this step
- process_locations(locations)[source]¶
Process location rows - passed in as a pandas dataframe. Results can be list, tuple, generator or a pandas dataframe.
- to_abs_filepath(filepath)[source]¶
replace placeholder r’%%(.+?)%%’ (ex: %%KEYS_DATA_PATH%%) with the path set in self.config Args:
filepath (str): filepath with potentially a placeholder
- Returns:
str: filepath where placeholder are replace their actual value.
- static set_id_columns(df, id_columns)[source]¶
in Dataframes, only float column can have nan values. So after a left join for example if you have nan values that will change the type of the original column into float. this function replace the nan value with the OASIS_UNKNOWN_ID and reset the column type to int
- build_interval_to_index(value_column_name, sorted_array, index_column_name=None, side='left')[source]¶
Allow to map a value column to an index according to it’s index in the interval defined by sorted_array. nan value are kept as nan Args:
value_column_name: name of the column to map sorted_array: sorted value that define the interval to map to index_column_name: name of the output column side: define what index is returned (left or right) in case of equality with one of the interval boundary
- Returns:
function: return the mapping function
- static build_combine(id_columns, strategy)[source]¶
build a function that will combine several strategy trying to achieve the same purpose by different mean into one. for example, finding the correct area_peril_id for a location with one method using (latitude, longitude) and one using postcode. each strategy will be applied sequentially on the location that steal have OASIS_UNKNOWN_ID in their id_columns after the precedent strategy
- Args:
id_columns (list): columns that will be checked to determine if a strategy has succeeded strategy (list): list of strategy to apply
- Returns:
function: function combining all strategies
- static build_split_loc_perils_covered(model_perils_covered=None)[source]¶
split the value of LocPerilsCovered into multiple line, taking peril group into account drop all line that are not in the list model_perils_covered
usefull inspirational code: https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows
- static build_prepare(**kwargs)[source]¶
Prepare the dataframe by setting default, min and max values and type support several simple DataFrame preparation:
default: create the column if missing and replace the nan value with the default value max: truncate the values in a column to the specified max min: truncate the values in a column to the specified min type: convert the type of the column to the specified numpy dtype
Note that we use the string representation of numpy dtype available at https://numpy.org/doc/stable/reference/arrays.dtypes.html#arrays-dtypes-constructing
- build_rtree(file_path, file_type, id_columns, area_peril_read_params=None, nearest_neighbor_min_distance=-1)[source]¶
Function Factory to associate location to area_peril based on the rtree method
!!! please note that this method is quite time consuming (specialy if you use the nearest point option if your peril_area are square you should use area_peril function fixed_size_geo_grid !!!
- file_path: is the path to the file containing the area_peril_dictionary.
this file must be a geopandas Dataframe with a valid geometry. an example on how to create such dataframe is available in PiWind if you are new to geo data (in python) and want to learn more, you may have a look at this excellent course: https://automating-gis-processes.github.io/site/index.html
- file_type: can be any format readable by geopandas (‘file’, ‘parquet’, …)
see: https://geopandas.readthedocs.io/en/latest/docs/reference/io.html you may have to install additional library such as pyarrow for parquet
id_columns: column to transform to an ‘id_column’ (type int32 with nan replace by -1)
- nearest_neighbor_min_distance: option to compute the nearest point if intersection method fails
we use: https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html but alternatives can be found here: https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe
- static build_fixed_size_geo_grid(lat_min, lat_max, lon_min, lon_max, arc_size, lat_reverse=False, lon_reverse=False)[source]¶
associate an id to each square of the grid define by the limit of lat and lon reverse allow to change the ordering of id from (min to max) to (max to min)
- build_merge(file_path, id_columns=[], **kwargs)[source]¶
this method will merge the locations Dataframe with the Dataframe present in file_path All non match column present in id_columns will be set to -1
this is an efficient way to map a combination of column that have a finite scope to an idea.
- static build_simple_pivot(pivots, remove_pivoted_col=True)[source]¶
allow to pivot columns of the locations dataframe into multiple rows each pivot in the pivot list may define:
“on”: to rename a column into a new one “new_cols”: to create a new column with a certain values
ex: “pivots”: [{“on”: {“vuln_str”: “vulnerability_id”},
“new_cols”: {“coverage_type”: 1}},
- {“on”: {“vuln_con”: “vulnerability_id”},
“new_cols”: {“coverage_type”: 3}},
],
loc_id vuln_str vuln_con 1 3 2 2 18 4
=> loc_id vuln_str vuln_con vulnerability_id coverage_type 1 3 2 3 1 2 18 4 18 1 1 3 2 2 3 2 18 4 4 3