gojo.core package

Submodules

gojo.core.evaluation module

class gojo.core.evaluation.Metric(name: str, function: callable, bin_threshold: Optional[float] = None, ignore_bin_threshold: bool = False, multiclass: bool = False, number_of_classes: Optional[int] = None, use_multiclass_sparse: bool = True, **kwargs)[source]

Bases: object

Base class used to create any type of performance evaluation metric compatible with the gojo framework.

namestr: Name given to the performance metric
functioncallable: Function that will receive as input two numpy.ndarray (y_true and y_pred) and must return a scalar or a numpy.ndarray.
bin_thresholdfloat or int, default=None: Threshold used to binarize the input predictions. By default, no thresholding is applied.
ignore_bin_thresholdbool, default=False: If provided, parameter bin_threshold will be ignored.
multiclassbool, default=False: Parameter indicating if a multi-class classification metric is being computed.
number_of_classesint, default=None: Parameter indicating the number of classes in a multi-class classification problem. This parameter will not have any effect when multiclass=False.
use_multiclass_sparsebool, default=False: Parameter indicating if the multi-class level predictions are provided as a one-hot vector. This parameter will not have any effect when multiclass=False.
**kwargs: Optional parameters provided to the input callable specified by function.

gojo.core.evaluation.flatFunctionInput(fn: callable)[source]

Function used to flatten the input predictions before the computation of the metric. Internally, the input y_pred and y_true will be flattened before calling the provided function.

>>> from gojo import core
>>> from sklearn import metrics

>>> metric = core.Metric(
>>>     'accuracy',
>>>     core.flatFunctionInput(metrics.accuracy_score),
>>>     bin_threshold=0.5)
>>>

gojo.core.evaluation.getAvailableDefaultMetrics(task: Optional[str] = None) → dict[source]

Return to dictionary with task names and default metrics defined for those tasks. The selected problems for which you want to see the metrics can be filtered by the task parameter indicating the task for which you want to see the metrics.

taskstr, default=None: Specify the task to see the defined metrics associated to that task.

task_infodict: Dictionary where the keys correspond to the task and the values to the metrics defined by default for the associated task.

gojo.core.evaluation.getDefaultMetrics(task: str, select: Optional[list] = None, bin_threshold: Optional[float] = None, multiclass: bool = False, number_of_classes: Optional[int] = None, use_multiclass_sparse: bool = False) → list[source]

Function used to get a series of pre-defined scores for evaluate the model performance.

taskstr

Task-associated metrics. Currently available tasks are: binary_classification and regression.

selectlist, default=None

Metrics of those returned that will be selected (in case you do not want to calculate all the metrics). By default, all metrics associated with the task will be returned.

Note: metrics are represented by strings.

bin_thresholdfloat or int, default=None

Threshold used to binarize the input predictions. By default, no thresholding is applied.

multiclassbool, default=False

Parameter indicating if a multi-class classification metric is being computed.

number_of_classesint, default=None

Parameter indicating the number of classes in a multi-class classification problem. This parameter will not have any effect when multiclass=False.

use_multiclass_sparsebool, default=False

Parameter indicating if the multi-class level predictions are provided as a one-hot vector. This parameter will not have any effect when multiclass=False.

metricslist: List of instances of the gojo.core.Metric class.

gojo.core.evaluation.getScores(y_true: numpy.ndarray, y_pred: numpy.ndarray, metrics: list) → dict[source]

Function used to calculate the scores given by the metrics passed within the metrics parameter.

y_truenp.ndarray: True labels.
y_prednp.ndarray: Predicted labels.
metricsList[gojo.core.Metric]: List of gojo.core.Metric instances.

metric_scoresdict: Dictionary where the keys will correspond to the metric names and the values to the metric scores.

gojo.core.loops module

gojo.core.loops.evalCrossVal(X: pandas.DataFrame, y: pandas.Series, model: gojo.interfaces.model.Model, cv: gojo.util.splitter.SimpleSplitter, transforms: Optional[List[gojo.interfaces.transform.Transform]] = None, verbose: int = - 1, n_jobs: int = 1, save_train_preds: bool = False, save_transforms: bool = False, save_models: bool = False, op_instance_args: Optional[dict] = None) → gojo.core.report.CVReport[source]

Subroutine used to evaluate a model according to a cross-validation scheme provided by the cv argument.

Xnp.ndarray or pd.DataFrame: Variables used to fit the model.
ynp.ndarray or pd.DataFrame or pd.Series: Target prediction variable.
modelgojo.interfaces.Model: Model to be trained. The input model must follow the gojo.base.Model interfaz.
cvCross-validation splitter: Cross-validation schema. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see gojo.util.getCrossValObj()). Supported splitters are sklearn.model_selection.RepeatedKFold, sklearn.model_selection.RepeatedStratifiedKFold, sklearn.model_selection.LeaveOneOut, gojo.util.splitter.SimpleSplitter, gojo.util.splitter.InstanceLevelKFoldSplitter or gojo.util.splitter.PredefinedSplitter
transformsList[Transform] or None, default=None: Transformations applied to the data before being provided to the models. These transformations will be fitted using the training data, and will be applied to both training and test data. For more information see the module gojo.core.transform.
verboseint, default=-1: Verbosity level.
n_jobsint, default=1: Number of jobs used for parallelization.
save_train_predsbool, default=False: Parameter that indicates whether the predictions made on the training set will be saved in gojo.core.report.CVReport. For large training sets this may involve higher computational and storage costs.
save_transformsbool, default=False: Parameter that indicates whether the fitted transforms will be saved in gojo.core.report.CVReport.
save_modelsbool, default=False: Parameter that indicates whether the fitted models will be saved in gojo.core.report.CVReport. For larger models this may involve higher computational and storage costs.
op_instance_argsdict, default=None: Instance-level optional arguments. This parameter should be a dictionary whose values must be list on an array-like iterable containing the same number of elements as instances in X and y.

cv_objgojo.core.report.CVReport: Cross validation report. For more information see gojo.core.report.CVReport.

>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.decomposition import PCA
>>>
>>> # GOJO libraries
>>> from gojo import core
>>> from gojo import interfaces
>>>
>>> N_JOBS = 8
>>>
>>> # load test dataset (Wine)
>>> wine_dt = datasets.load_wine()
>>>
>>> # create the target variable. Classification problem 0 vs rest
>>> # to see the target names you can use wine_dt['target_names']
>>> y = (wine_dt['target'] == 1).astype(int)
>>> X = wine_dt['data']
>>>
>>> # previous model transforms
>>> transforms = [
>>>     interfaces.SKLearnTransformWrapper(StandardScaler),
>>>     interfaces.SKLearnTransformWrapper(PCA, n_components=5)
>>> ]
>>>
>>> # default model
>>> model = interfaces.SklearnModelWrapper(
>>>     SVC, kernel='poly', degree=1, coef0=0.0,
>>>     cache_size=1000, class_weight=None
>>> )
>>>
>>> # evaluate the model using a simple cross-validation strategy with a
>>> # default parameters
>>> cv_report = core.evalCrossVal(
>>>     X=X, y=y,
>>>     model=model,
>>>     cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
>>>     transforms=transforms,
>>>     verbose=True,
>>>     save_train_preds=True,
>>>     save_models=False,
>>>     save_transforms=False,
>>>     n_jobs=N_JOBS
>>> )
>>>
>>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5))
>>> results = pd.concat([
>>>     pd.DataFrame(scores['train'].mean(axis=0)).round(decimals=3),
>>>     pd.DataFrame(scores['test'].mean(axis=0)).round(decimals=3)],
>>>     axis=1).drop(index=['n_fold'])
>>> results.columns = ['Train', 'Test']
>>> results
>>>

gojo.core.loops.evalCrossValNestedHPO(X: pandas.DataFrame, y: pandas.Series, model: gojo.interfaces.model.Model, search_space: dict, outer_cv: gojo.util.splitter.SimpleSplitter, inner_cv: gojo.util.splitter.SimpleSplitter, hpo_sampler: optuna.samplers.BaseSampler, hpo_n_trials: int, minimization: bool, metrics: List[gojo.core.evaluation.Metric], objective_metric: Optional[str] = None, agg_function: Optional[callable] = None, transforms: Optional[List[gojo.interfaces.transform.Transform]] = None, verbose: int = - 1, n_jobs: int = 1, inner_cv_n_jobs: int = 1, save_train_preds: bool = False, save_transforms: bool = False, save_models: bool = False, op_instance_args: Optional[dict] = None, enable_experimental: bool = False)[source]

Subroutine used to evaluate a model according to a cross-validation scheme provided by the outer_cv argument. This function also perform a nested cross-validation for hyperparameter optimization (HPO) based on the optuna library.

Xnp.ndarray or pd.DataFrame

Variables used to fit the model.

ynp.ndarray or pd.DataFrame or pd.Series

Target prediction variable.

modelgojo.interfaces.Model

Model to be trained. The input model must follow the gojo.base.Model interfaz.

search_spacedict

Search space used for performing the HPO. For more information about distributions and sampling strategies consult optuna.

>>> search_space = {
>>>     # sample from a categorical distribution
>>>     'max_depth': ('suggest_int', (2, 10)),
>>>     # ... from a uniform distribution
>>>     'max_samples': ('suggest_float', (0.5, 1.0)),
>>> }

Keyword arguments can be passed by providing a dictionary in the third position where the key will correspond to the name of the parameter:

>>> search_space = {
>>>     # sample from a categorical distribution in log space
>>>     'max_depth': ('suggest_int', (2, 40), dict(step=1, log=True))),
>>>     # ... from a uniform distribution
>>>     'max_samples': ('suggest_float', (0.5, 1.0)),
>>> }

outer_cvCross-validation splitter

Cross-validation schema. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see gojo.util.getCrossValObj()). Supported splitters are sklearn.model_selection.RepeatedKFold, sklearn.model_selection.RepeatedStratifiedKFold, sklearn.model_selection.LeaveOneOut, gojo.util.splitter.SimpleSplitter, gojo.util.splitter.InstanceLevelKFoldSplitter or gojo.util.splitter.PredefinedSplitter

inner_cvCross-validation splitter

Inner cross-validation schema used for evaluating model performance in the nested cross-validation used for optimize the model hyperparameters. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see gojo.util.getCrossValObj()). Supported splitters are sklearn.model_selection.RepeatedKFold, sklearn.model_selection.RepeatedStratifiedKFold, sklearn.model_selection.LeaveOneOut, gojo.util.splitter.SimpleSplitter, gojo.util.splitter.InstanceLevelKFoldSplitter or gojo.util.splitter.PredefinedSplitter

hpo_sampleroptuna.samplers.BaseSampler

Sampler used for suggest model hyperparameters. For more information see optuna.

hpo_n_trialsint

Number of HPO iterations.

minimization: bool

Parameter indicating if the HPO objetive function must be minimized. If minimization=False the objective function will be maximized.

metricsList[gojo.core.evaluation.Metric]

Metrics used within the nested-cross validation to evaluate the hyperparameter configuration.

objective_metricstr, default=None

It is possible to indicate which of the metrics provided by the metrics parameter are to be optimized within the HPO. The metric must be provided as a string and must be included in the list of metrics provided. If this parameter is not provided, an aggregation function must be provided by means of the agg_function parameter.

agg_functioncallable, default=None

This function will receive a dataframe with the metrics calculated on each of the folds generated by the inner_cv and taking into account this information it will provide a score that will be maximized/minimized within the HPO. If the X parameter is not provided, this parameter must be provided. If both parameters are provided, X will be ignored.

transformsList[Transform] or None, default=None

Transformations applied to the data before being provided to the models. These transformations will be fitted using the training data, and will be applied to both training and test data. For more information see the module gojo.core.transform.

verboseint, default=-1

Verbosity level.

: int, default=1: Number of cores used to parallelise internal cross validation.

n_jobsint, default=1

Number of jobs used for parallelization. Parallelisation will be done at the optuna trial level and will depend on a temporary database that will be created and automatically removed once the optimizzation ends. This is an experimental feature, to enable this parameter you have to specify enable_experimental=True.

save_train_predsbool, default=False

Parameter that indicates whether the predictions made on the training set will be saved in gojo.core.report.CVReport. For large training sets this may involve higher computational and storage costs.

save_transformsbool, default=False

Parameter that indicates whether the fitted transforms will be saved in gojo.core.report.CVReport.

save_modelsbool, default=False

Parameter that indicates whether the fitted models will be saved in gojo.core.report.CVReport. For larger models this may involve higher computational and storage costs.

op_instance_argsdict, default=None

Instance-level optional arguments. This parameter should be a dictionary whose values must be list on an array-like iterable containing the same number of elements as instances in X and y.

enable_experimental: bool, default=False

Parameter indicating whether the experimental characteristics of the function are allowed.

cv_objgojo.core.report.CVReport: Cross validation report. For more information see gojo.core.report.CVReport. The HPO history will be save in the report metadata (gojo.core.report.CVReport.metadata.

>>> import optuna
>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.decomposition import PCA
>>>
>>> # GOJO libraries
>>> import gojo
>>> from gojo import core
>>>
>>> N_JOBS = 8
>>>
>>> # load test dataset (Wine)
>>> wine_dt = datasets.load_wine()
>>>
>>> # create the target variable. Classification problem 0 vs rest
>>> # to see the target names you can use wine_dt['target_names']
>>> y = (wine_dt['target'] == 1).astype(int)
>>> X = wine_dt['data']
>>>
>>> # previous model transforms
>>> transforms = [
>>>     core.SKLearnTransformWrapper(StandardScaler),
>>>     core.SKLearnTransformWrapper(PCA, n_components=5)
>>> ]
>>>
>>> # model hyperparameters
>>> search_space = {
>>>     'degree': ('suggest_int', (1, 10)),
>>>     'class_weight': ('suggest_categorical', [('balanced', None)]),
>>>     'coef0': ('suggest_float', (0.0, 100.00 ))
>>> }
>>>
>>> # default model
>>> model = core.SklearnModelWrapper(
>>>     SVC, kernel='poly', degree=1, coef0=0.0,
>>>     cache_size=1000, class_weight=None
>>> )
>>>
>>> # perform the HPO to optimice model-hyperparameters
>>> cv_report = core.evalCrossValNestedHPO(
>>>     X=X,
>>>     y=y,
>>>     model=model,
>>>     search_space=search_space,
>>>     outer_cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
>>>     inner_cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
>>>     hpo_sampler=optuna.samplers.TPESampler(n_startup_trials=40),
>>>     hpo_n_trials=80,
>>>     minimization=False,
>>>     transforms=transforms,
>>>     metrics=core.getDefaultMetrics('binary_classification', bin_threshold=0.5),
>>>     objective_metric='f1_score',
>>>     verbose=1,
>>>     save_train_preds=True,
>>>     save_models=False,
>>>     n_jobs=1
>>> )
>>>
>>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5))
>>> results = pd.concat([
>>>     pd.DataFrame(scores['train'].mean(axis=0)).round(decimals=3),
>>>     pd.DataFrame(scores['test'].mean(axis=0)).round(decimals=3)],
>>>     axis=1).drop(index=['n_fold'])
>>> results.columns = ['Train', 'Test']
>>> results
>>>

gojo.core.report module

class gojo.core.report.CVReport(raw_results: list, X_dataset: gojo.interfaces.data.Dataset, y_dataset: gojo.interfaces.data.Dataset, n_fold_key: str, pred_test_key: str, true_test_key: str, pred_train_key: str, true_train_key: str, test_idx_key: str, train_idx_key: str, trained_model_key: str, fitted_transforms_key: str)[source]

Bases: object

Object returned by the subroutines defined in gojo.core.loops functions with the results of the cross validation.

addMetadata(**kwargs)[source]: Function used to add metadata to the report.

getFittedTransforms(copy: bool = True) → dict[source]

Function that returns the fitted transforms if they have been saved in the gojo.core.loops subroutine.

copybool, default=True: Parameter that indicates whether to return a deepcopy of the transforms or directly the saved transforms. Defaults to True to avoid inplace modifications.

fitted_transformsdict or None: Trained models or None if the models were not saved.

getScores(metrics: list, loocv: bool = False, supress_warnings: bool = False) → dict[source]

Method used to calculate performance metrics for folds from a list of metrics ( gojo.core.evaluation.Metric instances) provided. If the subroutine from gojo.core.loops performed a leave-one-out cross-validation you must specify the parameter loocv as True.

metricslist: List of gojo.core.evaluation.Metric instances
loocvbool: Parameter indicating if the predictions correspond to a LOOCV schema
supress_warningsbool, default=False: Indicates whether to supress the possible warnings returned by the method.

performance_metricsdict: Dictionary with the performance associated with the test data (identified with the ‘test’ key) and with the training data (identified with the ‘train’ key).

>>> from gojo import core
>>>
>>> # ... cv_report = core.loops.evalCrossVal(...)
>>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5))
>>>

getTestPredictions() → pandas.DataFrame[source]

Function that returns a dataframe with the model predictions, indices, and true labels for the test set.

test_predictionspd.DataFrame: Model predictions over the test set.

getTrainPredictions(supress_warnings: bool = False) → None[source]

Function that returns a dataframe with the model predictions, indices, and true labels for the train set.

Predictions will only be returned if they are available. In some subroutines of gojo.core.loops it should be noted that the predictions made on the training set are not saved or this decision is relegated to the user.

supress_warningsbool, default=False: Silence the warning raised when not training predictions have been made.

test_predictionspd.DataFrame or None: Model predictions over the train set.

getTrainedModels(copy: bool = True) → dict[source]

Function that returns the trained models if they have been saved in the gojo.core.loops subroutine.

copybool, default=True: Parameter that indicates whether to return a deepcopy of the models (using the copy.deepcopy) or directly the saved model. Defaults to True to avoid inplace modifications.

trained_modelsdict or None: Trained models or None if the models were not saved.

property metadata: dict: Return the report metadata.

gojo.core package

Submodules

gojo.core.evaluation module

gojo.core.loops module

gojo.core.report module

Module contents