gojo.core package

Submodules

gojo.core.evaluation module

class gojo.core.evaluation.Metric(name: str, function: callable, bin_threshold: Optional[float] = None, ignore_bin_threshold: bool = False, multiclass: bool = False, number_of_classes: Optional[int] = None, use_multiclass_sparse: bool = True, **kwargs)[source]

Bases: object

Base class used to create any type of performance evaluation metric compatible with the gojo framework.

namestr

Name given to the performance metric

functioncallable

Function that will receive as input two numpy.ndarray (y_true and y_pred) and must return a scalar or a numpy.ndarray.

bin_thresholdfloat or int, default=None

Threshold used to binarize the input predictions. By default, no thresholding is applied.

ignore_bin_thresholdbool, default=False

If provided, parameter bin_threshold will be ignored.

multiclassbool, default=False

Parameter indicating if a multi-class classification metric is being computed.

number_of_classesint, default=None

Parameter indicating the number of classes in a multi-class classification problem. This parameter will not have any effect when multiclass=False.

use_multiclass_sparsebool, default=False

Parameter indicating if the multi-class level predictions are provided as a one-hot vector. This parameter will not have any effect when multiclass=False.

**kwargs

Optional parameters provided to the input callable specified by function.

gojo.core.evaluation.flatFunctionInput(fn: callable)[source]

Function used to flatten the input predictions before the computation of the metric. Internally, the input y_pred and y_true will be flattened before calling the provided function.

>>> from gojo import core
>>> from sklearn import metrics
>>> metric = core.Metric(
>>>     'accuracy',
>>>     core.flatFunctionInput(metrics.accuracy_score),
>>>     bin_threshold=0.5)
>>>
gojo.core.evaluation.getAvailableDefaultMetrics(task: Optional[str] = None) dict[source]

Return to dictionary with task names and default metrics defined for those tasks. The selected problems for which you want to see the metrics can be filtered by the task parameter indicating the task for which you want to see the metrics.

taskstr, default=None

Specify the task to see the defined metrics associated to that task.

task_infodict

Dictionary where the keys correspond to the task and the values to the metrics defined by default for the associated task.

gojo.core.evaluation.getDefaultMetrics(task: str, select: Optional[list] = None, bin_threshold: Optional[float] = None, multiclass: bool = False, number_of_classes: Optional[int] = None, use_multiclass_sparse: bool = False) list[source]

Function used to get a series of pre-defined scores for evaluate the model performance.

taskstr

Task-associated metrics. Currently available tasks are: binary_classification and regression.

selectlist, default=None

Metrics of those returned that will be selected (in case you do not want to calculate all the metrics). By default, all metrics associated with the task will be returned.

Note: metrics are represented by strings.

bin_thresholdfloat or int, default=None

Threshold used to binarize the input predictions. By default, no thresholding is applied.

multiclassbool, default=False

Parameter indicating if a multi-class classification metric is being computed.

number_of_classesint, default=None

Parameter indicating the number of classes in a multi-class classification problem. This parameter will not have any effect when multiclass=False.

use_multiclass_sparsebool, default=False

Parameter indicating if the multi-class level predictions are provided as a one-hot vector. This parameter will not have any effect when multiclass=False.

metricslist

List of instances of the gojo.core.Metric class.

gojo.core.evaluation.getScores(y_true: numpy.ndarray, y_pred: numpy.ndarray, metrics: list) dict[source]

Function used to calculate the scores given by the metrics passed within the metrics parameter.

y_truenp.ndarray

True labels.

y_prednp.ndarray

Predicted labels.

metricsList[gojo.core.Metric]

List of gojo.core.Metric instances.

metric_scoresdict

Dictionary where the keys will correspond to the metric names and the values to the metric scores.

gojo.core.loops module

gojo.core.loops.evalCrossVal(X: pandas.DataFrame, y: pandas.Series, model: gojo.interfaces.model.Model, cv: gojo.util.splitter.SimpleSplitter, transforms: Optional[List[gojo.interfaces.transform.Transform]] = None, verbose: int = - 1, n_jobs: int = 1, save_train_preds: bool = False, save_transforms: bool = False, save_models: bool = False, op_instance_args: Optional[dict] = None) gojo.core.report.CVReport[source]

Subroutine used to evaluate a model according to a cross-validation scheme provided by the cv argument.

Xnp.ndarray or pd.DataFrame

Variables used to fit the model.

ynp.ndarray or pd.DataFrame or pd.Series

Target prediction variable.

modelgojo.interfaces.Model

Model to be trained. The input model must follow the gojo.base.Model interfaz.

cvCross-validation splitter

Cross-validation schema. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see gojo.util.getCrossValObj()). Supported splitters are sklearn.model_selection.RepeatedKFold, sklearn.model_selection.RepeatedStratifiedKFold, sklearn.model_selection.LeaveOneOut, gojo.util.splitter.SimpleSplitter, gojo.util.splitter.InstanceLevelKFoldSplitter or gojo.util.splitter.PredefinedSplitter

transformsList[Transform] or None, default=None

Transformations applied to the data before being provided to the models. These transformations will be fitted using the training data, and will be applied to both training and test data. For more information see the module gojo.core.transform.

verboseint, default=-1

Verbosity level.

n_jobsint, default=1

Number of jobs used for parallelization.

save_train_predsbool, default=False

Parameter that indicates whether the predictions made on the training set will be saved in gojo.core.report.CVReport. For large training sets this may involve higher computational and storage costs.

save_transformsbool, default=False

Parameter that indicates whether the fitted transforms will be saved in gojo.core.report.CVReport.

save_modelsbool, default=False

Parameter that indicates whether the fitted models will be saved in gojo.core.report.CVReport. For larger models this may involve higher computational and storage costs.

op_instance_argsdict, default=None

Instance-level optional arguments. This parameter should be a dictionary whose values must be list on an array-like iterable containing the same number of elements as instances in X and y.

cv_objgojo.core.report.CVReport

Cross validation report. For more information see gojo.core.report.CVReport.

>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.decomposition import PCA
>>>
>>> # GOJO libraries
>>> from gojo import core
>>> from gojo import interfaces
>>>
>>> N_JOBS = 8
>>>
>>> # load test dataset (Wine)
>>> wine_dt = datasets.load_wine()
>>>
>>> # create the target variable. Classification problem 0 vs rest
>>> # to see the target names you can use wine_dt['target_names']
>>> y = (wine_dt['target'] == 1).astype(int)
>>> X = wine_dt['data']
>>>
>>> # previous model transforms
>>> transforms = [
>>>     interfaces.SKLearnTransformWrapper(StandardScaler),
>>>     interfaces.SKLearnTransformWrapper(PCA, n_components=5)
>>> ]
>>>
>>> # default model
>>> model = interfaces.SklearnModelWrapper(
>>>     SVC, kernel='poly', degree=1, coef0=0.0,
>>>     cache_size=1000, class_weight=None
>>> )
>>>
>>> # evaluate the model using a simple cross-validation strategy with a
>>> # default parameters
>>> cv_report = core.evalCrossVal(
>>>     X=X, y=y,
>>>     model=model,
>>>     cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
>>>     transforms=transforms,
>>>     verbose=True,
>>>     save_train_preds=True,
>>>     save_models=False,
>>>     save_transforms=False,
>>>     n_jobs=N_JOBS
>>> )
>>>
>>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5))
>>> results = pd.concat([
>>>     pd.DataFrame(scores['train'].mean(axis=0)).round(decimals=3),
>>>     pd.DataFrame(scores['test'].mean(axis=0)).round(decimals=3)],
>>>     axis=1).drop(index=['n_fold'])
>>> results.columns = ['Train', 'Test']
>>> results
>>>
gojo.core.loops.evalCrossValNestedHPO(X: pandas.DataFrame, y: pandas.Series, model: gojo.interfaces.model.Model, search_space: dict, outer_cv: gojo.util.splitter.SimpleSplitter, inner_cv: gojo.util.splitter.SimpleSplitter, hpo_sampler: optuna.samplers.BaseSampler, hpo_n_trials: int, minimization: bool, metrics: List[gojo.core.evaluation.Metric], objective_metric: Optional[str] = None, agg_function: Optional[callable] = None, transforms: Optional[List[gojo.interfaces.transform.Transform]] = None, verbose: int = - 1, n_jobs: int = 1, inner_cv_n_jobs: int = 1, save_train_preds: bool = False, save_transforms: bool = False, save_models: bool = False, op_instance_args: Optional[dict] = None, enable_experimental: bool = False)[source]

Subroutine used to evaluate a model according to a cross-validation scheme provided by the outer_cv argument. This function also perform a nested cross-validation for hyperparameter optimization (HPO) based on the optuna library.

Xnp.ndarray or pd.DataFrame

Variables used to fit the model.

ynp.ndarray or pd.DataFrame or pd.Series

Target prediction variable.

modelgojo.interfaces.Model

Model to be trained. The input model must follow the gojo.base.Model interfaz.

search_spacedict

Search space used for performing the HPO. For more information about distributions and sampling strategies consult optuna.

>>> search_space = {
>>>     # sample from a categorical distribution
>>>     'max_depth': ('suggest_int', (2, 10)),
>>>     # ... from a uniform distribution
>>>     'max_samples': ('suggest_float', (0.5, 1.0)),
>>> }

Keyword arguments can be passed by providing a dictionary in the third position where the key will correspond to the name of the parameter:

>>> search_space = {
>>>     # sample from a categorical distribution in log space
>>>     'max_depth': ('suggest_int', (2, 40), dict(step=1, log=True))),
>>>     # ... from a uniform distribution
>>>     'max_samples': ('suggest_float', (0.5, 1.0)),
>>> }
outer_cvCross-validation splitter

Cross-validation schema. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see gojo.util.getCrossValObj()). Supported splitters are sklearn.model_selection.RepeatedKFold, sklearn.model_selection.RepeatedStratifiedKFold, sklearn.model_selection.LeaveOneOut, gojo.util.splitter.SimpleSplitter, gojo.util.splitter.InstanceLevelKFoldSplitter or gojo.util.splitter.PredefinedSplitter

inner_cvCross-validation splitter

Inner cross-validation schema used for evaluating model performance in the nested cross-validation used for optimize the model hyperparameters. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see gojo.util.getCrossValObj()). Supported splitters are sklearn.model_selection.RepeatedKFold, sklearn.model_selection.RepeatedStratifiedKFold, sklearn.model_selection.LeaveOneOut, gojo.util.splitter.SimpleSplitter, gojo.util.splitter.InstanceLevelKFoldSplitter or gojo.util.splitter.PredefinedSplitter

hpo_sampleroptuna.samplers.BaseSampler

Sampler used for suggest model hyperparameters. For more information see optuna.

hpo_n_trialsint

Number of HPO iterations.

minimization: bool

Parameter indicating if the HPO objetive function must be minimized. If minimization=False the objective function will be maximized.

metricsList[gojo.core.evaluation.Metric]

Metrics used within the nested-cross validation to evaluate the hyperparameter configuration.

objective_metricstr, default=None

It is possible to indicate which of the metrics provided by the metrics parameter are to be optimized within the HPO. The metric must be provided as a string and must be included in the list of metrics provided. If this parameter is not provided, an aggregation function must be provided by means of the agg_function parameter.

agg_functioncallable, default=None

This function will receive a dataframe with the metrics calculated on each of the folds generated by the inner_cv and taking into account this information it will provide a score that will be maximized/minimized within the HPO. If the X parameter is not provided, this parameter must be provided. If both parameters are provided, X will be ignored.

transformsList[Transform] or None, default=None

Transformations applied to the data before being provided to the models. These transformations will be fitted using the training data, and will be applied to both training and test data. For more information see the module gojo.core.transform.

verboseint, default=-1

Verbosity level.

: int, default=1

Number of cores used to parallelise internal cross validation.

n_jobsint, default=1

Number of jobs used for parallelization. Parallelisation will be done at the optuna trial level and will depend on a temporary database that will be created and automatically removed once the optimizzation ends. This is an experimental feature, to enable this parameter you have to specify enable_experimental=True.

save_train_predsbool, default=False

Parameter that indicates whether the predictions made on the training set will be saved in gojo.core.report.CVReport. For large training sets this may involve higher computational and storage costs.

save_transformsbool, default=False

Parameter that indicates whether the fitted transforms will be saved in gojo.core.report.CVReport.

save_modelsbool, default=False

Parameter that indicates whether the fitted models will be saved in gojo.core.report.CVReport. For larger models this may involve higher computational and storage costs.

op_instance_argsdict, default=None

Instance-level optional arguments. This parameter should be a dictionary whose values must be list on an array-like iterable containing the same number of elements as instances in X and y.

enable_experimental: bool, default=False

Parameter indicating whether the experimental characteristics of the function are allowed.

cv_objgojo.core.report.CVReport

Cross validation report. For more information see gojo.core.report.CVReport. The HPO history will be save in the report metadata (gojo.core.report.CVReport.metadata.

>>> import optuna
>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.decomposition import PCA
>>>
>>> # GOJO libraries
>>> import gojo
>>> from gojo import core
>>>
>>> N_JOBS = 8
>>>
>>> # load test dataset (Wine)
>>> wine_dt = datasets.load_wine()
>>>
>>> # create the target variable. Classification problem 0 vs rest
>>> # to see the target names you can use wine_dt['target_names']
>>> y = (wine_dt['target'] == 1).astype(int)
>>> X = wine_dt['data']
>>>
>>> # previous model transforms
>>> transforms = [
>>>     core.SKLearnTransformWrapper(StandardScaler),
>>>     core.SKLearnTransformWrapper(PCA, n_components=5)
>>> ]
>>>
>>> # model hyperparameters
>>> search_space = {
>>>     'degree': ('suggest_int', (1, 10)),
>>>     'class_weight': ('suggest_categorical', [('balanced', None)]),
>>>     'coef0': ('suggest_float', (0.0, 100.00 ))
>>> }
>>>
>>> # default model
>>> model = core.SklearnModelWrapper(
>>>     SVC, kernel='poly', degree=1, coef0=0.0,
>>>     cache_size=1000, class_weight=None
>>> )
>>>
>>> # perform the HPO to optimice model-hyperparameters
>>> cv_report = core.evalCrossValNestedHPO(
>>>     X=X,
>>>     y=y,
>>>     model=model,
>>>     search_space=search_space,
>>>     outer_cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
>>>     inner_cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997),
>>>     hpo_sampler=optuna.samplers.TPESampler(n_startup_trials=40),
>>>     hpo_n_trials=80,
>>>     minimization=False,
>>>     transforms=transforms,
>>>     metrics=core.getDefaultMetrics('binary_classification', bin_threshold=0.5),
>>>     objective_metric='f1_score',
>>>     verbose=1,
>>>     save_train_preds=True,
>>>     save_models=False,
>>>     n_jobs=1
>>> )
>>>
>>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5))
>>> results = pd.concat([
>>>     pd.DataFrame(scores['train'].mean(axis=0)).round(decimals=3),
>>>     pd.DataFrame(scores['test'].mean(axis=0)).round(decimals=3)],
>>>     axis=1).drop(index=['n_fold'])
>>> results.columns = ['Train', 'Test']
>>> results
>>>

gojo.core.report module

class gojo.core.report.CVReport(raw_results: list, X_dataset: gojo.interfaces.data.Dataset, y_dataset: gojo.interfaces.data.Dataset, n_fold_key: str, pred_test_key: str, true_test_key: str, pred_train_key: str, true_train_key: str, test_idx_key: str, train_idx_key: str, trained_model_key: str, fitted_transforms_key: str)[source]

Bases: object

Object returned by the subroutines defined in gojo.core.loops functions with the results of the cross validation.

addMetadata(**kwargs)[source]

Function used to add metadata to the report.

getFittedTransforms(copy: bool = True) dict[source]

Function that returns the fitted transforms if they have been saved in the gojo.core.loops subroutine.

copybool, default=True

Parameter that indicates whether to return a deepcopy of the transforms or directly the saved transforms. Defaults to True to avoid inplace modifications.

fitted_transformsdict or None

Trained models or None if the models were not saved.

getScores(metrics: list, loocv: bool = False, supress_warnings: bool = False) dict[source]

Method used to calculate performance metrics for folds from a list of metrics ( gojo.core.evaluation.Metric instances) provided. If the subroutine from gojo.core.loops performed a leave-one-out cross-validation you must specify the parameter loocv as True.

metricslist

List of gojo.core.evaluation.Metric instances

loocvbool

Parameter indicating if the predictions correspond to a LOOCV schema

supress_warningsbool, default=False

Indicates whether to supress the possible warnings returned by the method.

performance_metricsdict

Dictionary with the performance associated with the test data (identified with the ‘test’ key) and with the training data (identified with the ‘train’ key).

>>> from gojo import core
>>>
>>> # ... cv_report = core.loops.evalCrossVal(...)
>>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5))
>>>
getTestPredictions() pandas.DataFrame[source]

Function that returns a dataframe with the model predictions, indices, and true labels for the test set.

test_predictionspd.DataFrame

Model predictions over the test set.

getTrainPredictions(supress_warnings: bool = False) None[source]

Function that returns a dataframe with the model predictions, indices, and true labels for the train set.

Predictions will only be returned if they are available. In some subroutines of gojo.core.loops it should be noted that the predictions made on the training set are not saved or this decision is relegated to the user.

supress_warningsbool, default=False

Silence the warning raised when not training predictions have been made.

test_predictionspd.DataFrame or None

Model predictions over the train set.

getTrainedModels(copy: bool = True) dict[source]

Function that returns the trained models if they have been saved in the gojo.core.loops subroutine.

copybool, default=True

Parameter that indicates whether to return a deepcopy of the models (using the copy.deepcopy) or directly the saved model. Defaults to True to avoid inplace modifications.

trained_modelsdict or None

Trained models or None if the models were not saved.

property metadata: dict

Return the report metadata.

Module contents