gojo.core package
Submodules
gojo.core.evaluation module
- class gojo.core.evaluation.Metric(name: str, function: callable, bin_threshold: Optional[float] = None, ignore_bin_threshold: bool = False, multiclass: bool = False, number_of_classes: Optional[int] = None, use_multiclass_sparse: bool = True, **kwargs)[source]
Bases:
object
Base class used to create any type of performance evaluation metric compatible with the
gojo
framework.- namestr
Name given to the performance metric
- functioncallable
Function that will receive as input two numpy.ndarray (y_true and y_pred) and must return a scalar or a numpy.ndarray.
- bin_thresholdfloat or int, default=None
Threshold used to binarize the input predictions. By default, no thresholding is applied.
- ignore_bin_thresholdbool, default=False
If provided, parameter bin_threshold will be ignored.
- multiclassbool, default=False
Parameter indicating if a multi-class classification metric is being computed.
- number_of_classesint, default=None
Parameter indicating the number of classes in a multi-class classification problem. This parameter will not have any effect when multiclass=False.
- use_multiclass_sparsebool, default=False
Parameter indicating if the multi-class level predictions are provided as a one-hot vector. This parameter will not have any effect when multiclass=False.
- **kwargs
Optional parameters provided to the input callable specified by function.
- gojo.core.evaluation.flatFunctionInput(fn: callable)[source]
Function used to flatten the input predictions before the computation of the metric. Internally, the input y_pred and y_true will be flattened before calling the provided function.
>>> from gojo import core >>> from sklearn import metrics
>>> metric = core.Metric( >>> 'accuracy', >>> core.flatFunctionInput(metrics.accuracy_score), >>> bin_threshold=0.5) >>>
- gojo.core.evaluation.getAvailableDefaultMetrics(task: Optional[str] = None) dict [source]
Return to dictionary with task names and default metrics defined for those tasks. The selected problems for which you want to see the metrics can be filtered by the task parameter indicating the task for which you want to see the metrics.
- taskstr, default=None
Specify the task to see the defined metrics associated to that task.
- task_infodict
Dictionary where the keys correspond to the task and the values to the metrics defined by default for the associated task.
- gojo.core.evaluation.getDefaultMetrics(task: str, select: Optional[list] = None, bin_threshold: Optional[float] = None, multiclass: bool = False, number_of_classes: Optional[int] = None, use_multiclass_sparse: bool = False) list [source]
Function used to get a series of pre-defined scores for evaluate the model performance.
- taskstr
Task-associated metrics. Currently available tasks are: binary_classification and regression.
- selectlist, default=None
Metrics of those returned that will be selected (in case you do not want to calculate all the metrics). By default, all metrics associated with the task will be returned.
Note: metrics are represented by strings.
- bin_thresholdfloat or int, default=None
Threshold used to binarize the input predictions. By default, no thresholding is applied.
- multiclassbool, default=False
Parameter indicating if a multi-class classification metric is being computed.
- number_of_classesint, default=None
Parameter indicating the number of classes in a multi-class classification problem. This parameter will not have any effect when multiclass=False.
- use_multiclass_sparsebool, default=False
Parameter indicating if the multi-class level predictions are provided as a one-hot vector. This parameter will not have any effect when multiclass=False.
- metricslist
List of instances of the gojo.core.Metric class.
- gojo.core.evaluation.getScores(y_true: numpy.ndarray, y_pred: numpy.ndarray, metrics: list) dict [source]
Function used to calculate the scores given by the metrics passed within the metrics parameter.
- y_truenp.ndarray
True labels.
- y_prednp.ndarray
Predicted labels.
- metricsList[gojo.core.Metric]
List of gojo.core.Metric instances.
- metric_scoresdict
Dictionary where the keys will correspond to the metric names and the values to the metric scores.
gojo.core.loops module
- gojo.core.loops.evalCrossVal(X: pandas.DataFrame, y: pandas.Series, model: gojo.interfaces.model.Model, cv: gojo.util.splitter.SimpleSplitter, transforms: Optional[List[gojo.interfaces.transform.Transform]] = None, verbose: int = - 1, n_jobs: int = 1, save_train_preds: bool = False, save_transforms: bool = False, save_models: bool = False, op_instance_args: Optional[dict] = None) gojo.core.report.CVReport [source]
Subroutine used to evaluate a model according to a cross-validation scheme provided by the cv argument.
- Xnp.ndarray or pd.DataFrame
Variables used to fit the model.
- ynp.ndarray or pd.DataFrame or pd.Series
Target prediction variable.
- model
gojo.interfaces.Model
Model to be trained. The input model must follow the
gojo.base.Model
interfaz.- cvCross-validation splitter
Cross-validation schema. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see
gojo.util.getCrossValObj()
). Supported splitters aresklearn.model_selection.RepeatedKFold
,sklearn.model_selection.RepeatedStratifiedKFold
,sklearn.model_selection.LeaveOneOut
,gojo.util.splitter.SimpleSplitter
,gojo.util.splitter.InstanceLevelKFoldSplitter
orgojo.util.splitter.PredefinedSplitter
- transformsList[Transform] or None, default=None
Transformations applied to the data before being provided to the models. These transformations will be fitted using the training data, and will be applied to both training and test data. For more information see the module
gojo.core.transform
.- verboseint, default=-1
Verbosity level.
- n_jobsint, default=1
Number of jobs used for parallelization.
- save_train_predsbool, default=False
Parameter that indicates whether the predictions made on the training set will be saved in
gojo.core.report.CVReport
. For large training sets this may involve higher computational and storage costs.- save_transformsbool, default=False
Parameter that indicates whether the fitted transforms will be saved in
gojo.core.report.CVReport
.- save_modelsbool, default=False
Parameter that indicates whether the fitted models will be saved in
gojo.core.report.CVReport
. For larger models this may involve higher computational and storage costs.- op_instance_argsdict, default=None
Instance-level optional arguments. This parameter should be a dictionary whose values must be list on an array-like iterable containing the same number of elements as instances in X and y.
- cv_obj
gojo.core.report.CVReport
Cross validation report. For more information see
gojo.core.report.CVReport
.
>>> import pandas as pd >>> from sklearn import datasets >>> from sklearn.svm import SVC >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.decomposition import PCA >>> >>> # GOJO libraries >>> from gojo import core >>> from gojo import interfaces >>> >>> N_JOBS = 8 >>> >>> # load test dataset (Wine) >>> wine_dt = datasets.load_wine() >>> >>> # create the target variable. Classification problem 0 vs rest >>> # to see the target names you can use wine_dt['target_names'] >>> y = (wine_dt['target'] == 1).astype(int) >>> X = wine_dt['data'] >>> >>> # previous model transforms >>> transforms = [ >>> interfaces.SKLearnTransformWrapper(StandardScaler), >>> interfaces.SKLearnTransformWrapper(PCA, n_components=5) >>> ] >>> >>> # default model >>> model = interfaces.SklearnModelWrapper( >>> SVC, kernel='poly', degree=1, coef0=0.0, >>> cache_size=1000, class_weight=None >>> ) >>> >>> # evaluate the model using a simple cross-validation strategy with a >>> # default parameters >>> cv_report = core.evalCrossVal( >>> X=X, y=y, >>> model=model, >>> cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997), >>> transforms=transforms, >>> verbose=True, >>> save_train_preds=True, >>> save_models=False, >>> save_transforms=False, >>> n_jobs=N_JOBS >>> ) >>> >>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5)) >>> results = pd.concat([ >>> pd.DataFrame(scores['train'].mean(axis=0)).round(decimals=3), >>> pd.DataFrame(scores['test'].mean(axis=0)).round(decimals=3)], >>> axis=1).drop(index=['n_fold']) >>> results.columns = ['Train', 'Test'] >>> results >>>
- gojo.core.loops.evalCrossValNestedHPO(X: pandas.DataFrame, y: pandas.Series, model: gojo.interfaces.model.Model, search_space: dict, outer_cv: gojo.util.splitter.SimpleSplitter, inner_cv: gojo.util.splitter.SimpleSplitter, hpo_sampler: optuna.samplers.BaseSampler, hpo_n_trials: int, minimization: bool, metrics: List[gojo.core.evaluation.Metric], objective_metric: Optional[str] = None, agg_function: Optional[callable] = None, transforms: Optional[List[gojo.interfaces.transform.Transform]] = None, verbose: int = - 1, n_jobs: int = 1, inner_cv_n_jobs: int = 1, save_train_preds: bool = False, save_transforms: bool = False, save_models: bool = False, op_instance_args: Optional[dict] = None, enable_experimental: bool = False)[source]
Subroutine used to evaluate a model according to a cross-validation scheme provided by the outer_cv argument. This function also perform a nested cross-validation for hyperparameter optimization (HPO) based on the optuna library.
- Xnp.ndarray or pd.DataFrame
Variables used to fit the model.
- ynp.ndarray or pd.DataFrame or pd.Series
Target prediction variable.
- model
gojo.interfaces.Model
Model to be trained. The input model must follow the
gojo.base.Model
interfaz.- search_spacedict
Search space used for performing the HPO. For more information about distributions and sampling strategies consult optuna.
>>> search_space = { >>> # sample from a categorical distribution >>> 'max_depth': ('suggest_int', (2, 10)), >>> # ... from a uniform distribution >>> 'max_samples': ('suggest_float', (0.5, 1.0)), >>> }
Keyword arguments can be passed by providing a dictionary in the third position where the key will correspond to the name of the parameter:
>>> search_space = { >>> # sample from a categorical distribution in log space >>> 'max_depth': ('suggest_int', (2, 40), dict(step=1, log=True))), >>> # ... from a uniform distribution >>> 'max_samples': ('suggest_float', (0.5, 1.0)), >>> }
- outer_cvCross-validation splitter
Cross-validation schema. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see
gojo.util.getCrossValObj()
). Supported splitters aresklearn.model_selection.RepeatedKFold
,sklearn.model_selection.RepeatedStratifiedKFold
,sklearn.model_selection.LeaveOneOut
,gojo.util.splitter.SimpleSplitter
,gojo.util.splitter.InstanceLevelKFoldSplitter
orgojo.util.splitter.PredefinedSplitter
- inner_cvCross-validation splitter
Inner cross-validation schema used for evaluating model performance in the nested cross-validation used for optimize the model hyperparameters. For more information about cross validation see sklearn.model_selection module. The gojo module implements useful functions for easy loading of cross-validation objects (see
gojo.util.getCrossValObj()
). Supported splitters aresklearn.model_selection.RepeatedKFold
,sklearn.model_selection.RepeatedStratifiedKFold
,sklearn.model_selection.LeaveOneOut
,gojo.util.splitter.SimpleSplitter
,gojo.util.splitter.InstanceLevelKFoldSplitter
orgojo.util.splitter.PredefinedSplitter
- hpo_sampleroptuna.samplers.BaseSampler
Sampler used for suggest model hyperparameters. For more information see optuna.
- hpo_n_trialsint
Number of HPO iterations.
- minimization: bool
Parameter indicating if the HPO objetive function must be minimized. If minimization=False the objective function will be maximized.
- metricsList[
gojo.core.evaluation.Metric
] Metrics used within the nested-cross validation to evaluate the hyperparameter configuration.
- objective_metricstr, default=None
It is possible to indicate which of the metrics provided by the metrics parameter are to be optimized within the HPO. The metric must be provided as a string and must be included in the list of metrics provided. If this parameter is not provided, an aggregation function must be provided by means of the agg_function parameter.
- agg_functioncallable, default=None
This function will receive a dataframe with the metrics calculated on each of the folds generated by the inner_cv and taking into account this information it will provide a score that will be maximized/minimized within the HPO. If the X parameter is not provided, this parameter must be provided. If both parameters are provided, X will be ignored.
- transformsList[Transform] or None, default=None
Transformations applied to the data before being provided to the models. These transformations will be fitted using the training data, and will be applied to both training and test data. For more information see the module
gojo.core.transform
.- verboseint, default=-1
Verbosity level.
- : int, default=1
Number of cores used to parallelise internal cross validation.
- n_jobsint, default=1
Number of jobs used for parallelization. Parallelisation will be done at the optuna trial level and will depend on a temporary database that will be created and automatically removed once the optimizzation ends. This is an experimental feature, to enable this parameter you have to specify enable_experimental=True.
- save_train_predsbool, default=False
Parameter that indicates whether the predictions made on the training set will be saved in
gojo.core.report.CVReport
. For large training sets this may involve higher computational and storage costs.- save_transformsbool, default=False
Parameter that indicates whether the fitted transforms will be saved in
gojo.core.report.CVReport
.- save_modelsbool, default=False
Parameter that indicates whether the fitted models will be saved in
gojo.core.report.CVReport
. For larger models this may involve higher computational and storage costs.- op_instance_argsdict, default=None
Instance-level optional arguments. This parameter should be a dictionary whose values must be list on an array-like iterable containing the same number of elements as instances in X and y.
- enable_experimental: bool, default=False
Parameter indicating whether the experimental characteristics of the function are allowed.
- cv_obj
gojo.core.report.CVReport
Cross validation report. For more information see
gojo.core.report.CVReport
. The HPO history will be save in the report metadata (gojo.core.report.CVReport.metadata
.
>>> import optuna >>> import pandas as pd >>> from sklearn import datasets >>> from sklearn.svm import SVC >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.decomposition import PCA >>> >>> # GOJO libraries >>> import gojo >>> from gojo import core >>> >>> N_JOBS = 8 >>> >>> # load test dataset (Wine) >>> wine_dt = datasets.load_wine() >>> >>> # create the target variable. Classification problem 0 vs rest >>> # to see the target names you can use wine_dt['target_names'] >>> y = (wine_dt['target'] == 1).astype(int) >>> X = wine_dt['data'] >>> >>> # previous model transforms >>> transforms = [ >>> core.SKLearnTransformWrapper(StandardScaler), >>> core.SKLearnTransformWrapper(PCA, n_components=5) >>> ] >>> >>> # model hyperparameters >>> search_space = { >>> 'degree': ('suggest_int', (1, 10)), >>> 'class_weight': ('suggest_categorical', [('balanced', None)]), >>> 'coef0': ('suggest_float', (0.0, 100.00 )) >>> } >>> >>> # default model >>> model = core.SklearnModelWrapper( >>> SVC, kernel='poly', degree=1, coef0=0.0, >>> cache_size=1000, class_weight=None >>> ) >>> >>> # perform the HPO to optimice model-hyperparameters >>> cv_report = core.evalCrossValNestedHPO( >>> X=X, >>> y=y, >>> model=model, >>> search_space=search_space, >>> outer_cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997), >>> inner_cv=gojo.util.splitter.getCrossValObj(cv=5, repeats=1, stratified=True, loocv=False, random_state=1997), >>> hpo_sampler=optuna.samplers.TPESampler(n_startup_trials=40), >>> hpo_n_trials=80, >>> minimization=False, >>> transforms=transforms, >>> metrics=core.getDefaultMetrics('binary_classification', bin_threshold=0.5), >>> objective_metric='f1_score', >>> verbose=1, >>> save_train_preds=True, >>> save_models=False, >>> n_jobs=1 >>> ) >>> >>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5)) >>> results = pd.concat([ >>> pd.DataFrame(scores['train'].mean(axis=0)).round(decimals=3), >>> pd.DataFrame(scores['test'].mean(axis=0)).round(decimals=3)], >>> axis=1).drop(index=['n_fold']) >>> results.columns = ['Train', 'Test'] >>> results >>>
gojo.core.report module
- class gojo.core.report.CVReport(raw_results: list, X_dataset: gojo.interfaces.data.Dataset, y_dataset: gojo.interfaces.data.Dataset, n_fold_key: str, pred_test_key: str, true_test_key: str, pred_train_key: str, true_train_key: str, test_idx_key: str, train_idx_key: str, trained_model_key: str, fitted_transforms_key: str)[source]
Bases:
object
Object returned by the subroutines defined in
gojo.core.loops
functions with the results of the cross validation.- getFittedTransforms(copy: bool = True) dict [source]
Function that returns the fitted transforms if they have been saved in the
gojo.core.loops
subroutine.- copybool, default=True
Parameter that indicates whether to return a deepcopy of the transforms or directly the saved transforms. Defaults to True to avoid inplace modifications.
- fitted_transformsdict or None
Trained models or None if the models were not saved.
- getScores(metrics: list, loocv: bool = False, supress_warnings: bool = False) dict [source]
Method used to calculate performance metrics for folds from a list of metrics (
gojo.core.evaluation.Metric
instances) provided. If the subroutine fromgojo.core.loops
performed a leave-one-out cross-validation you must specify the parameter loocv as True.- metricslist
List of
gojo.core.evaluation.Metric
instances- loocvbool
Parameter indicating if the predictions correspond to a LOOCV schema
- supress_warningsbool, default=False
Indicates whether to supress the possible warnings returned by the method.
- performance_metricsdict
Dictionary with the performance associated with the test data (identified with the ‘test’ key) and with the training data (identified with the ‘train’ key).
>>> from gojo import core >>> >>> # ... cv_report = core.loops.evalCrossVal(...) >>> scores = cv_report.getScores(core.getDefaultMetrics('binary_classification', bin_threshold=0.5)) >>>
- getTestPredictions() pandas.DataFrame [source]
Function that returns a dataframe with the model predictions, indices, and true labels for the test set.
- test_predictionspd.DataFrame
Model predictions over the test set.
- getTrainPredictions(supress_warnings: bool = False) None [source]
Function that returns a dataframe with the model predictions, indices, and true labels for the train set.
Predictions will only be returned if they are available. In some subroutines of
gojo.core.loops
it should be noted that the predictions made on the training set are not saved or this decision is relegated to the user.- supress_warningsbool, default=False
Silence the warning raised when not training predictions have been made.
- test_predictionspd.DataFrame or None
Model predictions over the train set.
- getTrainedModels(copy: bool = True) dict [source]
Function that returns the trained models if they have been saved in the
gojo.core.loops
subroutine.- copybool, default=True
Parameter that indicates whether to return a deepcopy of the models (using the copy.deepcopy) or directly the saved model. Defaults to True to avoid inplace modifications.
- trained_modelsdict or None
Trained models or None if the models were not saved.
- property metadata: dict
Return the report metadata.