Cross-validation Examples¶

Optunity offers a simply interface to k-fold cross-validation.

The fold generation procedure is aware of both strata and clusters. Please refer to Cross-validation for an overview and optunity.cross_validated() for implementation and API details.

We will build examples step by step. The basic setup is a train and predict function along with some data to construct folds over:

from __future__ import print_function
import optunity as opt

def train(x, y, filler=''):
    print(filler + 'Training data:')
    for instance, label in zip(x, y):
        print(filler + str(instance) + ' ' + str(label))

def predict(x, filler=''):
    print(filler + 'Testing data:')
    for instance in x:
        print(filler + str(instance))

data = list(range(9))
labels = [0] * 9

The recommended way to perform cross-validation is using the optunity.cross_validation.cross_validated() function decorator:

@opt.cross_validated(x=data, y=labels, num_folds=3)
def cved(x_train, y_train, x_test, y_test):
    train(x_train, y_train)
    predict(x_test)
    return 0.0

cved()

Nested cross-validation¶

Nested cross-validation is a commonly used approach to estimate the generalization performance of a modeling process which includes model selection internally. A good summary is provided here.

Nested cv consists of two cross-validation procedures wrapped around eachother. The inner cv is used for model selection, the outer cv estimates generalization performance.

This can be done in a straightforward manner using Optunity:

@opt.cross_validated(x=data, y=labels, num_folds=3)
def nested_cv(x_train, y_train, x_test, y_test):

    @opt.cross_validated(x=x_train, y=y_train, num_folds=3)
    def inner_cv(x_train, y_train, x_test, y_test):
        train(x_train, y_train, '...')
        predict(x_test, '...')
        return 0.0

    inner_cv()
    predict(x_test)
    return 0.0

nested_cv()

The inner optunity.cross_validated() decorator has access to the train and test folds generated by the outer procedure (x_train and x_test). For notational simplicity we assume a problem without labels here.

Note

The inner folds are regenerated in every iteration (since we are redefining inner_cv each time). The inner folds will therefore be different each time. The outer folds remain static, unless regenerate_folds=True is passed.

Below we illustrate a more complete example of nested cv, which includes hyperparameter optimization with optunity.maximize(). Assume we have access to the following functions svm=svm_train(x, y, c, g) and predictions=svm_predict(svm, x). Where c and g are hyperparameters to be optimized for accuracy:

@opt.cross_validated(x=data, y=labels, num_folds=3)
def nested_cv(x_train, y_train, x_test, y_test):

    @opt.cross_validated(x=x_train, y=y_train, num_folds=3)
    def inner_cv(x_train, y_train, x_test, y_test, c, g):
        svm = svm_train(x_train, y_train, c, g)
        predictions = svm_predict(svm, x_test)
        return opt.score_functions.accuracy(y_test, predictions)

    optimal_parameters, _, _ = opt.maximize(inner_cv, num_evals=100, c=[0, 10], g=[0, 10])
    optimal_svm = svm_train(x_train, y_train, **optimal_parameters)
    predictions = svm_predict(optimal_svm, x_test)
    return opt.score_functions.accuracy(y_test, predictions)

overall_accuracy = nested_cv()

Note

You are free to use different score and aggregation functions in the inner and outer cv.