Cross-validation Examples¶
Optunity offers a simply interface to k-fold cross-validation.
The fold generation procedure is aware of both strata and clusters.
Please refer to Cross-validation for an overview and
optunity.cross_validated()
for implementation and API details.
We will build examples step by step. The basic setup is a train
and predict
function along with some data
to construct folds over:
from __future__ import print_function
import optunity as opt
def train(x, y, filler=''):
print(filler + 'Training data:')
for instance, label in zip(x, y):
print(filler + str(instance) + ' ' + str(label))
def predict(x, filler=''):
print(filler + 'Testing data:')
for instance in x:
print(filler + str(instance))
data = list(range(9))
labels = [0] * 9
The recommended way to perform cross-validation is using the
optunity.cross_validation.cross_validated()
function decorator:
@opt.cross_validated(x=data, y=labels, num_folds=3)
def cved(x_train, y_train, x_test, y_test):
train(x_train, y_train)
predict(x_test)
return 0.0
cved()
Nested cross-validation¶
Nested cross-validation is a commonly used approach to estimate the generalization performance of a modeling process which includes model selection internally. A good summary is provided here.
Nested cv consists of two cross-validation procedures wrapped around eachother. The inner cv is used for model selection, the outer cv estimates generalization performance.
This can be done in a straightforward manner using Optunity:
@opt.cross_validated(x=data, y=labels, num_folds=3)
def nested_cv(x_train, y_train, x_test, y_test):
@opt.cross_validated(x=x_train, y=y_train, num_folds=3)
def inner_cv(x_train, y_train, x_test, y_test):
train(x_train, y_train, '...')
predict(x_test, '...')
return 0.0
inner_cv()
predict(x_test)
return 0.0
nested_cv()
The inner optunity.cross_validated()
decorator has access to
the train and test folds generated by the outer procedure (x_train
and x_test
).
For notational simplicity we assume a problem without labels here.
Note
The inner folds are regenerated in every iteration (since we are redefining inner_cv
each time).
The inner folds will therefore be different each time. The outer folds remain static, unless regenerate_folds=True
is passed.
Below we illustrate a more complete example of nested cv, which includes hyperparameter
optimization with optunity.maximize()
. Assume we have access to the following functions
svm=svm_train(x, y, c, g)
and predictions=svm_predict(svm, x)
. Where c
and g
are hyperparameters to be optimized for accuracy:
@opt.cross_validated(x=data, y=labels, num_folds=3)
def nested_cv(x_train, y_train, x_test, y_test):
@opt.cross_validated(x=x_train, y=y_train, num_folds=3)
def inner_cv(x_train, y_train, x_test, y_test, c, g):
svm = svm_train(x_train, y_train, c, g)
predictions = svm_predict(svm, x_test)
return opt.score_functions.accuracy(y_test, predictions)
optimal_parameters, _, _ = opt.maximize(inner_cv, num_evals=100, c=[0, 10], g=[0, 10])
optimal_svm = svm_train(x_train, y_train, **optimal_parameters)
predictions = svm_predict(optimal_svm, x_test)
return opt.score_functions.accuracy(y_test, predictions)
overall_accuracy = nested_cv()
Note
You are free to use different score and aggregation functions in the inner and outer cv.