Dataset¶
This module contains classes and functions related to data handling.
CVSplit¶
This class is responsible for performing the NeuralNet
’s
internal cross validation. For this, it sticks closely to the sklearn
standards. For more information on how sklearn handles cross
validation, look here.
The first argument that CVSplit
takes is cv
. It works
analogously to the cv
argument from sklearn
GridSearchCV
,
cross_val_score()
, etc. For those not
familiar, here is a short explanation of what you may pass:
None
: Use the default 3-fold cross validation.- integer: Specifies the number of folds in a
(Stratified)KFold
, - float: Represents the proportion of the dataset to include in the
validation split (e.g.
0.2
for 20%). - An object to be used as a cross-validation generator.
- An iterable yielding train, validation splits.
Furthermore, CVSplit
takes a stratified
argument that
determines whether a stratified split should be made (only makes sense
for discrete targets), and a random_state
argument, which is used
in case the cross validation split has a random component.
One difference to sklearn’s cross validation is that skorch makes only a single split. In sklearn, you would expect that in a 5-fold cross validation, the model is trained 5 times on the different combination of folds. This is often not desirable for neural networks, since training takes a lot of time. Therefore, skorch only ever makes one split.
If you would like to have all splits, you can still use skorch in
conjunction with the sklearn functions, as you would do with any
other sklearn-compatible estimator. Just remember to set
train_split=None
, so that the whole dataset is used for
training. Below is shown an example of making out-of-fold predictions
with skorch and sklearn:
net = NeuralNetClassifier(
module=MyModule,
train_split=None,
)
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(net, X, y, cv=5)
Dataset¶
In PyTorch, we have the concept of a
Dataset
and a
DataLoader
. The former is purely the
container of the data and only needs to implement __len__()
and
__getitem__(<int>)
. The latter does the heavy lifting, such as
sampling, shuffling, and distributed processing.
skorch uses the PyTorch DataLoader
s by default.
However, the Dataset
s provided by PyTorch
are not sufficient for our usecase; for instance, they don’t work with
numpy.ndarray
s. That’s why we provide our own
Dataset
class. This container works with:
numpy.ndarray
s- PyTorch
Tensor
s - pandas DataFrames or Series
In addition, you can pass dictionaries or lists of one of those data
types, e.g. a dictionary of numpy.ndarray
s. When you pass
dictionaries, the keys of the dictionaries are used as the argument
name for the forward()
method of the net’s
module
. Similarly, the column names of pandas DataFrame
s are
used as argument names. The example below should illustrate how to use
this feature:
import numpy as np
import torch
import torch.nn.functionl as F
class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.dense_a = torch.nn.Linear(10, 100)
self.dense_b = torch.nn.Linear(20, 100)
self.output = torch.nn.Linear(200, 2)
def forward(self, key_a, key_b):
hid_a = F.relu(self.dense_a(key_a))
hid_b = F.relu(self.dense_b(key_b))
concat = torch.cat((hid_a, hid_b), dim=1)
out = F.softmax(self.output(concat))
return out
net = NeuralNetClassifier(MyModule)
X = {
'key_a': np.random.random((1000, 10)).astype(np.float32),
'key_b': np.random.random((1000, 20)).astype(np.float32),
}
y = np.random.randint(0, 2, size=1000)
net.fit(X, y)
Note that the keys in the dictionary X
exactly match the argument
names in the forward()
method. This way, you
can easily work with several different types of input features.
The Dataset
from skorch makes the assumption that you always
have an X
and a y
, where X
represents the input data and
y
the target. However, you may leave y=None
, in which case
Dataset
returns a dummy variable.
Dataset
applies a transform final transform on the data
before passing it on to the PyTorch
DataLoader
. By default, it replaces y
by a dummy variable in case it is None
. If you would like to
apply your own transformation on the data, you should subclass
Dataset
and override the
transform()
method, then pass your
custom class to NeuralNet
as the dataset
argument.