Fine-tuning¶

Regularization-based Fine-tuning¶

L2¶

class talib.finetune.delta.L2Regularization(model)[source]¶

The L2 regularization of parameters \(w\) can be described as:

\[{\Omega} (w) = \dfrac{1}{2} \Vert w\Vert_2^2 ,\]

Parameters: model (torch.nn.Module) – The model to apply L2 penalty.

Shape:

Output: scalar.

L2-SP¶

class talib.finetune.delta.SPRegularization(source_model, target_model)[source]¶

The SP (Starting Point) regularization from Explicit inductive bias for transfer learning with convolutional networks (ICML 2018)

The SP regularization of parameters \(w\) can be described as:

\[{\Omega} (w) = \dfrac{1}{2} \Vert w-w_0\Vert_2^2 ,\]

where \(w_0\) is the parameter vector of the model pretrained on the source problem, acting as the starting point (SP) in fine-tuning.

Parameters

source_model (torch.nn.Module) – The source (starting point) model.
target_model (torch.nn.Module) – The target (fine-tuning) model.

Shape:

Output: scalar.

DELTA: DEep Learning Transfer using Feature Map with Attention¶

class talib.finetune.delta.BehavioralRegularization[source]¶

The behavioral regularization from DELTA:DEep Learning Transfer using Feature Map with Attention for convolutional networks (ICLR 2019)

It can be described as:

\[{\Omega} (w) = \sum_{j=1}^{N} \Vert FM_j(w, \boldsymbol x)-FM_j(w^0, \boldsymbol x)\Vert_2^2 ,\]

where \(w^0\) is the parameter vector of the model pretrained on the source problem, acting as the starting point (SP) in fine-tuning, \(FM_j(w, \boldsymbol x)\) is feature maps generated from the \(j\)-th layer of the model parameterized with \(w\), given the input \(\boldsymbol x\).

Inputs:

layer_outputs_source (OrderedDict): The dictionary for source model, where the keys are layer names and the values are feature maps correspondingly.

layer_outputs_target (OrderedDict): The dictionary for target model, where the keys are layer names and the values are feature maps correspondingly.

Shape:

Output: scalar.

class talib.finetune.delta.AttentionBehavioralRegularization(channel_attention)[source]¶

The behavioral regularization with attention from DELTA:DEep Learning Transfer using Feature Map with Attention for convolutional networks (ICLR 2019)

It can be described as:

\[{\Omega} (w) = \sum_{j=1}^{N} W_j(w) \Vert FM_j(w, \boldsymbol x)-FM_j(w^0, \boldsymbol x)\Vert_2^2 ,\]

where \(w^0\) is the parameter vector of the model pretrained on the source problem, acting as the starting point (SP) in fine-tuning. \(FM_j(w, \boldsymbol x)\) is feature maps generated from the \(j\)-th layer of the model parameterized with \(w\), given the input \(\boldsymbol x\). \(W_j(w)\) is the channel attention of the \(j\)-th layer of the model parameterized with \(w\).

Parameters: channel_attention (list) – The channel attentions of feature maps generated by each selected layer. For the layer with C channels, the channel attention is a tensor of shape [C].

Inputs:

layer_outputs_source (OrderedDict): The dictionary for source model, where the keys are layer names and the values are feature maps correspondingly.

layer_outputs_target (OrderedDict): The dictionary for target model, where the keys are layer names and the values are feature maps correspondingly.

Shape:

Output: scalar.

class talib.finetune.delta.IntermediateLayerGetter(model, return_layers, keep_output=True)[source]¶

Wraps a model to get intermediate output values of selected layers.

Parameters

model (torch.nn.Module) – The model to collect intermediate layer feature maps.
return_layers (list) – The names of selected modules to return the output.
keep_output (bool) – If True, model_output contains the final model’s output, else return None. Default: True

Returns

An OrderedDict of intermediate outputs. The keys are selected layer names in return_layers and the values are the feature map outputs. The order is the same as return_layers.
The model’s final output. If keep_output is False, return None.

LWF: Learning without Forgetting¶

class talib.finetune.lwf.Classifier(backbone, num_classes, head_source, head_target=None, bottleneck=None, bottleneck_dim=-1, finetune=True, pool_layer=None)[source]¶

A Classifier used in Learning Without Forgetting (ECCV 2016)..

Parameters

backbone (torch.nn.Module) – Any backbone to extract 2-d features from data.
num_classes (int) – Number of classes.
head_source (torch.nn.Module) – Classifier head of source model.
head_target (torch.nn.Module, optional) – Any classifier head. Use torch.nn.Linear by default
finetune (bool) – Whether finetune the classifier or train from scratch. Default: True

Inputs:

x (tensor): input data fed to backbone

Outputs:

y_s: predictions of source classifier head
y_t: predictions of target classifier head

Shape:

Inputs: (b, *) where b is the batch size and * means any number of additional dimensions
y_s: (b, N), where b is the batch size and N is the number of classes
y_t: (b, N), where b is the batch size and N is the number of classes

Co-Tuning¶

class talib.finetune.co_tuning.CoTuningLoss[source]¶

The Co-Tuning loss in Co-Tuning for Transfer Learning (NIPS 2020).

Inputs:

input: p(y_s) predicted by source classifier.
target: p(y_s|y_t), where y_t is the ground truth class label in target dataset.

Shape:

input: (b, N_p), where b is the batch size and N_p is the number of classes in source dataset
target: (b, N_p), where b is the batch size and N_p is the number of classes in source dataset
Outputs: scalar.

class talib.finetune.co_tuning.Relationship(data_loader, classifier, device, cache=None)[source]¶

Learns the category relationship p(y_s|y_t) between source dataset and target dataset.

Parameters

data_loader (torch.utils.data.DataLoader) – A data loader of target dataset.
classifier (torch.nn.Module) – A classifier for Co-Tuning.
device (torch.nn.Module) – The device to run classifier.
cache (str, optional) – Path to find and save the relationship file.

StochNorm: Stochastic Normalization¶

class talib.finetune.stochnorm.StochNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, p=0.5)[source]¶

Applies Stochastic Normalization over a 2D or 3D input (a mini-batch of 1D inputs with optional additional channel dimension)

Stochastic Normalization is proposed in Stochastic Normalization (NIPS 2020)

\[ \begin{align}\begin{aligned}\hat{x}_{i,0} = \frac{x_i - \tilde{\mu}}{ \sqrt{\tilde{\sigma} + \epsilon}}\\\hat{x}_{i,1} = \frac{x_i - \mu}{ \sqrt{\sigma + \epsilon}}\\\hat{x}_i = (1-s)\cdot \hat{x}_{i,0} + s\cdot \hat{x}_{i,1}\\ y_i = \gamma \hat{x}_i + \beta\end{aligned}\end{align} \]

where \(\mu\) and \(\sigma\) are mean and variance of current mini-batch data.

\(\tilde{\mu}\) and \(\tilde{\sigma}\) are current moving statistics of training data.

\(s\) is a branch-selection variable generated from a Bernoulli distribution, where \(P(s=1)=p\).

During training, there are two normalization branches. One uses mean and variance of current mini-batch data, while the other uses current moving statistics of the training data as usual batch normalization.

During evaluation, the moving statistics is used for normalization.

Parameters

num_features (int) – \(c\) from an expected input of size \((b, c, l)\) or \(l\) from an expected input of size \((b, l)\).
eps (float) – A value added to the denominator for numerical stability. Default: 1e-5
momentum (float) – The value used for the running_mean and running_var computation. Default: 0.1
affine (bool) – A boolean value that when set to True, gives the layer learnable affine parameters. Default: True
track_running_stats (bool) – A boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as None. When these buffers are None, this module always uses batch statistics in both training and eval modes. Default: True p (float): The probability to choose the second branch (usual BN). Default: 0.5

Shape:

Input: \((b, l)\) or \((b, c, l)\)
Output: \((b, l)\) or \((b, c, l)\) (same shape as input)

class talib.finetune.stochnorm.StochNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, p=0.5)[source]¶

Applies Stochastic Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension)

Stochastic Normalization is proposed in Stochastic Normalization (NIPS 2020)

\[ \begin{align}\begin{aligned}\hat{x}_{i,0} = \frac{x_i - \tilde{\mu}}{ \sqrt{\tilde{\sigma} + \epsilon}}\\\hat{x}_{i,1} = \frac{x_i - \mu}{ \sqrt{\sigma + \epsilon}}\\\hat{x}_i = (1-s)\cdot \hat{x}_{i,0} + s\cdot \hat{x}_{i,1}\\ y_i = \gamma \hat{x}_i + \beta\end{aligned}\end{align} \]

where \(\mu\) and \(\sigma\) are mean and variance of current mini-batch data.

\(\tilde{\mu}\) and \(\tilde{\sigma}\) are current moving statistics of training data.

\(s\) is a branch-selection variable generated from a Bernoulli distribution, where \(P(s=1)=p\).

During training, there are two normalization branches. One uses mean and variance of current mini-batch data, while the other uses current moving statistics of the training data as usual batch normalization.

During evaluation, the moving statistics is used for normalization.

Parameters

num_features (int) – \(c\) from an expected input of size \((b, c, h, w)\).
eps (float) – A value added to the denominator for numerical stability. Default: 1e-5
momentum (float) – The value used for the running_mean and running_var computation. Default: 0.1
affine (bool) – A boolean value that when set to True, gives the layer learnable affine parameters. Default: True
track_running_stats (bool) – A boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as None. When these buffers are None, this module always uses batch statistics in both training and eval modes. Default: True p (float): The probability to choose the second branch (usual BN). Default: 0.5

Shape:

Input: \((b, c, h, w)\)
Output: \((b, c, h, w)\) (same shape as input)

class talib.finetune.stochnorm.StochNorm3d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, p=0.5)[source]¶

Applies Stochastic Normalization over a 5D input (a mini-batch of 3D inputs with additional channel dimension)

Stochastic Normalization is proposed in Stochastic Normalization (NIPS 2020)

\[ \begin{align}\begin{aligned}\hat{x}_{i,0} = \frac{x_i - \tilde{\mu}}{ \sqrt{\tilde{\sigma} + \epsilon}}\\\hat{x}_{i,1} = \frac{x_i - \mu}{ \sqrt{\sigma + \epsilon}}\\\hat{x}_i = (1-s)\cdot \hat{x}_{i,0} + s\cdot \hat{x}_{i,1}\\ y_i = \gamma \hat{x}_i + \beta\end{aligned}\end{align} \]

where \(\mu\) and \(\sigma\) are mean and variance of current mini-batch data.

\(\tilde{\mu}\) and \(\tilde{\sigma}\) are current moving statistics of training data.

\(s\) is a branch-selection variable generated from a Bernoulli distribution, where \(P(s=1)=p\).

During training, there are two normalization branches. One uses mean and variance of current mini-batch data, while the other uses current moving statistics of the training data as usual batch normalization.

During evaluation, the moving statistics is used for normalization.

Parameters

num_features (int) – \(c\) from an expected input of size \((b, c, d, h, w)\)
eps (float) – A value added to the denominator for numerical stability. Default: 1e-5
momentum (float) – The value used for the running_mean and running_var computation. Default: 0.1
affine (bool) – A boolean value that when set to True, gives the layer learnable affine parameters. Default: True
track_running_stats (bool) – A boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics, and initializes statistics buffers running_mean and running_var as None. When these buffers are None, this module always uses batch statistics in both training and eval modes. Default: True p (float): The probability to choose the second branch (usual BN). Default: 0.5

Shape:

Input: \((b, c, d, h, w)\)
Output: \((b, c, d, h, w)\) (same shape as input)

talib.finetune.stochnorm.convert_model(module, p)[source]¶

Traverses the input module and its child recursively and replaces all instance of BatchNorm to StochNorm.

Parameters

module (torch.nn.Module) – The input module needs to be convert to StochNorm model.
p (float) – The hyper-parameter for StochNorm layer.

Returns

The module converted to StochNorm version.

Adaptive Fine-tuning¶

Bi-Tuning¶

class talib.finetune.bi_tuning.Bituning(encoder_q, encoder_k, num_classes, K=40, m=0.999, T=0.07)[source]¶

Bi-Tuning Module in Bi-tuning of Pre-trained Representations.

Parameters

encoder_q (Classifier) – Query encoder.
encoder_k (Classifier) – Key encoder.
num_classes (int) – Number of classes
K (int) – Queue size. Default: 40
m (float) – Momentum coefficient. Default: 0.999
T (float) – Temperature. Default: 0.07

Inputs:

im_q (tensor): input data fed to encoder_q
im_k (tensor): input data fed to encoder_k
labels (tensor): classification labels of input data

Outputs: y_q, logits_z, logits_y, labels_c

y_q: query classifier’s predictions
logits_z: projector’s predictions on both positive and negative samples
logits_y: classifier’s predictions on both positive and negative samples
labels_c: contrastive labels

Shape:

im_q, im_k: (minibatch, *) where * means, any number of additional dimensions
labels: (minibatch, )
y_q: (minibatch, num_classes)
logits_z: (minibatch, 1 + num_classes x K, projection_dim)
logits_y: (minibatch, 1 + num_classes x K, num_classes)
labels_c: (minibatch, 1 + num_classes x K)

Rejecting Untransferable Information¶

BSS: Batch Spectral Shrinkage¶

class talib.finetune.bss.BatchSpectralShrinkage(k=1)[source]¶

The regularization term in Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning (NIPS 2019).

The BSS regularization of feature matrix \(F\) can be described as:

\[L_{bss}(F) = \sum_{i=1}^{k} \sigma_{-i}^2 ,\]

where \(k\) is the number of singular values to be penalized, \(\sigma_{-i}\) is the \(i\)-th smallest singular value of feature matrix \(F\).

All the singular values of feature matrix \(F\) are computed by SVD:

\[F = U\Sigma V^T,\]

where the main diagonal elements of the singular value matrix \(\Sigma\) is \([\sigma_1, \sigma_2, ..., \sigma_b]\).

Parameters: k (int) – The number of singular values to be penalized. Default: 1

Shape:

Input: \((b, |\mathcal{f}|)\) where \(b\) is the batch size and \(|\mathcal{f}|\) is feature dimension.
Output: scalar.

Fine-tuning¶

Regularization-based Fine-tuning¶

L2¶

L2-SP¶

DELTA: DEep Learning Transfer using Feature Map with Attention¶

LWF: Learning without Forgetting¶

Co-Tuning¶

StochNorm: Stochastic Normalization¶

Adaptive Fine-tuning¶

Bi-Tuning¶

Rejecting Untransferable Information¶

BSS: Batch Spectral Shrinkage¶

Docs

Tutorials