Self Training Methods¶

Self Ensemble¶

class dalib.adaptation.self_ensemble.ConsistencyLoss(distance_measure, reduction='mean')[source]¶

Consistency loss between output of student model and output of teacher model. Given distance measure \(D\), student model’s output \(y\), teacher model’s output \(y_{teacher}\), binary mask \(mask\), consistency loss is

\[D(y, y_{teacher}) * mask\]

Parameters

distance_measure (callable) – Distance measure function.
reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: 'mean'

Inputs:

y: predictions from student model
y_teacher: predictions from teacher model
mask: binary mask

Shape:

y, y_teacher: \((N, C)\) where C means the number of classes.
mask: \((N, )\) where N means mini-batch size.

class dalib.adaptation.self_ensemble.L2ConsistencyLoss(reduction='mean')[source]¶: L2 consistency loss. Given student model’s output \(y\), teacher model’s output \(y_{teacher}\) and binary mask \(mask\), L2 consistency loss is

\[\text{MSELoss}(y, y_{teacher}) * mask\]

class dalib.adaptation.self_ensemble.ClassBalanceLoss(num_classes)[source]¶

Class balance loss that penalises the network for making predictions that exhibit large class imbalance. Given predictions \(y\) with dimension \((N, C)\), we first calculate mean across mini-batch dimension, resulting in mini-batch mean per-class probability \(y_{mean}\) with dimension \((C, )\)

\[y_{mean}^j = \frac{1}{N} \sum_{i=1}^N y_i^j\]

Then we calculate binary cross entropy loss between \(y_{mean}\) and uniform probability vector \(u\) with the same dimension where \(u^j\) = \(\frac{1}{C}\)

\[loss = \text{BCELoss}(y_{mean}, u)\]

Parameters: num_classes (int) – Number of classes

Inputs:

y (tensor): predictions from classifier

Shape:

y: \((N, C)\) where C means the number of classes.

class dalib.adaptation.self_ensemble.EmaTeacher(model, alpha)[source]¶

Exponential moving average model used in Self-ensembling for Visual Domain Adaptation (ICLR 2018)

We denote \(\theta_t'\) as the parameters of teacher model at training step t, \(\theta_t\) as the parameters of student model at training step t, \(\alpha\) as decay rate. Then we update teacher model in an exponential moving average manner as follows

\[\theta_t'=\alpha \theta_{t-1}' + (1-\alpha)\theta_t\]

Parameters

model (torch.nn.Module) – student model
alpha (float) – decay rate for EMA.

Inputs:: x (tensor): input data fed to teacher model

Examples:

>>> classifier = ImageClassifier(backbone, num_classes=31, bottleneck_dim=256).to(device)
>>> # initialize teacher model
>>> teacher = EmaTeacher(classifier, 0.9)
>>> num_iterations = 1000
>>> for _ in range(num_iterations):
>>>     # x denotes input of one mini-batch
>>>     # you can get teacher model's output by teacher(x)
>>>     y_teacher = teacher(x)
>>>     # when you want to update teacher, you should call teacher.update()
>>>     teacher.update()

MCC: Minimum Class Confusion¶

class dalib.adaptation.mcc.MinimumClassConfusionLoss(temperature)[source]¶

Minimum Class Confusion loss minimizes the class confusion in the target predictions.

You can see more details in Minimum Class Confusion for Versatile Domain Adaptation (ECCV 2020)

Parameters: temperature (float) – The temperature for rescaling, the prediction will shrink to vanilla softmax if temperature is 1.0.

Note

Make sure that temperature is larger than 0.

Inputs: g_t

g_t (tensor): unnormalized classifier predictions on target domain, \(g^t\)

Shape:

g_t: \((minibatch, C)\) where C means the number of classes.
Output: scalar.

Examples::

>>> temperature = 2.0
>>> loss = MinimumClassConfusionLoss(temperature)
>>> # logits output from target domain
>>> g_t = torch.randn(batch_size, num_classes)
>>> output = loss(g_t)

MCC can also serve as a regularizer for existing methods. Examples:

>>> from dalib.modules.domain_discriminator import DomainDiscriminator
>>> num_classes = 2
>>> feature_dim = 1024
>>> batch_size = 10
>>> temperature = 2.0
>>> discriminator = DomainDiscriminator(in_feature=feature_dim, hidden_size=1024)
>>> cdan_loss = ConditionalDomainAdversarialLoss(discriminator, reduction='mean')
>>> mcc_loss = MinimumClassConfusionLoss(temperature)
>>> # features from source domain and target domain
>>> f_s, f_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim)
>>> # logits output from source domain adn target domain
>>> g_s, g_t = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes)
>>> total_loss = cdan_loss(g_s, f_s, g_t, f_t) + mcc_loss(g_t)

MMT: Mutual Mean-Teaching¶

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification (ICLR 2020)

State of the art unsupervised domain adaptation methods utilize clustering algorithms to generate pseudo labels on target domain, which are noisy and thus harmful for training. Inspired by the teacher-student approaches, MMT framework provides robust soft pseudo labels in an on-line peer-teaching manner.

We denote two networks as \(f_1,f_2\), their parameters as \(\theta_1,\theta_2\). The authors also propose to use the temporally average model of each network \(\text{ensemble}(f_1),\text{ensemble}(f_2)\) to generate more reliable soft pseudo labels for supervising the other network. Specifically, the parameters of the temporally average models of the two networks at current iteration \(T\) are denoted as \(E^{(T)}[\theta_1]\) and \(E^{(T)}[\theta_2]\) respectively, which can be calculated as

\[E^{(T)}[\theta_1] = \alpha E^{(T-1)}[\theta_1] + (1-\alpha)\theta_1\]

\[E^{(T)}[\theta_2] = \alpha E^{(T-1)}[\theta_2] + (1-\alpha)\theta_2\]

where \(E^{(T-1)}[\theta_1],E^{(T-1)}[\theta_2]\) indicate the temporal average parameters of the two networks in the previous iteration \((T-1)\), the initial temporal average parameters are \(E^{(0)}[\theta_1]=\theta_1,E^{(0)}[\theta_2]=\theta_2\) and \(\alpha\) is the momentum.

These two networks cooperate with each other in three ways:

When running clustering algorithm, we average features produced by \(\text{ensemble}(f_1)\) and
\(\text{ensemble}(f_2)\) instead of only considering one of them.
A soft triplet loss is optimized between \(f_1\) and \(\text{ensemble}(f_2)\) and vice versa
to force one network to learn from temporally average of another network.
A cross entropy loss is optimized between \(f_1\) and \(\text{ensemble}(f_2)\) and vice versa
to force one network to learn from temporally average of another network.

The above mentioned loss functions are listed below, more details can be found in training scripts.

class common.vision.models.reid.loss.SoftTripletLoss(margin=None, normalize_feature=False)[source]¶

Soft triplet loss from Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification (ICLR 2020). Consider a triplet \(x,x_p,x_n\) (anchor, positive, negative), corresponding features are \(f,f_p,f_n\). We optimize for a smaller distance between \(f\) and \(f_p\) and a larger distance between \(f\) and \(f_n\). Inner product is adopted as their similarity measure, soft triplet loss is thus defined as

\[loss = \mathcal{L}_{\text{bce}}(\frac{\text{exp}(f^Tf_p)}{\text{exp}(f^Tf_p)+\text{exp}(f^Tf_n)}, 1)\]

where \(\mathcal{L}_{\text{bce}}\) means binary cross entropy loss. We denote the first term in above loss function as \(T\). When features from another teacher network can be obtained, we can calculate \(T_{teacher}\) as labels, resulting in the following soft version

\[loss = \mathcal{L}_{\text{bce}}(T, T_{teacher})\]

Parameters

margin (float, optional) – margin of triplet loss. If None, soft labels from another network will be adopted when computing loss. Default: None.
normalize_feature (bool, optional) – if True, normalize features into unit norm first before computing loss. Default: False.

class common.vision.models.reid.loss.CrossEntropyLoss[source]¶

We use \(C\) to denote the number of classes, \(N\) to denote mini-batch size, this criterion expects unnormalized predictions \(y\_{logits}\) of shape \((N, C)\) and \(target\_{logits}\) of the same shape \((N, C)\). Then we first normalize them into probability distributions among classes

\[y = \text{softmax}(y\_{logits})\]

\[target = \text{softmax}(target\_{logits})\]

Final objective is calculated as

\[\text{loss} = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^C -target_i^j \times \text{log} (y_i^j)\]

Other Methods¶

AFN: Adaptive Feature Norm¶

class dalib.adaptation.afn.AdaptiveFeatureNorm(delta)[source]¶

The Stepwise Adaptive Feature Norm loss (ICCV 2019)

Instead of using restrictive scalar R to match the corresponding feature norm, Stepwise Adaptive Feature Norm is used in order to learn task-specific features with large norms in a progressive manner. We denote parameters of backbone \(G\) as \(\theta_g\), parameters of bottleneck \(F_f\) as \(\theta_f\) , parameters of classifier head \(F_y\) as \(\theta_y\), and features extracted from sample \(x_i\) as \(h(x_i;\theta)\). Full loss is calculated as follows

\[\begin{split}L(\theta_g,\theta_f,\theta_y)=\frac{1}{n_s}\sum_{(x_i,y_i)\in D_s}L_y(x_i,y_i)+\frac{\lambda}{n_s+n_t} \sum_{x_i\in D_s\cup D_t}L_d(h(x_i;\theta_0)+\Delta_r,h(x_i;\theta))\\\end{split}\]

where \(L_y\) denotes classification loss, \(L_d\) denotes norm loss, \(\theta_0\) and \(\theta\) represent the updated and updating model parameters in the last and current iterations respectively.

Parameters: delta (float) – positive residual scalar to control the feature norm enlargement.

Inputs:

f (tensor): feature representations on source or target domain.

Shape:

f: \((N, F)\) where F means the dimension of input features.
Outputs: scalar.

Examples:

>>> adaptive_feature_norm = AdaptiveFeatureNorm(delta=1)
>>> f_s = torch.randn(32, 1000)
>>> f_t = torch.randn(32, 1000)
>>> norm_loss = adaptive_feature_norm(f_s) + adaptive_feature_norm(f_t)

class dalib.adaptation.afn.Block(in_features, bottleneck_dim=1000, dropout_p=0.5)[source]¶

Basic building block for Image Classifier with structure: FC-BN-ReLU-Dropout. We use \(L_2\) preserved dropout layers. Given mask probability \(p\), input \(x_k\), generated mask \(a_k\), vanilla dropout layers calculate

\[\begin{split}\hat{x}_k = a_k\frac{1}{1-p}x_k\\\end{split}\]

While in \(L_2\) preserved dropout layers

\[\begin{split}\hat{x}_k = a_k\frac{1}{\sqrt{1-p}}x_k\\\end{split}\]

Parameters

in_features (int) – Dimension of input features
bottleneck_dim (int, optional) – Feature dimension of the bottleneck layer. Default: 1000
dropout_p (float, optional) – dropout probability. Default: 0.5

class dalib.adaptation.afn.ImageClassifier(backbone, num_classes, num_blocks=1, bottleneck_dim=1000, dropout_p=0.5, **kwargs)[source]¶

ImageClassifier for AFN.

Parameters

backbone (torch.nn.Module) – Any backbone to extract 2-d features from data
num_classes (int) – Number of classes
num_blocks (int, optional) – Number of basic blocks. Default: 1
bottleneck_dim (int, optional) – Feature dimension of the bottleneck layer. Default: 1000
dropout_p (float, optional) – dropout probability. Default: 0.5

Self Training Methods¶

Self Ensemble¶

MCC: Minimum Class Confusion¶

MMT: Mutual Mean-Teaching¶

Other Methods¶

AFN: Adaptive Feature Norm¶

Docs

Tutorials