Selfsupervised learning aims to extract representation from unsupervised visual data and it’s super famous in computer vision nowadays. This article covers the SWAV method, a robust selfsupervised learning paper from a mathematical perspective. To that end, we provide insights and intuitions for why this method works. Additionally, we will discuss the optimal transport problem with entropy constraint and its fast approximation that is a key point of the SWAV method that is hidden when you read the paper.
In any case, if you want to learn more about general aspects of selfsupervised learning, like augmentation, intuitions, softmax with temperature, and contrastive learning, consult our previous article.
SWAV Method overview
Definitions
Let two image features $\mathbf{z}_t$
Source: BYOL

Our actual targets: Let $\mathbf{q}_t$

Prototypes: consider a set of $K$ prototypes ${\mathbf{c}_1, .., \mathbf{c}_K}$
Source: SWAV paper, Caron et al 2020
Clusters and prototypes are used interchangeably throughout this article. Don’t confuse it with “codes” though! Nonetheless, codes and assignments are also used interchangeably.
SWAV ^{ compares the features $\mathbf{z}_t$}
Intuition: If $\mathbf{z}_t$
Source: SWAV’s github page
Difference between SWAV and SimCLR
In contrastive learning methods, the features from different transformations of the same images are compared directly to each other. SWAV does not directly compare image features. Why?
In SwAV, there is the intermediate “codes” step ($Q$). To create the codes (targets), we need to assign the image features to prototype vectors. We then solve a “swapped” prediction problem wherein the codes (targets) are altered for the two image views.
Source: SWAV paper, Caron et al 2020
Prototype vectors ${\mathbf{c}_1, .., \mathbf{c}_K}$
The unit sphere and its implications
By definition, a unit sphere is the set of points with L2 distance equal to 1 from a fixed central point, here the origin. Note that this is different from a unit ball, where the L2 distance is less than or equal to 1 from the centre.
Moving on the surface of the sphere corresponds to a smooth change in assignments. In fact many selfsupervised methods are using this L2norm trick, and especially contrastive methods. SWAV also applies L2normalization to the features as well as to the prototypes throughout training.
SWAV method Steps
Let’s recap the steps of SWAV:

Create $N$ views from input image $X$ using a set of stochastic transformations $T$

Calculate image feature representations $z$

Calculate softmaxnormalized similarities between all $z$ and $c$: $softmax(z^T c)$

Calculate code matrix $Q$ iteratively. We intentionally ignored this part. See further on for this step.

Calculate crossentropy loss between representation $t$, aka $z_t$

Average loss between all $N$ views.
Source: SWAV paper, Caron et al 2020
Again, notice the difference between cluster assignments (codes) and cluster prototype vectors ($c$). Here is a detailed explanation of the loss function:
Digging into SWAV’s math: approximating Q
Understanding the Optimal Transport Problem with Entropic Constraint
As discussed, the code vectors $\mathbf{q}_1,…,\mathbf{q}_B$
For $K$ prototypes and batch size $B$, the optimal code matrix $\mathbf{Q} = [\mathbf{q}_1,…,\mathbf{q}_B] \in R^{K \times B}$
For a formal and very detailed formulation, analysis and solution of said problem, I recommend having a look at the paper ^{. }
For SwAV, we define the optimal code matrix $\mathbf{Q}$ as:
with $H$ being the entropy $H(\mathbf{Q}) = – \sum_{ij} \mathbf{Q}_{ij} \log \mathbf{Q}_{ij}$
The trace $\text{Tr}$ is defined to be the sum of the elements on the main diagonal.
A matrix $\mathbf{Q}$ from the set $\mathcal{Q}$ is constrained in three ways:

All its entries have to be positive.

The sum of each row has to be $1/K$

The sum of each column has to be $1/B$.

Note that this also implies that the sum of all entries to be $1$, hence these matrices allow for a probabilistic interpretation, for example, w.r.t. Entropy. However, it’s not a stochastic matrix.
A simple matrix in this set is a matrix whose entries are all $1/(BK)$, which corresponds to a uniform distribution over all entries. This matrix maximizes the entropy $H(Q)$.
With a good intuition on the set $\mathcal{Q}$, we can examine the target function.
Optimal transport without entropy
Ignoring the entropyterm for now, we can go stepbystep through the first term
Since both $\mathbf{C}$ and $\mathbf{Z}$ are L2 normalized, the matrix product $\mathbf{C}^T \mathbf{Z}$
The first column of $\mathbf{C}^T \mathbf{Z}$
This means that the first diagonal entry of $\mathbf{Q}^T \mathbf{C}^T \mathbf{Z}$
While its entropy term will be:
Similarly, the second diagonal entry of $\mathbf{Q}^T \mathbf{C}^T \mathbf{Z}$
Doing this for all diagonal entries and taking the sum results in $\textbf{Tr}(\mathbf{Q}^T \mathbf{C}^T \mathbf{Z})$
Intuition: While the optimal matrix $\mathbf{Q}^*$
Based on this design, such a method would be more biased to mode collapse by choosing one prototype than collapsing to a uniform distribution.
Solution? Enforcing entropy to the rescue!
The entropy constraint
So why do we need the entropy term at all?
Well, while the resulting code vectors $\mathbf{q}_1,…,\mathbf{q}_B$
For $\varepsilon \rightarrow \infty$
When $\varepsilon = 0$
Finally, small values for $\varepsilon$ result in a slightly smoothed $\mathbf{Q}^*$
Revisiting the constraints for $\mathcal{Q}$, the rowsum and columnsum constraints imply an equal amount of total weight is assigned to each prototype and each feature vector respectively.
The constraints impose a strong regularization that results in avoiding mode collapse, where all feature vectors are assigned to the same prototype all the time.
Online estimation of Q* for SWAV
What is left now is to compute $\mathbf{Q}^*$
Using Lemma 2 from page 5, we know that the solution takes the form:
where $\mathbf{u}$ and $\mathbf{v}$ act as column and row normalization vectors respectively. An exact computation here is inefficient. However, the $\textbf{SinkhornKnopp}$ algorithm provides a fast, iterative alternative. We can initialize a matrix $\mathbf{Q}$ as the exponential term from $\mathbf{Q}^*$
SinkhornKnopp Code analysis
Here is the pseudocode, given by the authors on the approximation of Q from the similarity scores:
def sinkhorn(scores, eps=0.05, niters=3):
Q = exp(scores / eps).T
Q /= sum(Q)
K, B = Q.shape
u, r, c = zeros(K), ones(K) / K, ones(B) / B
for _ in range(niters):
u = sum(Q, dim=1)
Q *= (r / u).unsqueeze(1)
Q *= (c / sum(Q, dim=0)).unsqueeze(0)
return (Q / sum(Q, dim=0, keepdim=True)).T
To approximate $Q$, we take as input only the similarity score matrix $C^T Z$
Intuition on the clusters/prototypes
So what is actually learned in these clusters/prototypes?
Well, the prototypes’ main purpose is to summarize the dataset. So SWAV is still a form of contrastive learning. In fact, it can also be interpreted as a way of contrasting image views by comparing their cluster assignments instead of their features.
Ultimately, we contrast with the clusters and not the whole dataset. SimCLR uses batch information, called negative samples, but it is not always representative of the whole dataset. That makes the SWAV objective more tractable.
This can be observed from the experiments. Compared to SimCLR, SWAV pretraining converges faster and is less sensitive to the batch size. Moreover, SWAV is not that sensitive to the number of clusters. Typically 3K clusters are used for ImageNet. In general, it is recommended to use approximately one order of magnitude larger than the real class labels. For STL10 which has 10 classes, 512 clusters would be enough.
The multicrop idea: augmenting views with smaller images
Every time I read about contrastive selfsupervised learning methods I think, why just 2 views? Well, the obvious question is answered in the SWAV paper also.
Multicrop. Source: SWAV paper, Caron et al 2020
To this end, SwAV proposes a multicrop augmentation strategy where the same image is cropped randomly with 2 global (i.e. 224×224) views and $N=4$
As shown below, multicrop is a very general trick to improve selfsupervised learning representations. It can be used out of the box for any method with surprisingly good results ~2% improvement on SimCLR!
Source: SWAV paper, Caron et al 2020
The authors also observed that mapping small parts of a scene to more global views significantly boosts the performance.
Results
To evaluate the learned representation of $f$, the backbone model i.e. Resnet50 is frozen. A single linear layer is trained on top. This is a fair comparison for the learned representations, called linear evaluation. Below are the results of SWAV compared to other stateiftheartmethods.
(left) Comparison between clusteringbased and contrastive instance methods and
impact of multicrop. (right) Performance as a function of epochs. Source: SWAV paper, Caron et al 2020
Left: Classification accuracy on ImageNet is reported. The linear layers are trained on frozen features from different selfsupervised methods with a standard ResNet50. Right: Performance of wide ResNets50 by factors of 2, 4, and 5.
Conclusion
In this post an overview of SWAV and its hidden math is provided. We covered the details of optimal transport with and without the entropy constraint. This post would not be possible without the detailed mathematical analysis of Tim.
Finally you can check out this interview on SWAV by its first author (Mathilde Caron).
For further reading, take a look at selfsupervised representation learning on videos or SWAV’s experimental report. You can even run your own experiments with the official code if you have a multiGPU machine!
Finally, I have to say that I’m a bit biased on the work of FAIR on visual selfsupervised learning. This team really rocks!
References

SWAV Paper

SWAV Code

Ref paper on optimal transport

SWAV’s Report in WANDB
Cite as:
@article{kaiser2021swav,
title = "Understanding SWAV: selfsupervised learning with contrasting cluster assignments",
author = "Kaiser, Tim and Adaloglou, Nikolaos",
journal = "https://theaisummer.com/",
year = "2021",
howpublished = {https://theaisummer.com/swav/},
}
* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.