A note on the equivalence between inverse OT and minimizing the Monge gap.
This blog is about an elegant and practical reformulation of inverse Optimal Transport (OT) that enables efficient computations. It is based on a derivation found in
Entropic OT
We consider two discrete distributions that we wish to compare: \(\sum_i a_i \delta_{\mathbf{x}_i}\)
Primal problem. The entropic OT problem reads
where \(\Pi(\mathbf{a}, \mathbf{b})=\left\{\mathbf{P}\mathbf{\geq0},\mathbf{P}\mathbf{1}=\mathbf{a},\mathbf{P}^{\top}\mathbf{1}=\mathbf{b}\right\}\) is the set of couplings with marginals \((\mathbf{a}, \mathbf{b})\) and \(\mathrm{H}(\mathbf{P}) = - \langle \mathbf{P}, \log \mathbf{P} - \mathbf{1} \mathbf{1}^\top \rangle\)
Dual problem. The above entropic OT problem \eqref{eq:eot} can be solved through the following dual
\[\begin{align}\label{eq:dual_eot} \max_{\mathbf{f},\mathbf{g}} \: \: \langle \mathbf{f}, \mathbf{a} \rangle + \langle \mathbf{g}, \mathbf{b} \rangle - \varepsilon \left\langle \exp((\mathbf{f} \oplus \mathbf{g} - \mathbf{C}) / \varepsilon), \mathbf{1} \mathbf{1}^\top \right\rangle \:. \end{align}\]The solution \(\mathbf{P}^\star\) of the primal problem \eqref{eq:eot} can be expressed in terms of the optimal dual variables \((\mathbf{f}^\star, \mathbf{g}^\star)\) solving \eqref{eq:dual_eot} as \(\mathbf{P}^{\star} = \exp((\mathbf{f}^\star \oplus \mathbf{g}^\star - \mathbf{C}) / \varepsilon)\).
The Lagrangian of the above problem is as follows, with dual variables \(\mathbf{f}\) and \(\mathbf{g}\) \(\begin{align}\label{eq:lagrangian_eot} \langle \mathbf{C}, \mathbf{P} \rangle - \varepsilon \mathrm{H}(\mathbf{P}) - \langle \mathbf{f}, \mathbf{P} \mathbf{1} - \mathbf{a} \rangle - \langle \mathbf{g}, \mathbf{P}^\top \mathbf{1} - \mathbf{b} \rangle \:. \end{align}\) Strong duality holds for \eqref{eq:eot} and the first order KKT condition gives \(\begin{align} \mathbf{C} - \varepsilon \log(\mathbf{P}^\star) - \mathbf{f}^\star\mathbf{1}^\top - \mathbf{1}(\mathbf{g}^\star)^{\top} \mathbf{=0} \end{align}\) for optimal primal \(\mathbf{P}^\star\) and dual \((\mathbf{f}^\star, \mathbf{g}^\star)\) variables.
It gives the primal/dual relation \(\mathbf{P}^\star = \exp((\mathbf{f}^\star \oplus \mathbf{g}^\star - \mathbf{C}) / \varepsilon)\).
Plugging it back into the Lagrangian we recover the dual objective of equation \eqref{eq:dual_eot}.
Problem \eqref{eq:dual_eot} can be solved using block coordinate ascent, alternatively optimizing with respect to \(\mathbf{f}\) and \(\mathbf{g}\) with the following updates: \(\begin{align} f_i &\leftarrow \varepsilon \log a_i - \varepsilon \log \sum_j e^{(g_j-C_{ij}) / \varepsilon} \label{eq:sinkhorn-f} \\ g_j &\leftarrow \varepsilon \log b_j - \varepsilon \log \sum_i e^{(f_i-C_{ij}) / \varepsilon} \label{eq:sinkhorn-g} \:. \end{align}\) The above updates are known as Sinkhorn iterations (in log domain) due to the seminal work of Sinkhorn and Knopp who proved their convergence
In inverse OT
When using entropic OT, the inverse OT problem is usually formulated with a KL divergence \(\mathrm{KL}(\mathbf{P} \| \mathbf{Q}) = \langle \mathbf{P}, \log (\mathbf{P} \oslash \mathbf{Q}) \rangle - \mathbf{P} + \mathbf{Q}\). The problem we consider is as follows
\[\DeclareMathOperator*{\argmin}{arg\,min} \begin{align} \min_{\mathbf{C}} \quad &\mathrm{KL}(\widehat{\mathbf{P}} \| \mathbf{P}^{\mathbf{C}}) \label{eq:outer_invot}\\[1em] \text{s.t.} \quad &\mathbf{P}^{\mathbf{C}} = \argmin_{\mathbf{P} \in \Pi(\mathbf{a}, \mathbf{b})} \: \: \langle \mathbf{C}, \mathbf{P} \rangle - \varepsilon \mathrm{H}(\mathbf{P}) \label{eq:inner_invot} \:. \end{align}\]Issue : the above is a nested problem and we need to unroll the Sinkhorn iterations of the inner problem \eqref{eq:inner_invot} to solve the outer problem \eqref{eq:outer_invot}. Another approach would be to rely on the implicit function theorem but it requires a costly inversion.
Hopefully, a computationally simpler formulation can be derived from the above, as shown in the theorem 1 of
We detail this derivation in what follows.
A first step is to observe that the outer objective \eqref{eq:outer_invot} of inverse OT can be expressed in terms of the optimal dual variables \((\mathbf{f}^\star,\mathbf{g}^\star)\) of the entropic OT inner problem \eqref{eq:inner_invot}. Indeed, it holds
\[\begin{align}\label{eq:first_step} \varepsilon \left( \mathrm{KL}(\widehat{\mathbf{P}} \| \mathbf{P}^{\mathbf{C}}) + \operatorname{H}(\widehat{\mathbf{P}}) \right) &= \left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \langle \mathbf{f}^\star, \mathbf{a} \rangle - \langle \mathbf{g}^\star, \mathbf{b} \rangle \:. \end{align}\]The KL can be decomposed as \(\begin{align} \operatorname{KL}(\widehat{\mathbf{P}} | \mathbf{P}^{\mathbf{C}}) = - \langle \widehat{\mathbf{P}}, \log \mathbf{P}^{\mathbf{C}} \rangle - \operatorname{H}(\widehat{\mathbf{P}}) \:. \end{align}\)
For optimal dual variables \((\mathbf{f}^\star, \mathbf{g}^\star)\), the solution of the primal of entropic OT is given by \(\begin{align} \mathbf{P}^{\mathbf{C}} = \exp((\mathbf{f}^\star \oplus \mathbf{g}^\star - \mathbf{C}) / \varepsilon) \:. \end{align}\)
Therefore we have \(\begin{align} \varepsilon \left( \mathrm{KL}(\widehat{\mathbf{P}} \| \mathbf{P}^{\mathbf{C}}) + \operatorname{H}(\widehat{\mathbf{P}}) \right) &= - \left\langle \widehat{\mathbf{P}}, \mathbf{f}^\star \oplus \mathbf{g}^\star - \mathbf{C} \right\rangle \\ &= \left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \left\langle \widehat{\mathbf{P}}, \mathbf{f}^\star \oplus \mathbf{g}^\star \right\rangle \:. \end{align}\)
Focusing on the last term, using that \(\widehat{\mathbf{P}} \in \Pi(\mathbf{a}, \mathbf{b})\) it holds \(\begin{align} \left\langle \widehat{\mathbf{P}}, \mathbf{f}^\star \oplus \mathbf{g}^\star \right\rangle &= \sum_i f^\star_i \sum_j \widehat{P}_{ij} + \sum_j g^\star_j \sum_i \widehat{P}_{ij} \\ &= \sum_i f^\star_i a_i + \sum_j g^\star_j b_j \\ &= \langle \mathbf{f}^\star, \mathbf{a} \rangle + \langle \mathbf{g}^\star, \mathbf{b} \rangle \:. \end{align}\) Therefore \(\begin{align} \varepsilon \left( \mathrm{KL}(\widehat{\mathbf{P}} \| \mathbf{P}^{\mathbf{C}}) + \operatorname{H}(\widehat{\mathbf{P}}) \right) &= \left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \langle \mathbf{f}^\star, \mathbf{a} \rangle - \langle \mathbf{g}^\star, \mathbf{b} \rangle \:. \end{align}\)
In equation \eqref{eq:first_step}, \(\mathbf{f}^\star\) and \(\mathbf{g}^\star\) implicitly depend on \(\mathbf{C}\) through problem \eqref{eq:dual_eot}. Thus we are still stuck with the bilevel structure and have’nt made any real progress yet.
Recall that we would like to derive a joint single-level objective for both outer variable $\mathbf{C}$ and inner variables $(\mathbf{f}, \mathbf{g})$. To do so, one can notice that equation \eqref{eq:first_step} has terms in common with the dual problem of entropic OT \eqref{eq:dual_eot}. Indeed, in both \eqref{eq:dual_eot} and \eqref{eq:first_step} we find \(\begin{align} \langle\mathbf{f},\mathbf{a}\rangle+\langle\mathbf{g},\mathbf{b}\rangle \:. \end{align}\)
The trick is to add the missing term of dual entropic OT \eqref{eq:dual_eot} in \eqref{eq:first_step}. Doing so, we define the following joint objective \(\begin{align} \cal{G}(\widehat{\mathbf{P}}, \mathbf{C}, \mathbf{f}, \mathbf{g}) = &\left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \langle \mathbf{f}, \mathbf{a} \rangle - \langle \mathbf{g}, \mathbf{b} \rangle \\ + &\varepsilon \left\langle \exp(\left(\mathbf{f} \oplus \mathbf{g} - \mathbf{C}\right) / \varepsilon), \mathbf{1} \mathbf{1}^\top \right\rangle \:. \end{align}\)
For any \(\mathbf{C}\), minimizing \(\cal{G}\) with respect to \((\mathbf{f}, \mathbf{g})\) exactly amounts to solving dual entropic OT \eqref{eq:dual_eot}, because \(\left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle\) does not depend on \((\mathbf{f}, \mathbf{g})\). Hence we have: \(\begin{align} \min_{\mathbf{f},\mathbf{g}} \: \cal{G}(\widehat{\mathbf{P}}, \mathbf{C}, \mathbf{f}, \mathbf{g}) = \left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \langle \mathbf{f}^\star, \mathbf{a} \rangle - \langle \mathbf{g}^\star, \mathbf{b} \rangle + \varepsilon \left\langle \mathbf{P}^{\mathbf{C}}, \mathbf{1} \mathbf{1}^\top \right\rangle \end{align}\)
where \(\mathbf{P}^{\mathbf{C}} = \exp((\mathbf{f}^\star \oplus \mathbf{g}^\star - \mathbf{C}) / \varepsilon)\) as we have seen in the first part.
Importantly, because we have \(\mathbf{P}^{\mathbf{C}} \in \Pi(\mathbf{a}, \mathbf{b})\), we can notice that the term we added no longer depends on \(\mathbf{C}\) when evaluted in \((\mathbf{f}^\star,\mathbf{g}^\star)\). Indeed \(\begin{align} \left\langle \mathbf{P}^{\mathbf{C}}, \mathbf{1} \mathbf{1}^\top \right\rangle = \sum_{ij} P^{\mathbf{C}}_{ij} = \sum_i a_i = 1 \:. \end{align}\)
Thus, when evaluated in \((\mathbf{f}^\star,\mathbf{g}^\star)\), thanks to equation \eqref{eq:first_step} the objective writes \(\begin{align} \min_{\mathbf{f},\mathbf{g}} \: \cal{G}(\widehat{\mathbf{P}}, \mathbf{C}, \mathbf{f}, \mathbf{g}) &= \left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \langle \mathbf{f}^\star, \mathbf{a} \rangle - \langle \mathbf{g}^\star, \mathbf{b} \rangle + \varepsilon \\ &= \varepsilon \left( \mathrm{KL}(\widehat{\mathbf{P}} \| \mathbf{P}^{\mathbf{C}}) + \operatorname{H}(\widehat{\mathbf{P}}) + \textrm{1} \right) \label{eq:final_derivation} \:. \end{align}\)
Minimizing the above with respect to \(\mathbf{C}\) then amounts to minimizing \(\mathrm{KL}(\widehat{\mathbf{P}}\|\mathbf{P}^{\mathbf{C}})\) since it is the only term that depends on \(\mathbf{C}\) in equation \eqref{eq:final_derivation}.
Therefore solving inverse OT is equivalent to the following jointly convex problem \(\begin{align}\label{eq:new_form_invot} \min_{\mathbf{C}, \mathbf{f}, \mathbf{g}} \: \: \cal{G}(\widehat{\mathbf{P}}, \mathbf{C}, \mathbf{f}, \mathbf{g}) \:. \end{align}\)
Concretely, this means that \((\mathbf{C}^\star, \mathbf{f}^\star, \mathbf{g}^\star)\) solves \eqref{eq:new_form_invot} if and only if \(\mathbf{C}^\star\) solves inverse OT \eqref{eq:outer_invot} where \(\mathbf{P}^{\mathbf{C}} = \exp((\mathbf{f}^\star \oplus \mathbf{g}^\star - \mathbf{C}) / \varepsilon)\) solves the inner problem \eqref{eq:inner_invot}.
Let’s take a moment to decipher this new expression closely.
Since strong duality holds for entropic OT, one has the equality between the primal optimal objective and the dual optimal objective ie \eqref{eq:eot} = \eqref{eq:dual_eot}.
Therefore we have \(\begin{align}\label{eq:min_formulation_invot} \min_{\mathbf{f},\mathbf{g}} \: \cal{G}(\widehat{\mathbf{P}}, \mathbf{C}, \mathbf{f}, \mathbf{g}) &= \left\langle \widehat{\mathbf{P}}, \mathbf{C} \right\rangle - \left(\min_{\mathbf{P} \in \Pi(\mathbf{a}, \mathbf{b})} \: \: \langle \mathbf{C}, \mathbf{P} \rangle - \varepsilon \mathrm{H}(\mathbf{P}) \right) \:. \end{align}\)
Hence \(\cal{G}\) quantifies the difference in the transport cost when using \(\widehat{\mathbf{P}}\) against the solution of the inner problem \(\mathbf{P}^{\mathbf{C}}\). This quantity is known as the Monge gap
As discussed earlier, optimizing w.r.t. \(\mathbf{C}\) an argmin like in \eqref{eq:inner_invot} requires computationally demanding tools such as unrolling or implicit function theorem. On the contrary, optimizing the min as in \eqref{eq:min_formulation_invot} is much simpler. It can be done using Danskin’s theorem (or envelope theorem).
In our case, this result simply states that, for each update of \(\mathbf{C}\), we can optimize \(\cal{G}\) in \(\mathbf{C}\) by considering \(\mathbf{f}\) and \(\mathbf{g}\) as constants. Without further constraint on \(\mathbf{C}\), the update reads \(\begin{align}\label{eq:update_C} \mathbf{C} &\leftarrow \mathbf{f} \oplus \mathbf{g} - \varepsilon \log \widehat{\mathbf{P}} \:. \end{align}\)
Overall, to efficiently solve inverse OT one can use block coordinate descent alternating between updating \(\mathbf{f}\) and \(\mathbf{g}\) with Sinkhorn iterations \eqref{eq:sinkhorn-f}-\eqref{eq:sinkhorn-g} and updating \(\mathbf{C}\) with \eqref{eq:update_C}.
In this last part, we are going to see how inverse OT and the presented trick can be used to learn data representations, as shown in
To do so, we are going to look for a cost of the form \(d(\mathbf{z}_i, \mathbf{z}_j)\) which solves inverse OT with an input \(\widehat{\mathbf{P}}\) computed from \((\mathbf{x}_1, .., \mathbf{x}_n)\). To compute \(\widehat{\mathbf{P}}\), one can simply solve the symmetric variant of entropic OT wich is exactly problem \eqref{eq:eot} with symmetric \(\mathbf{C}\) \(=(d(\mathbf{x}_i, \mathbf{x}_j))_{ij}\) and \(\mathbf{a}=\mathbf{b}\). We pick \(\mathbf{a}=\mathbf{b}=\mathbf{1}\) to give the same mass to every data point.
In symmetric entropic OT, we only have one dual variable \(\mathbf{f}\) as the primal solution is given by \(\widehat{\mathbf{P}} = \exp((\mathbf{f}^\star \oplus \mathbf{f}^\star - \mathbf{C}) / \varepsilon)\). Moreover \(\mathbf{f}^\star\) can be computed by simply iterating
In symmetric entropic OT, each point spreads its mass to its closest neighbors thus capturing the geometry of the data. In this context, the regularizer \(\varepsilon\) controls the scale of dependencies that is captured.
Once we have computed \(\widehat{\mathbf{P}}\), the goal is to solve the inverse problem of finding the embeddings \((\mathbf{z}_1, .., \mathbf{z}_n)\) that would generate a similar entropic OT plan in low-dimension. In other words, we want the geometry in the low-dimensional space to be similar to the one in input space. This method has strong connections with the t-SNE algorithm as developped in
To do so, we rely on the presented trick for inverse OT and therefore focus on solving \(\begin{align} \min_{(\mathbf{z}_1, .., \mathbf{z}_n), \mathbf{f}, \mathbf{g}} \: \: \cal{G}(\widehat{\mathbf{P}}, \mathbf{C}_{\mathbf{Z}}, \mathbf{f}, \mathbf{g}) \:. \end{align}\)
where \(\mathbf{C}_{\mathbf{Z}}\) it the symmetric cost matrix with entries \(d(\mathbf{z}_i, \mathbf{z}_j)\).
We consider the common task of embedding the swiss roll (depicted below) from 3d to 2d.
In the experiments, we take the squared Euclidean distance for \(d\), \(\varepsilon=10\) for the entropic regularizer and independent \(\cal{N}(0,1)\) variables to initialize the embedding coordinates. The code is provided in the box below.
First, as shown in the figure below, we can verify that we obtain exactly the same embeddings \((\mathbf{z}_1, .., \mathbf{z}_n)\) using unrolling and the Monge gap trick presented in this blog.
Regarding run-time, the Monge gap approach is faster than unrolling as we can see on the following plot. Hence the trick presented in this blog has a great practical interest, especially for large-scale applications.
Inverse OT is also useful for contrastive learning as shown in
Feel free to contact me for any question or remark on this blog !
If you found this useful, you can cite this blog post using:
@article{inverse_ot_unrolling,
title = {Inverse optimal transport does not require unrolling},
author = {Hugues Van Assel},
year = {2024},
month = {April},
url = {https://huguesva.github.io/blog/2024/inverseOT_mongegap/}
}