Given a closed, convex and proper function $f:\mathbf{R}^n\to\mathbf{R}$, its proximity operator $\operatorname{prox}_f:\mathbf{R}^n\to\mathbf{R}^n$ is defined as \[ \operatorname{prox}_f(y)=\operatorname{argmin}_{x\in\mathbf{R}^n}\{f(x)+\tfrac{1}{2}\|x-y\|_2^2\}\mbox{.} \] It is well-known that the scaled Euclidean norm $\lambda\|\cdot\|_2$ with $\lambda>0$ has a closed-form proximity operator, \[ \operatorname{prox}_{\lambda\|\cdot\|_2}(y)=\left(1-\frac{\lambda}{\max\{\|y\|_2,\lambda\}}\right)y\mbox{.} \] This can be derived using the Moreau identity, or by using optimality conditions. I’ll spare the details since there will be plenty of details in the sequel.
What is not so well-known (although is certainly known in some circles) is that there is actually something of a closed-form expression for the proximity operator of $x\mapsto\lambda\|Mx\|_2$, where $M\in\mathbf{R}^{m\times n}$ is any real-valued matrix. This problem was brought to me by one of my colleagues earlier this year, and I was unable to find a source that provided a closed-form expression. Stack Exchange user River Li posted a similar formula to the one below for diagonal $M$ in 2020, but did not extend the argument to handle non-diagonal matrices $M$. But such a formula does exist, provided that you take a suitably liberal interpretation of “closed-form”.
Let $M\in\mathbf{R}^{m\times n}$ and $\lambda>0$. Denote the positive singular values of $M$ by $\sigma_1\geq\cdots\geq\sigma_r>0$ and the corresponding right-singular vectors by $v_1,\ldots,v_r$. Then \begin{equation} \operatorname{prox}_{\lambda\|M\cdot\|_2}(y)=\begin{cases} \Pi_{\operatorname{ker}M}(y) & \mbox{if $\|M^{+\top}y\|_2\leq\lambda$}\\ (I+\tfrac{\lambda^2}{\eta}M^\top M)^{-1} y & \mbox{otherwise,} \end{cases} \tag{1}\label{eq:prox-weighted-euclidean} \end{equation} where $\Pi_{\operatorname{ker}M}$ is the projection onto the kernel (also known as the null space) of $M$, $M^+$ denotes the Moore–Penrose pseudoinverse of $M$, and $\eta>0$ is the unique positive solution to the equation \begin{equation} \sum_{i=1}^r\left(\frac{\lambda^2\sigma_i^2}{(\eta+\lambda^2\sigma_i^2)^2}\right)(v_i^\top y)^2=1\mbox{.} \tag{2}\label{eq:eta-choice} \end{equation}
Before demonstrating this, I’d like to point out a few things. First, the norm threshold can be written $\|\Sigma_{11}^{-1}V_1^\top y\|_2\leq\lambda$, where $\Sigma_{11}=\operatorname{diag}(\sigma_1,\ldots,\sigma_r)$ and $V_1=\begin{bmatrix} v_1 & \cdots & v_r\end{bmatrix}\in\mathbf{R}^{n\times r}$, since $\|\cdot\|_2$ is invariant under orthogonal transformation. Second, the equation \eqref{eq:eta-choice}, in general, yields a polynomial equation in $\eta$ of degree $2r$, which means it (probably) does not generally have a closed-form for $r>2$. However, if one happens to be in the case $\|M^{+\top}y\|_2>\lambda$, then \eqref{eq:eta-choice} can be solved very easily numerically. Indeed, the expression on the left-hand side is clearly strictly decreasing for $\eta\geq 0$, and at $\eta=0$, it can be rewritten as $\left\|\tfrac{1}{\lambda}\Sigma_{11}^{-1}V_1^\top y\right\|_2^2-1=\left\|\tfrac{1}{\lambda}M^{+\top}y\right\|_2^2-1>0$, while at $\eta=\tfrac{r}{2}\|V_1^\top y\|_\infty^2$, one has \begin{align*} \sum_{i=1}^r\left(\frac{\lambda^2\sigma_i^2}{(\eta+\lambda^2\sigma_i^2)^2}\right)(v_i^\top y)^2-1 &=\sum_{i=1}^r\frac{(v_i^\top y)^2\lambda^2\sigma_i^2}{\eta^2+2\eta\lambda^2\sigma_i^2+\lambda^4\sigma_i^4}-1\\ &\leq\sum_{i=1}^r\frac{(v_i^\top y)^2\lambda^2\sigma_i^2}{r(v_i^\top y)^2\lambda_i^2\sigma_i^2}-1\\ &\leq 0\mbox{.} \end{align*} Applying bisection in the range $[0,\tfrac{r}{2}\|V_1^\top y\|_\infty^2]$ should quickly yield a sufficiently accurate $\eta$.
Derivation
To prove \eqref{eq:prox-weighted-euclidean}, we will actually solve a more general problem. In particular, with $\lambda>0$, $M\in\mathbf{R}^{m\times n}$, $A\in\mathbf{R}^{k\times n}$, and $b\in\mathbf{R}^k$, we will symbolically solve the optimization problem \begin{align} \begin{array}{ll} \mbox{minimize}_{x\in\mathbf{R}^n} & \lambda\|Mx\|_2+\tfrac{1}{2}\|Ax-b\|_2^2\mbox{.} \end{array} \tag{3}\label{eq:main-problem} \end{align} To derive the optimal $x$, begin by taking the full SVD of $M$, which we’ll write $M=U\Sigma V^\top$, where $U\in\mathbf{R}^{m\times m}$ and $V\in\mathbf{R}^{n\times n}$ are orthogonal, and $\Sigma\in\mathbf{R}_+^{m\times n}$ is diagonal. Since $\|\cdot\|_2$ is invariant under orthogonal transformation, problem \eqref{eq:main-problem} above is equivalent to \[ \begin{array}{ll} \mbox{minimize}_{x\in\mathbf{R}^n} & \lambda\|\Sigma V^\top x\|_2+\tfrac{1}{2}\|Ax-b\|_2^2\mbox{.} \end{array} \] Let $\sigma_1\geq\sigma_2\geq\cdots\geq\sigma_r$ denote the positive singular values of $M$, and assume without loss of generality that the main diagonal of $\Sigma$ is arranged in decreasing order. Define $\Sigma_{11}\in\mathbf{R}^{r\times r}$ to be the diagonal matrix $\Sigma_{11}=\operatorname{diag}(\sigma_1,\ldots,\sigma_r)$ and partition $V=\begin{bmatrix} V_1& V_2\end{bmatrix}$ so that $V_1\in\mathbf{R}^{n\times r}$ and $V_2\in\mathbf{R}^{n\times(n-r)}$. We now make the change of variables \[ \tilde{x}=\begin{bmatrix}\tilde{x}_1 \\ \tilde{x}_2\end{bmatrix}=\begin{bmatrix} \Sigma_{11}V_1^\top x \\ V_2^\top x\end{bmatrix}\mbox{.} \] This change of variables yields the problem \[ \begin{array}{ll} \mbox{minimize}_{\tilde{x}\in\mathbf{R}^n} & \lambda\|\tilde{x}_1\|_2+\tfrac{1}{2}\|AV_1\Sigma_{11}^{-1}\tilde{x}_1+AV_2\tilde{x}_2-b\|_2^2\mbox{.} \end{array} \tag{4}\label{eq:real-problem} \] Optimizing first over $\tilde{x}_2$, we deduce that \[ \tilde{x}_2=(AV_2)^+(b-AV_1\Sigma_{11}^{-1}\tilde{x}_1)\mbox{,} \] where $(AV_2)^+$ is the Moore–Penrose pseudoinverse of $AV_2$. Plugging this into \eqref{eq:real-problem}, we arrive at the problem \[ \begin{array}{ll} \mbox{minimize}_{\tilde{x}_1\in\mathbf{R}^r} & \lambda\|\tilde{x}_1\|_2+\tfrac{1}{2}\left\|\left(I-AV_2(AV_2)^+\right)AV_1\Sigma_{11}^{-1}\tilde{x}_1-\left(I-AV_2(AV_2)^+\right)b\right\|_2^2\mbox{.} \end{array} \] This is simply an $\ell_2$-regularized least-squares problem. Puig, Wiesel, and Hero characterized the solution to this problem in 2009. In our case, the optimal $\tilde{x}_1$ is \[ \tilde{x}_1=\begin{cases} 0 & \mbox{if $\left\|\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)b\right\|_2\leq\lambda$} \\ \left[\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)AV_1\Sigma_{11}^{-1}+\tfrac{\lambda^2}{\eta} I\right]^{-1}\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)b & \mbox{otherwise,} \end{cases} \] where $\eta>0$ is the unique positive solution to the equation \[ \left\|\lambda\left[\eta\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)AV_1\Sigma_{11}^{-1}+\lambda^2 I\right]^{-1}\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)b\right\|_2^2=1\mbox{.} \] Note that in specializing Corollary 3.1 of Puig, Wiesel, and Hero, I have used the fact that $I-AV_2(AV_2)^+$ is idempotent (i.e., it equals its square). Reversing the change of variables, we finally arrive at the optimal $x$ to \eqref{eq:main-problem}, \[ \small{ x=\begin{cases} V_2(AV_2)^+b & \mbox{if $\left\|\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)b\right\|_2\leq\lambda$}\\ \left[\left(I-V_2(AV_2)^+A\right)V_1\Sigma_{11}^{-1}\left[\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)AV_1\Sigma_{11}^{-1}+\tfrac{\lambda^2}{\eta}I\right]^{-1}\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)+V_2(AV_2)^+\right]b & \mbox{otherwise,} \end{cases}} \] where $\eta>0$ is as defined above. If someone knows a simpler form of the above expression, I’d be happy to hear. In any case, we can specialize this formula to obtain \eqref{eq:prox-weighted-euclidean}. There we take $A$ to be the $n\times n$ identity matrix and $b=y\in\mathbf{R}^n$. This simplifies the above formula dramatically, since $A$ “vanishes” and $(AV_2)^+$ reduces to simply $V_2^\top$. (The latter fact is quite nice since $V_1V_1^\top+V_2V_2^\top=I$ as the columns of $V$ are orthonormal.) First, the expression in the norm threshold simplifies down to \begin{align*} \|\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)b\|_2 &= \|\Sigma_{11}^{-1}V_1^\top(I-V_2V_2^\top)y\|_2 \\ &=\|\Sigma_{11}^{-1}V_1^\top V_1V_1^\top y\|_2 \\ &=\|\Sigma_{11}^{-1}V_1^\top y\|_2 \\ &=\|M^{+\top}y\|_2\mbox{,} \end{align*} which is exactly the norm threshold provided in \eqref{eq:prox-weighted-euclidean}. If the norm threshold is satisfied, then $V_2(AV_2)^+b=V_2V_2^\top y=\Pi_{\operatorname{ker}M}(y)$. Now we’ll handle the ugly case. It’s not the most pleasant calculation to follow, but if you read carefully, you’ll see \begin{align*} &\left[\left(I-V_2(AV_2)^+A\right)V_1\Sigma_{11}^{-1}\left[\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)AV_1\Sigma_{11}^{-1}+\tfrac{\lambda^2}{\eta}I\right]^{-1}\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)+V_2(AV_2)^+\right]b\\ &{\quad}=\left[\left(I-V_2V_2^\top\right)V_1\Sigma_{11}^{-1}\left[\Sigma_{11}^{-1}V_1^\top\left(I-V_2V_2^\top\right)V_1\Sigma_{11}^{-1}+\tfrac{\lambda^2}{\eta}I\right]^{-1}\Sigma_{11}^{-1}V_1^\top\left(I-V_2V_2^\top\right)+V_2V_2^\top\right]y\\ &{\quad}=\left[V_1V_1^\top V_1\Sigma_{11}^{-1}\left(\Sigma_{11}^{-1}V_1^\top V_1V_1^\top V_1\Sigma_{11}^{-1}+\tfrac{\lambda^2}{\eta}I\right)^{-1}\Sigma_{11}^{-1}V_1^\top V_1V_1^\top+V_2V_2^\top\right]y\\ &{\quad}=\left[V_1\Sigma_{11}^{-1}\left(\Sigma_{11}^{-2}+\tfrac{\lambda^2}{\eta}I\right)^{-1}\Sigma_{11}^{-1}V_1^\top+V_2V_2^\top\right]y\\ &{\quad}=V_1\left(I+\tfrac{\lambda^2}{\eta}\Sigma_{11}^2\right)^{-1}V_1^\top y+V_2V_2^\top y\\ &{\quad}=\begin{bmatrix}V_1 & V_2\end{bmatrix}\begin{bmatrix}(I+\tfrac{\lambda^2}{\eta}\Sigma_{11}^2)^{-1} & 0 \\ 0 & I\end{bmatrix}\begin{bmatrix} V_1^\top \\ V_2^\top\end{bmatrix}y\\ &{\quad}=V\left(I+\tfrac{\lambda^2}{\eta}\Sigma^2\right)^{-1}V^\top y \\ &{\quad}=\left(I+\tfrac{\lambda^2}{\eta}M^\top M\right)^{-1}y\mbox{,} \end{align*} which is the desired expression. Finally, we can simplify down the equation that determines $\eta$, which looks like \begin{align*} &\left\|\lambda\left[\eta\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)AV_1\Sigma_{11}^{-1}+\lambda^2 I\right]^{-1}\Sigma_{11}^{-1}V_1^\top A^\top\left(I-AV_2(AV_2)^+\right)b\right\|_2^2=1\\ &{}\iff\left\|\lambda\left[\eta\Sigma_{11}^{-1}V_1^\top V_1V_1^\top V_1\Sigma_{11}^{-1}+\lambda^2 I\right]^{-1}\Sigma_{11}^{-1}V_1^\top V_1V_1^\top y\right\|_2^2=1\\ &{}\iff\left\|\lambda\left(\eta\Sigma_{11}^{-2}+\lambda^2 I\right)^{-1}\Sigma_{11}^{-1}V_1^\top y\right\|_2^2=1\\ &{}\iff\sum_{i=1}^r\frac{\lambda^2\sigma_i^{-2}}{(\eta\sigma_i^{-2}+\lambda^2)^2}(v_i^\top y)^2=1\\ &{}\iff\sum_{i=1}^r\frac{\lambda^2\sigma_i^2}{(\eta+\lambda^2\sigma_i^2)^2}(v_i^\top y)^2=1\mbox{,} \end{align*} which is, indeed, the same as \eqref{eq:eta-choice}.
I plan at some point to revisit this blog post and expand on some interesting special cases, as well as to provide several code implementations of the proximity operator. In particular, I intend to benchmark \eqref{eq:prox-weighted-euclidean} against iterative methods for optimization. Until then, feel free to play with this yourself! I’d be happy to hear if you find something useful from this.
Leave a Reply