visitor put up with Dóra Jámbor
This can be a half-guest-post written collectively with Dóra, a fellow participant in a studying group the place we just lately mentioned the unique paper on $beta$-VAEs:
On the floor of it, $beta$-VAEs are a simple extension of VAEs the place we’re allowed to instantly management the tradeoff between the reconstruction and KL loss phrases. In an try to higher perceive the place the $beta$-VAE goal comes from, and to additional encourage why it is smart, right here we derive $beta$-VAEs from completely different first rules than it’s offered within the paper. Over to principally Dóra for the remainder of this put up:
First, some notation:
- $p_mathcal{D}(mathbf{x})$: information distribution
- $q_psi(mathbf{z}vert mathbf{x})$: illustration distribution
- $q_psi(mathbf{z}) = int p_mathcal{D}(mathbf{x})q_psi(mathbf{z}vert mathbf{x})$: combination posterior – marginal distribution of illustration $Z$
- $q_psi(mathbf{x}vert mathbf{z}) = frac{q_psi(mathbf{z}vert mathbf{x})p_mathcal{D}(mathbf{x})}{q_psi(mathbf{z})}$: “inverted posterior”
Motivation and assumptions of $beta$-VAEs
Studying disentangled representations that get better the unbiased information generative components has been a long-term objective for unsupervised illustration studying.
$beta$-VAEs had been launched in 2017 with a proposed modification to the unique VAE formulation that may obtain higher disentanglement within the posterior $q_psi(mathbf{z}vert mathbf{x})$. An assumption of $beta$-VAEs is that there are two units of latent components, $mathbf{v}$ and $mathbf{w}$, that contribute to producing observations $x$ in the actual world. One set, $mathbf{v}$, is coordinate-wise conditionally unbiased given the noticed variable, i.e., $log p(mathbf{v}vert x) = sum_k log p(v_kvert mathbf{x})$. On the similar time, we do not assume something in regards to the remaining components $mathbf{w}$.
The components $v$ are going to be the primary object of curiosity for us. The conditional independence assumption permits us to formulate what it means to disentangle these components of variation. Think about a illustration $mathbb{z}$ which entangles coordinates of $v$, in that every coordinate of $mathbb{z}$ relies on a number of coordinates of $mathbb{v}$, e.g. $z_1 = f_1(v_1, v_2)$ and $z_2 = f_2(v_1, v_2)$. Such a $mathbb{z}$ will not essentially fulfill co-ordinatewise conditional independence $log p(mathbf{z}vert x) = sum_k log p(z_kvert mathbf{x})$. Nevertheless, if every part of $mathbb{z}$ depended solely on one corresponding coordinate of $mathbf{v}$, for instance $z_1 = g_1(v_1)$ and $z_2 = g_2(v_2)$, the component-wise conditional independence would maintain for $mathbb{z}$ too.
Thus, below these assumptions we are able to encourage disentanglement to occur by encouraging the posterior $q_psi(mathbf{z}vert mathbf{x})$ to be coordinate-wise conditionally unbiased. This may be finished by including a brand new hyperparameter $beta$ to the unique VAE formulation
$$
mathcal{L}(theta, phi; x, z, beta) = -mathbb{E}_{q_{phi}(xvert z)p_mathcal{D}(x)}[log p_{theta}(x vert z)] + beta operatorname{KL} (q_{phi(zvert x)}| p(z)),
$$
the place $beta$ controls the trade-off between the capability of the latent info channel and studying conditionally unbiased latent components. When $beta$ is increased than 1, we encourage the posterior $q_psi(zvert x)$ to be near the isotropic unit Gaussian $p(z) = mathcal{N}(0, I)$, which itself is coordnate-wise unbiased.
Marginal versus Conditional Independence
On this put up, we revisit the conditional independence assumption of latent components, and argue {that a} extra acceptable goal could be to have marginal independence within the latent components. To indicate you our instinct, let’s revisit the “Explaining Away” phenomenon from Probabilistic Graphical Fashions.
Explaining away
Think about three random variables:
$A$: Ferenc is grading exams
$B$: Ferenc is in a superb temper
$C$: Ferenc is tweeting a meme
with the next graphical mannequin $A rightarrow C leftarrow B$.
Right here we might assume that Ferenc grading exams is unbiased of him being in a superb temper, i.e., $A perp B$. Nevertheless, the stress of marking exams leads to elevated probability of procrastination, which will increase the possibilities of tweeting memes, too.
Nevertheless, as quickly as we see a meme being tweeted by him, we all know that he both in a superb temper or he’s grading exams. If we all know he’s grading exams, that explains why he’s tweeting memes, so it is much less probably he is tweeting memes as a result of he is a superb temper. Consequently, $A not!perp Bvert C$.
In all seriousness, if we now have a graphical mannequin $A rightarrow C leftarrow B$, in proof of $C$, independence between $A$ and $B$ now not holds.
Why does this matter?
We argue that the explaining away phenomenon makes the conditional independence of latent components undesirable. A way more affordable assumption in regards to the generative strategy of the info is that the components of variation $mathbf{v}$ are drawn independently, after which the observations are generated conditoned on them. Nevertheless, if we think about two coordinates of $mathbb{v}$ and the commentary $mathbf{x}$, we now have a $V_1 rightarrow mathbf{X} leftarrow V_2$ graphical mannequin, thus, conditional independence can’t maintain.
As a substitute, we argue that to get better the generative components of the info, we must always encourage latent components to be marginally unbiased. Within the subsequent part, we got down to derive an algorithm that encourages marginal independence within the illustration Z. We may even present how the ensuing loss perform from this new derivation is definitely equal to the unique $beta$-VAEs formulation.
Marginally Unbiased Latent Components
We’ll begin from desired properties of the illustration distribution $q_psi(zvert x)$. We might like this illustration to fulfill two properties:
- Marginal independence: We want the mixture posterior $q_psi(z)$ to be near some fastened and factorized unit Gaussian prior $p(z) = prod_i p(z_i)$. This encourages $q_psi(z)$ to exhibit coordinate-wise independence.
- Most Info: We might just like the illustration $Z$ to retain as a lot info as attainable in regards to the enter information $X$.
Be aware that with out (1), (2) is inadequate, as a result of then any deterministic and invertible perform of $X$ would comprise most details about $X$ however that would not make it a helpful or disentangled illustration. Equally, with out (2), (1) is inadequate as a result of if we set $q_psi(zvert x) = p(z)$ it could give us a latent illustration Z that’s coordinate-wise unbiased, however it’s also unbiased of the info which isn’t very helpful.
Deriving a sensible goal
We will obtain a mix of those desiderata by optimizing an goal with the weighted mixture of two phrases equivalent to the 2 objectives we set out above:
$$
mathcal{L}(psi) = operatorname{KL}[q_psi(z)| p(z)] – lambda mathbb{I}_{q_psi(zvert x) p_mathcal{D}(x)}[X, Z]
$$
Keep in mind, we use $q_psi(z)$ to indicate the mixture posterior. We are going to check with this because the InfoMax goal. Now we will present how this goal will be associated to the $beta$-VAE goal. Let’s first think about the KL time period within the above goal:
start{align}
operatorname{KL}[q_psi(z)| p(z)] &= mathbb{E}_{q_psi(z)} log frac{q_psi(z)}{p(z)}
&= mathbb{E}_{q_psi(zvert x)p_mathcal{D}(x)} log frac{q_psi(z)}{p(z)}
&= mathbb{E}_{q_psi(zvert x)p_mathcal{D}(x)} log frac{q_psi(z)}{q_psi(zvert x)} + mathbb{E}_{q_psi(zvert x)p_mathcal{D}(x)} log frac{q_psi(zvert x)}{p(z)}
&= mathbb{E}_{q_psi(zvert x)p_mathcal{D}(x)} log frac{q_psi(z)p_mathcal{D}(x)}{q_psi(zvert x)p_mathcal{D}(x)} + mathbb{E}_{q_psi(zvert x)p_mathcal{D}(x)} log frac{q_psi(zvert x)}{p(z)}
&= -mathbb{I}_{q_psi(zvert x)p_mathcal{D}(x)}[X,Z] + mathbb{E}_{p_mathcal{D}}operatorname{KL}[q_psi(zvert x)| p(z)]
finish{align}
That is attention-grabbing. If the mutual info between $X$ and $Z$ is non-zero (which is ideally the case), the above equation reveals that latent components can’t be each marginally and conditionally unbiased on the similar time. It additionally provides us a option to relate the KL phrases representing marginal and conditional independence.
Placing this again into the InfoMax goal, we now have that
start{align}
mathcal{L}(psi) &= operatorname{KL}[q_psi(z)| p(z)] – lambda mathbb{I}_{q_psi(zvert x)p_mathcal{D}(x)}[X, Z]
&= mathbb{E}_{p_mathcal{D}}operatorname{KL}[q_psi(zvert x)| p(z)] – (lambda + 1) mathbb{I}_{q_psi(zvert x)p_mathcal{D}(x)}[X, Z]
finish{align}
Utilizing the KL time period within the InfoMax goal, we had been capable of get better the KL-divergence time period that additionally seems within the $beta$-VAE (and consequently, VAE) goal.
At this level, we nonetheless have not outlined the generative mannequin $p_theta(xvert z)$, the above goal expresses all the pieces by way of the info distribution $p_mathcal{D}$ and the posterior/illustration distribution $q_psi$.
We are going to now give attention to the 2nd time period in our desired goal, the weighted mutual info, which we nonetheless cannot simply consider. We are going to now present that we are able to get better the reconstruction time period in $beta$-VAEs by doing a variational approximation to the mutual info.
Variational sure on mutual info
Be aware the next equality:
start{equation}
mathbb{I}[X,Z] = mathbb{H}[X] – mathbb{H}[Xvert Z]
finish{equation}
Since we pattern X from the info distribution $p_mathcal{D}$, we see that the primary time period $mathbb{H}[X]$, the entropy of $X$, is fixed with respect to the variational parameter $psi$. We’re left to give attention to discovering a superb approximation to the second time period $mathbb{H}[Xvert Z]$. We will accomplish that by minimizing the KL divergence between $q_psi(xvert z)$ and an auxilliary distribution $p_theta(xvert z)$ to make a variational appoximation to the mutual info:
$$mathbb{H}[Xvert Z] = – mathbb{E}_{qpsi(zvert x)p_mathcal{D}(x)} log q_psi(xvert z) leq inf_theta – mathbb{E}_{qpsi(zvert x)p_mathcal{D}(x)} log p_theta(xvert z)$$
Placing this sure again collectively:
Discovering this decrease sure to MI, we now have now recovered the reconstruction time period from the $beta$-VAE goal:
$$
mathcal{L}(psi) + textual content{const} leq – (1 + lambda) mathbb{E}_{q_psi(zvert x)p_mathcal{D}(x)} log p_theta(xvert z) + mathbb{E}_{p_mathcal{D}}operatorname{KL}[q_psi(zvert x)| p(z)]
$$
That is primarily the identical because the $beta$-VAE goal perform, the place $beta$ is said to our earlier $lambda$. Specifically, $beta = frac{1}{1 + lambda}$. Thus, since we assumed $lambda>0$ for the InfoMax goal to make sense, we are able to say that the $beta$-VAE goal encourages disentanglement within the InfoMax sense for values of $0<beta<1$.
Takeaways
Conceptually, this derivation is attention-grabbing as a result of the primary object of curiosity is now the popularity mannequin, $q_psi(zvert x)$. That’s, the posterior turns into a the main focus of the target perform – one thing that’s not the case after we are maximizing mannequin probability alone (as defined right here). On this respect, this derivation of the $beta$-VAE makes extra sense from a illustration studying viewpoint than the derivation of VAE from most probability.
There’s a good symmetry to those two views. There are two joint distributions over latents and observable variables in a VAE. On one hand we now have $q_psi(zvert x)p_mathcal{D}(x)$ and on the opposite we now have $p(x)p_theta(xvert z)$. The “latent variable mannequin” $q_psi(zvert x)p_mathcal{D}(x)$ is a household of LVMs which has a marginal distribution on observable $mathbf{x}$ that’s precisely the identical as the info distribution $p_mathcal{D}$. So one can say $q_psi(zvert x)p_mathcal{D}(x)$ is a parametric household of latent variable fashions with whose chances are maximal – and we need to select from this household a mannequin the place the illustration $q_psi(zvert x)$ has good properties.
On the flipside, $p(z)p_theta(xvert z)$ is a parametic set of fashions the place the marginal distribution of latents is coordinatewise unbiased, however we want to select from this household a mannequin that has good information probability.
The VAE goal tries to maneuver these two latent variable fashions nearer to at least one one other. From the attitude of $q_psi(zvert x)p_mathcal{D}(x)$ this quantities to reproducing the prior $p(z)$ with the mixture posterior. from the attitude of $p(z)p_theta(xvert z)$, it quantities to maximising the info probability. When the $beta$-VAE goal is used, we moreover want to maximise the mutual info between the noticed information and the illustration.
This twin position of data maximization and most probability has been identified earlier than, for instance on this paper in regards to the IM algorithm. The symmetry of variational studying has been exploited a couple of occasions, for instance in yin-yang machines, and extra just lately additionally in strategies like adversarially discovered inference.
