Project Conclusions

So, as the project is winding down and I have no more time to run experiments (also, I have to write a 10-page paper on Thompson sampling for tomorrow), I thought I would write a brief post re-capping what I did and learned from this project.

The common architecture throughout my project is the variational autoencoder. While the results from the rest of the class using LSTMs was markedly better, I decided to explore the VAE in a bit more depth rather than trying to replicate the results. I guess this is because I find playing with these latent variable models more interesting in a sense, rather than tuning the hyperparameters of an LSTM. I suspect that this is because the nature of this project (as opposed to the Cats & Dogs project) was more qualitative – I was content with trying to ‘make interesting things happen’ rather than ‘get the best score’, because nobody was posting their negative log likelihood scores (indeed, when I tried to submit them it didn’t work).

In the end, while the audio I generated wasn’t terrible, it was still quite far off from many others using large LSTMs. I suspect that if I had increased the capacity of my decoder significantly by adding more layers, the generated audio would have been quite a bit better. Now that I think about it, I think what happened was that the model was resorting to some kind of averaging in order to minimize the L2 reconstruction error – this points to another solution, which would be to have a better measure of loss (maybe some sort of adversarial training?). Also, this kind of averaging is somewhat similar to what we see in the dialogue problem, where our decoders generate very generic sentences in order to minimize the NLL. Maybe adversarial training could help there as well…

Looking back on my samples, it seems like the audio I generated using iterative refinement and a 50-dimensional Z (https://drive.google.com/open?id=0B-OCmk1sbIXRWXFacW1RN3Bzbkk) gave the best results, even though there was quite a bit of noise. I had forgotten this when I ran my latent sampling models, thus I had reduced the size of my Z to be 10 for convenience. I would have like to run a couple more experiments, using a 100-dimensional Z and a sampling rate of 4000, but I will have to do that another day.

I’m happy that the project gave me an opportunity to play around with the VAE, something I’ve been wanting to for a while. A few of the lessons I learned from the project:

1. If your learning curves are fishy, always suspect your learning rate!
2. Tuning hyperparameters is pretty important. No surprises there.
3. It is always useful to come up with toy examples to debug your model, before you actually train and test on a dataset. I suspect that if I had fed a simple sine wave into my model before I figured out I was binarizing everything, I would have figured out my problems much sooner.
4. If you are under a tight deadline and need the best results, it’s easier to build off of the results of others, rather than writing everything from scratch. That’s why we have open source code!

-Ryan

Final results – trajectory indifference

So I’ve finished running some of the experiments that I talked about in the last post. In particular, I experimented with different trajectories in the latent space. I also decided to reduce the dimensionality of the input space from 32000 to 16000 while decreasing the sampling rate from 16000 to 4000. This corresponds to 4 seconds of audio instead of 2. The hope is that this will lead the VAE to capture more interesting variations in its window, since 2 seconds is a bit too small to have significant changes in tone. Also, reducing the sampling rate should make it easier for the model to learn.

These are the hyperparameters used:

• 16000 -sized inputs and outputs (corresponding to 4s of audio sampled at a rate of 4000). **Note that this is down from the previous sampling rate of 16000
• 2 hidden layers in the encoder with 2000, 4000, 8000 units
• 1 hidden layer in the decoder with 2000, 4000, 8000 units
• a 10-dimensional latent variable Z
• a learning rate of 1e-5
• not re-weighting the KL
• run for 400 epochs with early stopping if the validation error doesn’t improve for 50 epochs

Here are the training curves for the model with 4000 hidden units:

And 8000 hidden units:

Somewhat surprisingly, despite the reduction in the sampling rate these models are not able to achieve the same level of loss as the previous models.

The goal of this experiment was to see if different sampling paths in the latent space led to qualitatively different generated audio. To that end, I tried 2 different paths – one where the latent variables were sampled diagonally across the 10-dimensional space (both in the range [-1,1] and [-2,2], and one where they were sampled in a circle (this was implemented by taking the vector [sin(x), cos(x), sin(x), cos(x), …], and varying x accordingly). The results were fairly interesting, but the audio wasn’t particularly great.

Here is what the audio for the diagonal sampling in the range [-1,1] for 4000 hidden units:

The corresponding audio is here: https://drive.google.com/open?id=0B-OCmk1sbIXRYXFWLTEtTE9PU2s. This graph is fairly interesting – what is shows essentially is that there is a range from about [-0.3,0.3] (in 10-dimensional latent space) where the encoder is placing most of mass for the posterior. Outside that range, there becomes increasingly more noise in the generated output. What is disappointing is that traversing through this region of the space didn’t seem to alter anything other than the amount of noise – there is really no qualitative difference between the 4 second samples other than this.

Here is the graph for the circular trajectory with 8000 hidden units:

and here is the audio: https://drive.google.com/open?id=0B-OCmk1sbIXRYV9GeVJJdlJIYWs. What’s nice is that this graph sort of seems to indicate that there is something that is changing when we alter where we sample from in the latent space. Unfortunately, this is not really reflected in the quality of the audio – unless you listen very closely, each part sounds approximately the same. It seems like the VAE, as I’ve set it up, doesn’t learn anything interesting in its latent space, at least in terms the quality of the generated audio.

What’s also disappointing is that decreasing the sampling rate by a factor of 4 and increasing the input window to 4 seconds did not have a significant impact on the audio quality. The sound still sounds very uninteresting… not quite noisy, but muted, and without significant variations in pitch like the training data. In particular, it is still quite a bit worse than the results generated by others using LSTMs. My suspicions for this are basically model capacity – I should have experimented with a deeper decoder and even longer (~8s) time sequences. But alas, I wasn’t able to figure out the bug in the code on time. As it is, the model seems to generate some kind of ‘average sound’ from the input audio, in order to minimize the reconstruction error. This also points to potentially the loss type (L2) being part of the problem.

Due to the time constraints, these are the last experiments I’m going to run. My next post will be a wrap-up of what I learned from this project.

Results – VAE generation via latent trajectories

In the last post, I talked about a way to generate longer sequences with the VAE – by simply ignoring the recognition network after training and sampling from the latent space. Since you can sample from the latent space as many times as you like, there is no limit on the length of sequences that you can generate. Hopefully, if your consecutive samples are close enough together, the audio will sound relatively smooth, since hopefully points that are close together in your latent space encode similar-sounding audio.

I put this to the test with some alterations on the network design:

• 32000 -sized inputs and outputs
• 2 hidden layers in the encoder with 8000 units (**note this is changed from only having 1 hidden layer previously)
• 1 hidden layer in the decoder with 8000 units
• a 2 and 10-dimensional latent variable Z
• a learning rate of 1e-5
• not re-weighting the KL
• run for 400 epochs with early stopping if the validation error doesn’t improve for 50 epochs

There are some strange things about this setup – notably there aren’t 2 layers in the decoder, and the latent variable dimension is very small. The reason for only having 1 decoder layer is that there is a dimension mismatch bug that I’m trying to work out (although it works fine for the encoder). The reason for the latter point is that I wanted to make it easy to give sample trajectories through the latent space (which would require a bit more effort in 50-dimensions).

Here is what the training curve looked  like for the 2-dimensional latent variable:

And for the 10-dimensional Z. As expected, the added capacity allows the model to achieve a lower validation error (a loss of about -2.4e+6 vs -2.2e+6):

In order to generate the audio, in 2-dimensions I sampled along a 2-d grid from [-1,1], in steps of 0.5. In 10-dimensions, I did the same thing for the first 2-dimensions of the latent variable, and left the other 8 dimensions at 0. I am trying another experiment where I sample diagonally across all dimensions of the latent space, so we’ll see how that turns out in comparison.

Here is what the generated audio looks like for the 2-dimensional Z:

And here are the samples from the models (the links look the exact same for some reason, but they seem to go to the proper files):

First note – the quality of the audio seems to be better, with less noise. This must be due to the increased capacity of the encoder, but it is not clear to me exactly how this is contributing to better samples since we are not actually using it in the generation process (maybe during training it allows the model more flexibility in ‘mapping out’ its latent space, or something).

Another thing to note is that there is not a lot of variety from the samples from the 10-dimensional Z. This is probably because I’m sampling from such a small region of the latent space – hopefully sampling diagonally in all dimensions should help this. Things are a bit better in the 2-dimensional case, especially at the end where there is a noticeable difference in the generated audio. I am re-running some experiments where I sample from a larger range in the 10-dimensional space.

Another thing the diagonal sampling should alleviate is the ‘cut-offs’ after each 10 second block. This corresponds to discontinuities in the latent space trajectory – since I’m sampling in a grid (implemented via a nested loop), every time I go back to the initial point along one dimension, there is a ‘jump’. While it’s not particularly obvious in the video, you can see it by looking at the waveform in the 2-dimensional case. Ideally, you want your trajectories to be “contiguous”, so that there isn’t a large gap between neighbouring 2 second samples, but the audio still changes over time.

Unfortunately, I won’t be able to get results from the recurrent models in time for the deadline (since I also have another project due tomorrow), but I think these VAE results are fairly interesting in themselves.

A journey through the latent space

Although I was able to find a way to generate longer sequences using the ‘iterative refinement’ technique in the last blog post, I’m still not completely satisfied because each of the 2 second snippets from the model is very similar (understandably, because it is trying to minimize reconstruction error). So I would like to have a better way to generate longer sequences.

Of course, there is an obvious way that somehow I hadn’t thought of before – by training the VAE, presumably we have already learned some latent representation that is similar to a Gaussian distribution prior in the Z space. So in fact we don’t have to use the inference network at all – we can just sample directly from the latent space! This should lead to samples that are qualitatively different in terms of how they sound. You can sample as many times as you like to obtain as many different samples as you like, and thus form arbitrarily long sequences.

In fact, this is done quite often when training VAEs on MNIST – if you constrain the Z to be 2-dimensional, you can have a nice visualization where you sample from the 2-D latent space in a grid, and visualize the reconstructions. What you find is that the model ‘sees’ different digits in different parts of the latent space. Hopefully, I can train a VAE to exhibit the same kind of behaviour in the audio domain.

The only problem with this method is that the samples generated by the model will still exhibit a ‘discontinuity at the edges’, meaning that two consecutive samples from the latent variable space will not magically smooth themselves together.

I have just finish training some of these models, but I can’t upload the audio now because I’m at my cottage for my brother’s birthday. So hopefully these should be up soon!

I’ve also been experimenting with the variational recurrent autoencoder (VRAE), proposed by Fabius & van Amerforst (2015) at ICLR. However, I’m not sure if I’m going to be able to get results for this in time… Instead, I might focus on replicating the LSTM results that some of the other students are getting. While I would like to think that my VAE exploration has been slightly more interesting from a technical perspective, it is undeniable that the samples from the LSTM sound quite a bit better overall. So much to do, so little time!

Generating longer sequences via iterative refinement

So, I was thinking about methods to generate longer sequences with the VAE. One idea I came up with was the idea of iterative refinement: that is, given a data input, you compute a reconstruction with the VAE, and then you feed that reconstruction back into the VAE to compute another reconstruction. Intuitively, I think of it as the VAE is sort of performing a random walk in the data space, akin to MCMC. If you concatenate the output of these reconstructions together, you can form an audio sequence of arbitrary length (depending on how many times you run this process).

I ran an experiment with the following hyperparameters:

• 32000 -sized inputs and outputs (corresponding to 2s of audio sampled at a rate of 16000)
• 1 hidden layer in the encoder with 8000 units
• 1 hidden layer in the decoder with 8000 units
• a 50-dimensional latent variable Z
• a learning rate of 1e-5
• not re-weighting the KL
• run for 400 epochs with early stopping if the validation error doesn’t improve for 50 epochs

I The learning curve looks very similar to the previous case with 4000 units and a latent variable dimension of 20:

And here is what the audio sounds like after iterative refinement:

Again, there’s a bit of noise, but the quality isn’t bad. The problem is that there is a clear auditory separation between each 2 second clip (it doesn’t ‘flow’), and each 2 second clip sounds about the same.This similarity is to be expected, since the lower bound explicitly accounts for the reconstruction cost (squared difference between the output and input).

This is reinforced by looking at the plot of the audio signal:

Each reconstruction is about the same, but the amplitude is slightly reduced. This is an interesting observation – I suspect what is happening here is that to minimize the L2 reconstruction error, the reconstruction is slightly ‘contracting to the mean’. While it only happens a very small amount each time, it becomes noticeable after 20 iterations.

We’re going bug hunting

So, after many days of frustration, I’ve figured out the cause of the poor samples from my model.

It was not, as I had previously postulated, that the model was overfitting, or underfitting, or didn’t have enough capacity (although it still probably doesn’t have enough capacity).

The problem was that I hadn’t re-set the flag to give the model continuous outputs!

Let me explain – the original version of this code was meant for generating MNIST images. There are, in fact, two different kinds of MNIST datasets – the binarized dataset, and the regular (continuous-valued) dataset. Depending on the dataset you use, you will have to change your decoder accordingly. For the binary dataset, you are outputting the parameter $p$ of a Bernoulli(p) distribution. For the continous dataset, you are predicting the mean and variance of a Gaussian distribution, I believe.

So, when I was using the binary version, it was predicting mostly zeros, and only a few non-zero values (not a sinusoidal wave that one would expect). The reason I didn’t suspect this sooner was that, when I looked at the actual output values from samples from the decoder, it wasn’t just binary-valued (there were several different non-zero numbers in a range).

So while I’m still not sure why there were non-binary outputs in the binary version, I’m pretty sure that this was my problem all along. I re-trained the VAE with the following criteria:

• 32000 -sized inputs and outputs (corresponding to 2s of audio sampled at a rate of 16000)
• 1 hidden layer in the encoder with 4000 units
• 1 hidden layer in the decoder with 4000 units
• a 20-dimensional latent variable Z
• a learning rate of 1e-5
• not re-weighting the KL
• run for 400 epochs with early stopping if the validation error doesn’t improve for 50 epochs

It converged after about 160 epochs (in fact, in converged much sooner – I think I can now go back and make my validation stopping more aggressive):

Here, red is the validation error, blue is the training error, the vertical axis is the loss, and the horizontal axis is the number of iterations.

Here are depiction of the output waveforms (for various samples from the latent variable):

And here is a snippet of the samples I am getting:

Original audio:

Corresponding VAE sample:

Sounds much more promising! Lesson learned: when you are getting fishy results, check every. single. flag. in. the. code. Also, it would have been a good idea to try a ‘toy example’ (eg. a sine wave) to see if my network could have reconstructed that, to make sure there were no bugs. Once I saw it couldn’t, I would have taken a more careful look at the code earlier!

So I will try some more hyperparameter configurations to see if I can get a better result. Then, if I have time, I will try to either go bug hunting and get the LSTM working, or skip right to a variational recurrent autoencoder – stay tuned!

On having a limited model capacity

So, I ran several more experiments using a 1-layer VAE. I used the following hyperparameter settings:

• 32000 -sized inputs and outputs (corresponding to 2s of audio sampled at a rate of 16000)
• 1 hidden layer in the encoder with 500, 1000, and 2000 units
• 1 hidden layer in the decoder with 500, 1000, and 2000 units
• a 5,10,20-dimensional latent variable Z
• a learning rate of 1e-5
• no re-weighting the KL
• run for 4000 epochs with early stopping if the validation error doesn’t improve for 50 epochs

It seems like the only reason the network was overfitting before was because of the reduced KL term. Now, I am running it with the regular variational lower bound, but simply for longer than I ran it initially (since I presumed it was not converging). Interestingly, what I found is that the training and validation error reduced for the full 4000 epochs. However, the samples still sound terrible:

Sample from VAE with 20-dim Z, 1000 units: