CS180 Project 5: Diffusion Models Exploration

Author: Ziqian Luo

Part 0: Setup
Part 1: Sampling Loops
Part 1.1: Implementing the Forward Process
Part 1.2: Classical Denoising
Part 1.3: One-Step Denoising
Part 1.4: Iterative Denoising
Part 1.5: Diffusion Model Sampling
Part 1.6: Classifier-Free Guidance (CFG)
Part 1.7: Image-to-Image Translation
Part 1.8: Visual Anagrams
Part 1.9: Hybrid Images
Part 2: Training a Single-Step Denoising UNet
Part 2.1: Implementing the UNet
Part 2.2: Using the UNet to Train a Denoiser
Part 2.2.1: Training
Part 2.2.2: Out-of-Distribution Testing
Part 3: Training a Diffusion Model
Part 3.1: Adding Time Conditioning to UNet
Part 3.2: Training the Time-Conditioned UNet
Part 3.3: Sampling from the Time-Conditioned UNet
Part 3.4: Adding Class-Conditioning to UNet
Part 3.5: Sampling from the Class-Conditioned UNet
Conclusion

Part 0: Setup

I used two different values for num_inference_steps: 40 and 100. I noticed that increasing the number of inference steps improved the quality of the generated images, making them more detailed and visually appealing. I used a fixed random seed 42.

Part 1: Sampling Loops

Part 1.1: Implementing the Forward Process

To implement the forward process, I added noise to a clean image by iteratively applying a noise function at different timesteps. The results below showcase how the image progressively becomes noisier as the timestep increases.

Part 1.2: Classical Denoising

For classical denoising, I applied a Gaussian blur filter to the noisy images generated in the forward process. The results show the original noisy images.

Part 1.3: One-Step Denoising

Using a pre-trained diffusion model (UNet), I denoised the images that had been corrupted at timesteps t=250, 500, and 750. This one-step denoising method used the model to estimate and remove the noise in a single iteration. The results below compare the original noisy images with the model's estimated clean versions.

Part 1.4: Iterative Denoising

To improve upon the one-step denoising, I implemented an iterative denoising loop. This approach involved creating strided timesteps and repeatedly applying the diffusion model to progressively remove noise. I added noise to a test image and then iteratively denoised it, showing the image at each significant step. The final image is compared to the one-step denoised version and the Gaussian-blurred image.

Part 1.5: Diffusion Model Sampling

In this part, I used the iterative denoising process to generate images from scratch by starting with random noise. By applying the diffusion model iteratively, I was able to create realistic images from pure noise. Below are 5 sample images that were generated using this method.

Part 1.6: Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) was used to enhance the quality of the generated images. By combining conditional and unconditional noise estimates, CFG allowed me to steer the image generation towards more visually appealing results. Below are 5 images generated using CFG.

Part 1.7: Image-to-Image Translation

For image-to-image translation, I used the diffusion models to make edits to existing images. I experimented with both hand-drawn and web images, projecting them onto a natural image manifold. Below are examples of the edited images.

Part 1.7.1: Editing Hand-Drawn and Web Images

Hand-drawn and web images were edited using the diffusion model to enhance their quality and realism. By projecting these images onto a natural image manifold, I was able to transform them into more photorealistic versions. Below are examples of the edited images.

Part 1.7.2: Inpainting

Inpainting was implemented to edit specific parts of an image. Using a binary mask, I was able to generate new content for selected regions while preserving the unchanged areas. This technique is particularly useful for making localized edits without affecting the rest of the image. Below are examples of inpainting results.

Part 1.7.3: Text-Conditional Image-to-Image Translation

In this part, I modified image generation using text prompts to guide the edits. By providing a new text prompt, I was able to influence the model to generate new features or alter existing ones in an image, starting from various noise levels. Below are examples of images generated using text-conditional image-to-image translation.

Text-Conditional Image Generation Example

Part 1.8: Visual Anagrams

Visual anagrams were created to generate images that change their interpretation when flipped upside down. By implementing this technique, I produced an image that displayed different scenes depending on its orientation. Below are examples of visual anagrams that exhibit this property. You can absolutelt tell which prompts I used for each image.

Part 1.9: Hybrid Images

Hybrid images were created by combining features from two different prompts. By blending high-frequency and low-frequency information from two separate images, I was able to produce a composite image that appears different depending on the viewing distance. Below are examples of hybrid images, including one that appears as a skull from far away but transforms into a waterfall when viewed up close.

Part 2: Training a Single-Step Denoising UNet

I started by building a simple one-step denoiser. Given a noisy image, I wanted to train a denoiser that could map it to a clean image. To do this, I optimized over an L2 loss.

Part 2.1: Implementing the UNet

I implemented the denoiser as a UNet. It has a few downsampling and upsampling blocks, with skip connections in between.

The diagram above shows the standard tensor operations I used:

Part 2.2: Using the UNet to Train a Denoiser

To solve the denoising problem, I wanted the UNet to take in a noisy image and output a clean version of it. The training goal was to minimize the L2 loss between the predicted and clean images.

I generated training data pairs of (clean image, noisy image). Each clean image was a clean MNIST digit, and for each batch, I added noise to create the noisy version:

Part 2.2.1: Training

Then, I trained the model to perform denoising.

Below is the training loss curve for the UNet:

I visualized the denoised results on the test set at the end of training. Below are sample results after the 1st and 5th epochs:

Results on digits from the test set after 1 epoch of training

Results on digits from the test set after 5 epochs of training

Part 2.2.2: Out-of-Distribution Testing

Next, I tested the model on noise levels that it wasn't trained for to see how well it could generalize:

Results on digits from the test set with varying noise levels

Part 3: Training a Diffusion Model

Now, I was ready to train a diffusion model using a UNet that can iteratively denoise an image. This part involved implementing a DDPM (Denoising Diffusion Probabilistic Model).

To do this, I conditioned the UNet on a timestep. This allowed the model to learn how to remove noise effectively at different stages. Conditioning the UNet meant adding information about the current timestep to guide the denoising process more accurately.

Part 3.1: Adding Time Conditioning to UNet

To add time conditioning, I needed to inject the timestep information into the UNet. I did this using a fully-connected block (FCBlock), which let me add the scalar timestep as an extra input to the model.

I used an operator called FCBlock to inject the timestep. The diagram above shows how I used the FCBlock to condition the UNet on the timestep.

Since the timestep is just a single number, I normalized it to be between 0 and 1 before feeding it into the FCBlock.

Part 3.2: Training the Time-Conditioned UNet

To train the time-conditioned UNet, I picked random images from the training set along with random timesteps, and trained the model to predict the noise that had been added. This process was repeated for different images and different timesteps until the model learned to effectively denoise at all stages.

Below is the training loss curve for the time-conditioned UNet:

Time-Conditioned UNet training loss curve

Part 3.3: Sampling from the Time-Conditioned UNet

Once the time-conditioned UNet was trained, I used it to generate images from pure noise. The sampling process was similar to the iterative denoising approach I used earlier, but now the model was better at understanding the noise level at each step, thanks to the timestep conditioning.

Part 3.4: Adding Class-Conditioning to UNet

To make the generated images even better, I also added class conditioning to the UNet. This allowed the model to generate specific digit classes, giving me more control over the output. I added two more FCBlocks to the UNet to inject class information alongside the timestep.

Instead of a scalar, I used a one-hot vector to represent the class. This way, the model could understand which class it was working with, while still being flexible enough to generate images without any class conditioning (by dropping the conditioning information).

Below is the training curve for the class-conditioned UNet:

Part 3.5: Sampling from the Class-Conditioned UNet

After training the class-conditioned UNet, I sampled images from the model to see how well it could generate digits of different classes. The sampling process was the same as for the time-conditioned model, but with the added benefit of class control. I also used classifier-free guidance to improve the quality of the generated images by adjusting the conditioning signal during sampling.

Below are the results of sampling from the class-conditioned UNet at different stages of training, such as after 5 and 20 epochs:

Conclusion

Overall, this project was a great learning experience. I was able to explore various diffusion models and understand how they can be used to generate images from noise. By training a UNet to denoise images, I learned how to improve the quality of generated images and control the output using class and time conditioning. I also experimented with different techniques such as classifier-free guidance and image-to-image translation to enhance the generated images further. This project has given me a deeper understanding of diffusion models and their applications in image generation and manipulation.

Table of Contents