Image-to-Image Generation Using depth2img Pre-Trained Models

Mobarak Inuwa 13 Jun, 2023 • 10 min read

Introduction

Hugging face has provided different means of carryout image-to-image generation using pre-trained models and other available libraries. This article will generate new images from an input image using UNet2DConditionModel models. The implementation will be based on PyTorch and then the hugging face depth2img. Stable diffusion models make it possible to generate new images by modifying them and generating new creative images.

Learning Objectives

  • A practical look at Stable Diffusion
  • Building a customized pipeline
  • Using Huggin Face transformers and diffusers

This article was published as a part of the Data Science Blogathon.

What are Stable Diffusion Models?

Stable Diffusion models are models that perform as latent diffusers by learning the latent makeup of images on how they behave through the latent space. They can be seen as a member of the family of deep generative neural networks (dgNN) which are particularly high-performance generative neural network (GNN) layers backend.

A beginner may ask about the implication of the word “Stable” in stable diffusion. It is stable because a guide is provided on the results using the original images and parameters. This works across the latent space accordingly. There is an unstable diffusion where the generation space is unpredictable and completely random.

The Architecture of Stable Diffusion

The working concept of stable diffusion is a bit complex since it goes beyond regular image processing or deep learning. The attempt in this article is to demystify this concept using simple explanations.

The model utilized a probabilistic model known as the latent diffusion model (LDM). In the process of training these models, the priority is to apply noise to the process such that the statistical probability distribution function is equal to the normal distribution of the training outcome. This means that, although the model is being noised, it is still trying to maintain a threshold. This is referred Gaussian noise.

The image below illustrates the diffusion Process in a Latent Space with noise using the U-Net model.

image-to-image generation

This is a sequence of denoising autoencoders (DAE). DAEs help to adjust the reconstruction criterion for the pixel space of the images. It adds noise to the standard autoencoder. Stable Diffusion consists of some essential parts such as the variational autoencoder (VAE) which is an artificial neural network (ANN) for probabilistic graphical models. It also has the U-Net block which is a convolutional neural network (CNN) for image segmentation. Then, finally, it has a text encoder part performed by a trained CLIP ViT-L/14 text encoder which does the transformation of text prompts into an embedding space.

The VAE compresses the image values into a dimensional latent space without losing details yet managing memory.

variational autoencoder (VAE) || image-to-image generation

Image-to-Image Generation

The idea is to transform an existing image into a modified one with a diffuser with an image generator pipeline. This approach involves giving an image to the pipeline in order to generate another image. This is one among the many use case of stable diffusion. Other instances are text to image where no image is provided initially except a piece of text containing the description of the expected image. Another instance is the text-to-video, which generates new videos from text and even text-to-speech, etc. In this article, we present a hands-on approach to building a customized pipeline for image-to-image generation.

Approach

Let us see the process we will follow to build our customized pipeline.

  • Import libraries
  • Create a class for the Diffusion pipeline
  • Create an instance with the pipeline 
  • Load functions for preprocessing our text (prompts)
  • Understand the Image depth with Dense Prediction Transformers (DPT)
  • Create a variable of the pipeline
  • Load image to be used
  • Generate new images

Project GitHub repo

Importing Libraries

Installing libraries

#  Installing libraries
%pip install --quiet --upgrade diffusers transformers scipy ftfy
#  Installing libraries
%pip install --quiet --upgrade accelerate

Let us look at the use of the above libraries:

Diffusers: They are latent models trained to generate photo-realistic images with text. This is the backbone of our everything.

Transformers: They are models used for their ability to understand features in a context like text and interpret them to produce token embeddings. The choice of transformers enhances how the context will be understood.

Scipy: SciPy is a traditional Python library for scientific computation that uses the NumPy backend. It provides our pipeline with optimization and signal processing of the data.

Ftfy: It provides tools for processing the texts.

Accelerate: It allows PyTorch code to be cross-platform and to run in different configurations and environments with no stress.

import numpy as np
from tqdm import tqdm
from PIL import Image

# PyTorch backend
import torch
from torch import autocast

# Transformers containing tools
from transformers import CLIPTextModel, CLIPTokenizer
from transformers import DPTForDepthEstimation, DPTFeatureExtractor

# Accessing our pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel
from diffusers.schedulers.scheduling_pndm import PNDMScheduler

Creating Diffusion Pipeline

Although hugging face provides a simple predefined pipeline for diffusion but in this section, we will be building one. Hence, this will provide a flexible approach. Specifying what we want and how we want it. We create a number of methods in the class which we will initialize later. To follow hands-on, please use the notebook available on GitHub.

# Creating a class for the Diffusion pipeline

class DiffusionPipeline:

    def __init__(self, 
                 vae, 
                 tokenizer, 
                 text_encoder, 
                 unet, 
                 scheduler):
        
        self.vae = vae
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder
        self.unet = unet
        self.scheduler = scheduler
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    
    def get_text_embeds(self, text):
    
        # Tokenizing the text
        text_input = self.tokenizer(text, 
                                    padding='max_length', 
                                    max_length=tokenizer.model_max_length, 
                                    truncation=True, 
                                    return_tensors='pt')
                                    
        # Embeding the tokenize text
        with torch.no_grad():
            text_embeds = self.text_encoder(text_input.input_ids.to(self.device))[0] # Get embeddings
        return text_embeds


    def get_prompt_embeds(self, prompt):
        if isinstance(prompt, str):
            prompt = [prompt]
            
        # get conditional embeddings
        cond_embeds = self.get_text_embeds(prompt)
        
        # get unconditional embeddings
        uncond_embeds = self.get_text_embeds([''] * len(prompt))
        
        # concatenate the above 2 embeds
        prompt_embeds = torch.cat([uncond_embeds, cond_embeds])
        return prompt_embeds



    def decode_img_latents(self, img_latents):
        img_latents = 1 / self.vae.config.scaling_factor * img_latents
        with torch.no_grad():
            img = self.vae.decode(img_latents).sample
        
        img = (img / 2 + 0.5).clamp(0, 1)
        img = img.cpu().permute(0, 2, 3, 1).float().numpy()
        return img



    def transform_img(self, img):
    
        # scale images to [0, 255] and convert to int
        img = (img * 255).round().astype('uint8')
        
        # convert to PIL Image objects
        img = [Image.fromarray(i) for i in img]
        return img


    def encode_img_latents(self, img, latent_timestep):
        if not isinstance(img, list):
            img = [img]
        
        img = np.stack([np.array(i) for i in img], axis=0)
        
        # scale images to [-1, 1]
        img = 2 * ((img / 255.0) - 0.5)
        img = torch.from_numpy(img).float().permute(0, 3, 1, 2)
        img = img.to(self.device)

        # encode images
        img_latents_dist = self.vae.encode(img)
        img_latents = img_latents_dist.latent_dist.sample()
        
        # scale images
        img_latents = self.vae.config.scaling_factor * img_latents
        
        # add noise to the latents
        noise = torch.randn(img_latents.shape).to(self.device)
        img_latents = self.scheduler.add_noise(img_latents, noise, latent_timestep)

        return img_latents

We initialize the root class as a parameter in a new class. This new Depth2ImgPipeline class will be assigned to a variable that will be called to act on our target image.

class Depth2ImgPipeline(DiffusionPipeline):
    def __init__(self, 
                 vae, 
                 tokenizer, 
                 text_encoder, 
                 unet, 
                 scheduler, 
                 depth_feature_extractor, 
                 depth_estimator):
        
        super().__init__(vae, tokenizer, text_encoder, unet, scheduler)

        self.depth_feature_extractor = depth_feature_extractor
        self.depth_estimator = depth_estimator


    def get_depth_mask(self, img):
        if not isinstance(img, list):
            img = [img]

        width, height = img[0].size
        
        # pre-process the input image and get its pixel values
        pixel_values = self.depth_feature_extractor(img, return_tensors="pt").pixel_values

        # use autocast for automatic mixed precision (AMP) inference
        with autocast('cuda'):
            depth_mask = self.depth_estimator(pixel_values).predicted_depth
        
        # get the depth mask
        depth_mask = torch.nn.functional.interpolate(depth_mask.unsqueeze(1),
                                                     size=(height//8, width//8),
                                                     mode='bicubic',
                                                     align_corners=False)
        
        # scale the mask to [-1, 1]
        depth_min = torch.amin(depth_mask, dim=[1, 2, 3], keepdim=True)
        depth_max = torch.amax(depth_mask, dim=[1, 2, 3], keepdim=True)
        depth_mask = 2.0 * (depth_mask - depth_min) / (depth_max - depth_min) - 1.0
        depth_mask = depth_mask.to(self.device)

        # replicate the mask for classifier free guidance 
        depth_mask = torch.cat([depth_mask] * 2)
        return depth_mask



    # Denoising the image over the latent space
    def denoise_latents(self, 
                        img,
                        prompt_embeds,
                        depth_mask,
                        strength,
                        num_inference_steps=20,
                        guidance_scale=7.5,
                        height=512, width=512):
        
        # clip the value of strength to ensure strength lies in [0, 1]
        strength = max(min(strength, 1), 0)

        # compute timesteps
        self.scheduler.set_timesteps(num_inference_steps)

        init_timestep = int(num_inference_steps * strength)
        t_start = num_inference_steps - init_timestep
        
        timesteps = self.scheduler.timesteps[t_start: ]
        num_inference_steps = num_inference_steps - t_start

        latent_timestep = timesteps[:1].repeat(1)

        latents = self.encode_img_latents(img, latent_timestep)

        # use autocast for automatic mixed precision (AMP) inference
        with autocast('cuda'):
            for i, t in tqdm(enumerate(timesteps)):
                latent_model_input = torch.cat([latents] * 2)
                latent_model_input = torch.cat([latent_model_input, depth_mask], dim=1)
                
                # predict noise residuals
                with torch.no_grad():
                    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)['sample']

                # separate predictions for unconditional and conditional outputs
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                
                # perform guidance
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

                # remove the noise from the current sample i.e. go from x_t to x_{t-1}
                latents = self.scheduler.step(noise_pred, t, latents)['prev_sample']

        return latents


    def __call__(self, 
                 prompt, 
                 img, 
                 strength=0.8,
                 num_inference_steps=50,
                 guidance_scale=7.5,
                 height=512, width=512):


        prompt_embeds = self.get_prompt_embeds(prompt)

        depth_mask = self.get_depth_mask(img)

        latents = self.denoise_latents(img,
                                       prompt_embeds,
                                       depth_mask,
                                       strength,
                                       num_inference_steps,
                                       guidance_scale,
                                       height, width)

        img = self.decode_img_latents(latents)

        img = self.transform_img(img)
        
        return img

Specifying a processor to speed up tasks with GPU.

# Setting a GPU device
device = 'cuda'

Loading Functions for Preprocessing Text

Since the prompt is text, we need a way of converting this text to be readable by the pipeline. This is achieved by autoencoders and tokenization.

# Loading autoencoder for reconstructing the image
vae = AutoencoderKL.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='vae').to(device)

# Load the tokenizer and the text encoder
tokenizer = CLIPTokenizer.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='text_encoder').to(device)

# Load UNet model
unet = UNet2DConditionModel.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='unet').to(device)

# Load scheduler to adjust the learning rate
scheduler = PNDMScheduler(beta_start=0.00085, 
                          beta_end=0.012, 
                          beta_schedule='scaled_linear', 
                          num_train_timesteps=1000)

Using Dense Prediction Transformers (DPT)

DPT was first released by intel labs around 2021. It is a dense vision transformer with an architecture that leverages vision transformers instead of traditional convolutional networks. It is trained with over 150 classes and works as a vision transformer for dense prediction tasks. This converts the image into tokens using a kind of segmentation.

Dense Prediction Transformers (DPT)

You can find the official paper on DPT released by Intel here

# Load DPT Depth Estimator for measuring the distance of each pixel
depth_estimator = DPTForDepthEstimation.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='depth_estimator')

# Load DPT Feature Extractor for dense prediction
depth_feature_extractor = DPTFeatureExtractor.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='feature_extractor')

Creating a Variable Instance of Pipeline

We can now initialize our pipeline. This approach makes our code brief when calling our pipeline on the input image.

# Initializing pipeline
depth2img = Depth2ImgPipeline(vae, 
                              tokenizer, 
                              text_encoder, 
                              unet, 
                              scheduler,
                              depth_feature_extractor,
                              depth_estimator)

Loading Images

We need a function to ensure and load images.

import urllib.parse as parse
import os
import requests

# Determine if a string is a URL
def check_url(string):
    try:
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

We can now build a function to load the images.

# Load an image
def load_image(image_path):
    if check_url(image_path):
        return Image.open(requests.get(image_path, stream=True).raw)
    elif os.path.exists(image_path):
        return Image.open(image_path)

Generating New Images

We are now set to use our model. To use our pipeline in generating new images, we load the image URL. Let us view a sample image.

# Getting an image URL
url = "https://img.freepik.com/free-vector/two-red-roses-white_1308-35268.jpg?size=626&ext=jpg&uid=R21895281&ga=GA1.1.821631087.1678296847&semt=ais"
img = load_image(url)
img
depth2img

Prompt = “two hibiscus flowers”

# Assigning Pipeline to prompt
depth2img("two hibiscus flowers", img)[0]
Generating New Images | image to image

Let us try another sample.

# Getting an image URL
url = "https://img.freepik.com/free-vector/cute-pink-bicycle-isolated_1284-43044.jpg?t=st=1684396069~exp=1684396669~hmac=fb265438f0680c00b7c156182201f5c15b602bd1733a5b051a2d9c77ff83a4fd"
img = load_image(url)
img
Generating New Images | depth2img

Prompt = “bicycle”

# Assigning Pipeline to prompt
depth2img("bicycle", img)[0]
Generating New Images | depth2img

You can experiment with different images and text. With more detailed prompts, you can generate new creative images.

Memory Complexity

It is going to be vital to consider memory before we wrap off. It is possible to encounter memory challenges when trying out this customized pipeline on a limited environment like CPU or low RAM or GPU. If you are trying this on Free Google Colab then it is likely to face a “CUDA Out of Memory” problem. 

OutOfMemoryError: CUDA Out of Memory

OutOfMemoryError: CUDA out of memory | depth2img

To fix this common issue with stable diffusion, there are a few quick escape routes. The first is to reduce the image resolutions and spare processing time or get more memory by buying it. The default trained dimension is around 500 pixels. You can half this size to less than 25 which further spares some memory.

height=256, width=256): # Reduced image dimension

Conclusion

We have looked at building an image-to-image generation pipeline using depth2img pre-trained models. This is an alternative powered by Hugging Face instead of the prebuilt pipeline with less customization. Building a pipeline on the pre-trained models make things more adjustable. Utilizing the power of generative AI and stable diffusion could bring about a lot of changes and benefits to everyday business processes.

Key Takeaways

  • Hugging face has provided means of carryout image-to-image generation using pre-trained models and other available libraries.
  • The prompt is text so convert it to be readable by the pipeline. You can achieve this by autoencoders and tokenization.
  • DPT is a dense vision transformer with an architecture that leverages vision transformers instead of traditional convolutional networks.
  • Generation of images is convenient with Hugging Face

Reference Links

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is an image-to-image generation?

A. Image-to-image generation refers to the process of generating new images from existing ones while preserving certain attributes or transforming them into a different representation. It involves mapping input images to corresponding output images using various techniques like generative adversarial networks (GANs) or conditional variational autoencoders (CVAEs).

Q2. What is the process of image generation?

A. The process of image generation typically involves training a machine learning model on a dataset of images. This model learns the underlying patterns and structures of the images and can generate new images by sampling from the learned distribution. The process may involve preprocessing, feature extraction, model training, and post-processing steps.

Q3. What is depth-to-image in Stable Diffusion?

A. In Stable Diffusion, depth-to-image refers to generating realistic images from depth maps or information. It involves converting depth representations, which typically indicate the distance of objects from the camera, into visually plausible images with realistic details, textures, and colors. Stable Diffusion is a method that leverages diffusion models for this purpose, enabling high-quality depth-to-image synthesis.

Q4. What is image-to-image generation using Stable Diffusion?

A. Image-to-image generation using Stable Diffusion refers to the process of generating new images from existing ones by leveraging the Stable Diffusion method. This approach involves converting input images into a latent space representation, applying diffusion processes to manipulate the latent variables, and then mapping them back to the image space to generate visually coherent and diverse output images with desired attributes or transformations.

Mobarak Inuwa 13 Jun 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers