Stable Diffusion XL 1.0 vs Stable Diffusion 2 or 1.5

7 min readNov 14, 2023

What is Stable Diffusion?

Stable Diffusion is a generative model in the field of AI that uses diffusion technology to produce photorealistic images from text prompts. It is a latent diffusion model for text-to-image synthesis that can generate images based on textual descriptions. Image-to-image generation is also possible by passing a text prompt and an initial image.

Main Components of Stable Diffusion

Text encoder: Creates embedding from text prompts, and that embedding is used for image creation.
Variational autoencoder(VAE): It is responsible for latent space operation.
U-Net: It predicts the noise during the reverse diffusion process.

Stable Diffusion 1.5

Stable Diffusion 1.5 used OpenAI’s CLIP (Contrastive Language-Image Pre-training) model as a text encoder. It is an open-source pre-trained model, but the training dataset of CLIP is not available in the public domain.
It can generate images of (512 x 512) resolution.

Stable Diffusion 2.x (2.0 and 2.1)

Stable Diffusion 2.x used a new text encoder OpenCLIP trained on the LAION-5B dataset which is publicly available.
Used NSFW Filter to drop adult contents from the LAION-5B dataset.
It generates images of higher resolution (768 x 768). It can also generate images of (512 x 512) resolution.

Important features introduced in SD 2.0

Super-resolution Upscaler Diffusion Models

It can convert a low-resolution image into a high-resolution image by a factor of 4. Below is the official release statement from Stability AI:

“Stable Diffusion 2.0 also includes an Upscaler Diffusion model that enhances the resolution of images by a factor of 4. Below is an example of our model upscaling a low-resolution generated image (128x128) into a higher-resolution image (512x512). Combined with our text-to-image models, Stable Diffusion 2.0 can generate images with resolutions of 2048x2048–or even higher.’’

Depth-to-Image Diffusion Model

It infers the depth of an input image and then generates new images using both the text prompt and depth information. Below is the official release statement from Stability AI:

“Our new depth-guided stable diffusion model, called depth2img, extends the previous image-to-image feature from V1 with brand-new possibilities for creative applications. Depth2img infers the depth of an input image (using an existing model) and then generates new images using both the text and depth information.’’

Updated Inpainting Diffusion Model

Inpainting replaces or edits specific areas of an image. Inpainting relies on a mask(black & white) to determine which regions of an image to fill in(white pixels) and which to keep(black pixels). The white pixels are filled in by the prompt.

For inpainting, we need to provide the image, a text prompt, and a mask image. Below is the official release statement from Stability AI:

“We also include a new text-guided inpainting model, fine-tuned on the new Stable Diffusion 2.0 base text-to-image, which makes it super easy to switch out parts of an image intelligently and quickly.’’

Takeaways from Stable Diffusion 2.x

Due to aggressive filtering for NSFW images, the representation of humans in training data for SD 2.0 was low and for that, it is too hard to get good images of humans.

LAION-5B contains fewer celebrity images and fewer artistic images than CLIP’s training data, this means that it is more difficult to generate good celebrity images or artistic images.

For the above downside of SD 2.0, community users criticized SD 2.0 and they prefer to use SD 1.5 over 2.0. Stable Diffusion 2.1 was released shortly after the release of Stable Diffusion 2.0 because of the shortcomings of 2.0 relative to 1.5.

With some modifications to the NSFW filter, which is now less restrictive, Stable Diffusion 2.1 was released.

A negative prompt is indispensable for SD 2.x to get a good result. While a normal prompt is for what you want to see in the image, a negative prompt is for what you don’t want.

Celebrity image example:

Artistic image example:

A monster fighting a hero by Greg Rutkowski (source)

Stable Diffusion XL (SDXL) 1.0

From the abstract of the original SDXL paper:

“Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. Wedesign multiple novel conditioning schemes and train SDXL on multiple aspectratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared to previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.”

The main highlights of SDXL are in the text encoder and U-Net.

It uses two text encoders, OpenCLIP ViT-bigG in combination with CLIP ViT-L.
It uses a 3 times larger UNet-backbone compared to the previous Stable Diffusion models (like SD 1.5 or 2.1).

From SDXL original paper: https://arxiv.org/abs/2307.01952

By default, SDXL generates (1024 x 1024) resolution images for the best results. (768 x 768) or (512 x 512) resolution images can be generated by setting the height and width parameters, but anything below (512 x 512) is not likely to work.

SDXL introduces a two-stage model process; the base model (which can also be run as a standalone model) generates an image as an input to the refiner model which adds additional high-quality details.

Visualization of the two-stage pipeline (source)

Stabel Diffusion generated Images

I have generated images with SD 1.5, 2.0, 2.1, and SDXL 1.0 with the same prompt. Except for the SD 1.5 model, a negative prompt has been used for all the models. The comparison of images and the codes are given below:

Code for image generation with SD 2.x or 1.5

################## For SD 2.x or 1.5  ######################

model_id = ...
scheduler = ...

# define the pipeline
pipe = ...
pipe = pipe.to("cuda")

seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50

generator = torch.Generator(device='cuda').manual_seed(seed)
image = pipe(prompt=prompt,
             num_inference_steps=num_steps,
             negative_prompt=negative_prompt,
             generator=generator).images[0]

**SD 2.x and 1.5 generated images in one place**

We can see that, the image generated by SD 1.5 is more realistic. SD 2.0 generated human depiction is not as good as SD 2.1.

Code for image generation with SDXL

################### For SDXL 1.0 #########################

# load base model
base_pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16").to("cuda")
# load refiner model
refiner_pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base_pipe.text_encoder_2,
    vae=base_pipe.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16").to("cuda")

####################################################################
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50

generator = torch.Generator(device='cuda').manual_seed(seed)
base_image_latent = base_pipe(
    prompt=prompt,
    num_inference_steps=num_steps,
    negative_prompt=negative_prompt,
    generator=generator,
    denoising_end=0.7,
    output_type="latent").images

# if you want to use only base model, then don't need to provide 
# 'denoising_end' and 'output_type' parameters in base_pipe

####################################################################
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50

generator = torch.Generator(device='cuda').manual_seed(seed)
refined_image = refiner_pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=num_steps,
    denoising_start=0.7,
    image=base_image_latent,
    generator=generator).images[0]