Segmind Stable Diffusion (SSD-1B) vs SDXL 1.0 and SD 2.1, 2.0 or 1.5

Kousik Sasmal
7 min readNov 20, 2023

--

Segmind Stable Diffusion 1B (SSD-1B) generated image

What is Stable Diffusion?

Stable Diffusion is a generative model in the field of AI that uses diffusion technology to produce photorealistic images from text prompts. It is a latent diffusion model for text-to-image synthesis that can generate images based on textual descriptions. Image-to-image generation is also possible by passing a text prompt and an initial image.

Main Components of Stable Diffusion

  • Text encoder: Creates embedding from text prompts, and that embedding is used for image creation.
  • Variational autoencoder(VAE): It is responsible for latent space operation.
  • U-Net: It predicts the noise during the reverse diffusion process.

Text encoder

Stable Diffusion 1.5 used OpenAI’s CLIP (Contrastive Language-Image Pre-training) model as a text encoder. It is an open-source pre-trained model, but the training dataset of CLIP is not available in the public domain.

Stable Diffusion 2.0 used a new text encoder OpenCLIP trained on the LAION-5B dataset which is publicly available. It used an NSFW Filter to drop adult contents from the LAION-5B dataset.

Due to aggressive filtering for NSFW images, SD 2.0 had low human representation in training data, making it challenging to obtain good human images. LAION-5B, compared to CLIP, has fewer celebrity and artistic images, making it harder to generate quality content. Due to criticism and a preference for SD 1.5, Stable Diffusion 2.1 was released shortly after 2.0, addressing the limitations and featuring less restrictive NSFW filters. See the comparison below:

Celebrity image example:

An image of Robert Downey Jr. (source)

Artistic image example:

A monster fighting a hero by Greg Rutkowski (source)

Stable Diffusion XL 1.0 uses two text encoders, OpenCLIP ViT-bigG in combination with CLIP ViT-L. It is one of the main highlights.

Since the Segmind Stable Diffusion Model (SSD-1B) is a distilled version of the Stable Diffusion XL 1.0, it also uses the two text encoders of SDXL 1.0.

Model Architecture

Stable Diffusion XL 1.0 uses a 3 times larger UNet-backbone compared to the Stable Diffusion 1.5 or 2.1. It is another important highlight of SDXL 1.0.

From SDXL original paper: https://arxiv.org/abs/2307.01952

Unlike Stable Diffusion 1.5 or 2.1, SDXL introduces a two-stage model process; the base model (which can also be run as a standalone model) generates an image as an input to the refiner model which adds additional high-quality details.

Visualization of the two-stage pipeline (source)

The Segmind Stable Diffusion Model (SSD-1B) is a distilled version of the Stable Diffusion XL (SDXL) 1.0 base model, is 50% smaller, 60% faster, and maintains high-quality text-to-image generation. It has been trained on diverse datasets, including Grit and Midjourney scrape data, to enhance its ability to create a wide range of visual content based on textual prompts.

SDXL vs SSD-1B (source)

Image quality

Stable Diffusion 1.5 can generate images of (512 x 512) resolution.

Stable Diffusion 2.0 or 2.1 generates images of higher resolution (768 x 768). It can also generate images of (512 x 512) resolution.

Stable Diffusion XL 1.0 generates (1024 x 1024) resolution images by default for the best results. (768 x 768) or (512 x 512) resolution images can be generated by setting the height and width parameters, but anything below (512 x 512) is not likely to work.

Segmind Stable Diffusion(SSD-1B) has Multi-Resolution Support.

SSD-1B Multi-Resolution Support (source)

Negative Prompt

A negative prompt is not necessary for Stable Diffusion 1.5 to get good images.

A negative Prompt is indispensable for Stable Diffusion 2.x (2.0 or 2.1) to get a good result. While a normal prompt is for what you want to see in the image, a negative Prompt is for what you don’t want.

In Segmind Stable Diffusion(SSD-1B), a negative prompt is needed.

Stabel Diffusion generated Images

I have generated images with SD 1.5, 2.0, 2.1, SDXL 1.0, and SSD-1B with the same prompt. Except for the SD 1.5 model, a negative prompt has been used for all the models. The comparison of images and the codes are given below:

Code for image generation with SD 2.x or 1.5

################## For SD 2.x or 1.5  ######################

model_id = ...
scheduler = ...

# define the pipeline
pipe = ...
pipe = pipe.to("cuda")

seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50
generator = torch.Generator(device='cuda').manual_seed(seed)
image = pipe(prompt=prompt,
num_inference_steps=num_steps,
negative_prompt=negative_prompt,
generator=generator).images[0]
SD 1.5 generated image
SD 2.0 generated image
SD 2.1 generated image
SD 2.x and 1.5 generated images in one place

We can see that, the image generated by SD 1.5 is more realistic. SD 2.0 generated human depiction is not as good as SD 2.1.

Code for image generation with SDXL 1.0

################### For SDXL 1.0 #########################

# load base model
base_pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16").to("cuda")

# load refiner model
refiner_pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base_pipe.text_encoder_2,
vae=base_pipe.vae,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16").to("cuda")

####################################################################
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50

generator = torch.Generator(device='cuda').manual_seed(seed)
base_image_latent = base_pipe(
prompt=prompt,
num_inference_steps=num_steps,
negative_prompt=negative_prompt,
generator=generator,
denoising_end=0.7,
output_type="latent").images
# if you want to use only base model, then don't need to provide
# 'denoising_end' and 'output_type' parameters in base_pipe

####################################################################
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50

generator = torch.Generator(device='cuda').manual_seed(seed)
refined_image = refiner_pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_steps,
denoising_start=0.7,
image=base_image_latent,
generator=generator).images[0]
SDXL 1.0 generated images

The SDXL 1.0 base model produces good images. However, by incorporating the refiner model, we can generate high-quality images.

Code for image generation with SSD-1B

One can either load SSD-1B in their own GPU or use the Segmind API.

# Image generation with Segmind API

import requests

prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"

api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/ssd-1b"

# Request payload
data = {
"prompt": prompt,
"negative_prompt": negative_prompt,
"samples": 1,
"scheduler": "UniPC",
"num_inference_steps": 30,
"guidance_scale": "9",
"seed": "42",
"img_width": "1024",
"img_height": "1024",
"base64": False
}

response = requests.post(url, json=data, headers={'x-api-key': api_key})


if response.status_code == 200:
from PIL import Image
from io import BytesIO

image_data = response.content
img = Image.open(BytesIO(image_data))
SSD-1B generated image

The human depiction is more accurate with SSD-1B than other Stable diffusion models shown here. It performs better than SDXL 1.0 for human image generation. However, it’s worth noting that the background in SSD-1B generated images is not as high-quality as in SDXL 1.0.

Performance Graph

SDXL with refiner model outperforms all the Stable Diffusion models that existed before.

Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1 (source)

I tested all the models including SSD-1B for human image generation. SSD-1B outperforms all the models including SDXL 1.0 for human depiction.

Segmind team showed that SSD-1B is up to 60% faster than the Base SDXL Model. Below is a comparison on an A100 80GB.

Speed Comparision between SDXL 1.0 base and SSD-1B (source)

References:

--

--

No responses yet