Segmind Stable Diffusion (SSD-1B) vs SDXL 1.0 and SD 2.1, 2.0 or 1.5
What is Stable Diffusion?
Stable Diffusion is a generative model in the field of AI that uses diffusion technology to produce photorealistic images from text prompts. It is a latent diffusion model for text-to-image synthesis that can generate images based on textual descriptions. Image-to-image generation is also possible by passing a text prompt and an initial image.
Main Components of Stable Diffusion
- Text encoder: Creates embedding from text prompts, and that embedding is used for image creation.
- Variational autoencoder(VAE): It is responsible for latent space operation.
- U-Net: It predicts the noise during the reverse diffusion process.
Text encoder
Stable Diffusion 1.5 used OpenAI’s CLIP (Contrastive Language-Image Pre-training) model as a text encoder. It is an open-source pre-trained model, but the training dataset of CLIP is not available in the public domain.
Stable Diffusion 2.0 used a new text encoder OpenCLIP trained on the LAION-5B dataset which is publicly available. It used an NSFW Filter to drop adult contents from the LAION-5B dataset.
Due to aggressive filtering for NSFW images, SD 2.0 had low human representation in training data, making it challenging to obtain good human images. LAION-5B, compared to CLIP, has fewer celebrity and artistic images, making it harder to generate quality content. Due to criticism and a preference for SD 1.5, Stable Diffusion 2.1 was released shortly after 2.0, addressing the limitations and featuring less restrictive NSFW filters. See the comparison below:
Celebrity image example:
Artistic image example:
Stable Diffusion XL 1.0 uses two text encoders, OpenCLIP ViT-bigG in combination with CLIP ViT-L. It is one of the main highlights.
Since the Segmind Stable Diffusion Model (SSD-1B) is a distilled version of the Stable Diffusion XL 1.0, it also uses the two text encoders of SDXL 1.0.
Model Architecture
Stable Diffusion XL 1.0 uses a 3 times larger UNet-backbone compared to the Stable Diffusion 1.5 or 2.1. It is another important highlight of SDXL 1.0.
Unlike Stable Diffusion 1.5 or 2.1, SDXL introduces a two-stage model process; the base model (which can also be run as a standalone model) generates an image as an input to the refiner model which adds additional high-quality details.
The Segmind Stable Diffusion Model (SSD-1B) is a distilled version of the Stable Diffusion XL (SDXL) 1.0 base model, is 50% smaller, 60% faster, and maintains high-quality text-to-image generation. It has been trained on diverse datasets, including Grit and Midjourney scrape data, to enhance its ability to create a wide range of visual content based on textual prompts.
Image quality
Stable Diffusion 1.5 can generate images of (512 x 512) resolution.
Stable Diffusion 2.0 or 2.1 generates images of higher resolution (768 x 768). It can also generate images of (512 x 512) resolution.
Stable Diffusion XL 1.0 generates (1024 x 1024) resolution images by default for the best results. (768 x 768) or (512 x 512) resolution images can be generated by setting the height and width parameters, but anything below (512 x 512) is not likely to work.
Segmind Stable Diffusion(SSD-1B) has Multi-Resolution Support.
Negative Prompt
A negative prompt is not necessary for Stable Diffusion 1.5 to get good images.
A negative Prompt is indispensable for Stable Diffusion 2.x (2.0 or 2.1) to get a good result. While a normal prompt is for what you want to see in the image, a negative Prompt is for what you don’t want.
In Segmind Stable Diffusion(SSD-1B), a negative prompt is needed.
Stabel Diffusion generated Images
I have generated images with SD 1.5, 2.0, 2.1, SDXL 1.0, and SSD-1B with the same prompt. Except for the SD 1.5 model, a negative prompt has been used for all the models. The comparison of images and the codes are given below:
Code for image generation with SD 2.x or 1.5
################## For SD 2.x or 1.5 ######################
model_id = ...
scheduler = ...
# define the pipeline
pipe = ...
pipe = pipe.to("cuda")
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50
generator = torch.Generator(device='cuda').manual_seed(seed)
image = pipe(prompt=prompt,
num_inference_steps=num_steps,
negative_prompt=negative_prompt,
generator=generator).images[0]
We can see that, the image generated by SD 1.5 is more realistic. SD 2.0 generated human depiction is not as good as SD 2.1.
Code for image generation with SDXL 1.0
################### For SDXL 1.0 #########################
# load base model
base_pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16").to("cuda")
# load refiner model
refiner_pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base_pipe.text_encoder_2,
vae=base_pipe.vae,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16").to("cuda")
####################################################################
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50
generator = torch.Generator(device='cuda').manual_seed(seed)
base_image_latent = base_pipe(
prompt=prompt,
num_inference_steps=num_steps,
negative_prompt=negative_prompt,
generator=generator,
denoising_end=0.7,
output_type="latent").images
# if you want to use only base model, then don't need to provide
# 'denoising_end' and 'output_type' parameters in base_pipe
####################################################################
seed = 42
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
num_steps = 50
generator = torch.Generator(device='cuda').manual_seed(seed)
refined_image = refiner_pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=num_steps,
denoising_start=0.7,
image=base_image_latent,
generator=generator).images[0]
The SDXL 1.0 base model produces good images. However, by incorporating the refiner model, we can generate high-quality images.
Code for image generation with SSD-1B
One can either load SSD-1B in their own GPU or use the Segmind API.
# Image generation with Segmind API
import requests
prompt = "A beautiful farmer working in paddy field"
negative_prompt = "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy"
api_key = "YOUR_API_KEY"
url = "https://api.segmind.com/v1/ssd-1b"
# Request payload
data = {
"prompt": prompt,
"negative_prompt": negative_prompt,
"samples": 1,
"scheduler": "UniPC",
"num_inference_steps": 30,
"guidance_scale": "9",
"seed": "42",
"img_width": "1024",
"img_height": "1024",
"base64": False
}
response = requests.post(url, json=data, headers={'x-api-key': api_key})
if response.status_code == 200:
from PIL import Image
from io import BytesIO
image_data = response.content
img = Image.open(BytesIO(image_data))
The human depiction is more accurate with SSD-1B than other Stable diffusion models shown here. It performs better than SDXL 1.0 for human image generation. However, it’s worth noting that the background in SSD-1B generated images is not as high-quality as in SDXL 1.0.
Performance Graph
SDXL with refiner model outperforms all the Stable Diffusion models that existed before.
I tested all the models including SSD-1B for human image generation. SSD-1B outperforms all the models including SDXL 1.0 for human depiction.
Segmind team showed that SSD-1B is up to 60% faster than the Base SDXL Model. Below is a comparison on an A100 80GB.
References:
- High-Resolution Image Synthesis with Latent Diffusion Models
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
- Hugginfface Diffusers library
- https://huggingface.co/segmind/SSD-1B
- https://www.jonstokes.com/p/stable-diffusion-20-and-21-an-overview
- https://www.assemblyai.com/blog/stable-diffusion-1-vs-2-what-you-need-to-know/