stable-diffusion-2

Stable Diffusion 2：突破性的文本到图像生成模型

Stable Diffusion 2是一个强大的文本到图像生成模型，由Robin Rombach和Patrick Esser开发。它是基于扩散的模型，能够根据文本提示生成和修改图像。这个模型是在Stable Diffusion 2 Base的基础上进行了进一步训练和改进，为用户提供了更高质量和更多样化的图像生成能力。

模型特点

使用固定的预训练文本编码器（OpenCLIP-ViT/H）
基于潜在扩散模型（Latent Diffusion Model）架构
支持768x768分辨率的图像生成
使用v-objective进行训练，提高了图像质量
提供多个专门用途的检查点，如深度感知和图像修复

训练数据和过程

Stable Diffusion 2的训练数据来自LAION-5B数据集的子集。为了减少不适当内容，研究人员使用了LAION的NSFW检测器进行过滤。训练过程包括以下步骤：

将图像编码为潜在表示
使用OpenCLIP-ViT/H编码文本提示
将文本编码器的输出通过交叉注意力机制输入到UNet骨干网络
使用重建目标和v-objective进行优化

使用方法

用户可以通过Hugging Face的Diffusers库轻松使用Stable Diffusion 2。以下是一个简单的示例：

from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
import torch

model_id = "stabilityai/stable-diffusion-2"
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")