ReVersion: 基于扩散的图像关系反演

这个仓库包含以下论文的实现：

ReVersion: 基于扩散的图像关系反演 Ziqi Huang∗，Tianxing Wu∗，Yuming Jiang，Kelvin C.K. Chan，Ziwei Liu

来自南洋理工大学 (NTU) 的MMLab，隶属于S实验室

:open_book: 概述

overall_structure

我们提出了一个新的任务，关系反演：给定一些示例图像，这些图像中每张都存在某种关系，我们旨在找到一个关系提示符 <R> 来捕捉这种交互，并将这种关系应用于新的实体以合成新的场景。上图由我们的ReVersion框架生成。

:heavy_check_mark: 更新

[2024年03月] 我们优化了代码实现。你只需保存和加载关系提示符，而不必保存或加载整个文本到图像模型。
[2023年08月] 我们发布了关系反演的训练代码。
[2023年04月] 我们发布了ReVersion基准。
[2023年04月] 集成到使用Gradio的Hugging Face 🤗。试试在线演示：
[2023年03月] Arxiv论文已发布。
[2023年03月] 预训练模型和关系提示符已发布在此链接。
[2023年03月] 项目页面和视频已上线。
[2023年03月] 推理代码已发布。

:hammer: 安装

克隆仓库

git clone https://github.com/ziqihuangg/ReVersion
cd ReVersion

创建Conda环境并安装依赖

conda create -n reversion
conda activate reversion
conda install python=3.8 pytorch==1.11.0 torchvision==0.12.0 cudatoolkit=11.3 -c pytorch
pip install diffusers["torch"]
pip install -r requirements.txt

:page_with_curl: 使用方法

关系反演

给定一组示例图像及其实体的粗略描述，你可以优化一个关系提示符**<R>**来捕捉这些图像中共存的关系，即所谓的关系反演。

准备示例图像（例如，0.jpg - 9.jpg）及粗略描述（text.json），并将它们放入一个文件夹。你可以使用我们的ReVersion基准，也可以准备自己的图像。以下是我们ReVersion基准中的一个示例：

.reversion_benchmark_v1
├── painted_on
│   ├── 0.jpg
│   ├── 1.jpg
│   ├── 2.jpg
│   ├── 3.jpg
│   ├── 4.jpg
│   ├── 5.jpg
│   ├── 6.jpg
│   ├── 7.jpg
│   ├── 8.jpg
│   ├── 9.jpg
│   └── text.json

以关系 painted_on 为例，你可以使用以下脚本开始训练：

accelerate launch \
    --config_file="./configs/single_gpu.yml" \
    train.py \
    --seed="2023" \
    --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
    --train_data_dir="./reversion_benchmark_v1/painted_on" \
    --placeholder_token="<R>" \
    --initializer_token="and" \
    --train_batch_size="2" \
    --gradient_accumulation_steps="4" \
    --max_train_steps="3000" \
    --learning_rate='2.5e-04' --scale_lr \
    --lr_scheduler="constant" \
    --lr_warmup_steps="0" \
    --output_dir="./experiments/painted_on" \
    --save_steps="1000" \
    --importance_sampling \
    --denoise_loss_weight="1.0" \
    --steer_loss_weight="0.01" \
    --num_positives="4" \
    --temperature="0.07" \
    --only_save_embeds

其中 train_data_dir 为示例图片和粗略描述的路径。output_dir 是保存倒置关系和实验日志的路径。要生成特定关系的图像，你可以参考接下来的章节生成。

请注意，only_save_embeds 选项允许你只保存关系提示 <R>，而不必保存整个 Stable Diffusion 模型。你可以决定是否开启这个选项。

:framed_picture: 生成

我们可以使用学习的关系提示 <R> 来生成具有新对象、背景和风格的特定关系图像。

你可以使用关系倒置通过自定义数据获取一个学习到的 <R>。你也可以从这里下载模型，我们提供了一些预训练的关系提示供你使用。

将模型（即学习到的关系提示 <R>）放在 ./experiments/ 目录下，如下所示：

./experiments/
├── painted_on
│   ├── checkpoint-500
│   ...
│   └── model_index.json
├── carved_by
│   ├── checkpoint-500
│   ...
│   └── model_index.json
├── inside
│   ├── checkpoint-500
│   ...
│   └── model_index.json
...

以关系 painted_on 为例，你可以使用以下脚本通过单一提示生成图像，例如 "cat <R> stone"：
```
python inference.py \
--model_id ./experiments/painted_on \
--prompt "cat <R> stone" \
--placeholder_string "<R>" \
--num_samples 10 \
--guidance_scale 7.5 \
--only_load_embeds
```
或者在 ./templates/templates.py 中用键名 $your_template_name 编写提示列表，并为列表 $your_template_name 中的每个提示生成图像：
```
your_template_name='painted_on_examples'
python inference.py \
--model_id ./experiments/painted_on \
--template_name $your_template_name \
--placeholder_string "<R>" \
--num_samples 10 \
--guidance_scale 7.5 \
--only_load_embeds
```
其中 model_id 是模型目录，num_samples 是为每个提示生成的图像数量，guidance_scale 是无分类器指导系数。

我们在 ./templates/templates.py 中为每个关系提供了一些示例模板，例如 painted_on_examples，carved_by_examples 等。

请注意，如果你在倒置过程中保存了整个模型，即没有开启 only_save_embeds 选项，那么在推理时应该关闭 only_load_embeds 选项。 only_load_embeds 选项仅从实验文件夹中加载关系提示 <R>，并自动从包含预训练 Stable Diffusion 模型的默认缓存位置加载其余的 Stable Diffusion 模型（包括其他文本令牌的嵌入）。

:hugs: Gradio Demo

我们还提供了一个 Gradio Demo 用于通过 UI 测试我们的方法。此 demo 支持即时生成特定关系的文本到图像。运行以下命令将启动 demo：
```
python app_gradio.py
```
或者，你可以在这里试用在线 demo。

:art: 多样生成

你还可以使用关系提示 <R> 指定多样的提示来生成多种背景和风格的图像。例如，你的提示可以是 "michael jackson <R> wall, in the desert"，"cat <R> stone, on the beach"，等等。

diverse_results

:straight_ruler: ReVersion 基准测试

ReVersion 基准测试包含不同关系和实体，以及一组定义明确的文本描述。

关系和实体。我们定义了十种具有不同抽象水平的代表性对象关系，涵盖基本空间关系（例如，“在上方”）、实体交互（例如，“握手”）以及抽象概念（例如，“被雕刻”）。涉及多种实体，如动物、人类、家居用品等，以进一步增加基准测试的多样性。
示例图像和文本描述。对于每种关系，我们收集了四到十张包含不同实体的示例图像。我们进一步为每个示例图像注释了不同详细程度的几种文本模板。这些训练模板可用于关系提示的优化。
基准测试场景。我们为每种关系设计了 100 个由不同对象实体组成的推理模板。

:fountain_pen: 引用

如果你发现我们的代码库对你的研究有帮助，请考虑引用我们的论文：

@article{huang2023reversion,
     title={{ReVersion}: Diffusion-Based Relation Inversion from Images},
     author={Huang, Ziqi and Wu, Tianxing and Jiang, Yuming and Chan, Kelvin C.K. and Liu, Ziwei},
     journal={arXiv preprint arXiv:2303.13495},
     year={2023}
}