visualwebarena

visualwebarena

真实视觉网络任务评估多模态智能体表现的基准平台

VisualWebArena是一个评估多模态自主语言智能体的真实基准平台。它包含多种基于网络的复杂视觉任务,全面评估智能体的各项能力。该项目基于WebArena的可复现评估方法,提供端到端训练和环境重置功能,支持在任意网页上测试多模态智能体。项目还公开了GPT-4V + SoM智能体在910个任务中的表现数据,方便研究人员进行分析和评估。

VisualWebArena多模态代理视觉网页任务AI评估GPT-4VGithub开源项目

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

<!-- <p align="center"> <a href="https://www.python.org/downloads/release/python-3109/"><img src="https://img.shields.io/badge/python-3.10-blue.svg" alt="Python 3.10"></a> <a href="https://pre-commit.com/"><img src="https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white" alt="pre-commit"></a> <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" alt="Code style: black"></a> <a href="https://mypy-lang.org/"><img src="https://www.mypy-lang.org/static/mypy_badge.svg" alt="Checked with mypy"></a> <a href="https://beartype.readthedocs.io"><img src="https://raw.githubusercontent.com/beartype/beartype-assets/main/badge/bear-ified.svg" alt="bear-ified"></a> </p> -->

[<a href="https://jykoh.com/vwa">Website</a>] [<a href="https://arxiv.org/abs/2401.13649">Paper</a>]

<i>VisualWebArena</i> is a realistic and diverse benchmark for evaluating multimodal autonomous language agents. It comprises of a set of diverse and complex web-based visual tasks that evaluate various capabilities of autonomous multimodal agents. It builds off the reproducible, execution based evaluation introduced in <a href="https://webarena.dev" target="_blank">WebArena</a>.

Overview

TODOs

  • Add human trajectories.
  • Add GPT-4V + SoM trajectories from our paper.
  • Add scripts for end-to-end training and reset of environments.
  • Add demo to run multimodal agents on any arbitrary webpage.

News

  • [08/05/2024]: Added an Amazon Machine Image that pre-installed all VWA (and WA) websites so that you don't have to!
  • [03/08/2024]: Added the agent trajectories of our GPT-4V + SoM agent on the full set of 910 VWA tasks.
  • [02/14/2024]: Added a demo script for running the GPT-4V + SoM agent on any task on an arbitrary website.
  • [01/25/2024]: GitHub repo released with tasks and scripts for setting up the VWA environments.

Install

# Python 3.10 (or 3.11, but not 3.12 cause 3.12 deprecated distutils needed here) python -m venv venv source venv/bin/activate pip install -r requirements.txt playwright install pip install -e .

You can also run the unit tests to ensure that VisualWebArena is installed correctly:

pytest -x

End-to-end Evaluation

  1. Setup the standalone environments. Please check out this page for details.

  2. Configurate the urls for each website. First, export the DATASET to be visualwebarena:

export DATASET=visualwebarena

Then, set the URL for the websites

export CLASSIFIEDS="<your_classifieds_domain>:9980" export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c" # Default reset token for classifieds site, change if you edited its docker-compose.yml export SHOPPING="<your_shopping_site_domain>:7770" export REDDIT="<your_reddit_domain>:9999" export WIKIPEDIA="<your_wikipedia_domain>:8888" export HOMEPAGE="<your_homepage_domain>:4399"

In addition, if you want to run on the original WebArena tasks, make sure to also set up the CMS, GitLab, and map environments, and then set their respective environment variables:

export SHOPPING_ADMIN="<your_e_commerce_cms_domain>:7780/admin" export GITLAB="<your_gitlab_domain>:8023" export MAP="<your_map_domain>:3000"
  1. Generate config files for each test example:
python scripts/generate_test_data.py

You will see *.json files generated in the config_files folder. Each file contains the configuration for one test example.

  1. Obtain and save the auto-login cookies for all websites:
bash prepare.sh
  1. Set up API keys.

If using OpenAI models, set a valid OpenAI API key (starting with sk-) as the environment variable:

export OPENAI_API_KEY=your_key

If using Gemini, first install the gcloud CLI. Configure the API key by authenticating with Google Cloud:

gcloud auth login
gcloud config set project <your_project_name>
  1. Launch the evaluation. For example, to reproduce our GPT-3.5 captioning baseline:
python run.py \ --instruction_path agent/prompts/jsons/p_cot_id_actree_3s.json \ --test_start_idx 0 \ --test_end_idx 1 \ --result_dir <your_result_dir> \ --test_config_base_dir=config_files/vwa/test_classifieds \ --model gpt-3.5-turbo-1106 \ --observation_type accessibility_tree_with_captioner

This script will run the first Classifieds example with the GPT-3.5 caption-augmented agent. The trajectory will be saved in <your_result_dir>/0.html. Note that the baselines that include a captioning model run on GPU by default (e.g., BLIP-2-T5XL as the captioning model will take up approximately 12GB of GPU VRAM).

GPT-4V + SoM Agent

SoM

To run the GPT-4V + SoM agent we proposed in our paper, you can run evaluation with the following flags:

python run.py \ --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \ --test_start_idx 0 \ --test_end_idx 1 \ --result_dir <your_result_dir> \ --test_config_base_dir=config_files/vwa/test_classifieds \ --model gpt-4-vision-preview \ --action_set_tag som --observation_type image_som

To run Gemini models, you can change the provider, model, and the max_obs_length (as Gemini uses characters instead of tokens for inputs):

python run.py \ --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \ --test_start_idx 0 \ --test_end_idx 1 \ --max_steps 1 \ --result_dir <your_result_dir> \ --test_config_base_dir=config_files/vwa/test_classifieds \ --provider google --model gemini --mode completion --max_obs_length 15360 \ --action_set_tag som --observation_type image_som

If you'd like to reproduce the results from our paper, we have also provided scripts in scripts/ to run the full evaluation pipeline on each of the VWA environments. For example, to reproduce the results from the Classifieds environment, you can run:

bash scripts/run_classifieds_som.sh

Agent Trajectories

To facilitate analysis and evals, we have also released the trajectories of the GPT-4V + SoM agent on the full set of 910 VWA tasks here. It consists of .html files that record the agent's observations and output at each step of the trajectory.

Demo

Demo

We have also prepared a demo for you to run the agents on your own task on an arbitrary webpage. An example is shown above where the agent is tasked to find the best Thai restaurant in Pittsburgh.

After following the setup instructions above and setting the OpenAI API key (the other environment variables for website URLs aren't really used, so you should be able to set them to some dummy variable), you can run the GPT-4V + SoM agent with the following command:

python run_demo.py \ --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \ --start_url "https://www.amazon.com" \ --image "https://media.npr.org/assets/img/2023/01/14/this-is-fine_wide-0077dc0607062e15b476fb7f3bd99c5f340af356-s1400-c100.jpg" \ --intent "Help me navigate to a shirt that has this on it." \ --result_dir demo_test_amazon \ --model gpt-4-vision-preview \ --action_set_tag som --observation_type image_som \ --render

This tasks the agent to find a shirt that looks like the provided image (the "This is fine" dog) from Amazon. Have fun!

Human Evaluations

We collected human trajectories on 233 tasks (one from each template type) and the Playwright recording files are provided here. These are the same tasks reported in our paper (with a human success rate of ~89%). You can view the HTML pages, actions, etc., by running playwright show-trace <example_id>.zip. The example_id follows the same structure as the examples from the corresponding site in config_files/.

Citation

If you find our environment or our models useful, please consider citing <a href="https://jykoh.com/vwa" target="_blank">VisualWebArena</a> as well as <a href="https://webarena.dev/" target="_blank">WebArena</a>:

@article{koh2024visualwebarena,
  title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks},
  author={Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel},
  journal={arXiv preprint arXiv:2401.13649},
  year={2024}
}

@article{zhou2024webarena,
  title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
  author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
  journal={ICLR},
  year={2024}
}

Acknowledgements

Our code is heavily based off the <a href="https://github.com/web-arena-x/webarena">WebArena codebase</a>.

编辑推荐精选

iTerms

iTerms

企业专属的AI法律顾问

iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。

SimilarWeb流量提升

SimilarWeb流量提升

稳定高效的流量提升解决方案,助力品牌曝光

稳定高效的流量提升解决方案,助力品牌曝光

Sora2视频免费生成

Sora2视频免费生成

最新版Sora2模型免费使用,一键生成无水印视频

最新版Sora2模型免费使用,一键生成无水印视频

Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

热门AI辅助写作AI工具讯飞绘文内容运营AI创作个性化文章多平台分发AI助手
TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
商汤小浣熊

商汤小浣熊

最强AI数据分析助手

小浣熊家族Raccoon,您的AI智能助手,致力于通过先进的人工智能技术,为用户提供高效、便捷的智能服务。无论是日常咨询还是专业问题解答,小浣熊都能以快速、准确的响应满足您的需求,让您的生活更加智能便捷。

imini AI

imini AI

像人一样思考的AI智能体

imini 是一款超级AI智能体,能根据人类指令,自主思考、自主完成、并且交付结果的AI智能体。

Keevx

Keevx

AI数字人视频创作平台

Keevx 一款开箱即用的AI数字人视频创作平台,广泛适用于电商广告、企业培训与社媒宣传,让全球企业与个人创作者无需拍摄剪辑,就能快速生成多语言、高质量的专业视频。

即梦AI

即梦AI

一站式AI创作平台

提供 AI 驱动的图片、视频生成及数字人等功能,助力创意创作

下拉加载更多