prize

prize

探索大型语言模型性能反向扩展现象

Inverse Scaling Prize比赛旨在发现大型语言模型性能反向扩展的任务。该比赛探索随着模型规模增大,在特定任务上表现反而下降的现象。这有助于揭示语言模型预训练和扩展的潜在问题,对模型的安全和负责任使用具有重要意义。比赛将评估提交的任务,并将优秀成果纳入基准测试,为语言模型研究提供新的洞察。

逆向缩放语言模型AI竞赛GPT-3机器学习Github开源项目
<p align="center"> <img src="docs/promo-image.png" alt="Two graphs, one with regular scaling marked 'Many tasks like this', and one with inverse scaling marked 'Any tasks like this?'" width=500px/> </p>

Inverse Scaling Prize

TL;DR: Win up to $100,000 for finding an important task where larger language models do worse.

Submissions due August 27, 2022 (Round 1) and October 27, 2022 (Round 2).

The contest has ended! Results: Round 1, Round 2.

Recent changes

11 October, 2023

21 March, 2023

  • Updated prize pool info

1 March, 2023

17 December, 2022

  • Updated prize eligibility for FAR employees

12 December, 2022

  • Added prize terms update to the ‘Prize information’ section
  • Updated ‘About us’

9 October, 2022

  • Added Huggingface Hub evaluation setup description to the tips section

4 October, 2022

  • BUG FIX: Reported total probabilities should now be more accurate for all classification tasks

26 September, 2022

  • Demonstrating positive scaling on the ‘incorrect’ answer is now allowed
  • Added stronger recommendation to aim for roughly 1000 examples
  • Added requirement to name the task
  • Added request to include data for control experiments
  • Added field to specify how the data was generated
  • Added field for links to dataset sources
  • Added field for code that generated dataset
  • Added requirement that submissions in multiple parts should upload all .csv files together in one .zip
  • Added request to make file names anonymous
  • Added option to use a variable number of classes in classification datasets
  • Added print out to colabs of the total probability given to class labels
  • Added reminder that the submitted plot should be from our official colab
  • Added request that people edit their form submission rather than resubmit to update
  • Added reminder to specify correct behavior on the task
  • Added field to specify whether the task is zero-shot or few-shot
  • Updated terms and conditions

Motivation

As language models get larger, they seem to only get better. Larger language models score better on benchmarks and unlock new capabilities like arithmetic [1], few-shot learning [1], and multi-step reasoning [2]. However, language models are not without flaws, exhibiting many biases [3] and producing plausible misinformation [4]. The purpose of this contest is to find evidence for a stronger failure mode: tasks where language models get worse as they become better at language modeling (next word prediction).

The standard paradigm in natural language processing today is to pretrain large language models to autocomplete text corpora. The resulting models are then either frozen and used directly for other tasks (zero-shot or using few-shot learning), or additionally trained on other tasks (fine-tuning). We focus on the case of zero-shot/few-shot evaluation on downstream tasks without task-specific gradient optimization: it's typically easier to use in practice and to study.

Scaling laws [5][6] show that language models get predictably better (in terms of test loss and downstream performance [7]) as the number of parameters, amount of compute used, and dataset size increase. The improvement follows a power law in each of parameters, compute, and dataset size. We hypothesize that there are tasks with trends in the opposite direction: task performance gets monotonically, predictably worse as the overall test loss of the language model improves. We call this phenomenon inverse scaling, in contrast with the standard scaling laws. There are some tasks that appear to show inverse scaling under some conditions [4][8][10], but such tasks appear to be rare.

This contest aims to find inverse scaling tasks, especially those of importance to the safe and responsible use of language models. We hope that task submissions will teach us more about what types of tasks exhibit inverse scaling; inverse scaling tasks will also highlight potential issues with the current paradigm of language model pretraining and scaling. Inverse scaling tasks are important because they represent a mismatch between the behavior we want language models to exhibit and the behavior we get in practice from the training objectives and data we use. As language models continue to get bigger and used in more real-world applications, it is important that they are not increasingly getting worse or harming users in yet-undetected ways.

After two rounds of the contest, we will write a survey of the submitted tasks and other examples found in the literature. Authors of winning tasks will be awarded prize money and invited to be co-authors on the resulting paper. Below, we detail our call for submissions. Feel free to join our Slack to message us with questions, find collaborators, and participate in contest-related discussions with other participants (code, ideas, findings, and related work sharing).

Prize information

2023/03/21 Update: The prize pool has been funded by Open Philanthropy

We will award up to $250,000 in total prize money for task submissions, distributed as follows:

  1. Up to 1 Grand Prize of $100,000.
  2. Up to 5 Second Prizes of $20,000 each.
  3. Up to 10 Third Prizes of $5,000 each.

All prize decisions will be made by the organizers and anonymous reviewers, using the Prize Rubric below. Prize winners may nominate a non-profit to receive the prize money on their behalf. Some prizes may remain unawarded if there are not enough tasks that meet the eligibility for a prize tier, as detailed in the Prize Rubric.

Benchmark and Co-authorship: Authors of prize-winning submissions will be invited as co-authors on the paper written after the contest concludes. We will also offer co-authorship to authors of submissions that met our acceptability criteria but did not receive prizes, in the event that we receive more acceptable submissions than we can award with prizes. We will include all accepted submissions in our final benchmark, which we plan to release to the research community after the contest.

Timeline: The contest begins on June 27, 2022. We will host a first round of evaluations on submissions received on or before August 27, 2022 (Anywhere on Earth) and a second, final round of evaluations on submissions received on or before October 27, 2022 (Anywhere on Earth). After the first round, we will award eligible tasks with third prizes (up to 5) and second prizes (up to 2). To help improve first-round submissions, we will also return reviewer feedback and scaling law plots/results from our private, evaluation models. Submissions will be paused for two weeks at the end of the first round to allow any necessary improvements to be made. At the end of the second round, we will reward eligible tasks at all prize tiers, with the possibility of upgrading first-round submissions to higher prize tiers based on both rounds of submissions.

<a name="prize-rubric"></a>Prize Rubric

Here, we detail our submission evaluation rubric. The rubric will guide an anonymous panel of reviewers in judging submissions for prizes. A submission must meet all criteria in the "Grand Prize" column to win the grand prize. Likewise, a submission must meet all criteria in the "Accepted Task" column to be accepted into our benchmark and for co-authorship on our paper. For second prizes, submissions must meet all "Accepted Task" criteria and some "Grand Prize" criteria. Third prizes must meet the "Accepted Task" criteria. We may receive more eligible submissions than we have prizes for a given tier. In this case, we will first break ties based on how many “Grand Prize” criteria are met and then by having reviewers make subjective rankings within tiers (e.g., more granular measures of how much various criteria are met or the relative difficulty or importance of each criterion met). We will consider inverse scaling trends on publicly-available models like GPT-3, as well as held-out, private models for which we will run evaluation.

CriterionDescriptionPrize Tier
No PrizeAccepted TaskGrand Prize
Inverse Scaling StrengthHow straight and steep is the inverse scaling trend on public models?Shows flat, very bumpy, or standard scaling.Shows approximately monotonic inverse scaling.Shows a clear, strictly monotonic inverse scaling trend.
Inverse Scaling GeneralityDo different models all show inverse scaling?No inverse scaling on private models.Shows inverse scaling on some public and some private models.Shows inverse scaling across all public and private models tested.
Task ImportanceIs the task important to the safe and responsible use of LMs, or for shedding light on where LMs fail? How strong are the arguments?Weak. No users or third parties would be harmed, and the task does not shed light on where LMs fail.Fairly convincing. Some LM users or third parties would be harmed by the discovered behavior, or the task sheds light on where LMs fail (e.g., sensitivity to prompts).Very convincing. Significant implications for how LM research or deployment will need to be developed to be reliably safe and effective.
Novelty and SurprisingnessIs inverse scaling on the task novel (not shown in prior work) and surprising?Not novel or surprising.Novel and somewhat surprising.Novel and surprising, teaching us something new about LMs.
Task CoverageAre the examples fully representative of the described task?Examples only cover a special subcategory or phrasing of the described task. There's no evidence of inverse scaling on other subcategories or phrasings.Examples cover different subcategories and phrasings for the described task.Examples cover almost all important task subcategories and phrasings, suggesting robust inverse scaling on the described task.
ReproducibilityDoes inverse scaling appear to occur if we reproduce the task based on its description?No, we see flat, very bumpy, or standard scaling. The particular examples submitted may have been over-optimized for inverse scaling, to the extent that the examples are unrepresentative of the described task.Yes, but to a lesser extent.Yes, to a similar or stronger extent.

Answering the below, optional questions in our submission form (in the free-form response) will make your task stand out more:

  • Does inverse scaling persist even if the model is conditioned with few-shot examples to behave correctly? If providing enough few-shot examples eliminates inverse scaling, how many examples are required for that?
  • Does inverse scaling persist even after fine-tuning on the task? Are there good reasons to think it would persist after fine-tuning?
  • Does inverse scaling persist for InstructGPT models trained with Reinforcement Learning from Human Feedback (RLHF)? To test this, you can use the same code as that for GPT-3 evaluation. We may also evaluate submissions on private RLHF models of various sizes from Anthropic [Bai et al. 2022].

We reserve the right to update the prize tier standards or criteria, e.g., between rounds if we observe submissions gaming them in some way.

Evaluation Eligibility: To be eligible for official review, a task submission must:

  1. Include a plot of loss vs. model size across ada, babbage, curie, and davinci GPT-3 models, using the provided code for GPT-3 evaluation. The plot must not show a standard scaling law. A very bumpy trend is okay for submission; we expect to observe cleaner scaling laws with our held-out evaluation models, where we observe clear scaling trends.
  2. Meet the formatting requirements described in the Submission Guidelines.
    • This requirement should already be satisfied if you are able to successfully run the evaluation code.
  3. Include a coherent description of the task.

<a name="models"></a>Models

This contest uses pretrained autoregressive language models such as GPT-3. We offer Google colab notebooks for evaluating inverse scaling with the GPT-3, OPT, and GPT-2 model series when developing a task. However, to avoid overfitting to publicly available models, we use private models to run the evaluations for awarding prizes. Currently, we are using the series of pretrained language models (without additional finetuning) from Anthropic [Bai et al. 2022]. We are in discussions with other organizations to use their models, which may be added later on to strengthen the evaluation.

<a name="reviewers"></a>Reviewers

Prize decisions will be made by an anonymous panel of reviewers. Reviewers will be selected by the contest organizers and may include some organizers. Reviewers will have ML and NLP experience relevant to inverse scaling. The panel may contain some competition organizers. Reviewers will not be allowed to make submissions to the contest.

<a name="submission"></a>Submission guidelines

  1. Each task submission should be a language modeling test set (in the style of BIG-Bench) of inputs with corresponding answers, which will be evaluated according to one of four evaluation metrics (detailed later).
  2. This prize is to incentivize original work, so submissions should find a new phenomenon for which inverse scaling has not been previously documented.
    1. If a task has already shown inverse scaling in prior work (even if the original authors did not identify it as such) then it is ineligible for the contest.
    2. If an existing task has not been subjected to any kind of scaling analysis, then it is likely eligible for the contest.
    3. If you would like to check whether an existing task is eligible, message us on our Slack or email us at inverse.scaling@gmail.com with [PRIOR WORK] in the subject line and a link to where the task has previously been published.
  3. Data must be formatted as a .zip containing .csv files.
    • The zip should be called <task_name>.zip, where task_name is the name you provide for your submission in the form (e.g. lambada.zip).
    • The file should be called <task_name>.csv (e.g. lambada.csv).
      • If you have multiple parts to the same task, add -PART<i> to each (e.g. lambada-PART1.csv, lambada-PART2.csv).
      • If you have control experiments, add these as <task_name>-CONTROL<i>.csv (e.g. lambada-CONTROL1.csv).
    • The .csv files will be read using the pandas package, using the default arguments.
    • Specific formats are given below in the Evaluation metrics section.
  4. Examples will be given as a prompt to an autoregressive language model.
    • I.e., either zero-shot or few-shot prompts (prompts containing a few examples). Few-shot examples must demonstrate the correct behavior on the task.
  5. Tasks must contain at least 300 examples
    • We strongly recommend aiming for on the order of 1000 examples so that inverse scaling trends are clearer.
    • Submissions with unclear scaling trends and close to the minimum number of examples are unlikely to win a prize.
  6. In the submission form, you will be asked to add:
    1. Evaluation metric used
      • The metric should be one of:
        1. Classification loss in a multiple-choice format (classification).
        2. Loss on a sequence at the end of the prompt (sequence_prob).
        3. Difference in logodds between two possible responses (logodds).
        4. Absolute difference in logodds between two possible responses (absolute_logodds).
    2. Authors
    3. Task name
    4. Description of intended task
      • What is the task aiming to test?
      • Remember to explain what good behavior

编辑推荐精选

Trae

Trae

字节跳动发布的AI编程神器IDE

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
问小白

问小白

全能AI智能助手,随时解答生活与工作的多样问题

问小白,由元石科技研发的AI智能助手,快速准确地解答各种生活和工作问题,包括但不限于搜索、规划和社交互动,帮助用户在日常生活中提高效率,轻松管理个人事务。

热门AI助手AI对话AI工具聊天机器人
Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞智文

讯飞智文

一键生成PPT和Word,让学习生活更轻松

讯飞智文是一个利用 AI 技术的项目,能够帮助用户生成 PPT 以及各类文档。无论是商业领域的市场分析报告、年度目标制定,还是学生群体的职业生涯规划、实习避坑指南,亦或是活动策划、旅游攻略等内容,它都能提供支持,帮助用户精准表达,轻松呈现各种信息。

AI办公办公工具AI工具讯飞智文AI在线生成PPTAI撰写助手多语种文档生成AI自动配图热门
讯飞星火

讯飞星火

深度推理能力全新升级,全面对标OpenAI o1

科大讯飞的星火大模型,支持语言理解、知识问答和文本创作等多功能,适用于多种文件和业务场景,提升办公和日常生活的效率。讯飞星火是一个提供丰富智能服务的平台,涵盖科技资讯、图像创作、写作辅助、编程解答、科研文献解读等功能,能为不同需求的用户提供便捷高效的帮助,助力用户轻松获取信息、解决问题,满足多样化使用场景。

热门AI开发模型训练AI工具讯飞星火大模型智能问答内容创作多语种支持智慧生活
Spark-TTS

Spark-TTS

一种基于大语言模型的高效单流解耦语音令牌文本到语音合成模型

Spark-TTS 是一个基于 PyTorch 的开源文本到语音合成项目,由多个知名机构联合参与。该项目提供了高效的 LLM(大语言模型)驱动的语音合成方案,支持语音克隆和语音创建功能,可通过命令行界面(CLI)和 Web UI 两种方式使用。用户可以根据需求调整语音的性别、音高、速度等参数,生成高质量的语音。该项目适用于多种场景,如有声读物制作、智能语音助手开发等。

咔片PPT

咔片PPT

AI助力,做PPT更简单!

咔片是一款轻量化在线演示设计工具,借助 AI 技术,实现从内容生成到智能设计的一站式 PPT 制作服务。支持多种文档格式导入生成 PPT,提供海量模板、智能美化、素材替换等功能,适用于销售、教师、学生等各类人群,能高效制作出高品质 PPT,满足不同场景演示需求。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

热门AI辅助写作AI工具讯飞绘文内容运营AI创作个性化文章多平台分发AI助手
材料星

材料星

专业的AI公文写作平台,公文写作神器

AI 材料星,专业的 AI 公文写作辅助平台,为体制内工作人员提供高效的公文写作解决方案。拥有海量公文文库、9 大核心 AI 功能,支持 30 + 文稿类型生成,助力快速完成领导讲话、工作总结、述职报告等材料,提升办公效率,是体制打工人的得力写作神器。

openai-agents-python

openai-agents-python

OpenAI Agents SDK,助力开发者便捷使用 OpenAI 相关功能。

openai-agents-python 是 OpenAI 推出的一款强大 Python SDK,它为开发者提供了与 OpenAI 模型交互的高效工具,支持工具调用、结果处理、追踪等功能,涵盖多种应用场景,如研究助手、财务研究等,能显著提升开发效率,让开发者更轻松地利用 OpenAI 的技术优势。

下拉加载更多