prize

prize

探索大型语言模型性能反向扩展现象

Inverse Scaling Prize比赛旨在发现大型语言模型性能反向扩展的任务。该比赛探索随着模型规模增大,在特定任务上表现反而下降的现象。这有助于揭示语言模型预训练和扩展的潜在问题,对模型的安全和负责任使用具有重要意义。比赛将评估提交的任务,并将优秀成果纳入基准测试,为语言模型研究提供新的洞察。

逆向缩放语言模型AI竞赛GPT-3机器学习Github开源项目
<p align="center"> <img src="docs/promo-image.png" alt="Two graphs, one with regular scaling marked 'Many tasks like this', and one with inverse scaling marked 'Any tasks like this?'" width=500px/> </p>

Inverse Scaling Prize

TL;DR: Win up to $100,000 for finding an important task where larger language models do worse.

Submissions due August 27, 2022 (Round 1) and October 27, 2022 (Round 2).

The contest has ended! Results: Round 1, Round 2.

Recent changes

11 October, 2023

21 March, 2023

  • Updated prize pool info

1 March, 2023

17 December, 2022

  • Updated prize eligibility for FAR employees

12 December, 2022

  • Added prize terms update to the ‘Prize information’ section
  • Updated ‘About us’

9 October, 2022

  • Added Huggingface Hub evaluation setup description to the tips section

4 October, 2022

  • BUG FIX: Reported total probabilities should now be more accurate for all classification tasks

26 September, 2022

  • Demonstrating positive scaling on the ‘incorrect’ answer is now allowed
  • Added stronger recommendation to aim for roughly 1000 examples
  • Added requirement to name the task
  • Added request to include data for control experiments
  • Added field to specify how the data was generated
  • Added field for links to dataset sources
  • Added field for code that generated dataset
  • Added requirement that submissions in multiple parts should upload all .csv files together in one .zip
  • Added request to make file names anonymous
  • Added option to use a variable number of classes in classification datasets
  • Added print out to colabs of the total probability given to class labels
  • Added reminder that the submitted plot should be from our official colab
  • Added request that people edit their form submission rather than resubmit to update
  • Added reminder to specify correct behavior on the task
  • Added field to specify whether the task is zero-shot or few-shot
  • Updated terms and conditions

Motivation

As language models get larger, they seem to only get better. Larger language models score better on benchmarks and unlock new capabilities like arithmetic [1], few-shot learning [1], and multi-step reasoning [2]. However, language models are not without flaws, exhibiting many biases [3] and producing plausible misinformation [4]. The purpose of this contest is to find evidence for a stronger failure mode: tasks where language models get worse as they become better at language modeling (next word prediction).

The standard paradigm in natural language processing today is to pretrain large language models to autocomplete text corpora. The resulting models are then either frozen and used directly for other tasks (zero-shot or using few-shot learning), or additionally trained on other tasks (fine-tuning). We focus on the case of zero-shot/few-shot evaluation on downstream tasks without task-specific gradient optimization: it's typically easier to use in practice and to study.

Scaling laws [5][6] show that language models get predictably better (in terms of test loss and downstream performance [7]) as the number of parameters, amount of compute used, and dataset size increase. The improvement follows a power law in each of parameters, compute, and dataset size. We hypothesize that there are tasks with trends in the opposite direction: task performance gets monotonically, predictably worse as the overall test loss of the language model improves. We call this phenomenon inverse scaling, in contrast with the standard scaling laws. There are some tasks that appear to show inverse scaling under some conditions [4][8][10], but such tasks appear to be rare.

This contest aims to find inverse scaling tasks, especially those of importance to the safe and responsible use of language models. We hope that task submissions will teach us more about what types of tasks exhibit inverse scaling; inverse scaling tasks will also highlight potential issues with the current paradigm of language model pretraining and scaling. Inverse scaling tasks are important because they represent a mismatch between the behavior we want language models to exhibit and the behavior we get in practice from the training objectives and data we use. As language models continue to get bigger and used in more real-world applications, it is important that they are not increasingly getting worse or harming users in yet-undetected ways.

After two rounds of the contest, we will write a survey of the submitted tasks and other examples found in the literature. Authors of winning tasks will be awarded prize money and invited to be co-authors on the resulting paper. Below, we detail our call for submissions. Feel free to join our Slack to message us with questions, find collaborators, and participate in contest-related discussions with other participants (code, ideas, findings, and related work sharing).

Prize information

2023/03/21 Update: The prize pool has been funded by Open Philanthropy

We will award up to $250,000 in total prize money for task submissions, distributed as follows:

  1. Up to 1 Grand Prize of $100,000.
  2. Up to 5 Second Prizes of $20,000 each.
  3. Up to 10 Third Prizes of $5,000 each.

All prize decisions will be made by the organizers and anonymous reviewers, using the Prize Rubric below. Prize winners may nominate a non-profit to receive the prize money on their behalf. Some prizes may remain unawarded if there are not enough tasks that meet the eligibility for a prize tier, as detailed in the Prize Rubric.

Benchmark and Co-authorship: Authors of prize-winning submissions will be invited as co-authors on the paper written after the contest concludes. We will also offer co-authorship to authors of submissions that met our acceptability criteria but did not receive prizes, in the event that we receive more acceptable submissions than we can award with prizes. We will include all accepted submissions in our final benchmark, which we plan to release to the research community after the contest.

Timeline: The contest begins on June 27, 2022. We will host a first round of evaluations on submissions received on or before August 27, 2022 (Anywhere on Earth) and a second, final round of evaluations on submissions received on or before October 27, 2022 (Anywhere on Earth). After the first round, we will award eligible tasks with third prizes (up to 5) and second prizes (up to 2). To help improve first-round submissions, we will also return reviewer feedback and scaling law plots/results from our private, evaluation models. Submissions will be paused for two weeks at the end of the first round to allow any necessary improvements to be made. At the end of the second round, we will reward eligible tasks at all prize tiers, with the possibility of upgrading first-round submissions to higher prize tiers based on both rounds of submissions.

<a name="prize-rubric"></a>Prize Rubric

Here, we detail our submission evaluation rubric. The rubric will guide an anonymous panel of reviewers in judging submissions for prizes. A submission must meet all criteria in the "Grand Prize" column to win the grand prize. Likewise, a submission must meet all criteria in the "Accepted Task" column to be accepted into our benchmark and for co-authorship on our paper. For second prizes, submissions must meet all "Accepted Task" criteria and some "Grand Prize" criteria. Third prizes must meet the "Accepted Task" criteria. We may receive more eligible submissions than we have prizes for a given tier. In this case, we will first break ties based on how many “Grand Prize” criteria are met and then by having reviewers make subjective rankings within tiers (e.g., more granular measures of how much various criteria are met or the relative difficulty or importance of each criterion met). We will consider inverse scaling trends on publicly-available models like GPT-3, as well as held-out, private models for which we will run evaluation.

CriterionDescriptionPrize Tier
No PrizeAccepted TaskGrand Prize
Inverse Scaling StrengthHow straight and steep is the inverse scaling trend on public models?Shows flat, very bumpy, or standard scaling.Shows approximately monotonic inverse scaling.Shows a clear, strictly monotonic inverse scaling trend.
Inverse Scaling GeneralityDo different models all show inverse scaling?No inverse scaling on private models.Shows inverse scaling on some public and some private models.Shows inverse scaling across all public and private models tested.
Task ImportanceIs the task important to the safe and responsible use of LMs, or for shedding light on where LMs fail? How strong are the arguments?Weak. No users or third parties would be harmed, and the task does not shed light on where LMs fail.Fairly convincing. Some LM users or third parties would be harmed by the discovered behavior, or the task sheds light on where LMs fail (e.g., sensitivity to prompts).Very convincing. Significant implications for how LM research or deployment will need to be developed to be reliably safe and effective.
Novelty and SurprisingnessIs inverse scaling on the task novel (not shown in prior work) and surprising?Not novel or surprising.Novel and somewhat surprising.Novel and surprising, teaching us something new about LMs.
Task CoverageAre the examples fully representative of the described task?Examples only cover a special subcategory or phrasing of the described task. There's no evidence of inverse scaling on other subcategories or phrasings.Examples cover different subcategories and phrasings for the described task.Examples cover almost all important task subcategories and phrasings, suggesting robust inverse scaling on the described task.
ReproducibilityDoes inverse scaling appear to occur if we reproduce the task based on its description?No, we see flat, very bumpy, or standard scaling. The particular examples submitted may have been over-optimized for inverse scaling, to the extent that the examples are unrepresentative of the described task.Yes, but to a lesser extent.Yes, to a similar or stronger extent.

Answering the below, optional questions in our submission form (in the free-form response) will make your task stand out more:

  • Does inverse scaling persist even if the model is conditioned with few-shot examples to behave correctly? If providing enough few-shot examples eliminates inverse scaling, how many examples are required for that?
  • Does inverse scaling persist even after fine-tuning on the task? Are there good reasons to think it would persist after fine-tuning?
  • Does inverse scaling persist for InstructGPT models trained with Reinforcement Learning from Human Feedback (RLHF)? To test this, you can use the same code as that for GPT-3 evaluation. We may also evaluate submissions on private RLHF models of various sizes from Anthropic [Bai et al. 2022].

We reserve the right to update the prize tier standards or criteria, e.g., between rounds if we observe submissions gaming them in some way.

Evaluation Eligibility: To be eligible for official review, a task submission must:

  1. Include a plot of loss vs. model size across ada, babbage, curie, and davinci GPT-3 models, using the provided code for GPT-3 evaluation. The plot must not show a standard scaling law. A very bumpy trend is okay for submission; we expect to observe cleaner scaling laws with our held-out evaluation models, where we observe clear scaling trends.
  2. Meet the formatting requirements described in the Submission Guidelines.
    • This requirement should already be satisfied if you are able to successfully run the evaluation code.
  3. Include a coherent description of the task.

<a name="models"></a>Models

This contest uses pretrained autoregressive language models such as GPT-3. We offer Google colab notebooks for evaluating inverse scaling with the GPT-3, OPT, and GPT-2 model series when developing a task. However, to avoid overfitting to publicly available models, we use private models to run the evaluations for awarding prizes. Currently, we are using the series of pretrained language models (without additional finetuning) from Anthropic [Bai et al. 2022]. We are in discussions with other organizations to use their models, which may be added later on to strengthen the evaluation.

<a name="reviewers"></a>Reviewers

Prize decisions will be made by an anonymous panel of reviewers. Reviewers will be selected by the contest organizers and may include some organizers. Reviewers will have ML and NLP experience relevant to inverse scaling. The panel may contain some competition organizers. Reviewers will not be allowed to make submissions to the contest.

<a name="submission"></a>Submission guidelines

  1. Each task submission should be a language modeling test set (in the style of BIG-Bench) of inputs with corresponding answers, which will be evaluated according to one of four evaluation metrics (detailed later).
  2. This prize is to incentivize original work, so submissions should find a new phenomenon for which inverse scaling has not been previously documented.
    1. If a task has already shown inverse scaling in prior work (even if the original authors did not identify it as such) then it is ineligible for the contest.
    2. If an existing task has not been subjected to any kind of scaling analysis, then it is likely eligible for the contest.
    3. If you would like to check whether an existing task is eligible, message us on our Slack or email us at inverse.scaling@gmail.com with [PRIOR WORK] in the subject line and a link to where the task has previously been published.
  3. Data must be formatted as a .zip containing .csv files.
    • The zip should be called <task_name>.zip, where task_name is the name you provide for your submission in the form (e.g. lambada.zip).
    • The file should be called <task_name>.csv (e.g. lambada.csv).
      • If you have multiple parts to the same task, add -PART<i> to each (e.g. lambada-PART1.csv, lambada-PART2.csv).
      • If you have control experiments, add these as <task_name>-CONTROL<i>.csv (e.g. lambada-CONTROL1.csv).
    • The .csv files will be read using the pandas package, using the default arguments.
    • Specific formats are given below in the Evaluation metrics section.
  4. Examples will be given as a prompt to an autoregressive language model.
    • I.e., either zero-shot or few-shot prompts (prompts containing a few examples). Few-shot examples must demonstrate the correct behavior on the task.
  5. Tasks must contain at least 300 examples
    • We strongly recommend aiming for on the order of 1000 examples so that inverse scaling trends are clearer.
    • Submissions with unclear scaling trends and close to the minimum number of examples are unlikely to win a prize.
  6. In the submission form, you will be asked to add:
    1. Evaluation metric used
      • The metric should be one of:
        1. Classification loss in a multiple-choice format (classification).
        2. Loss on a sequence at the end of the prompt (sequence_prob).
        3. Difference in logodds between two possible responses (logodds).
        4. Absolute difference in logodds between two possible responses (absolute_logodds).
    2. Authors
    3. Task name
    4. Description of intended task
      • What is the task aiming to test?
      • Remember to explain what good behavior

编辑推荐精选

Vora

Vora

免费创建高清无水印Sora视频

Vora是一个免费创建高清无水印Sora视频的AI工具

Refly.AI

Refly.AI

最适合小白的AI自动化工作流平台

无需编码,轻松生成可复用、可变现的AI自动化工作流

酷表ChatExcel

酷表ChatExcel

大模型驱动的Excel数据处理工具

基于大模型交互的表格处理系统,允许用户通过对话方式完成数据整理和可视化分析。系统采用机器学习算法解析用户指令,自动执行排序、公式计算和数据透视等操作,支持多种文件格式导入导出。数据处理响应速度保持在0.8秒以内,支持超过100万行数据的即时分析。

AI工具使用教程AI营销产品酷表ChatExcelAI智能客服
TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

热门AI工具生产力协作转型TraeAI IDE
AIWritePaper论文写作

AIWritePaper论文写作

AI论文写作指导平台

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

数据安全AI助手热门AI工具AI辅助写作AI论文工具论文写作智能生成大纲
博思AIPPT

博思AIPPT

AI一键生成PPT,就用博思AIPPT!

博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。

热门AI工具AI办公办公工具智能排版AI生成PPT博思AIPPT海量精品模板AI创作
潮际好麦

潮际好麦

AI赋能电商视觉革命,一站式智能商拍平台

潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。

iTerms

iTerms

企业专属的AI法律顾问

iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。

SimilarWeb流量提升

SimilarWeb流量提升

稳定高效的流量提升解决方案,助力品牌曝光

稳定高效的流量提升解决方案,助力品牌曝光

Sora2视频免费生成

Sora2视频免费生成

最新版Sora2模型免费使用,一键生成无水印视频

最新版Sora2模型免费使用,一键生成无水印视频

下拉加载更多