Inverse Scaling Prize

TL;DR: Win up to $100,000 for finding an important task where larger language models do worse.

~~Submissions due August 27, 2022 (Round 1) and October 27, 2022 (Round 2).~~

The contest has ended! Results: Round 1, Round 2.

Recent changes

11 October, 2023

Added Modus Tollens caveat to data-release README

21 March, 2023

Updated prize pool info

1 March, 2023

Released data for all winning tasks

17 December, 2022

Updated prize eligibility for FAR employees

12 December, 2022

Added prize terms update to the ‘Prize information’ section
Updated ‘About us’

9 October, 2022

Added Huggingface Hub evaluation setup description to the tips section

4 October, 2022

BUG FIX: Reported total probabilities should now be more accurate for all classification tasks

26 September, 2022

Demonstrating positive scaling on the ‘incorrect’ answer is now allowed
Added stronger recommendation to aim for roughly 1000 examples
Added requirement to name the task
Added request to include data for control experiments
Added field to specify how the data was generated
Added field for links to dataset sources
Added field for code that generated dataset
Added requirement that submissions in multiple parts should upload all .csv files together in one .zip
Added request to make file names anonymous
Added option to use a variable number of classes in classification datasets
Added print out to colabs of the total probability given to class labels
Added reminder that the submitted plot should be from our official colab
Added request that people edit their form submission rather than resubmit to update
Added reminder to specify correct behavior on the task
Added field to specify whether the task is zero-shot or few-shot
Updated terms and conditions

Motivation

As language models get larger, they seem to only get better. Larger language models score better on benchmarks and unlock new capabilities like arithmetic [1], few-shot learning [1], and multi-step reasoning [2]. However, language models are not without flaws, exhibiting many biases [3] and producing plausible misinformation [4]. The purpose of this contest is to find evidence for a stronger failure mode: tasks where language models get worse as they become better at language modeling (next word prediction).

The standard paradigm in natural language processing today is to pretrain large language models to autocomplete text corpora. The resulting models are then either frozen and used directly for other tasks (zero-shot or using few-shot learning), or additionally trained on other tasks (fine-tuning). We focus on the case of zero-shot/few-shot evaluation on downstream tasks without task-specific gradient optimization: it's typically easier to use in practice and to study.

Scaling laws [5][6] show that language models get predictably better (in terms of test loss and downstream performance [7]) as the number of parameters, amount of compute used, and dataset size increase. The improvement follows a power law in each of parameters, compute, and dataset size. We hypothesize that there are tasks with trends in the opposite direction: task performance gets monotonically, predictably worse as the overall test loss of the language model improves. We call this phenomenon inverse scaling, in contrast with the standard scaling laws. There are some tasks that appear to show inverse scaling under some conditions [4][8][10], but such tasks appear to be rare.

This contest aims to find inverse scaling tasks, especially those of importance to the safe and responsible use of language models. We hope that task submissions will teach us more about what types of tasks exhibit inverse scaling; inverse scaling tasks will also highlight potential issues with the current paradigm of language model pretraining and scaling. Inverse scaling tasks are important because they represent a mismatch between the behavior we want language models to exhibit and the behavior we get in practice from the training objectives and data we use. As language models continue to get bigger and used in more real-world applications, it is important that they are not increasingly getting worse or harming users in yet-undetected ways.

After two rounds of the contest, we will write a survey of the submitted tasks and other examples found in the literature. Authors of winning tasks will be awarded prize money and invited to be co-authors on the resulting paper. Below, we detail our call for submissions. Feel free to join our Slack to message us with questions, find collaborators, and participate in contest-related discussions with other participants (code, ideas, findings, and related work sharing).

Prize information

2023/03/21 Update: The prize pool has been funded by Open Philanthropy

We will award up to $250,000 in total prize money for task submissions, distributed as follows:

Up to 1 Grand Prize of $100,000.
Up to 5 Second Prizes of $20,000 each.
Up to 10 Third Prizes of $5,000 each.

All prize decisions will be made by the organizers and anonymous reviewers, using the Prize Rubric below. Prize winners may nominate a non-profit to receive the prize money on their behalf. Some prizes may remain unawarded if there are not enough tasks that meet the eligibility for a prize tier, as detailed in the Prize Rubric.

Benchmark and Co-authorship: Authors of prize-winning submissions will be invited as co-authors on the paper written after the contest concludes. We will also offer co-authorship to authors of submissions that met our acceptability criteria but did not receive prizes, in the event that we receive more acceptable submissions than we can award with prizes. We will include all accepted submissions in our final benchmark, which we plan to release to the research community after the contest.

Timeline: The contest begins on June 27, 2022. We will host a first round of evaluations on submissions received on or before August 27, 2022 (Anywhere on Earth) and a second, final round of evaluations on submissions received on or before October 27, 2022 (Anywhere on Earth). After the first round, we will award eligible tasks with third prizes (up to 5) and second prizes (up to 2). To help improve first-round submissions, we will also return reviewer feedback and scaling law plots/results from our private, evaluation models. Submissions will be paused for two weeks at the end of the first round to allow any necessary improvements to be made. At the end of the second round, we will reward eligible tasks at all prize tiers, with the possibility of upgrading first-round submissions to higher prize tiers based on both rounds of submissions.

<a name="prize-rubric"></a>Prize Rubric

Here, we detail our submission evaluation rubric. The rubric will guide an anonymous panel of reviewers in judging submissions for prizes. A submission must meet all criteria in the "Grand Prize" column to win the grand prize. Likewise, a submission must meet all criteria in the "Accepted Task" column to be accepted into our benchmark and for co-authorship on our paper. For second prizes, submissions must meet all "Accepted Task" criteria and some "Grand Prize" criteria. Third prizes must meet the "Accepted Task" criteria. We may receive more eligible submissions than we have prizes for a given tier. In this case, we will first break ties based on how many “Grand Prize” criteria are met and then by having reviewers make subjective rankings within tiers (e.g., more granular measures of how much various criteria are met or the relative difficulty or importance of each criterion met). We will consider inverse scaling trends on publicly-available models like GPT-3, as well as held-out, private models for which we will run evaluation.

Criterion	Description		Prize Tier
		No Prize	Accepted Task	Grand Prize
Inverse Scaling Strength	How straight and steep is the inverse scaling trend on public models?	Shows flat, very bumpy, or standard scaling.	Shows approximately monotonic inverse scaling.	Shows a clear, strictly monotonic inverse scaling trend.
Inverse Scaling Generality	Do different models all show inverse scaling?	No inverse scaling on private models.	Shows inverse scaling on some public and some private models.	Shows inverse scaling across all public and private models tested.
Task Importance	Is the task important to the safe and responsible use of LMs, or for shedding light on where LMs fail? How strong are the arguments?	Weak. No users or third parties would be harmed, and the task does not shed light on where LMs fail.	Fairly convincing. Some LM users or third parties would be harmed by the discovered behavior, or the task sheds light on where LMs fail (e.g., sensitivity to prompts).	Very convincing. Significant implications for how LM research or deployment will need to be developed to be reliably safe and effective.
Novelty and Surprisingness	Is inverse scaling on the task novel (not shown in prior work) and surprising?	Not novel or surprising.	Novel and somewhat surprising.	Novel and surprising, teaching us something new about LMs.
Task Coverage	Are the examples fully representative of the described task?	Examples only cover a special subcategory or phrasing of the described task. There's no evidence of inverse scaling on other subcategories or phrasings.	Examples cover different subcategories and phrasings for the described task.	Examples cover almost all important task subcategories and phrasings, suggesting robust inverse scaling on the described task.
Reproducibility	Does inverse scaling appear to occur if we reproduce the task based on its description?	No, we see flat, very bumpy, or standard scaling. The particular examples submitted may have been over-optimized for inverse scaling, to the extent that the examples are unrepresentative of the described task.	Yes, but to a lesser extent.	Yes, to a similar or stronger extent.

Answering the below, optional questions in our submission form (in the free-form response) will make your task stand out more:

Does inverse scaling persist even if the model is conditioned with few-shot examples to behave correctly? If providing enough few-shot examples eliminates inverse scaling, how many examples are required for that?
Does inverse scaling persist even after fine-tuning on the task? Are there good reasons to think it would persist after fine-tuning?
Does inverse scaling persist for InstructGPT models trained with Reinforcement Learning from Human Feedback (RLHF)? To test this, you can use the same code as that for GPT-3 evaluation. We may also evaluate submissions on private RLHF models of various sizes from Anthropic [Bai et al. 2022].

We reserve the right to update the prize tier standards or criteria, e.g., between rounds if we observe submissions gaming them in some way.

Evaluation Eligibility: To be eligible for official review, a task submission must:

Include a plot of loss vs. model size across ada, babbage, curie, and davinci GPT-3 models, using the provided code for GPT-3 evaluation. The plot must not show a standard scaling law. A very bumpy trend is okay for submission; we expect to observe cleaner scaling laws with our held-out evaluation models, where we observe clear scaling trends.
Meet the formatting requirements described in the Submission Guidelines.
- This requirement should already be satisfied if you are able to successfully run the evaluation code.
Include a coherent description of the task.

<a name="models"></a>Models

This contest uses pretrained autoregressive language models such as GPT-3. We offer Google colab notebooks for evaluating inverse scaling with the GPT-3, OPT, and GPT-2 model series when developing a task. However, to avoid overfitting to publicly available models, we use private models to run the evaluations for awarding prizes. Currently, we are using the series of pretrained language models (without additional finetuning) from Anthropic [Bai et al. 2022]. We are in discussions with other organizations to use their models, which may be added later on to strengthen the evaluation.

<a name="reviewers"></a>Reviewers

Prize decisions will be made by an anonymous panel of reviewers. Reviewers will be selected by the contest organizers and may include some organizers. Reviewers will have ML and NLP experience relevant to inverse scaling. The panel may contain some competition organizers. Reviewers will not be allowed to make submissions to the contest.

<a name="submission"></a>Submission guidelines

Each task submission should be a language modeling test set (in the style of BIG-Bench) of inputs with corresponding answers, which will be evaluated according to one of four evaluation metrics (detailed later).
This prize is to incentivize original work, so submissions should find a new phenomenon for which inverse scaling has not been previously documented.
1. If a task has already shown inverse scaling in prior work (even if the original authors did not identify it as such) then it is ineligible for the contest.
2. If an existing task has not been subjected to any kind of scaling analysis, then it is likely eligible for the contest.
3. If you would like to check whether an existing task is eligible, message us on our Slack or email us at inverse.scaling@gmail.com with [PRIOR WORK] in the subject line and a link to where the task has previously been published.
Data must be formatted as a .zip containing .csv files.
- The zip should be called <task_name>.zip, where task_name is the name you provide for your submission in the form (e.g. lambada.zip).
- The file should be called <task_name>.csv (e.g. lambada.csv).
  - If you have multiple parts to the same task, add -PART<i> to each (e.g. lambada-PART1.csv, lambada-PART2.csv).
  - If you have control experiments, add these as <task_name>-CONTROL<i>.csv (e.g. lambada-CONTROL1.csv).
- The .csv files will be read using the pandas package, using the default arguments.
- Specific formats are given below in the Evaluation metrics section.
Examples will be given as a prompt to an autoregressive language model.
- I.e., either zero-shot or few-shot prompts (prompts containing a few examples). Few-shot examples must demonstrate the correct behavior on the task.
Tasks must contain at least 300 examples
- We strongly recommend aiming for on the order of 1000 examples so that inverse scaling trends are clearer.
- Submissions with unclear scaling trends and close to the minimum number of examples are unlikely to win a prize.
In the submission form, you will be asked to add:
1. Evaluation metric used
  - The metric should be one of:
    1. Classification loss in a multiple-choice format (classification).
    2. Loss on a sequence at the end of the prompt (sequence_prob).
    3. Difference in logodds between two possible responses (logodds).
    4. Absolute difference in logodds between two possible responses (absolute_logodds).
2. Authors
3. Task name
4. Description of intended task
  - What is the task aiming to test?
  - Remember to explain what good behavior