<div align="center"> <h1>Awesome Foundation Model Leaderboard</h1> <a href="https://awesome.re"> <img src="https://awesome.re/badge.svg" height="20"/> </a> <a href="https://github.com/SAILResearch/awesome-foundation-model-leaderboards/fork"> <img src="https://img.shields.io/badge/PRs-Welcome-red" height="20"/> </a> <a href="https://arxiv.org/abs/2407.04065"> <img src="https://img.shields.io/badge/📃-Arxiv-b31b1b" height="20"/> </a> </div>

Awesome Foundation Model Leaderboard is a curated list of awesome foundation model leaderboards (for an explanation of what a leaderboard is, please refer to this post), along with various development tools and evaluation organizations according to our survey:

<p align="center"><strong>On the Workflows and Smells of Leaderboard Operations (LBOps):<br>An Exploratory Study of Foundation Model Leaderboards</strong></p> <p align="center"><a href="https://github.com/zhimin-z">Zhimin (Jimmy) Zhao</a>, <a href="https://abdulali.github.io">Abdul Ali Bangash</a>, <a href="https://www.filipecogo.pro">Filipe Roseiro Côgo</a>, <a href="https://mcis.cs.queensu.ca/bram.html">Bram Adams</a>, <a href="https://research.cs.queensu.ca/home/ahmed">Ahmed E. Hassan</a></p> <p align="center"><a href="https://sail.cs.queensu.ca">Software Analysis and Intelligence Lab (SAIL)</a></p>

If you find this repository useful, please consider giving us a star :star: and citation:

@article{zhao2024workflows,
  title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
  author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
  journal={arXiv preprint arXiv:2407.04065},
  year={2024}
}

Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.

If you want to contribute to this list (please do), welcome to propose a pull request.

If you have any suggestions, critiques, or questions regarding this list, welcome to raise an issue.

Also, a leaderboard should be included if only:

It is actively maintained.
It is related to foundation models.

Tools
Organizations
Evaluations
- Model-oriented
  - Comprehensive
  - Text
  - Image
  - Code
  - Math
  - Video
  - Agent
  - Audio
  - 3D
- Solution-oriented
- Data-oriented
- Metric-oriented
- Meta Leaderboard

Tools

Name	Description
gradio_leaderboard	gradio_leaderboard helps users build fully functional and performant leaderboard demos with gradio.
Demo leaderboard	Demo leaderboard helps users easily deploy their leaderboards with a standardized template.
Leaderboard Explorer	Leaderboard Explorer helps users navigate the diverse range of leaderboards available on Hugging Face Spaces.
open_llm_leaderboard	open_llm_leaderboard helps users access Open LLM Leaderboard data easily.
open-llm-leaderboard-renamer	open-llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily.
Open LLM Leaderboard Results PR Opener	Open LLM Leaderboard Results PR Opener helps users showcase Open LLM Leaderboard results in their model cards.
Open LLM Leaderboard Scraper	Open LLM Leaderboard Scraper helps users scrape and export data from Open LLM Leaderboard.

Organizations

Name	Description
Allen Institute for AI	Allen Institute for AI is a non-profit research institute with the mission of conducting high-impact AI research and engineering in service of the common good.
Papers With Code	Papers With Code is a community-driven platform for learning about state-of-the-art research papers on machine learning.

Evaluations

Model-oriented

Comprehensive

Name	Description
CompassRank	CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation mdoels for the industry and research.
FlagEval	FlagEval is a comprehensive platform for evaluating foundation models.
GenAI-Arena	GenAI-Arena hosts the visual generation arena, where various vision models compete based on their performance in image generation, image edition, and video generation.
Holistic Evaluation of Language Models	Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models.
nuScenes	nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car.
SuperCLUE	SuperCLUE is a series of benchmarks for evaluating Chinese foundation models.

Text

Name	Description
ACLUE	ACLUE is an evaluation benchmark for ancient Chinese language comprehension.
AIR-Bench	AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models.
AlignBench	AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese.
AlpacaEval	AlpacaEval is an automatic evaluator designed for instruction-following LLMs.
ANGO	ANGO is a generation-oriented Chinese language model evaluation benchmark.
Arabic Tokenizers Leaderboard	Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms.
Arena-Hard-Auto	Arena-Hard-Auto is a benchmark for instruction-tuned LLMs.
Auto-Arena	Auto-Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance.
BeHonest	BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs.
BenBench	BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities.
BiGGen-Bench	BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks.
Biomedical Knowledge Probing Leaderboard	Biomedical Knowledge Probing Leaderboard aims to track, rank, and evaluate biomedical factual knowledge probing results in LLMs.
BotChat	BotChat assesses the multi-round chatting capabilities of LLMs through a proxy task, evaluating whether two ChatBot instances can engage in smooth and fluent conversation with each other.
C-Eval	C-Eval is a Chinese evaluation suite for LLMs.
C-Eval Hard	C-Eval Hard is a more challenging version of C-Eval, which involves complex LaTeX equations and requires non-trivial reasoning abilities to solve.
Capability leaderboard	Capability leaderboard is a platform to evaluate long context understanding capabilties of LLMs.
Chain-of-Thought Hub	Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs.
ChineseFactEval	ChineseFactEval is a factuality benchmark for Chinese LLMs.
CLEM	CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents.
CLiB	CLiB is a benchmark to evaluate Chinese LLMs.
CMMLU	CMMLU is a Chinese benchmark to evaluate LLMs' knowledge and reasoning capabilities.
CMB	CMB is a multi-level medical benchmark in Chinese.
CMMLU	CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context.
CMMMU	CMMMU is a benchmark to test the capabilities of multimodal models in understanding and reasoning across multiple disciplines in the Chinese context.
CompMix	CompMix is a benchmark for heterogeneous question answering.
Compression Leaderboard	Compression Leaderboard is a platform to evaluate the compression performance of LLMs.
CoTaEval	CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs.
ConvRe	ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations.
CriticBench	CriticBench is a benchmark to evaluate LLMs' ability to make critique responses.
CRM LLM Leaderboard	CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications.
DecodingTrust	DecodingTrust is an assessment platform to evaluate the trustworthiness of LLMs.
Domain LLM Leaderboard	Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs.
DyVal	DyVal is a dynamic evaluation protocol for LLMs.
Enterprise Scenarios leaderboard	Enterprise Scenarios Leaderboard aims to assess the performance of LLMs on real-world enterprise use cases.
EQ-Bench	EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs.
Factuality Leaderboard	Factuality Leaderboard compares the factual capabilities of LLMs.
FuseReviews	FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization.
FELM	FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs.
GAIA	GAIA aims to test fundamental abilities that an AI assistant should possess.
GPT-Fathom	GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings.
Guerra LLM AI Leaderboard	Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others.
Hallucinations Leaderboard	Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs.
HalluQA	HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs.
HellaSwag	HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs.
HHEM Leaderboard	HHEM Leaderboard evaluates how often a language model introduces hallucinations when summarizing a document.
IFEval	IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions.
Indic LLM Leaderboard	Indic LLM Leaderboard is a benchmark to track progress and rank the performance of Indic LLMs.
InstructEval	InstructEval is an evaluation suite to assess instruction selection methods in the context of LLMs.
Japanese Chatbot Arena	Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese.
JustEval	JustEval is a powerful tool designed for fine-grained evaluation of LLMs.
Ko Chatbot Arena	Ko Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Korean.
KoLA	KoLA is a benchmark to evaluate the world knowledge of LLMs.
L-Eval	L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context.
Language Model Council	Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement.
LawBench	LawBench is a benchmark to evaluate the legal capabilities of