
基础模型评估榜单和工具的综合汇总
本项目收录了多样化的基础模型评估榜单、开发工具和评估机构信息。涵盖文本、图像、代码、数学等领域的模型评估,同时包含解决方案和数据导向的评估。项目提供榜单搜索功能,便于快速查找。这一资源有助于研究人员和开发者比较和分析不同基础模型的性能。
Awesome Foundation Model Leaderboard is a curated list of awesome foundation model leaderboards (for an explanation of what a leaderboard is, please refer to this post), along with various development tools and evaluation organizations according to our survey:
<p align="center"><strong>On the Workflows and Smells of Leaderboard Operations (LBOps):<br>An Exploratory Study of Foundation Model Leaderboards</strong></p> <p align="center"><a href="https://github.com/zhimin-z">Zhimin (Jimmy) Zhao</a>, <a href="https://abdulali.github.io">Abdul Ali Bangash</a>, <a href="https://www.filipecogo.pro">Filipe Roseiro Côgo</a>, <a href="https://mcis.cs.queensu.ca/bram.html">Bram Adams</a>, <a href="https://research.cs.queensu.ca/home/ahmed">Ahmed E. Hassan</a></p> <p align="center"><a href="https://sail.cs.queensu.ca">Software Analysis and Intelligence Lab (SAIL)</a></p>If you find this repository useful, please consider giving us a star :star: and citation:
@article{zhao2024workflows,
title={On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards},
author={Zhao, Zhimin and Bangash, Abdul Ali and C{\^o}go, Filipe Roseiro and Adams, Bram and Hassan, Ahmed E},
journal={arXiv preprint arXiv:2407.04065},
year={2024}
}
Additionally, we provide a search toolkit that helps you quickly navigate through the leaderboards.
If you want to contribute to this list (please do), welcome to propose a pull request.
If you have any suggestions, critiques, or questions regarding this list, welcome to raise an issue.
Also, a leaderboard should be included if only:
| Name | Description |
|---|---|
| gradio_leaderboard | gradio_leaderboard helps users build fully functional and performant leaderboard demos with gradio. |
| Demo leaderboard | Demo leaderboard helps users easily deploy their leaderboards with a standardized template. |
| Leaderboard Explorer | Leaderboard Explorer helps users navigate the diverse range of leaderboards available on Hugging Face Spaces. |
| open_llm_leaderboard | open_llm_leaderboard helps users access Open LLM Leaderboard data easily. |
| open-llm-leaderboard-renamer | open-llm-leaderboard-renamer helps users rename their models in Open LLM Leaderboard easily. |
| Open LLM Leaderboard Results PR Opener | Open LLM Leaderboard Results PR Opener helps users showcase Open LLM Leaderboard results in their model cards. |
| Open LLM Leaderboard Scraper | Open LLM Leaderboard Scraper helps users scrape and export data from Open LLM Leaderboard. |
| Name | Description |
|---|---|
| Allen Institute for AI | Allen Institute for AI is a non-profit research institute with the mission of conducting high-impact AI research and engineering in service of the common good. |
| Papers With Code | Papers With Code is a community-driven platform for learning about state-of-the-art research papers on machine learning. |
| Name | Description |
|---|---|
| CompassRank | CompassRank is a platform to offer a comprehensive, objective, and neutral evaluation reference of foundation mdoels for the industry and research. |
| FlagEval | FlagEval is a comprehensive platform for evaluating foundation models. |
| GenAI-Arena | GenAI-Arena hosts the visual generation arena, where various vision models compete based on their performance in image generation, image edition, and video generation. |
| Holistic Evaluation of Language Models | Holistic Evaluation of Language Models (HELM) is a reproducible and transparent framework for evaluating foundation models. |
| nuScenes | nuScenes enables researchers to study challenging urban driving situations using the full sensor suite of a real self-driving car. |
| SuperCLUE | SuperCLUE is a series of benchmarks for evaluating Chinese foundation models. |
| Name | Description |
|---|---|
| ACLUE | ACLUE is an evaluation benchmark for ancient Chinese language comprehension. |
| AIR-Bench | AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
| AlignBench | AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. |
| AlpacaEval | AlpacaEval is an automatic evaluator designed for instruction-following LLMs. |
| ANGO | ANGO is a generation-oriented Chinese language model evaluation benchmark. |
| Arabic Tokenizers Leaderboard | Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms. |
| Arena-Hard-Auto | Arena-Hard-Auto is a benchmark for instruction-tuned LLMs. |
| Auto-Arena | Auto-Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance. |
| BeHonest | BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
| BenBench | BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities. |
| BiGGen-Bench | BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
| Biomedical Knowledge Probing Leaderboard | Biomedical Knowledge Probing Leaderboard aims to track, rank, and evaluate biomedical factual knowledge probing results in LLMs. |
| BotChat | BotChat assesses the multi-round chatting capabilities of LLMs through a proxy task, evaluating whether two ChatBot instances can engage in smooth and fluent conversation with each other. |
| C-Eval | C-Eval is a Chinese evaluation suite for LLMs. |
| C-Eval Hard | C-Eval Hard is a more challenging version of C-Eval, which involves complex LaTeX equations and requires non-trivial reasoning abilities to solve. |
| Capability leaderboard | Capability leaderboard is a platform to evaluate long context understanding capabilties of LLMs. |
| Chain-of-Thought Hub | Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
| ChineseFactEval | ChineseFactEval is a factuality benchmark for Chinese LLMs. |
| CLEM | CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents. |
| CLiB | CLiB is a benchmark to evaluate Chinese LLMs. |
| CMMLU | CMMLU is a Chinese benchmark to evaluate LLMs' knowledge and reasoning capabilities. |
| CMB | CMB is a multi-level medical benchmark in Chinese. |
| CMMLU | CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context. |
| CMMMU | CMMMU is a benchmark to test the capabilities of multimodal models in understanding and reasoning across multiple disciplines in the Chinese context. |
| CompMix | CompMix is a benchmark for heterogeneous question answering. |
| Compression Leaderboard | Compression Leaderboard is a platform to evaluate the compression performance of LLMs. |
| CoTaEval | CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs. |
| ConvRe | ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations. |
| CriticBench | CriticBench is a benchmark to evaluate LLMs' ability to make critique responses. |
| CRM LLM Leaderboard | CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications. |
| DecodingTrust | DecodingTrust is an assessment platform to evaluate the trustworthiness of LLMs. |
| Domain LLM Leaderboard | Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs. |
| DyVal | DyVal is a dynamic evaluation protocol for LLMs. |
| Enterprise Scenarios leaderboard | Enterprise Scenarios Leaderboard aims to assess the performance of LLMs on real-world enterprise use cases. |
| EQ-Bench | EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
| Factuality Leaderboard | Factuality Leaderboard compares the factual capabilities of LLMs. |
| FuseReviews | FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization. |
| FELM | FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs. |
| GAIA | GAIA aims to test fundamental abilities that an AI assistant should possess. |
| GPT-Fathom | GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
| Guerra LLM AI Leaderboard | Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others. |
| Hallucinations Leaderboard | Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs. |
| HalluQA | HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs. |
| HellaSwag | HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs. |
| HHEM Leaderboard | HHEM Leaderboard evaluates how often a language model introduces hallucinations when summarizing a document. |
| IFEval | IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions. |
| Indic LLM Leaderboard | Indic LLM Leaderboard is a benchmark to track progress and rank the performance of Indic LLMs. |
| InstructEval | InstructEval is an evaluation suite to assess instruction selection methods in the context of LLMs. |
| Japanese Chatbot Arena | Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese. |
| JustEval | JustEval is a powerful tool designed for fine-grained evaluation of LLMs. |
| Ko Chatbot Arena | Ko Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Korean. |
| KoLA | KoLA is a benchmark to evaluate the world knowledge of LLMs. |
| L-Eval | L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context. |
| Language Model Council | Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement. |
| LawBench | LawBench is a benchmark to evaluate the legal capabilities of |


AI一键生成PPT,就用博思AIPPT!
博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美 化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。


AI赋能电商视觉革命,一站式智能商拍平台
潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。


企业专属的AI法律顾问
iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM) 、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。


稳定高效的流量提升解决方案,助力品牌曝光
稳定高效的流量提升解决方案,助力品牌曝光


最新版Sora2模型免费使用,一键生成无水印视频
最新版Sora2模型免费使用,一键生成无水印视频


实时语音翻译/同声传译工具
Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多 国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。


选题、配图、成文,一站式创作,让内容运营更高效
讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。


AI辅助编程,代码自动修复
Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具 备上下文感知和代码自动完成功能,是提升开发效率的理想工具。


最强AI数据分析助手
小浣熊家族Raccoon,您的AI智能助手,致力于通过先进的人工智能技术,为用户提供高效、便捷的智能服务。无论是日常咨询还是专业问题解答,小浣熊都能以快速、准确的响应满足您的需求,让您的生活更加智能便捷。


像人一样思考的AI智能体
imini 是一款超级AI智能体,能根据人类指令,自主思考、自主完成、并且交付结果的AI智能体。
最新AI工具、AI资讯
独家AI资源、AI项目落地

微信扫一扫关注公众号