A Survey on Data Selection for Language Models

This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!
For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models.
By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang
<img src="fig1.png" alt="A conceptual demonstration of the data pipeline for language model training" width=75% align="center">
Table of Contents
Data Selection for Pretraining
<img src="learning-stages-pretraining.png" alt="Conceptualization of objectives and constraints on data selection for pretraining" width=50% align="right">
Language Filtering
Back to Table of Contents
- FastText.zip: Compressing text classification models: 2016<br/> Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov<br/>
- Learning Word Vectors for 157 Languages: 2018<br/> Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas<br/>
- Cross-lingual Language Model Pretraining: 2019<br/> Conneau, Alexis and Lample, Guillaume<br/>
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus: 2020<br/> Caswell, Isaac and Breiner, Theresa and van Esch, Daan and Bapna, Ankur<br/>
- Unsupervised Cross-lingual Representation Learning at Scale: 2020<br/> Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman... 4 hidden ... Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin<br/>
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020<br/> Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard<br/>
- A reproduction of Apple's bi-directional LSTM models for language identification in short strings: 2021<br/> Toftrup, Mads and Asger Sorensen, Soren and Ciosici, Manuel R. and Assent, Ira<br/>
- Evaluating Large Language Models Trained on Code: 2021<br/> Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba<br/>
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021<br/> Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin<br/>
- Competition-level code generation with AlphaCode: 2022<br/> Li, Yujia and Choi, David and Chung, Junyoung... 20 hidden ... de Freitas, Nando and Kavukcuoglu, Koray and Vinyals, Oriol<br/>
- PaLM: Scaling Language Modeling with Pathways: 2022<br/> Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- Writing System and Speaker Metadata for 2,800+ Language Varieties: 2022<br/> van Esch, Daan and Lucassen, Tamar and Ruder, Sebastian and Caswell, Isaac and Rivera, Clara<br/>
- FinGPT: Large Generative Models for a Small Language: 2023<br/> Luukkonen, Risto and Komulainen, Ville and Luoma, Jouni... 5 hidden ... Muennighoff, Niklas and Piktus, Aleksandra and others<br/>
- MC^ 2: A Multilingual Corpus of Minority Languages in China: 2023<br/> Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong<br/>
- Madlad-400: A multilingual and document-level large audited dataset: 2023<br/> Kudugunta, Sneha and Caswell, Isaac and Zhang, Biao... 5 hidden ... Stella, Romi and Bapna, Ankur and others<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
Heuristic Approaches
Back to Table of Contents
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: 2020<br/> Raffel, Colin and Shazeer, Noam and Roberts, Adam... 3 hidden ... Zhou, Yanqi and Li, Wei and Liu, Peter J.<br/>
- Language Models are Few-Shot Learners: 2020<br/> Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario<br/>
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020<br/> Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy<br/>
- Evaluating Large Language Models Trained on Code: 2021<br/> Mark Chen and Jerry Tworek and Heewoo Jun... 52 hidden ... Sam McCandlish and Ilya Sutskever and Wojciech Zaremba<br/>
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer: 2021<br/> Xue, Linting and Constant, Noah and Roberts, Adam... 2 hidden ... Siddhant, Aditya and Barua, Aditya and Raffel, Colin<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset: 2022<br/> Laurenccon, Hugo and Saulnier, Lucile and Wang, Thomas... 48 hidden ... Mitchell, Margaret and Luccioni, Sasha Alexandra and Jernite, Yacine<br/>
- HTLM: Hyper-Text Pre-Training and Prompting of Language Models: 2022<br/> Armen Aghajanyan and Dmytro Okhonko and Mike Lewis... 1 hidden ... Hu Xu and Gargi Ghosh and Luke Zettlemoyer<br/>
- LLaMA: Open and Efficient Foundation Language Models: 2023<br/> Hugo Touvron and Thibaut Lavril and Gautier Izacard... 8 hidden ... Armand Joulin and Edouard Grave and Guillaume Lample<br/>
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only: 2023<br/> Guilherme Penedo and Quentin Malartic and Daniel Hesslow... 3 hidden ... Baptiste Pannier and Ebtesam Almazrouei and Julien Launay<br/>
- The foundation model transparency index: 2023<br/> Bommasani, Rishi and Klyman, Kevin and Longpre, Shayne... 2 hidden ... Xiong, Betty and Zhang, Daniel and Liang, Percy<br/>
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research: 2024<br/> Luca Soldaini and Rodney Kinney and Akshita Bhagia... 30 hidden ... Dirk Groeneveld and Jesse Dodge and Kyle Lo<br/>
Data Quality
Back to Table of Contents
- KenLM: Faster and Smaller Language Model Queries: 2011<br/> Heafield, Kenneth<br/>
- FastText.zip: Compressing text classification models: 2016<br/> Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Hérve Jégou and Tomas Mikolov<br/>
- Learning Word Vectors for 157 Languages: 2018<br/> Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas<br/>
- Language Models are Unsupervised Multitask Learners: 2019<br/> Alec Radford and Jeff Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever<br/>
- Language Models are Few-Shot Learners: 2020<br/> Brown, Tom and Mann, Benjamin and Ryder, Nick... 25 hidden ... Radford, Alec and Sutskever, Ilya and Amodei, Dario<br/>
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling: 2020<br/> Leo Gao and Stella Biderman and Sid Black... 6 hidden ... Noa Nabeshima and Shawn Presser and Connor Leahy<br/>
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data: 2020<br/> Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis... 1 hidden ... Guzm'an, Francisco and Joulin, Armand and Grave, Edouard<br/>
- Detoxifying language models risks marginalizing minority voices: 2021<br/> Xu, Albert and Pathak, Eshaan and Wallace, Eric and Gururangan, Suchin and Sap, Maarten and Klein, Dan<br/>
- PaLM: Scaling Language Modeling with Pathways: 2022<br/> Aakanksha Chowdhery and Sharan Narang and Jacob Devlin... 61 hidden ... Jeff Dean and Slav Petrov and Noah Fiedel<br/>
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher: 2022<br/> Jack W. Rae and Sebastian Borgeaud and Trevor Cai... 74 hidden ... Demis Hassabis and Koray Kavukcuoglu and Geoffrey Irving<br/>
- Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection: 2022<br/> Gururangan, Suchin and Card, Dallas and Dreier, Sarah... 2 hidden ... Wang, Zeyu and Zettlemoyer, Luke and Smith, Noah A.<br/>
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts: 2022<br/> Du, Nan and Huang, Yanping and Dai, Andrew M... 21 hidden ... Wu, Yonghui and Chen, Zhifeng and Cui,