Friday, January 16, 2026

The Information-High quality Phantasm: Rethinking Classifier-Primarily based High quality Filtering for LLM Pretraining


Massive-scale fashions are pretrained on huge web-crawled datasets containing paperwork of blended high quality, making knowledge filtering important. A preferred methodology is Classifier-based High quality Filtering (CQF), which trains a binary classifier to tell apart between pretraining knowledge and a small, high-quality set. It assigns every pretraining doc a top quality rating outlined because the classifier’s rating and retains solely the top-scoring ones. We offer an in-depth evaluation of CQF. We present that whereas CQF improves downstream process efficiency, it doesn’t essentially improve language modeling on the high-quality dataset. We clarify this paradox by the truth that CQF implicitly filters the high-quality dataset as properly. We additional examine the conduct of fashions educated with CQF to these educated on artificial knowledge of accelerating high quality, obtained through random token permutations, and discover starkly completely different traits. Our outcomes problem the view that CQF captures a significant notion of knowledge high quality.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles