We introduce Artificial Bootstrapped Pretraining (SBP), a language mannequin (LM) pretraining process that first learns a mannequin of relations between paperwork from the pretraining dataset after which leverages it to synthesize an enormous new corpus for joint coaching. Whereas the usual pretraining teaches LMs to be taught causal correlations amongst tokens inside a single doc, it’s not designed to effectively mannequin the wealthy, learnable inter-document correlations that may doubtlessly result in higher efficiency. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter and a 6B-parameter mannequin on as much as 1T tokens from scratch. We discover SBP persistently improves upon a powerful repetition baseline and delivers as much as 60% of efficiency enchancment attainable by an oracle higher certain with entry to 20x extra distinctive knowledge. Qualitative evaluation reveals that the synthesized paperwork transcend mere paraphrases — SBP first abstracts a core idea from the seed materials after which crafts a brand new narration on high of it. Apart from sturdy empirical efficiency, SBP admits a pure Bayesian interpretation: the synthesizer implicitly learns to summary the latent ideas shared between associated paperwork.
- † Stanford College
- ‡ Equal contribution
