Latest advances in test-time alignment strategies, resembling Greatest-of-N sampling, supply a easy and efficient option to steer language fashions (LMs) towards most well-liked behaviors utilizing reward fashions (RM). Nonetheless, these approaches might be computationally costly, particularly when utilized uniformly throughout prompts with out accounting for variations in alignment issue. On this work, we suggest a prompt-adaptive technique for Greatest-of-N alignment that allocates inference-time compute extra effectively. Motivated by latency issues, we develop a two-stage algorithm: an preliminary exploratory section estimates the reward distribution for every immediate utilizing a small exploration price range, and a second stage adaptively allocates the remaining price range utilizing these estimates. Our methodology is straightforward, sensible, and appropriate with any LM-RM mixture. Empirical outcomes on prompts from the AlpacaEval, HH-RLHF, and PKU-SafeRLHF datasets for 12 LM/RM pairs and 50 completely different batches of prompts present that our adaptive technique outperforms the uniform allocation with the identical inference price range. Furthermore, we present that our adaptive technique stays aggressive towards uniform allocations with 20 p.c bigger inference budgets and improves in efficiency because the batch dimension grows.
- † College of Michigan
