Transcending Scaling Laws with 0.1% Extra Compute

Transcending Scaling Laws with 0.1% Extra Compute.
Scaling language models improves performance but comes with significant
computational costs. This paper proposes UL2R, a method that substantially
improves existing language models and their scaling curves with a relatively
tiny amount of extra compute. The key idea is to continue training a
state-of-the-art large language model (e.g., PaLM) on a few more steps with
UL2’s mixture-of-denoiser objective. We show that, with almost negligible extra
computational costs and no new sources of data, we are able to substantially
improve the scaling properties of large language models on downstream metrics.
In this paper, we continue training PaLM with UL2R, introducing a new set of
models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B
scale, we show an approximately 2x computational savings rate where U-PaLM
achieves the same performance as the final PaLM 540B model at around half its
computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further
show that this improved scaling curve leads to ‘emergent abilities’ on
challenging BIG-Bench tasks – for instance, U-PaLM does much better than PaLM
on some tasks or demonstrates better quality at much smaller scale (62B as
opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many
few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question
answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual
tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide
qualitative examples showing the new capabilities of U-PaLM for single and
multi-span infilling.

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

Corresponding tweet for this thread:

Share link for this tweet.