Researchers at Google and University of Illinois at Urbana-Champaign (UIUC) have published a technique called Language Model Self-Improved (LMSI), which fine-tunes a large language model (LLM) on a dataset generated by that same model. Using LMSI, the researchers improved the performance of the LLM on six benchmarks and set new state-of-the-art accuracy records on four of them.
The team began with a pre-trained 540B parameter PaLM model. The model was given as input questions from an unlabeled training dataset, along with chain-of-thought prompts. The model generated answers for these questions, which were then used along with the inputs as a fine-tuning training dataset. The fine-tuned model was then evaluated on a suite of benchmark datasets for three different natural language processing (NLP) tasks: arithmetic reasoning, commonsense reasoning, and natural language inference. On four of the benchmarks—ARC-c, OpenBookQA, ANLI-A2 and ANLI-A3—the model outperformed previous records. According to the Google team:
We hope our simple approach and strong empirical results could encourage more future work by the community to investigate optimal performances of pretrained LLMs without additional human supervision….As part of our future work, we plan to combine large-scale generated data from our approach and existing supervised data, to further improve the performance of LLMs.
Chain-of-thought (CoT) prompting augments the input question given to a language model by prepending an example question and answer along with the reasoning steps to arrive at the answer. InfoQ recently covered Google’s PaLM model, which when used with CoT prompting achieves state-of-the-art few-shot performance on several reasoning benchmarks. Given this few-shot performance, the LMSI researchers wanted to investigate PaLM’s performance when fine-tuned on additional datasets.
The challenge with fine-tuning, though, is the same as with any supervised learning problem: acquiring a labeled dataset. The key idea in LMSI is to use the PaLM model itself to generate this dataset. To do this, the team took questions from a training dataset and augmented them by prepending CoT examples and applying prompts, such as “let’s think step-by-step.” These were fed into PaLM, which generated multiple candidate output answers. The candidate answers were filtered using self-consistency to keep the highest-confidence answers. The resulting question/answer dataset was then used to fine-tune the original PaLM model.
Image Source: https://arxiv.org/abs/2210.11610
Besides the fine-tuned 540B PaLM model, the team also investigated knowledge distillation, using the generated dataset to fine-tune smaller versions of PaLM. The team found that a fine-tuned 62B parameter model outperformed the pre-trained 540B parameter one, and a fine-tuned 8B parameter model outperformed the pre-trained 62B parameter model.
This result was particularly interesting to several readers of the paper. Jules Gagnon-Marchand, an NLP grad student, called it “an incredibly strong side-result,” noting that it was “equivalent to multiplying the parameters by 10.” In a discussion on Reddit, one user wrote:
They show that the large LMSI models can be distilled into smaller models while maintaining accuracy, but I wonder what size model is necessary for the LMSI training itself to be viable. They only show results for 540B. Would be very curious to see a study here if there is a certain model size where this kicks in.
LMSI’s rank on several NLP benchmark leaderboards is available on Papers with Code.