Generating Synthetic Data for LLM Post-Training

I recently wrote an article about test time compute in which I covered a couple of papers where synthetic data is used to teach LLMs to solve problems step-by-step.

A lot of those papers focussed on solving math problems, which are a good fit for synthetic data generation: there is often a definitive answer to a problem, so it's easy to verify if a solution is correct or not.

In this article I wanted to dive deeper into domains other than Math and review how some of the big LLM releases from this year made use of synthetic data.

I had to draw a line in the sand somewhere, so I'm focussing on synthetic data generation for SFT (supervised fine-tuning), i.e. for post-training, unless stated otherwise.

The LLMs I looked at for this blog post are

Llama 3.1 by Meta [paper]
AFM by Apple [paper]
Qwen2 by Alibaba [paper]
Hunyuan-Large by Tencent [paper]

Why does training on synthetic data even work?

It sounds counterintuitive at first: why would it be beneficial to create synthetic data with a given model, and then use that data to fine-tune the same model? If you take the view that an LLM is just a 'lossy text compression algorithm' then this doesn't make sense: you're just resampling from the same distribution over and over again.

Part of the reason why this works is that as Saunders et al. argue, it is much easier to spot errors in an answer than it is to generate an error-free answer. In the same way it is easier to verify a solution than it is to come up with one (see also the relationship P ⊆ NP from complexity theory, which states that verifying a solution is easier than finding one).

Another big driver is the fact by creating responses with an LLM you are effectively denoising the training data. If you think about it, the internet (which makes up a large part of the pre-training data) is full of incorrect statements, typos, and other noise. Using an LLM to create synthetic training data for another LLM is a form of data cleaning or data augmentation, common pre-processing steps in machine learning. You throw away the noisy data and keep the clean bits.

But it goes beyond that, as one of the main authors of the Llama 3 paper, Thomas Scialom, points out in this podcast. Once you give an LLM access to tools, like a calculator, a code interpreter or a search engine, it can get feedback on its generations - figure out whether its answers are correct or not. This way you're effectively adding expert knowledge that the model hasn't encountered during pre-training. So this goes beyond purely augmenting existing data from the pre-training corpus.

Motivations for generating synthetic data

While reading those papers, I noticed some common motivations for generating synthetic data:

Simply improving performance on tasks like solving math problems or coding
Teaching new skills: for example, we might want to teach models to use tools like a web browser, code interpreter or calculator, or call functions that are defined in-context
Cover underrepresented domains: we can generate synthetic data that are underrepresented in the pre-training corpus, e.g. less common programming languages or less commonly spoken languages
Alignment & safety: see e.g. Qwen2 where the authors define a set of guidelines and create synthetic data from it to teach the model to follow them. Or Llama 3.1 where the authors create synthetic adversarial data in order to train the model to generate safe responses

Techniques used in synthetic data generation

I found a bunch of techniques for generating synthetic data in those papers and roughly group them into the following categories:

Using LLMs to generate responses: seems obvious, but just putting it here for completeness
Using LLMs to generate questions: e.g. in Llama 3.1 post-training, the model is used to create synthetic user prompts that require tool use, so as to trigger the generation of tool use responses
Bootstrapping from seed data: e.g. in AFM, the authors use a set of seed programming topics as a starting point and generate synthetic coding problems from those
Using LLMs to judge/critique and score synthetic data: e.g. for Llama 3.1, the authors use the model to score the informativeness and correctness of generated answers
External feedback: as mentioned above, this is used for things like coding or math problems where an answer can be verified using an external tool like a calculator or a code interpreter
Instructing LLMs to self-improve: e.g. when generating synthetic coding problems, solutions and unit tests, the model is asked to refine its generations based on execution feedback
Translation and adaptation: e.g. in Llama 3, the authors use Llama to translate code from common to less common programming languages

Let's take a closer look into how each of the LLMs use synthetic data for post-training.

Llama 3.1

Meta released the Llama 3.1 [paper] in July 2024. Their post-training phase consists of 6 rounds of iterative SFT and RLHF training as shown below. We're just focussing on the synthetic data generation part for SFT here.

Meta 2024

The list below summarises the techniques used for generating synthetic SFT data.

Coding

Execution Feedback: Llama is used to (1) generate programming problem statements, (2) generate solutions, (3) generate unit tests, and (4) take into account feedback from failed tests (and static analysis) & self-correct.
Programming Language Translation: As there is a lack of training data for less common programming languages, the authors use Llama to translate code from common languages to less common languages. After translation there are also some quality checks like syntax parsing, compilation and execution.
Backtranslation: For tasks where execution feedback is less indicative of correctness (like adding documentation or explanations), the authors let Llama (1) generate documentation for a piece of code, (2) let it backtranslate the documentation to code again, i.e. generate code using only using the documentation, and (3) let it determine the quality of the resulting code. Only the highest scoring ones make it into the SFT corpus.

Math

Generate step-by-step reasoning traces & filter the generations based on whether the final answer is correct. The authors also use self-verification (i.e. asking Llama itself) to filter out incorrect reasoning paths. In both cases, the goal is to generate high-quality fine-tuning data
Train ORMs and PRMs to filter out incorrect reasoning paths
Interleaving code and text reasoning: let the model generate text as well as code. So the model can execute code to get feedback on its generation. That way it can assert whether its reasoning process is correct
Learning from feedback and mistakes: use incorrect generations to perform error correction by prompting the model to correct itself

Tool Use

Llama is fine-tuned to use tools (Brave API, Wolfram Alpha API and a Python interpreter) in a chat setup to solve user queries. Apart from those tools it can also be used for function-calling (aka zero-shot tool use) where an arbitrary function is defined in the prompt.

Collecting fine-tuning data for this task requires detailed human feedback, potentially for multiple steps in the same interaction. That's expensive so the authors use earlier Llama 3 checkpoints to generate this feedback synthetically instead of using human annotators.

But not only is Llama used for generating answers (including tool use). Also the starting prompts, or 'user queries', are model generated. That is, the model is prompted to generate users queries that require tool use, and then further used to generate the step-by-step answers to those synthetic user prompts.

Multilinguality

Similarly to less common programming languages, there is also a lack of training data for less commonly spoken languages. So the authors branch off the main pre-training run and train a multilingual expert model on 90% multilingual data. This model is then used to collect higher quality annotations in non-English languages.

Long Context

Generating human SFT data for long-context tasks can be very expensive, because it's time-consuming tedious for humans to read lengthy contexts. Instead the authors used earlier Llama versions to create synthetic fine-tuning data for long-context tasks like summarisation and question answering.

Factuality

In order to reduce hallucations, the authors generate synthetic data to align the model with factual information present in the pre-training data. They generate synthetic data by (1) sampling context from the pre-training data, (2) asking Llama to generate a factual question, then (3) asking it to generate answers, and then (4) asking it to generate a correctness and a informativeness score. For answers that have both a high informativeness but low correctness score, the authors ask Llama to generate a refusal instead of providing an incorrect answer.

Safety

In addition to human fine-tuning data, the authors use in-context learning, guided mutation of seed prompts and advanced methods like Rainbow Teaming to generate synthetic adversarial examples.

AFM

Apple released their [AFM paper] in July 2024 as well. Synthetic data generation is used both in the pre-training and post-training phases of AFM. In the pre-training phase, the authors use synthetic long-context Q&A data for context lengthening, i.e. to train the model on using contexts of up to 32k tokens. For post-training, they create synthetic data for specific domains like coding, math and tool use, which we'll focus on here.

Coding

The authors use a self-instruct method with rejection sampling to generate a synthetic coding data set.

Here is what that means:

Self-instruct: AFM is prompted to generate coding problems itself, based on a set of 71 different seed programming topics.
Rejection sampling: AFM is then prompted to generate a number of candidate solutions and units tests for each questions, and the best solution is selected based on execution feedback.

Math

The authors create synthetic math problems and solutions for SFT. Math problems are generated in 2 ways:

Starting from a set of seed math problems, ask AFM to (1) rephrase the problem or (2) reverse it (e.g. finding an input variable given the answer)
Prompt AFM to do an (1) in-depth or (2) in-breadth evolution of a seed problem. In-depth means making the problem statement more complex, and in-breadth means covering more topics.

Then, given those synthetically created math problems, the authors prompt AFM to solve them using chain-of-thought, i.e. generating a step-by-step solution. The authors generate $N$ responses per question and then select the best one.

In cases where a ground truth answer is available, the correctness can easily be verified. And in cases where it's unavailable, the authors use an LLM judge to decide whether an answer is correct and ok to be added to the SFT corpus.

Tool Use

Here the authors use a combination of synthetic and human annotated data. Similar to Llama 3.1, AFM is able to call functions, use a code interpreter and browse the internet. The basic tool use capabilities (single tool use) are bootstrapped with synthetic data. Instruction data for more advanced tool use including multi-tool and multi-step are then collected using human annotators.

Qwen2

Alibaba also released their paper on Qwen in July 2024 (July 2024 was a hot month for LLM releases). They mention that they used synthetic data during the pre-training phase, though there isn't much detail into what kind of data or how it was generated. In the post-training section they describe the following technique for generating synthetic data for SFT (demonstration data) and RLHF (preference data).

Coding

LLMs generate solutions and test cases, which are then compiled and executed to assert their correctness.

Math

The authors use LLMs to generate multiple responses for an instruction, and rejection sampling to select the best ones Depending on whether a generation leads to the correct answer, it is used for SFT. And pairs of correct and incorrect generations are used for RLHF.

Data Repurposing

The authors create demonstration data for literary writing tasks by pairing high-quality literary works with LLM-generated instructions.

Constitutional Feedback

LLMs were used to generate responses that either align or deviate from a set of guidelines, to create demonstration and preference data.

Hunyuan-Large

This is Tencent's [paper] which got released in November 2024. Their paper mentions synthetic data both in the pre-training and post-training sections.

Pre-training

The authors don't got into great detail here. But what is clear is that they use synthetic data for (1) math, coding and (3) "low-resource" and "high-educational-value" fields. The process for generating synthetic data follows the following steps

Instruction Generation: Starting from high quality seed questions, instruct an LLM to generate more synthetic instructions covering various domains
Instruction Evolution: Improve the instructions by making them clearer, more informative or more difficult (through self-instructing the LLM)
Response Generation: Use different specialised models to generate responses
Response Filtering: Use a critique-model and conduct self-consistency checks to filter out low quality answers

Tencent 2024

Post-training

The authors create synthetic SFT data for domains like math, logical reasoning and knowledge-based question answering. Here the process is as follows:

Instruction Extraction: Use specialised models to extract instruction-answer pairs from high quality, publicly available data sources like web sites and encyclopedias
Instruction Generalization: The authors train a specialised model to generate new instructions of increasing difficulty and complexity levels.
Instruction Balancing: They classify the synthetically generated instructions (presumably using a model) and balance the data set to make sure certain problem domains aren't over or underrepresented.
Data Quality Control: The authors use a critique model (in addition to a rules-based and a human step) to filter out instructions that are low on things like accuracy or relevance

Conclusion

I was really looking forward to seeing ablation studies on the effectiveness of synthetic data generation. For example, it would have been awesome to see the performance gain from using synthetic data after each round of SFT + RLHF training in Llama 3.1.

Unfortunately, none of the papers I reviewed had such information. However, the end results speak for themselves and the authors of the AFM paper talk very explicitly about the quality and usefulness of synthetic data:

"Our findings suggest that when guided by our robust reward models, AFMs are capable of generating high quality responses and for some specific domains, these responses are found to be on par with, or even superior to, human annotations. Therefore, we extend our prompt set to increase the diversity and find that those generated responses can benefit AFMs themselves."

It's exciting to wonder where the limits of synthetic data generation lie. Do the benefits plateau after a certain amount of data? And apart from the domains that were covered in this blog post, and which are kind of "easy" to generate synthetic data for, what other domains could benefit from synthetic data generation?

To wrap this up, here is a short snippet from Mark Zuckerberg on Llama 3 and synthetic data generation: