提示工程

大型语言模型合成数据生成的全面指南

Published July 5, 2024

Updated April 4, 2026

Aayush Mittal Mittal

大型语言模型 (LLMs) 不仅是生成类似人类的文本的强大工具，也是创建高质量合成数据的工具。这种能力正在改变我们对 AI 开发的方法，特别是在现实世界数据稀缺、昂贵或敏感的场景中。在本综合指南中，我们将探讨 LLM 驱动的合成数据生成，深入研究其方法、应用和最佳实践。

LLMs 合成数据生成介绍

合成数据生成使用 LLMs 涉及利用这些高级 AI 模型创建模拟现实世界数据的合成数据集。这一方法提供了几种优势：

成本效益: 生成合成数据通常比收集和注释现实世界数据更便宜。
隐私保护: 合成数据可以在不暴露敏感信息的情况下创建。
可扩展性: LLMs 可以快速生成大量多样化的数据。
定制: 数据可以根据特定的用例或场景进行定制。

让我们首先了解使用 LLMs 生成合成数据的基本过程：

from transformers import AutoTokenizer, AutoModelForCausalLM

# 加载预训练的 LLM
model_name = "gpt2-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# 定义合成数据生成的提示
prompt = "生成一篇关于智能手机的客户评论："

# 生成合成数据
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=100, num_return_sequences=1)

# 解码并打印生成的文本
synthetic_review = tokenizer.decode(output[0], skip_special_tokens=True)
print(synthetic_review)

这个简单的示例演示了如何使用 LLM 生成合成客户评论。然而，LLM 驱动的合成数据生成的真正力量在于更复杂的技术和应用。

2. 高级合成数据生成技术

2.1 提示工程

提示工程对于指导 LLMs 生成高质量、相关的合成数据至关重要。通过仔细设计提示，我们可以控制生成数据的各种方面，例如风格、内容和格式。

更复杂的提示示例：

prompt = """
生成一篇关于智能手机的详细客户评论，具有以下特征：
- 品牌：{brand}
- 型号：{model}
- 关键特性：{features}
- 评分：{rating}/5 星

评论应在 50-100 个字之间，并包括正面和负面方面。

评论：
"""
品牌 = ["苹果", "三星", "谷歌", "一加"]
型号 = ["iPhone 13 Pro", "Galaxy S21", "Pixel 6", "9 Pro"]
特性 = ["5G，OLED 显示屏，三摄像头", "120Hz 刷新率，8K 视频", "AI 驱动的摄像头，5G", "快速充电，120Hz 显示屏"]
评分 = [4, 3, 5, 4]

# 生成多个评论
for brand, model, feature, rating in zip(品牌, 型号, 特性, 评分):
filled_prompt = prompt.format(brand=brand, model=model, features=feature, rating=rating)
input_ids = tokenizer.encode(filled_prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=200, num_return_sequences=1)
synthetic_review = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"评论 {brand} {model}:\n{synthetic_review}\n")

这种方法允许更受控和多样化的合成数据生成，适应特定的场景或产品类型。

2.2 少样本学习

少样本学习涉及为 LLM 提供几个期望输出格式和风格的示例。这种技术可以显著提高生成数据的质量和一致性。

few_shot_prompt = """
生成一段关于产品问题的客户支持对话，客户支持人员（A）和客户（C）如下：
你好，我正在使用我的新耳机遇到问题。右耳塞不工作。
它是 SoundMax Pro 3000。
是的，我尝试过，但没有帮助。
你好，我刚刚收到我的新智能手表，但它无法开机。
"""
# 生成对话
input_ids = tokenizer.encode(few_shot_prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=500, num_return_sequences=1)
synthetic_conversation = tokenizer.decode(output[0], skip_special_tokens=True)
print(synthetic_conversation)

这种方法有助于 LLM 理解期望的对话结构和风格，从而生成更真实的合成客户支持交互。

2.3 条件生成

条件生成允许我们控制生成数据的特定属性。这在我们需要创建具有特定受控特征的多样化数据集时特别有用。

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model = GPT2LMHeadModel.from_pretrained(“gpt2-medium”)
tokenizer = GPT2Tokenizer.from_pretrained(“gpt2-medium”)

def generate_conditional_text(prompt, condition, max_length=100):
input_ids = tokenizer.encode(prompt, return_tensors=”pt”)
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device)

# 编码条件
condition_ids = tokenizer.encode(condition, add_special_tokens=False, return_tensors=”pt”)

# 连接条件和输入 ID
input_ids = torch.cat([condition_ids, input_ids], dim=-1)
attention_mask = torch.cat([torch.ones(condition_ids.shape, dtype=torch.long, device=condition_ids.device), attention_mask], dim=-1)

output = model.generate(input_ids, attention_mask=attention_mask, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)

return tokenizer.decode(output[0], skip_special_tokens=True)

# 生成具有不同条件的产品描述
conditions = [“豪华”, “预算友好”, “环保”, “高科技”]
prompt = “描述一个背包：”

Aayush Mittal

我过去五年一直沉浸在令人着迷的机器学习和深度学习世界中。我的热情和专业知识使我能够为超过50个不同的软件工程项目做出贡献，特别注重人工智能/机器学习。我的持续好奇心也使我对自然语言处理产生了兴趣，这是一个我渴望进一步探索的领域。

Unite.AI