昇思25天学习打卡营第13天 | LLM原理和实践:文本解码原理--以MindNLP为例

1. 文本解码原理--以MindNLP为例

1.1 自回归语言模型

  • 根据前文预测下一个单词
  • 一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积
  • W 0 W_0 W0:初始上下文单词序列

  • t t t: 时间步

  • 当生成EOS标签时,停止生成。

  • MindNLP/huggingface Transformers提供的文本生成方法

1.2 环境准备

!pip uninstall mindvision -y
!pip uninstall mindinsight -y
!pip install mindnlp

1.3 Greedy search(贪心搜索)

“Greedy search”(贪婪搜索)是一种在自然语言处理(NLP)和机器学习中的解码策略,特别是在文本生成任务中,如语言模型和机器翻译。在贪婪搜索中,每一步选择都是基于当前的最优可能性,而不考虑这个选择对未来的影响。


w t w_t wt= a r g m a x w argmax_w argmaxw 𝑃( w w w| w w w 1 : t − 1 1:t-1 1:t1)

按照贪心搜索输出序列(“The”,“nice”,“woman”) 的条件概率为:0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词,如:dog=0.5, has=0.9

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 使用贪婪搜索策略生成文本,直到输出长度(包括上下文长度)达到50
greedy_output = model.generate(input_ids, max_length=50)# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.036 seconds.
Prefix dict has been built successfully.
100%0.99M/0.99M [00:00<00:00,14.9MB/s]
100%446k/446k [00:00<00:00,8.31MB/s]
100%1.29M/1.29M [00:00<00:00,18.8MB/s]
100%523M/523M [00:38<00:00,11.4MB/s]
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.I'm not sure if I'll

1.4 Beam search(束搜索)

Beam search(束搜索)是一种启发式搜索算法,常用于自然语言处理中的序列生成问题,如机器翻译和语音识别。与贪婪搜索不同,束搜索在每一步考虑多个可能的候选者,而不是只选择最优的一个。这样可以避免贪婪搜索可能会陷入的局部最优解问题。
束搜索的基本思想是保持一个大小为K的候选列表(称为“束”),在每一步都从这个列表中选出最有可能的K个候选者,然后基于这些候选者生成新的候选列表,如此迭代直到生成序列的结束。束的大小K称为束宽(beam width),是一个超参数,可以根据需要调整。

Beam search通过在每个时间步保留最可能的 num_beams 个词,并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。如图以 num_beams=2 为例:

(“The”,“dog”,“has”) : 0.4 * 0.9 = 0.36

(“The”,“nice”,“woman”) : 0.5 * 0.4 = 0.20


缺点:1. 无法解决重复问题;2. 开放域生成效果差

# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 使用束搜索和早停策略生成文本,设置束宽为5,最大长度为50
beam_output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True
)# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')# 设置不重复n-gram的大小为2,以避免生成重复的文本
beam_output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True
)# 打印生成的文本
print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')# 设置返回多个序列的数量大于1
beam_outputs = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, num_return_sequences=5, early_stopping=True
)# 打印所有生成的序列
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')


I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I don't think I'll ever be able to walk with her again.""I don't think I
Beam search with ngram, Output:
I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I'm
return_num_sequences, Output:
0: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I'm
1: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like she's
2: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like we're
3: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I've
4: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with her again.""I'm not sure what to say to that," she said. "I mean, it's not like I can

通过n-gram 惩罚: 将出现过的候选词的概率设置为 0 和设置no_repeat_ngram_size=2 ,任意 2-gram 不会出现两次。可以适当改善这个问题。Notice: 实际文本生成需要重复出现

1.5 Sample(样本搜索)

根据当前条件概率分布随机选择输出词 w t w_t wt


# 导入MindSpore库,用于设置随机种子
import mindspore# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 设置随机种子,以获得可重复的结果
mindspore.set_seed(0)# 激活采样策略,并关闭top_k采样,设置最大长度为50
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=0
)# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"I realized what Neddy meant when he first launched the website. "Thank you so much for joining."I

1.6 Temperature(温度)

降低softmax 的temperature使 P( w w w w w w 1 : t − 1 1:t−1 1:t1​)分布更陡峭

# 导入MindSpore库,用于设置随机种子
import mindspore# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 设置随机种子,以获得可重复的结果
mindspore.set_seed(1234)# 激活采样策略,并关闭top_k采样,设置最大长度为50,设置温度系数为0.7
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=0,temperature=0.7
)# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy walking with my cute dog and have never had a problem with her until now.A large dog named Chucky managed to get a few long stretches of grass on her back and ran around with it for about 5 minutes, ran around

1.7 TopK sample(TopK采样)


选出概率最大的 K 个词,重新归一化,最后在归一化后的 K 个词中采样
将采样池限制为固定大小 K :

  • 在分布比较尖锐的时候产生胡言乱语
  • 在分布比较平坦的时候限制模型的创造力
# 导入MindSpore库,用于设置随机种子
import mindspore# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 设置随机种子,以获得可重复的结果
mindspore.set_seed(0)# 激活采样策略,并关闭top_k采样,设置最大长度为50,设置top_k为50
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50
)# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy walking with my cute dog.She's always up for some action, so I have seen her do some stuff with it.Then there's the two of us.The two of us I'm talking about were

1.8 Top-P sample(Top-P采样)


在累积概率超过概率 p 的最小单词集中进行采样,重新归一化

# 导入MindSpore库,用于设置随机种子
import mindspore# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 设置随机种子,以获得可重复的结果
mindspore.set_seed(0)# 激活采样策略,并关闭top_k采样,设置最大长度为50,设置top_p为0.92,设置top_k为0
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_p=0.92, top_k=0
)# 打印生成的文本
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy walking with my cute dog Neddy as much as I'd like. Keep up the good work Neddy!"I realized what Neddy meant when he first launched the website. "Thank you so much for joining."I

1.9 top_k_top_p(混合采样)


# 导入MindSpore库,用于设置随机种子
import mindspore# 导入GPT2模型的分词器和模型
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel# 初始化分词器,并加载预训练的权重,指定镜像源为modelscope
tokenizer = GPT2Tokenizer.from_pretrained("iiBcai/gpt2", mirror='modelscope')# 初始化模型,并加载预训练的权重,将EOS token设置为PAD token以避免警告,指定镜像源为modelscope
model = GPT2LMHeadModel.from_pretrained("iiBcai/gpt2", pad_token_id=tokenizer.eos_token_id, mirror='modelscope')# 对生成文本的条件上下文进行编码,得到input_ids
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')# 设置随机种子,以获得可重复的结果
mindspore.set_seed(0)# 激活采样策略,并设置top_k为5,top_p为0.95,设置最大长度为50,设置num_return_sequences为3
sample_outputs = model.generate(input_ids,do_sample=True,max_length=50,top_k=5,top_p=0.95,num_return_sequences=3
)# 打印生成的文本
print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
print(100 * '-')



2. 小结






