Deep Dive into LLMs like ChatGPT - 3

本文为 Stanford 大语言模型公开课讲稿片段,对应「后训练:从基础模型到助手」部分,格式为一段英文、一段中文对照。


一、后训练概览:算力更省,目标是把基础模型变成助手

So we’re now going to discuss a few ways to do what’s called post training of these models. These stages in post training are going to be computational in much less expensive. Most of the computational work, all of the massive data centers, all of the sort of heavy compute and millions of dollars are the pre training stage. But now we’re going to the slightly cheaper, but still extremely important stage called post training where we turn this base model into an assistant.

接下来要讲的是后训练(post training)的几种做法。后训练阶段的算力开销会小很多——大部分算力、数据中心、巨额花费都花在预训练;后训练相对便宜,但仍然极其重要,因为我们要在这里把基础模型变成助手

So let’s take a look at how we can get our model to not sample internet documents, but to give answers to questions.

我们要看看怎么让模型不再「续写互联网文档」,而是回答问题


二、助手应该怎么表现:多轮对话与「用例子编程」

So in other words, what we want to do is we want to start thinking about conversations. And these are conversations that can be multi turn. So there could be multiple turns. And they are, in the simplest case of conversation between a human and an assistant. And so, for example, we can imagine the conversation to look something like this. When a human says, what is 2+2? The assistant should respond with something like 2+2 is 4. When a human follows up and says, what if it was star instead of a plus a system could respond with something like this. And similar here, this is another example showing that the assistant could also have some kind of a personality here that it’s kind of like nice. And then here in the third example, I’m showing that when a human is asking for something that we don’t wish to help with, we can produce what’s called refusal. We can say that we cannot help with that. So, in other words, what we want to do now is we need to think through how an assistant should interact with the human. And we want to program the assistant and its behavior in these conversations.

换句话说,我们要开始考虑对话:可以是多轮的,最简单的就是人和助手之间的对话。例如:人问「2+2 等于几?」,助手应回答「2+2 等于 4」;人追问「如果是乘号呢?」,系统可以那样回答;再比如助手可以带一点「友好」的人设;第三个例子是当用户请求我们不想帮忙的内容时,可以拒绝(refusal),说不能协助。所以我们要想清楚助手该如何与人交互,并把这些行为「编」进对话里。

Now, because this is neural networks, we’re not going to be programming these explicitly. We’re not gonna be able to program the assistant in that way, because this is neural networks. Everything is done through neural or training on data sets. And so because of that, we are going to be implicitly programming the assistant by creating data sets of conversations. So these are three independent examples of conversations in a data set. An actual data set. And I’m gonna show you examples will be much larger. It could have hundreds of thousands of conversations that are multiturn, very long, et cetera, and would cover a diverse breadth of topics. But here, I’m only showing three examples. But the way this works basically is assistant is being programmed, by example.

但这是神经网络,没法手写逻辑来编程;一切靠在数据集上训练。所以我们是在用对话数据集隐式地编程助手。上面是数据集里的三个独立对话样例,真实数据集会大得多——可能有几十万条多轮、很长的对话,覆盖各种话题。这里只放三个例子。本质就是:助手是通过例子被「编程」的


三、用对话数据继续训练:换数据不换算法

We will basically give human laborers some conversational context. And we will ask them to basically give the ideal assistant response in this situation. And a human will write out the ideal response for an assistant in any situation. And then we’re gonna get the model to basically train on this and to imitate those kinds of responses. So the way this works then is we are going to take our base model which we produced in the pre training stage. This base model was trained on internet documents. We’re now going to take that data set of internet documents, and we’re gonna throw it out. And we’re gonna substitute a new data set. And that’s gonna be a data set of conversations. And we’re going to continue training the model on these conversations, on this new data set of conversations. And what happens is that the model will very rapidly adjust. And we’ll sort of like learn the statistics of how this assistant response to human queries. Later during inference, we’ll be able to basically prime the assistant and get the response. And it will be imitating what human laborers would do in that situation. And that makes sense.

我们会给人工标注员一段对话上下文,让他们写出在这种情境下理想的助手回复;人在各种情境下写出理想回复,然后让模型在这些数据上训练、模仿这些回复。具体流程是:拿出预训练阶段得到的基础模型(它之前是在互联网文档上训的),扔掉原来的互联网文档数据集,换成新的对话数据集,继续用这些对话训练。模型会很快适应,学到「助手如何回应人类问题」的统计规律。之后在推理时,我们给模型一段前缀(对话上下文),它就会生成回复,模仿的正是标注员在这种情境下会写的内容。

So we’re gonna see examples of that. And this is gonna become a bit more concrete. I also wanted to mention that this post training stage, we’re gonna basically just continue training the model. But the pre training stage can, in practice, take roughly 3 months of training on many thousands of computers. The post training stage will typically be much shorter like 3 hours, for example. And that’s because the data set of conversations that we’re going to create here manually is much smaller than the data set of text on the internet. And so this training will be very short, but fundamentally, we’re just gonna take our base model. We’re gonna continue training using the exact same algorithm, the exact same everything, except we’re swapping out the data set for conversations.

后面会看到具体例子。另外要提一句:后训练阶段本质上就是继续训练,算法不变。预训练在实践中可能要在成千上万台机器上跑大约 3 个月,后训练通常短得多,比如 3 小时,因为这里用的对话数据集是人造的、体量远小于互联网文本。所以训练时间很短,但做法就是:拿着基础模型,用完全相同的算法继续训,唯一区别是把数据换成对话

So the questions now are, where are these conversations? How do we represent them? How do we get the model to see conversations instead of just raw text? And then what are the outcomes of this kind of training? And what do you get in a certain like psychological sense when we talk about the model? So let’s turn to those questions now.

接下来要回答的是:这些对话从哪来?怎么表示?怎么让模型「看到」的是对话而不是纯文本?这种训练会带来什么结果?从心理/认知角度,我们到底在和什么对话?下面就来谈这些。


四、对话的 token 化:从结构化对话到一维 token 序列

So let’s start by talking about the tokenization of conversations. Everything in these models has to be turned into tokens, because everything is just about token sequences. So how do we turn conversations into token sequences? Is the question. And so for that, we need to design some kind of encoding. And this is kind of similar to, maybe if you’re familiar, you don’t have to be with, for example, the TCP/IP packet on the internet. There are precise rules and protocols for how you represent information, how everything is structured together so that you have all this kind of data laid out in a way that is written out on the paper. And that everyone can agree on. It’s the same thing now happening in LLMs. We need some kind of data structures, and we need to have some rules around how these data structures like conversations get encoded and decoded to and from tokens.

先从对话的 token 化说起。模型里一切都要变成 token,因为底层就是一维 token 序列。所以问题变成:怎么把对话变成 token 序列? 这就需要设计一种编码方式,有点像互联网上 TCP/IP 包有约定好的规则和协议,规定信息怎么表示、怎么组装,大家按同一套来。LLM 里也一样:要有数据结构,还要有规则,规定对话这类结构如何编成 token、又如何从 token 解回来。

And so I want to show you now how I would recreate this conversation in the token space. So if you go to take tokenizer, I can take that conversation. And this is how it is represented in for the language model. So here we are reiterating a user and an assistant in this two turn conversation. And what you’re seeing here is it looks ugly, but it’s actually relatively simple. The way it gets turned into a token sequence here at the end is a little bit complicated. But at the end, this conversation between the user and assistant ends up being 49 tokens. It is a one dimensional sequence of 49 tokens. These are the tokens. And all the different elements will have a slightly different format or protocols. And it’s a little bit of a wild west right now. But for example, GPT for all does it in the following way. You have this special token called I am underscore start. And this is short for “imagine monologue of the start”. Then you have to specify whose turn it is. So for example, user, which is a token 1428, then you have internal monologues separator. And then it’s the exact question. So the tokens of the question. And then you have to close it. So I am end the end of the measuring monologue. So basically, question from a user of what is 2+2 ends up being the token sequence of these tokens.

下面用 token 空间重写这段对话。用 tokenizer 把对话转成语言模型看到的表示,就是这样:两轮对话里用户和助手各一轮。看起来有点乱,但逻辑相对简单;最后变成 token 序列的细节略复杂,但结果就是:这段用户—助手对话变成 49 个 token 的一维序列。各家格式/协议略有不同,目前有点像「西部拓荒」。例如 GPT for all 的做法是:有一个特殊 token 叫 im_start(imagine monologue start),然后标明是谁的回合(如 user,对应 token 1428),有分隔符,接着是问题的真实 token,再结束这一轮。所以用户问「what is 2+2?」就对应这样一串 token。

Now, the important thing to mention here is that im_start, this is not text, right? im_start is a special token that gets added. It’s a new token. And this token has never been trained on so far. It is a new token that we create in a post training stage. And we introduce and so these special tokens, like im_start, et cetera, are introduced and interspersed with text, so that they sort of get the model to learn that, hey, this is the start of a turn. Was it the start of the turn for the user? And then this is what the user says. And then the user ends. It’s a new start of a turn, and it is by the assistant. And then what does the assistant say? These are the tokens of what the assistant says, et cetera. This conversation is not turned into the sequence of tokens. The specific details here are not actually that important at all. I’m trying to show you in concrete terms, is that our conversations, which we think of as kind of like a structured object, end up being turned via some encoding into one dimensional sequences of tokens, because this is a one dimensional sequence of tokens. We can apply all the stuff that we applied before. Now it’s just a sequence of tokens. And now we can train a language model on it. We’re just predicting the next token in sequence, just like before. And we can represent and train on conversations.

重要的一点:im_start 不是普通文本,是后加进去的特殊 token,是后训练阶段新造的、预训练时没见过。这些特殊 token(如 im_start 等)和正文交织在一起,让模型学到「这是新一轮」「这是用户的开头」「这是用户说的」「用户说完了」「新一轮、轮到助手」「助手说的 token 是这些」等等。对话就这样被编成 token 序列。具体协议细节不重要,我想说明的是:我们心目中「有结构的」对话,通过某种编码变成一维 token 序列后,就可以沿用之前那套——它就是一段 token 序列,我们可以在上面训练语言模型,还是预测下一个 token,只不过现在是在对话上表示和训练。

And then what does it look like at test time during inference? So say we’ve trained a model and we trained a model on these kinds of data sets of conversations. And now we want to inference. So during inference, what does this look like? When you’re on ChatGPT? You come to ChatGPT and you have say, like a dialogue with it. And the way this works is basically say that this was already filled in. So like, what is 2+2? 2+2 is 4. And now you issue what if it was times instead? And what basically ends up happening on the servers of OpenAI or something like that is they put an im_start assistant. This is where they ended right here. So they construct this context. And now they start sampling from the model. So it’s at this stage that they will go to the model and say, okay, what is a good first token? What is a good second token? What is a good third token? And this is where the model takes over and creates a response, like, for example, response that looks something like this, and it doesn’t have to be identical to this, but it will have the flavor of this if this kind of a conversation was in the data set. So that’s roughly how the protocol works. Although the details of this protocol are not important. Again, my goal is that just to show you that everything ends up being just a one dimensional token sequence. So we can apply everything we’ve already seen, but we’re now training on conversations, and we’re now basically generating conversations as well.

推理/测试时是什么样?假设模型已经在这种对话数据集上训好了,现在要做推理。你在 ChatGPT 里和它对话:前面已经填好了「2+2 等于几?」「2+2 等于 4」,你又问「如果是乘号呢?」。在 OpenAI 等服务器上,他们会把上下文构造成:im_start、assistant、到这里结束,然后从模型开始采样——问模型「第一个 token 该是什么?第二个?第三个?」……模型就这样接管并生成回复,不一定和训练集里某条一模一样,但若训练集里有类似对话,风格会很像。协议大致就是这样。细节不重要,我想强调的是:一切最终都是一维 token 序列,所以之前那套都能用,只不过现在是在对话上训练、也在生成对话。


五、实践中的数据集:InstructGPT 与标注指南

So now I would like to turn to what these data sets look like in practice. The first paper that I would like to show you, and the first effort in this direction is this paper from OpenAI in 2022, and this paper was called InstructGPT or the technique that they developed. This was the first time that OpenAI has kind of talked about how you can take language models and fine tune them in conversations. This paper has a number of details that I would like to take you through. So the first stop I would like to make is in section 3.4, where they talk about the human contractors that they hired. In this case, from Upwork or through Scale AI to construct these conversations. And so there are human laborers involved whose job it is professionally to create these conversations. And these laborers are asked to come up with prompts. And then they are asked to also complete the ideal assistant responses. And so these are the kinds of problems that people came up with. So these are human laborers. So list five ideas for how to regain enthusiasm for my career. What are the top ten science fiction books I should read next? There’s many different types of kind of problems here. So translate this sentence from to Spanish, et cetera. There’s many things here that people came up with. They first come up with the prompt, and then they also answer that prompt and they give the ideal assistant response.

下面看这些数据集在实践里长什么样。第一篇要提的是 OpenAI 2022 年的 InstructGPT,是他们在这条路线上的首次公开工作,也是 OpenAI 第一次系统讲如何把语言模型在对话上微调。论文里有很多细节,先看 3.4 节:他们通过 Upwork 或 Scale AI 雇人类标注员来构造对话,这些人的工作就是专业地生产对话。标注员既要出题(prompt),也要写出理想的助手回复。所以你会看到各种他们想出来的问题:列出五个重燃职业热情的办法、推荐十本科幻书、把某句翻译成西班牙语等等。流程就是:先想 prompt,再自己答一遍,写出「理想助手回复」。

Now, how do they know what is the ideal assistant response that they should write for these prompts? So when we scroll down a little bit further, we see that here we have this excerpt of labeling instructions that are given to the human laborers. So the company that is developing the language model, like, for example, OpenAI writes up labeling instructions for how the humans should create ideal responses. Here, for example, is an excerpt of these kinds of labeling instructions. On a high level, you’re asking people to be helpful, truthful, and harmless. And you can pause the video, if you’d like to see more here. But on a high level, basically just answer, try to be helpful, try to be truthful, and don’t answer questions that we don’t want kind of the system to handle later in ChatGPT and so, roughly speaking, the company comes up with the labeling instructions. Usually they are not this short. Usually they are hundreds of pages. And people have to study them professionally. And then they write out the ideal assistant responses following those labeling instructions. So this is a very human heavy process as it was described in this paper.

他们怎么知道该写什么样的「理想回复」?往下翻会看到发给标注员的标注指南(labeling instructions)摘录。开发语言模型的公司(如 OpenAI)会写一套指南,规定人类该如何写出理想回复。例如摘录里高层原则就是:helpful, truthful, harmless(有帮助、真实、无害)——尽量有用、尽量真实,不要回答公司不希望系统在 ChatGPT 里回答的那类问题。大致就是公司定标注指南,通常不止几页,往往有几百页,标注员要专业学习,然后按指南写出理想助手回复。所以如论文所述,这是非常依赖人力的过程。

Now the data set for InstructGPT was never actually released by OpenAI but we do have some open source reproductions that we’re trying to follow this kind of a setup and collect their own data. So one that I’m familiar with, for example, is the effort of Open Assistant from a while back. This is just one of, I think, many examples, but I just wanna show you an example. So these are people on the internet that were asked to basically create these conversations similar to what OpenAI did with human laborers. And so here’s an entry of a person who came up with this prompt. Can you write a short introduction to the relevance of the term monopsony in economics? Please use examples, et cetera. And then the same person or potentially a different person will write up the response. So here’s the assistant response to this. Then the same person or different person will actually write out this ideal response. And then this is an example of maybe how the conversation could continue. Now explain it to a dog, and then you can try to come up with a slightly a simpler explanation or something like that. Now, this then becomes the label, and we end up training on this.

InstructGPT 的数据集 OpenAI 从未公开,但有一些开源复现按类似设置自己收集数据,例如早期的 Open Assistant。这里是网上志愿者像 OpenAI 的标注员一样生产对话的样例:有人出题「能否写一段经济学里 monopsony 的简短介绍,并举例?」然后同一人或另一个人写助手回复,再可能续一轮(比如「用给狗解释的方式再说一遍」)。这些就变成标签,我们最终就在这些数据上训练。


六、从纯人工到合成数据:UltraChat 与 SFT 混合

So what happens during training is that we’re not gonna have full coverage of all the possible questions that the model will encounter at this time during inference. We can’t possibly cover all the possible problems that people are gonna be asking. But if we have like a data set of a few of these examples, then the model during training will start to take on this persona of this helpful, truthful, harmless assistant. And it’s all programmed, by example. And so these are all examples of behavior. If you have conversations of these example behaviors and you have enough of them like 100,000 and you train on it, the model sort of starts to understand the statistical pattern, and it kind of takes on this personality of this assistant. Now, it’s possible that when you get the exact same question like this, at test time, it’s possible that the answer will be recited as exactly what was in the training set. More likely than that is that the model will kind of like do something of a similar vibe. We will understand that this is the kind of answer that you want. So that’s what we’re doing. We’re programming the system, by example, and the system adopts statistically this persona of this helpful, truthful, harmless assistant, which is kind of like reflected in the labeling instructions that the company creates.

训练时我们不可能覆盖推理时用户会问的所有问题,但只要有足够多这类样例(比如十几万条对话)并在这上面训练,模型就会在统计上学会这种「有帮助、真实、无害」的助手人设——全是用例子编程。若测试时问题恰好和训练集里某条一样,有可能几乎背出那条答案;更常见的是模型会给出风格类似的回复,你感觉「对,就要这种答案」。所以我们就是在用例子编程系统,系统在统计上呈现出公司通过标注指南定义的那种助手人格。

I want to show you that the state of the art has kind of advanced in the last 2 or 3 years since the InstructGPT paper. So in particular, it’s not very common for humans to be doing all the heavy lifting just by themselves anymore. And that’s because we now have language models. And these language models are helping us create these data sets and conversations. So it is very rare that the people will like literally just write out the response from scratch. It is a lot more likely that they will use an existing LLM to basically like come up with an answer, and then they will edit it or things like that. So there’s many different ways in which now LLMs have started to kind of permeate this post training data stack. And LLMs are basically used to help create these massive data sets of conversations.

InstructGPT 之后两三年,业界已经往前走了不少。纯靠人类从零写回复已经很少见,因为现在有语言模型帮忙造对话数据集:人很少真的从头写一整段回复,更常见的是用现有 LLM 先生成一个答案,再人工编辑。所以 LLM 已经渗透进后训练数据流水线的很多环节,用来帮助生成海量对话数据。

So I don’t want to show—Ultra Chat is one such example of like a more modern data set of conversations. It is to a very large extent, synthetic. But I believe there’s some human involvement. I could be wrong with that. Usually there will be a little bit of human, but there will be a huge amount of synthetic help. This is all kind of like constructed in different ways. And UltraChat is just one example of many SFT data sets that currently exist. The only thing I wanna show you is that these data sets have now millions of conversations. These conversations are mostly synthetic, but they’re probably augmented to some extent by humans. They span a huge diversity of sort of areas and so on. So these are fairly extensive artifacts by now. There are all these like SFT mixtures, as they’re called. So you have a mixture of like lots of different types and sources, and it’s partially synthetic, partially human. It’s kind of gone in that direction since. But roughly speaking, we still have SFT data sets. They’re made up of conversations. We’re training on them, just like we did before.

UltraChat 是更现代的一类对话数据集例子,很大程度上是合成的,可能仍有一点人工参与。通常是人少量参与、合成占大头,用各种方式构造。UltraChat 只是众多 SFT 数据集中的一个;想说明的是:这类数据集现在动辄百万级对话,多半是合成的,再经人类一定程度增补,覆盖领域很广,已经是相当成规模的数据产物。现在有很多所谓的 SFT mixture:多种类型、多种来源混合,一部分合成、一部分人工,整体朝这个方向发展。但大致上仍然是:SFT 数据集由对话组成,我们在上面训练,和以前一样。


七、你在和谁对话:统计模拟与「LLM 心理」

So I guess, like the last thing to note is that I wanted to spell a little bit of the magic of talking to an AI. Like when you go to ChatGPT and you give it a question, and then you hit enter, what is coming back is kind of like statistically aligned with what’s happening in the training set. And these training sets, they really just have a seed in humans following labeling instructions. So what are you actually talking to in ChatGPT or how should you think about it? It’s not coming from some magical AI. Roughly speaking, it’s coming from something that is statistically imitating human laborers, which comes from labeling instructions written by these companies. And so you’re kind of imitating this. You’re kind of getting—it’s almost as if you’re asking human laborer and imagine that the answer that is given to you from ChatGPT is some kind of a simulation of a human laborer. And it’s kind of like asking, what would a human laborer say in this kind of a conversation? And it’s not just like this. Human laborer is not just like a random person from the internet, because these companies actually hire experts. So for example, when you are asking questions about code and so on, the human laborers that would be involved in the creation of these conversation data sets, they probably should be usually be educated expert people. And you’re kind of like asking a question of like a simulation of those people. That makes sense. So you’re not talking to a magical AI. You’re talking to an average laborer. This average laborer is probably fairly highly skilled, but you’re talking to kind of like an instantaneous simulation of that kind of a person that would be hired in the construction of these data sets.

最后想稍微「祛魅」一下:你在和 ChatGPT 对话、按回车时,返回的内容在统计上是对齐训练集里发生的事的,而训练集的种子就是人类按标注指南写的回复。所以你在和什么对话? 不是某种神秘 AI,大致可以理解为:你在和一种在统计上模仿人类标注员的东西对话,而标注员又是按公司写的标注指南来写的。也就是说,你得到的答案有点像「若请一位标注员在这种对话里回答,他会写什么」的模拟。而且这些标注员不是随便从网上抓的人,公司会雇专家——比如你问代码相关问题时,参与构造对话数据集的往往是受过教育、有专业背景的人。所以你可以理解为:你在和这类人的即时模拟对话;不是魔法,而是「被雇来造数据的那种人」的统计模拟,通常技能不低。

So let me give you one more specific example before we move on. For example, when I go to ChatGPT and I say, recommend the top five landmarks to see in Paris, I hit enter. Okay, here we go. When I hit enter, what’s coming out here? How do I think about it? It’s not some kind of a magical AI that has gone out and researched all the landmarks and then ranked them using its infinite intelligence, et cetera. What I’m getting is a statistical simulation of a laborer that was hired by OpenAI. You can think about it roughly in that way. And so if this specific question is in the post training data set, somewhere at OpenAI then I’m very likely to see an answer. There is probably very, very similar to what that human laborer would have put down for those five landmarks. How does the human laborer come up with this? Well, they go off and they go on the internet, and they kind of do their own little research for 20 minutes. And they just come up with a list. Right now, as if they come up with this list. And this is in the data set. I’m probably very likely to see what they submitted as the correct answer from the assistant. Now, if this specific query is not part of the post training data set, then what I’m getting here is a little bit more emergent, because the model kind of understands it statistically. The kinds of landmarks that are in the data set are usually the prominent landmarks, the landmarks that people usually wanna see, the kinds of landmarks that are usually very often talked about on the internet. And remember that the model already has a ton of knowledge from its pre training on the internet. So it’s probably seen a ton of conversations about Paris, about landmarks, about the kinds of things that people like to see. And so it’s the pre training knowledge that has been combined with the post training data set that results in this kind of an imitation.

再举一个具体例子:我去 ChatGPT 问「推荐巴黎必去的五个地标」,按回车。出来的东西该怎么理解?不是有个魔法 AI 去查了所有地标再用「无限智能」排序;你得到的是被 OpenAI 雇来的那种标注员的统计模拟。若「巴黎五大地标」这道题正好在 OpenAI 的后训练数据集里,你看到的答案很可能和那位标注员当时写下的非常像。标注员当时怎么写的?他们上网查一二十分钟,列个单子,交上去,进了数据集;所以你很可能看到的就是他们交的「助手正确答案」。若你这句具体问法不在后训练集里,你得到的东西就更「涌现」一点:模型在统计上理解「数据集里的地标」通常是知名、大家常去、网上常提的,再加上预训练时已经见过大量关于巴黎、地标、人们爱去哪的文本,所以是预训练知识 + 后训练对话数据共同作用,产生这种模仿。

So that’s roughly how you can kind of think about what’s happening behind the scenes here. In the statistical sense. Now, I want to turn to the topic of LLM psychology, as I like to call it, which is what are sort of the emergent cognitive effects of the training pipeline that we have for these models.

从统计角度,大致可以这样理解幕后发生的事。接下来我想转到**「LLM 心理」这个话题:我们为这些模型设计的训练流水线,会带来哪些涌现的认知/心理效果**。


八、全文总结与初学者学习重点

文章讲了什么

本片段对应 Stanford 大语言模型课中的「后训练」部分:如何把预训练得到的基础模型变成能问答的助手。

  • 后训练:算力比预训练小很多(例如几小时 vs 数月),但非常关键;做法是换数据不换算法——用对话数据集替代互联网文档,继续用「预测下一个 token」训练,模型很快学会助手式回复。
  • 助手行为:通过对话数据集隐式编程;多轮、可拒绝;人类标注员按公司标注指南写出理想回复,模型通过例子学会「有帮助、真实、无害」等人设。
  • 对话的表示:对话通过编码协议变成一维 token 序列(含特殊 token 如 im_start);推理时把已有对话编成上下文,模型从该处开始采样生成回复。
  • 数据来源演变:InstructGPT 时代以人工标注为主(Upwork/Scale AI,标注指南可达数百页);现在大量使用 LLM 辅助/合成(如 UltraChat),SFT 数据集常为百万级对话、多源混合、部分合成部分人工。
  • 你在和谁对话:从统计上讲,回复是对「按标注指南写答案的人类标注员」的模拟;标注员多为专家,所以也可理解为「被雇来造数据的那种人」的即时模拟;若问题落在后训练集里,答案可能很接近某条标注;若不在,则结合预训练知识与后训练统计做「风格一致」的生成。

初学者应关注的重点

主题 要点
后训练 vs 预训练 后训练算力小、时间短;数据从互联网文档换成对话;算法相同(下一个 token 预测)。
助手 = 用例子编程 无法手写逻辑,靠对话数据集;标注员出题+写理想回复,模型模仿统计规律。
对话 token 化 对话通过协议编成 token 序列(含特殊 token);推理 = 构造上下文 + 从模型采样。
标注指南与人力 公司写标注指南(如 helpful/truthful/harmless);早期重人工,现在 LLM 参与造数据。
合成与混合 现代 SFT 常为合成+人工混合、百万级对话;UltraChat 等为典型。
心理模型 回复 ≈ 人类标注员(按指南)的统计模拟;结合预训练知识理解「不在集内」的泛化。

一句话总结

后训练用对话数据继续训基础模型,把「续写互联网」变成「按例子学助手」;对话被编成 token 序列训练与生成,数据从纯人工标注演进到合成+混合;你对话的对象在统计上是对「按公司标注指南写答案的标注员」的模拟,而不是一个独立研究再回答的魔法 AI。