CS146S: The Modern Software Developer
https://themodernsoftware.dev/

Deep Dive into LLMs like ChatGPT

https://www.youtube.com/watch?v=7xTGNNLPyMI

视频转文字

03月01日_1_智能规整

一、开篇：我们在这期视频里要讲什么

hi Everyone, someone make this video for a while it is a comprehensive but general audience introduction to large language models like chat gpt and what i’m hoping to achieve in this video is to give you kind of mental models for thinking through what it is that this tool is obviously magical and amazing. In some respects. It’s really good at some things, not very good at other things. And there’s also a lot of sharp edges to be aware of. So what is behind this text box? You can put anything in there and press enter, but what should be putting there? And what are these words generated back? How does this work? And what are you talking to? Exactly?

大家好，这是我请人制作的一段视频，旨在面向普通大众，全面介绍像 ChatGPT 这样的大型语言模型。我希望通过这段视频，帮大家建立一套思维框架，来理解这个工具的本质。
它显然很神奇、很出色，在某些方面能力极强，但在另一些方面却并不擅长，同时还存在不少需要警惕的风险与局限。
那么，这个输入框背后到底是什么？
你可以随便输入任何内容并按下回车，但你究竟应该输入什么？
它返回的这些文字是如何生成的？
它的工作原理是什么？
你到底在和什么对话？
我希望在视频里把这些问题一一讲清楚。

So i’m hoping to get at all those topics in this video. We’re gonna go through the entire pipeline of how this stuff is built, but i’m going to keep everything are sort of accessible to a general audience. So let’s take a look at first how you build something like chat gpt and along the way, i’m gonna talk about some of the sort of cognitive psychological implications of these tools.

所以我希望在这期视频里把这些问题都讲清楚。我们会完整梳理这类模型的整个构建流程，但我会用普通观众都能轻松理解的方式来讲解。

那么我们先来看看，像 ChatGPT 这样的系统是如何搭建出来的。在讲解过程中，我也会聊聊这些工具在认知与心理学层面带来的一些影响。

二、预训练第一步：互联网数据的获取与整理

Okay, so let’s move on to ChatGPT. So there’s going to be multiple stages arranged sequentially. The first stage is called the pre-training stage. The first step of the pre-training stage is to download and process the internet. Now to get a sense of what this roughly looks like, I recommend looking at this URL here. So this company called Hugging Face has collected and created a curated data set called FineWeb. And they go into a lot of detail in this blog post on how they constructed the FineWeb data set. And all of the major LLM providers like OpenAI, Anthropic, and Google and so on will have some equivalent internally of something like the FineWeb data set.

好的，我们继续来讲 ChatGPT。它的构建会按顺序分为多个阶段。第一个阶段叫作预训练阶段。
预训练阶段的第一步，是下载并处理互联网上的数据。为了让大家对这个过程有个大致概念，我推荐大家看看这个链接。有一家名叫 Hugging Face 的公司，收集并精心整理了一个名为 FineWeb 的数据集。他们在这篇博客文章中，详细介绍了如何构建 FineWeb 数据集。
而所有主流的大模型公司，比如 OpenAI、Anthropic、Google 等，内部都会有类似于 FineWeb 的等效数据集。

So roughly, what are we trying to achieve here? We’re trying to get a ton of text from the internet from publicly available sources. So we’re trying to have a huge quantity of very high quality documents. And we also want very large diversity of documents because we want to have a lot of knowledge inside these models. So we want large diversity of high quality documents, and we want many of them and achieving this is quite complicated. And as you can see here, takes multiple stages to do well. So let’s take a look at what some of these stages look like in a bit.

简单来说，我们在这里想要达成什么目标？
我们要从互联网上的公开来源获取海量文本数据，收集数量极其庞大、质量非常高的文档，同时还要保证文档高度多样化，因为我们希望模型里能装进丰富的知识。

所以我们需要：大量、高质量、种类丰富的文档，而要做到这一点其实非常复杂。正如你所看到的，需要经过多个阶段才能做好。接下来我们就来看看其中一些阶段具体是怎样的。

三、数据从哪来：Common Crawl 与多阶段过滤

For now, I’d like to just note that, for example, the final data set, which is fairly representative of what you would see in a production-grade application, actually ends up being only about 44 terabytes of disk space. You can get a USB stick for like a terabyte very easily. Or I think this could fit on a single hard drive almost today. So this is not a huge amount of data. At the end of the day, even though the internet is very, very large, we’re working with text. And we’re also filtering it aggressively. So we end up with about 44 terabytes in this example.

说到这里我想提一下，例如这个最终数据集，作为生产级应用中比较有代表性的规模，实际只占大约 44TB 的磁盘空间。现在很容易就能买到 1TB 的 U 盘，或者说几乎一块硬盘就能装下。所以体量并没有大到夸张。归根结底，虽然互联网非常大，我们处理的只是文本，而且会做很激进的过滤，所以在这个例子里最终大约是 44TB。

So let’s take a look at what this data looks like. And what some of these stages are. The starting point for a lot of these efforts, and what contributes most of the data by the end of it, is data from Common Crawl. So Common Crawl is an organization that has been basically scouring the internet since 2007. So as of 2024, for example, Common Crawl has indexed 2.7 billion web pages. And they have all these crawlers going around the internet. And what you end up doing basically is you start with a few seed web pages and then you follow all the links, and you just keep following links and you keep indexing all the information, and you end up with a ton of data of the internet over time.

那我们来看一下这些数据长什么样，以及这些阶段具体是什么。很多这类工作的起点、以及最终数据里占大头的来源，都是 Common Crawl。Common Crawl 是一个从 2007 年起就在爬取互联网的组织。例如到 2024 年，Common Crawl 已经索引了 27 亿个网页。他们用各种爬虫在互联网上抓取，基本上就是从少量种子页面出发、跟着链接一路爬、不断索引信息，久而久之就积累下海量的互联网数据。

So this is usually the starting point for a lot of these efforts. Now, this Common Crawl data is quite raw and is filtered in many, many different ways. So here they document in the same diagram a little bit the kind of processing that happens in these stages. So the first thing here is something called URL filtering. So what that is referring to is that there are these block lists of basically URLs or domains that you don’t want to be getting data from. So usually this includes things like malware websites, spam websites, marketing websites, racist websites, adult sites, and things like that. So there are different types of websites that are just eliminated at this stage, because we don’t want them in our data set.

所以这通常是很多这类工作的起点。Common Crawl 的数据很原始，会经过多种多样的过滤。他们在这个图里也简单记录了各个阶段做了哪些处理。第一步叫 URL 过滤，意思是有一些屏蔽列表，列出你不想采数据的 URL 或域名，通常包括恶意软件站、垃圾站、营销站、种族主义网站、成人网站等等。这类网站在这一阶段就会被直接剔除，因为我们不想要它们进数据集。

The second part is text extraction. You have to remember that all these web pages—this is the raw HTML of these web pages that are being saved by these crawlers. So when I go to inspect here, this is what the raw HTML actually looks like. You’ll notice that it’s got all this markup like lists and stuff like that and their CSS and all this kind of stuff. So this is computer code almost for these web pages. But what we really want is we just want the text, right? We just want the text of this web page, and we don’t want the navigation and things like that. There’s a lot of filtering and processing and heuristics that go into adequately filtering for just the good content of these web pages.

第二部分是文本提取。要记住，爬虫保存下来的都是网页的原始 HTML。你在开发者工具里看到的，就是原始 HTML 的样子：一堆列表、CSS 之类的标记，本质上是网页的代码。我们真正需要的只是正文文本，不想要导航栏那些。要只保留网页里的优质正文，需要大量过滤、处理和启发式规则。

The next stage here is language filtering. So for example, FineWeb uses a language classifier. They try to guess what language every single web page is in. And then they only keep web pages that have more than 65% English, as an example. So you can get a sense that this is a design decision that different companies can take for themselves. What fraction of all different types of languages are we going to include in our data set? Because for example, if we filter out all of the Spanish, then you might imagine that our model later will not be very good at Spanish, because it’s just never seen that much data of that language.

下一阶段是语言过滤。例如 FineWeb 会用语言分类器，对每个网页判断是什么语言，然后只保留英文占比超过 65% 的页面。所以你可以把这理解成各家公司的设计选择：数据集里要包含多少种语言、各占多少比例。比如如果把西班牙语都过滤掉，模型以后在西班牙语上就不会太好，因为它几乎没见过那么多西班牙语数据。

And so different companies can focus on multilingual performance to a different degree. So FineWeb is quite focused on English. And so their language model, if they end up training one later, will be very good at English, but maybe not very good at other languages. After language filtering, there are a few other filtering steps and deduplication and things like that, finishing with, for example, PII removal. And this is personally identifiable information. So things like addresses, social security numbers and so on—you try to detect them and filter out those kinds of web pages from the data set as well.

所以不同公司对多语言能力的侧重可以很不一样。FineWeb 就比较偏英文，这样训出来的语言模型英文会很好，其他语言可能就一般。语言过滤之后还有若干过滤步骤、去重等，最后还有例如 PII 移除。PII 是个人可识别信息，比如地址、社保号之类，会尽量检测出来，把包含这些的页面从数据集里筛掉。

So there’s a lot of stages here, and I won’t go into full detail, but it is a fairly extensive part of the preprocessing. And you end up with, for example, the FineWeb data set.

所以这里阶段很多，我不展开讲细节了，但预处理是相当庞大的一环。最终你会得到例如 FineWeb 这样的数据集。

So when you click in on it, you can see some examples here of what this actually ends up looking like. And anyone can download this on the Hugging Face web page. So here are some examples of the final text that ends up in the training set. This is some article about tornadoes in 2012—what happened. This next one is something like: did you know you have two little yellow, 9-volt-battery-sized adrenal glands in your body? So there’s some kind of odd medical article. So just think of these as basically web pages on the internet filtered down to just the text in various ways. And now we have a ton of text, 40 terabytes of it. And that now is the starting point for the next step of the stage.

点进去就能看到一些样例，看最终数据长什么样，大家也可以在 Hugging Face 页面上下载。这里就是最终进入训练集的一些文本示例：有 2012 年龙卷风相关的文章，还有像「你知道吗，你体内有两个像 9 伏电池那么大的黄色肾上腺」之类的冷门医学文章。可以把它们想成互联网网页经过各种方式过滤后剩下的纯文本。现在我们手上有海量文本，大约 40TB，这就是下一阶段步骤的起点。

四、文本如何送进模型：从比特、字节到 Token（分词）

Now I wanted to give you an intuitive sense of where we are right now. So I took the first 200 web pages here, and remember we have tons of them. And I just take all that text, and I just put it all together, concatenate it. And so this is what we end up with. We just get this raw internet text. And there’s a ton of it, even in these 200 web pages. So I can continue zooming out here. And we just have this massive tapestry of text data. And this text data has all these patterns. And what we want to do now is we want to start training neural networks on this data, so the neural net can internalize and model how this text flows.

我想先让大家直观感受一下我们现在处在哪一步。我取了这里的前 200 个网页（别忘了我们总共有海量网页），把其中所有文本拼成一整段。你看到的就是这样：一段原始的互联网文本。光这 200 页就已经很多了，还可以继续缩小比例看，整体就是一大片文本数据，里面充满各种模式。接下来我们要做的，就是在这类数据上训练神经网络，让网络内化并建模这段文本是如何流动的。

So we just have this giant tapestry of text, and now we want to get neural nets that mimic it. Now, before we plug text into neural networks, we have to decide how we’re going to represent this text and how we’re going to feed it.

所以我们有这一大块文本，现在要得到能模仿它的神经网络。在把文本送进神经网络之前，我们必须先决定如何表示这段文本、如何喂给它。

Now, the way our technology works for these neural nets is that they expect a one-dimensional sequence of symbols. They want a finite set of symbols that are possible. We have to decide what the symbols are. And then we have to represent our data as a one-dimensional sequence of those symbols. So right now, what we have is a one-dimensional sequence of text. It starts here and it goes here, and then it comes here, et cetera. So this is a one-dimensional sequence, even though on my monitor it’s laid out in a two-dimensional way—it goes from left to right and top to bottom. Right? It’s a one-dimensional sequence of text.

我们这套技术里，神经网络期望的输入是一维的符号序列。符号来自一个有限的集合，我们要先定好有哪些符号，然后把数据表示成这些符号的一维序列。目前我们有的就是一维的文本序列：从这里开始，到这里，再到这里，等等。所以虽然在我屏幕上看起来是二维排布的，从左到右、从上到下，但本质上是一维的文本序列。

Now, this being computers, there’s an underlying representation here. So if I do what’s called UTF-8 encode this text, then I can get the raw bits that correspond to this text in the computer. And that looks like this. So it turns out that, for example, this very first bar here is the first 8 bits. So what is this thing? Right? This is the representation that we are looking for. In a certain sense, we have exactly two possible symbols, zero and one. And we have a very long sequence of it. Right?

在计算机里，底下还有一层表示。如果我对这段文本做 UTF-8 编码，就能得到它在计算机里对应的原始比特，大概就像这样。比如这里第一小段就是前 8 个比特。这是什么？其实就是我们想要的那种表示：某种意义上我们只有两种符号——0 和 1，然后得到很长的一串。

Now, as it turns out, this sequence length is actually going to be a very finite and precious resource in our neural network. And we actually don’t want extremely long sequences of just two symbols. What we want is we want to trade off the size of this vocabulary, as we call it, and the resulting sequence length. So we don’t want just two symbols and extremely long sequences. We’re going to want more symbols and shorter sequences. One naive way of compressing or decreasing the length of our sequence here is to basically consider some group of consecutive bits, for example 8 bits, and group them into a single what’s called byte.

实际上，在我们神经网络里，序列长度是非常有限、非常宝贵的资源。我们并不想要只有两个符号、却极长的序列，而是要在「词汇表大小」和「序列长度」之间做权衡：更多符号、更短序列。一种简单的压缩/缩短序列的方式，就是把连续的若干比特（比如 8 个）合成一个单位，也就是一个「字节」。

So because these bits are either on or off, if we take a group of eight of them, there turns out to be only 256 possible combinations of how these bits could be on or off. Therefore, we can represent this sequence as a sequence of bytes instead. So this sequence of bytes will be 8 times shorter. But now we have 256 possible symbols. So every number here goes from 0 to 255. Now, I really encourage you to think of these not as numbers, but as unique IDs or unique symbols. So maybe it’s better to think of replacing every one of these with a unique emoji—you’d get something like this. So we basically have a sequence of symbols, and there are 256 possible symbols. You can think of it that way.

因为每个比特要么开要么关，八个一组就只有 256 种组合，所以我们可以把这段序列表示成字节序列，长度就缩短为原来的 1/8，但符号数变成 256 个，每个对应 0 到 255。建议大家别把它们当数字，而是当唯一的 ID 或符号，甚至可以想象成每个都换成一个独特的表情符号——我们得到的是一串符号，共有 256 种可能。

Now, it turns out that in production for state-of-the-art language models, you actually want to go even beyond this. You want to continue to shrink the length of the sequence, because again, it is a precious resource, in return for more symbols in your vocabulary. And the way this is done is by running what’s called the byte pair encoding algorithm. And the way this works is we’re basically looking for consecutive bytes or symbols that are very common.

在实际的顶尖语言模型里，还会再往前推一步：继续缩短序列长度（因为序列长度很宝贵），换取更大的词汇表。做法是跑一种叫「字节对编码」（BPE）的算法：不断找出现频率很高的连续字节或符号对。

So for example, it turns out that the sequence 1, 16 followed by 32 is quite common and occurs very frequently. So we’re going to group this pair into a new symbol. So we’re gonna mint a symbol with an ID 256, and we’re gonna rewrite every single pair 1,16 32 with this new symbol. And then we can iterate this algorithm as many times as we wish. And each time when we mint a new symbol, we’re decreasing the length. We’re increasing the symbol size. And in practice, it turns out that a pretty good setting for the vocabulary size turns out to be about 100,000 possible symbols. So in particular, GPT-4 uses 100,277 symbols. And this process of converting from raw text into these symbols, or as we call them tokens, is the process called tokenization.

比如可能发现「1, 16 后面跟 32」很常见，就把这个组合当成一个新符号，给它一个 ID 256，把数据里所有这样的配对都替换成这个新符号。这个算法可以反复做很多轮，每轮都会缩短序列、增加符号种类。实践中，词汇表大小设在约 10 万个符号比较合适；GPT-4 就用了 100,277 个符号。把原始文本变成这些符号（我们叫 token）的过程，就是「分词」或叫 tokenization。

So let’s now take a look at how GPT-4 performs tokenization, converting from text to tokens and from tokens back to text.

下面我们就看一下 GPT-4 是怎么做分词、把文本变成 token、再从 token 变回文本的。

4.1 分词器实战：tiktokenizer 与 GPT-4 词表

And what this actually looks like. So one website I like to use to explore these token representations is called tiktokenizer. You go to the dropdown and select cl100k_base, which is the GPT-4 base model tokenizer. And here on the left, you can put in text. And it shows you the tokenization of that text. So for example, “hello world” turns out to be exactly two tokens: the token “hello”, which is the token with ID 15339, and the token “ world” (space world). That is token 1917. So “hello” + “ world”. Now if I were to join these two—for example “helloworld” without the space—I’m gonna get again two tokens, but it’s the token “hell” and the token “o world” without the space. If I put in two spaces here between hello and world, it’s again a different tokenization.

具体长什么样呢？我常用一个叫 tiktokenizer 的网站来看 token 表示。在下来菜单里选 cl100k_base，就是 GPT-4 的官方分词器。左边输入文本，右边会显示对应的 token。比如 “hello world” 正好是两个 token：”hello” 是 ID 15339，” world”（带空格的 world）是 1917。要是写成 “helloworld” 不留空格，又会变成两个 token：”hell” 和 “o world”。如果 hello 和 world 之间打两个空格，又会得到另一种分词。

There’s a new token 220 here. So you can play with this and see what happens. Also keep in mind this is case-sensitive. So if this is a capital H it is something else. Or if it’s “Hello world”, then actually this ends up being three tokens instead of just two tokens. So you can play with this and get an intuitive sense of what these tokens work like. We’re actually going to come back to tokenization a bit later in the video.

这里会多出一个 token 220。大家可以自己试不同输入看效果。还要注意大小写是区分的，大写 H 会变成别的 token，”Hello world” 会变成三个 token 而不是两个。玩一玩就能对 token 有个直观感觉。后面视频里我们还会再回到分词。

Now I just wanted to show you the website. And I wanted to show you that this text, basically at the end of the day—so for example, if I take one line here—this is what GPT-4 will see it as. So this text will be a sequence of length 62. This is the sequence here. And this is how the chunks of text correspond to these symbols. And again, there are 100,277 possible symbols. We now have one-dimensional sequences of those symbols. We’re gonna come back to tokenization, but that’s for now where we are.

我就是想让大家看看这个网站，以及这段文本在 GPT-4 眼里最终长什么样。比如取这一行，它会变成长度 62 的序列，每个片段对应一个符号，一共大约 10 万多种可能的符号。现在我们有的就是一维的符号序列。分词后面还会提到，目前就先到这里。

Okay, so what I’ve done now is I’ve taken this sequence of text that we have here in the data set, and I have re-represented it using our tokenizer into a sequence of tokens. And this is what that looks like now.

好，我做的就是把数据集里这段文本用分词器重新表示成 token 序列，现在你看到的就是它的样子。

So for example, when we go back to the FineWeb data set, they mentioned that not only is it 44 terabytes in disk space, but this is about a 15 trillion token sequence in this data set.

比如回到 FineWeb 数据集，他们提到不仅是 44TB 的磁盘占用，整份数据大约是 15 万亿个 token 组成的序列。

And so here, these are just some of the first one or two or three or a few thousand tokens of this data set. But there are 15 trillion here to keep in mind. And again, keep in mind one more time that all of these represent little text chunks. They’re all like atoms of these sequences. And the numbers here don’t make any sense—they’re just unique IDs. Okay? So now we get to the fun part, which is the neural network training. This is where a lot of the heavy lifting happens computationally when you’re training these neural networks. So what we do here in this step is we want to model the statistical relationships of how these tokens follow each other in the sequence.

这里显示的只是数据集里最前面的一两千个 token，但要记住整份有 15 万亿个。再强调一次：每一个都代表一小段文本，是序列里的「原子」，数字本身没有含义，只是唯一 ID。好，接下来就是好玩的部分——神经网络训练。训练这些网络时，大部分计算量都花在这里。我们这一步要做的是建模这些 token 在序列里如何前后相继的统计关系。

五、神经网络训练：预测下一个 Token

So what we do is we go into the data, and we take windows of tokens. So we take a window of tokens from this data fairly randomly. And the window length can range anywhere between zero tokens, actually, all the way up to some maximum size that we decide on. So for example, in practice, you could see token windows of, say, 8,000 tokens. Now, in principle, we can use arbitrary window lengths of tokens, but processing very long window sequences would just be very computationally expensive. So we just kind of decide that say 8,000 is a good number or 4,000 or 16,000, and we cap it there.

做法是：从数据里随机取一「窗」token，窗口长度可以从 0 一直到我们设定的某个上限。实践中常见的是比如 8,000 个 token 的窗口。理论上可以用任意长度，但非常长的序列算起来太贵，所以会定一个上限，比如 8,000、4,000 或 16,000。

Now, in this example, I’m going to be taking the first four tokens, just so everything fits nicely. These tokens—we’re going to take a window of four tokens, “bar”, “buz”, “hing” and space “single”, which are these token IDs. And now what we’re trying to do here is we’re trying to basically predict the token that comes next in a sequence. So 3962 comes next, right? So what we do now here is that we call this the context. These four tokens are context, and they feed into a neural network, and this is the input to the neural network.

在这个例子里为了演示方便，我只取前四个 token（对应某些 token ID）。我们要做的事就是：根据这段序列预测下一个 token，也就是 3962。这前四个 token 我们叫「上下文」，它们作为神经网络的输入喂进去。

Now, I’m going to go into the detail of what’s inside this neural network in a little bit.

神经网络内部结构稍后再说。

Now it’s important to understand the input and the output of the neural net. The input are sequences of tokens of variable length, anywhere between zero and some maximum size like 8,000. The output now is a prediction for what comes next. So because our vocabulary has 100,277 possible tokens, the neural network is going to output exactly that many numbers. And all of those numbers correspond to the probability of that token coming next in the sequence. So it’s making guesses about what comes next. In the beginning, this neural network is randomly initialized. And we’re going to see in a little bit what that means, but it’s a random transformation. So these probabilities in the very beginning of the training are also going to be kind of random.

重要的是理解网络的输入和输出。输入是长度不固定的 token 序列，从 0 到某个上限（比如 8,000）。输出是对「下一个是什么」的预测。因为词表有 100,277 个 token，网络就输出这么多个数，每个数表示对应 token 作为下一个出现的概率。所以它是在猜下一个 token。一开始网络是随机初始化的，所以这些概率一开始也是随机的。

So here I have three examples, but keep in mind that there are 100,000 numbers here. So the probability of this token “space direction”—the network is saying that this is 4% likely right now. 11,799 is 2%. And then here, the probability of 3962, which is “post”, is 30%.

这里只举了三个例子，但要记住实际有 10 万多个数。比如「space direction」这个 token 当前概率是 4%，11,799 是 2%，3962 也就是 “post” 是 30%。

Now we’ve sampled this window from our data set. So we know what comes next. We know—and that’s the label. We know that the correct answer is that 3962 actually comes next in the sequence. So now we have this mathematical process for doing an update to the neural network. We have a way of tuning it. And we’re going to go into a little bit of detail in a bit. But basically, we know that this probability here of 3%, we want this probability to be higher. And we want the probabilities of all the other tokens to be lower. And so we have a way of mathematically calculating how to adjust and update the neural network so that the correct answer has a slightly higher probability.

这个窗口是从数据集里采出来的，所以我们知道下一个其实是 3962，这就是标签。于是我们有一套数学方法，用来根据这个对网络做一次更新、微调参数。细节稍后再说，大致思路是：我们希望正确 token（3962）的概率变高，其他 token 的概率变低，并且能算出该怎么改参数才能做到这一点。

So if I do an update to the neural network, now the next time I feed this particular sequence of four tokens into the neural network, the neural network will be slightly adjusted now, and it will say, okay, “post” is maybe 4% and “case” now maybe is 1%. “Direction” could become 2% or something like that. We have a way of nudging or slightly updating the neural net to basically give a higher probability to the correct token that comes next in the sequence.

所以做完一次更新之后，下次再喂进同样的四个 token，网络已经稍微变了一点，可能会说 “post” 变成 4%、”case” 变成 1%、”direction” 变成 2% 之类。我们就是在用这种方式一点点调整网络，让「下一个正确 token」的概率提高。

And now we just have to remember that this process happens not just for this token here, where these four fed in and predicted this one. This process happens at the same time for all of these tokens in the entire dataset. In practice, we sample little windows, little batches of windows. And then at every single one of these tokens, we want to adjust our neural network so that the probability of that token becomes slightly higher. And this all happens in parallel in large batches of these tokens. This is the process of training the neural network. It’s a sequence of updating it, so that its predictions match up with the statistics of what actually happens in your training set. And its probabilities become consistent with the statistical patterns of how these tokens follow each other in the data.

还要记住：不只是「这四个 token 预测这一个」会这样更新，整个数据集里所有 token 都在同时经历类似的过程。实践中我们会采样很多小窗口、组成一批，对这批里每个位置都希望把「下一个 token」的概率调高一点，这些更新是并行、批量做的。这就是训练神经网络的过程：不断更新参数，让模型的预测越来越符合训练数据里的统计规律，概率分布和真实序列的统计模式一致。

六、神经网络内部：参数、数学式子与 Transformer

Let’s now briefly get into the internals of these neural networks, just to give you a sense of what’s inside. So neural network internals—as I mentioned, we have these inputs that are sequences of tokens. In this case, this is four input tokens, but this can be anywhere between zero up to, let’s say, a thousand tokens. In principle, this could be an infinite number of tokens. It would just be too computationally expensive to process an infinite number of tokens. So we just crop it at a certain length, and that becomes the maximum context length of that model.

下面简单看一下神经网络内部长什么样。输入就是 token 序列，这里举的是 4 个 token，也可以是 0 到比如一千个。理论上可以无限长，但算不动，所以会截到某个长度，这个长度就是模型的「最大上下文长度」。

Now, these inputs x are mixed up in a giant mathematical expression together with the parameters or the weights of these neural networks. So here I’m showing six example parameters and their setting. But in practice, these modern neural networks will have billions of these parameters. In the beginning, these parameters are completely randomly set. With a random setting of parameters, you might expect that this neural network would make random predictions, and it does in the beginning—it’s totally random predictions. But it’s through this process of iteratively updating the network—and we call that process training the neural network—that the setting of these parameters gets adjusted, such that the outputs of our neural network become consistent with the patterns seen in our training set.

这些输入 x 会和网络的参数（权重）一起放进一个巨大的数学式子。这里只画了 6 个参数做例子，实际现代网络有几十亿个参数。一开始参数全是随机的，所以一开始预测也是随机的。我们说的「训练」就是不断迭代更新这些参数，让网络的输出越来越符合训练数据里看到的模式。

So think of these parameters as kind of like knobs on a stereo. And as you’re twiddling these knobs, you’re getting different predictions for every possible token sequence input. And training a neural network just means discovering a setting of parameters that seems to be consistent with the statistics of the training set.

可以把参数想象成音响上的旋钮：拧不同的旋钮，对同一段输入就会得到不同预测。训练神经网络，就是在找一组参数，让预测和训练集的统计规律一致。

Now, let me just give you an example of what the giant mathematical expression looks like, just to give you a sense. Modern networks are massive expressions with trillions of terms, probably. But let me just show you a simple example here. It would look something like this. These are the kinds of expressions—just to show you that it’s not very scary. We have inputs x like x1, x2, in this case two example inputs. And they get mixed up with the weights of the network, w0, w1, w2, w3, et cetera. And this mixing is simple things like multiplication, addition, exponentiation, division, et cetera. And it is the subject of neural network architecture research to design effective mathematical expressions that have a lot of convenient characteristics. They are expressive, they’re optimizable, they’re parallelizable, et cetera.

给大家看一个简化版的大式子长什么样。现代网络可能是万亿项级别的式子，这里只举一个小例子，形式大概像这样，并不吓人：输入 x1、x2 和网络的权重 w0、w1、w2、w3 等混在一起，做乘法、加法、指数、除法之类的运算。神经网络架构研究就是在设计这类数学式子，让它们表达力强、好优化、好并行等等。

But at the end of the day, these are not complex expressions. Basically, they mix up the inputs with the parameters to make predictions. And we’re optimizing the parameters of this neural network so that the predictions come out consistent with the training set.

说到底，式子本身不复杂：就是把输入和参数混在一起得到预测，然后我们通过优化参数，让预测和训练集一致。

Now, I would like to show you an actual production-grade example of what these neural networks look like. So for that, I encourage you to go through this website that has a very nice visualization of one of these networks. So this is what you will find on this website. And this neural network here that is used in production settings has this special kind of structure. This network is called the Transformer. And this particular one, as an example, has roughly 85,000 parameters. Out here on the top, we take the inputs, which are the token sequences. And then information flows through the neural network until the output, which here is the final softmax. But these are the predictions for what token comes next.

再给大家看一个真正生产级网络长什么样。可以去看一个做了很好可视化的网站，上面就是你会看到的结构。这种在生产里用的网络有一种特定结构，叫 Transformer。比如这个例子大约有 8.5 万个参数。最上面是输入（token 序列），信息一层层往下传，最后得到输出，也就是一个 softmax，表示下一个 token 的预测分布。

And then here there’s a sequence of transformations and all these intermediate values that get produced inside this mathematical expression, as it is sort of predicting what comes next.

中间是一连串变换，以及这个数学表达式里产生的各种中间结果，一路算下来就是在预测「下一个是什么」。

So as an example, these tokens are embedded into what’s called a distributed representation. So every possible token has kind of like a vector that represents it inside the neural network.

比如 token 会先被嵌入成一种「分布式表示」，每个 token 在网络里对应一个向量。

So first, we embed the tokens and then those values kind of flow through this diagram. And these are all very simple mathematical expressions individually. So we have layer norms and matrix multiplication and softmax and so on. So here’s kind of like the attention block of this Transformer. And then information kind of flows through into the multilayer perceptron block and so on. And all these numbers here—these are the intermediate values of the expression. And you can almost think of these as kind of like the firing rates of these synthetic neurons. I would caution you to not think of it too much like neurons, because these are extremely simple compared to the neurons you would find in your brain.

先做嵌入，然后这些值沿着图流动，每一步都是很简单的运算：层归一化、矩阵乘、softmax 等。这里是 Transformer 的注意力块，接着进多层感知机块等等。这些数字都是式子里的中间结果，可以粗略理解为这些「人工神经元」的激活强度。但别和生物神经元类比太多，它们比人脑里的神经元简单得多。

Your biological neurons are very complex dynamical processes that have memory and so on. There’s no memory in this expression. It’s a fixed mathematical expression from input to output with no memory. It’s just stateless. So these are very simple neurons in comparison to biological neurons, but you can still kind of loosely think of this as like a synthetic piece of brain tissue, if you like, to think about it that way. So information flows through, all these neurons fire, until we get to the predictions.

生物神经元是带记忆的复杂动力系统，而这个式子里没有记忆，从输入到输出是固定的、无状态的。所以和生物神经元比非常简化，但你可以 loosely 把它想成一块「人造脑组织」：信息流过，这些单元被激活，最后得到预测。

Now, I’m not actually going to dwell too much on the precise mathematical details of all these transformations. Honestly, I don’t think it’s that important to get into. What’s really important to understand is that this is a mathematical function. It is parameterized by some fixed set of parameters, like say 85,000 of them. And it is a way of transforming inputs into outputs. And as we twiddle the parameters, we are getting different kinds of predictions. And then we need to find a good setting of these parameters so that the predictions sort of match up with the patterns seen in the training set.

这些变换的精确数学细节我就不展开了，对理解主干没那么关键。重要的是：这就是一个数学函数，由一组固定参数（比如 8.5 万个）决定，把输入变成输出；拧参数就得到不同预测，训练就是在找一组参数，让预测和训练集里的模式对上。

So that’s the Transformer. So I’ve shown you the internals of the neural network and we talked a bit about the process of training it. I want to cover one more major stage of working with these networks. And that is the stage called inference. So in inference, what we’re doing is we’re generating new data from the model. We want to basically see what kind of patterns it has internalized in the parameters of its network.

Transformer 就介绍到这里。我们已经看了网络内部和训练过程，接下来要讲和这些网络打交道的另一个重要阶段：「推理」。推理就是从已经训好的模型里生成新数据，看看它参数里到底内化了哪些模式。

七、推理：从模型里生成文本（采样与随机性）

So to generate from the model is relatively straightforward. We start with some tokens that are basically your prefix—like what you want to start with.

从模型里生成文本其实很直接：先给一段前缀 token，表示你想从哪儿开始。

So say we want to start with the token “to anyone”. We feed it into the network. And remember that network gives us probabilities, right? It gives us this probability vector here. So what we can do now is we can basically flip a biased coin. So we can sample a token based on this probability distribution. So the tokens that are given high probability by the model are more likely to be sampled when you flip this biased coin. So we sample from the distribution to get a single unique token. So for example, token 860 comes next. So 860, in this case when we’re generating from the model, that could come next. Now 860 is a relatively likely token. It might not be the only possible token. In this case there could be many other tokens that could have been sampled.

比如我们从 “to anyone” 对应的 token 开始，喂进网络，网络给出一个概率分布。接下来我们就按这个分布「掷一枚有偏的骰子」：按概率抽样出一个 token。概率高的更容易被抽到。比如这次抽到的是 860，它是比较可能的一个，但不是唯一可能，也可能抽到别的。

But we could see that 860 is a relatively likely token as an example. And indeed, in our training example here, 860 does follow “anyone”. Let’s say that we continue the process. So after 91, 860 we appended. And we again ask, what is the third token? Let’s sample. And let’s just say that it’s 287. Let’s do that again. We come back in. Now we have a sequence of three, and we ask, what is the likely 4th token? And we sample from that and get this one.

比如 860 就是一个相对可能的 token，在我们之前的训练例子里，”anyone” 后面也确实可能是它。我们继续：把 860 接到后面，再问「第三个 token 是什么？」再按分布抽样，比如得到 287。再重复：现在序列有三个 token，问第四个，再抽样得到一个。

And now let’s say we do it one more time. We take those four, we sample, and we get this one. And this 13,659. This is not actually 3962 as we had before. So this token is the token “article”. So “viewing a single article”—in this case we didn’t exactly reproduce the sequence that we saw here in the training data.

再做一次：用这四个 token 再抽下一个，得到 13,659。这次就不是之前的 3962 了，这个 token 对应 “article”。所以生成的是 “viewing a single article” 这类内容，并没有完全复现训练数据里的那段序列。

So keep in mind that these systems are stochastic. We’re sampling, we’re flipping coins, and sometimes we luck out and we reproduce some small chunk of the text in the training set. But sometimes we’re getting a token that was not verbatim part of any of the documents in the training data.

要记住这些系统是随机的：我们在抽样、在掷骰子，有时会碰巧复现训练集里的一小段，有时会抽到训练文档里从没原样出现过的 token。

So we’re gonna get sort of like remixes. All the data that we saw in the training—at every step along the way we can flip and get a slightly different token. And then once that token makes it in, if you sample the next one and so on, you very quickly start to generate token streams that are very different from the token streams that occur in the training documents. So statistically, they will have similar properties, but they are not identical to the training data. They’re kind of like inspired by the training data.

所以生成的是某种「再混合」：训练里见过的数据，在每一步我们都可以掷出略不同的 token，一个接一个采样下去，很快就会得到和训练文档里很不一样的 token 流。统计上会有相似性，但不会和某条训练数据一模一样，可以说是受训练数据启发的新序列。

So in this case, we got a slightly different sequence. And why would we get “article”? You might imagine that “article” is a relatively likely token in the context of “bar viewing single”, et cetera. And you could imagine that the word “article” followed this context window somewhere in the training documents, to some extent. And we just happened to sample it here at that stage.

这里我们就得到了一条略不同的序列。为什么会抽到 “article”？可以想象在 “bar viewing single” 这类上下文里，”article” 是相对常见的下一个词，训练文档里某处可能就有这样的续写，我们只是在这一步碰巧抽到了它。

So basically, inference is just predicting from these distributions one at a time. We continue feeding back tokens and getting the next one. And we are always flipping these coins. And depending on how lucky or unlucky we get, we might get very different kinds of patterns, depending on how we sample from these probability distributions. So that’s inference. So in most common scenarios, basically, downloading the internet and organizing it is a preprocessing step. You do that a single time. Once you have your token sequence, we can start training networks. And in practical cases, you would try to train many different networks of different kinds of settings and different kinds of arrangements and different kinds of sizes.

所以推理就是：按这些分布一个一个预测，把已生成的 token 再喂回去、得到下一个，一直掷骰子。运气不同，采样的结果就会差很多。这就是推理。大多数情况下，下载互联网、整理数据是预处理，做一次就行；有了 token 序列就可以开始训网络；实际中会尝试很多不同配置、不同规模、不同结构的网络。

And so you’d be doing a lot of neural network training.

也就是说会做大量神经网络训练。

Then once you have a neural network and you train it, and you have some specific set of parameters that you’re happy with, then you can take the model and you can do inference and you can actually generate data from the model. When you’re on ChatGPT and you’re talking with a model, that model is trained and has been trained by OpenAI many months ago, probably. They have a specific set of weights that work well. When you’re talking to the model, all of that is just inference. There’s no more training. Those parameters are held fixed. You’re just talking to the model—giving it some of the tokens, and it’s kind of completing token sequences. And that’s what you’re seeing generated when you actually use the model on ChatGPT. So that model just does inference alone. So let’s now look at an example of training and inference that is kind of concrete and gives you a sense of what this actually looks like when these models are trained.

等你训好一个网络、得到一组满意的参数，就可以拿这个模型做推理、从模型里生成数据。你在 ChatGPT 上对话时，用的就是 OpenAI 很可能好几个月前就训好、已经定死的那组参数。你和模型对话时，发生的全是推理，没有训练，参数不变，只是你给它一些 token，它在续写 token 序列，你在界面上看到的生成内容就是这么来的。下面看一个具体的训练与推理例子，感受一下这些模型在训的时候实际长什么样。

八、实例：GPT-2 与预训练成本

Now, the example that I would like to work with and that I am particularly fond of is that of OpenAI’s GPT-2. So GPT stands for Generative Pre-trained Transformer. This is the second iteration of the GPT series by OpenAI. When you are talking to ChatGPT today, the model that is underlying all of the magic of that interaction is GPT-4. So the 4th iteration of that series. Now GPT-2 was published in 2019 by OpenAI in this paper that I have right here. And the reason I like GPT-2 is that it is the first time that a recognizably modern stack came together. All the pieces of GPT-2 are recognizable today. By modern standards, it’s just everything has gotten bigger.

我想用的例子是 OpenAI 的 GPT-2，我自己特别喜欢。GPT 是 Generative Pre-trained Transformer 的缩写，是 OpenAI GPT 系列的第二代；今天你和 ChatGPT 对话时，背后已经是第四代 GPT-4。GPT-2 是 2019 年 OpenAI 在这篇论文里发布的。我喜欢它的原因是：第一次，一套「一眼能认出来的现代架构」完整出现，GPT-2 里的每一块在今天都还能对上号，只是按现在的标准，规模都变大了。

Now, I’m not gonna be able to go into the full details of this paper, because it is a technical publication. But some of the details that I would like to highlight are as follows. GPT-2 was a Transformer neural network, just like a neural network you would work with today. It had 1.6 billion parameters, right? So these are the parameters that we looked at here. We would have 1.6 billion of them. Today, modern Transformers would have a lot closer to 1 trillion or several hundred billion, probably. Maximum context length here was 1,024 tokens. So when we are sampling chunks or windows of tokens from the data set, we’re never taking more than 1,024 tokens. And so when you are trying to predict the next token in a sequence, you will never have more than 1,024 tokens in your context in order to make that prediction.

论文的完整细节就不展开了，只提几个要点。GPT-2 是 Transformer 神经网络，和今天用的那种同一类，有 16 亿参数；现在的主流模型往往是几千亿甚至接近一万亿参数。它的最大上下文长度是 1,024 个 token，也就是说从数据集里采样窗口时，最多只用 1,024 个 token，预测下一个 token 时上下文也不会超过这个长度。

Now, this is also tiny by modern standards. Today, the context length would be a lot closer to a couple hundred thousand or maybe even 1 million. You have a lot more context, a lot more tokens in history. And you can make a lot better prediction about the next token in the sequence in that way. And finally, GPT-2 was trained on approximately a hundred billion tokens. And this is also fairly small by modern standards. As I mentioned, the final data set that we looked at here has 15 trillion tokens. So hundred billion is quite small.

以今天的标准看，1,024 也很小；现在的上下文长度往往是几十万甚至百万 token，历史更长，预测下一个 token 会更有依据。GPT-2 的训练数据大约是 1000 亿个 token，同样偏小；前面说的 FineWeb 最终有 15 万亿 token，所以 1000 亿算很少。

Now, I actually tried to reproduce GPT-2 for fun, as part of this project called ml.c so you can see my write-up of doing that in this post on GitHub under the ml.c repository. So in particular, the cost of training GPT-2 in 2019 was estimated to be approximately $40,000. But today you can do significantly better than that. And in particular, here it took about one day and about $600. But this wasn’t even trying too hard. I think you could really bring this down to about $100 today.

我为了玩一玩还复现过 GPT-2，写在了 ml.c 项目里，GitHub 上可以找到。2019 年训 GPT-2 的成本估计在 4 万美元左右，现在可以便宜很多；我这次大约花了一天和 600 美元，而且还没特别抠，我觉得现在压到 100 美元左右是可行的。

Now, why is it that the costs have come down so much? Well, number one, these data sets have gotten a lot better. The way we filter them, extract them, and prepare them has gotten a lot more refined. The data set is of just a lot higher quality. So that’s one thing. But really, the biggest difference is that our computers have gotten much faster in terms of the hardware. And we’re gonna look at that in a second. And also the software for running these models and really squeezing out all the speed from the hardware as much as possible—that software has also gotten much better as everyone has focused on these models and tried to run them very efficiently.

成本为什么能降这么多？一是数据集变好了：过滤、抽取、清洗的流程都更成熟，数据质量高很多。但更大的原因是算力：硬件快了很多，我们马上会看；同时跑这些模型、把硬件榨干的软件也进步很大，因为大家都在盯着这类模型、拼命优化。

九、研究者视角：训练时你在看什么（Loss、步数、生成样本）

https://youtu.be/7xTGNNLPyMI?t=1869&si=5ZZfJ1y9z6lp6DZR

Now, I’m not gonna be able to go into the full detail of this reproduction. And this is a long technical post. But I would like to still give you an intuitive sense for what it looks like to actually train one of these models as a researcher. Like, what are you looking at? And what does it look like? What does it feel like? So let me give you a sense of that a little bit. So this is what it looks like. Let me slide this over. So what I’m doing here is I am training a GPT-2 model right now. What’s happening here is that every single line here, like this one, is one update to the model.

复现的完整细节就不展开了，帖子很长。但我想让大家直观感受一下：作为研究者，真正训一个这样的模型时，你在看什么、界面长什么样、是什么感觉。大概就是这样：我现在在训一个 GPT-2，这里的每一行代表对模型的一次更新。

So remember how here we are basically making the prediction better for every one of these tokens. And we are updating these weights or parameters of the neural net. So here every single line is one update to the neural network where we change its parameters by a little bit so that it is better at predicting the next token in this sequence. In particular, every single line here is improving the prediction on 1 million tokens in the training set. So we’ve basically taken 1 million tokens out of this data set. We’ve tried to improve the prediction of that token as coming next in a sequence on all 1 million of them simultaneously. And at every single one of these steps, we are making an update to the network for that.

前面说过，我们是在让「下一个 token」的预测变好，也就是在更新网络的权重/参数。所以每一行都是一次参数更新，针对 100 万个 token：从数据集里取 100 万个 token，希望在这 100 万个位置上都把「下一个 token」预测得更准，每一步都对网络做一次更新。

Now the number to watch closely is this number called loss. The loss is a single number that is telling you how well your neural network is performing right now. And it is created so that low loss is good. So you’ll see that the loss is decreasing as we make more updates to the neural net, which corresponds to making better predictions on the next token in a sequence. And so the loss is the number that you are watching as a neural network researcher. You are kind of waiting, twiddling your thumbs. You’re drinking coffee and you’re making sure that this looks good so that with every update, the loss is improving, the network is getting better at prediction.

要盯着的数字叫 loss（损失）。它是一个数，表示当前模型表现如何，设计成「越低越好」。你会看到随着更新变多，loss 在下降，也就是对「下一个 token」的预测在变好。做神经网络研究时，你就是在盯着这个数，喝咖啡、确认每次更新后 loss 在降、网络在变好。

Now. Here, you see that we are processing 1 million tokens per update. Each update takes about 7 seconds, roughly. Here we are going to process a total of 32,000 steps of optimization. So 32,000 steps times 1 million tokens each is about 32 billion tokens that we are going to process. And we’re currently only about 420 steps out of 32,000. So we are still only a bit more than 1% done, because I’ve only been running this for 10 or 15 minutes or something like that.

这里每次更新处理 100 万个 token，每次大约 7 秒。一共要跑 32,000 步优化，也就是 32,000 × 100 万 ≈ 320 亿 token。目前才到第 420 步，32,000 里只跑了一点点，所以只完成了 1% 多，因为才跑了十来分钟。

Now every 20 steps, I have configured this optimization to do inference. So what you’re seeing here is the model predicting the next token in the sequence. And so you sort of started randomly, and then you continue plugging in the tokens. So we’re running this inference, then. And this is the model sort of predicting the next token in the sequence. And every time you see something appear, that’s a new token. So let’s just look at this. And you can see that this is not yet very coherent, and keep in mind that this is only 1% of the way through training. The model is not yet very good at predicting the next token in the sequence.

我设成每 20 步做一次推理，所以你会看到模型在续写 token：从随机起点开始，把已生成的 token 再喂进去，不断预测下一个，每出现一段就是新生成的 token。可以看到现在还很不连贯，而且这才训了 1%，模型对「下一个 token」还预测得不好。

So what comes out is actually kind of a little bit of gibberish, right? But it still has a little bit of like local coherence. “So since she is mine, it’s a part of the information should discuss my father. Great companions. Gordon showed me sitting over it” and et cetera. So I know it doesn’t look very good, but let’s actually scroll up and see what it looked like when I started the optimization. So all the way here at step one.

所以输出有点像胡话，但局部还是有一点连贯，比如「So since she is mine…」那段。我知道看起来不怎么样，但可以往上翻，看看优化刚开始、第一步时是什么样。

So after 20 steps of optimization, you see that what we’re getting here looks completely random. And that’s because the model has only had 20 updates to its parameters. It’s giving you random text because it’s a random network. And so you can see that at least in comparison to this, the model is starting to do much better. And indeed, if we waited the entire 32,000 steps, the model would have improved to the point that it is actually generating fairly coherent English and the token stream correctly. And they make up English a lot better. So this has to run for about a day or two more now. And so at this stage, we just make sure that the loss is decreasing. Everything is looking good, and we just have to wait.

刚跑 20 步时，生成的内容完全随机，因为参数只更新了 20 次，网络还是随机的。和那时比，现在已经好不少了。要是跑完全部 32,000 步，模型会进步到能生成比较连贯的英文、token 流也像话很多。所以还得再跑一两天。现阶段就是确认 loss 在降、一切正常，然后等着。

全文总结与初学者学习重点

文章讲了什么（简要总结）

本文整理自 Stanford CS146S 课程中「像 ChatGPT 这样的大语言模型」入门讲解的讲稿（中英对照），按「从数据到模型」的流程，说明「你输入一句话、模型为什么会这样回」背后的机制。

核心线索：大模型是怎么被「造」出来的

预训练 = 用互联网文本学「下一个词是什么」
数据来自 Common Crawl、FineWeb 等，经 URL 过滤、正文抽取、语言过滤、去重、PII 移除等，得到几十 TB 级高质量文本。
文本要先变成 Token
模型不直接读「字」，而是读一串符号（Token）。文中讲了：比特 → 字节 → BPE 分词（字节对编码），以及 GPT-4 用约 10 万词表、可用 tiktokenizer 自行体验。
训练在做什么
从长文本里随机截一段（如 8000 个 token）作为「上文」，让神经网络预测下一个 token；根据预测对错更新参数（降低 loss），在海量数据上重复，即预训练。
模型长什么样
用的是 Transformer：输入 token → 嵌入成向量 → 多层注意力 + 前馈网络 → 输出「下一个 token 的概率分布」。参数是几十亿到上万亿个数字，训练就是调这些数字。
推理 = 用训好的模型「续写」
给你一段前缀，模型按学到的概率逐 token 抽样生成后面的内容；因抽样带随机性，同一问题可能得到不同回答，生成内容像是训练数据的「统计重混」而非背答案。
以 GPT-2 为例
用 16 亿参数、1024 上下文、约 1000 亿 token 训练；2019 年要几万美元，现在复现可便宜到几百甚至约 100 美元，并展示了训练时 loss 下降、每若干步做一次生成看效果。
算力从哪来
真正训大模型要用云上 GPU（如 8× H100 节点、按小时租），大厂会建数据中心、抢 GPU；NVIDIA 市值高也与「所有人都在为预测下一个 token 买算力」有关。
预训练得到的叫「基础模型」
只会续写文本、不会乖乖回答问题；要变成 ChatGPT 那种助手，还要后面的后训练（监督微调、强化学习等），本文只讲到「预训练 + 基础模型」为止。

作为初学者应该关注的重点

优先级	重点内容	为什么重要
必懂	Token 与分词（第四节）	一切输入输出都是 token 序列；不懂「hello world」怎么被切成 2 个 token，后面 prompt、长度限制、成本都难理解。
必懂	训练在学什么：预测下一个 Token（第五节）	大模型本质就是一个「下一个 token 预测器」；理解「窗口 + 预测 + 更新参数」，就抓住了预训练在干什么。
必懂	推理 = 按概率抽样（第七节）	为什么同一问题会得到不同答案、为什么会有「幻觉」，都跟「按概率抽样」有关，这是使用和理解模型行为的基础。
建议懂	Transformer 是啥（第六节）	不需要记公式，但要懂：输入 token → 向量 → 多层计算 → 输出概率分布；知道「参数巨大、一次前向传播算下一个 token」。
建议懂	数据从哪来、怎么过滤（第二、三节）	理解「高质量、多样化、大规模」文本从 Common Crawl 到 FineWeb 的流程，有助于理解模型能力边界和偏见来源。
可选	GPT-2 参数规模与成本（第八、九节）	建立「参数量、数据量、算力、成本」的直观感觉，知道预训练很贵、但小规模复现已经可以很便宜。
可选	算力与 GPU（第十节）	知道大模型依赖云 GPU、数据中心，对产业和「为什么大模型集中在少数公司」有概念即可。

一句话总结

本文讲的是：ChatGPT 这类模型的第一步「预训练」——用互联网文本做成 token 序列，训练一个巨大的「下一个 token 预测器」（Transformer），你平时用的对话其实是它在做带随机性的续写；初学者优先搞懂「Token、预测下一个 token、推理时的抽样」，再看 Transformer 和数据/算力，就能搭起一个清晰的心理模型。

vibe coding

出国前一定要算这笔账！我差点因为没查Numbeo亏10万上一篇

数码港创意微型基金CCMF 下一篇

Stanford大学公开课vibe coding