从零实现一个大模型(二)

Posted by Neil Ning on 2025-07-15
学习

本文从零实现一个大模型,用于帮助理解大模型的基本原理,本文是一个读书笔记,内容来自于Build a Large Language Model (From Scratch)

目录

本系列包含以以下主题,当前在主题二

1. 文本处理

2. 注意力机制

3. 开发一个Transform架构的大模型

4. 使用无标记数据预训练模型

5. 分类微调

6. 指令微调

注意力机制

注意力机制是Transformer架构中一种技术,目的是能够将一个长序列中每个位置上元素都能和其他位置上的元素产生关联。

不含可训练权重的简单自注意力机制

为了充分理解自注意力机制的原理,我们从简单的自注意力机制开始。

要获得上下文向量,需要使用使用输入向量乘以注意力权重,大致过程如下:

07.webp

计算注意力评分

想要得到注意力权重,首先第一步要获取注意力评分,单个输入的注意力评分的计算过程如下:

1
2
3
4
5
6
7
8
9
10
import torch

inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)
1
2
3
4
5
6
7
query = inputs[1]  # 以第二个输入为例

attn_scores_2 = torch.empty(inputs.shape[0]) # 计算第二个元素的的注意力分数
for i, x_i in enumerate(inputs):
attn_scores_2[i] = torch.dot(x_i, query) # 点积运算

print(attn_scores_2) # 打印第二个输入词的注意力分数
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

以上计算的本质就是其他六个单词相对于第二个单词“journey”注意力分数,所有他的长度是6,即inputs的总长度。

计算注意力权重

通过上一步计算得到注意力分数之后,第二部需要计算得到注意力权重只需要将注意力分数进行序列化(或者叫归一化),即可得到注意力权重,注意力权重的特点是总和为1

1
2
3
4
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()

print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())
Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)

以上演示了的到注意力权重的大致过程,现实中我们可以直接torch库提供的现成函数softmax计算注意力权重

1
2
3
4
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)

print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)

计算上下文向量

第三步计算上下文向量,将上面的到的注意力权重和某个token ID的嵌入向量相乘即可得到改token ID的上下文向量,注意经过计算之后,得到的这个上下文向量的维度和input的token ID向量的维度是相同的,即“journey”的上下文向量和“journey”的嵌入向量的维度是相同的。

1
2
3
4
5
6
7
query = inputs[1] # 以第二个输入为例

context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
context_vec_2 += attn_weights_2[i]*x_i # 使用上文的注意力权重和每一个input token进行乘积运算

print(context_vec_2)
tensor([0.4419, 0.6515, 0.5683])

以上过程只是计算了单个输入的上下文向量,LLM中需要为所有的输入都计算上下文向量

计算所有输入的上下文向量

了解了单个上下文向量的计算过程之后,我们可以为所有的输入计算上下文向量:

1
2
3
4
5
6
7
# 1.计算注意力分数
attn_scores = inputs @ inputs.T
# 2.计算注意力权重
attn_weights = torch.softmax(attn_scores, dim=-1)
# 3.计算上下文向量
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)
tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

实现一个包含可训练权重的自注意力机制

可训练权重的自注意力机制的过程和上面的过程类似,也需要为特定的输入token计算上下文向量,区别是它引入了一个新的权重指标,他会在模型训练的过程中更新。这种可训练的权重指标对LLM来说至关重要,他可以使模型通过不断的学习生成更好的上下文向量。

权重指标分别是,将这三个权重指标和某个input相乘,可以得到Query、Key、Input向量:

  • Query vector:
  • Key vector:
  • Value vector:

接下来仍然使用单词“journey”作为示例演示这个计算过程

1
2
3
x_2 = inputs[1] # 第二个单词“journey”
d_in = inputs.shape[1] # 输入的嵌入维度,3
d_out = 2 # 输出的嵌入维度, d=2

计算指标

1
2
3
4
5
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

计算Query、Key、Value向量:

1
2
3
4
5
query_2 = x_2 @ W_query 
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(query_2)
tensor([0.4306, 1.4551])

上面的示例演示了单个输入的Q、K、V向量,为所有输入计算也很简单:

1
2
3
4
5
keys = inputs @ W_key 
values = inputs @ W_value

print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])

得到QKV向量之后,就可以计算注意力分数了,

1
2
3
keys_2 = keys[1] # 计算第二个单词的注意力分数
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)
tensor(1.8524)
1
2
attn_scores_2 = query_2 @ keys.T # 为所有的输入计算注意力分数
print(attn_scores_2)
tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])

第三步和上面一样,使用softmax函数将注意力分数序列化,得到注意力权重,注意力权重的总和是1。和前面的区别是进行缩放,将值除以输入维度的平方根。

1
2
3
d_k = keys.shape[1] #input的维度
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)
tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])

最后一步,计算上下文向量

1
2
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)
tensor([0.3061, 0.8210])

实现自注意力类

将以上代码整理后,即可实现一个自注意力的类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch.nn as nn

class SelfAttention_v2(nn.Module):

def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

def forward(self, x):
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)

attn_scores = queries @ keys.T
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

context_vec = attn_weights @ values
return context_vec

torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))
tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)

上面的代码实现中使用nn.Linear替换了nn.Parameter(torch.rand(…),因为该方法有更好的权重初始化原型,会让模型的训练更稳定

实现因果自注意力机制

在因果自注意力机制中,对角线上面的词需要使用遮罩掩盖掉,这是为了在使用注意力评分计算上下文向量时,模型不能使用后面的词来参与计算,而只能使用某个词之前的输入进行计算。换句话说LLM只能根据已经生成的输出来计算下一次的输出。如下图所示

19.webp

接下来如何用代码实现mask机制

1
2
3
4
5
6
7
8
queries = sa_v2.W_query(inputs)
keys = sa_v2.W_key(inputs)
attn_scores = queries @ keys.T

context_length = attn_scores.shape[0]
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)
tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)
1
2
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=-1) # 计算注意力权重
print(attn_weights)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)

增加dropout参数

在实际中,为了防止模型在训练时过拟合,还需要随机丢弃一些值,dropout可以在以下时机:

  1. 在计算上下文权重之后,随机丢弃一些值
  2. 在计算上下文向量之后,随机丢弃一些值

比较常见的方式是第一种。如下图所示的dropout值为50%

22.webp

代码的实现也很简单

1
2
3
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout 值为50%
print(dropout(attn_weights))
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
       grad_fn=<MulBackward0>)

实现因果自注意力类

然后实现一个因果自注意力类,它包含mask和dropout机制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch.nn as nn
class CausalAttention(nn.Module):

def __init__(self, d_in, d_out, context_length,
dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout) # New
self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

def forward(self, x):
b, num_tokens, d_in = x.shape # New batch dimension b
# For inputs where `num_tokens` exceeds `context_length`, this will result in errors
# in the mask creation further below.
# In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs
# do not exceed `context_length` before reaching this forward method.
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)

attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
attn_scores.masked_fill_( # New, _ ops are in-place
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf) # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
attn_weights = torch.softmax(
attn_scores / keys.shape[-1]**0.5, dim=-1
)
attn_weights = self.dropout(attn_weights) # New

context_vec = attn_weights @ values
return context_vec


1
2
3
4
5
6
7
8
9
10
11
12
13
x_2 = inputs[1] # second input element
d_in = inputs.shape[1] # the input embedding size, d=3
d_out = 2 # the output embedding size, d=2
torch.manual_seed(123)
batch = torch.stack((inputs, inputs), dim=0)

context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)

context_vecs = ca(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
tensor([[[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]],

        [[-0.4519,  0.2216],
         [-0.5874,  0.0058],
         [-0.6300, -0.0632],
         [-0.5675, -0.0843],
         [-0.5526, -0.0981],
         [-0.5299, -0.1081]]], grad_fn=<UnsafeViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])

将单头注意力拓展为多头注意力

多头注意力和核心思想是将单个注意力重复多次

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
class MultiHeadAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
super().__init__()
assert (d_out % num_heads == 0), \
"d_out must be divisible by num_heads"

self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.out_proj = nn.Linear(d_out, d_out) # Linear layer to combine head outputs
self.dropout = nn.Dropout(dropout)
self.register_buffer(
"mask",
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
# As in `CausalAttention`, for inputs where `num_tokens` exceeds `context_length`,
# this will result in errors in the mask creation further below.
# In practice, this is not a problem since the LLM (chapters 4-7) ensures that inputs
# do not exceed `context_length` before reaching this forwar

keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
queries = self.W_query(x)
values = self.W_value(x)

# We implicitly split the matrix by adding a `num_heads` dimension
# Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

# Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)

# Compute scaled dot-product attention (aka self-attention) with a causal mask
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head

# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)

# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)

# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection

return context_vec
1
2
3
4
5
6
7
8
9
10
torch.manual_seed(123)

batch_size, context_length, d_in = batch.shape
d_out = 2
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=2)

context_vecs = mha(batch)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)
tensor([[[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]],

        [[0.3190, 0.4858],
         [0.2943, 0.3897],
         [0.2856, 0.3593],
         [0.2693, 0.3873],
         [0.2639, 0.3928],
         [0.2575, 0.4028]]], grad_fn=<ViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])

参考资料

  1. Build a Large Language Model (From Scratch)