从零实现一个大模型（三）

#LLM #Python #BPE #GPT #Transformer

学习

本文从零实现一个大模型，用于帮助理解大模型的基本原理，本文是一个读书笔记，内容来自于Build a Large Language Model (From Scratch)

构建模型

类GPT模型的架构图如下，该类主要包含以下两个个未实现的模块：Output layers和Transformer block。其他模块在之前的章节中已经实现。

首先初始化GPTModel类所需的配置参数

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # 词汇表大小
    "context_length": 1024, # 上下文长度
    "emb_dim": 768,         # 嵌入向量的维度
    "n_heads": 12,          # 多头注意力的个数
    "n_layers": 12,         # 模型中 Transformer 模块的层数
    "drop_rate": 0.1,       # 丢弃率，防止模型过拟合
    "qkv_bias": False       # Query-Key-Value bias，注意力类中会用到该参数
}

模型类的的代码结构如下

import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        # Use a placeholder for LayerNorm
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # 待实现

    def forward(self, x):
        # 待实现
        return x


class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # 待实现

    def forward(self, x):
        # 待实现
        return x

归一化层

首先需要实现的的归一化层，即上文代码中的LayerNorm，该类的作用是将inputs的最后一个维度（嵌入向量的维度，emb_dim）进行标准化，将他们的值调整为均值为 0，方差为 1（即单位方差）。
这样做的目的是为了防止梯度消失或者梯度爆炸，加速权重的收敛速度，确保训练过程的一致性和稳定性。归一化层用在多头注意力层之前和之后，也应用在最终的输出层之前

梯度指的是变化率，描述了某个值（例如函数输出值）对另一个值（如输入变量）的变化趋势。大模型在应用梯度的概念时，首先会设计一个损失函数，用来衡量模型的预测结果与目标结果的差距。在训练过程中，它通过梯度去帮助每个模型参数不断调整来快速减少损失函数的值，从而提高模型的预测精度。

上图演示了一个具有5个输入和6个输出的神经网络层，输出层的6个值，其均值为0，方差为1。

torch.manual_seed(123)
# 创建2个训练样本，每个样本有5个维度（特征）
batch_example = torch.randn(2, 5) 
# 创建一个具有 5 个输入和 6 个输出的神经网络层
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
# 查看输出
print(out)

# 查看层归一化处理之前的均值和方差
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

print("\n")

# 查看层归一化处理之前的均值和方差
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)
Mean:
 tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


Normalized layer outputs:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[-5.9605e-08],
        [ 1.9868e-08]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)

以上是层归一化的实现原理，根据以上过程，LayerNorm的具体实现如下：

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        # scale 和 shift 是两个可训练参数（与输入具有相同的维度）。大语言模型（LLM）在训练中会自动调整这些参数，
        # 以改善模型在训练任务上的性能。这使得模型能够学习适合数据处理的最佳缩放和偏移方式。
        return self.scale * norm_x + self.shift

调用上面的层归一化类

ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[-2.9802e-08],
        [ 0.0000e+00]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)

使用GELU激活函数实现前馈神经网络

接下来需要实现一个前馈神经网络FeedForward类，但是在实现这个类之前要先了解一下GELU激活函数，GELU激活函数是更复杂、平滑的激活函数，分别结合了高斯分布和 sigmoid 门控线性单元。他可以为深度学习模型提供更好的性能。这里暂时不深究他的数学原理，直接实现：

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

实现了GELU激活函数之后，就可以实现前馈神经网络了，FeedForward 模块是一个小型神经网络，由两个线性层和一个 GELU 激活函数组成

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]), # 线性层
            GELU(), # 激活函数
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

前馈神经网络的输入和输出具有相同的维度

ffn = FeedForward(GPT_CONFIG_124M)

# input shape: [batch_size, num_token, emb_size]
x = torch.rand(2, 3, 768) 
out = ffn(x)
print(out.shape)

torch.Size([2, 3, 768])

这里实现的 FeedForward 模块对模型能力的增强（主要体现在从数据中学习模式并泛化方面）起到了关键作用。尽管该模块的输入和输出维度相同，但在内部，它首先通过第一个线性层将嵌入维度扩展到一个更高维度的空间。之后再接入非线性 GELU 激活，最后再通过第二个线性层变换回原始维度。这样的设计能够探索更丰富的表示空间。上面的架构如图所示

实现快捷连接

接下来，还要再讨论一下快捷连接（也称跳跃连接或残差连接）的概念，它用于缓解梯度消失问题。梯度消失是指在训练中指导权重更新的梯度在反向传播过程中逐渐减小，导致早期层（靠近输入端的网络层）难以有效训练。

简单的说，快捷连接有以下两个作用：

保持信息（或者说是特征）流畅传递
缓解梯度消失问题

快捷连接的实现示例如下

class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList([
            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
        ])

    def forward(self, x):
        for layer in self.layers:
            # 计算当前层的输出层
            layer_output = layer(x)
            # 如果使用了快捷连接，则将当前层和输出层直接相加
            if self.use_shortcut and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

接下来实现一个函数打印梯度对比观察使用快捷连接的效果

def print_gradients(model, x):
    # Forward pass
    output = model(x)
    target = torch.tensor([[0.]])

    # Calculate loss based on how close the target
    # and output are
    loss = nn.MSELoss()
    loss = loss(output, target)
    
    # Backward pass to calculate the gradients
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            # Print the mean absolute gradient of the weights
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

不使用快捷连接时的结果如下：

layer_sizes = [3, 3, 3, 3, 3, 1]  

sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=False
)
print_gradients(model_without_shortcut, sample_input)

layers.0.0.weight has gradient mean of 0.00020173587836325169
layers.1.0.weight has gradient mean of 0.0001201116101583466
layers.2.0.weight has gradient mean of 0.0007152041071094573
layers.3.0.weight has gradient mean of 0.0013988735154271126
layers.4.0.weight has gradient mean of 0.005049645435065031

梯度在从最后一层（layers.4）到第一层（layers.0）时逐渐减小，这种现象称为梯度消失问题。再看看使用快捷连接的效果：

torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)

layers.0.0.weight has gradient mean of 0.22169791162014008
layers.1.0.weight has gradient mean of 0.20694106817245483
layers.2.0.weight has gradient mean of 0.32896995544433594
layers.3.0.weight has gradient mean of 0.2665732204914093
layers.4.0.weight has gradient mean of 1.3258540630340576

可以已经没有梯度消失的问题了。

实现TransformerBlock类

以上模块实现之后，接下来就可以实现GPTModel类的核心模块TransformerBlock类了，他的结构图如下

根据上面的架构图，将之前实现的子模块组装成TransformerBlock类

from gpt import MultiHeadAttention # 上一章实现的多头注意力类

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # 快捷连接
        shortcut = x
        x = self.norm1(x) # 第一次使用归一化层处理
        x = self.att(x)  # 使用多头注意力层处理
        x = self.drop_shortcut(x) # 随机丢弃一些值
        x = x + shortcut  # 第一次使用快捷连接

        # 快捷连接
        shortcut = x
        x = self.norm2(x) # 第二次使用归一化层处理
        x = self.ff(x) # 使用前馈网络层处理
        x = self.drop_shortcut(x) # 随机丢弃一些值
        x = x + shortcut  # 第二次使用快捷连接

        return x

实现GPTModel类

核心模块TransformerBlock类实现之后，就可以实现GPTModel类了，类架构图如下：

根据上面的架构图，将之前实现的函数组装成GPTModel类

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        # 将TransformBlock类重复12次
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx) # 获取输入的嵌入向量
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device)) # 获取输入的位置嵌入向量
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x) # 使用TransformBlock类处理
        x = self.final_norm(x) # 使用归一化层处理
        logits = self.out_head(x) # 最后使用线性层处理
        return logits

文本测试

创建一个辅助函数测试上面的GPTModel类

def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]
        
        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)
        
        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]  

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx

import tiktoken

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval()  # disable dropout

start_context = "Hello, I am"

tokenizer = tiktoken.get_encoding("gpt2")
encoded = tokenizer.encode(start_context)
encoded_tensor = torch.tensor(encoded).unsqueeze(0)

print(f"\n{50*'='}\n{22*' '}IN\n{50*'='}")
print("\nInput text:", start_context)
print("Encoded input text:", encoded)
print("encoded_tensor.shape:", encoded_tensor.shape)

out = generate_text_simple(
    model=model,
    idx=encoded_tensor,
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)
decoded_text = tokenizer.decode(out.squeeze(0).tolist())

print(f"\n\n{50*'='}\n{22*' '}OUT\n{50*'='}")
print("\nOutput:", out)
print("Output length:", len(out[0]))
print("Output text:", decoded_text)

可以看到，经过模型处理后，已经有输出了，但是输出的内容还是错误的，这是因为模型还没有经过任何的训练，下一章要训练模型，来输出合理的内容。

从零实现一个大模型（三）

目录

1. 文本处理

2. 注意力机制

3. 开发一个Transform架构的大模型

4. 使用无标记数据预训练模型

5. 分类微调

6. 指令微调