Linear Regression

the Loss Function： $l^{(i)} (w, b) = \frac{1}{2} (\overset{y}{^}^{(i)} - y^{(i)})^{2} .$
To measure the quality of a model on the entire dataset of $n$ examples, we simply average (or equivalently, sum) the losses on the training set: $L (w, b) = \frac{1}{n} \sum_{i = 1}^{n} l^{(i)} (w, b) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{2} (w^{⊤} x^{(i)} + b - y^{(i)})^{2} .$ So we need to find out corresponding $w^{*}, b^{*}$ such that : $w^{*}, b^{*} = w, b argmin L (w, b)$

Minibatch SGD(stochastic gradient descent)

the learning rate : $η$
hyperparameters : minibatch size( $∣ B ∣$ ), learning rate 调参（hyperparameter tuning）是选择超参数的过程

initialize the values of the model parameters, typically at random;
iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient.

(w, b) \leftarrow (w, b) - \frac{η}{∣ B ∣} i \in B_{t} \sum \partial_{(w, b)} l^{(i)} (w, b) .

For quadratic losses and affine transformations, this has a closed-form expansion:

w b \leftarrow w - \frac{η}{∣ B ∣} i \in B_{t} \sum \partial_{w} l^{(i)} (w, b) \leftarrow b - \frac{η}{∣ B ∣} i \in B_{t} \sum \partial_{b} l^{(i)} (w, b) = w - \frac{η}{∣ B ∣} i \in B_{t} \sum x^{(i)} (w^{⊤} x^{(i)} + b - y^{(i)}) = b - \frac{η}{∣ B ∣} i \in B_{t} \sum (w^{⊤} x^{(i)} + b - y^{(i)}) .

Vectorization for Speed 矢量化加速

Goal: to process whole minibatches of examples simultaneously.

Object-Oriented Design for Implementation(可忽略)

面向对象设计：核心架构图

Three classes：

(i) Module contains models, losses, and optimization methods; (ii) DataModule provides data loaders for training and validation; (iii) both classes are combined using the Trainer class,

1. `Module` 类：模型的核心（The Brain）

Module 类封装了神经网络的架构、损失函数以及优化器的配置。 核心功能：

参数存储：保存 $w$ 和 $b$ 等需要更新的张量。
前向传播 (forward)：定义数据流向，计算 $\overset{y}{^}$ 。
损失计算 (loss)：计算预测值与真实值之间的差距（如你之前问的 MSE）。
优化器配置 (configure_optimizers)：决定用哪种算法更新参数（如 SGD）。

2. `DataModule` 类：数据的管家（The Pantry）

DataModule 负责数据的生命周期：下载、处理、打包。它确保模型在训练时能源源不断地获得整齐的 Minibatches。 核心功能：

数据加载：将原始数据读入内存或指向存储位置。
迭代器封装 (get_dataloader)：返回你之前接触过的 DataLoader，负责把数据打乱并按照 batch_size 切分。
多数据集管理：同时管理训练集（Training Set）和验证集（Validation Set）。

3. `Trainer` 类：训练的执行者（The Cook）

Trainer 负责把上面的“大脑”和“食材”结合起来。它包含了嵌套循环：Epochs（外层） 和 Batches（内层）。 核心功能：

训练循环 (prepare_batch, fit)：自动处理反向传播、梯度清零和参数更新。
设备管理：（进阶）自动决定是在 CPU 还是 GPU上运行计算。

总结

在 3.1 中：

模型： $\overset{y}{^} = w^{⊤} x + b$
损失： $L = \frac{1}{2 n} \sum (\overset{y}{^} - y)^{2}$
更新： $w \leftarrow w - η g$ 在3.2中：
模型公式放进 Module.forward。
损失公式放进 Module.loss。
更新逻辑放进 Trainer。

线性回归的实现

包括数据流水线、模型、损失函数和小批量随机梯度下降优化器

数据集是一个矩阵 $X$ ：每一行代表一个样本，每一列代表一个特征（比如身高、体重、房价等） $w$ ：权重，代表了每个特征的重要程度 torch.matmul(X, w): 矩阵乘法 torch.normal(mean, std, size) 函数：Returns a tensor of random numbers drawn from separate normal distributions whose mean and standard deviation are given. 实现流程：

读取数据集：每次抽取一小批量样本来更新模型
初始化模型参数 $w, b$
定义模型、损失函数、优化算法（示例如下）

def sgd(params, lr, batch_size):  #@save
    """小批量随机梯度下降"""
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

关键操作	学术功能	核心原理
`no_grad()`	计算图隔离	禁用自动求导引擎，确保参数更新动作不被记录，节省显存。
`lr * grad / n`	规范化更新	通过平均梯度执行一阶优化更新，确保步长尺度不随 Batch 大小波动。
`zero_()`	状态重置	手动清除梯度缓存，防止当前批次的梯度累加到下一次迭代。
$θ_{t + 1} = θ_{t} - η \cdot \frac{1}{n} \sum \nabla L (θ_{t})$

训练

每次迭代中，我们读取一小批量训练样本，并通过我们的模型来获得一组预测。计算完损失后，我们开始反向传播，存储每个参数的梯度。最后，我们调用优化算法sgd来更新模型参数

lr = 0.03
num_epochs = 3
net = linreg # linreg = torch.matmul(X, w) + b
loss = squared_loss
 
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y)  # X和y的小批量损失
        # 因为l形状是(batch_size,1)，而不是一个标量。l中的所有元素被加到一起，
        # 并以此计算关于[w,b]的梯度
        l.sum().backward()
        sgd([w, b], lr, batch_size)  # 使用参数的梯度更新参数
    with torch.no_grad():
        train_l = loss(net(features, w, b), labels)
        print(f'epoch {epoch + 1}, loss {float(train_l.mean()):f}')

Starry's Blog

Explorer

Chapter 3 -线性神经网络

Linear Regression

Minibatch SGD(stochastic gradient descent)

Vectorization for Speed 矢量化加速

Object-Oriented Design for Implementation(可忽略)

Three classes：

1. `Module` 类：模型的核心（The Brain）

2. `DataModule` 类：数据的管家（The Pantry）

3. `Trainer` 类：训练的执行者（The Cook）

总结

线性回归的实现

训练

Graph View

Table of Contents

Starry's Blog

Explorer

Chapter 3 -线性神经网络

Linear Regression

Minibatch SGD(stochastic gradient descent)

Vectorization for Speed 矢量化加速

Object-Oriented Design for Implementation(可忽略)

Three classes：

1. Module 类：模型的核心（The Brain）

2. DataModule 类：数据的管家（The Pantry）

3. Trainer 类：训练的执行者（The Cook）

总结

线性回归的实现

训练

Graph View

Table of Contents

1. `Module` 类：模型的核心（The Brain）

2. `DataModule` 类：数据的管家（The Pantry）

3. `Trainer` 类：训练的执行者（The Cook）