Softmax 回归 · 损失函数 · 图片分类数据集

📖 对应《动手学深度学习 v2》第 3 章 · 线性神经网络 🏷️ Tags: 深度学习分类 Softmax 损失函数 Fashion-MNIST

1. 从回归到分类

回归与分类
回归估计连续值
分类预测离散类别
数据集：MNIST手写数字识别、ImageNet自然物体分类

核心区别

问题类型	输出	激活函数	损失函数
线性回归	连续数值（单个）	无 / 恒等	MSE（均方误差）
Softmax 回归	离散类别（多个概率）	Softmax	交叉熵

全连接层

全连接层是“完全”连接的，可能有很多可学习的参数。具体来说，对于任何具有 $d$ 个输入和 $q$ 个输出的全连接层，参数开销为 $O (d q)$

独热编码（One-Hot Encoding）

对于 $q$ 类分类问题，标签 $y$ 被编码为长度为 $q$ 的向量：

y = [0, \dots, 0, 第 i 类 1, 0, \dots, 0]^{⊤}

示例：3 类问题，类别为"猫"

$y_{猫} = [1, 0, 0]^{⊤}, y_{狗} = [0, 1, 0]^{⊤}, y_{鸟} = [0, 0, 1]^{⊤}$

2. Softmax 回归模型

2.1 网络架构

输入层             输出层
x₁  ───────────► o₁ → ŷ₁
x₂  ──── W,b ──► o₂ → ŷ₂
x₃  ───────────► o₃ → ŷ₃
x₄  ───────────► (q 个输出节点)
(d 个特征)

输入： $x \in R^{d}$ （ $d$ 个特征）
输出： $o \in R^{q}$ （ $q$ 个类别的原始分数，称为 logits）
参数：权重矩阵 $W \in R^{d \times q}$ ，偏置 $b \in R^{q}$

o = W^{⊤} x + b

2.2 Softmax 函数

问题：线性输出不满足概率要求

线性输出 $o$ 可能为负数，且各分量之和不等于 1，不能直接解释为概率。

Softmax 变换将 logits 转换为合法概率分布：

\overset{y}{^}_{j} = softmax (o)_{j} = \frac{exp ( o _{j} )}{\sum _{k = 1}^{q} exp ( o _{k} )}

性质验证：

j = 1 \sum q \overset{y}{^}_{j} = \frac{\sum _{j = 1}^{q} exp ( o _{j} )}{\sum _{k = 1}^{q} exp ( o _{k} )} = 1 ✓

\overset{y}{^}_{j} > 0 \forall j ✓ (指数函数恒正)

单调性保持

Softmax 不改变各类别的相对大小顺序： $argmax_{j} \overset{y}{^}_{j} = argmax_{j} o_{j}$

2.3 向量化表示

对于 $n$ 个样本的小批量 $X \in R^{n \times d}$ ：

O = XW + b, O \in R^{n \times q}

\hat{Y} = softmax (O)

其中 softmax 按行（每个样本）独立计算。

3. 损失函数

损失函数（Loss Function）用于衡量模型预测值与真实值之间的差距。训练的目标就是不断调整模型参数，使损失函数尽可能小。设：

真实值为 $y$
预测值为 $\overset{y}{^}$ 记误差为：

e = \overset{y}{^} - y

下面是三种常见损失函数。

L2 Loss

L2 Loss 也叫 平方损失（Squared Loss），定义为：

L (y, \overset{y}{^}) = \frac{1}{2} (\overset{y}{^} - y)^{2}

也可以写成：

L = \frac{1}{2} e^{2}

这里前面的 $\frac{1}{2}$ 只是为了求导时更方便，因为：

\frac{\partial L}{\partial y ^} = \overset{y}{^} - y = e

特点：

对误差较大的样本惩罚更强，因为误差被平方了
损失函数光滑、可导，优化起来比较方便
对离群点（outlier）比较敏感，因为大的误差会被进一步放大

直观理解：

如果一个样本预测错得很多，那么 L2 Loss 会给它非常大的惩罚，因此模型会特别想去修正这些大误差样本。

适用场景：

误差接近高斯分布时常用
线性回归中最经典的损失函数就是平方损失

L1 Loss

L1 Loss 也叫 绝对值损失（Absolute Loss），定义为：

L (y, \overset{y}{^}) = ∣ \overset{y}{^} - y ∣

也就是：

L = ∣ e ∣

它对预测值的导数在 $e \neq = 0$ 时为：

\frac{\partial L}{\partial y ^} = ⎩ ⎨ ⎧ 1, - 1, \overset{y}{^} > y \overset{y}{^} < y

当 $\overset{y}{^} = y$ 时不可导，不过实际优化中可以使用次梯度（subgradient）来处理。 特点：

对误差的惩罚是线性的，不会像 L2 那样把大误差平方放大
因此对离群点更不敏感，更鲁棒
但在 0 点不可导，优化时没有 L2 那么平滑 直观理解： L1 Loss 不会因为某个样本误差特别大，就让它在总损失里占据过高权重，因此比 L2 更能抵抗异常值的干扰。 适用场景：
数据中可能有异常值时
希望模型对离群点不那么敏感时

Huber‘s Robust Loss

Huber Loss 也叫 Huber 鲁棒损失，它结合了 L1 Loss 和 L2 Loss 的优点。它的定义是分段的。设阈值为 $δ > 0$ ，则：

L (y, \overset{y}{^}) = ⎩ ⎨ ⎧ \frac{1}{2} (\overset{y}{^} - y)^{2}, δ ∣ \overset{y}{^} - y ∣ - \frac{1}{2} δ^{2}, ∣ \overset{y}{^} - y ∣ \leq δ ∣ \overset{y}{^} - y ∣ > δ

写成误差形式就是：

L (e) = ⎩ ⎨ ⎧ \frac{1}{2} e^{2}, δ ∣ e ∣ - \frac{1}{2} δ^{2}, ∣ e ∣ \leq δ ∣ e ∣ > δ

它对预测值的导数为：

\frac{\partial L}{\partial y ^} = ⎩ ⎨ ⎧ e, δ sign (e), ∣ e ∣ \leq δ ∣ e ∣ > δ

其中：

sign (e) = ⎩ ⎨ ⎧ 1, - 1, e > 0 e < 0

特点：

当误差较小时，使用 L2 Loss 的形式
这样函数平滑，优化稳定
当误差较大时，使用 L1 Loss 的形式
这样不会对离群点过度敏感
因此它是一种更鲁棒的损失函数

直观理解：

Huber Loss 的思想是：

小误差：认真精细地调整，用平方损失
大误差：不要让异常样本影响太大，用绝对值损失

所以它经常被看成 L1 和 L2 的折中方案。

适用场景：

数据里可能存在少量异常值
既想保留 L2 的平滑优化性质，又想增强对离群点的鲁棒性

三者对比

L2 Loss：

小误差和大误差都平滑处理
对大误差惩罚很强
对离群点敏感

L1 Loss：

对误差线性惩罚
对离群点更鲁棒
但在 0 点不可导，优化不如 L2 平滑

Huber Loss：

小误差时像 L2
大误差时像 L1
兼顾平滑性和鲁棒性

3.1 交叉熵损失

最大似然估计（MLE）视角：

给定预测概率 $\hat{y}$ 和真实标签 $y$ （独热编码），真实类别 $y$ （整数索引）对应的预测概率为 $\overset{y}{^}_{y}$ 。

最大化对数似然等价于最小化交叉熵损失：

l (y, \hat{y}) = - j = 1 \sum q y_{j} lo g \overset{y}{^}_{j}

由于 $y$ 是独热向量（只有第 $y$ 个分量为 1），化简得：

l (y, \hat{y}) = - lo g \overset{y}{^}_{y} = - lo g \frac{exp ( o _{y} )}{\sum _{k = 1}^{q} exp ( o _{k} )}

直观理解：

真实类别概率 ↑  →  -log(p) ↓  →  损失 ↓
预测正确且置信度高 → 损失接近 0
预测错误 → 损失很大（趋向 +∞）

3.2 信息论视角

概念	公式	含义
熵 $H (y)$	$- \sum_{j} y_{j} lo g y_{j}$	真实分布的不确定性（下界）
交叉熵 $H (y, \hat{y})$	$- \sum_{j} y_{j} lo g \overset{y}{^}_{j}$	用 $\hat{y}$ 编码 $y$ 的代价
KL 散度	$H (y, \hat{y}) - H (y)$	两分布的差异（≥ 0）

关系

$H (y, \hat{y}) = H (y) + D_{KL} (y ∥ \hat{y}) \geq H (y)$ 最小化交叉熵 ⟺ 最小化 KL 散度 ⟺ 使 $\hat{y}$ 逼近 $y$

3.3 Softmax 与交叉熵的梯度

将 softmax 代入交叉熵，对 logit $o_{j}$ 求偏导：

\frac{\partial l}{\partial o _{j}} = \frac{\partial}{\partial o _{j}} [- o_{y} + lo g k = 1 \sum q exp (o_{k})]

\frac{\partial l}{\partial o _{j}} = \overset{y}{^}_{j} - y_{j}

梯度的优美性

梯度 = 预测概率 − 真实概率

这个形式与线性回归的梯度 $(\overset{y}{^} - y)$ 完全类似

数值稳定，易于实现

数值稳定性技巧：

\overset{y}{^}_{j} = \frac{exp ( o _{j} - max _{k} o _{k} )}{\sum _{k} exp ( o _{k} - max _{k} o _{k} )}

减去最大值不改变结果，但避免了 exp 溢出。

# PyTorch 实现（数值稳定版）
import torch.nn.functional as F
loss = F.cross_entropy(logits, labels)  # 内部已处理数值稳定性

4. 图片分类数据集 Fashion-MNIST

4.1 数据集简介

属性	值
来源	Zalando Research
训练集	60,000 张
测试集	10,000 张
图片尺寸	28 × 28 像素（灰度图）
类别数	10 类服装
替代	经典 MNIST（手写数字），难度更高

10 个类别：

0: T-shirt/top   1: Trouser      2: Pullover
3: Dress         4: Coat         5: Sandal
6: Shirt         7: Sneaker      8: Bag
9: Ankle boot

为什么用 Fashion-MNIST？

经典 MNIST（手写数字）太简单，现代模型几乎已达 99%+ 准确率， Fashion-MNIST 难度适中，更适合用于验证算法性能。

4.2 读取数据集

import torchvision
from torchvision import transforms
from torch.utils import data
 
def get_dataloader_workers():
    """使用 4 个进程来读取数据"""
    return 4
 
def load_data_fashion_mnist(batch_size, resize=None):
    """下载 Fashion-MNIST 数据集，然后将其加载到内存中"""
    trans = [transforms.ToTensor()]
    if resize:
        trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    
    mnist_train = torchvision.datasets.FashionMNIST(
        root="../data", train=True, transform=trans, download=True)
    mnist_test = torchvision.datasets.FashionMNIST(
        root="../data", train=False, transform=trans, download=True)
    
    return (
        data.DataLoader(mnist_train, batch_size, shuffle=True,
                        num_workers=get_dataloader_workers()),
        data.DataLoader(mnist_test, batch_size, shuffle=False,
                        num_workers=get_dataloader_workers())
    )
 
# 使用示例
batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size)
 
# 查看一批数据的形状
for X, y in train_iter:
    print(f"X shape: {X.shape}")   # torch.Size([256, 1, 28, 28])
    print(f"y shape: {y.shape}")   # torch.Size([256])
    break

4.3 数据可视化

import matplotlib.pyplot as plt
 
def get_fashion_mnist_labels(labels):
    """返回 Fashion-MNIST 数据集的文本标签"""
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    return [text_labels[int(i)] for i in labels]
 
def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5):
    """绘制图像列表"""
    figsize = (num_cols * scale, num_rows * scale)
    _, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
    axes = axes.flatten()
    for i, (ax, img) in enumerate(zip(axes, imgs)):
        ax.imshow(img.numpy())
        ax.axes.get_xaxis().set_visible(False)
        ax.axes.get_yaxis().set_visible(False)
        if titles:
            ax.set_title(titles[i])
    return axes
 
# 可视化第一批数据
X, y = next(iter(data.DataLoader(mnist_train, batch_size=18)))
show_images(X.reshape(18, 28, 28), 2, 9,
            titles=get_fashion_mnist_labels(y))
plt.show()

5. Softmax 回归的从零实现

import torch
from torch import nn
 
# ── 1. 数据准备 ──────────────────────────────────────────
batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size)
 
# ── 2. 初始化参数 ─────────────────────────────────────────
num_inputs  = 784   # 28×28 展平
num_outputs = 10    # 10 类
 
W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True)
b = torch.zeros(num_outputs, requires_grad=True)
 
# ── 3. 定义 Softmax ───────────────────────────────────────
def softmax(X):
    X_exp = torch.exp(X)
    partition = X_exp.sum(1, keepdim=True)   # 按行求和
    return X_exp / partition                  # 广播
 
# ── 4. 定义模型 ───────────────────────────────────────────
def net(X):
    return softmax(torch.matmul(X.reshape((-1, W.shape[0])), W) + b)
 
# ── 5. 定义损失函数 ───────────────────────────────────────
def cross_entropy(y_hat, y):
    # y_hat[range(len(y)), y] 取出每个样本真实类别的预测概率
    return -torch.log(y_hat[range(len(y)), y])
 
# ── 6. 分类准确率 ─────────────────────────────────────────
def accuracy(y_hat, y):
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = y_hat.argmax(axis=1)
    cmp = y_hat.type(y.dtype) == y
    return float(cmp.type(y.dtype).sum())
 
# ── 7. 训练 ───────────────────────────────────────────────
lr         = 0.1
num_epochs = 10
updater    = torch.optim.SGD([W, b], lr=lr)
 
for epoch in range(num_epochs):
    total_loss, total_acc, total_n = 0.0, 0.0, 0
    for X, y in train_iter:
        y_hat = net(X)
        loss  = cross_entropy(y_hat, y)
        updater.zero_grad()
        loss.mean().backward()
        updater.step()
        total_loss += float(loss.sum())
        total_acc  += accuracy(y_hat, y)
        total_n    += y.numel()
    print(f"Epoch {epoch+1}: loss={total_loss/total_n:.4f}, "
          f"train_acc={total_acc/total_n:.4f}")

6. Softmax 回归的简洁实现

import torch
from torch import nn
 
# ── 1. 数据 ───────────────────────────────────────────────
batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size)
 
# ── 2. 模型定义 ───────────────────────────────────────────
# Flatten: [batch, 1, 28, 28] → [batch, 784]
# Linear:  [batch, 784]       → [batch, 10]
net = nn.Sequential(
    nn.Flatten(),
    nn.Linear(784, 10)
)
 
# 权重初始化
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)
 
net.apply(init_weights)
 
# ── 3. 损失函数（含数值稳定的 softmax） ────────────────────
# CrossEntropyLoss = LogSoftmax + NLLLoss，内置数值稳定处理
loss = nn.CrossEntropyLoss(reduction='none')
 
# ── 4. 优化器 ─────────────────────────────────────────────
trainer = torch.optim.SGD(net.parameters(), lr=0.1)
 
# ── 5. 训练 ───────────────────────────────────────────────
num_epochs = 10
 
for epoch in range(num_epochs):
    net.train()
    for X, y in train_iter:
        trainer.zero_grad()
        y_hat = net(X)
        l = loss(y_hat, y)
        l.mean().backward()
        trainer.step()
    
    # 评估
    net.eval()
    with torch.no_grad():
        correct = sum(
            (net(X).argmax(axis=1) == y).sum().item()
            for X, y in test_iter
        )
        total = len(test_iter.dataset)
    print(f"Epoch {epoch+1}: test_acc={correct/total:.4f}")

7. 总结与对比

知识点速查

Softmax 回归
├── 模型：o = Wx + b，ŷ = softmax(o)
├── softmax: ŷⱼ = exp(oⱼ) / Σ exp(oₖ)
├── 损失：交叉熵 l = -Σ yⱼ log(ŷⱼ) = -log(ŷᵧ)
├── 梯度：∂l/∂oⱼ = ŷⱼ - yⱼ  （简洁优美）
└── 数值稳定：减去最大值后再做 softmax

Softmax 回归 vs 线性回归

对比项	线性回归	Softmax 回归
问题类型	回归	多类分类
输出激活	恒等映射	Softmax
参数规模	$d \times 1$	$d \times q$
损失函数	MSE	交叉熵
梯度形式	$\overset{y}{^} - y$	$\hat{y} - y$

Fashion-MNIST 速查

项目	值
训练集	60,000
测试集	10,000
尺寸	28×28 灰度
类别	10 类服装
加载	`torchvision.datasets.FashionMNIST`

核心公式汇总

Softmax： $\overset{y}{^}_{j} = \frac{e ^{o_{j}}}{\sum _{k} e ^{o_{k}}}$

交叉熵损失： $l = - lo g \overset{y}{^}_{y} = - o_{y} + lo g \sum_{k} e^{o_{k}}$

梯度： $\frac{\partial l}{\partial o _{j}} = \overset{y}{^}_{j} - y_{j}$

参考：《动手学深度学习》v2，Aston Zhang et al. | d2l.ai

Starry's Blog

Explorer

Softmax

Softmax 回归 · 损失函数 · 图片分类数据集

目录

1. 从回归到分类

核心区别

全连接层

独热编码（One-Hot Encoding）

2. Softmax 回归模型

2.1 网络架构

2.2 Softmax 函数

2.3 向量化表示

3. 损失函数

L2 Loss

L1 Loss

Huber‘s Robust Loss

三者对比

3.1 交叉熵损失

3.2 信息论视角

3.3 Softmax 与交叉熵的梯度

4. 图片分类数据集 Fashion-MNIST

4.1 数据集简介

4.2 读取数据集

4.3 数据可视化

5. Softmax 回归的从零实现

6. Softmax 回归的简洁实现

7. 总结与对比

知识点速查

Softmax 回归 vs 线性回归

Fashion-MNIST 速查

Graph View

Table of Contents