Question 1: Geometric Understanding of SVM
Question:
Consider a binary classification dataset with , and a linear classifier:
- Define the geometric margin and explain its meaning.
- Show that after proper scaling of the parameters, maximizing the geometric margin is equivalent to minimizing .
- Explain why only the support vectors affect the final classifier.
Answer:
几何间隔表示样本点到分类超平面 的带符号距离。对样本 ,几何间隔为
它衡量分类器对该样本分类的“安全程度”:间隔越大,样本离决策边界越远,分类越稳定。
由于 同时乘以正常数不会改变分类边界,可以把参数缩放到满足
此时最小几何间隔为
因此最大化几何间隔等价于最小化 ,通常写成最小化 。
最终分类器只由支持向量决定,因为支持向量是距离超平面最近、满足等号约束的样本。非支持向量离边界更远,对最优间隔没有直接影响,即使轻微移动也通常不会改变最终超平面。
Question 2: Generative vs Discriminative Models
Question:
Consider a binary classification problem with features .
- Write the Naive Bayes assumption and its classification rule.
- Show that Gaussian NB with equal class variances leads to a linear decision boundary.
- Write the Logistic Regression model form and its decision boundary.
- Compare NB and Logistic Regression: What are the differences in learning objectives? Which is preferred when the training data is small? Large? Why?
Answer:
Naive Bayes 假设在给定类别 的条件下,各个特征相互独立:
分类规则是选择后验概率最大的类别:
如果采用 Gaussian NB,并且不同类别下每个特征具有相同方差,则对数后验比中的二次项会相互抵消,剩下关于 的一次项。因此决策边界可以写成
所以是线性边界。
Logistic Regression 直接建模条件概率:
其中 。其决策边界通常为
NB 是生成式模型,学习 和 ;Logistic Regression 是判别式模型,直接学习 。数据较少时,NB 常更合适,因为其独立性假设降低了参数估计难度;数据较多时,Logistic Regression 通常更好,因为它不需要强独立性假设,能更灵活地拟合决策边界。
Question 3: Ensemble Methods
Question:
- Explain the core idea of Bagging. Does it primarily reduce bias or variance? Why?
- Describe the basic workflow of Boosting, such as AdaBoost.
- Explain why Boosting reduces bias, and why Bagging is suitable for high-variance models like decision trees.
Answer:
Bagging 的核心思想是从原始训练集多次有放回采样,训练多个基模型,然后通过投票或平均得到最终预测。它主要降低方差,因为多个模型的预测误差可以相互抵消,使整体结果更稳定。
Boosting 的基本流程是按顺序训练多个弱分类器。每一轮都会提高前一轮分错样本的权重,使后续模型更关注难分样本。最终将多个弱分类器加权组合,得到强分类器。
Boosting 能降低偏差,因为它逐步修正前一轮模型的错误,使模型表达能力不断增强。Bagging 适合决策树这类高方差模型,因为单棵树对训练数据变化很敏感,而多棵树平均后可以显著降低这种不稳定性。
Coding Problem: Gaussian Naive Bayes
Question:
Implement a Gaussian Naive Bayes classifier from scratch using NumPy.
- In
fit, estimate the class prior log probability , the per-feature mean , and the unbiased variance for each class. - In
predict_log_proba, compute the log unnormalized posterior:
Return the class with the highest score and compare the result with sklearn.naive_bayes.GaussianNB.
Answer:
代码实现:
正确率:88.75%
与scikit-learn对比: