DD1420

exchange

Course Information

Name：Foundations of Machine Learning

Lecture Notes

Lesson 2

Linear Regression with RANSAC

Root mean squared error

Algebraic Linear Regression

Uses matrix algebra to solve for $w$ and $b$ directly, minimizing the least squares error:

L = \frac{1}{n} \sum (y_{i} - \overset{y}{^}_{i})^{2}

Logistic Regression

The sigmoid function maps $z = w^{T} x + b$ to a probability between 0 and 1 for classification:

σ (z) = \frac{1}{1 + e ^{- z}}

Support Vector Machine

Finds a hyperplane maximizing the margin between classes, controlled by the hyperparameter $C$ (trade-off between margin size and classification error).

Large C: Narrow margin, fewer misclassifications.
Small 𝐶: Wider margin, more slack (misclassifications allowed).

K-Medoids Clustering

Clusters data by selecting actual data points as medoids, iteratively assigning points to the nearest medoid and updating medoids to minimize total distance.

Lesson 5

Dimensionality reduction

將數據的維度降低，複雜度降低，我們能更容易的分析數據，當資料的維度降低到只剩下幾個抽象特徵時，可以加快模型的訓練速度和大小，且當維度降低至2或3維時，也能更直觀的視覺化數據。

Principal Component Analysis, PCA

主成份分析，一種經典的線性降維方法，目標是尋找一個最佳的線性正交投影 $π$ ，將一組 $d$ 維的數據降到 $k$ 維， $d > k$ ，使得投影後的數據在低維子空間中能夠保有最大的變異性。

這個子空間有 $k$ 維，由 $k$ 個正交的特徵向量 $u_{1}, \dots, u_{k} \in R^{n}$ 張成，這些向量稱為主成分，滿足以下條件：

正交： $u_{i}^{T} u_{j} = 0, i \neq = j$ ，確保這些向量不會有重複的資訊，每個向量表示不同的方向。
單位長度： $u_{i}^{t} u_{i} = ∥ u_{i} ∥^{2} = 1$ ，確保數據在投影後不會因為基底向量的長度變化而被放大或縮小。

變異性代表數據的分散程度，如果數據在某方向上的變異性較大，代表這個方向上的變化較多，不同的數據點有顯著區別，表示資訊較多、較豐富。

PCA有三個步驟：

計算數據的協方差矩陣 covariance matrix
計算協方差矩陣中的特徵值和特徵向量，最大的特徵值對應的特徵向量就是第一主成分，第二大的特徵值對應的特徵向量就是第二主成分。
將數據投影到這些主成分上，得到降維後的數據。

Multidimensional Scaling, MDS

Lesson 6

Kernel Methods

Kernel methods are the methods to make linear models (like linear regression and SVMs) handle non-linear data.

The idea is to mapping data into a higher-dimensional space where a linear model can be applied. However, the mapping is not actually performed, instead, they use kernel functions to compute the inner product directly in the higher-dimensional space.

Linear Regression

The linear regression model tries to fit a linear function (a straight line) to the data, which minimizes the sum of squared errors between the predicted value and the true value.

The equation of the line is $y = θ_{0} + θ_{1} x$ , and for each data point $(x_{i}, y_{i})$ , the model predicts $\overset{y}{^}_{i} = θ_{0} + θ_{1} x_{i}$ .

The error for each data point is $e_{i} = y_{i} - \overset{y}{^}_{i}$ , and the sum of squared errors is

E = Σ_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2} = Σ_{i = 1}^{n} (y_{i} - θ_{0} - θ_{1} x_{i})^{2} = Σ_{i = 1}^{n} (y_{i} - θ^{t} x_{i})^{2}

The reason to use the squared error is that it avoid the cancellation of positive and negative errors and also penalize large errors more compare to the small errors.

Ridge Regression

When trying to minimize the loss, we might meet the degenerate case: the solution is not unique because of too less data points, or too many features.

Ridge Regression can solve this by adding a regularization term to the loss function, which is the sum of squared weights, encouraging the weights to be small at the same time minimize the loss to prevent overfitting.

E = Σ_{i = 1}^{n} (y_{i} - θ^{t} x_{i})^{2} + λ ∣∣ θ ∣ ∣^{2}

Also because of this term, we can always get a unique solution even if the data is not enough.

Primal Solution

Primal and dual solutions are two different ways to solve the same problem. The primal solution focuses on finding the weights $θ$ directly.

E = Σ_{i = 1}^{n} (y_{i} - θ^{t} x_{i})^{2} + λ ∣∣ θ ∣ ∣^{2}

Derive the loss function with respect to $θ$ and set it to zero, we can get the optimal weights

θ = (X^{t} X + λ I_{k})^{- 1} X^{t} y

$X$ is the data matrix, $n \times k$
$n$ is the number of samples, $k$ is the number of features.
$X^{t} X$ is a $k \times k$ matrix.
The inverse matrix $(X^{t} X + λ I_{k})^{- 1}$ is a $k \times k$ matrix with computational time complexity $O (k^{3})$ .
The solution $θ$ is a $k \times 1$ vector, which is in the feature space.

Dual Solution

If the dimensional of the data is very large (a lot of features), it will be very slow to calculate the inverse of the matrix in the primal solution. We use the dual solution to solve this problem.

The dual solution focuses on finding the Lagrange multipliers $α$ , which is the weight of each data point.

θ = X^{t} α

Bring this into the original ridge regression loss function, we can get the best $α$ .

α = (K + λ I_{n})^{- 1} y ⟹ θ = X^{t} (K + λ I_{n})^{- 1} y

$K = X X^{t}$ , the Gram matrix, is a $n \times n$ matrix.
The inverse matrix is a $n \times n$ matrix with computational time complexity $O (n^{3})$ .
The solution $α$ is a $n \times 1$ vector, which is in the sample space.

| Formulation | Primal Solution | Dual Solution |

| --- | --- | --- |

| Solution | $θ = (X^{t} X + λ I_{k})^{- 1} X^{t} y$ | $θ = X^{t} (K + λ I_{n})^{- 1} y$ |

| Space | Feature space | Sample space |

| Computational Complexity | $O (k^{3})$ | $O (n^{3})$ |

| Kernel Methods | Not practical in high-dimensional data | Yes |

Polynomial Regression

In previous linear regression, we assume the relationship between the input and output is linear. However, in some cases, the relationship is not linear, but polynomial.

It first maps the input $x$ to a higher-dimensional space, using $x \to ϕ (x)$ , then apply the linear regression in this space.

If we apply this to the primal and dual solution, they become:

P r ima l : θ = (Φ^{t} Φ + λ I_{k})^{- 1} Φ^{t} y D u a l : θ = Φ^{t} (Φ Φ^{t} + λ I_{n})^{- 1} y

Kernel Functions

Kernel function: $k (x, y) =< ϕ (x), ϕ (y) >$

It is defined as the inner product of the mapped data in the transformed space.

Kernel Trick

It will be very expensive to calculate the $Φ$ when the data is very high-dimensional. Instead of calculating the $Φ$ directly, we can calculate the kernel function $k (x, y)$ to get the inner product of the mapped data in the transformed space without actually mapping the data.

Polynomial Kernel

k (x, y) = (x^{t} y + c)^{d}

Gaussian Kernel

k (x, y) = exp (- \frac{∣∣ x - y ∣ ∣ ^{2}}{2 σ ^{2}})

which maps the data to an infinite-dimensional space.

Periodic Kernel

Lesson 7, Probability

Probabilistic Model

It is a probability distribution defined by the model family and parameters, it assigns probabilities to different outcomes.

Model family: Specifies the functional form of the distribution.
Parameters: Controls the distribution’s specific characteristics.

Likelihood

It’s a function (not distribution) of $θ$ for fixed data $X$ , describes how probable is the data $X$ under the specific $θ$ , according to the model.

L (θ) = p_{X} (X; θ)

It’s not a PDF for $θ$ and it doesn’t integrate to 1 with respect to $θ$ . Because it’s not a probability distribution over $θ$ .

Log-Likelihood

L (θ) = lo g L (θ) = i = 1 \sum n lo g p_{x} (x_{i}; θ)

It is used for numerical stability and computational efficiency, because the product is replaced with a sum in log space.

Maximum Likelihood Principle

The best parameters $θ$ for a model are those that maximize the likelihood of observing the given data $X$ .

Maximum Likelihood Estimation

The process of applying the maximum likelihood principle to find the specific parameter values that maximize the likelihood for a given dataset and model.

To find the maximum-likelihood parameter estimate $θ_{ML}$ , solve directly for the maximum of the likelihood function by setting its derivative to zero:

⟹ L (θ) = lo g L (θ) = i = 1 \sum n [x_{i} lo g θ + (1 - x_{i}) lo g (1 - θ)] \frac{d}{d θ} L (θ) = \frac{d}{d θ} [i = 1 \sum n x_{i} lo g θ + (1 - x_{i}) lo g (1 - θ)] = i = 1 \sum n [\frac{x _{i}}{θ} - \frac{1 - x _{i}}{1 - θ}] = 0

Solving this yields:

θ_{ML} = \frac{\sum _{i = 1}^{n} x _{i}}{n}

Limitation: It is optimal with infinite data but may overfit with small samples and can be constrained by shrinking the parameter space or using prior knowledge.

Bayes’ Law

p_{θ ∣ X} (θ ∣ X) = \frac{p _{X ∣ θ} ( X ∣ θ ) \cdot p _{θ} ( θ )}{p _{X} ( X )}

Prior: $p_{θ} (θ)$ , the initial belief about the parameter $θ$ before seeing the data.
Likelihood: $p_{X ∣ θ} (X ∣ θ)$ , the probability of observing the data $X$ given a specific value of $θ$ .
Evidence: $p_{X} (X)$ , This is the total probability of the data $X$ across all possible $θ$ values.
Posterior: $p_{θ ∣ X} (θ ∣ X)$ , the updated probability distribution of $θ$ after observing the data $X$ . It combines our prior belief with the evidence from the data.

General form of Bayes' Law

P (A ∣ B) = \frac{P ( B ∣ A ) \cdot P ( A )}{P ( B )}

$P (A ∣ B)$ : The probability of event A given that event B has occurred (called the posterior).
$P (B ∣ A)$ : The probability of B given A (called the likelihood).
$P (A)$ : The probability of A before considering B (called the prior).
$P (B)$ : The probability of B, acting as a normalizing constant (called the evidence).

Example of Bayes' Law

A: You have the disease.
B: You test positive for the disease.
$P (A ∣ B)$ : The probability that you actually have the disease given that you tested positive. This is the posterior.
$P (B ∣ A)$ : The probability of testing positive given that you have the disease. This is the likelihood.
$P (A)$ : The probability that you have the disease before knowing anything about the test result. This is the prior.
$P (B)$ : The probability of testing positive, which acts as the normalizing constant.

Maximum A Posteriori Principle

We should pick the $θ$ that maximizes the posterior $p_{θ ∣ X} (θ ∣ X)$ for $θ$ to best balances what the data tells us (via the likelihood) with what we already believe (via the prior).

Maximum A Posteriori Estimation

MAP estimation is the practical process of applying the MAP principle to compute the specific parameter estimate $θ_{MAP}$ that maximizes the posterior for a given dataset $X$ and prior.

θ_{MAP} = ar g θ max p_{θ ∣ X} (θ ∣ X) = ar g θ max [p_{X ∣ θ} (X ∣ θ) \cdot p_{θ} (θ)] = ar g θ max [L (θ ∣ X) + lo g p_{θ} (θ)]

Then take the derivative and set it to be zero.

MLE v.s. MAP

MLE maximizes only the likelihood $p (X ∣ θ)$ , ignoring prior knowledge. MAP adds the prior $p (θ)$ , constraining the estimate to avoid overfitting (e.g., a coin with 3 heads doesn’t force $P (tails) = 0$ ).

The estimated parameters of MAP are somewhere between where the prior thought they would be, and the parameter estimates we would have gotten.

Mixture Models

They combine multiple simple distributions into a more flexible, multi-modal distribution. The density is a weighted sum:

p (x) = j = 1 \sum m π_{j} p^{(j)} (x)

where $π_{j}$ are weights satisfying:

\sum π_{j} = 1, π_{j} \geq 0

Each data point is assumed to come from one of the component distributions, but the identity of the component is unobserved (latent variable).

Latent Variables

The component assignment of each data point is a latent variable, meaning it is not directly observed.

Gaussian Mixture Models (GMMs):

A specific case of mixture models where components are Gaussian distributions. Each component has its own mean vector and covariance matrix.

Maximum likelihood estimation (MLE) for GMMs lacks a closed-form solution. The log-likelihood involves a sum inside a logarithm, making direct optimization difficult.

Expectation-Maximisation Algorithm

In GMMs, There’s observed data $D = {x_{i}}$ , unknown parameters $θ$ , and hidden latent variables ${z_{i}}$ . Neither can be solved directly without knowing the other.

EM uses alternating optimization:

If $θ$ were known, $z_{i}$ could be inferred using Bayes’ law.
If $z_{i}$ were known, $θ$ could be estimated via standard MLE.

EM iterates between guessing one and refining the other until convergence.

Jensen’s Inequality

f (i = 1 \sum n w_{i} x_{i}) \leq i = 1 \sum n w_{i} f (x_{i})

For a concave function like $lo g$ ,

lo g (\sum w_{i} x_{i}) \geq \sum w_{i} lo g (x_{i}) (with \sum w_{i} = 1)

This helps transform the tricky log-likelihood:

lo g p (x_{i}; θ) = lo g j \sum p (x_{i}, z_{i} = j; θ)

into a manageable lower bound.

Evidence Lower Bound (ELBO)

By introducing an arbitrary distribution $Q (z_{i})$ , the log-likelihood is bounded as:

lo g p (x_{i}; θ) \geq j \sum Q (z_{i} = j) lo g \frac{p ( x _{i} , z _{i} = j ; θ )}{Q ( z _{i} = j )}

This ELBO is easier to optimize than the original likelihood.

EM Algorithm Steps

Initialization: Start with a guess for $θ$ or responsibilities $γ_{ij}$ (probability that $x_{i}$ belongs to component $j$ ).
E-Step (Expectation): Using the current value of $θ$ to compute the ELBO by setting $Q (z_{i}) = p (z_{i} ∣ x_{i}; θ)$ . This result in responsibilities $γ_{ij} = P (z_{i} = j ∣ x_{i}; θ)$ .
M-Step (Maximization): Update $θ$ by maximiz the ELBO, improving the likelihood.

θ_{n e w} = ar g θ max ELBO_{γ (θ)} (θ, X)

Repeat: Alternate E and M steps until a stopping criterion (e.g., small likelihood increase) is met.

EM finds local maxima.
The ELBO equals the likelihood when $Q (z_{i}) = p (z_{i} ∣ x_{i}; θ)$ ensuring each iteration improves or maintains the likelihood.

Nonparametric Density Estimation

Empirical Density Estimation

Naive Density Estimation

Kernel Density Estimation (KDE)

Places a kernel (small PDF) around each data point and aggregates them.
Formula: sum of kernel functions centered at each observation, divided by sample size.

p (x) = \frac{1}{nh} i = 1 \sum n K (\frac{x - x _{i}}{h})

where $K$ is a kernel (e.g., Gaussian) and $h$ is the bandwidth.

Bandwidth (h), similar to window width in histograms, controls the width of each kernel.
Small h
Narrow kernels, tightly hugging each data point.
A spiky, jagged density with lots of detail.
Large h:
Wide kernels, spreading probability far from each point
A smooth, overly flattened density that might miss details.

Histograms

Divides data domain into bins of equal width

Counts data points in each bin and normalizes to create a PDF

Bin width (w), which controls detail level
Trade-off: too few bins (high bias) vs. too many bins (high variance)

Lesson 8

Information Content

Also known as Shannon information, measures how surprising an event is (the uncertainty).

I (x) = l o g_{2} \frac{1}{P ( x )} = - l o g_{2} P (x)

Why $l o g_{2}$ ?

Monotonic
$P (x) = 1 ⟹ I (x) = 0$

when the event will definitely happen, the information content is 0, meaning that it is not surprising at all.

$I (x, y) = I (x) + I (y)$

Entropy

The expected information content of a random variable $X$ , it describes the expected uncertainty in a probability mass function.

H (X) = E_{x \sim P_{X}} [I (x)] = - x \in Ω \sum P_{X} (x) l o g_{2} P_{X} (x)

The unit of entropy is bit or dits if using $l o g_{10}$ or nats if using $l o g_{e}$ .

Perplexity

Perplexity is another way to express the entropy (the uncertainty).

The amount of surprise of the model when predicting a sample.

PP (X) = 2^{H (x)}

Cross Entropy

The expected information content of a probability mass function $Q_{X}$ with respect to another probability mass function $P_{X}$ .

H (P_{X}, Q_{X}) = E_{x \sim P_{X}} [I (x, Q_{X})] = - x \in Ω \sum P_{X} (x) lo g Q_{X} (x)

Kullback-Leibler Divergence

The KLD is a measure of how much one probability distribution differs from another.

It compares the true distribution (the data $P_{X}$ ) with the approximate distribution (the model $Q_{X}$ ) by calculating the difference between the cross entropy and the entropy.

D_{KL} (P_{x} ∥ Q_{x}) = H (P_{X}, Q_{X}) - H (P_{X}) = E_{x \sim P_{x}} [lo g \frac{P _{x} ( x )}{Q _{x} ( x )}] = x \in Ω \sum P_{x} (x) lo g \frac{P _{x} ( x )}{Q _{x} ( x )}

$D_{KL} (P_{x} ∥ Q_{x}) \geq 0 \forall P_{x}, Q_{x}$
$D_{KL} (P_{x} ∥ Q_{x}) = 0 ⟺ P_{x} = Q_{x}$

Why it is not a distance?

Not symmetric:

$D_{KL} (P_{x} ∥ Q_{x}) \neq = D_{KL} (Q_{x} ∥ P_{x})$

Not triangle inequality:

$D_{KL} (P_{x} ∥ Q_{x}) + D_{KL} (Q_{x} ∥ R_{x}) ≱ D_{KL} (P_{x} ∥ R_{x})$

Forward KL: $D_{KL} (P ∥ Q)$

Approximates $P$ with $Q$
Penalizes: $Q$ heavily if it assigns low probability to regions where $P$ has high probability ( $lo g \frac{P ( x )}{Q ( x )}$ becomes very large).
Behavior: Mass-covering, $Q$ can’t be too small in any region where $P$ supports, $Q$ spreads probability mass to cover all regions where $P$ exists.

Reverse KL: $D_{KL} (Q ∥ P)$

Approximates $Q$ with $P$
Penalizes: $Q$ if it assigns high probability to regions where $P$ has low probability ( $lo g \frac{Q ( x )}{P ( x )}$ becomes very large).
Behavior: Mode-seeking, $Q$ focuses on matching a single mode (peak) of $P$ , it needs to be small in regions where $P$ is small.

Decision Tree

What is the difference between information gain and Gini impurity as splitting criteria?

Information gain measures the reduction in Shannon entropy (uncertainty) after a split, favoring splits that maximize the difference in entropy before and after.
Gini impurity measures the probability of misclassifying a randomly labeled point in a subset, favoring splits that minimize this impurity.
Gini is computationally simpler (no logarithms) and less sensitive to rare events, while information gain aligns with information theory principles.
Gini impurity might be preferred because it’s computationally faster (uses squared probabilities instead of logarithms) and less sensitive to small probability changes, making it more robust in noisy datasets.

How does setting a max_depth limit help prevent overfitting in decision trees?

Setting a max_depth limit stops the tree from growing too deep, preventing it from creating overly specific splits that capture noise in the training data. This reduces variance (overfitting) by enforcing simpler models

Excerise 9

Summary

Simple Gaussian Model: A basic generative model assumes each pixel is independently Gaussian-distributed. Maximum likelihood estimation (MLE) computes pixel-wise means and standard deviations, but samples are noisy due to ignored pixel correlations.
PCA-Based Gaussian Model: To address correlations, Principal Component Analysis (PCA) with 250 components transforms the data into uncorrelated coefficients. A Gaussian model is fitted to these coefficients, and inverse PCA reconstructs images, reducing noise by capturing pixel covariances implicitly.
Temperature Adjustment: Reducing the “temperature” (scaling standard deviation) lowers entropy, concentrating samples around likely outcomes, improving perceived quality.
As T increasing from 0 to 0.25, the feature of the faces become more obvious and recognizable. However when T increases to 1, the shape become distorted, likely due to excessive variation in pixel values.
Gaussian Mixture Model (GMM): A non-Gaussian approach uses GMMs (25 and 250 components) on PCA coefficients, capturing multimodal distributions and further enhancing sample quality by modeling data clusters.
It can better capture the structure of the data by assigning different Gaussian components to different regions of the distribution, allowing the generated samples to retain more detailed features and become more sharper.

🪴 Quartz 4.0

Recent writing

Quadson simulation

ROS bridge

SSD

Sensor topic Not found

Simulation

DD1420

Course Information

Lesson 2

Lesson 5

Lesson 6

Lesson 7, Probability

Lesson 8

Excerise 9

Graph View

Table of Contents

Backlinks

🪴 Quartz 4.0

Recent writing

Quadson simulation

ROS bridge

SSD

Sensor topic Not found

Simulation

DD1420

Course Information §

Lesson 2 §

Lesson 5 §

Lesson 6 §

Lesson 7, Probability §

Lesson 8 §

Excerise 9 §

Graph View

Table of Contents

Backlinks

Course Information

Lesson 2

Lesson 5

Lesson 6

Lesson 7, Probability

Lesson 8

Excerise 9