PPO

PPO (Proximal Policy Optimization)

PPO 的目標是解決傳統 Policy Gradient 樣本效率差，以及 TRPO 實作過於複雜的問題
On-Policy 架構
- 透過 Importance Sampling 技術，重複使用同一批收集到的資料進行多次更新
- 提升樣本效率
Trust Region
- 利用截斷或懲罰機制，限制新策略 $π_{θ}$ 與舊策略 $π_{θ_{o l d}}$ 之間的差異
- 防止更新步幅過大導致策略崩壞
兩種主要版本
- PPO1：基於 KL 散度的懲罰 (Adaptive KL Penalty)
- PPO2：截斷式目標函數 (Clipped Surrogate Objective)

PPO1 基於 KL 散度的懲罰

將 KL 散度 (兩個 Policy 的差距) 加入目標函數中作為懲罰項

J_{PPO} (θ) = J^{θ^{'}} (θ) - β \cdot K L (θ, θ^{'})

KL 散度
- 衡量的是行為上的距離（Output action distribution 的差異），而非參數數值上的距離
動態調整 $β$ (Adaptive KL)：
- 若 $K L > K L_{ma x}$ ：代表差距太大，增加 $β$ (懲罰加重)
- 若 $K L < K L_{min}$ ：代表差距太小 (太保守)，減少 $β$ (懲罰減輕)
前身 TRPO 是將 KL 設為硬性限制，難以計算，PPO 則設為懲罰項，較易實作
但效果未顯著優於 PPO2 的 Clip 版本，較少使用

Kullback-Leibler Divergence

The KLD is a measure of how much one probability distribution differs from another.

It compares the true distribution (the data $P_{X}$ ) with the approximate distribution (the model $Q_{X}$ ) by calculating the difference between the cross entropy and the entropy.
$D_{KL} (P_{x} ∥ Q_{x}) = H (P_{X}, Q_{X}) - H (P_{X}) = E_{x \sim P_{x}} [lo g \frac{P _{x} ( x )}{Q _{x} ( x )}] = x \in Ω \sum P_{x} (x) lo g \frac{P _{x} ( x )}{Q _{x} ( x )}$

$D_{KL} (P_{x} ∥ Q_{x}) \geq 0 \forall P_{x}, Q_{x}$

$D_{KL} (P_{x} ∥ Q_{x}) = 0 ⟺ P_{x} = Q_{x}$

Why it is not a distance?

Not symmetric:

$D_{KL} (P_{x} ∥ Q_{x}) \neq = D_{KL} (Q_{x} ∥ P_{x})$

Not triangle inequality:

$D_{KL} (P_{x} ∥ Q_{x}) + D_{KL} (Q_{x} ∥ R_{x}) ≱ D_{KL} (P_{x} ∥ R_{x})$

Forward KL: $D_{KL} (P ∥ Q)$

Approximates $P$ with $Q$

Penalizes: $Q$ heavily if it assigns low probability to regions where $P$ has high probability ( $lo g \frac{P ( x )}{Q ( x )}$ becomes very large).

Behavior: Mass-covering, $Q$ can’t be too small in any region where $P$ supports, $Q$ spreads probability mass to cover all regions where $P$ exists.

Reverse KL: $D_{KL} (Q ∥ P)$

Approximates $Q$ with $P$

Penalizes: $Q$ if it assigns high probability to regions where $P$ has low probability ( $lo g \frac{Q ( x )}{P ( x )}$ becomes very large).

Behavior: Mode-seeking, $Q$ focuses on matching a single mode (peak) of $P$ , it needs to be small in regions where $P$ is small.

Link to original

KL 散度不是對稱的，以下是 Forward 與 Backward，若希望兩個分佈越接近，KL 值要越小

D_{KL} (P_{x} ∥ Q_{x}) = x \in Ω \sum P_{x} (x) lo g \frac{P _{x} ( x )}{Q _{x} ( x )}

Forward 下， $Q$ 不能在任何 $P$ 有值的地方太小，否則 log term 會很大，傾向尋求平均， $Q$ 對於整個 $P$ 完整的覆蓋。

D_{KL} (Q_{x} ∥ P_{x}) = x \in Ω \sum Q_{x} (x) lo g \frac{Q _{x} ( x )}{P _{x} ( x )}

Backward 下， $Q$ 不能在任何 $P$ 小的地方太大，否則 log term 會很大，傾向尋求峰值， $Q$ 擬合 $P$ 的某個高機率區。

PPO2 截斷式目標函數

不計算複雜的 KL 散度，直接用 clip 函數限制比例，是目前較常用的版本

L (θ) = (s, a) \sum min (\frac{π _{θ} ( a ∣ s )}{π _{θ^{'}} ( a ∣ s )} A, clip (\frac{π _{θ} ( a ∣ s )}{π _{θ^{'}} ( a ∣ s )}, 1 - ϵ, 1 + ϵ) A)

$ϵ$ ：Hyperparameter (如 0.2)，限制更新幅度
若 $A > 0$ (好動作)：希望增加機率 $π_{θ}$ ，但若比值超過 $1 + ϵ$ ，就截斷不再獎勵避免機率增加過頭導致分佈偏差太大
若 $A < 0$ (壞動作)：希望減少機率 $π_{θ}$ ，但若比值低於 $1 - ϵ$ ，就截斷不再處罰
確保 $π_{θ}$ 更新後的行為與 $π_{θ^{'}}$ 保持在一定範圍內 (Trust Region)，既享受 Off-Policy 的效率，又維持訓練的穩定性
min 的作用：悲觀下界，在原始效益與截斷後效益中永遠選擇較差者
當比值越界時，min 強制選取邊界值（常數），使梯度歸零並停止該次更新，防止因單次步幅過大導致策略崩潰，確保模型是透過多次迭代穩健地進步

🪴 Quartz 4.0

Recent writing

Quadruped robot

Quadson simulation

ROS bridge

SSD

Sensor topic Not found

PPO

PPO (Proximal Policy Optimization)

PPO1 基於 KL 散度的懲罰

Kullback-Leibler Divergence

PPO2 截斷式目標函數

Graph View

Table of Contents

Backlinks

🪴 Quartz 4.0

Recent writing

Quadruped robot

Quadson simulation

ROS bridge

SSD

Sensor topic Not found

PPO

PPO (Proximal Policy Optimization) §

PPO1 基於 KL 散度的懲罰 §

Kullback-Leibler Divergence §

PPO2 截斷式目標函數 §

Graph View

Table of Contents

Backlinks

PPO (Proximal Policy Optimization)

PPO1 基於 KL 散度的懲罰

Kullback-Leibler Divergence

PPO2 截斷式目標函數