| Concept | Formula | Notes |
|---|---|---|
| Expectation | E[X] = Σ x·P(x) |
Discrete case. |
| Variance | Var(X) = E[X²] − (E[X])² |
Spread of distribution. |
| Bayes’ theorem | P(A|B) = P(B|A)P(A) / P(B) |
Posterior inference. |
| Gaussian distribution | N(x|μ,σ²) = (1/√(2πσ²)) e^{-(x−μ)²/(2σ²)} |
Common in ML. |
| Concept | Formula | Notes |
|---|---|---|
| Linear regression model | ŷ = Xw |
Prediction as dot product. |
| Least squares solution | w = (XᵀX)⁻¹ Xᵀ y |
Closed‑form solution. |
| Ridge regression | w = (XᵀX + λI)⁻¹ Xᵀ y |
L2 regularization. |
| Logistic function | σ(z) = 1 / (1 + e^{−z}) |
Binary classification. |
| Concept | Formula | Notes |
|---|---|---|
| Mean squared error (MSE) | L = (1/n) Σ (yᵢ − ŷᵢ)² |
Regression tasks. |
| Cross‑entropy (binary) | L = −[y log p + (1−y) log(1−p)] |
Logistic regression. |
| Cross‑entropy (multi‑class) | L = −Σ yᵢ log pᵢ |
Softmax outputs. |
| Softmax | pᵢ = e^{zᵢ} / Σ e^{zⱼ} |
Probability distribution. |
| Concept | Formula | Notes |
|---|---|---|
| Gradient descent update | w ← w − η ∇L(w) |
η = learning rate. |
| Stochastic gradient descent | w ← w − η ∇Lᵢ(w) |
Uses one sample or mini‑batch. |
| Momentum | v ← βv + (1−β)∇Lw ← w − ηv |
Smoother updates. |
| Adam optimizer |
m ← β₁m + (1−β₁)gv ← β₂v + (1−β₂)g²w ← w − η m̂ / (√v̂ + ε)
|
Adaptive learning rates. |
| Concept | Formula | Notes |
|---|---|---|
| Neuron output | a = f(wᵀx + b) |
Activation function f. |
| ReLU | f(x) = max(0, x) |
Most common activation. |
| Backpropagation (weight gradient) | ∂L/∂w = δ x |
δ = error term. |
| Softmax gradient | ∂L/∂zᵢ = pᵢ − yᵢ |
Elegant simplification. |
| Concept | Formula | Notes |
|---|---|---|
| Scaled dot‑product attention | Attention(Q,K,V) = softmax(QKᵀ / √dₖ) V |
Core of transformer models. |
| Layer normalization | y = (x − μ) / √(σ² + ε) |
Stabilizes training. |
| Residual connection | y = x + F(x) |
Prevents vanishing gradients. |
| Concept | Formula | Notes |
|---|---|---|
| Entropy | H(X) = −Σ p(x) log p(x) |
Uncertainty measure. |
| Kullback–Leibler divergence | KL(P‖Q) = Σ P log(P/Q) |
Distribution difference. |
| Mutual information | I(X;Y) = H(X) + H(Y) − H(X,Y) |
Shared information. |