Back to Home Page

AI & Machine Learning Formula Sheet

1. Core Probability & Statistics

Concept	Formula	Notes
Expectation	`E[X] = Σ x·P(x)`	Discrete case.
Variance	`Var(X) = E[X²] − (E[X])²`	Spread of distribution.
Bayes’ theorem	`P(A\|B) = P(B\|A)P(A) / P(B)`	Posterior inference.
Gaussian distribution	`N(x\|μ,σ²) = (1/√(2πσ²)) e^{-(x−μ)²/(2σ²)}`	Common in ML.

2. Linear Models

Concept	Formula	Notes
Linear regression model	`ŷ = Xw`	Prediction as dot product.
Least squares solution	`w = (XᵀX)⁻¹ Xᵀ y`	Closed‑form solution.
Ridge regression	`w = (XᵀX + λI)⁻¹ Xᵀ y`	L2 regularization.
Logistic function	`σ(z) = 1 / (1 + e^{−z})`	Binary classification.

3. Loss Functions

Concept	Formula	Notes
Mean squared error (MSE)	`L = (1/n) Σ (yᵢ − ŷᵢ)²`	Regression tasks.
Cross‑entropy (binary)	`L = −[y log p + (1−y) log(1−p)]`	Logistic regression.
Cross‑entropy (multi‑class)	`L = −Σ yᵢ log pᵢ`	Softmax outputs.
Softmax	`pᵢ = e^{zᵢ} / Σ e^{zⱼ}`	Probability distribution.

4. Optimization

Concept	Formula	Notes
Gradient descent update	`w ← w − η ∇L(w)`	η = learning rate.
Stochastic gradient descent	`w ← w − η ∇Lᵢ(w)`	Uses one sample or mini‑batch.
Momentum	`v ← βv + (1−β)∇L` `w ← w − ηv`	Smoother updates.
Adam optimizer	`m ← β₁m + (1−β₁)g` `v ← β₂v + (1−β₂)g²` `w ← w − η m̂ / (√v̂ + ε)`	Adaptive learning rates.

5. Neural Networks

Concept	Formula	Notes
Neuron output	`a = f(wᵀx + b)`	Activation function f.
ReLU	`f(x) = max(0, x)`	Most common activation.
Backpropagation (weight gradient)	`∂L/∂w = δ x`	δ = error term.
Softmax gradient	`∂L/∂zᵢ = pᵢ − yᵢ`	Elegant simplification.

6. Transformers & Deep Learning

Concept	Formula	Notes
Scaled dot‑product attention	`Attention(Q,K,V) = softmax(QKᵀ / √dₖ) V`	Core of transformer models.
Layer normalization	`y = (x − μ) / √(σ² + ε)`	Stabilizes training.
Residual connection	`y = x + F(x)`	Prevents vanishing gradients.

7. Information Theory

Concept	Formula	Notes
Entropy	`H(X) = −Σ p(x) log p(x)`	Uncertainty measure.
Kullback–Leibler divergence	`KL(P‖Q) = Σ P log(P/Q)`	Distribution difference.
Mutual information	`I(X;Y) = H(X) + H(Y) − H(X,Y)`	Shared information.