Back to Home Page

AI & Machine Learning Formula Sheet

1. Core Probability & Statistics

ConceptFormulaNotes
Expectation E[X] = Σ x·P(x) Discrete case.
Variance Var(X) = E[X²] − (E[X])² Spread of distribution.
Bayes’ theorem P(A|B) = P(B|A)P(A) / P(B) Posterior inference.
Gaussian distribution N(x|μ,σ²) = (1/√(2πσ²)) e^{-(x−μ)²/(2σ²)} Common in ML.

2. Linear Models

ConceptFormulaNotes
Linear regression model ŷ = Xw Prediction as dot product.
Least squares solution w = (XᵀX)⁻¹ Xᵀ y Closed‑form solution.
Ridge regression w = (XᵀX + λI)⁻¹ Xᵀ y L2 regularization.
Logistic function σ(z) = 1 / (1 + e^{−z}) Binary classification.

3. Loss Functions

ConceptFormulaNotes
Mean squared error (MSE) L = (1/n) Σ (yᵢ − ŷᵢ)² Regression tasks.
Cross‑entropy (binary) L = −[y log p + (1−y) log(1−p)] Logistic regression.
Cross‑entropy (multi‑class) L = −Σ yᵢ log pᵢ Softmax outputs.
Softmax pᵢ = e^{zᵢ} / Σ e^{zⱼ} Probability distribution.

4. Optimization

ConceptFormulaNotes
Gradient descent update w ← w − η ∇L(w) η = learning rate.
Stochastic gradient descent w ← w − η ∇Lᵢ(w) Uses one sample or mini‑batch.
Momentum v ← βv + (1−β)∇L
w ← w − ηv
Smoother updates.
Adam optimizer m ← β₁m + (1−β₁)g
v ← β₂v + (1−β₂)g²
w ← w − η m̂ / (√v̂ + ε)
Adaptive learning rates.

5. Neural Networks

ConceptFormulaNotes
Neuron output a = f(wᵀx + b) Activation function f.
ReLU f(x) = max(0, x) Most common activation.
Backpropagation (weight gradient) ∂L/∂w = δ x δ = error term.
Softmax gradient ∂L/∂zᵢ = pᵢ − yᵢ Elegant simplification.

6. Transformers & Deep Learning

ConceptFormulaNotes
Scaled dot‑product attention Attention(Q,K,V) = softmax(QKᵀ / √dₖ) V Core of transformer models.
Layer normalization y = (x − μ) / √(σ² + ε) Stabilizes training.
Residual connection y = x + F(x) Prevents vanishing gradients.

7. Information Theory

ConceptFormulaNotes
Entropy H(X) = −Σ p(x) log p(x) Uncertainty measure.
Kullback–Leibler divergence KL(P‖Q) = Σ P log(P/Q) Distribution difference.
Mutual information I(X;Y) = H(X) + H(Y) − H(X,Y) Shared information.