Skip to content


\[ \hat y = \hat f(x) + u \\ \hat f(x) = E[y \vert x] \]


Denotation Term Comment
\(x\) input; feature; predictor
\(y\) output; target; response
\(\hat y\) prediction
\(E[y \vert x]\) CEF (Conditional Expectation Function)
\(\hat f\) Target function
Gives mapping b/w \(x\) and \(y\) to obtain CEF
\(p(y \vert x)\) Target distribution/
Posterior distribution of \(y\)
Gives mapping b/w \(x\) and \(y\) to obtain Conditional Distribution
\(u\) Random component

Model Types

Ideal Non-Parametric
(Nearest Neighbor)
Semi-Parametric Parametric
\(\hat y\) \(\text{Mean}(y \vert x)\) \(\text{Mean} \Big(y \vert x_i \in N(x) \Big)\)
\(N(x)\) is neighborhood of \(x\)
Functional Form assumption None None Assumes functional form with a finite & fixed number of parameters, before data is observed
Advantages - learns complex patterns
- in a high-dimensional space
- without being specifically directed
- learns interactions
Limitation Suffers from curse of dimensionality: Requires large dataset, especially when \(k\) is large Suffers from curse of dimensionality: Requires large dataset, especially when \(k\) is large

Black box: Lacks interpretability
Large storage cost: Stores all training records
Visualization image-20240212222001445 image-20240212222208110
Complexity Function of training set size
Example Nearest Neighbor averaging Spline Linear Regression

Fundamentally, a parametric model can be though of data compression

Modelling Types

Reduced Form
Characteristic Mathematical/Statistical Theoretical
Mix of first principles & machine learning
Goal 1. \(\hat p(y \vert x)\)
2. \(\hat y = \hat E(y \vert x)\)
1. \(\hat p(x, y)\)
2. \(\hat p(y \vert x)\)
3. \(\hat y = \hat E(y \vert x)\)
\(\hat y = \text{g}(x) + d(x)\)
Includes Causal Theory ❌ ✅ Same as Structural
Preferred for Interpolation Extrapolation
Counter-factual simulation
Synthetic data generation
Same as Structural
Out-of-Sample Accuracy Low High
(only for good theoretical model)
Same as Structural
Limitations Unstable for equilibrium effects
Derivation Time 0 High Same as Structural
Example models Non-Probabilistic classifiers Probabilistic classifiers (Bayesian/Gaussian)
Example 1. Mars position wrt Earth, assuming that Mars revolves around the Earth

2. Relationship of wage vs college education directly
1. Mars position wrt Earth, assuming that Mars & Earth revolve around the Sun

2. Relationship of wage vs college education, with understanding of effects of supply of college educated students in the market
Example 2 Linear regression of wage wrt education Wage & education using supply-demand curve
Comment The shortcoming of reduced form was seen in the 2008 Recession
The prediction model for defaults was only for the case that housing prices go up, as there was data only for that. Hence, the model was not good for when the prices started going down.
Learning \(p(x, y)\) can help understand \(p(u, v)\) if \(\{x, y \}\) and \(\{ u, v \}\) share a common underlying causal mechanism

For eg: Apples falling down trees and the earth orbiting around the sun both inform us of the gravitational constant.

Given a structural model \(g (x, y) = 0\)

If \(y\) is solved as a function of \(x\), i.e. \(y = f(x)\), then \(f\) is referred to as the reduced form of \(g\)


Structural model vs Reduced form

Structural model vs Reduced form

Number of Variables

Univariate Regression Multi-Variate
\(\hat y\) \(f(X_1)\) \(f(X_1, X_2, \dots, X_n)\)
Equation \(\beta_0 + \beta_1 X_1\) \(\sum\limits_{i=0}^n \beta_i X_i\)
Best Fit Straight line Place

Degree of Model

Simple Linear Regression Curve-Fitting
Polynomial Linear Regression
Non-Linear Regression
Equation \(\sum\limits_{j=0}^k \beta_j X_j\) \(\sum \limits_{j=0}^k \sum\limits_{i=0}^n \beta_{ij} (X_j)^i\) Any of the \(\beta\) is not linear
Example \(\beta_0 + \beta_1 X_1 + \beta_1 X_2\) \(\beta_0 + \beta_1 X_1 + \beta_1 X_1^2 + \beta_1 X_2^{10}\) \(\beta_0 + e^{\textcolor{hotpink}{\beta_1} X_1}\)
Best Fit Straight line Curve Curve
You can alternatively perform transformation to make your regression linear, but this isn’t best
1. Your regression will minimize transformed errors, not your back-transformed errors (what actually matters). So the weights of errors will not be what is expected
2. Transformed errors will be random, but your back-transformed errors (what actually matters) won’t be a random process

The term linear refers to the linearity in the coefficients \(\beta\)s, not the predictors

Jensen’s Inequality

\[ E[\log y] < \log (E[y]) \]

Therefore $$ \hat y = \exp(\beta_0 + \beta_1 x) + u_i \ E[y \vert x] \ne E[\exp(\beta_0 + \beta_1 x)] $$ However, if you assume that \(u \sim N(0, \sigma^2)\) $$ E[y \vert x] = \exp(\beta_0 + \beta_1 x + \dfrac{\sigma^2}{2}) $$

Generalized Linear Models

Models response variable \(y\) from a known exponential family with mean \(\mu\) and link function \(g\) $$ g(\mu) = x' \beta $$ where \(g\) is a monotic function

\(y\) \(x' \beta\)
Normal Linear \(N(\mu, \sigma^2)\) \(\mu\)
Logistic \(\text{Binomial}(n, p)\) \(\log \left(\dfrac{\pi}{1-\pi} \right)\)
Poisson \(\text{Poisson}(\mu)\) \(\log \mu\)
Last Updated: 2024-05-12 ; Contributors: AhmedThahir
