CS After Dark

[NOTES] Classification Method - P1

Download the PDF from here:


Statistical Classifiers

Note on Freq vs Bayes: Bayes considers priors. Frequency Domains don't consider priors

Bayes Classifier


G has discrete dist with prob mass assigned to a few outcomes



We have: prior*likelihood=posterior

Bayes Theorem

Posterior Probability is given by: Pr(G=k|X=x) where

We are essentially checking the probability of class $G = k$ given input variables $X = x$.

Classification decision:

Assign a "new" observation with $x_{new}$ (inputs) to class $j = 1, 2, \cdots , k$ if $Pr(G=j | x)$ is highest in $Pr( G = 1, 2, \cdots, k | x)$.

Bayes Theorem is given by: Pr(G=k|X=x)=Pr(G=kX=x)Pr(X=x)\ =Pr(X=x|G=k)*Pr(G=k)Pr(X=x)

Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.

Note: We have $Pr(X = x)$ as a normalizing factor in denominator.

Discriminant Analysis (LDA/QDA/RDA)

Model Assumptions:

The Multivariate normal probability distribution function takes form:

fk(x)=1(2π)(p2)Σk(12)e(12(xμk)TΣk1(xμk)) Here $p$ is the number of predictors and $k$ is the class number. $\Sigma$ is variance and $\mu$ is the mean.

Σ(α)=αΣk+(1α)Σk Numerical Illustration

[1] $k = 1, 2, \cdots, k$ (5 classes)

[2] $X = [x_1, x_2, x_3]$ say 3 input variables (rich, sleep, ) where, x1=[x11\ x12\ \ x1n]

Now we get a mean for class k =1: $\mu_{k=1} = [E(X_1), E(X_2), E(X_3)]$

This is the mean of x for class k = 1.

Then, Σk=1=[Var(X1)amp;Cov(X1,X2)amp;Cov(X1,X3)\ Cov(X1,X2)amp;Var(X2)amp;Cov(X2,X3)\ Cov(X1,X3)amp;Cov(X2,X3)amp;Var(X3)]

Dimensions of $\Sigma_k$ are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.

Exam Question:

Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.

Different class will have different data likelihood!


In comparing 2 classes we set their posterior probability ($P(G=k|X=x)$) to be equal Then upon simplification we get the following decision boundary.

δ(x)=log(πkπl)12(μkT)Σ1(μk)+xTΣ1(μk) Now here $\pi_{k}$ and $\pi_{l}$ are the prior probabilities for class k and l. Note that we can get the prior-probabilities for each class directly from our data.


Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:

δk(𝐱)=12log|Σk|12(𝐱μk)TΣk1(𝐱μk)+logπk REMEMBER SANDWICH! $\Sigma_k$ is caught between $(x-\mu_k)$ This sandwich actually stems from the scaling property of variance.

Naive Bayes Classifier!

Visually for dependent data:

$$VarCovar(X) = \begin{bmatrix} var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p}) \ \ covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p}) \ \ \vdots \ covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p}) \ \ \end{bmatrix}Andforindependentdata(NaiveBayesassumption)itisasfollows:VarCovar(X) = \begin{bmatrix} var(x_{1}) & 0 & \cdots & 0 \ \ 0 & var(x_{2}, x_{2}) & \cdots & 0 \ \ \vdots \ 0 & 0 & \cdots & var(x_{p}) \ \ \end{bmatrix}$$

Statistical vs Computing based classification approaches


Separating Hyperplanes



Decision Trees

Objective functions:


Artificial Neural Networks (ANN)




xj=[xji\ xj2\ \ xjn]


Kernel based classification


KNN Classification

Cross Validation

Exam Q:

h * n

MISE (mean integrated squared error)



We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)

MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.

Select best kernel

Go through all kernels with N h(x) and compare for each kernel the best H.

#classification #data-science #interview #machine-learning #machine-learning-interview #notes