CS After Dark

[NOTES] Classification Method - P1

Download the PDF from here:

Overview

Statistical Classifiers

Note on Freq vs Bayes: Bayes considers priors. Frequency Domains don't consider priors

Bayes Classifier

Pr(G|X=x)

G has discrete dist with prob mass assigned to a few outcomes

Terminology

We have: prior*likelihood=posterior

Bayes Theorem

Posterior Probability is given by:

Pr(G=k|X=x)

where

We are essentially checking the probability of class G=k given input variables X=x.

Classification decision:

Assign a "new" observation with xnew (inputs) to class j=1,2,,k if Pr(G=j|x) is highest in Pr(G=1,2,,k|x).

Bayes Theorem is given by:

Pr(G=k|X=x)=Pr(G=kX=x)Pr(X=x)=Pr(X=x|G=k)*Pr(G=k)Pr(X=x)

Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.

Note: We have Pr(X=x) as a normalizing factor in denominator.

Discriminant Analysis (LDA/QDA/RDA)

Model Assumptions:

The Multivariate normal probability distribution function takes form:

fk(x)=1(2π)(p2)Σk(12)e(12(xμk)TΣk1(xμk))

Here p is the number of predictors and k is the class number. Σ is variance and μ is the mean.

[1] k=1,2,,k (5 classes)

[2] X=[x1,x2,x3] say 3 input variables (rich, sleep, ) where,

x1=[x11x12x1n]

Now we get a mean for class k =1: μk=1=[E(X1),E(X2),E(X3)]

This is the mean of x for class k = 1.

Then,

Σk=1=[Var(X1)Cov(X1,X2)Cov(X1,X3)Cov(X1,X2)Var(X2)Cov(X2,X3)Cov(X1,X3)Cov(X2,X3)Var(X3)]

Dimensions of Σk are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.

Exam Question:

Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.

Different class will have different data likelihood!

LDA

In comparing 2 classes we set their posterior probability (P(G=k|X=x)) to be equal Then upon simplification we get the following decision boundary.

δ(x)=log(πkπl)12(μkT)Σ1(μk)+xTΣ1(μk)

Now here πk and πl are the prior probabilities for class k and l. Note that we can get the prior-probabilities for each class directly from our data.

QDA

Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:

δk(𝐱)=12log|Σk|12(𝐱μk)TΣk1(𝐱μk)+logπk

REMEMBER SANDWICH! Σk is caught between (xμk) This sandwich actually stems from the scaling property of variance.

Naive Bayes Classifier!

$$VarCovar(X) = \begin{bmatrix} var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p}) \ \ covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p}) \ \ \vdots \ covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p}) \ \ \end{bmatrix}$$ And for independent data (Naive Bayes assumption) it is as follows: $$VarCovar(X) = \begin{bmatrix} var(x_{1}) & 0 & \cdots & 0 \ \ 0 & var(x_{2}, x_{2}) & \cdots & 0 \ \ \vdots \ 0 & 0 & \cdots & var(x_{p}) \ \ \end{bmatrix}$$

Statistical vs Computing based classification approaches

COMPUTING OPTIMISATION BASED PROCEDURES

Separating Hyperplanes

SVM

TODO ADD MORE


Decision Trees

Objective functions:

Cn(T)=Σm=1MΣxiinRM(yicm)2+α|T|

Artificial Neural Networks (ANN)

Considerations:

Zm=σ(α0m+αmTX) σ(v)=1(1+ev) xj=[xjixj2xjn] βkmr+1=βkmkγrk=1KRik=1Kβkmr

Kernel based classification

K(x,y)=(ϕ(x),ϕ(y))

KNN Classification

Cross Validation

Exam Q:

MISE (mean integrated squared error)

MISE=E((f(x)f(x)^)2dx)=Bias2(f(x)^)dx+Var(f(x)^)dx

We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)

MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.

Select best kernel

Go through all kernels with N h(x) and compare for each kernel the best H.

#classification #data-science #interview #machine-learning #machine-learning-interview #notes