# [NOTES] Classification Method - P1

*
*

Download the PDF from here:

## Overview

Statistical Classifiers

- Bayes Classifiers
- LDA/QDA/RDA
- Naive Bayes (assumes no correlation between inputs x_i)

- Frequency Domain Classifiers (OMG I HAD IT HERE! NEED TO MAKE FLASHCARDS TODAY!!!!!!! FUCK)
- Logistic Regression

- Computing/Optimizing classifiers
- SVM/Decision Tree/ANN...

**Note on Freq vs Bayes:**
Bayes considers priors. Frequency Domains **don't consider priors**

## Bayes Classifier

- classifies the case into most probable class
- posterior conditional distribution
- Logistic regression is
**not a Bayes procedure. Its a frequency domain procedure**

G has discrete dist with prob mass assigned to a few outcomes

### Terminology

Prior: previously observed probability of a class $K$

- eg. we have 5 classes $$ Pr(k=1 | (blue)) = 32% $$

- Ways to get priors:

-- Empirical studies (like just doing pd.count_values() on your dataframe)

likelihood: Data distribution!

posterior: Classification decisions

We have: $prior*likelihood=posterior$

## Bayes Theorem

Posterior Probability is given by:

$$Pr(G=k|X=x)$$where

- k is classes (ex. Blue, Red, class)
- x is input variables

We are essentially checking the probability of class $G=k$ given input variables $X=x$.

Classification decision:

Assign a "new" observation with ${x}_{new}$ (inputs) to class $j=1,2,\cdots ,k$ if $Pr(G=j|x)$ is highest in $Pr(G=1,2,\cdots ,k|x)$.

Bayes Theorem is given by:

$$Pr(G=k|X=x)=\frac{Pr(G=k\cap X=x)}{Pr(X=x)}\phantom{\rule{0ex}{0ex}}=\frac{Pr(X=x|G=k)*Pr(G=k)}{Pr(X=x)}$$Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.

Note: We have $Pr(X=x)$ as a normalizing factor in denominator.

# Discriminant Analysis (LDA/QDA/RDA)

### Model Assumptions:

- Multivariate Normal Distribution assumed for joint-distribution of p-dimensional X-variables with density ${f}_{k}({x}_{k})=Pr({X}_{k}={x}_{k}|G=k)$
**Homogeneity**: For each class G = k, k = 1, 2, ... K, its inputs are considered as random-variables**Independence**: the inputs for each class k are different from each other

The Multivariate normal probability distribution function takes form:

$${f}_{k}(x)=\frac{1}{(2\pi {)}^{\left(\frac{p}{2}\right)}\mid {\Sigma}_{k}{\mid}^{\left(\frac{1}{2}\right)}}{e}^{(-\frac{1}{2}(x-{\mu}_{k}{)}^{T}{\Sigma}_{k}^{-1}(x-{\mu}_{k}))}$$Here $p$ is the number of predictors and $k$ is the class number. $\Sigma $ is variance and $\mu $ is the mean.

- In
**LDA**assume that ${\Sigma}_{k}=\Sigma $ i.e the**variance-covariance matrix**is the same for EVERY class. - In
**QDA**the variance-covariance matrix ${\Sigma}_{k}$ is different for each class - in
**RDA**the variance-covariance matrix is a function of $\alpha $ (tuning-param) and can be given like so: $$ \Sigma(\alpha) = \alpha\Sigma_{k} + (1-\alpha)\Sigma_{k} $$**Numerical Illustration**

[1] $k=1,2,\cdots ,k$ (5 classes)

[2] $X=[{x}_{1},{x}_{2},{x}_{3}]$ say 3 input variables (rich, sleep, ) where,

$${x}_{1}=\left[\begin{array}{c}{x}_{11}\\ {x}_{12}\\ \vdots \\ {x}_{1n}\end{array}\right]$$Now we get a mean for class k =1: ${\mu}_{k=1}=[E({X}_{1}),E({X}_{2}),E({X}_{3})]$

This is the mean of x for class k = 1.

Then,

$${\Sigma}_{k=1}=\left[\begin{array}{ccc}Var({X}_{1})& Cov({X}_{1},{X}_{2})& Cov({X}_{1},{X}_{3})\\ Cov({X}_{1},{X}_{2})& Var({X}_{2})& Cov({X}_{2},{X}_{3})\\ Cov({X}_{1},{X}_{3})& Cov({X}_{2},{X}_{3})& Var({X}_{3})\end{array}\right]$$Dimensions of ${\Sigma}_{k}$ are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.

### Exam Question:

Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.

Different class will have different data likelihood!

### LDA

In comparing 2 classes we set their posterior probability ($P(G=k|X=x)$) to be equal Then upon simplification we get the following decision boundary.

**Decision Boundary:**The Discriminant function for LDA takes the following form:

Now here ${\pi}_{k}$ and ${\pi}_{l}$ are the **prior probabilities for class k and l**. Note that we can get the prior-probabilities for each class directly from our data.

- Note that the discriminant function is a linear function of our input X = x.
- We know the mean (${\mu}_{k}$) and variance (${\Sigma}_{k}$) from our input data.
- Okay but what the fuck does the decision boundary even represent?
- Well if its a linear function of X then it fucking means that is p-1 dimensional hyperplane in a p-dimensional space

- Also note how in the above decision boundary we do not have ${\Sigma}_{k}$ !! This is cuz of the fucking assumption we made above that all classes will have the same variance-covariance matrix. Boom!

### QDA

Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:

$${\delta}_{k}(\mathbf{x})=-\frac{1}{2}\mathrm{log}|{\mathit{\Sigma}}_{k}|-\frac{1}{2}(\mathbf{x}-{\mathit{\mu}}_{k}{)}^{T}{\mathit{\Sigma}}_{k}^{-1}(\mathbf{x}-{\mathit{\mu}}_{k})+\mathrm{log}{\pi}_{k}$$**REMEMBER SANDWICH! ${\Sigma}_{k}$ is caught between $(x-{\mu}_{k})$**
This sandwich actually stems from the scaling property of variance.

- Why the fuck it this called Quadratic then?
- Well check the $(\mathbf{x}-{\mathit{\mu}}_{k}{)}^{T}{\mathit{\Sigma}}_{k}^{-1}(\mathbf{x}-{\mathit{\mu}}_{k})$ part. You will notice that as compared to the LDA decision boundary function we now have two $x$ in the above equation i.e. its become a 2nd order function aka quadratic function

- Note that you can manipulate LDA to
**behave similar to QDA**if you just take interactions between inputs. i.e. if you have inputs X1 and X2, perform LDA in a 5-dimensional space with $\{{X}_{1},{X}_{2},{X}_{1}{X}_{2},{X}_{1}^{2},{X}_{2}^{2}\}$.

## Naive Bayes Classifier!

- Special case of LDA/QDA/RDA
- We must remember that LDA/QDA/RDA assumes that your input data is
**dependent multivariate**data. In case of Naive Bayes, we just assume that inputs matrix X is**INDEPENDENT multivariate**data.- What does dependent and independent multivariate data mean?
- In simple terms if your input dataset with p-columns are somehow dependent on each other then its dependent (example scores in a physics and math class are somewhat correlated). If we assume that each column has no effect on other column what so ever then the assumption is independent (example: scores in physics and singing class are not related at all).

- Mathematically it just means that in case of Naive Bayes your variance-covariance matrix is a diagonal matrix (i.e. except of the diagonals everything else is 0) Visually for dependent data:

- What does dependent and independent multivariate data mean?

$$VarCovar(X) = \begin{bmatrix}
var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p})

covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p})

\vdots

covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p})

\end{bmatrix}$$
And for independent data (Naive Bayes assumption) it is as follows:
$$VarCovar(X) = \begin{bmatrix}
var(x_{1}) & 0 & \cdots & 0

0 & var(x_{2}, x_{2}) & \cdots & 0

\vdots

0 & 0 & \cdots & var(x_{p})

\end{bmatrix}$$

### Statistical vs Computing based classification approaches

- Statistical techniques are more rigorous but, require more assumptions
- Statistical techniques involve
**probability distributions**where as computing procedures involve**objective function**

## COMPUTING OPTIMISATION BASED PROCEDURES

### Separating Hyperplanes

### SVM

- Sensitive to data
- Data critical
- SVMs are bad in higher dimensions. Hence we do dimensionality reduction in standard cases and make it work

TODO ADD MORE

## Decision Trees

- Complexity of trees:
- Number of branches
- Number of layers (depth)
- OR count the number of decision points

Objective functions:

- Number of misclassified
- Entropy (advanced)

## Artificial Neural Networks (ANN)

Considerations:

- How many hidden layers?
- The more non-linear relation between Y and X the more layers will help

- ${\alpha}_{0m}$ is the intercept and ${\alpha}_{m}^{T}\overrightarrow{X}$ is weight times X (same as linear regression).
- We first perform linear-combination and then do non-linear transformation.
- $\sigma $ is some non-linear transformation (sigmoid etc).

- The arrows in the ANN digram between two layers are dot-products (weighted sum)

- It might involve tuning parameters

## Kernel based classification

$$K(x,y)=(\varphi (x),\varphi (y))$$- don't really care about this

## KNN Classification

- lower the K the more sensitive the classification will become
- for larger values of K, the more

## Cross Validation

Exam Q:

- How many regression model do we need to build in locating the best h? h * n

### MISE (mean integrated squared error)

$$MISE=E(\int (f(x)-\hat{f(x)}{)}^{2}dx)$$$$=\int Bia{s}^{2}(\hat{f(x)})dx+\int Var(\hat{f(x)})dx$$We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)

MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.

### Select best kernel

Go through all kernels with N h(x) and compare for each kernel the best H.