# [NOTES] Classification Method - P1

*
*

Download the PDF from here:

## Overview

Statistical Classifiers

- Bayes Classifiers
- LDA/QDA/RDA
- Naive Bayes (assumes no correlation between inputs x_i)

- Frequency Domain Classifiers (OMG I HAD IT HERE! NEED TO MAKE FLASHCARDS TODAY!!!!!!! FUCK)
- Logistic Regression

- Computing/Optimizing classifiers
- SVM/Decision Tree/ANN...

**Note on Freq vs Bayes:**
Bayes considers priors. Frequency Domains **don't consider priors**

## Bayes Classifier

- classifies the case into most probable class
- posterior conditional distribution
- Logistic regression is
**not a Bayes procedure. Its a frequency domain procedure**

$$Pr(G|X=x)$$

G has discrete dist with prob mass assigned to a few outcomes

### Terminology

- Prior: previously observed probability of a class $K$
- eg. we have 5 classes

$$Pr(k=1|(blue))=32$$

- Ways to get priors:

-- Empirical studies (like just doing pd.count_values() on your dataframe)

likelihood: Data distribution!

posterior: Classification decisions

We have: $prior*likelihood=posterior$

## Bayes Theorem

Posterior Probability is given by: $$Pr(G=k|X=x)$$ where

- k is classes (ex. Blue, Red, class)
- x is input variables

We are essentially checking the probability of class $G = k$ given input variables $X = x$.

Classification decision:

Assign a "new" observation with $x_{new}$ (inputs) to class $j = 1, 2, \cdots , k$ if $Pr(G=j | x)$ is highest in $Pr( G = 1, 2, \cdots, k | x)$.

Bayes Theorem is given by: $$Pr(G=k|X=x)=\frac{Pr(G=k\cap X=x)}{Pr(X=x)}\backslash =\frac{Pr(X=x|G=k)*Pr(G=k)}{Pr(X=x)}$$

Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.

Note: We have $Pr(X = x)$ as a normalizing factor in denominator.

# Discriminant Analysis (LDA/QDA/RDA)

### Model Assumptions:

- Multivariate Normal Distribution assumed for joint-distribution of p-dimensional X-variables with density $f_k(x_{k}) = Pr(X_{k} = x_{k} | G = k)$
**Homogeneity**: For each class G = k, k = 1, 2, ... K, its inputs are considered as random-variables**Independence**: the inputs for each class k are different from each other

The Multivariate normal probability distribution function takes form:

$${f}_{k}(x)=\frac{1}{(2\pi {)}^{\left(\frac{p}{2}\right)}\mid {\Sigma}_{k}{\mid}^{\left(\frac{1}{2}\right)}}{e}^{(-\frac{1}{2}(x-{\mu}_{k}{)}^{T}{\Sigma}_{k}^{-1}(x-{\mu}_{k}))}$$ Here $p$ is the number of predictors and $k$ is the class number. $\Sigma$ is variance and $\mu$ is the mean.

- In
**LDA**assume that $\Sigma_{k} = \Sigma$ i.e the**variance-covariance matrix**is the same for EVERY class. - In
**QDA**the variance-covariance matrix $\Sigma_{k}$ is different for each class - in
**RDA**the variance-covariance matrix is a function of $\alpha$ (tuning-param) and can be given like so:

$$\Sigma (\alpha )=\alpha {\Sigma}_{k}+(1-\alpha ){\Sigma}_{k}$$
**Numerical Illustration**

[1] $k = 1, 2, \cdots, k$ (5 classes)

[2] $X = [x_1, x_2, x_3]$ say 3 input variables (rich, sleep, ) where, $${x}_{1}=\left[\begin{array}{c}{x}_{11}\backslash {x}_{12}\backslash \vdots \backslash {x}_{1n}\end{array}\right]$$

Now we get a mean for class k =1: $\mu_{k=1} = [E(X_1), E(X_2), E(X_3)]$

This is the mean of x for class k = 1.

Then, $${\Sigma}_{k=1}=\left[\begin{array}{ccccccc}Var({X}_{1})& amp;Cov({X}_{1},{X}_{2})& amp;Cov({X}_{1},{X}_{3})\backslash Cov({X}_{1},{X}_{2})& amp;Var({X}_{2})& amp;Cov({X}_{2},{X}_{3})\backslash Cov({X}_{1},{X}_{3})& amp;Cov({X}_{2},{X}_{3})& amp;Var({X}_{3})\end{array}\right]$$

Dimensions of $\Sigma_k$ are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.

### Exam Question:

Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.

Different class will have different data likelihood!

### LDA

In comparing 2 classes we set their posterior probability ($P(G=k|X=x)$) to be equal Then upon simplification we get the following decision boundary.

**Decision Boundary:**The Discriminant function for LDA takes the following form:

$$\delta (x)=\mathrm{log}\left(\frac{{\pi}_{k}}{{\pi}_{l}}\right)-\frac{1}{2}({\mu}_{k}^{T}){\Sigma}^{-1}({\mu}_{k})+{x}^{T}{\Sigma}^{-1}({\mu}_{k})$$
Now here $\pi_{k}$ and $\pi_{l}$ are the **prior probabilities for class k and l**. Note that we can get the prior-probabilities for each class directly from our data.

- Note that the discriminant function is a linear function of our input X = x.
- We know the mean ($\mu_k$) and variance ($\Sigma_{k}$) from our input data.
- Okay but what the fuck does the decision boundary even represent?
- Well if its a linear function of X then it fucking means that is p-1 dimensional hyperplane in a p-dimensional space

- Also note how in the above decision boundary we do not have $\Sigma_{k}$ !! This is cuz of the fucking assumption we made above that all classes will have the same variance-covariance matrix. Boom!

### QDA

Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:

$${\delta}_{k}(\mathbf{x})=-\frac{1}{2}\mathrm{log}|{\mathit{\Sigma}}_{k}|-\frac{1}{2}(\mathbf{x}-{\mathit{\mu}}_{k}{)}^{T}{\mathit{\Sigma}}_{k}^{-1}(\mathbf{x}-{\mathit{\mu}}_{k})+\mathrm{log}{\pi}_{k}$$
**REMEMBER SANDWICH! $\Sigma_k$ is caught between $(x-\mu_k)$**
This sandwich actually stems from the scaling property of variance.

- Why the fuck it this called Quadratic then?
- Well check the $(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)$ part. You will notice that as compared to the LDA decision boundary function we now have two $x$ in the above equation i.e. its become a 2nd order function aka quadratic function

- Note that you can manipulate LDA to
**behave similar to QDA**if you just take interactions between inputs. i.e. if you have inputs X1 and X2, perform LDA in a 5-dimensional space with ${X_{1}, X_{2}, X_{1}X_{2}, X_{1}^2, X_{2}^2 }$.

## Naive Bayes Classifier!

- Special case of LDA/QDA/RDA
- We must remember that LDA/QDA/RDA assumes that your input data is
**dependent multivariate**data. In case of Naive Bayes, we just assume that inputs matrix X is**INDEPENDENT multivariate**data.- What does dependent and independent multivariate data mean?
- In simple terms if your input dataset with p-columns are somehow dependent on each other then its dependent (example scores in a physics and math class are somewhat correlated). If we assume that each column has no effect on other column what so ever then the assumption is independent (example: scores in physics and singing class are not related at all).

- Mathematically it just means that in case of Naive Bayes your variance-covariance matrix is a diagonal matrix (i.e. except of the diagonals everything else is 0)

- What does dependent and independent multivariate data mean?

Visually for dependent data:

$$VarCovar(X) = \begin{bmatrix} var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p}) \ \ covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p}) \ \ \vdots \ covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p}) \ \ \end{bmatrix}$$Andforindependentdata(NaiveBayesassumption)itisasfollows:$$VarCovar(X) = \begin{bmatrix} var(x_{1}) & 0 & \cdots & 0 \ \ 0 & var(x_{2}, x_{2}) & \cdots & 0 \ \ \vdots \ 0 & 0 & \cdots & var(x_{p}) \ \ \end{bmatrix}$$

### Statistical vs Computing based classification approaches

- Statistical techniques are more rigorous but, require more assumptions
- Statistical techniques involve
**probability distributions**where as computing procedures involve**objective function**

## COMPUTING OPTIMISATION BASED PROCEDURES

### Separating Hyperplanes

### SVM

- Sensitive to data
- Data critical
- SVMs are bad in higher dimensions. Hence we do dimensionality reduction in standard cases and make it work

TODO ADD MORE

## Decision Trees

- Complexity of trees:
- Number of branches
- Number of layers (depth)
- OR count the number of decision points

Objective functions:

- Number of misclassified
- Entropy (advanced)

$${C}_{n}(T)={\Sigma}_{m=1}^{M}\Sigma xiinRM({y}_{i}-{c}_{m}{)}^{2}+\alpha |T|$$

## Artificial Neural Networks (ANN)

Considerations:

- How many hidden layers?
- The more non-linear relation between Y and X the more layers will help

$${Z}_{m}=\sigma ({\alpha}_{0m}+{\alpha}_{m}^{T}\overrightarrow{X})$$

- $\alpha_{0m}$ is the intercept and $\alpha_m^T\vec{X}$ is weight times X (same as linear regression).
- We first perform linear-combination and then do non-linear transformation.
- $\sigma$ is some non-linear transformation (sigmoid etc).

$$\sigma (v)=\frac{1}{(1+{e}^{-v})}$$

- The arrows in the ANN digram between two layers are dot-products (weighted sum)

$${x}_{j}=\left[\begin{array}{c}{x}_{ji}\backslash {x}_{j2}\backslash \vdots \backslash {x}_{jn}\end{array}\right]$$

- It might involve tuning parameters

$${\beta}_{km}^{r+1}={\beta}_{km}^{k}-{\gamma}_{r}{\sum}_{k=1}^{K}\frac{\partial {R}_{i}}{\partial {\sum}_{k=1}^{K}{\beta}_{km}^{r}}$$

## Kernel based classification

$$K(x,y)=(\varphi (x),\varphi (y))$$

- don't really care about this

## KNN Classification

- lower the K the more sensitive the classification will become
- for larger values of K, the more

## Cross Validation

Exam Q:

- How many regression model do we need to build in locating the best h?

h * n

### MISE (mean integrated squared error)

$$MISE=E(\int (f(x)-\hat{f(x)}{)}^{2}dx)$$

$$=\int Bia{s}^{2}(\hat{f(x)})dx+\int Var(\hat{f(x)})dx$$

We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)

MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.

### Select best kernel

Go through all kernels with N h(x) and compare for each kernel the best H.