[NOTES] Classification Method - P1

18 Feb, 2024

Download the PDF from here:

Overview

Statistical Classifiers

Bayes Classifiers
- LDA/QDA/RDA
- Naive Bayes (assumes no correlation between inputs x_i)
Frequency Domain Classifiers (OMG I HAD IT HERE! NEED TO MAKE FLASHCARDS TODAY!!!!!!! FUCK)
- Logistic Regression
Computing/Optimizing classifiers
- SVM/Decision Tree/ANN...

Note on Freq vs Bayes: Bayes considers priors. Frequency Domains don't consider priors

Bayes Classifier

classifies the case into most probable class
posterior conditional distribution
Logistic regression is not a Bayes procedure. Its a frequency domain procedure

P r (G | X = x)

G has discrete dist with prob mass assigned to a few outcomes

Terminology

Prior: previously observed probability of a class $K$
- eg. we have 5 classes $$ Pr(k=1 | (blue)) = 32% $$
- Ways to get priors:
-- Empirical studies (like just doing pd.count_values() on your dataframe)
likelihood: Data distribution!
posterior: Classification decisions

We have: $p r i o r * l i k e l i h o o d = p o s t e r i o r$

Bayes Theorem

Posterior Probability is given by:

P r (G = k | X = x)

where

k is classes (ex. Blue, Red, class)
x is input variables

We are essentially checking the probability of class $G = k$ given input variables $X = x$ .

Classification decision:

Assign a "new" observation with $x_{n e w}$ (inputs) to class $j = 1, 2, \dots, k$ if $P r (G = j | x)$ is highest in $P r (G = 1, 2, \dots, k | x)$ .

Bayes Theorem is given by:

P r (G = k | X = x) = \frac{P r (G = k \cap X = x)}{P r (X = x)} = \frac{P r (X = x | G = k) * P r (G = k)}{P r (X = x)}

Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.

Note: We have $P r (X = x)$ as a normalizing factor in denominator.

Discriminant Analysis (LDA/QDA/RDA)

Model Assumptions:

Multivariate Normal Distribution assumed for joint-distribution of p-dimensional X-variables with density $f_{k} (x_{k}) = P r (X_{k} = x_{k} | G = k)$
Homogeneity: For each class G = k, k = 1, 2, ... K, its inputs are considered as random-variables
Independence: the inputs for each class k are different from each other

The Multivariate normal probability distribution function takes form:

f_{k} (x) = \frac{1}{(2 π)^{(\frac{p}{2})} ∣ Σ_{k} ∣^{(\frac{1}{2})}} e^{(- \frac{1}{2} (x - μ_{k})^{T} Σ_{k}^{- 1} (x - μ_{k}))}

Here $p$ is the number of predictors and $k$ is the class number. $Σ$ is variance and $μ$ is the mean.

In LDA assume that $Σ_{k} = Σ$ i.e the variance-covariance matrix is the same for EVERY class.
In QDA the variance-covariance matrix $Σ_{k}$ is different for each class
in RDA the variance-covariance matrix is a function of $α$ (tuning-param) and can be given like so: $$ \Sigma(\alpha) = \alpha\Sigma_{k} + (1-\alpha)\Sigma_{k} $$ Numerical Illustration

[1] $k = 1, 2, \dots, k$ (5 classes)

[2] $X = [x_{1}, x_{2}, x_{3}]$ say 3 input variables (rich, sleep, ) where,

x_{1} = [\begin{matrix} x_{11} \\ x_{12} \\ ⋮ \\ x_{1 n} \end{matrix}]

Now we get a mean for class k =1: $μ_{k = 1} = [E (X_{1}), E (X_{2}), E (X_{3})]$

This is the mean of x for class k = 1.

Then,

Σ_{k = 1} = [\begin{matrix} V a r (X_{1}) & C o v (X_{1}, X_{2}) & C o v (X_{1}, X_{3}) \\ C o v (X_{1}, X_{2}) & V a r (X_{2}) & C o v (X_{2}, X_{3}) \\ C o v (X_{1}, X_{3}) & C o v (X_{2}, X_{3}) & V a r (X_{3}) \end{matrix}]

Dimensions of $Σ_{k}$ are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.

Exam Question:

Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.

Different class will have different data likelihood!

LDA

In comparing 2 classes we set their posterior probability ( $P (G = k | X = x)$ ) to be equal Then upon simplification we get the following decision boundary.

Decision Boundary: The Discriminant function for LDA takes the following form:

δ (x) = \log (\frac{π_{k}}{π_{l}}) - \frac{1}{2} (μ_{k}^{T}) Σ^{- 1} (μ_{k}) + x^{T} Σ^{- 1} (μ_{k})

Now here $π_{k}$ and $π_{l}$ are the prior probabilities for class k and l. Note that we can get the prior-probabilities for each class directly from our data.

Note that the discriminant function is a linear function of our input X = x.
We know the mean ( $μ_{k}$ ) and variance ( $Σ_{k}$ ) from our input data.
Okay but what the fuck does the decision boundary even represent?
- Well if its a linear function of X then it fucking means that is p-1 dimensional hyperplane in a p-dimensional space
Also note how in the above decision boundary we do not have $Σ_{k}$ !! This is cuz of the fucking assumption we made above that all classes will have the same variance-covariance matrix. Boom!

QDA

Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:

δ_{k} (𝐱) = - \frac{1}{2} \log | Σ_{k} | - \frac{1}{2} (𝐱 - μ_{k})^{T} Σ_{k}^{- 1} (𝐱 - μ_{k}) + \log π_{k}

REMEMBER SANDWICH! $Σ_{k}$ is caught between $(x - μ_{k})$ This sandwich actually stems from the scaling property of variance.

Why the fuck it this called Quadratic then?
- Well check the $(𝐱 - μ_{k})^{T} Σ_{k}^{- 1} (𝐱 - μ_{k})$ part. You will notice that as compared to the LDA decision boundary function we now have two $x$ in the above equation i.e. its become a 2nd order function aka quadratic function
Note that you can manipulate LDA to behave similar to QDA if you just take interactions between inputs. i.e. if you have inputs X1 and X2, perform LDA in a 5-dimensional space with ${X_{1}, X_{2}, X_{1} X_{2}, X_{1}^{2}, X_{2}^{2}}$ .

Naive Bayes Classifier!

Special case of LDA/QDA/RDA
We must remember that LDA/QDA/RDA assumes that your input data is dependent multivariate data. In case of Naive Bayes, we just assume that inputs matrix X is INDEPENDENT multivariate data.
- What does dependent and independent multivariate data mean?
  - In simple terms if your input dataset with p-columns are somehow dependent on each other then its dependent (example scores in a physics and math class are somewhat correlated). If we assume that each column has no effect on other column what so ever then the assumption is independent (example: scores in physics and singing class are not related at all).
- Mathematically it just means that in case of Naive Bayes your variance-covariance matrix is a diagonal matrix (i.e. except of the diagonals everything else is 0) Visually for dependent data:

$$VarCovar(X) = \begin{bmatrix} var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p})

covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p})

\vdots
covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p})

\end{bmatrix}$$ And for independent data (Naive Bayes assumption) it is as follows: $$VarCovar(X) = \begin{bmatrix} var(x_{1}) & 0 & \cdots & 0

0 & var(x_{2}, x_{2}) & \cdots & 0

\vdots
0 & 0 & \cdots & var(x_{p})

\end{bmatrix}$$

Statistical vs Computing based classification approaches

Statistical techniques are more rigorous but, require more assumptions
Statistical techniques involve probability distributions where as computing procedures involve objective function

COMPUTING OPTIMISATION BASED PROCEDURES

Separating Hyperplanes

SVM

Sensitive to data
Data critical
SVMs are bad in higher dimensions. Hence we do dimensionality reduction in standard cases and make it work

TODO ADD MORE

Decision Trees

Complexity of trees:
- Number of branches
- Number of layers (depth)
- OR count the number of decision points

Objective functions:

Number of misclassified
Entropy (advanced)

C_{n} (T) = Σ_{m = 1}^{M} Σ x i i n R M (y_{i} - c_{m})^{2} + α | T |

Artificial Neural Networks (ANN)

Considerations:

How many hidden layers?
- The more non-linear relation between Y and X the more layers will help

Z_{m} = σ (α_{0 m} + α_{m}^{T} \vec{X})

$α_{0 m}$ is the intercept and $α_{m}^{T} \vec{X}$ is weight times X (same as linear regression).
We first perform linear-combination and then do non-linear transformation.
$σ$ is some non-linear transformation (sigmoid etc).

σ (v) = \frac{1}{(1 + e^{- v})}

The arrows in the ANN digram between two layers are dot-products (weighted sum)

x_{j} = [\begin{matrix} x_{j i} \\ x_{j 2} \\ ⋮ \\ x_{j n} \end{matrix}]

It might involve tuning parameters

β_{k m}^{r + 1} = β_{k m}^{k} - γ_{r} \sum_{k = 1}^{K} \frac{\partial R_{i}}{\partial \sum_{k = 1}^{K} β_{k m}^{r}}

Kernel based classification

K (x, y) = (ϕ (x), ϕ (y))

don't really care about this

KNN Classification

lower the K the more sensitive the classification will become
for larger values of K, the more

Cross Validation

Exam Q:

How many regression model do we need to build in locating the best h? h * n

MISE (mean integrated squared error)

M I S E = E (\int (f (x) - \hat{f (x)})^{2} d x)

= \int B i a s^{2} (\hat{f (x)}) d x + \int V a r (\hat{f (x)}) d x

We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)

MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.

Select best kernel

Go through all kernels with N h(x) and compare for each kernel the best H.

#classification #data-science #interview #machine-learning #machine-learning-interview #notes