[NOTES] Classification Method - P1
Download the PDF from here:
Overview
Statistical Classifiers
- Bayes Classifiers
- LDA/QDA/RDA
- Naive Bayes (assumes no correlation between inputs x_i)
- Frequency Domain Classifiers (OMG I HAD IT HERE! NEED TO MAKE FLASHCARDS TODAY!!!!!!! FUCK)
- Logistic Regression
- Computing/Optimizing classifiers
- SVM/Decision Tree/ANN...
Note on Freq vs Bayes: Bayes considers priors. Frequency Domains don't consider priors
Bayes Classifier
- classifies the case into most probable class
- posterior conditional distribution
- Logistic regression is not a Bayes procedure. Its a frequency domain procedure
G has discrete dist with prob mass assigned to a few outcomes
Terminology
- Prior: previously observed probability of a class $K$
- eg. we have 5 classes
- Ways to get priors:
-- Empirical studies (like just doing pd.count_values() on your dataframe)
likelihood: Data distribution!
posterior: Classification decisions
We have:
Bayes Theorem
Posterior Probability is given by: where
- k is classes (ex. Blue, Red, class)
- x is input variables
We are essentially checking the probability of class $G = k$ given input variables $X = x$.
Classification decision:
Assign a "new" observation with $x_{new}$ (inputs) to class $j = 1, 2, \cdots , k$ if $Pr(G=j | x)$ is highest in $Pr( G = 1, 2, \cdots, k | x)$.
Bayes Theorem is given by:
Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.
Note: We have $Pr(X = x)$ as a normalizing factor in denominator.
Discriminant Analysis (LDA/QDA/RDA)
Model Assumptions:
- Multivariate Normal Distribution assumed for joint-distribution of p-dimensional X-variables with density $f_k(x_{k}) = Pr(X_{k} = x_{k} | G = k)$
- Homogeneity: For each class G = k, k = 1, 2, ... K, its inputs are considered as random-variables
- Independence: the inputs for each class k are different from each other
The Multivariate normal probability distribution function takes form:
Here $p$ is the number of predictors and $k$ is the class number. $\Sigma$ is variance and $\mu$ is the mean.
- In LDA assume that $\Sigma_{k} = \Sigma$ i.e the variance-covariance matrix is the same for EVERY class.
- In QDA the variance-covariance matrix $\Sigma_{k}$ is different for each class
- in RDA the variance-covariance matrix is a function of $\alpha$ (tuning-param) and can be given like so:
Numerical Illustration
[1] $k = 1, 2, \cdots, k$ (5 classes)
[2] $X = [x_1, x_2, x_3]$ say 3 input variables (rich, sleep, ) where,
Now we get a mean for class k =1: $\mu_{k=1} = [E(X_1), E(X_2), E(X_3)]$
This is the mean of x for class k = 1.
Then,
Dimensions of $\Sigma_k$ are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.
Exam Question:
Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.
Different class will have different data likelihood!
LDA
In comparing 2 classes we set their posterior probability ($P(G=k|X=x)$) to be equal Then upon simplification we get the following decision boundary.
- Decision Boundary: The Discriminant function for LDA takes the following form:
Now here $\pi_{k}$ and $\pi_{l}$ are the prior probabilities for class k and l. Note that we can get the prior-probabilities for each class directly from our data.
- Note that the discriminant function is a linear function of our input X = x.
- We know the mean ($\mu_k$) and variance ($\Sigma_{k}$) from our input data.
- Okay but what the fuck does the decision boundary even represent?
- Well if its a linear function of X then it fucking means that is p-1 dimensional hyperplane in a p-dimensional space
- Also note how in the above decision boundary we do not have $\Sigma_{k}$ !! This is cuz of the fucking assumption we made above that all classes will have the same variance-covariance matrix. Boom!
QDA
Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:
REMEMBER SANDWICH! $\Sigma_k$ is caught between $(x-\mu_k)$ This sandwich actually stems from the scaling property of variance.
- Why the fuck it this called Quadratic then?
- Well check the $(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)$ part. You will notice that as compared to the LDA decision boundary function we now have two $x$ in the above equation i.e. its become a 2nd order function aka quadratic function
- Note that you can manipulate LDA to behave similar to QDA if you just take interactions between inputs. i.e. if you have inputs X1 and X2, perform LDA in a 5-dimensional space with ${X_{1}, X_{2}, X_{1}X_{2}, X_{1}^2, X_{2}^2 }$.
Naive Bayes Classifier!
- Special case of LDA/QDA/RDA
- We must remember that LDA/QDA/RDA assumes that your input data is dependent multivariate data. In case of Naive Bayes, we just assume that inputs matrix X is INDEPENDENT multivariate data.
- What does dependent and independent multivariate data mean?
- In simple terms if your input dataset with p-columns are somehow dependent on each other then its dependent (example scores in a physics and math class are somewhat correlated). If we assume that each column has no effect on other column what so ever then the assumption is independent (example: scores in physics and singing class are not related at all).
- Mathematically it just means that in case of Naive Bayes your variance-covariance matrix is a diagonal matrix (i.e. except of the diagonals everything else is 0)
- What does dependent and independent multivariate data mean?
Visually for dependent data:
$$VarCovar(X) = \begin{bmatrix} var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p}) \ \ covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p}) \ \ \vdots \ covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p}) \ \ \end{bmatrix}VarCovar(X) = \begin{bmatrix} var(x_{1}) & 0 & \cdots & 0 \ \ 0 & var(x_{2}, x_{2}) & \cdots & 0 \ \ \vdots \ 0 & 0 & \cdots & var(x_{p}) \ \ \end{bmatrix}$$
Statistical vs Computing based classification approaches
- Statistical techniques are more rigorous but, require more assumptions
- Statistical techniques involve probability distributions where as computing procedures involve objective function
COMPUTING OPTIMISATION BASED PROCEDURES
Separating Hyperplanes
SVM
- Sensitive to data
- Data critical
- SVMs are bad in higher dimensions. Hence we do dimensionality reduction in standard cases and make it work
TODO ADD MORE
Decision Trees
- Complexity of trees:
- Number of branches
- Number of layers (depth)
- OR count the number of decision points
Objective functions:
- Number of misclassified
- Entropy (advanced)
Artificial Neural Networks (ANN)
Considerations:
- How many hidden layers?
- The more non-linear relation between Y and X the more layers will help
- $\alpha_{0m}$ is the intercept and $\alpha_m^T\vec{X}$ is weight times X (same as linear regression).
- We first perform linear-combination and then do non-linear transformation.
- $\sigma$ is some non-linear transformation (sigmoid etc).
- The arrows in the ANN digram between two layers are dot-products (weighted sum)
- It might involve tuning parameters
Kernel based classification
- don't really care about this
KNN Classification
- lower the K the more sensitive the classification will become
- for larger values of K, the more
Cross Validation
Exam Q:
- How many regression model do we need to build in locating the best h?
h * n
MISE (mean integrated squared error)
We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)
MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.
Select best kernel
Go through all kernels with N h(x) and compare for each kernel the best H.