[NOTES] Classification Method - P1
Download the PDF from here:
Overview
Statistical Classifiers
- Bayes Classifiers
- LDA/QDA/RDA
- Naive Bayes (assumes no correlation between inputs x_i)
- Frequency Domain Classifiers (OMG I HAD IT HERE! NEED TO MAKE FLASHCARDS TODAY!!!!!!! FUCK)
- Logistic Regression
- Computing/Optimizing classifiers
- SVM/Decision Tree/ANN...
Note on Freq vs Bayes: Bayes considers priors. Frequency Domains don't consider priors
Bayes Classifier
- classifies the case into most probable class
- posterior conditional distribution
- Logistic regression is not a Bayes procedure. Its a frequency domain procedure
G has discrete dist with prob mass assigned to a few outcomes
Terminology
Prior: previously observed probability of a class
- eg. we have 5 classes $$ Pr(k=1 | (blue)) = 32% $$
- Ways to get priors:
-- Empirical studies (like just doing pd.count_values() on your dataframe)
likelihood: Data distribution!
posterior: Classification decisions
We have:
Bayes Theorem
Posterior Probability is given by:
where
- k is classes (ex. Blue, Red, class)
- x is input variables
We are essentially checking the probability of class given input variables .
Classification decision:
Assign a "new" observation with (inputs) to class if is highest in .
Bayes Theorem is given by:
Input variable is a RANDOM VARIABLE (uppercase). Observations are denoted in lowercase.
Note: We have as a normalizing factor in denominator.
Discriminant Analysis (LDA/QDA/RDA)
Model Assumptions:
- Multivariate Normal Distribution assumed for joint-distribution of p-dimensional X-variables with density
- Homogeneity: For each class G = k, k = 1, 2, ... K, its inputs are considered as random-variables
- Independence: the inputs for each class k are different from each other
The Multivariate normal probability distribution function takes form:
Here is the number of predictors and is the class number. is variance and is the mean.
- In LDA assume that i.e the variance-covariance matrix is the same for EVERY class.
- In QDA the variance-covariance matrix is different for each class
- in RDA the variance-covariance matrix is a function of (tuning-param) and can be given like so: $$ \Sigma(\alpha) = \alpha\Sigma_{k} + (1-\alpha)\Sigma_{k} $$ Numerical Illustration
[1] (5 classes)
[2] say 3 input variables (rich, sleep, ) where,
Now we get a mean for class k =1:
This is the mean of x for class k = 1.
Then,
Dimensions of are 3x3 (i.e. the dxd where d is the number of columns in your dataset! (observations)). Here existence of covariance terms means that there can be dependence between any two x variables.
Exam Question:
Sample size for students (observations) in each class can be different! For each class the mean value and variance-covariance value can be different.
Different class will have different data likelihood!
LDA
In comparing 2 classes we set their posterior probability () to be equal Then upon simplification we get the following decision boundary.
- Decision Boundary: The Discriminant function for LDA takes the following form:
Now here and are the prior probabilities for class k and l. Note that we can get the prior-probabilities for each class directly from our data.
- Note that the discriminant function is a linear function of our input X = x.
- We know the mean () and variance () from our input data.
- Okay but what the fuck does the decision boundary even represent?
- Well if its a linear function of X then it fucking means that is p-1 dimensional hyperplane in a p-dimensional space
- Also note how in the above decision boundary we do not have !! This is cuz of the fucking assumption we made above that all classes will have the same variance-covariance matrix. Boom!
QDA
Now remember that QDA has no such funny business of assuming the variance-covariance matrix for each class is same. Hence it's decision boundary becomes slightly more complicated and looks as follows:
REMEMBER SANDWICH! is caught between This sandwich actually stems from the scaling property of variance.
- Why the fuck it this called Quadratic then?
- Well check the part. You will notice that as compared to the LDA decision boundary function we now have two in the above equation i.e. its become a 2nd order function aka quadratic function
- Note that you can manipulate LDA to behave similar to QDA if you just take interactions between inputs. i.e. if you have inputs X1 and X2, perform LDA in a 5-dimensional space with .
Naive Bayes Classifier!
- Special case of LDA/QDA/RDA
- We must remember that LDA/QDA/RDA assumes that your input data is dependent multivariate data. In case of Naive Bayes, we just assume that inputs matrix X is INDEPENDENT multivariate data.
- What does dependent and independent multivariate data mean?
- In simple terms if your input dataset with p-columns are somehow dependent on each other then its dependent (example scores in a physics and math class are somewhat correlated). If we assume that each column has no effect on other column what so ever then the assumption is independent (example: scores in physics and singing class are not related at all).
- Mathematically it just means that in case of Naive Bayes your variance-covariance matrix is a diagonal matrix (i.e. except of the diagonals everything else is 0) Visually for dependent data:
- What does dependent and independent multivariate data mean?
$$VarCovar(X) = \begin{bmatrix}
var(x_{1}) & covar(x_{1}, x_{2}) & \cdots & covar(x_{1}, x_{p})
covar(x_{2}, x_{1}) & var(x_{2}, x_{2}) & \cdots & covar(x_{2}, x_{p})
\vdots
covar(x_{p}, x_{1}) & covar(x_{p}, x_{2}) & \cdots & var(x_{p})
\end{bmatrix}$$
And for independent data (Naive Bayes assumption) it is as follows:
$$VarCovar(X) = \begin{bmatrix}
var(x_{1}) & 0 & \cdots & 0
0 & var(x_{2}, x_{2}) & \cdots & 0
\vdots
0 & 0 & \cdots & var(x_{p})
\end{bmatrix}$$
Statistical vs Computing based classification approaches
- Statistical techniques are more rigorous but, require more assumptions
- Statistical techniques involve probability distributions where as computing procedures involve objective function
COMPUTING OPTIMISATION BASED PROCEDURES
Separating Hyperplanes
SVM
- Sensitive to data
- Data critical
- SVMs are bad in higher dimensions. Hence we do dimensionality reduction in standard cases and make it work
TODO ADD MORE
Decision Trees
- Complexity of trees:
- Number of branches
- Number of layers (depth)
- OR count the number of decision points
Objective functions:
- Number of misclassified
- Entropy (advanced)
Artificial Neural Networks (ANN)
Considerations:
- How many hidden layers?
- The more non-linear relation between Y and X the more layers will help
- is the intercept and is weight times X (same as linear regression).
- We first perform linear-combination and then do non-linear transformation.
- is some non-linear transformation (sigmoid etc).
- The arrows in the ANN digram between two layers are dot-products (weighted sum)
- It might involve tuning parameters
Kernel based classification
- don't really care about this
KNN Classification
- lower the K the more sensitive the classification will become
- for larger values of K, the more
Cross Validation
Exam Q:
- How many regression model do we need to build in locating the best h? h * n
MISE (mean integrated squared error)
We use this cuz its continious (i.e. we no longer have individual discrete datapoints of X that we can sum. Continous hai so integrate karo)
MISE and AISE are too complicated to minimize. Hence most computer packages use Cross-Validations.
Select best kernel
Go through all kernels with N h(x) and compare for each kernel the best H.