PCA: Principal Component Analysis
To overcome the shortcomings of high dimensional models, various dimensionality reduction methods have been introduced, useful in converting a tall dimensional model into a low dimensional model, thus facilitating easy analysis. Of which, feature extraction is one the most important ones. Feature extraction rebuilds the original features to form newer features that are more informative, generalized forms of the original features, thus preventing overfitting, non-redundant meaning they do not consist of correlated or inter-related features. PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) are two different feature extraction methods.
PCA (Principal Component Analysis): This method is used for dimensionality reduction, which is based on maximizing variance, increasing interpretability while minimizing information loss. It forms an available model by forming new general features that help prevent overfitting and prevent redundancy between features. It is an unsupervised type of learning.
Advantages of PCA:
- Non-redundant features: Redundant features are correlated and can thus be represented using a linear combination; also, they do not add any further information to the existing features. PCA gives rise to non-redundant features that are independent of one other. This is achieved by considering variance among the features. Redundant or correlated features do not vary as much as independent features do. Independent features are identified using a variance. Independent features will have high conflict whereas redundant features will have comparatively lower variance values, and thus independent features can be spotted.
- Generalized new features: Having said about non-redundant features, one might wonder that out of all the correlated features, how is a generalized feature formed? Among the correlated features, they might have some common aspects, but a generalization of the varying elements is built up, including common elements with the help of a linear combination of such features, which represents nothing but a line (new generalized feature).
- Reduced dimensionality: The dataset with high dimensionality is reduced to lower; 2-dimensional models have independent features, thus facilitating easy visualization of the dataset.
- Minimized chances of overfitting: Due to the formation of a generalized model, overfitting chances are reduced. Thus a better and possibly optimized output can be expected.
Disadvantages of PCA:
- Recovery of the original variables becomes difficult (Information loss): After the various generalizations, the dataset's original features cannot be recovered quickly from the generalised model that has been formed. Also, while performing generalisation, some amount of information might be lost because of which recovery of the original information is a significant hurdle.
- Interpretation of independent variables minimized: The final generalised model is linear combinations of original features (principal components or new generalised features). From this model, the model's interpretation in terms of original features remains a major challenge.
February, 19, 2021