The principal components are vectors, but they are not chosen at random. The **first principal component** is computed so that it explains the greatest amount of variance in the original features. The **second component** is orthogonal to the first, and it explains the greatest amount of variance left *after* the first principal component.

The original data can be represented as feature vectors. PCA allows us to go a step further and represent the data as linear combinations of principal components. Getting principal components is equivalent to a linear transformation of data from the feature1 x feature2 axis to a PCA1 x PCA2 axis.

Why is this useful?

In the small 2-dimensional example above, we do not gain much by using PCA, since a feature vector of the form (feature1, feature2) will be very similar to a vector of the form (first principal component (PCA1), second principal component (PCA2)). But in very large datasets (where the number of dimensions can surpass 100 different variables), **principal components remove noise by reducing a large number of features to just a couple of principal components**. Principal components are orthogonal projections of data onto lower-dimensional space.

In theory, PCA produces the same number of principal components as there are features in the training dataset. In practice, though, we do not keep all of the principal components. Each successive principal component explains the variance that is left after its preceding component, so picking just a few of the first components sufficiently approximates the original dataset *without *the need for additional features.

The result is a new set of features in the form of principal components, which have multiple practical applications.

Keboola is *the* platform for data scientists and takes care of all the steps in the machine learning workflow from deployment to production, so you can focus on your machine learning models, and leave the infrastructure to us.

## What is PCA used for?

The algorithm can be used on its own, or it can serve as a __data cleaning or data preprocessing__ technique used before another machine learning algorithm.

On its own, PCA is used across a variety of use cases:

**Visualize multidimensional data**. Data visualizations are a great tool for communicating multidimensional data as 2- or 3-dimensional plots.**Compress information**. Principal Component Analysis is used to compress information to store and transmit data more efficiently. For example, it can be used to compress images without losing too much quality, or in signal processing. The technique has successfully been applied across a wide range of compression problems in pattern recognition (specifically face recognition), image recognition, and more.**Simplify complex business decisions**. PCA has been employed to simplify traditionally complex business decisions. For example, traders use over 300 financial instruments to manage portfolios. The algorithm has proven successful in the risk management of interest rate derivative portfolios, lowering the number of financial instruments from more than 300 to just 3-4 principal components.**Clarify convoluted scientific processes**. The algorithm has been applied extensively in the understanding of convoluted and multidirectional factors, which increase the probability of neural ensembles to trigger action potentials.

When PCA is used as part of preprocessing, the algorithm is applied to:

**Reduce the number of dimensions**in the training dataset.**De-noise**the data. Because PCA is computed by finding the components which explain the greatest amount of variance, it captures the signal in the data and omits the noise.

Let's take a look at how Principal Component Analysis is computed.

## How is PCA calculated?

There are multiple ways to calculate PCA:

Eigendecomposition of the covariance matrix

Singular value decomposition of the data matrix

Eigenvalue approximation via power iterative computation

Non-linear iterative partial least squares (NIPALS) computation

… and more.

Let’s take a closer look at the first method - eigendecomposition of the covariance matrix - to gain a deeper appreciation of PCA. There are several steps in computing PCA:

**Feature standardization**. We standardize each feature to have a mean of 0 and a variance of 1. As we explain later in assumptions and limitations, features with values that are on different orders of magnitude prevent PCA from computing the best principal components.**Obtain the covariance matrix computation**. The covariance matrix is a square matrix, of*d x d*dimensions, where*d*stands for “dimension” (or feature or column, if our data is tabular). It shows the pairwise feature correlation between each feature.**Calculate the eigendecomposition of the covariance matrix**. We calculate the eigenvectors (unit vectors) and their associated eigenvalues (scalars by which we multiply the eigenvector) of the covariance matrix. If you want to brush up on your linear algebra,__this is a good resource__to refresh your knowledge of eigendecomposition.**Sort the eigenvectors from the highest eigenvalue to the lowest**. The eigenvector with the highest eigenvalue is the first principal component. Higher eigenvalues correspond to greater amounts of shared variance explained.**Select the number of principal components**. Select the top N eigenvectors (based on their eigenvalues) to become the N principal components. The optimal number of principal components is both subjective and problem-dependent. Usually, we look at the cumulative amount of shared variance explained by the combination of principal components and pick that number of components, which still significantly explains the shared variance.

Keep in mind that the majority of data scientists will not calculate PCA by hand, but rather implement it in Python with __ScikitLearn__, or use R to compute it. These mathematical foundations enrich our understanding of PCA but are not necessary for its implementation. Understanding PCA allows us to have a better idea of its advantages and disadvantages.

## What are the advantages and disadvantages of PCA?

PCA offers multiple benefits, but it also suffers from certain shortcomings.

**Advantages of PCA:**

**Easy to compute**. PCA is based on linear algebra, which is computationally easy to solve by computers.**Speeds up other machine learning algorithms**. Machine learning algorithms converge faster when trained on principal components instead of the original dataset.**Counteracts the issues of high-dimensional data**. High-dimensional data causes regression-based algorithms to overfit easily. By using PCA beforehand to lower the dimensions of the training dataset, we prevent the predictive algorithms from overfitting.

**Disadvantages of PCA:**

**Low interpretability of principal components**. Principal components are linear combinations of the features from the original data, but they are not as easy to interpret. For example, it is difficult to tell which are the most important features in the dataset after computing principal components.**The trade-off between information loss and dimensionality reduction**. Although dimensionality reduction is useful, it comes at a cost. Information loss is a necessary part of PCA. Balancing the trade-off between dimensionality reduction and information loss is unfortunately a necessary compromise that we have to make when using PCA.

To start PCA on the right foot, you will need to have the right tools that help you collect data from multiple sources and prepare it for machine learning models. Keboola covers all the steps, so you won't have to think about the infrastructure, only about the added-value your machine learning models will bring.

## What are the assumptions and limitations of PCA?

PCA is related to the set of operations in the Pearson correlation, so it inherits similar assumptions and limitations:

**PCA assumes a correlation between features**. If the features (or dimensions or columns, in tabular data) are not correlated, PCA will be unable to determine principal components.**PCA is sensitive to the scale of the features**. Imagine we have two features - one takes values between 0 and 1000, while the other takes values between 0 and 1. PCA will be extremely biased towards the first feature being the first principle component, regardless of the*actual*maximum variance within the data. This is why it’s so important to standardize the values first.**PCA is not robust against outliers**. Similar to the point above, the algorithm will be biased in datasets with strong outliers. This is why it is recommended to remove outliers before performing PCA.**PCA assumes a linear relationship between features**. The algorithm is not well suited to capturing non-linear relationships. That’s why it’s advised to turn non-linear features or relationships between features into linear, using the standard methods such as log transforms.**Technical implementations often assume no missing values**. When computing PCA using statistical software tools, they often assume that the feature set has no missing values (no empty rows). Be sure to remove those rows and/or columns with missing values, or impute missing values with a close approximation (e.g. the mean of the column).

## コメント