PPCA is a probabilistic latent variable model, whose maximum likelihood solution corresponds to PCA. For an introduction to PPCA, see [1].
This implementation uses the expectation-maximization (EM) algorithm to find maximum-likelihood estimates of the PPCA model parameters. This enables a principled handling of missing values in the dataset, assuming that the values are missing at random (see equations further below).
This implementation requires Python >= 3.9 and can be installed as a package with:
pip install -e .
To run the demo.ipynb
notebook, the following packages are additionally required:
pip install notebook scikit-learn matplotlib
In the demo.ipynb
we show basic usage and compare this implementation to the sklearn
implementation. In short, PPCA can be used similarly to its sklearn
counterpart:
from ppca import PPCA
...
ppca = PPCA(n_components=2)
# X contains data with possibly missing values (= np.nan)
# Z are the transformed values
Z = ppca.fit_transform(X)
print("explained variance: ", ppca.explained_variance_)
...
However, in addition to the sklearn
implementation, PPCA can output distributions, handle missing values, etc.
Most implementations on Github for PCA with missing values use the EM imputation algorithm described in Roweis [3]. However, without formulating a probabilistic model, there is e.g. no obvious way to transform unseen data with missing values. Instead, in this repository, we implement the full probabilistic PCA model.
References [1] and [2] provide derivations and detailed discussions of the model and its optimization with EM, however, the missing value case is not explained in detail. Therefore, the necessary equations are provided here in compact form. Familiarity with [1] and [2] is assumed.
First, note that we can simply integrate out missing values from the marginal likelihood. Let
Note that
The expectation of the complete-data log likelihood w.r.t. the latent variables is as follows:
In the E-step, we estimate the sufficient statistics of the latent posterior:
where
In the M-step, we maximize the complete-data log likelihood while fixing the latent variable posterior.
[1] Bishop, C. M., Pattern Recognition and Machine Learning. New York: Springer, 2006.
[2] Tipping, M. E. and Bishop, C. M., Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1999.
[3] Roweis, S., EM algorithms for PCA and SPCA. In Proceedings of the 1997 conference on Advances in neural information processing systems 10 (NIPS '97), 1998, 626-632.