Given a matrix with missing values, impute the missing entries using a low-rank SVD approximation estimated by the EM algorithm.
a matrix to impute the missing entries of.
the rank of the SVD approximation.
the convergence tolerance for the EM algorithm.
the maximum number of EM steps to take.
the completed version of the matrix.
the sum of
squares between the SVD approximation and the non-missing values in
x
.
the number of EM iterations before algorithm stopped.
Impute the missing values of x
as follows: First, initialize all
NA
values to the column means, or 0
if all entries in the
column are missing. Then, until convergence, compute the first k
terms of the SVD of the completed matrix. Replace the previously missing
values with their approximations from the SVD, and compute the RSS between
the non-missing values and the SVD.
Declare convergence if abs(rss0 - rss1) / (.Machine$double.eps +
rss1) < tol
, where rss0
and rss1
are the RSS values computed
from successive iterations. Stop early after maxiter
iterations and
issue a warning.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R.B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520--525.
# Generate a matrix with missing entries
n <- 20
p <- 10
u <- rnorm( n )
v <- rnorm( p )
xfull <- u %*% rbind( v ) + rnorm( n*p )
miss <- sample( seq_len( n*p ), n )
x <- xfull
x[miss] <- NA
# impute the missing entries with a rank-1 SVD approximation
xhat <- impute.svd( x, 1 )$x
# compute the prediction error for the missing entries
sum( ( xfull-xhat )^2 )
#> [1] 24.40642