Welcome to Sanket Gupte's Blog: Probabilistic PCA

Maximum Likelihood PCA Approach for Handwritten Letter Data Set

EXECUTIVE SUMMARY

Principal component analysis is a ubiquitous technique for data analysis and processing, but it is not based on a probability model, due to which its effectiveness is limited by its global linearity. Here we try to demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis.

Note. We have not considered the mixture of probabilistic principal component analyzers whose parameters can be determined using an Expectation Maximization algorithm, to estimate the principal subspace iteratively.

The advantages of PPCA model above conventional PCA are

Addresses limitations of regular PCA
PCA can be used as a general Gaussian density model in addition to reducing dimensions
Maximum-likelihood estimates can be computed for elements associated with principal components
Captures dominant correlations with few parameters
Multiple PCA models can be combined as a probabilistic mixture
Can be used as a base for Bayesian PCA

PROBLEM DEFINITION AND BACKGROUND

In the statistical literature of factor analysis, any high dimensional data set can be represented in terms of a summation of linear combination of independent and normally distributed latent (or hidden) variables, means along the different dimensions and independent and normally distributed error (or noise) distributions respectively. Mathematically, this can be represented as

Where t = d x N observed data set, x = q x N set of latent variables, W = d x q factor loading matrix, = d x N set of row means columned for N times, = d x N matrix of orthonormal error columns. Both latent variables and noise are zero-means i.e. and where I is identity matrix and is a diagonal matrix of noise variances. And, where .

The project tries to probabilistically estimate W, the parameter of the features or principal components of the observed data set t. There are two approaches that have been followed for identifying the features or reduced dimensions.

(1) Expectation Maximization

(2) Maximum Likelihood PCA

Expectation Maximization

In the EM approach, we have tried to determine the factor loading matrix W_ML using the maximization condition of posterior log-likelihood L_cwith respect to W_ML and . The log-likelihood is calculated as the equation:

Considering both the joint distribution of t_n and x_n
and the distribution of x_n
to be independent and Gaussian respectively, and using Bayes' Rule the conditional distribution of x_n
given t_n
can be calculated as follows:

, where

E-step

The above conditional distribution parameters yield posterior mean

Least Square Reconstruction

The mean adjusted observed data is projected on W and then data is reconstructed using the equation

SSE =

M-step

Maximizing the derivative of L_c with respect to W and, the maximum likelihood estimates are obtained as and

The values obtained in the M-step are used iteratively in calculation of in E-step in the next iteration. This is done using W =

While the previous squared reconstruction error SSE_prev is assigned the value of current SSE to help stop the iteration when SSE becomes greater than SSE_prev.

This is done using SSE_prev = SSE.

Maximum Likelihood PCA

Unlike the EM approach, the maximum likelihood is the closed form approach where W and are obtained not through an iterative process but by some closed form expressions obtained by using conditions for stationary points of the log-likelihood function L_c
which is having same definition as EM approach.

At the stationary points, deriving L_c with respect to W yields the equation where

is the covariance matrix of observed data and C and W have the usual definitions as before.

Now, since it can be shown that , the eigenvalues of S are such that , where are the diagonal elements of L such that (singular value decomposition).

Thus the smallest eigen values of S has to be at least when . The number of discarded dimensions has been chosen to be the number of eigen values having the smallest eigen value .

The maximum likelihood estimate of is then calculated as

Next W_ML is estimated using the equation

U_q is calculated by taking the q most significant column eigen vectors of S and K_q is calculated as a q x q diagonal matrix with q largest eigen values along the diagonal and R is an arbitrary orthogonal matrix.

It can be shown that the above estimates of W_ML and corresponds to the global maximum of the specified log likelihood function.

TECHNICAL SCOPE OF PROJECT

The technical scope of the project involves finding out the hidden variables out of a set of hand written letter recognition data from UCI Machine Learning Repository titled Letter Image Recognition Data using the above-mentioned two approaches and then to see if the found out hidden variables correspond to the true eigenvectors of the data set.

The data set is very large of 20,000 instances and 16 attributes with class labels mentioned as letter names A, T, C, etc. For our sample experiment, only 500 instances have been chosen for convenience.

Java Class for EM Approach (using JAMA package )
(Not Shown)

Java Class for Maximum Likelihood Approach (using JAMA package)

(Not Shown)

EXPERIMENTAL RESULTS

The high dimensional parallel coordinate plot for the UCI Letter Image Recognition Data is as follows:

The above plot has 16000 observations with 16 variables or dimensions.

After projection of the observed data on three hidden variables and forming three classes the parallel coordinate plot looks like as above.

The Matlab code for plot is

Z = load('letter-recognition-training-unclassified-5000.txt');

% Z is of size 5000 x 16

varNames = {'x-box';'y-box';'width';'high';'onpix';'x-bar';'y-bar';'x2bar';'y2bar';'xybar';'x2ybr';'xy2br';'x-ege';'xegvy';'y-ege';'yegvx'};

%Matlab Title: Visualizing Multivariate Data

P = Z*evec;

[r,c] = size(P);

max = zeros(r);

for i=1:5000

max(i) = 1;

for j=2:dim

if Z(i,j)>max(i)

max(i) = j;

end;

end;

end;

parallelcoords(Z,'group',char(max),'standardize','on','labels',varNames);

set(gcf,'color','white');

WORK DISTRIBUTION

The project has been a team effort with the roles of the individual members in the areas as mentioned below:

Somnath Chakrabarti – Design and Development

Sanket Gupte – Development and Experimentation

PROJECT SCHEDULE

Intermediate Project Status report submitted on Oct 29, 2009

Project Submitted on Dec 20, 2009

REFERENCES:

[1] Michael E. Tipping, Chris M. Bishop Probabilistic Principal Component Analysis, Journal of the royal statistical society, Series B, 61, Part 3, pp, 611-622.

[2] The Minh Luong, Lecture Notes title "Probabilistic Principal Component Analysis and the E-M algorithm" CS3750 Fall 2007

[3] Sam Roweis, EM Algorithms for PCA and SPCA, Computation & Neural Systems, California Institute of Tech.

[4] UCI Machine Learning Repository Data Set: Letter Image Recognition Data

Welcome to Sanket Gupte's Blog

4/8/10

Probabilistic PCA

0 comments:

Post a Comment

Category

About Me

My Facebook

Weather @ Baltimore

Blog Archive

Pages