4/8/10

Probabilistic PCA



Maximum Likelihood PCA Approach for Handwritten Letter Data Set 




EXECUTIVE SUMMARY
Principal component analysis is a ubiquitous technique for data analysis and processing, but it is not based on a probability model, due to which its effectiveness is limited by its global linearity. Here we try to demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis.

Note. We have not considered the mixture of probabilistic principal component analyzers whose parameters can be determined using an Expectation Maximization algorithm, to estimate the principal subspace iteratively.

The advantages of PPCA model above conventional PCA are

  • Addresses limitations of regular PCA
  • PCA can be used as a general Gaussian density model in addition to reducing dimensions
  • Maximum-likelihood estimates can be computed for elements associated with principal components
  • Captures dominant correlations with few parameters
  • Multiple PCA models can be combined as a probabilistic mixture
  • Can be used as a base for Bayesian PCA
PROBLEM DEFINITION AND BACKGROUND
In the statistical literature of factor analysis, any high dimensional data set can be represented in terms of a summation of linear combination of independent and normally distributed latent (or hidden) variables, means along the different dimensions and independent and normally distributed error (or noise) distributions respectively. Mathematically, this can be represented as

                   

Where t = d x N observed data set, x = q x N set of latent variables, W = d x q factor loading matrix, = d x N set of row means columned for N times, = d x N matrix of orthonormal error columns. Both latent variables and noise are zero-means i.e. and where I is identity matrix and is a diagonal matrix of noise variances. And, where .

The project tries to probabilistically estimate W, the parameter of the features or principal components of the observed data set t. There are two approaches that have been followed for identifying the features or reduced dimensions.

(1) Expectation Maximization

(2) Maximum Likelihood PCA





Expectation Maximization

In the EM approach, we have tried to determine the factor loading matrix WML using the maximization condition of posterior log-likelihood Lc
with respect to WML and . The log-likelihood is calculated as the equation:    

Considering both the joint distribution of tn and xn
and the distribution of xn
to be independent and Gaussian respectively, and using Bayes' Rule the conditional distribution of xn
given tn
can be calculated as follows:

, where

E-step

The above conditional distribution parameters yield posterior mean

Least Square Reconstruction

The mean adjusted observed data is projected on W and then data is reconstructed using the equation



SSE =

M-step

Maximizing the derivative of Lc with respect to W and, the maximum likelihood estimates are obtained as and



The values obtained in the M-step are used iteratively in calculation of in E-step in the next iteration. This is done using W =

While the previous squared reconstruction error SSEprev is assigned the value of current SSE to help stop the iteration when SSE becomes greater than SSEprev.

This is done using SSEprev = SSE.

Maximum Likelihood PCA

Unlike the EM approach, the maximum likelihood is the closed form approach where W and are obtained not through an iterative process but by some closed form expressions obtained by using conditions for stationary points of the log-likelihood function Lc
which is having same definition as EM approach.

At the stationary points, deriving Lc with respect to W yields the equation where

is the covariance matrix of observed data and C and W have the usual definitions as before.

Now, since it can be shown that , the eigenvalues of S are such that , where are the diagonal elements of L such that (singular value decomposition).

Thus the smallest eigen values of S has to be at least when . The number of discarded dimensions has been chosen to be the number of eigen values having the smallest eigen value .

The maximum likelihood estimate of is then calculated as

Next WML is estimated using the equation


Uq is calculated by taking the q most significant column eigen vectors of S and Kq is calculated as a q x q diagonal matrix with q largest eigen values along the diagonal and R is an arbitrary orthogonal matrix.

It can be shown that the above estimates of WML and corresponds to the global maximum of the specified log likelihood function.

TECHNICAL SCOPE OF PROJECT
The technical scope of the project involves finding out the hidden variables out of a set of hand written letter recognition data from UCI Machine Learning Repository titled Letter Image Recognition Data using the above-mentioned two approaches and then to see if the found out hidden variables correspond to the true eigenvectors of the data set.

The data set is very large of 20,000 instances and 16 attributes with class labels mentioned as letter names A, T, C, etc. For our sample experiment, only 500 instances have been chosen for convenience.







Java Class for EM Approach (using JAMA package )
(Not Shown) 

Java Class for Maximum Likelihood Approach (using JAMA package)

(Not Shown)


EXPERIMENTAL RESULTS
The high dimensional parallel coordinate plot for the UCI Letter Image Recognition Data is as follows:







The above plot has 16000 observations with 16 variables or dimensions.


After projection of the observed data on three hidden variables and forming three classes the parallel coordinate plot looks like as above.

The Matlab code for plot is

Z = load('letter-recognition-training-unclassified-5000.txt');

% Z is of size 5000 x 16



varNames = {'x-box';'y-box';'width';'high';'onpix';'x-bar';'y-bar';'x2bar';'y2bar';'xybar';'x2ybr';'xy2br';'x-ege';'xegvy';'y-ege';'yegvx'};

%Matlab Title: Visualizing Multivariate Data



P = Z*evec;

[r,c] = size(P);

max = zeros(r);

for i=1:5000

max(i) = 1;


for j=2:dim



if Z(i,j)>max(i)


max(i) = j;


end;



end;


end;



parallelcoords(Z,'group',char(max),'standardize','on','labels',varNames);

set(gcf,'color','white');







WORK DISTRIBUTION
The project has been a team effort with the roles of the individual members in the areas as mentioned below:

Somnath Chakrabarti – Design and Development

Sanket Gupte – Development and Experimentation

PROJECT SCHEDULE
Intermediate Project Status report submitted on Oct 29, 2009

Project Submitted on Dec 20, 2009

REFERENCES:
[1] Michael E. Tipping, Chris M. Bishop Probabilistic Principal Component Analysis, Journal of the royal statistical society, Series B, 61, Part 3, pp, 611-622.

[2] The Minh Luong, Lecture Notes title "Probabilistic Principal Component Analysis and the E-M algorithm" CS3750 Fall 2007

[3] Sam Roweis, EM Algorithms for PCA and SPCA, Computation & Neural Systems, California Institute of Tech.

[4] UCI Machine Learning Repository Data Set: Letter Image Recognition Data

0 comments:

Post a Comment