Welcome to Sanket Gupte's Blog: Incomplete DM-Project

High-dimensional Clustering of Web-based text documents using iterative PCA based approach

PROLOGUE

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

According to Kriegel, Kröger & Zimek (2009), four problems need to be overcome for clustering in high-dimensional data:

Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, impossible to enumerate. This problem is known as the curse of dimensionality.
For spatial data, the concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:
A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

The traditional problem of clustering high dimensional data arises due to the fact that different subspaces contain meaningful clusters which do not span across all the dimensions. This has led to different subspace clustering techniques like PROCLUS [1], ORCLUS [2] or CLIQUE [3] based on selection of corresponding subset of features to find meaningful groups of clusters in high dimensional data. Most of these algorithms considers different feature subset selection methods to clustering but ignores the feature transformation aspect. In this iterative PCA approach, we are trying to use the feature transformation as a potential technique to find out features from high dimensional data, rather than using it for dimensionality reduction purpose. Since this concept may find good application in text document clustering based on document category, we may try to extend this concept in future for web document classifications.

ABSTRACT

Principal component analysis is a ubiquitous technique for data analysis and processing, but it is not based on a probability model, due to which its effectiveness is limited by its global linearity. Here we try to demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. The result would be a mixture of probabilistic principal component analyzers whose parameters can be determined using an Expectation Maximization algorithm, to estimate the principal subspace iteratively. The advantages of this model are explained with respect to clustering and dimensionality reduction.

INTRODUCTION

Principal Component Analysis is a popular technique for dimensionality reduction. The most common derivation of PCA is in terms of standardized linear projection which maximized the variance in the projected space (Hotelling 1933). For a set of observed d-dimensional data vectors {}, , the q principal axes are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors are given by the q dominant eigenvectors of the sample covariance matrix , where is the data sample mean such that . The q principal components of the observed vector are given by the vector , where . The variables are then uncorrelated such that the covariance matrix is diagonal with elements: . [7]

However, any method of PCA, as remarked in many texts, does not mention the presence of an associated probabilistic model for the observed data. It is possible to obtain a probabilistic formulation of PCA which can be closely associated with statistical factor analysis. The latent variable formulation would give us an algorithm which would be iterative, computationally efficient and be an Expectation Maximization algorithm. [7]

Factor Analysis:

Factor analysis is a linear latent variable model, which relates a d-dimensional observation vector t to a corresponding q-dimensional vector of latent variables x.

Factor analysis is related to principal component analysis (PCA) but not identical. Because PCA performs a variance-maximizing rotation of the variable space, it takes into account all variability in the variables. In contrast, factor analysis estimates how much of the variability is due to common factors ("communality"). The two methods become essentially equivalent if the error terms in the factor analysis model (the variability not explained by common factors, see below) can be assumed to all have the same variance.

In this model by constraining the error covariance to be a diagonal matrix whose elements are usually estimated from the data, the observed variables are conditionally independent given the values of the latent variables x. These latent variables are thus intended to explain the correlations between observation variables while represents the variability unique to a particular .

Maximum Likelihood Estimators:

The maximum likelihood estimate of parameter is given by the mean of the data.

The log-likelihood is maximized when the columns of W span the principal subspace of the data.

Also, for , the maximum-likelihood estimator for is given by

Where are the smallest eigen-values of S and so has a clear interpretation as the average variance lost per discarded dimension [7]

Equalities of Eigen-values:

The equalities of any of the q principal eigenvalues does not affect the maximum likelihood estimates, but the instance when all the d-q minor eigenvalue(s) are equal and identical to at least one retained eigenvalue is taken under consideration.

Expectation maximization:

In the EM approach, the latent variables {} are considered 'missing' data. Using standard least square techniques the estimation of W, from the known values would be easy. But the value of for a given is unknown, but instead the joint distribution of the observed and latent variables, p(t,x) is known and, the expectation of the corresponding complete –data log likelihood can be calculated.

In the E step of the EM algorithm, the above mentioned expression, calculated with respect to the posterior distribution of xn given tn is computed. In the M step, new parameter values and are determined, which maximize the expected complete data loh-likelihood and this is guaranteed to increase the likelihood of interest, unless it is already a local maximum. (Dempster, Laird, and Rubin 1977).

The complete data log-likelihood is given by:

The overall model distribution of a latent variable model can be considered of a form

Where is a single probabilistic PCA model and is the corresponding mixing proportion. The parameters for this model can be determined by an extension of the EM algorithm.

The missing data includes a set of for each model i and variables labeling which model is responsible for generating each data point . The complete-data log likelihood of a standard EM algorithm would be of the form

PROBLEM DEFINITION

Given a high dimensional data set with observations and dimensions denoted as , n being comparatively large, the objective is to find good-quality clusters in high dimensional data.

The algorithm starts with performing Principal Component Analysis (PCA) on to obtain first PC-scores having largest eigen-values along most significant principal component. Since principal components are independent to each other, each of these components of can be considered as i.i.d. random variables with Gaussian distributions having means equal to the column-means of .

Since are PC-scores, the random variables can be normalized without loss of relative similarity with respect to the principal components. Thus, , where are -normalized i.i.d. random variables.

Since PCA is not used here for dimensionality reduction, the original data records can be retrieved back from the PC-scores and then removed from the original data set . For our convenience, we denote this data set as to suggest the remaining data set of size . Next, another round of PCA is performed on to get the next first PC-scores with largest eigen-values along most significant principal component. Thus the entire process described above is repeated until the remaining data set is empty. The number of iterations yields the number of independent sub-clusters present in the high-dimensional data .

The Iterative PCA Algorithm

Notion of Similarity and choice of e_t

Since we talk about high dimensional data, there is no meaning of defining a similarity metric based on normal norms, as shown by C. C. Aggarwal et al. in [6]. So, this approach measures similarity based on the concept of cosine similarity and that also in feature space and then feeds the information back to the original dataset to get rid of those records from original space. This is continued till the remaining data is found to be similar to any eigen-vector lower bound by the eigen-value threshold e_t.

The initial choice of e_t is a debatable issue and for the initial assumption, we take it to be a fixed constant, say 0.9. As the algorithm iterates, it tries to recalculate e_t. We take mean of the -normalization of PC-scores and ensure that the mean is a close approximation of the unit eigen-vector passing through those PC-scores. We keep on taking significant PC-scores in steps and increase the mean in incremental steps until we found deterioration from the previous calculated mean and then fix the e_t at the lowest of the most significant eigen-values of the lot. This adaptive e_t calculation carries on for each iteration till the entire data is clustered.

P.S: Project Incomplete- Topic changed to "Maximum Likelihood PCA Approach for Handwritten Letter Data Set" Given in Another Blog-post

Members: Sanket Gupte, Somnath Chakrabarti

REFERENCES:

[1]
Aggarwal C, Procopiuc C, Wolf J. L, Yu P. S, Park J. S (1999) Fast Algorithms for Projected Clustering. SIGMOD, 1999

[2] Aggarwal C. C, Yu P. S (2000) Finding
generalized projected clusters in high dimensional spaces. SIGMOD, 2000

[3] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD, 1998

[4] Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan Muna, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc (2007) 14:63-97 DOI 10.1007/s10618-006-0060-8

[5] Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace Clustering of High Dimensional Data. SIAM international conference on data mining, pp 517-520

[6] Aggarwal C. C, Hinneberg A, Keim D. A (2001) On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001

[7] Michael E. Tipping, Chris M. Bishop Probabilistic Principal Component Analysis, Journal of the royal statistical society, Series B, 61, Part 3, pp, 611-622.

Welcome to Sanket Gupte's Blog

4/8/10

Incomplete DM-Project

0 comments:

Post a Comment

Category

About Me

My Facebook

Weather @ Baltimore

Blog Archive

Pages