Welcome to Sanket Gupte's Blog: Proposal DM

An Iterative PCA based approach for Clustering high-dimensional Text Documents in Web

Project Members: Somnath Chakrabarti, Sanket Gupte

Executive Summary:

The traditional problem of clustering high dimensional data arises due to the fact that different subspaces contain meaningful clusters which do not span across all the dimensions. This has led to different subspace clustering techniques like PROCLUS [1], ORCLUS [2] or CLIQUE [3] based on selection of corresponding subset of features to find meaningful groups of clusters in high dimensional data. Most of these algorithms considers different feature subset selection methods to clustering but ignores the feature transformation aspect. In this iterative PCA approach, we are trying to use the feature transformation as a potential technique to find out features from high dimensional data, rather than using it for dimensionality reduction purpose. Since this concept may find good application in text document clustering based on document category, we may try to extend this concept in future for web document classifications.

Problem Definition and Background:

Given a high dimensional data set with observations and dimensions denoted as , n being comparatively large, the objective is to find good-quality clusters in high dimensional data.

The algorithm starts with performing Principal Component Analysis (PCA) on to obtain first PC-scores having largest eigen-values along most significant principal component. Since principal components are independent to each other, each of these components of can be considered as i.i.d. random variables with Gaussian distributions having means equal to the column-means of .

Since are PC-scores, the random variables can be normalized without loss of relative similarity with respect to the principal components. Thus, , where are -normalized i.i.d. random variables.

Since PCA is not used here for dimensionality reduction, the original data records can be retrieved back from the PC-scores and then removed from the original data set . For our convenience, we denote this data set as to suggest the remaining data set of size . Next, another round of PCA is performed on to get the next first PC-scores with largest eigen-values along most significant principal component. Thus the entire process described above is repeated until the remaining data set is empty. The number of iterations yields the number of independent sub-clusters present in the high-dimensional data .

The Iterative PCA Algorithm

Notion of Similarity and choice of e_t

Since we talk about high dimensional data, there is no meaning of defining a similarity metric based on normal norms, as shown by C. C. Aggarwal et al. in [6]. So, this approach measures similarity based on the concept of cosine similarity and that also in feature space and then feeds the information back to the original dataset to get rid of those records from original space. This is continued till the remaining data is found to be similar to any eigen-vector lower bound by the eigen-value threshold e_t.

The initial choice of e_t is a debatable issue and for the initial assumption, we take it to be a fixed constant, say 0.9. As the algorithm iterates, it tries to recalculate e_t. We take mean of the -normalization of PC-scores and ensure that the mean is a close approximation of the unit eigen-vector passing through those PC-scores. We keep on taking significant PC-scores in steps and increase the mean in incremental steps until we found deterioration from the previous calculated mean and then fix the e_t at the lowest of the most significant eigen-values of the lot. This adaptive e_t calculation carries on for each iteration till the entire data is clustered.

Technical Scope of the Project

The initial algorithm of the project and any visualization has been decided to be implemented in MATLAB though if satisfactory results are obtained, then the algorithm can be compared against the local adaptive clustering (LAC) techniques [4] [5]. Based on results, there may be some additional plan to extend the work for Web Documents using JAVA and XML technologies.

Distribution of the work among the project members

Somnath – Design and Implementation in MATLAB Sanket – Implementation and Experimentation in MATLAB

In case of future extension of this work, there will be separate work distributions between the team members.

Project Schedule

The first initial implementation of the algorithm and its results are due by 29 Oct 2009.

References:

[1] Aggarwal C, Procopiuc C, Wolf J. L, Yu P. S, Park J. S (1999) Fast Algorithms for Projected Clustering. SIGMOD, 1999

[2] Aggarwal C. C, Yu P. S (2000) Finding generalized projected clusters in high dimensional spaces. SIGMOD, 2000

[3] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD, 1998

[4] Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan Muna, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc (2007) 14:63-97 DOI 10.1007/s10618-006-0060-8

[5] Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace Clustering of High Dimensional Data. SIAM international conference on data mining, pp 517-520

[6] Aggarwal C. C, Hinneberg A, Keim D. A (2001) On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001

Welcome to Sanket Gupte's Blog

4/8/10

Proposal DM

0 comments:

Post a Comment

Category

About Me

My Facebook

Weather @ Baltimore

Blog Archive

Pages