4/22/10

Life Before the Computer

1 comments
An application was for employment
A program was a TV show
A cursor used profanity
A keyboard was a piano!
Memory was something that you lost with age
A CD was a bank account
And if you had a 3 1/2 inch floppy
You hoped nobody found out!
Compress was something you did to garbage
Not something you did to a file
And if you unzipped anything in public
You'd be in jail for awhile!
Log on was adding wood to a fire
Hard drive was a long trip on the road
A mouse pad was where a mouse lived
And a backup happened to your commode!
Cut - you did with a pocket knife
Paste you did with glue
A web was a spider's home
And a virus was the flue!
I guess I'll stick to my pad and paper
And the memory in my head
I hear nobody's been killed in a computer crash
But when it happens they wish they were dead!

Courtesy: http://www.jimgibb.com

Poem for Computer Geeks

0 comments
What if Dr. Seuss Wrote Computer Manuals?


If a packet hits a pocket on a socket on a port,

and the bus is interrupted as a very last resort,

and the address of the memory makes your floppy disk abort,

Then the socket packet pocket has an error to report.



If your cursor finds a menu item followed by a dash

and the double-clicking icons put your window in the trash

and your data is corrupted 'cause the index doesn't hash,

Then your stiuation's hopeless, and your system's gonna crash.



If the label on your cable on the gable at your house,

says the network is connected to the button on your mouse,

But your packets want to tunnel to another protocol,

That's repeatedly rejected by the printer down the hall.



And your screen is all distorted by the side effects of gauss,

so your icons in the window are as wavy as a souse,

Then you may as well reboot and go out with a bang,

Cause as sure as I'm a poet, the sucker's gonna hang.



When the copy of your floppy's getting sloppy on the disk,

And the microcode instructions cause unnecessary RISC,

then you have to flash your memory and you'll want to RAM your ROM,

Quickly turn off your computer and be sure to tell your mom!

Courtesy: http://www.jimgibb.com

4/21/10

Interesting .. How stats can go Wrong. !

0 comments
Impressive .. Stats and Probability


MatheMagic

0 comments
Whoa... True Mathematical Genius..  Awesomeness Redefined

4/20/10

Search Results

0 comments

4/16/10

High Performance Cloud Computing

0 comments
http://www.buyya.com/papers/HPCC-ISPAN2009-Keynote.pdf

Cloud Computing Briefing - Jan'10

0 comments
http://www.cpni.gov.uk/Docs/cloud-computing-briefing.pdf

Introduction to Cloud Platforms-Enterprise Oriented View

0 comments
http://www.davidchappell.com/CloudPlatforms--Chappell.pdf

Cloud Ontology

0 comments
http://www.cs.ucsb.edu/~lyouseff/CCOntology/CloudOntology.pdf

4/11/10

A Break in the Clouds: Towards a Cloud Definition

0 comments
A Break in the Clouds: Towards a Cloud Definition

http://ccr.sigcomm.org/drupal/files/p50-v39n1l-vaqueroA.pdf


A Berkley View of Cloud Computing

0 comments
Above the Clouds: A Berkley View of Cloud Computing

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf

A Walk in the Clouds

0 comments
A Walk in the Clouds: Broadband Computing and Communication

http://www.online-pr.com/Holding/Cloud_Computing.pdf

4/8/10

Incomplete DM-Project

0 comments



High-dimensional Clustering of Web-based text documents using iterative PCA based approach 

PROLOGUE

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

According to Kriegel, Kröger & Zimek (2009), four problems need to be overcome for clustering in high-dimensional data:



  • Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, impossible to enumerate. This problem is known as the curse of dimensionality.



  • For spatial data, the concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:


  • A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
  • Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.


The traditional problem of clustering high dimensional data arises due to the fact that different subspaces contain meaningful clusters which do not span across all the dimensions. This has led to different subspace clustering techniques like PROCLUS [1], ORCLUS [2] or CLIQUE [3] based on selection of corresponding subset of features to find meaningful groups of clusters in high dimensional data. Most of these algorithms considers different feature subset selection methods to clustering but ignores the feature transformation aspect. In this iterative PCA approach, we are trying to use the feature transformation as a potential technique to find out features from high dimensional data, rather than using it for dimensionality reduction purpose. Since this concept may find good application in text document clustering based on document category, we may try to extend this concept in future for web document classifications.





ABSTRACT
Principal component analysis is a ubiquitous technique for data analysis and processing, but it is not based on a probability model, due to which its effectiveness is limited by its global linearity. Here we try to demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. The result would be a mixture of probabilistic principal component analyzers whose parameters can be determined using an Expectation Maximization algorithm, to estimate the principal subspace iteratively. The advantages of this model are explained with respect to clustering and dimensionality reduction.





INTRODUCTION
Principal Component Analysis is a popular technique for dimensionality reduction. The most common derivation of PCA is in terms of standardized linear projection which maximized the variance in the projected space (Hotelling 1933). For a set of observed d-dimensional data vectors {}, , the q principal axes are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors are given by the q dominant eigenvectors of the sample covariance matrix , where is the data sample mean such that . The q principal components of the observed vector are given by the vector , where . The variables are then uncorrelated such that the covariance matrix is diagonal with elements: . [7]

However, any method of PCA, as remarked in many texts, does not mention the presence of an associated probabilistic model for the observed data. It is possible to obtain a probabilistic formulation of PCA which can be closely associated with statistical factor analysis. The latent variable formulation would give us an algorithm which would be iterative, computationally efficient and be an Expectation Maximization algorithm. [7]

Factor Analysis:

Factor analysis is a linear latent variable model, which relates a d-dimensional observation vector t to a corresponding q-dimensional vector of latent variables x.

Factor analysis is related to principal component analysis (PCA) but not identical. Because PCA performs a variance-maximizing rotation of the variable space, it takes into account all variability in the variables. In contrast, factor analysis estimates how much of the variability is due to common factors ("communality"). The two methods become essentially equivalent if the error terms in the factor analysis model (the variability not explained by common factors, see below) can be assumed to all have the same variance.

In this model by constraining the error covariance to be a diagonal matrix whose elements are usually estimated from the data, the observed variables are conditionally independent given the values of the latent variables x. These latent variables are thus intended to explain the correlations between observation variables while represents the variability unique to a particular .

Maximum Likelihood Estimators:

The maximum likelihood estimate of parameter is given by the mean of the data.

The log-likelihood is maximized when the columns of W span the principal subspace of the data.

Also, for , the maximum-likelihood estimator for is given by



Where are the smallest eigen-values of S and so has a clear interpretation as the average variance lost per discarded dimension [7]



Equalities of Eigen-values:

The equalities of any of the q principal eigenvalues does not affect the maximum likelihood estimates, but the instance when all the d-q minor eigenvalue(s) are equal and identical to at least one retained eigenvalue is taken under consideration.

Expectation maximization:

In the EM approach, the latent variables {} are considered 'missing' data. Using standard least square techniques the estimation of W, from the known values would be easy. But the value of for a given is unknown, but instead the joint distribution of the observed and latent variables, p(t,x) is known and, the expectation of the corresponding complete –data log likelihood can be calculated.

In the E step of the EM algorithm, the above mentioned expression, calculated with respect to the posterior distribution of xn given tn is computed. In the M step, new parameter values and are determined, which maximize the expected complete data loh-likelihood and this is guaranteed to increase the likelihood of interest, unless it is already a local maximum. (Dempster, Laird, and Rubin 1977).

The complete data log-likelihood is given by:


The overall model distribution of a latent variable model can be considered of a form

Where is a single probabilistic PCA model and is the corresponding mixing proportion. The parameters for this model can be determined by an extension of the EM algorithm.

The missing data includes a set of for each model i and variables labeling which model is responsible for generating each data point . The complete-data log likelihood of a standard EM algorithm would be of the form





PROBLEM DEFINITION
Given a high dimensional data set with observations and dimensions denoted as , n being comparatively large, the objective is to find good-quality clusters in high dimensional data.

The algorithm starts with performing Principal Component Analysis (PCA) on to obtain first PC-scores having largest eigen-values along most significant principal component. Since principal components are independent to each other, each of these components of can be considered as i.i.d. random variables with Gaussian distributions having means equal to the column-means of .

Since are PC-scores, the random variables can be normalized without loss of relative similarity with respect to the principal components. Thus, , where are -normalized i.i.d. random variables.

Since PCA is not used here for dimensionality reduction, the original data records can be retrieved back from the PC-scores and then removed from the original data set . For our convenience, we denote this data set as to suggest the remaining data set of size . Next, another round of PCA is performed on to get the next first PC-scores with largest eigen-values along most significant principal component. Thus the entire process described above is repeated until the remaining data set is empty. The number of iterations yields the number of independent sub-clusters present in the high-dimensional data .




The Iterative PCA Algorithm






Notion of Similarity and choice of et

Since we talk about high dimensional data, there is no meaning of defining a similarity metric based on normal norms, as shown by C. C. Aggarwal et al. in [6]. So, this approach measures similarity based on the concept of cosine similarity and that also in feature space and then feeds the information back to the original dataset to get rid of those records from original space. This is continued till the remaining data is found to be similar to any eigen-vector lower bound by the eigen-value threshold et.

The initial choice of et is a debatable issue and for the initial assumption, we take it to be a fixed constant, say 0.9. As the algorithm iterates, it tries to recalculate et. We take mean of the -normalization of PC-scores and ensure that the mean is a close approximation of the unit eigen-vector passing through those PC-scores. We keep on taking significant PC-scores in steps and increase the mean in incremental steps until we found deterioration from the previous calculated mean and then fix the et at the lowest of the most significant eigen-values of the lot. This adaptive et calculation carries on for each iteration till the entire data is clustered.



P.S: Project Incomplete- Topic changed to "Maximum Likelihood PCA Approach for Handwritten Letter Data Set" Given in Another Blog-post


Members: Sanket Gupte, Somnath Chakrabarti




REFERENCES:

[1]
Aggarwal C, Procopiuc C, Wolf J. L, Yu P. S, Park J. S (1999) Fast Algorithms for Projected Clustering. SIGMOD, 1999

[2] Aggarwal C. C, Yu P. S (2000) Finding
generalized projected clusters in high dimensional spaces. SIGMOD, 2000

[3] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD, 1998

[4] Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan Muna, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc (2007) 14:63-97 DOI 10.1007/s10618-006-0060-8

[5] Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace Clustering of High Dimensional Data. SIAM international conference on data mining, pp 517-520

[6] Aggarwal C. C, Hinneberg A, Keim D. A (2001) On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001

[7] Michael E. Tipping, Chris M. Bishop Probabilistic Principal Component Analysis, Journal of the royal statistical society, Series B, 61, Part 3, pp, 611-622.

Proposal DM

0 comments


An Iterative PCA based approach for Clustering high-dimensional Text Documents in Web
Project Members: Somnath Chakrabarti, Sanket Gupte

Executive Summary:

The traditional problem of clustering high dimensional data arises due to the fact that different subspaces contain meaningful clusters which do not span across all the dimensions. This has led to different subspace clustering techniques like PROCLUS [1], ORCLUS [2] or CLIQUE [3] based on selection of corresponding subset of features to find meaningful groups of clusters in high dimensional data. Most of these algorithms considers different feature subset selection methods to clustering but ignores the feature transformation aspect. In this iterative PCA approach, we are trying to use the feature transformation as a potential technique to find out features from high dimensional data, rather than using it for dimensionality reduction purpose. Since this concept may find good application in text document clustering based on document category, we may try to extend this concept in future for web document classifications.

Problem Definition and Background:

Given a high dimensional data set with observations and dimensions denoted as , n being comparatively large, the objective is to find good-quality clusters in high dimensional data.

The algorithm starts with performing Principal Component Analysis (PCA) on to obtain first PC-scores having largest eigen-values along most significant principal component. Since principal components are independent to each other, each of these components of can be considered as i.i.d. random variables with Gaussian distributions having means equal to the column-means of .

Since are PC-scores, the random variables can be normalized without loss of relative similarity with respect to the principal components. Thus, , where are -normalized i.i.d. random variables.

Since PCA is not used here for dimensionality reduction, the original data records can be retrieved back from the PC-scores and then removed from the original data set . For our convenience, we denote this data set as to suggest the remaining data set of size . Next, another round of PCA is performed on to get the next first PC-scores with largest eigen-values along most significant principal component. Thus the entire process described above is repeated until the remaining data set is empty. The number of iterations yields the number of independent sub-clusters present in the high-dimensional data .



The Iterative PCA Algorithm






Notion of Similarity and choice of et

Since we talk about high dimensional data, there is no meaning of defining a similarity metric based on normal norms, as shown by C. C. Aggarwal et al. in [6]. So, this approach measures similarity based on the concept of cosine similarity and that also in feature space and then feeds the information back to the original dataset to get rid of those records from original space. This is continued till the remaining data is found to be similar to any eigen-vector lower bound by the eigen-value threshold et.

The initial choice of et is a debatable issue and for the initial assumption, we take it to be a fixed constant, say 0.9. As the algorithm iterates, it tries to recalculate et. We take mean of the -normalization of PC-scores and ensure that the mean is a close approximation of the unit eigen-vector passing through those PC-scores. We keep on taking significant PC-scores in steps and increase the mean in incremental steps until we found deterioration from the previous calculated mean and then fix the et at the lowest of the most significant eigen-values of the lot. This adaptive et calculation carries on for each iteration till the entire data is clustered.

Technical Scope of the Project

The initial algorithm of the project and any visualization has been decided to be implemented in MATLAB though if satisfactory results are obtained, then the algorithm can be compared against the local adaptive clustering (LAC) techniques [4] [5]. Based on results, there may be some additional plan to extend the work for Web Documents using JAVA and XML technologies.

Distribution of the work among the project members

Somnath – Design and Implementation in MATLAB                          Sanket – Implementation and Experimentation in MATLAB

In case of future extension of this work, there will be separate work distributions between the team members.

Project Schedule

The first initial implementation of the algorithm and its results are due by 29 Oct 2009.

References:

[1] Aggarwal C, Procopiuc C, Wolf J. L, Yu P. S, Park J. S (1999) Fast Algorithms for Projected Clustering. SIGMOD, 1999

[2] Aggarwal C. C, Yu P. S (2000) Finding generalized projected clusters in high dimensional spaces. SIGMOD, 2000

[3] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD, 1998

[4] Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan Muna, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc (2007) 14:63-97 DOI 10.1007/s10618-006-0060-8

[5] Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace Clustering of High Dimensional Data. SIAM international conference on data mining, pp 517-520

[6] Aggarwal C. C, Hinneberg A, Keim D. A (2001) On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001

Probabilistic PCA

0 comments


Maximum Likelihood PCA Approach for Handwritten Letter Data Set 




EXECUTIVE SUMMARY
Principal component analysis is a ubiquitous technique for data analysis and processing, but it is not based on a probability model, due to which its effectiveness is limited by its global linearity. Here we try to demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis.

Note. We have not considered the mixture of probabilistic principal component analyzers whose parameters can be determined using an Expectation Maximization algorithm, to estimate the principal subspace iteratively.

The advantages of PPCA model above conventional PCA are

  • Addresses limitations of regular PCA
  • PCA can be used as a general Gaussian density model in addition to reducing dimensions
  • Maximum-likelihood estimates can be computed for elements associated with principal components
  • Captures dominant correlations with few parameters
  • Multiple PCA models can be combined as a probabilistic mixture
  • Can be used as a base for Bayesian PCA
PROBLEM DEFINITION AND BACKGROUND
In the statistical literature of factor analysis, any high dimensional data set can be represented in terms of a summation of linear combination of independent and normally distributed latent (or hidden) variables, means along the different dimensions and independent and normally distributed error (or noise) distributions respectively. Mathematically, this can be represented as

                   

Where t = d x N observed data set, x = q x N set of latent variables, W = d x q factor loading matrix, = d x N set of row means columned for N times, = d x N matrix of orthonormal error columns. Both latent variables and noise are zero-means i.e. and where I is identity matrix and is a diagonal matrix of noise variances. And, where .

The project tries to probabilistically estimate W, the parameter of the features or principal components of the observed data set t. There are two approaches that have been followed for identifying the features or reduced dimensions.

(1) Expectation Maximization

(2) Maximum Likelihood PCA





Expectation Maximization

In the EM approach, we have tried to determine the factor loading matrix WML using the maximization condition of posterior log-likelihood Lc
with respect to WML and . The log-likelihood is calculated as the equation:    

Considering both the joint distribution of tn and xn
and the distribution of xn
to be independent and Gaussian respectively, and using Bayes' Rule the conditional distribution of xn
given tn
can be calculated as follows:

, where

E-step

The above conditional distribution parameters yield posterior mean

Least Square Reconstruction

The mean adjusted observed data is projected on W and then data is reconstructed using the equation



SSE =

M-step

Maximizing the derivative of Lc with respect to W and, the maximum likelihood estimates are obtained as and



The values obtained in the M-step are used iteratively in calculation of in E-step in the next iteration. This is done using W =

While the previous squared reconstruction error SSEprev is assigned the value of current SSE to help stop the iteration when SSE becomes greater than SSEprev.

This is done using SSEprev = SSE.

Maximum Likelihood PCA

Unlike the EM approach, the maximum likelihood is the closed form approach where W and are obtained not through an iterative process but by some closed form expressions obtained by using conditions for stationary points of the log-likelihood function Lc
which is having same definition as EM approach.

At the stationary points, deriving Lc with respect to W yields the equation where

is the covariance matrix of observed data and C and W have the usual definitions as before.

Now, since it can be shown that , the eigenvalues of S are such that , where are the diagonal elements of L such that (singular value decomposition).

Thus the smallest eigen values of S has to be at least when . The number of discarded dimensions has been chosen to be the number of eigen values having the smallest eigen value .

The maximum likelihood estimate of is then calculated as

Next WML is estimated using the equation


Uq is calculated by taking the q most significant column eigen vectors of S and Kq is calculated as a q x q diagonal matrix with q largest eigen values along the diagonal and R is an arbitrary orthogonal matrix.

It can be shown that the above estimates of WML and corresponds to the global maximum of the specified log likelihood function.

TECHNICAL SCOPE OF PROJECT
The technical scope of the project involves finding out the hidden variables out of a set of hand written letter recognition data from UCI Machine Learning Repository titled Letter Image Recognition Data using the above-mentioned two approaches and then to see if the found out hidden variables correspond to the true eigenvectors of the data set.

The data set is very large of 20,000 instances and 16 attributes with class labels mentioned as letter names A, T, C, etc. For our sample experiment, only 500 instances have been chosen for convenience.







Java Class for EM Approach (using JAMA package )
(Not Shown) 

Java Class for Maximum Likelihood Approach (using JAMA package)

(Not Shown)


EXPERIMENTAL RESULTS
The high dimensional parallel coordinate plot for the UCI Letter Image Recognition Data is as follows:







The above plot has 16000 observations with 16 variables or dimensions.


After projection of the observed data on three hidden variables and forming three classes the parallel coordinate plot looks like as above.

The Matlab code for plot is

Z = load('letter-recognition-training-unclassified-5000.txt');

% Z is of size 5000 x 16



varNames = {'x-box';'y-box';'width';'high';'onpix';'x-bar';'y-bar';'x2bar';'y2bar';'xybar';'x2ybr';'xy2br';'x-ege';'xegvy';'y-ege';'yegvx'};

%Matlab Title: Visualizing Multivariate Data



P = Z*evec;

[r,c] = size(P);

max = zeros(r);

for i=1:5000

max(i) = 1;


for j=2:dim



if Z(i,j)>max(i)


max(i) = j;


end;



end;


end;



parallelcoords(Z,'group',char(max),'standardize','on','labels',varNames);

set(gcf,'color','white');







WORK DISTRIBUTION
The project has been a team effort with the roles of the individual members in the areas as mentioned below:

Somnath Chakrabarti – Design and Development

Sanket Gupte – Development and Experimentation

PROJECT SCHEDULE
Intermediate Project Status report submitted on Oct 29, 2009

Project Submitted on Dec 20, 2009

REFERENCES:
[1] Michael E. Tipping, Chris M. Bishop Probabilistic Principal Component Analysis, Journal of the royal statistical society, Series B, 61, Part 3, pp, 611-622.

[2] The Minh Luong, Lecture Notes title "Probabilistic Principal Component Analysis and the E-M algorithm" CS3750 Fall 2007

[3] Sam Roweis, EM Algorithms for PCA and SPCA, Computation & Neural Systems, California Institute of Tech.

[4] UCI Machine Learning Repository Data Set: Letter Image Recognition Data