5/20/10

Moving to the Clouds

0 comments

Moving to the Clouds
Sanket Gupte
Technical Project Report for CMSC 601-Basic Research Skills
CSEE- University of Maryland Baltimore County


Abstract

Adopting Cloud Computing is creating a huge paradigm shift in the computing world. The research community has recently embarked its journey towards this contemporary technology. It is the convergence and evolution of several concepts from virtualization, distributed application design, grid, and enterprise IT management to enable a more flexible approach for deploying and scaling applications. This paper discusses how cloud computing transforms the way we design, build, and deliver applications, and the architectural considerations that enterprises must make when adopting and using cloud computing technology. Applications built on Cloud Architectures are such that the underlying computing infrastructure is used only when it is needed, draw the necessary resources on-demand, perform a specific job, then relinquish the unneeded resources and often dispose them, after the job is done. While in operation the application scales up or down elastically based on resource needs. For performing such tasks in the most optimized way there is a need for a perfect architecture, which takes into account all the constraints and delivers the best service. A better understanding of the technology would enable the scientific community in facilitating the adoption of this approach and expedite contributions and insights in this field.

1.Introduction


Cloud is a metaphor for 'internet'. When we represent internet in network diagram, we represent it as a cloud. It is like the 'etc' in the computer world - an area which is someone else's concern.
More precisely, it is used to access applications not on our computer. Cloud computing derives its roots from various technologies such as service oriented architecture, virtualization, distributed and grid computing.

1.1 What is 'Cloud Computing'?

In its broadest usage, the term cloud computing refers to the delivery of scalable IT resources over the Internet, as opposed to hosting and operating those resources locally, such as on a college or university network. Those resources can include applications and services, as well as the infrastructure on which they operate. By deploying IT infrastructure and services over the network, an organization can purchase these resources on an as-needed basis and avoid the capital costs of software and hardware. With cloud computing, IT capacity can be adjusted quickly and easily to accommodate changes in demand. While remotely hosted, managed services have long been a part of the IT landscape, a heightened interest in cloud computing is being fueled by ubiquitous networks, maturing standards, the rise of hardware and software virtualization, and the push to make IT costs variable and transparent. [4] [10]


1.2 How is it different from other similar technologies?

  • Autonomic computing — Integrate computer technology which allows networks to manage themselves with little or no human intervention.

  • Client-server – Client-server computing refers broadly to any distributed application that has two logical parts: a server which provides information or service, and a client, which requests them.

  • Grid computing — It is a form of distributed computing and parallel computing, whereby a 'super and virtual computer' is composed of a cluster of networked, loosely coupled computers acting in concert to perform very large tasks. Here, a large project is divided into smaller parts that work on different machines, in parallel, cost effective, fast, sharing of resources.

  • Utility computing — A form of computer service whereby the company providing the service charges you for how much you use it.

  • Peer-to-peer - systems in which data transfer utilizes the shared resources of every system attached to the Intranet, setup for a specific task or purpose.

  • Cloud computing- It is used to run multiple smaller applications at the same time. Cloud computing customers do not own the physical infrastructure, instead avoiding capital expenditure by renting usage from a third-party provider. They consume resources as a service and pay only for resources that they use.

1.3 What are the Categories in Cloud Services?

  • SaaS – Software as a Service
    • Network-hosted application. It is a model of software deployment whereby a provider licences an application to customers for use as a service on demand.

  • DaaS – Data as a Service
    • Customer queries against provider's database

  • PaaS– Platform as a Service
    • Network-hosted software development platform. It facilitates the deployment of applications without the cost and complexity of buying and managing the underlying hardware and software layers. PaaS provides the facilities required to support the complete lifecycle of building and delivering web applications and services.

  • IaaS – Infrastructure as a Service
    • Provider hosts customer VMs or provides network storage. Rather than buying servers, software, data center space or network equipment, clients instead buy resources as a fully outsourced service.
Other similar terminology is

  • IPMaaS – Identity and Policy Management as a Service
  • NaaS – Network as a Service
  • XaaS - Everything as a service.
2.Motivation

The first high-profile usage of the term 'Cloud Computing' was by Eric Schimdt of google in August 2006 in one of the conferences and the abbreviations SaaS, IaaS, PaaS etc are google's way of managing data-centers and infrastructures, but the first detailed explanation of 'Cloud Computing' was given in the paper titled the same, by a team of IBM employees namely- Greg Boss, Padma Malladi, Dennis Quan, Linda Legregni and Harold Hall, in October 2007. The paper being the first ever in depth-description of cloud computing could be called 'seminal'. It describes cloud computing, its business benefits and management considerations for the cloud. Giving a detailed description of architecture and infrastructure for the cloud, it shows a clear and concise understanding of the cloud, but in the perspective of management. This paper has its roots in "Introducing HiGIG: The HiPODS Global Innovation Grid" a paper by a different team of IBM, that describes –"
High Performance On Demand Solutions (HiPODS) global innovation grid (HiGIG)" which is based on a composite service-oriented architecture pattern and provides IT infrastructure management services for multiple, globally distributed IT infrastructures.

'Above the Clouds:A Berkley View of Cloud Computing'[3] is a phenomenal paper written in early 2009 which describes in detail from the beginning of cloud computing and where it is going to lead in a few years. Although it very lightly touches the technicality of 'cloud computing' but it concentrates more on the qualities and characteristics of it. It starts with how the idea began and why 'now' is the perfect time for bringing out the concept. It then describes in detail 10 problems which would be faced it embracing it, and ways to deal with it. It is a very significant paper and forms the basis for many related research, and is cited in more than 240 papers.

'A Tale of Clouds: Paradigm Comparisions and some thoughts on Research Issues"[12] is a significant paper written by Mei,Chan and Tse, which compares service computing and pervasive computing with cloud computing and gives out the major differences between them. Since it is a short paper, is fails to go into the full details of differences between them and only does a qualitative comparison based on the classic model of computer architecture. Finally it mentions a few research questions in cloud computing, such as storing and accessing large scale data by the clouds, adaptability to internal and external changes, plugging on the clouds, discovering of their entity by the clouds etc. It describes the significance of addressing these questions but does not mention any means of solving them.

A paper written by the developers of 'Eucalyptus – an open sourced software framework for cloud computing',[2] describes their infrastructure in a detailed way. It is an in-depth report of their developed framework which could be used for cloud computing users and developers in various ways. 'Eucalyptus' employs a hierarchical design, and is divided into Node Controller, Cluster controller, Storage controller and Cloud controller. The node controller runs on the host and controls its execution, inspection and termination of VM instances. The cluster controller is responsible got collecting information of VM execution on specific node controllers and their scheduling. Storage controller provides storing mechanism for storing and accessing VM images and user data, using Amazon's S3 interface. Cloud controller is the point where users and administrators could enter the cloud. It makes requests to cluster controllers to implement high level scheduling decisions.

In a very well written technical white paper published by the collaborated efforts from contributors of Amazon web services, RightScale and r-Path, it is described that although there can be significant similarities between normal applications and cloud applications, but there are huge number of things which have to be taken care of. A very careful attention to the design will ensure a successful deployment. All cloud based applications need to be deployed as virtual appliances so they contain all the components needed to operate, update and manage them. Simple design would help in scalability and planning for failure could avoid the worst things from happening.

Sun Microsystems published an elaborate white paper [23] on the architecture of cloud computing. The paper gives an explanation of cloud computing and what all components are present in it. It shows that the cloud can be divided into a very precise and distinct architectural layers such as PaaS, IaaS, SaaS etc. It then mentions the benefits of using cloud computing over normal computing and applications. The main emphasis of the paper is given on IaaS. It mentions what all changes have to be done in order to develop an Infrastructure for the cloud and all the characteristics that the architecture should possess. At the end it gives details on physics and management of data for the cloud. The concluding chapter is about the approach and innovations of 'Sun Microsystems' towards cloud computing.

In March this year, CPNI published a Cloud Computing briefing. [38] It gives a detailed overview of cloud computing in a security perspective. It concentrates on the potential benefits and risks as well as identifying mitigation advice to reduce vulnerability. It tries to target the conflicting opinions about the concept of cloud computing, but does not provide a clear definition either. Many of the details given are focused on satisfying the customers of the cloud services. It shows the significance of some accepted standard for cloud computing. While mentioning various security issues it also gives recommendations for customers about how they should manage security from their side, and what all things they should ensure in order to get good and standard cloud facilities.

2.1 Current Scenario: How different companies/Organizations are implementing 'Cloud Computing'



The following is a short list of different organizations implementing 'cloud computing'. [25][26][28][30][33][34][35][36][37]

Table 1:


OrganizationServiceNameDescription
AmazonIaaSElastic Compute Cloud(EC2)Virtual Servers
DynamoKey-value Storage system
Simple Storage Service(S3)Storage buckets
Simple DBDatabase as a service
CloudFrontContent Delivery
SQSQueuing services
HaaSMechanical TurkScalable Workforce
GoogleIaaSGoogle Big TableDistributed storage system
Google File SystemDistributed Files System
PaaSApp EngineScalable Runtime environment for python web applications
SaaSGoogle DocsOnline office suite
Google Maps APITo Embed Google Maps in their own Web-pages
OpenSocialCommon API for social applications across multiple websites
MicrosoftPaaSAzureDevelopment environment and

runtime for Microsoft applications
Live MeshPlatform to sync, share and access a

wide range of devices with

Microsoft operating systems
SaaSOffice LiveOnline office suite





2.1.1 Amazon Web-Services

Amazon was one of the first companies who took an initiative in launching a cloud product for the general public. [26][30][36]



If you need CPU cycles, you can spin up virtual machines with Elastic Compute Cloud (EC2). If it's data you want to store, you can park objects of up to 5GB in the Simple Storage Service (S3). Amazon has also built a limited database on top of the S3, which has still not been released. Your machines can talk among themselves with the Simple Queue Service (SQS), a message-passing API.
Amazon is thoroughly committed to the cloud paradigm, rethinking how we design these systems, producing some innovative tools and the most extensive solutions.

Amazon web-services are a collection of remote computing services that together make up a cloud computing platform. The following are a list of Amazon services, in no specific order.



  • Amazon Associates Web Service (A2S, formerly Amazon E-Commerce Service or ECS), providing access to Amazon's product data and electronic commerce functionality.

  • Amazon AWS Authentication is an implicit service, the authentication infrastructure used to authenticate access to the various services.

  • Amazon CloudFront, a content delivery network (CDN) for distributing objects stored in S3 to so-called "edge locations" near the requester.

  • Amazon DevPay, currently in limited beta, is a billing and account management system for applications that developers have built atop Amazon Web Services.

  • Amazon Elastic Block Store (EBS), providing persistent block level storage volumes for EC2.

  • Amazon Elastic Compute Cloud (EC2), providing scalable virtual private servers using Xen.

  • Amazon Elastic MapReduce, a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hostedHadoop framework running on the web-scale infrastructure of EC2 and Amazon S3.

  • Amazon Flexible Payments Service (FPS), provides an interface for micropayments.

  • Amazon Fulfillment Web Service provides a programmatic web service for sellers to ship items to and from Amazon using Fulfillment by Amazon.

  • Amazon Historical Pricing, providing access to Amazon's historical sales data from its affiliates. (Appears this service has been discontinued)

  • Amazon Mechanical Turk (Mturk), managing small units of work distributed amongst many people.
  • Amazon Relational Database Service (RDS), providing a scalable MySQL compatible database server.

  • Amazon Simple Storage Service (S3), providing Web Service based storage.

  • Amazon Simple Queue Service (SQS), providing a hosted message queue for web applications.

  • Amazon Simple Notification Service (SNS), providing a hosted multiprotocol "push" messaging for web applications.

  • Amazon SimpleDB, allows developers to run queries on structured data. It operates in concert with EC2 and S3 to provide "the core functionality of a database."

  • Amazon Virtual Private Cloud (VPC), a web service that creates a logically isolated set of Amazon EC2 instances to be connected to an existing network using a VPN connection.

  • AWS Management Console (AWS Console), A web-based point and click interface to manage and monitor the Amazon infrastructure suite including EC2, EBS, Amazon Elastic MapReduce, and Amazon CloudFront.

  • AWS Simple Monthly Calculator helps you estimate your AWS monthly cost 
The most central and widely used of these services is the Amazon Elastic Compute cloud (EC2).

2.1.2 Google App Engine
A totally opposite product to Amazon's offerings is the Google's App Engine [25]. While it is possible to get root privileges on Amazon, it is impossible to write a file in your own directory with App Engine. It is unclear whether we are even offered our own directory. In order to avoid security loop holes with quick implementation, google has taken the file write feature from Python. For storing data we must use Google's database.
Google has stripped Web applications down to a core set of features and built up a pretty good framework for delivering them. Writing applications is quite simple and quick. Google offers some nice tools for debugging the applications on our own machine. Deploying applications into the cloud can be done in very less time.
Google App Engine is a platform for developing and hosting web applications, which virtualizes applications across multiple servers and data centers.
It currently supports programming languages like Python, Java and other JVM languages. App Engine provides more infrastructure to make it easy to write scalable applications. Google handles deploying code to a cluster, monitoring, failover and launching application instances as necessary. App Engine requires its developers to use only its supported languages, APIs and frameworks. Current APIs allow storing and retrieving data from a BigTable non-relational database; making HTTP requests; sending e-mail; manipulations images; and caching. Most existing web applications cannot run on App Engine without modifications since they need a relational database.
The Google App Engine's datastore has a SQL-like syntax called 'GQL'. GQL has many differences and variations from the normal SQL and other relational databases. The Datastore API is not relational in the SQL sense.
There are many restrictions with App Engine like only read-access to filesystem, execute only HTTP requested code, usage of only pure python modules, inability to create new threads, not supportive for naked domains (without www), etc.

 

Fig 1: Comparison between Amazon and Google Clouds

 

2.1.3 Windows Azure
Windows Azure [34] is a service which is a part of the larger Azure Services Platform. Microsoft customers can use this service to run applications and store data on Internet-accessible machines. Windows Azure has three main parts: the compute service, the storage service, and the fabric.
 
The Compute Service: The primary goal of this platform is to support applications that have a large number of simultaneous users. It does this by scaling out (instead of scaling up), by running multiple copies of the same code across many servers.
For this an application can have multiple instances, each executing in its own virtual machine. These VMs run a server and are provided by a hypervisor which is modified for the Microsoft's cloud. 
 
The Storage Service: Windows Azure has mainly three types of storage. The simplest one being, a 'blob' which contains binary data and a simple hierarchy. A storage account can have one or more containers, each containing one or more blobs. To allow applications to work with data in a more fine-grained way, there are 'tables'. Unlike normal tables, the tables here store data in a simple hierarchy of entities that contain properties. Applications access data with ADO.NET Data services. This helps in scaling-out in a much easier way. Another form of data storage is 'queue'. A primary function of queues is to provide a way for web role instances to communicate with worker role instances. No matter in what form the data is stored in Azure, the data is replicated three times, which gives fault tolerance.

The Fabric: All Azure applications and data live in some Microsoft data center. Within that data center, the set of machines dedicated to windows Azure is organized into a fabric. The fabric contains a large group of machines, all of which are managed by a software called the 'fabric controller'. The fabric controller is replicated across a group of five to seven machines, and it owns all the resources in the fabric: computers, switches, load balancers and more. It monitors all running applications and manages operations systems, decides where new applications should run and chooses physical servers to optimize hardware utilization.

2.1.4 Resin 4.0
Based on the technical White paper [22] Resin 4.0 is a very good environment for deploying and scaling JAVA and PHP web applications for the web. Resin has its own unique architecture for sophisticated clustering, providing fast and distributed sessions, distributed object caching and cloud-wide applications deployment. The Resin 4.0 has the following underlying architecture.

The Triad is a set of three servers that provide the central repository for persistent data and maintain up-to-date record of dynamic servers in the system. It provides a point of stability and persistence to reduce management complexity.

There are dynamic servers, which can be started and stopped at will, providing elastic scaling.
Each dynamic server has access to the shared data within the system. The cost of starting a new dynamic server is simply to start a new virtual machine. When a new dynamic server is brought online, it contacts one of the triad servers to announce its availability and import all application data. As applications are updated on the triad, the changes are pushed out to all the dynamic servers by the triad to keep them updated.
The dynamic servers use the triad as their persistent store for session and object cache data. At the same time, optimizations keep frequently used data in memory on the dynamic servers to improve application performance and reduce network load. The triad servers and the dynamic servers together form a cluster.

Using a combination of triad servers and dynamic servers minimizes the complexity of managing an application deployment. The triad servers are brought up first on system start up and at least one should be available at any time during the life of the system. Thus these three servers can be the main focus of administration time and effort because all of the other servers may go up or down at any time without affecting the functional performance visible to clients. Assuming a virtualized environment, if one or more triad servers become faulty at any time, a replacement or replacements can be brought into place quickly. Having three servers in a triad avoids a single point of failure, allows up to two servers to fail at any given time, and allows normal maintenance of a single server without downtime.
Resin 4.0 includes a software load balancer that distributes HTTP requests for web application clients. The triad server keeps track of the current members of the cluster and communicates with the load balancer to update it on which dynamic servers are available to handle requests. When choosing a server to handle new requests, the load balancer takes into account the CPU load of a server as well as the number of simultaneous requests that server is already handling. Depending on the algorithm selected by the administrator, the load balancer can either direct the request to the least loaded server to keep load even or to the same set of servers until they are fully loaded to avoid starting new servers. Once a server has been selected for a new request, subsequent requests from the same client will go to the same server to avoid unnecessary load times.

3.Architecture and characteristics of a cloud:


3.1 Physical Structure of a Cloud [1]


- Surface Layer or Application Layer (SAAS)
This layer forms the most visible part of the infrastructure. Services to users would be provided through web-portals. It alleviates the burden of software maintenance and the operation and support costs. On the users' side it would reduce the restrictions on the hardware requirements. An application for the cloud is deployed on the provider's infrastructure, so the developers of applications can just add small patches and new features to the system without disturbing the users with requests for updating. Configuring and testing applications is quite less complicated since the deployment environment is quite restricted. The cloud applications can be developed on the cloud software environments and they can be composed as a service from other cloud services offered by other cloud systems, using SOA concepts. The applications for the higher layer of the stack are quite simple to develop , have a short time-to-market and are less error-prone.
There are several issues which hinder in the wide acceptance and adoption of the cloud model. Security and availability are major areas, which are currently avoided by lenient service level agreements. Coping with network outage and system failures is another major concern in this area.
 
- Software Platform Layer (PAAS)
Cloud applications' developers implement and deploy their applications on this layer. The providers of this layer supply the developers with a programming-language-level environment. The service is therefore called as Platform as a service (PaaS). The main benefits of developing cloud applications on the cloud programming environment are automatic scaling, load balancing and integration with other services, provided by the PaaS provider. Also, the developers can integrate other additional services on demand, which not only makes the task simple but also accelerates the deployment time and minimizes logic faults in the application.

- Software-Infrastructure Layer(IAAS+DAAS+CAAS)
It provides the fundamental resources to the other higher layers. This layer can be by-passed to form a combined platform + infrastructure layer, which could enhance the efficiency but would require more development efforts and involve complexity.

This layer can be categorized into the following three ways.
1)Computational Resources:
The most common and important form to provide computational resources is through Virtual Machines (VMs). Such a service is often referred as Infrastructure as a Service (IaaS). Virtualization allows users a lot of flexibility in configuring their settings while protecting the physical infrastructure of the provider's data center. Recent advancements in Operating System Virtualization techniques has made IaaS quite plausible.

2)Data Storage:
This allows users to store data remotely and access it from any place. This service is referred to as Data Storage as a service (DaaS). Data storage systems require to meet several rigorous requirements for maintaining users' data and information, including high availability, reliability, performance, replication and data consistency. Due to the conflicting nature of these requirements it is impossible to fulfill all of them together. So, DaaS providers have their own liberty to implement and favor a few features over others, while indicating their choice through SLA.

  

Fig 2: The Architecture


3)Communication:
Communication is a very important part of any cloud infrastructure. It must be service-oriented, configurable, schedulable, predictable and reliable. Communication as a Service (CaaS) came up, supporting such requirements , as well as network security, dynamic provisioning of virtual overlays for traffic isolation or dedicated bandwidth, guaranteed message delay, communication encryption, and network monitoring.

There are several design features which can be common between the three infrastructural components, such as security, availability, quality etc.

- Software Kernel Layer
It provides the basic software management for the physical servers that compose the cloud. Software kernels at this level can be implemented as OS Kernel, hypervisor, virtual machine monitor and/or clustering middleware. Grid Computing applications can be very suitable to be deployed and run on this layer. But due to the absence of virtualization abstraction in grid computing, jobs would be closely tied to actual hardware infrastructure, so providing migration, checkpointing and load balancing to the applications at this level is quite complicated. The amount of research done in Grid computing is quite huge and grid development concepts are realized in cloud computing. However, additional grid computing research can be potentially integrated to the research area of the cloud.

- Hardware Layer (HAAS)
The lowest layer of the cloud stack is the actual physical hardware and switches. The users of this layer are usually big enterprises with huge IT requirements in need of subleasing Hardware as a Service (HaaS). The HaaS provider operates, manages and upgrades the hardware on behalf of its customers, for the life-time of the sublease. They have the technical expertise and cost effective infrastructure to host the systems. There are several challenges that HaaS providers have to face, such as Efficiency, ease and speed of provisioning large scale systems etc.

3.2 Architecture of Platform and Infrastructure Layer


These two layers can be combined together and with respect to infrastructure can then be divided into four parts: [2]

-Node Manager
There is a Node Manager for every Node which hosts VM instances. It queries and controls the software on its node. It makes queries to discover the node's physical resources and to learn about the state of VM instances on the node. All the information collected is sent up to the cluster manager. Cluster managers control the VM instances on a node. After verifying the authorization and confirming resource availability, the Node manager executes the request with the help of hypervisor. To start an instance, the Node manager makes a node-local copy of the instance image files, either from a remote image repository or from the local cache, creates a new end point in the virtual network overlay and instructs the hypervisor to boot the instance. To stop an instance, it instructs the hypervisor to terminate the VM, tears down the virtual network endpoint, and cleans up the associated files.


 
Fig 3: Detailed Platform + Infrastructure

 
-Cluster manager
It runs on a machine which has connectivity with the Node Manager and the Interface Manager. It schedules incoming instance run requests to specific Node managers, controls the instance virtual network overlay and collects information about a set of Node managers. When a cluster manager receives a set of instances to run, it contacts each Node manager component and sends request to host the instances to the first Node manager which has free resources. With the list of resource characteristics that it receives it calculates how many simultaneous instances of the specific type can execute on its collection of Node managers and reports that to the Interface Manager.

-Data-Store Manager
A Data store manager would have special functionalities like allowing users to stream data in and out of the cloud and also from instances that they have started on nodes. It also acts as a storage and management service for VM images and other images used to instantiate VMs. The images are compressed and encrypted and split into parts. It also decrypts and verifies images uploaded by users. For performance optimization a cache is used to keep track of images that have already been decrypted. It supports concurrent and serial data transfers. Users are guaranteed that a consistent copy of the object will be saved in the case of concurrent writes on the same object.

-Interface Manager
Interface manager is the entry-point into the cloud for users and administrators. It queries node managers for information about resources, makes high level scheduling decisions, and implements them by making requests to cluster controllers. The services of an interface manager can be grouped into three categories.
Resource Services perform system wide arbitration of resource allocations, let users manipulate properties of the VMs and networks, and monitor both system components and virtual resources.
Data Services govern persistent user and system data and provide for a configurable user environment for formulating resource allocation request properties.
Interface Services present user-visible interfaces, handling authentication & protocol translation, and expose system management tools providing.

3.3 Virtualization and cloud computing
Virtualization is the notion that compute resources can be made available with a large degree of agility and flexibility without requiring that users have detailed knowledge of underlying physical resources. In a virtualized environment, computing environments can be dynamically created, expanded, shrunk, or moved as demand varies. Virtualization provides important advantages in sharing, manageability, and isolation (the ability for multiple users and applications to share physical resources without affecting each other) in a cloud environment. Virtualization allows a set of underutilized physical servers to be consolidated into a smaller number of more fully utilized physical servers, contributing to significant cost savings and providing tremendous advantages in a new enterprise data center cloud environment.
Cloud infrastructure commonly relies on virtualization. Customers can provide their own VMs, and the cloud provider would run them often without knowledge of the guest OSes or their configurations.
Virtual cloud computing is emerging as a promising solution to Information Technology (IT) management
to both ease the provisioning and administration of complex hardware and software systems and reduce the operational costs. Several industry and university leaders have presented recently possible implementations. As more Virtual Machines (VMs) are being used, particularly because VMs are easy to be cloned and snapshot, it is unavoidable to have a large number of VMs.
Virtualization provides a coarse-grained isolation mechanism that results in large systems, with full operating systems and a complete software stack as their foundation. Though much of this foundation is not strictly necessary, the programmatic burden of building systems at a finer-granularity, on a smaller foundation, has previously been shown to be prohibitive.
Here we present an alternative, lightweight and composite approach to virtualization for cloud computing that we call Macro-Components—software components that offer a lightweight container for software components that run in isolation from the rest of the system, but without the full foundations of their more traditionally virtualized counterparts. By reducing the foundation upon which virtualization is built, and by incorporating first principles of component based software development, this approach can provide a more scalable and sustainable approach for composing robust services in cloud environments, both in terms of dynamic system properties and software engineering qualities. [18]

These component systems would allow developers to build programs from scratch from compositions of components which they would further customize. Due to component systems it would be easy to replace existing parts without requiring major changes to the rest of the program. A virtualized component model would need the following characteristics, to function effectively.
Componentization
Granularity of the system, i.e. the appropriate component size is a very important factor for component design. The component size would be relatively small compared to a VM running a full OS. Two major factors that would affect the component size are 1) Computational overhead of the components and 2) The development burden of the components. Moreover, a few pragmatic concerns such as, the way of component packing, the way of sharing resources and information between components must be addressed.

Composition
The way in which a working system is formed by combining components is an important characteristic for a component system. Most of the component systems have a simple programmatic way in which you can reference and invoke a component. Some systems have a dynamic model in which the components can be invoked after finding them at runtime. Ideally a system should have programmatic dynamic composition of components. The design points to consider for such a model are:
  • The means by which components are referenced,
  • The control of the life-cycle of the components
  • Their customization within anticipated parameters
In modern virtual architectures the control and creation of VMs happens outside of the normal system in a special 'privileged' area. This mismatch in the location of control has to be address in a component system design

Communication

Communication can be as simple as a function call, since most component systems are designed to run in a single address space. A virtual component model would definitely have Inter-process communication and middleware, more so because they send communication through some sort of shared medium like shared memory or network instead of direct invocation of function or data access. Some of the design points to be taken care with respect to communication are:
  • The medium used communicate between VMs,
  • The way entities in the system named and registered,
  • The interface if typed communication is required, communication semantics
  • The identification of essential services.
Control

The mechanisms for controlling the communication and behavior of components should follow the key design principle aligned with the separation of policy and mechanism. An application designer should be allowed to specify component's policy and have the system follow through that policy with mechanism. So the design points with respect to control for fine grained decomposition are:
  • Definition of the mechanisms necessary to correctly control a component,
  • How and where are these mechanisms controlled from, in terms of system decomposition and
  • Where the policies of components is implemented.
Major Challenges in Using Virtualization as a composition mechanism

  • Changing the centralized control mechanisms employed in current virtualization platforms.
  • The scale of the system in terms of the number of VMs and the amount of inter-VM communication may need to be addressed.
  • The speed with which VMs can be created, and the latency of inter-VM communication needs to be addressed.

4. Characteristics of a cloud application: [8][13][19]

The greatest advantage of cloud computing is 'flexibility' which distinguishes it from other similar technologies such as Grid and utility computing. There are many important features that a cloud application must have, of which the ability to launch new instances of an application with minimal labor and expense is quite important. Once the application has been offered in a cloud computing environment, it is important to avoid the 'success disaster'. That is, when the application becomes an overnight success. The following should be important characteristics of an ideal cloud application.

-Ability to scale

Cloud computing applications require huge scalability, and for achieving it, there are some simple design guidelines which can be followed.
Simplicity:- Avoid complex design and enhancements. It is better to start with a simple application and depend on the cloud to ensure good performance. The improvement in efficiency can be done as and when you get traction and demand grows, which would allow serving more users but same number of servers, while maintaining performance. Although caching, server affinity, multi-threading and tight data sharing would make it difficult to distribute application across many servers, but it can improve performance. Which is the reason it is recommended to bring these characteristics later and not in the beginning, to ensure that you don't break horizontal scalability.
 
Splitting functions and Loose Coupling:- Use separate systems for different functionalities and avoid synchronous connection between them. So, as the demand grows each individual unit can be scaled independently instead of the whole application.
 
Network Communication:- Design the application to use network based interfaces instead of interprocess communication or file-based communication paradigms. Since each piece of application can be separated into distinct systems, it helps in scaling.
 
Clustering:- Instead of scaling a single system for all the servers, consider splitting the system into multiple small clusters, each serving a fraction of the application load. Different services can be split up along one dimension and requests can be directed to an appropriate cluster. To deploy a clustered system determine the right collection of servers that yield efficient application performance, taking into account the required functional redundancy. The scaling can then be done by replicating the ideal cluster size and splitting the system load across the servers in the clusters.
The following are the advantages of scaling in the above mentioned way.

  • Testing can be done against a test cluster without risking the performance or integrity of the entire system. Cluster performance can also be done using artificial users to generate load in the cloud.
  • Perform testing on an instance of the cluster first, and one cluster at a time. If the tests do not succeed then just roll back.
  • Separate clusters on the basis of varying customer demands.
  • With the readily available resources, the applications can be built to recognize the need to scale when the limits of the current configuration have been reached.
-Ability to handle failure

An application is doomed to fail, no matter what characteristics it has. Same is the case with any cloud application.

Resilience and Fault tolerance:- To be able to tolerate failures any application must operate as a part of a group, but loosely coupled to their peers. Each application piece should be able to continue execution despite the loss of other functions. Asynchronous interfaces for components are ideal mechanism to tolerate failures.

Distribute failure impact:- In a cloud application the failure of one application cluster would affect only a portion of the application. By spreading the load across multiple clusters, it would be easy to isolate individual clusters against failure in another cluster.

Quick Recovery:- Application components must be able to come up automatically, configure and join the application cluster, in order to have proper and quick recovery.

Data Preservation:- Keep all data on persistent storage and make sure that it is replicated and distributed in the correct manner. System can be restarted from the point of failure by storing the system state.

Test Failure scenario:- It is easy to bring up an instance of the application and test various failure scenarios, in very reasonable cost due to the flexibility of cloud computing. Single application instances can be taken offline to check the effects of it on the rest of the application. Similarly multiple recovery scenarios can be planned and executed for any type of failure.
Also, the application should be built after comparing the costs for developing fail-proof application and the costs for recovering from a failure, and choosing the better of the two.

-Ability to manage applications
It is quite easy to manage applications if they are deployed as virtual appliances. The appliances should bring with them the entire software needed for the lifecycle in the cloud. They should be built systematically so that there is consistency of creating and re-creating images. Rapid provisioning of new systems, for scaling and failure recovery, can be done only if the images to be provisioned are consistent and repeatable.
The appliance configuration and updates should be automated so that the management overhead does not grow in proportion when the application grows. In this way the appliances can stay in the cloud for any length if time with minimum management overhead.
When appliances are instantiated in the cloud they should plug into a monitoring and management system. This system will allow tracking the application running in the cloud, mitigate or shutdown instances as needed, and gather logs and other information that is necessary. Without a management system the application could slowly sprawl across the cloud and waste resources and money.
A management system plays a major role in testing and deployment process. Including testing the management system would allow us to bring up a test cluster, conduct the required testing and then migrate the application into production. Updated features and functions can be delivered faster due to the uniform resources helping a rapid release to the production process. By automating creation and management of these appliances and producing a consistent appliance image and managing it effectively would mean that we have removed variability from the release management and deployment process hence reduces the chances of mistakes.
The following are the advantages of developing application for management in the cloud.

  • Reducing cost and overhead of preparing the application for the cloud
  • Reducing the overhead of bringing up new instances of the application
  • Eliminating application sprawl
  • Reducing the chance for mistakes as the application is scaled out, failed over, upgraded etc
Automating Cloud services
[11]

Cloud services in particular to automate. Most cloud applications are specific to one vendor's API, whereas applications valuable for an enterprise need access and control across heterogeneous systems. Since cloud services are essentially extensions of data centers, developing cloud automation applications have the same issues as system applications with the added complexity of dealing with multiple vendor interfaces.
  • Automating recovery of cloud resources:- An important advantage of cloud compute services is that they can be started under program control. Virtualized systems allow resources to be assigned dynamically. Automating recovery requires interaction between the monitoring system, which should be aware of system alerts, and the cloud services APIs to restart the resource. Also, creating problem tickets and notifying operations personnel when recovery occurs in order to track service levels can be a part of automation.

  • Auto-scaling:- In this scenario the auto-scaling depends on the load on the application. The load calculation is explicitly or implicitly derived from the monitoring system metrics, which triggers the process to start a compute instance using the cloud API. The additional complexity comes from integrating interaction with other application components.

  • Storage backup:- Most cloud service providers provide theoretically 'unlimited' backup, so it becomes important to have backup. The automation application must coordinate the storage capacity triggering alert from the monitoring system, the application control commands, the cloud API action to initiate transfer to the cloud storage. Data storage applications that track critical application data would need to be notified.

  • Hybrid Applications:- Many a times there are applications which have their resources on the cloud but their own datacenter. The interaction between cloud and non-cloud components can be governed by an automation application.

  • Cross vendor services:- When a single vendor's service is used, it is bound to have a single point of failure. One of the ways to mitigate it is by spreading the services across multiple vendors. For doing this, the following automations must be done
    • Interfacing to multiple cloud APIs.
    • Recognizing vendor failures versus network or component failures
    • Reconfiguring components to work with either redundant cloud service
    • Planning for and handling changes and failure scenarios in both sets of services

Conclusion

Cloud computing offers great potential to improve productivity and reduce costs. In spite of its numerous advantages, there are still fears and unanswered questions, which hinder the adoption of Cloud computing by everyone. This paper was written with a goal of explaining and comprehending the architecture of an ideal cloud for a more rapid adoption by the scientific community. This paper demonstrates ontology of a cloud which is dissected into five main layers. A detailed examination of the layers and their inter-relation as well as inter-dependency between these layers is illustrated. We further go in depth into the most important layers, to understand the detailed analysis of them, giving a more comprehensive and detailed analysis of the cloud computing knowledge domain. We used composability as our methodology in constructing our cloud ontology which allowed us to capture the inter-relations between the different cloud components.

We also proposed using macro-components for virtualization in cloud computing in order to facilitate easy development and deployment. At the end we mention the characteristics which every cloud application should have in order to make use of the full potential of the proposed cloud architecture. These characteristics are very important and are inevitable in any cloud application.
Better comprehension of the technology would enable the community to design more efficient portals and gateways for the cloud, and facilitate the adoption of this novel computing approach in scientific environments. In turn, this will assist the scientific community to expedite its contributions and insights into this evolving computing field.

References

  1. L. Youse , M. Butrico, and D. Da Silva, "Toward a unified ontology of cloud computing," in Grid Computing Environments Workshop, 2008. GCE '08, pp. 1{10, 2008.  

  1. D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youse, and D. Zagorodnov, "The eucalyptus open-source cloud-computing system," in CCGRID '09: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, (Washington, DC, USA), pp. 124{131, IEEE Computer Society, 2009.

  1. M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, "Above the clouds: A berkeley view of cloud computing," tech. rep., February 2009.

  1. R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, vol. 25, pp. 599{616, 2009.

  1. B. Hayes, "Cloud computing," Commun. ACM, vol. 51, pp. 9{11, 2008. 

  1. S. Pearson, "Taking account of privacy when designing cloud computing services," in CLOUD '09: Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing, pp. 44{52, IEEE Computer Society, 2009.

  1. M. Jensen, J. Schwenk, N. Gruschka, and L. Lo Iacono, "On technical security issues in cloud computing," Cloud Computing, IEEE International Conference on, vol. 0, pp. 109{116, 2009.


  2. A. Lenk, M. Klems, J. Nimis, S. Tai, and T. Sandholm, "What's inside the cloud? an architectural map of the cloud landscape," in CLOUD '09: Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing, vol. 0, (Washington, DC, USA), pp. 23{31, IEEE Computer Society, 2009.

  1. J. J. Rehr, J. P. Gardner, M. Prange, L. Svec, and F. Vila, "Scientific computing in the cloud," 2008.
  1. L. M. Vaquero, L. R. Merino, J. Caceres, and M. Lindner, "A break in the clouds: towards a cloud definition," SIGCOMM Comput. Commun. Rev.,vol. 39, pp. 50{55, 2009.
  1. H. C. Lim, S. Babu, J. S. Chase, and S. S. Parekh, "Automated control in cloud computing: challenges and opportunities," in ACDC '09: Proceedings of the 1st workshop on Automated control for datacenters and clouds, (New York, NY, USA), pp. 13{18, ACM, 2009.
  1. L. Mei, W. K. Chan, and T. H. Tse, "A tale of clouds: Paradigm comparisons and some thoughts on research issues," Asia-Pacific Conference on Services Computing. 2006 IEEE, vol. 0, pp. 464{469, 2008.
  1. G. Reese and R. George, Cloud Application Architectures: Building Applications and Infrastructure in the Cloud. O'Reilly Media, 1 ed., 2009.
  1. GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications, 2008.
  1. C. Vecchiola, S. Pandey, and R. Buyya, "High-performance cloud computing: A view of scientific applications," 2009.
  1. R. Buyya, C. S. Yeo, and S. Venugopal, "Market-oriented cloud computing: Vision, hype, and reality for delivering it services as computing utilities," Aug 2008.
  1. P. C.-Y. Sheu, S. Wang, Q. Wang, K. Hao, and R. Paul, "Semantic computing, cloud computing, and semantic search engine," International Conference on Semantic Computing, vol. 0, pp. 654{657, 2009.
  1. C. Matthews and Y. Coady, "Virtualized recomposition: Cloudy or clear?," Software Engineering Challenges of Cloud Computing, ICSE Workshop on, vol. 0, pp. 38{43, 2009.
  1. J. S. Rellermeyer, M. Duller, and G. Alonso, "Engineering the cloud from software modules," in CLOUD '09: Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing, (Washington, DC, USA), pp. 32{37, IEEE Computer Society, 2009.
  1. R. Chow, P. Golle, M. Jakobsson, E. Shi, J. Staddon, R. Masuoka, and J.Molina, "Controlling data in the cloud: outsourcing computation without outsourcing control," in CCSW '09: Proceedings of the 2009 ACM workshop on Cloud computing security, (New York, NY, USA), pp. 85{90, ACM, 2009.
Other References


  1. P. Loh, "Automating cloud services," TapInSystems WhitePaper vol. 0, 2009.
  1. E. Ong, "Scaling web applications in a cloud environment using resin 4.0," Caucho-WhitePaper vol. 0, pp. 38{43, 2009.
  1. W. Paper, "Introduction to cloud computing architecture," Sun Microsystems-WhitePaper vol. 0, 2009.
  1. W. P. in Enterprise architecture, "Architectural strategies for cloud computing," Oracle- WhitePaper vol. 0, 2009.
  1. "GOOGLE App Engine," http://code.google.com/appengine


  2. "Amazon elastic compute cloud," http://aws.amazon.com/ec2/

  3. "Salesforce Customer Relationships Management (CRM) system," http://www.salesforce.com/


  4. "GOOGLE Apps," http://www.google.com/apps/business/index.html

  5. "Eucalyptus home page," http://eucalyptus.cs.ucsb.edu/


  6. "Amazon simple storage service," http://aws.amazon.com/s3/


  7. "Microsoft Connected Service Framework," http://www.microsoft.com/serviceproviders/solutions/connectedservicesframework.mspx


  8. "GoGrid," http://www.gogrid.com


  9. "Amazon S3 Availability," http://status.aws.amazon.com/s3-20080720.html


  10. Microsoft Azure. http://www.microsoft.com/azure/


  11. Sun network.com (Sun Grid). http://www.network.com

  12. Amazon Elastic Compute Cloud (EC2). http://www.amazon.com/ec2/
  1. Morgan Stanley. Technology Trends. 12 June 2008. http://www.morganstanley.com/institutional/techresearch/pdfs/TechTrends062008.pdf
  1. Security Guidance for Critical Areas of Focus in Cloud Computing. April 2009. Available at http://www.cloudsecurityalliance.org/guidance/casguide.pdf
Top threats to Cloud Computing V1.0. March 2010. Available at www.cloudsecurityalliance.org/topthreats/csathreats.v1.0.pdf

4/22/10

Life Before the Computer

1 comments
An application was for employment
A program was a TV show
A cursor used profanity
A keyboard was a piano!
Memory was something that you lost with age
A CD was a bank account
And if you had a 3 1/2 inch floppy
You hoped nobody found out!
Compress was something you did to garbage
Not something you did to a file
And if you unzipped anything in public
You'd be in jail for awhile!
Log on was adding wood to a fire
Hard drive was a long trip on the road
A mouse pad was where a mouse lived
And a backup happened to your commode!
Cut - you did with a pocket knife
Paste you did with glue
A web was a spider's home
And a virus was the flue!
I guess I'll stick to my pad and paper
And the memory in my head
I hear nobody's been killed in a computer crash
But when it happens they wish they were dead!

Courtesy: http://www.jimgibb.com

Poem for Computer Geeks

0 comments
What if Dr. Seuss Wrote Computer Manuals?


If a packet hits a pocket on a socket on a port,

and the bus is interrupted as a very last resort,

and the address of the memory makes your floppy disk abort,

Then the socket packet pocket has an error to report.



If your cursor finds a menu item followed by a dash

and the double-clicking icons put your window in the trash

and your data is corrupted 'cause the index doesn't hash,

Then your stiuation's hopeless, and your system's gonna crash.



If the label on your cable on the gable at your house,

says the network is connected to the button on your mouse,

But your packets want to tunnel to another protocol,

That's repeatedly rejected by the printer down the hall.



And your screen is all distorted by the side effects of gauss,

so your icons in the window are as wavy as a souse,

Then you may as well reboot and go out with a bang,

Cause as sure as I'm a poet, the sucker's gonna hang.



When the copy of your floppy's getting sloppy on the disk,

And the microcode instructions cause unnecessary RISC,

then you have to flash your memory and you'll want to RAM your ROM,

Quickly turn off your computer and be sure to tell your mom!

Courtesy: http://www.jimgibb.com

4/21/10

Interesting .. How stats can go Wrong. !

0 comments
Impressive .. Stats and Probability


MatheMagic

0 comments
Whoa... True Mathematical Genius..  Awesomeness Redefined

4/20/10

Search Results

0 comments

4/16/10

High Performance Cloud Computing

0 comments
http://www.buyya.com/papers/HPCC-ISPAN2009-Keynote.pdf

Cloud Computing Briefing - Jan'10

0 comments
http://www.cpni.gov.uk/Docs/cloud-computing-briefing.pdf

Introduction to Cloud Platforms-Enterprise Oriented View

0 comments
http://www.davidchappell.com/CloudPlatforms--Chappell.pdf

Cloud Ontology

0 comments
http://www.cs.ucsb.edu/~lyouseff/CCOntology/CloudOntology.pdf

4/11/10

A Break in the Clouds: Towards a Cloud Definition

0 comments
A Break in the Clouds: Towards a Cloud Definition

http://ccr.sigcomm.org/drupal/files/p50-v39n1l-vaqueroA.pdf


A Berkley View of Cloud Computing

0 comments
Above the Clouds: A Berkley View of Cloud Computing

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf

A Walk in the Clouds

0 comments
A Walk in the Clouds: Broadband Computing and Communication

http://www.online-pr.com/Holding/Cloud_Computing.pdf

4/8/10

Incomplete DM-Project

0 comments



High-dimensional Clustering of Web-based text documents using iterative PCA based approach 

PROLOGUE

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

According to Kriegel, Kröger & Zimek (2009), four problems need to be overcome for clustering in high-dimensional data:



  • Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, impossible to enumerate. This problem is known as the curse of dimensionality.



  • For spatial data, the concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:


  • A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
  • Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.


The traditional problem of clustering high dimensional data arises due to the fact that different subspaces contain meaningful clusters which do not span across all the dimensions. This has led to different subspace clustering techniques like PROCLUS [1], ORCLUS [2] or CLIQUE [3] based on selection of corresponding subset of features to find meaningful groups of clusters in high dimensional data. Most of these algorithms considers different feature subset selection methods to clustering but ignores the feature transformation aspect. In this iterative PCA approach, we are trying to use the feature transformation as a potential technique to find out features from high dimensional data, rather than using it for dimensionality reduction purpose. Since this concept may find good application in text document clustering based on document category, we may try to extend this concept in future for web document classifications.





ABSTRACT
Principal component analysis is a ubiquitous technique for data analysis and processing, but it is not based on a probability model, due to which its effectiveness is limited by its global linearity. Here we try to demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. The result would be a mixture of probabilistic principal component analyzers whose parameters can be determined using an Expectation Maximization algorithm, to estimate the principal subspace iteratively. The advantages of this model are explained with respect to clustering and dimensionality reduction.





INTRODUCTION
Principal Component Analysis is a popular technique for dimensionality reduction. The most common derivation of PCA is in terms of standardized linear projection which maximized the variance in the projected space (Hotelling 1933). For a set of observed d-dimensional data vectors {}, , the q principal axes are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors are given by the q dominant eigenvectors of the sample covariance matrix , where is the data sample mean such that . The q principal components of the observed vector are given by the vector , where . The variables are then uncorrelated such that the covariance matrix is diagonal with elements: . [7]

However, any method of PCA, as remarked in many texts, does not mention the presence of an associated probabilistic model for the observed data. It is possible to obtain a probabilistic formulation of PCA which can be closely associated with statistical factor analysis. The latent variable formulation would give us an algorithm which would be iterative, computationally efficient and be an Expectation Maximization algorithm. [7]

Factor Analysis:

Factor analysis is a linear latent variable model, which relates a d-dimensional observation vector t to a corresponding q-dimensional vector of latent variables x.

Factor analysis is related to principal component analysis (PCA) but not identical. Because PCA performs a variance-maximizing rotation of the variable space, it takes into account all variability in the variables. In contrast, factor analysis estimates how much of the variability is due to common factors ("communality"). The two methods become essentially equivalent if the error terms in the factor analysis model (the variability not explained by common factors, see below) can be assumed to all have the same variance.

In this model by constraining the error covariance to be a diagonal matrix whose elements are usually estimated from the data, the observed variables are conditionally independent given the values of the latent variables x. These latent variables are thus intended to explain the correlations between observation variables while represents the variability unique to a particular .

Maximum Likelihood Estimators:

The maximum likelihood estimate of parameter is given by the mean of the data.

The log-likelihood is maximized when the columns of W span the principal subspace of the data.

Also, for , the maximum-likelihood estimator for is given by



Where are the smallest eigen-values of S and so has a clear interpretation as the average variance lost per discarded dimension [7]



Equalities of Eigen-values:

The equalities of any of the q principal eigenvalues does not affect the maximum likelihood estimates, but the instance when all the d-q minor eigenvalue(s) are equal and identical to at least one retained eigenvalue is taken under consideration.

Expectation maximization:

In the EM approach, the latent variables {} are considered 'missing' data. Using standard least square techniques the estimation of W, from the known values would be easy. But the value of for a given is unknown, but instead the joint distribution of the observed and latent variables, p(t,x) is known and, the expectation of the corresponding complete –data log likelihood can be calculated.

In the E step of the EM algorithm, the above mentioned expression, calculated with respect to the posterior distribution of xn given tn is computed. In the M step, new parameter values and are determined, which maximize the expected complete data loh-likelihood and this is guaranteed to increase the likelihood of interest, unless it is already a local maximum. (Dempster, Laird, and Rubin 1977).

The complete data log-likelihood is given by:


The overall model distribution of a latent variable model can be considered of a form

Where is a single probabilistic PCA model and is the corresponding mixing proportion. The parameters for this model can be determined by an extension of the EM algorithm.

The missing data includes a set of for each model i and variables labeling which model is responsible for generating each data point . The complete-data log likelihood of a standard EM algorithm would be of the form





PROBLEM DEFINITION
Given a high dimensional data set with observations and dimensions denoted as , n being comparatively large, the objective is to find good-quality clusters in high dimensional data.

The algorithm starts with performing Principal Component Analysis (PCA) on to obtain first PC-scores having largest eigen-values along most significant principal component. Since principal components are independent to each other, each of these components of can be considered as i.i.d. random variables with Gaussian distributions having means equal to the column-means of .

Since are PC-scores, the random variables can be normalized without loss of relative similarity with respect to the principal components. Thus, , where are -normalized i.i.d. random variables.

Since PCA is not used here for dimensionality reduction, the original data records can be retrieved back from the PC-scores and then removed from the original data set . For our convenience, we denote this data set as to suggest the remaining data set of size . Next, another round of PCA is performed on to get the next first PC-scores with largest eigen-values along most significant principal component. Thus the entire process described above is repeated until the remaining data set is empty. The number of iterations yields the number of independent sub-clusters present in the high-dimensional data .




The Iterative PCA Algorithm






Notion of Similarity and choice of et

Since we talk about high dimensional data, there is no meaning of defining a similarity metric based on normal norms, as shown by C. C. Aggarwal et al. in [6]. So, this approach measures similarity based on the concept of cosine similarity and that also in feature space and then feeds the information back to the original dataset to get rid of those records from original space. This is continued till the remaining data is found to be similar to any eigen-vector lower bound by the eigen-value threshold et.

The initial choice of et is a debatable issue and for the initial assumption, we take it to be a fixed constant, say 0.9. As the algorithm iterates, it tries to recalculate et. We take mean of the -normalization of PC-scores and ensure that the mean is a close approximation of the unit eigen-vector passing through those PC-scores. We keep on taking significant PC-scores in steps and increase the mean in incremental steps until we found deterioration from the previous calculated mean and then fix the et at the lowest of the most significant eigen-values of the lot. This adaptive et calculation carries on for each iteration till the entire data is clustered.



P.S: Project Incomplete- Topic changed to "Maximum Likelihood PCA Approach for Handwritten Letter Data Set" Given in Another Blog-post


Members: Sanket Gupte, Somnath Chakrabarti




REFERENCES:

[1]
Aggarwal C, Procopiuc C, Wolf J. L, Yu P. S, Park J. S (1999) Fast Algorithms for Projected Clustering. SIGMOD, 1999

[2] Aggarwal C. C, Yu P. S (2000) Finding
generalized projected clusters in high dimensional spaces. SIGMOD, 2000

[3] Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD, 1998

[4] Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan Muna, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Disc (2007) 14:63-97 DOI 10.1007/s10618-006-0060-8

[5] Domeniconi C, Papadopoulos D, Gunopulos D, Ma S (2004) Subspace Clustering of High Dimensional Data. SIAM international conference on data mining, pp 517-520

[6] Aggarwal C. C, Hinneberg A, Keim D. A (2001) On the Surprising Behavior of Distance Metrics in High Dimensional Space. ICDT Conference Proceedings, 2001

[7] Michael E. Tipping, Chris M. Bishop Probabilistic Principal Component Analysis, Journal of the royal statistical society, Series B, 61, Part 3, pp, 611-622.