Sunday, July 26, 2009

Unbound Development Environment for HPC

Software development can really be a fun if you get your development environment available to you anywhere/everywhere. This is specially true for a typical HPC Developer who not only has to worry about right compilers, build environment, test environment, dependencies etc. but also the complex logic of parallel applications.

So many times you will find that you are tied to your development environment because creating a new one could be just a hassle. More ever, the time required to create such development environment may not be justified if you are running on tight schedule. The same goes with the test environment. I mean, there is no way of testing your application quickly on the parallel environment without actually submitting the job to a real cluster. I know you can run multiple threads of MPI on same system, but that's not a real good test. What if you have an application which reads/writes to a local file system (say, in /opt) in which case you actually need multiple hosts to test if it does that I/O on individual host properly.

Now what if someone bundles compilers and IDE (with advanced D-Trace and SGE plugins) in an virtual machines and ships it? Well then you can setup up your own dependencies and build environment in that Virtual Machine. There you go... now you can download this Virtual Machine on your laptop, desktop or run it on some VMware Infrastructure and access it using your netbook. You can make backup copies of this virtual machine too, and there is one thing you can do if you are working with Virtual Machines which is impossible to do with physical systems.... you can use snapshot to go back in time. So if you change something in your development environment in say, while developing version 2.0 and now say you want settings from version 1.0, all you have to do is restore the snapshot! I am talking about Sun HPC Software, Developer Edition for OpenSolaris 1.0. But that's not all...

Earlier I was talking about test environments. Sun HPC Software, Developer Edition for OpenSolaris 1.0 comes with pre-configured/ready-to-use 2 nodes cluster that is embedded within this Virtual Machine. This mini cluster is for testing your parallel applications right there in your development environment. So that even if you are disconnected from rest of the world you are still a job submit away from your test cluster! This test environment comes with pre-installed and pre-configured installation of OpenMPI and Sun Grid Engine.

There is something really interesting about this test environment... It can help you test your parallel application even if you don't get that revered slot on your organization's HPC cluster. Test environment in that Virtual Machine comes with Sun Grid Engine SGE's Cloud Adapter. This Cloud Adapter helps you to create your own HPC Cluster in the Cloud using Amazon EC2, with just a click of a mouse!

You can download the Sun HPC Software, Developer Edition for OpenSolaris 1.0 from here.

Friday, July 24, 2009

Sun Grid Engine and the Soul of Cloud Computing

Cloud Computing! Ah.. the word is everywhere. Cloud is this, cloud is that... so much that many times you wonder, hope its not just a vapor hanging in middle of no where just waiting for right temperatures to set in to pour down the valley. There is a saying in Urdu (my neighboring country's national language, currently remains in news for all the wrong reasons). The saying goes like this,

Jo baraste hai
Woh garajte nahi
Jo garajte hai
Woh baraste nahi

Those clouds who thunder need not shower
and those who shower need not thunder

We hear a lot of noise created by early cloud resources providers about how scalable they are. How they can simply go till infinity and come back if demand for resources also fluctuates in the same range. But tomorrow what if, cloud computing is adapted by masses? will they really be able to scale up? Do they have a right kind of software to handle such loads? They will need to effectively schedule their individual virtual machines to run on servers most optimally, this is a must for the cloud to be a cost effective solution both for the provider and subscriber of cloud resources.

The thought of scheduling jobs (think individual virtual machine as a job) effectively over the given hardware reminds me of something which was considered as the baap (father in Hindi) of all sorts of distributed computing... Grid Computing. There we go... What if someone uses Sun Grid Engine (SGE) baap-of-all-cluster-schedulers in cloud environment?

Think of something like on these lines...

  • Imagine you are a guy who is going to provide the cloud resources to basically everyone over the internet. So you need to be truly a web scale provider.
  • You will have certain number of systems on which most probably you will be running Xen Servers.
  • These Xen Servers will run Xen virtual machines which you will lend out as a cloud service by your Web Service. ( I am just consider SAAS for this example).
  • Now you want to make sure those virtual machines should run on your physical systems in such a way that you get to push your input costs down much as possible.
  • Then why not think of SGE as a way to handle the start/stop/migration of these virtual machines? After all SGE is well known for handling and supporting large loads very effectively.
  • So that brings us to point that, like we start/stop a job using a job script in SGE, why not start/stop/restart/deploy/destroy a Virtual Machines.
  • You don't have to invest in writing your own scheduler either!

Thursday, July 23, 2009

Google Wave and Sun Grid Engine

Google Wave caught my attention after the story got slashdotted. In case you are already not aware of the waves Google is making, I strongly suggest to go through this video before reading any further. Make sure you stay on that video till it finishes, You won't regret.

This seems to be enough reason for more chairs flying out of Redmond offices.

The whole demo above in that video I watched with my jaws dropping till floor. Initially it seemed as if, big deal? another web 2.0 crap trying to make its way through already over hyped market. But as the demo progresses you start to realize that you are dealing with someone who knows what are they up to better than most of the industry.

While watching the demo on Wave Extensions something struck my mind. I started wondering what if someone writes SGE wave rebot. Imagine...

  • A typical cluster user creates a Wave and adds SGE rebot in that Wave.
  • This SGE rebot is running on master node of the cluster, probably the same system on which Wave Server is also running. Having your Wave Server running on master server makes a lot of sense instead of hosting it somewhere outside of cluster. Why send all the XML detlas that are created from Wave Client when an event takes place to some outside server?
  • A SGE Wave rebot then presents a form in that wave for authentication on master node.
  • A cluster user can submit a job to a cluster using the SGE robot.
  • A SGE Admin or cluster user can see the status of the cluster in that Wave.
  • The best part of using Wave is Collaboration. Users can share and collaborate the results without duplicating them. One can use other robots like 'Graphy' to analyze results and share them instantly.
  • There could be assistant robot which can deploy cluster on cloud right from the wave and submit the jobs.
  • Even the main cluster status page can be a Wave itself, which can be shared with users, blogs and or even on tweeter.
To Understand how all this can be achieved watch the video below,

Wednesday, July 22, 2009