Just a few years ago, to most people, the terms "Linux cluster" and "Beowulf cluster" were virtually synonymous. However, these days, many people are realizing that Linux clusters can not only be used to make cheap supercomputers, but can also be used for high availability, load balancing, rendering farms, and more.
In this review, I will try to go through most of the over 100 projects that are listed in freshmeat's Clustering/Distributed Networks category that relate to Linux clustering. In order to do this effectively, I will break down the projects into a few categories. Here is a quick outline of how this review will be structured:
Software for building and using clusters High Performance Computing Software (Beowulf/Scyld, OSCAR, OpenMosix...), including special mention of specific groups of projects such as Single System Image systems and Cluster-specific Operating Systems. High Availability Software (Kimberlite, Heartbeat...). Load Balancing Software (Linux Virtual Server, Ultra Monkey...). Software used on, and for using, clusters File Systems (Intermezzo, ClusterNFS, DRBD...). Installation and Configuration Software (FAI, System Installation Suite...). Monitoring and Management Software (Ganglia, MOSIXVIEW, Performance Co-Pilot...). Programming and Execution Environments and Tools (MPI, PVM, spread...). Miscellaneous (things that don't necessarily fit into other categories well). Software for building and using clusters In this part of the review, I will talk about projects that offer "complete solutions" for the various kinds of Linux Clustering. This means that projects that offer complete start-to-finish solutions for the respective goals will be here, while specific, partial solutions (such as installation, management, monitoring, or programming environment tools) will be covered in Part II. Note that these "start-to-finish" solutions are different for the different types of clustering; for example, while this usually means some sort of slave node installation mechanism for HPC clusters, an installation mechanism is not needed for a High Availability or Load Balancing cluster.
High Performance Computing Software High Performance Computing seems to be the term that everyone likes to use these days. With regards to Linux clustering, this refers to creating a cluster to do any type of task that involves a great deal of computing power, whether it be modeling galaxy collisions or rendering the animation of the latest box office hit.
In this arena, there are quite a few projects out there. Unfortunately, this is sometimes due to the fact that some people are interested in building their own solution to a problem, rather than basing work on what is already out there. I will try to concentrate some of the better projects.
Let's start off by talking a little about the project that most everyone has heard of, the Beowulf Project, also known these days as Scyld. Scyld contains an enhanced kernel and some tools and libraries that are used to present the cluster as a "Single System Image". This idea of a single system image means that processes that are running on slave nodes in the cluster are visible and manageable from the master node, giving the impression of the cluster being just a single system.
The important thing to remember about these single system image systems is that because they alter the way the kernel works, they either distribute their own version of the kernel or distribute kernel patches. This means that you can only use the software with the specific kernel they give you or the version of the kernel that the patch is made for. There are other projects that use this idea of a single system image. Compaq has adopted their NonStop Clusters for Unixware software for Linux and put it into Single System Image Clusters. OpenMosix is a set of extensions to the standard kernel, as well as some userland tools that they are developing to help use the cluster more efficiently. SCE, or the Scalable Cluster Environment, is a set of tools that allow you to build and use a Beowulf cluster. Bproc, the program at the heart of the Beowulf project's ability to present a single system image, is used in Clubmask, as well as some other Open Source projects and tools like Kickstart, cfengine, the Maui scheduler, LAM/MPI, and more.
There are other HPC clustering solutions that do not change the way the kernel functions. These use other means to run jobs and deal with showing information about them. Cplant, the Ka Clustering Toolkit, and OSCAR all allow you to build, use, and manage your cluster in this manner.
There are certain Operating Systems that are geared towards high performance computing and clusters. They can serve various functions, from providing common HPC tools to actually being a specific cluster installation. Warewulf is a distro that you configure and burn onto a CD, then pop in the slave nodes. Boot them off it, and you have an instant cluster. ClumpOS is a MOSIX distribution that is put on a CD to allow the user to quickly add nodes to a MOSIX cluster. One of the nice things about these CD-based complete systems is that they can be temporary; slave nodes can be booted off the CD to be part of the cluster, and then rebooted without the CD to go back to their regular operating systems that reside on their hard disks. MSC.Linux is focused on high performance computing and comes with various kernel extensions, engineering tools, Beowulf tools, and desktop components that facilitate cluster computing.
High Availability Software These days, maintaining system availability is important to everyone. No one wants to have any critical services unavailable, be they a mail server, Web server, database, or something else. In the past few years, many Linux projects have sprung up to try to meet this demand, which has for so many years been filled well by the commercial operating system and software market (AIX or Solaris, for example).
I like to break this idea of High Availability into two separate categories, as it makes the concepts easier to think about. These are straight High Availability and Load Balancing and Sharing, which can be considered a specialized variant of High Availability. Let's talk about straight High Availability first. In its simplest form, you have at least two systems, one of which is live and is currently the box that users connect to when they access your Web page or database. The other is in some sort of stand-by state, waiting and watching to take over should something happen to the live server. There are more complicated issues that can come into play, such as access to shared resources, MAC address takeover, dealing with "quorums" in setups that have more than two boxes, etc. Each of the projects deals with some subset of these issues.
Kimberlite specializes in shared data storage and maintaining data integrity. Piranha (a.k.a. the Red Hat High Availability Server Project), can serve in one of two ways; it can be a two-node high availability failover solution or a multi-node load balancing solution. One of the better-known projects in this space is probably the High Availability Linux Project, also known as Linux-HA. The heart of Linux-HA is Heartbeat, which provides a heartbeat, monitoring, and IP takeover functionality. It can run heartbeats over serial ports or UDP broadcast or multicast, and can re-allocate IP addresses and other resources to various members of the cluster when a node goes down, and restore them when the node comes back up. A very lightweight High Availability setup is peerd, which will do simple monitoring and failover. The purpose of this project is not to provide sub-second failovers, but to have a total system failover in under a minute while allowing the systems to be located in physically separate locations (rather than in, say, the same rack in the server room). Another lightweight project is Poor Man's High Availability, which uses the Dynamic DNS service from dnsart.com to do simple failover and failback of a Web server.
Now, let's go on to talk a bit about that other type of High Availability:
Load Balancing Software Load Balancing is a special kind of High Availability, because not only does it offer the same monitoring and failover services that the straight HA packages offer, it can also balance and share incoming traffic, so the load is distributed somewhat evenly across multiple servers, allowing for faster responses and less likelihood of a server overloading.
Load Balancing is achieved by having at least one (though to achieve "real" High Availability, you'd need at least two) load balancers/routers and at least two backend servers. The live load balancer receives incoming requests, monitors the load and available system resources on the backend servers, and redirects requests to the most appropriate server. The backend servers are all live, in the sense that they are all handling requests they receive from the router. All of this is completely transparent to the end user.
One of the best known projects in this area is the Linux Virtual Server Project. It uses the load balancers to pass along requests to the servers, and can "virtualize" almost any TCP or UDP service, such as HTTP(S), DNS, ssh, POP, IMAP, SMTP, and FTP. Many Load Balancing projects are based on LVS. Ultra Monkey incorporates LVS, a heartbeat, and service monitoring to provide highly available and load balanced services. As mentioned previously, Piranha has a load balancing mode, which it refers to in its documentation as LVS mode. Keepalived adds a strong and robust keepalive facility to LVS. It monitors the server pools, and when one of the servers goes down, it tells the kernel and has the server removed from the LVS topology. The Zeus Load Balancer is not based on LVS, but offers similar functionality. It combines content-aware traffic management, site health monitoring, and failover services in its Web site load balancing. Another project not based on LVS is Pen, a simple load balancer for TCP-based protocols like HTTP or SMTP. Turbolinux Cluster Server is the last of the load balancing projects I will talk about. It is from the folks at Turbolinux, and its load balancing and monitoring software allows detection and recovery from hardware and software failures (if recovery is possible).
Software used on, and for using, clusters In this part of the review, I will talk about software that is used on a cluster for a specific purpose. This purpose can be an execution environment, a programming interface, monitoring and/or management software, filesystems that lend themselves well to clustering, and more.
Filesystems Let's first talk about filesystems and filesystem-related projects. There are a lot of filesystems for Linux. These days, many of them have features like journaling, and some were designed specifically for use in clustered and distributed systems. The strong and weak points and differences in the various filesystems is a bit too large a topic to get into here, so I will just mention some of them, with the recommendation that more research be done before you decide what is best for you.
OpenAFS (an open version of the Andrew Filesystem originally developed at Carnegie Mellon University), GFS (the Global Filesystem), Coda, and InterMezzo are all distributed/cluster filesystems.
There are some other projects that I want to mention in this section because they relate to filesystems, though may not technically be considered filesystems by some. First, there is the ClusterNFS project, a set of patches to the Universal NFS Daemon that allows multiple clients to NFS mount the same root filesystem. It uses a mechanism in which it "tags" files. This is quite useful for NFS mounting the root filesystem for a cluster of diskless slave nodes.
The next two projects I will mention both deal with block devices. ENBD, the Enhanced Network Block Device, and DRBD allow remote disks to appear as if they were local block devices. This is especially useful when setting up a RAID mirror, to have your data updated in realtime but located on separate machines so it can be highly available.
Installation and Configuration Software When dealing with perhaps hundreds or thousands of slave nodes in a cluster, installing and configuring them all is a huge task, and it is quite boring, especially when each of the nodes is functionally the exact duplicate of its hundreds or thousands of counterparts. Any tool to automate as much of this process as possible is always a welcome addition to the arsenal of the cluster administrator. As this is a problem that many other people have already faced, there are a few different solutions, each attacking the issue a little differently and solving different aspects of the overall problem.
FAI stands for Fully Automatic Installation. It is a non-interactive system to install the Debian GNU/Linux distribution. The installation of one or many nodes is initiated, and after installation is complete, the systems are fully configured and running, with no intervention on the part of the user. While FAI is specific to the Debian GNU/Linux distro, there are other projects that are not.
To install boxes System Installation Suite It is the answer
(I thought I would present this next project in haiku form, just to see if you were still paying attention.) The System Installation Suite is an image-based installation tool. Images are created and stored on an image server and copied to nodes as needed. Currently, Red Hat, Mandrake, SuSE, Conectiva, and Turbolinux are supported by SIS, with the hope that Debian and Debian-variant support will be coming sometime soon. SIS is made up of three projects, SystemImager, System Installer, and System Configurator.
Monitoring and Management Software Being able to monitor and manage a cluster from one central location is another key to not going insane while administering a cluster of a couple hundred nodes. In this space, there seems to be a large number of projects, and again, each one does things a little differently, solving different variations on the common problem.
ClusterIt is only for management and maintenance of large numbers of systems. Ganglia is a very stable, well-known and well-tested, realtime monitoring and remote execution environment. It is in use in universities, government labs, and complete cluster implementations all over the world, and is highly respected by everyone I know that has used it. Performance Co-Pilot is a monitoring and management package from SGI. Originally written for IRIX, SGI ported it to Linux and released it to the Open Source community back in February 2000. With years of development and testing behind its commercial version, SGI is able to use its knowledge to offer the Linux version the stability and functionality that is sometimes lacking in Open Source projects.
Besides these projects that can be used on any set of boxes, there are some projects that were developed to work specifically with other projects. MOSIXVIEW is a GUI for a MOSIX cluster. It supports both MOSIX and OpenMosix, acting as a frontend to the mosctl commands. LVSmon is a monitoring daemon written to be used for maintaining LVS tables (LVS referring to the Linux Virtual Server Project, previously mentioned in the Load Balancing Software section above).
In addition to these large projects that encompass a lot of capabilities, there are many projects that are simpler, with smaller scope. Syncopt is a solution to the problem of keeping software on multiple systems up-to-date without undertaking the task of installing the software on each system by hand. With Syncopt, the software is installed on a central server and propagated to all systems that need it. Fsync is similar to rsync and CVS; it allows file synchronization between remote hosts, with functionality to merge differences and hooks for tools to merge the trees. It is a single Perl script, designed to function over modem speed (read: slow) network connections. Ghosts (global hosts) is a system that allows creation of macros to define groups of machines. Using these macros, gsh, an included parallel task runner, can execute code on a group of machines simultaneously. Last, but not least in this section, there is pconsole, a tool with a functionality somewhat similar to ghosts, but with a different approach. pconsole is an interactive administrative shell for clusters. It allows a user to open connections to a number of different cluster nodes and type commands into a special window, and the command is sent to each of the machines.
Programming and Execution Environments and Tools Once you have a cluster built, what do you do with it? How do you write programs that take advantage of it? What libraries and programming tools are out there to help you write code optimized to run on a cluster? What software is out there to help you run a job on the cluster or keep a schedule of what is going to run, and when? These are the types of questions that the projects in this section answer.
PVM stands for the Parallel Virtual Machine. It is a portable message passing interface that lets a heterogeneous set of machines function as a cluster. Applications that use it can be written in C, C++, or Fortran, and can be comprised of a number of separate components (processes). PVM++ attempts to provide an easy-to-use library to program for PVM in C++. Programs like pvmpov or PVM Gmake use the PVM interface. pvmpov is a patch for POV-Ray which allows rendering to take place on a cluster using PVM-based code. PVM Gmake is a variant of GNU make which uses PVM to build in parallel on several remote machines.
Another of these message passing interfaces is MPI, which, coincidentally, stands for "The Message Passing Interface". There are a few different implementations of MPI. LAM/MPI and MPICH are two of them, and Object-Oriented MPI is an OO interface to the MPI standard which can be used at a much higher level than the standard C++ bindings. PETSc is a set of data structures and routines used for parallel applications that employs the MPI standard for all its message passing.
Another messaging system that is available is Spread. Spread is a toolkit that provides a message passing interface that can be used for highly available applications, a message bus for clusters, replicated databases, and failover applications.
Now, you may be asking how someone might keep track of all the jobs that are going to be run on a cluster. For that, you need a scheduler. There are a number of different schedulers out there, but when it comes to Linux Clustering, I mainly hear about two of them, Condor and Maui. These schedulers can handle scheduling policies, heterogeneous resources, dynamic priorities, reservations, and more.
Miscellaneous Cluster-Related Projects There are some projects that are related to clustering but do not really fit into the categories that I have used to group things above. I will now briefly mention two these, as they are useful and worth noting.
First, there is the Distributed Lock Manager from IBM. The DLM provides an implementation of locking semantics modeled after the VAX Cluster locking semantics. This is a specific tool, not an entire cluster environment. An API is embodied in a shared library, and applications that wish to use the DLM can connect through this.
Secondly, the Linux Terminal Server Project is a great way of providing access to diskless systems (workstations, cluster slave nodes, or any other diskless systems you may have). LTSP provides the administrative tools that make accessing these systems easier than it might otherwise be.
Some Comments In Closing This is by no means an all-encompassing list of the projects that exist in the world of Linux Clustering. There are a lot of other projects out there. I tried not to mention projects that have dead links or which haven't had any updates in a couple of years. At the pace that the world of clustering moves, two years is pretty outdated, unfortunately. Also, I may have missed some projects. Maybe they aren't included in the Clustering/Distributed Networks Category here on freshmeat. Some other good categories to look in would be:
System::Installation/Setup System::Monitoring System::Systems Administration , just to name a few. Maybe they aren't listed on freshmeat at all. There are also many Web sites on the great big World Wide Web that are devoted to Linux Clustering in its various forms:
The High Availability Linux Project This site is not only the home of the Linux-HA project, but also has lots of information about Linux High Availability. The Linux Clustering Information Center Ok, I may be a little biased, as this is my Web site, but I think it's a pretty useful place to find links to all sorts of information about all the types of clustering, from software to documentation to Linux Clustering haikus. The Beowulf Project The project that started it all. This page contains the history of the Beowulf Project, as well as some links to some other interesting sources of information. The Sourceforge Clustering Foundry A Sourceforge Foundry devoted to Clustering. What more could you ask for? Again, this is a partial listing. I would recommend these places as good starting points in your search for more information; more likely than not, they will be able to point you in the direction you want to go.
What is the state of Linux Clustering? Where does it stand now, and where is it going?
These are good questions. Linux Clustering is, at least in my biased opinion, off to a great start. With recent announcements such as Pacific Northwest National Laboratory's purchase of a 1,400 node McKinley cluster running Linux (with an expected peak performance of 8.3 Teraflops) and vendors like Penguin Computing announcing their new 1U SMP Athlon system (ideal for clustering), the future is looking quite bright. Many people are taking Linux seriously these days, with support from companies like IBM, SGI, Intel, and Sun, just to name a few, as well as efforts at many universities and the DOE National Labs, where software development and testing take place. With these pieces at the base, there is no real ceiling of how far this can go. This may sound kind of hokey, but I think it is true. Being able to have supercomputer speeds or fault tolerant systems for a fraction of what it used to cost, I think many people are going to embrace this and run with it