CARNet WebWorld - CUC 2004

Abstract

Every computer system has to be systematically supervised on account of recognition critical circumstances that need troubleshooting, system/application tuning or in the end, upgrade of a system. During years on UNIX/Linux platform there has been developed great deal of tools for that purpose. I this presentation an overview will be made of both traditional tools for monitoring UNIX/Linux systems and complex tools for monitoring distributed systems. Among traditional tools, different set of tools according to their specific field of usage will be taken in consideration; basic system monitoring tools, system integrity monitoring tools, system performance monitoring tools and services activity monitoring tools. Very important role in long term diagnostics and decision making have visualization of data collected through monitoring system, so the most prominent solutions in that area will be commented. In the end, special attention will be paid to the concept of cluster monitoring and its fundamental principles. Some consideration of active response and job distribution based on of monitoring systems will also be made.

One prominent system of that kind is Ganglia. Ganglia is a scalable distributed monitoring system for high performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at groups of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point to-point connections amongst representative cluster nodes to group clusters and aggregate their state. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on many clusters around the world.

Another monitoring system, Supermon is an exible set of tools for high speed, scalable cluster monitoring. Node behavior can be monitored very fast. In addition, Supermon uses a data protocol based on symbolic expressions at all levels of Supermon, from individual nodes to entire clusters. This contributes to Supermon's scalability and allows it to function in a heterogeneous environment.

Finally Hawkeye monitoring system will be shown. It utilizes technologies present in Condor and ClassAds to provide rich mechanisms for collecting, storing, and using information about computers, but it can be also used outside Condor. A Hawkeye system can be used to monitor various attributes of a collection of systems.