CUC 2004 / New Frontiers / New Techhnologies for New Needs
CARNet logo
Job Management Systems Analysis / F2
Authors: Emir Imamagić, Dobriša Dobrenić, Branimir Radić, University Computing Centre – Srce, Croatia

Job Management System (JMS) is a system responsible for control over user jobs and cluster nodes. Main objective of JMS is to achieve maximal utilization of cluster resources, while satisfying users needs. JMS is also known as Resource Management System, Workload Manager and Batching System. JMS has three basic functionalities: queuing, scheduling and resource management. Functionalities are implemented in following three JMS modules: Queuing Server, Scheduler and Resource Manager. Server is responsible for job queuing and interaction with users. Scheduler makes decisions where will the jobs be executed. Decision is based on various types of policies. Resource Manager monitors resources and jobs, allocates resources for jobs and prepares environment for job execution. Furthermore, Resource Manager notifies Server of resource and job status.

In order to evaluate existing JMSs we defined set of criteria. Criteria are based on these similar studies ‎[1], ‎[2], ‎[3], ‎[4], ‎[5], ‎[6], ‎[7]. Criteria are divided in 5 groups: JMS Software characteristics, Queuing, Scheduling, Resource Management and Security. First set of criteria are general characteristics of JMS software. Most important criteria from this group are platform dependency, user interface and compatibility with distributed file systems and parallel libraries. Next set of criteria are related with queuing capabilities of JMS. Queuing module should enable multiple queues creation and allow user to define needed resources (e.g. memory, walltime), control jobs and see jobs status and various statistics. Criteria for Scheduling module are following: set of standard scheduling policies, advanced reservation of resources, fair share scheduling and multiple policy usage. Requirements from Resource Manager are following checkpointing, job migration, load balance, fault tolerance and job and node monitoring. Security requirements are AAA – (authentication, authorization and accounting) and encryption.

We used criteria described above to evaluate numerous existing JMS solutions. Evaluated systems are Condor, CSS, LSF, Loadleveler, OpenPBS/Torque, PBSPro and SGE. Based on preliminary research we decided to implement and practically test following three: Condor, Torque and SGE.
Torque ‎[11] is based on OpenPBS ‎[14] with some additional features (mainly related with scalability). Torque was installed together with Maui scheduling system. Good feature of Torque is that it can have external scheduling module. Good features of OpenPBS system are that it is old and well tested. Furthermore, OpenPBS has interface used by parallel applications. Maui is a reservation-based scheduler. Advantages of Maui are advanced reservation, complex scheduling policy, fair share scheduling and various tools for managing and diagnose cluster resources. Major disadvantage of OpenPBS is that there is no certain roadmap, because of the commercial version – PBSPro.
SGE ‎[10] is a product of Sun Microsystems Company. Advantages of SGE are: job migration, load balancing, fault tolerance, security mechanisms and excellent GUI to all modules. In addition, major advantage of SGE is that the SGE is evolving rapidly. Disadvantages are: limited parallel job support, external checkpointing, no advanced scheduling and possibility of multiple queues creation.
Condor ‎[8] is a project of University of Wisconsin. Condor is profoundly different from two other JMS's. Condor is designed for High Throughput Computing and cycle stealing. Good features of Condor are following checkpointing, process migration, security mechanisms, dynamic node addition/removal and possibility to send jobs from one Condor cluster to another Condor cluster. Major disadvantage of Condor is that it is not designed for parallel application, but it provides additional modules to support parallel applications. Furthermore, Condor does not support interactive jobs and cannot have multiple queues.

[1] J. A. Kaplan, M. L. Nelson: "A Comparision of Queueing, Cluster and Distributed Computing Systems», June 1994.
[2] M. Baker, G. Fox, H.W. Yau: "Cluster Computing Review», NPAC Technical Report SCCS-748, Northeast Parallel Architectures Center, Syracuse University, November 1995.
[3] J. P. Jones: "NAS Requirements Checklist for Job Queuing/Scheduling Software», NAS Technical Report NAS-96-003, April 1996.
[4] C. Byun, C. Duncan, S. Burks: "A Comparision of Job Management Systems in Supporting HPC ClusterTools», Proc. SUPerG, Vancouver, Fall 2000.
[5] O. Hassaine: "Issues in Selecting a Job Management Systems (JMS)», Proc SUPerG, Tokyo, April 2001.
[6] T. El-Ghazawi, K. Gaj, N. Alexandridis, F. Vroman, N. Nguyen, J. R. Radzikowski, P. Samipagdi, S. A. Suboh: "A Performance Study of Job Managements Systems»
[7] T. El-Ghazawi, K. Gaj, N. Alexandridis, B. Schott, A. V. Staicu, J. R. Radzikowski, N. Nguyen, S. A. Suboh: "Conceptual Comparative Study of Job Management Systems», Technical Report, February 2001
[8] Condor,
[9] LSF,
[10] SGE,
[11] Torque,
[12] IBM Loadleveler,
[13] PBSPro,
[14] OpenPBS,


Emir Imamagić graduated from the Department of Electronics, Microelectronics, Computer and Intelligent Systems, Faculty of Electrical Engineering and Computing, University of Zagreb in May 2004. His research interests are high performance computing, distributed computing, computer clusters and grid systems.

Before graduation, he has worked on the AliEn Grid project at CERN, Switzerland in summer 2003 and on the MidArc middleware project at Ericsson Nicola Tesla in summer 2002. He is currently working as a researcher on the CRO-GRID Infrastructure project at University Computing Centre.

Copyright © 1991- 2004. CARNet. All rights reserved. / Mail to / Legal notes / Impressum