Job Management Systems Analysis / F2
Job Management System (JMS) is a system responsible for control over user jobs and cluster nodes. Main objective of JMS is to achieve maximal utilization of cluster resources, while satisfying users needs. JMS is also known as Resource Management System, Workload Manager and Batching System. JMS has three basic functionalities: queuing, scheduling and resource management. Functionalities are implemented in following three JMS modules: Queuing Server, Scheduler and Resource Manager. Server is responsible for job queuing and interaction with users. Scheduler makes decisions where will the jobs be executed. Decision is based on various types of policies. Resource Manager monitors resources and jobs, allocates resources for jobs and prepares environment for job execution. Furthermore, Resource Manager notifies Server of resource and job status.
In order to evaluate existing JMSs we defined set of criteria. Criteria are based on these similar studies , , , , , , . Criteria are divided in 5 groups: JMS Software characteristics, Queuing, Scheduling, Resource Management and Security. First set of criteria are general characteristics of JMS software. Most important criteria from this group are platform dependency, user interface and compatibility with distributed file systems and parallel libraries. Next set of criteria are related with queuing capabilities of JMS. Queuing module should enable multiple queues creation and allow user to define needed resources (e.g. memory, walltime), control jobs and see jobs status and various statistics. Criteria for Scheduling module are following: set of standard scheduling policies, advanced reservation of resources, fair share scheduling and multiple policy usage. Requirements from Resource Manager are following checkpointing, job migration, load balance, fault tolerance and job and node monitoring. Security requirements are AAA – (authentication, authorization and accounting) and encryption.
We used criteria described above to evaluate numerous existing JMS
solutions. Evaluated systems are Condor, CSS, LSF, Loadleveler, OpenPBS/Torque,
PBSPro and SGE. Based on preliminary research we decided to implement
and practically test following three: Condor, Torque and SGE.
Emir Imamagić graduated from the Department of Electronics, Microelectronics, Computer and Intelligent Systems, Faculty of Electrical Engineering and Computing, University of Zagreb in May 2004. His research interests are high performance computing, distributed computing, computer clusters and grid systems.
Before graduation, he has worked on the AliEn Grid project at CERN, Switzerland in summer 2003 and on the MidArc middleware project at Ericsson Nicola Tesla in summer 2002. He is currently working as a researcher on the CRO-GRID Infrastructure project at University Computing Centre.