                                                                     
                                                                     
                                                                     
                                             
TORQUE Resource Manager 4.1.1 Release Notes

We have left the 4.1.0 release notes in place to make it easier for the user to know what were 
the major changes in the TORQUE 4.1.x branch.

Changes specific to 4.1.1 will be designated as 4.1.1 changes.

August 2012


****************************************************************
The release notes file contains the following sections:

- Overview
- New Features
- Known Issues
- System Requirements
- Installation Information
- Upgrading and Backward Compatibility
- Documentation
****************************************************************


=== Overview ===

TORQUE 4.1.0 ehnaces support for Cray. With this release, all apbasil commands are executed by the pbs_mom daemons inside the Cray network. TORQUE 4.1.0 scales to newer, larger systems and can handle larger numbers of jobs, nodes, and commands per second.

The New Features section provides more information about what TORQUE 4.1.0 has to offer.


=== New Features ===

Cray: Improved ALPS interaction

With this release, ALPS interaction is no longer handled via scripts. All apbasil commands are executed by the pbs_mom daemons, which provides two major benefits: speed and mobility (meaning that pbs_server and Moab can be moved outside of Cray).

The mobility offers many benefits:
  - High availability is now possible.
  - You can place pbs_server and Moab on external nodes where you choose the size (meaning you can make these hosts as fast as you like).
  - If your Cray goes down, you can still submit jobs and look at queued jobs (though you cannot run them).


Hostname lookup

pbs_server now only looks up a hostname one time. All information is cached and pbs_server retreives information from the cache. This reduces the stress on the DNS and increases speed within TORQUE 4.1.0.


=== System Requirements  ===

The following software is required to run TORQUE 4.1.1:

- libxml2-devel package
- openssl-devel package
- ANSI C compiler (The native C compiler is recommended if it is ANSI; otherwise use gcc.) 
- A fully POSIX make. If you are unable to "make" PBS with your make, we suggest using gmake from GNU. 
- Tcl/Tk version 8 or higher if you plan to build the GUI portion of TORQUE or use a Tcl-based scheduler. 
- If you use cpusets, libhwloc 1.1 or later is required (for TORQUE 4.0 and later)


=== Installation Information ===

The directions to install and configure TORQUE are in chapter 1 of the TORQUE 4.1.0 Administrator Guide [http://www.adaptivecomputing.com/resources/docs/torque/4-1-0/help.htm]. Also note additional instructions in the PBS Administrators Guide and README.building_40.

Note that you may need to install libssl-dev in order for the source to make successfully. Specifically, the build system is looking for libssl.so and libcrypto.so. For non-RPM setups, you may need to make a symbolic link from the ssl and crypto libraries to the respective .so names.


=== Upgrading to TORQUE 4.1.1 and Backward Compatibility ===

TORQUE 4.1.1 is not backward compatible with versions of TORQUE prior to 4.0. When you upgrade to TORQUE 4.1.1, all MOM and server daemons must be upgraded at the same time. 

The job format is compatible between 4.1.1 and previous versions of TORQUE. Any queued jobs will upgrade to the new version with the exception of job arrays in TORQUE 2.4 and earlier. It is not recommended to upgrade TORQUE while jobs are in a running state.

Because TORQUE 4.1.1 has removed all use of UDP/IP and moved all communication to use TCP/IP, previous versions of TORQUE will not be able to communicate with the components of TORQUE 4.1.1. However, all files in the /var/spool/torque ($TORQUE_HOME) directory and all subdirectories are forwardly compatible.

++ Job Arrays ++

Job arrays from TORQUE version 2.5 and 3.0 are compatible with TORQUE 4.1.1. Job arrays were introduced in TORQUE version 2.4 but modified in 2.5. If upgrading from TORQUE 2.4, you will need to make sure all job arrays are complete before upgrading.

++ serverdb ++

The pbs_server configuration is saved in the file $TORQUE_HOME/server_priv/serverdb. When running TORQUE 4.1.1 for the first time, this file converts from a binary file to an XML-like format. This format can be used by TORQUE versions 2.5 and 3.0, but not earlier versions. Back up the $TORQUE_HOME/server_priv/serverdb file before moving to TORQUE 4.1.1.

++ Upgrading ++

Because TORQUE 4.1.0 will not communicate with versions of TORQUE prior to 4.0.x, it is not possible to upgrade one component and not upgrade the others. Rolling upgrades will not work.

Before upgrading the system, all running jobs must complete. To prevent queued jobs from starting, nodes can be set to offline or all queues can be disabled. Once all running jobs are complete, the upgrade can be made. Remember to allow any job arrays in version 2.4 to complete before upgrading. Queued array jobs will be lost.

== Cray ==

For upgrading TORQUE to 4.1.0 on a Cray system, refer to the Installation Notes for Moab and TORQUE on a Cray System [http://www.adaptivecomputing.com/resources/docs/mwm/7-1-0/help.htm#xtinstall.html] in Appendix G of the Moab Workload Manager Administrator Guide.


=== Documentation ===

The online help for TORQUE 4.1.0 is available in HTML [http://www.adaptivecomputing.com/resources/docs/torque/4-1-0/help.htm] and PDF [http://www.adaptivecomputing.com/resources/docs/torque/4-1-0/torqueAdminGuide-4.1.0.pdf] format.

=== What's new in 4.1.1 ===

Several problems with deadlocks were fixed in the area of job arrays and routing queues. Unfortunately, this combination did not get tested prior to the release of 4.1.0. However, we believe we have resolved the issues around this combination of functionality.

x11Forwarding was broken in 4.1.0 and has now been fixed in 4.1.1.

For a complete list of bugs fixed for this version of TORQUE please see the CHANGELOG.

=== Known Issues ===

-------------------------------------------------------------------------------
Jobs will occasionally get stuck in an exiting state. 

Elapsed time as displayed in qstat -r is garbage. 

Multi-req jobs with MPMD alps reservations currently not implemented

pbs_connect will use the default server list even if a specific server is requested.

TORQUE does not keep nodes offline after reboot
-------------------------------------------------------------------------------------

Each of these issues will be fixed in TORQUE 4.1.2.

© Copyright 2012, Adaptive Computing Enterprises, Inc.
