﻿                               Release Notes

TORQUE Resource Manager

Software Version 4.0.1

                        ---------- WARNING! ----------
TORQUE 4.0.0 is not backward compatible with previous versions of TORQUE. 
When you upgrade to TORQUE 4.0.0 all MOM and server daemons must be 
upgraded at the same time. 

The job format is compatible between 4.0.0 and previous versions of TORQUE. 
Any queued jobs will upgrade to the new version with the exception of 
job arrays in versions 2.4 and earlier of TORQUE. See the Upgrading and 
Backward Compatibility section of this document for more details. It is not
recommended to upgrade TORQUE while jobs are in a running state.

*******************************************************

The readme file contains the following sections:

+ Overview +
+ New Feature Highlights +
+ New Feature Details +
+ Upgrading and Backward Compatibility +
+ Requirements +
+ Build and Installation Directions +
+ Known Issues +

********************************************************

OVERVIEW

    TORQUE needs to be able to scale to newer, larger systems. This includes:
    handling larger numbers of jobs, nodes, and commands per second. TORQUE 
    also needs to become far more reliable. To respond to these challenges,
    TORQUE has undergone a huge makeover - pbs_server is now multi-threaded,
    RPP/UDP is no longer used, mom reporting can be organized into 
    hierarchies, job reporting can be divided into a tree structure, and 
    many other smaller yet significant changes.

*******************************************************

+ New Feature Highlights +

*******************************************************

What's new in 4.0.2
  This build has several key fixes which were discovered after 4.0.1 was released.
  Following are a list of the key fixes. The remainder of changes and enhancements
  can be found in CHANGELOG.

  -- Enable TORQUE 4.0.2 to listen on all interfaces. This is critical for 
     multi-homed systems.
  -- Fix a problem where TORQUE would lose communication with
     Moab frequently. Particularly a problem with Cray systems.
  -- Patched a memory leak with nvidia GPUs enabled
  -- Fixed a problem where qmgr could only run 9 commands
  -- Fixed a deadlock when job logging is enabled.

The following is a summary of key new features in TORQUE 4.0

- Multi-threaded pbs_server
  -- Faster response to client commands. No more waiting for someone else's call to qstat to 
     finish before your request can run.
  -- Higher job throughput
  -- More robust 

- New trqauthd client daemon
  -- trqauthd replaces pbs_iff
  -- trqauthd is multi-threaded and able to reliably handle multiple simultaneous client requests.
  -- Run as root. Gives Administrators more control

- Communication
  -- RPP (reliable packet protocol) removed
  -- UDP removed
  -- All communication done using TCP/IP

- Mom Hierarchy
  -- Reduces overall compute node update traffic to pbs_server
  -- Allows administrators to manage network communications 

- Scalability
  -- Ability to support over 15,000 compute nodes in a cluster
  -- Job radix allows users to submit jobs with very large node needs


*******************************************************

+ New Feature Details +

*******************************************************

This section contains detailed information pertaining to specific new features. 

---------------------------------
=== Multi-threaded pbs_server ===
---------------------------------

pbs_server is multi-threaded starting with TORQUE 4.0. By default the number of available
threads is 2*(number of cores)+1. This value can be modified using a set of three
server parameters; min_threads, max_threads and thread_idle_seconds. Each of these
parameters take an integer as an argument and can be set using qmgr.

  min_threads: Sets the minimum number of threads that will be available in pbs_server

  max_threads: Sets the maximum nuber of threads that can be on pbs_server

  thread_idle_seconds: The number of seconds a thread can be idle before it is removed from
  the system threadpool. Note the number of threads will never fall below the minimum.
  thread_idle_seconds by default is -1. This indicates to TORQUE to never remove a thread
  from the threadpool.

By setting max_threads greater than min_threads pbs_server is able to dynamically add threads
as the server load increases. The thread_idle_seconds parameter is used to detect a drop
in the server load and removes threads as they are idle for the number of seconds given in 
the parameter.

Setting max_threads equal to min_threads will keep the number of threads static.




---------------------
=== Mom Hierarchy ===
---------------------

The Mom Hierarchy is a new optional feature in TORQUE 4.0 which is designed to improve the 
efficiency of communications between pbs_server and the compute nodes (pbs_mom). By default
all compute nodes send status updates directly to pbs_server. As cluster sizes increase the 
need to reduce the traffic and time required to keep the cluster status up to date becomes
more important. The Mom Hierarchy allows administrators to configure compute nodes in a way
where each node sends its status update information to another compute node. The compute nodes 
pass the information up a tree or hierarchy until eventually the information reaches a node
that will pass the information directly to pbs_server. 

Setting up the Mom Hierarchy

The name of the file that contains the configuration information is named mom_hierarchy. By 
default it is located in the /var/spool/torque/server_priv directory. The file uses an XML like
syntax. By the time TORQUE 4.0 is released this will be an XML compatible syntax. For now 
they syntax is as follows:

   <path attr=val>
     <level attr=val> comma separated node list </level>
     <level attr=val> comma separated node list </level>
     ...
   </path attr=val>
   <path attr=val>
     <level attr=val> comma separated node list </level>
     ...
   </path attr=val>
   ...

The <path> </path> tag pair identifies a group of compute nodes. The <level></level> tag pair
contains a comma separated list of compute node names. Multiple paths can be defined with 
multiple levels within each path. 

Within a <path> tag pair the levels define the hierarchy. All nodes in the top level communicate
directly with the server. All nodes in lower levels communicate to the first available node
in the level directly above it. If the first node in the upper level goes down the nodes in the 
subordinate level will then communicate to the next node in the upper level. If no nodes are 
available in an upper level then the node will communicate directly to the server.

If an upper level node has fallen out and it is back in again the lower level nodes will eventually
find that the node is available and start sending their updates to that node.


----------------
=== trqauthd ===
----------------

trqauthd is a new daemon starting in TORQUE 4.0. It replaces pbs_iff which is used by 
TORQUE client utilities to authorize user connections to pbs_server. Unlike pbs_iff 
which is executed by the TORQUE client utilities each time the utility is run, trqauthd
is started once and remains resident. TORQUE client utilities then communicate with 
trqauthd on port 15005 on the loopback interface. Unlike pbs_iff trqauthd is multi-threaded
and is able to successfully handle a greater volume of simultaneous requests than pbs_iff.

Running trqauthd

trqauthd MUST be run as root. It must be running on any host where TORQUE client commands 
will execute.

By default trqauthd is installed to /usr/local/sbin.

trqauthd can be invoked directly from the command line or by the use of init.d scripts 
which are located in the contrib/init.d directory of the TORQUE source.

There are three init.d scripts for trqauthd in the contrib/init.d directory of the 
TORQUE source tree. 

debian.trqauthd
Used for the apt based systems (debian, ubuntu are the most common variations of this)

suse.trqauthd
Used for the rpm based systems. (redhat, suse, scientific, centos, fedora, are some 
common examples)

trqauthd
An example for other packages managers. (anything that doesn't use rpm or apt)

Inside each of the scripts are the variable PBS_DAEMON and PBS_HOME. These two variables
should be updated to match your torque installation. PBS_DAEMON needs to point to the
location of trqauthd. PBS_HOME needs to match your TORQUE installation. For more information
about PBS_HOME please see the TORQUE documentation at www.adaptivecomputing.com.

Choose the script that matches your dist system and copy it to /etc/init.d. If needed, 
renamed it to trqauthd

To start the daemon type:
     /etc/init.d/trqauthd start

To stop the daemon type:
     /etc/init.d/trqauthd stop

You can also use the following:
     service trqauthd start/stop

---------------------------------------------------
=== Upgrading to 4.0 and Backward Compatibility ===
---------------------------------------------------

Because TORQUE 4.0 has removed all use of UDP/IP and moved all communication to use 
TCP/IP previous versions of TORQUE will not be able to communicate with the components 
of TORQUE 4.0. However, all files in the /var/spool/torque ($TORQUE_HOME) directory 
and all subdirectories are forwardly compatible.

Job Arrays
Job arrays from TORQUE version 2.5 and 3.0 are compatible with TORQUE 4.0. Job arrays 
were introduced in TORQUE version 2.4 but modified in 2.5. If upgrading from TORQUE
2.4 you will nned to make sure all job arrays are complete before upgrading.

serverdb
The pbs_server configuration is saved in the file $TORQUE_HOME/server_priv/serverdb. 
When TORQUE 4.0 is run for the first time this file will be converted from a binary 
file to an XML like format. This format can be used by TORQUE versions 2.5 and 3.0 
but not earlier versions. It is recommended this file be backed up before moving to 
TORQUE 4.0.

Upgrading
Because TORQUE 4.0 will not communicate with previous versions of TORQUE it is not 
possible to upgrade one component and not upgrade the others. Rolling upgrades will
not work.

Before upgrading the system all running jobs must complete. To prevent queued jobs
from starting nodes can be set to offline or all queues can be disabled. Once all
running jobs are complete the upgrade can be made. Remember to allow any job arrays
in version 2.4 to complete before upgrading. Queued array jobs will be lost.

----------------------------
=== Requirements ===
----------------------------

An ANSI C compiler is required.   The native C compiler is recommended if it
is ANSI, otherwise use gcc.

A fully POSIX make is required.  If you are unable to "make" PBS with your
make, we suggest use of gmake from GNU.

Tcl/Tk version 8 or higher is required if you plan to build the GUI portion
of TORQUE or use a Tcl based scheduler.

-----------------------------------------------------
=== Build and Installation Directions  ====
-----------------------------------------------------

The directions to build and install are found in the PBS Administrators Guide.
A postscript and PDF copy are found in this directory.  Please read and
follow the directions CAREFULLY.

Also consider additional considerations in README.building_40.

Note that you may need to install libssl-dev in order for the source to make successfully.
Specifically, the build system is looking for libssl.so and libcrypto.so. For non-RPM 
setups you may need to make a symbolic link from the ssl and crypto libraries to the respective
.so names.

Installation instructions can also be found at http://www.clusterresources.com/products/torque/docs/

-------------------
=== Know Issues ===
-------------------

