Contents


Deprecated: please see new documentation site.



Overview

TotalView is a GUI-based debugging tool developed by TotalView Technologies (previously known as Entus). Totalview gives a programmer complete control over processes and thread execution and allows to debug one or many processes and/or threads. It provides analytical displays of the state of your running program for efficient debugging of memory errors and leaks and diagnosis of subtle problems like deadlocks and race conditions. It works with multiple languages, supports many parallel programming models and is supported on most HPC platforms.

Languages supported by TotalView

  • C/C++
  • Fortran
  • Mixed C/C++ and Fortran
  • Assembler

Parallel programming models supported by TotalView

  • MPI
  • PVM
  • OpenMP
  • Threads
  • SHMEM

Availability

Currently, TotalView is only available on tezpur.hpc.lsu.edu (for LSU users only), qb.loni.org and eric.loni.org. The current version is 8.8.0-0, for 64 bit x86 Linux.

Getting Started

Before you start

Setting up the environment

Make sure that your ~/.soft contains the proper software key for TotalView:

  • On Queen Bee: +totalview-8.8.0 or totalview-8.3.0.1
  • On Tezpur: +totalview-8.3
  • On Eric: +totalview-8.3.0.1

Once this is added, type in the command:

% resoft

Test that TotalView is now in PATH:

% which totalview
/usr/local/packages/TotalView/toolworks/totalview.8.3.0-1/linux-x86-64/bin/totalview

Getting X window to work

You need to make sure that X window works to get the TotalView GUI.

  • For *nix users
    • Connect to the cluster with -X -Y turned on: ssh -X -Y username@hostname
  • For Mac Os X users
    • Use the X11 utility to connect to the cluster
  • For Windows users
    • Install Xming and Putty
    • Enable X11 Forwarding in Putty

Detailed information can be found here.

Compiling your program

  • For serial programs
    • Compile with NO optimization and the debugging flag turned on:
% icc -O0 -g myfile.c -o myfile.x
  • For parallel programs
    • Use the proper MPI implementation
      • Queen Bee and Eric: +mvapich-1.1-intel-11.1
      • Tezpur: +mvapich-1.0-intel10.1-tvdbg
    • Compile with NO optimization and the debugging flag turned on:
% mpicc -g -O0 -o myexec myprogram.c

If you are compiling Fortran or C++ programs, "icc" and "mpicc" should be replaced by corresponding compilers, which can be found here. The "-O0 -g" option remains the same.

Getting an interactive debugging session

Obviously, it is not a good idea to run TotalView on the head node as it could take quite some resources such as cpu and memory. Instead, we need to set up an interactive session on some computation nodes and run TotalView there.

Here is an example of how to obtain an 4-hour session with 2 nodes (16 cores) on Queen Bee (Detailed information on how to set up an interactive session can be found here):

  • First, submit an interactive job (you might end up waiting in the queue for a while):
qsub -I -X -V -l walltime=4:00:00,nodes=2:ppn=8 -A my_loni_allocation -q checkpt
  • After you interactive job starts to run, you will see something like this:
...
PBS has allocated the following nodes:
qb235
qb234
...
  • Now you are ready to run TotalView!

Starting TotalView

Serial programs

To start TotalView with a serial program, type the following on the command line:

totalview <options> <serial_executable> -a arguments for your program

Parallel programs

MVAPICH

If you are debugging a parallel program with TotalView, mpirun_rsh must be used to start the program:

mpirun_rsh -tv -np <num_procs> <host list> <executable>

For example, to debug a parallel program 'buggy_mpi' with 2 processes (assuming that we are still on qb235):

mpirun_rsh -tv -np 2 qb235 qb235 ./buggy_mpi

Alternatively, you can use a hostfile rather than typing everything on the command line:

mpirun_rsh -tv -np <num_procs> -hostfile <path_to_hostfile> <executable>

For the above example, we can run:

mpirun_rsh -tv -np 2 -hostfile $PBS_NODEFILE ./buggy_mpi,

where $PBS_NODEFILE is an environment variable automatically set by moab.

OpenMPI

You can start your program like this:

mpirun -tv -np <num_procs> --host node1,node2... <executable>

Or:

mpirun -tv -np <num_procs> --hostfile <path_to_hostfile> <executable>

TotalView GUI

If you start TotalView successfully, there should be two windows like these:

image:TotalView_rootwindow.png

image:Totalview_processwindow.png

Other than the root and process windows, there are other windows in TotalView. One of the most frequently used is the variable window:

image:Totalview_pointerarray.png

Basic debugging functions

A programmer usually debugs a program by

  • Setting up action points (breakpoints etc.)
  • Controlling execution (next, stop etc.)
  • Examining the value of variables
  • ...

Now, we will see how to do these things with TotalView.

Adding action points

image:Totalview_sourcelookup.png

image:Totalview_breakpoint.png

image:Totalview_breakpoint_property.png

image:Totalview_evaluate.png

image:Totalview_evaluate_procwin.png

image:Totalview_watchpoint_dive.png

Conditional watch point

Unconditional watch point

Controlling execution

Viewing/Editing data

Debugging Parallel Programs

Coming soon!

Memory Debugging

Coming soon!

Command Line Interface

If you have ever used a command line debugger such as gdb, the debugging experience with TotalView command line interface (CLI) will be quite similar. TotalView CLI is integrated with Tcl, a scripting language. For that reason, all Tcl commands are usable within TotalView CLI and you will be better off if you know Tcl already. That said, you can still use TotalView CLI even if you don't have any prior knowledge on Tcl at all.

Starting a debugging session

To start a debugging session with CLI, use the totalviewcli command:

% totalviewcli
Linux x86_64 TotalView 8.8.0-0
Copyright 2007-2010 by TotalView Technologies, LLC. ALL RIGHTS RESERVED.
Copyright 1999-2007 by Etnus, LLC.
Copyright 1999 by Etnus, Inc.
Copyright 1996-1998 by Dolphin Interconnect Solutions, Inc.
Copyright 1989-1996 by BBN Inc.
TotalView Technologies ReplayEngine
Copyright 2010 TotalView Technologies
ReplayEngine uses the UndoDB Reverse Execution Engine
Copyright 2005-2010 Undo Limited
d1.<>

To terminate a session, just use the quit command.

Most TotalView CLI commands have a form of dxxxx, while aliases are provided for the most frequently used commands. To see the aliases currently available, simply type alias:

d1.<> alias
BAW { dfocus gW dbarrier -stop_when_done group }
CO { dfocus g dcont }
G { dfocus g dgo }
H { dfocus g dhalt }
HP { dfocus g dhold -process }
...

User can define their own aliases using the alias command.

To load a program for debugging, use the lo command, which is an alias for dload:

d1.<> lo ./cell_seq_f

To start a MPI program, one needs to specify a few more options:

d1.<> lo -mpi MVAPICH1 -np 4 -starter_args "-hostfile /var/spool/torque/aux/288346.qb2" ./cell_mpi_f

Things can be quite tricky if you need to pass command line arguments to your program, as it seems that the dload command does not allow any argument for the program being debugged, so the options are to load the program when TotalView CLI starts:

%totalviewcli ./cell_seq_f -a 400 100

or use the dset command after loading the program:

d1.<> dset ARGS_DEFAULT {400 100}
400 100

The dlist command (alias l) display the source code:

d1.<> l -n 5 20
 20     call getarg(1,paramin)
 21     read(paramin,*) Ndim
 22     call getarg(2,paramin)
 23     read(paramin,*) Niter
 24

In the above example, -n 5 20 indicates that 5 lines of the source code should be shown starting from line 20. The default (without any argument) is to display 20 lines starting from the current location.

Basic debugging functions

Like with the GUI, basic debugging operations include controlling execution, setting breakpoints and viewing/editing data.

Controlling execution

The table below shows the commands that control execution:

Command Alias description
dgo g resume execution
dkill k terminate execution
dhalt h suspend execution
dhold hold a group, process or thread
dunhold release a group, process or thread
drun r start or restart processes
dnext n step source lines (over subroutines)
dstep s step source lines (into subroutines)
dout ou run to the end of the current subroutine
duntil un run until reaching a target place

Some of the commands shown above accept arguments. For example, one can step over 5 source lines with dstep:

d1.<> s 5
 20 >   call getarg(1,paramin)
 21 >   read(paramin,*) Ndim
 22 >   call getarg(2,paramin)
 23 >   read(paramin,*) Niter
 25 >   write(*,*)

The command dwhere (alias w) displays the current location and the call stack:

d1.<> w
>  0 cell_seq         PC=0x00402a84, FP=0x7fbfffe540 [/home/lyan1/traininglab/debugging/cell_seq.f90#25]
   1 main             PC=0x004028dd, FP=0x7fbfffe560 [/home/lyan1/traininglab/debugging/cell_seq_f]
   2 __libc_start_main PC=0x3270c1c3f7, FP=0x7fbfffe610 [/lib64/tls/libc.so.6]
   3 _start           PC=0x00402825, FP=0x7fbfffe620 [/home/lyan1/traininglab/debugging/cell_seq_f]

Setting action points

Action points related commands are:

Command Alias description
dbreak b set breakpoints and evaluation points
dbarrier ba set barrier points
dwatch wa set watch points
denable en enable action points
ddisable di disable action points
ddelete de delete action points
dactions ac display, save and load action points

Here is an example where a break point is set, disabled and deleted:

d1.<> b 34
1
d1.<> ac
1 shared action point for group 3:
   1 [cell_seq.f90#34] Enabled
d1.<> di 1
d1.<> ac
1 shared action point for group 3:
   1 [cell_seq.f90#34] Disabled
d1.<> de 1
d1.<> ac
No matching breakpoints were found

The return value "1" for the "b 34" command is the ID of the action point.

When an action point is set, it is marked by "@" when the source code is displayed:

d1.<> b 34
2
d1.<> l 20
 20     call getarg(1,paramin)
 21     read(paramin,*) Ndim
 22     call getarg(2,paramin)
 23     read(paramin,*) Niter
 24
 25 >   write(*,*)
 26     write(*,'(A,T24,I8)') "Size of the array: ",Ndim
 27     write(*,'(A,T24,I8)') "Number of iterations: ",Niter
 28
 29     allocate(buffer(Ndim))
 30     allocate(nextbuffer(Ndim))
 31     allocate(tmp(Ndim))
 32
 33     ! Initialize the array.
 34@    do x=1,Ndim

Unlike with GUI, with CLI one could not tell what type of active point it is. The ">" indicates the current location.

To set an evaluation point where a code fragment is executed, use the -e option with the dbreak command:

d1.<> b 35 -e {x=1;goto $52}
4

The above command can be translated to "when the execution hits line 35, set x to 1 and skip to line 52".

Viewing/editing data

To view the value of a scalar variable, use the dprint (alias p) command:

d1.<> p x
x = 1 (0x00000001)

When viewing an array, slicing is usually helpful:

d1.<> p buffer(1:10:2)
buffer(1:10:2) = {
  (1) = 0 (0x00000000)
  (2) = 2 (0x00000002)
  (3) = 4 (0x00000004)
  (4) = 6 (0x00000006)
  (5) = 8 (0x00000008)
}

The dprint command can also be used to evaluate an expression:

d1.<> p "x+1"
x+1 = 2 (0x00000002)

Editing the value of a scalar variable or an array element can be done with the dassign (alias as):

d1.<> p x
 x = 1 (0x00000001)
d1.<> as x 101
d1.<> p x
 x = 101 (0x00000065)

Debugging parallel programs

When debugging a parallel program, it is important to be aware of the current focus and the scope of control command, which is displayed at the prompt when a CLI debugging session starts:

d1.<>

Here "d" indicates that the command scope is the default one and "1." means the current focus is on the first user thread in process 1. To change the focus and/or the command scope, use the dfocus (alias f) command:

d1.<> f t3.1
t3.1
t3.1> f g3.1
g3.1
g3.1>

In the example above, both commands set the focus on the first thread of process 3. The difference is that f t3.1 set the command scope to thread 3.1 while the command scope is still the entire group for f g3.1.

It is also possible to execute a control command in a specific process/thread group by combining the dfocus command and a control command:

t3.1> w
>  0 __select_nocancel PC=0x351d7c017a, FP=0x7fbfffe7e0 [/lib64/tls/libc.so.6]
...
   7 _start           PC=0x00403fe5, FP=0x7fbfffec80 [/home/lyan1/traininglab/debugging/cell_mpi_f]
t3.1> f p2 w
Thread 2.1:
>  0 __select_nocancel PC=0x351d7c0176, FP=0x7fbfffe7e0 [/lib64/tls/libc.so.6]
...
   7 _start           PC=0x00403fe5, FP=0x7fbfffec80 [/home/lyan1/traininglab/debugging/cell_mpi_f]
Thread 2.2:
>  0 __read_nocancel  PC=0x351e00b19f, FP=0x402000d0 [/lib64/tls/libpthread.so.0]
...
   3 start_thread     PC=0x351e006130, FP=0x40200270 [/lib64/tls/libpthread.so.0]

In the example above, the first command w shows the call stack and current location for thread 3.1 since it is the current focus is "t3.1". The second command f p2 w shows the information for both threads within process 2 while the current focus remains unchanged.

Additional Sources of Help


Users may direct questions to sys-help@loni.org.

Powered by MediaWiki