Deprecated: please see new documentation site.


Contents


Profiling tools allows one to determine areas of the program that might be acting as performance bottlenecks. They also allow one to determine the performance characteristics of the particular program being run.

IBM p5 575

prof

This is the traditional Unix profiling tool. It measures how much time is spent in each function call, and how many times these calls are made.

Step 1. Compile the program, passing -gp flag to the compiler; this tells the compiler to add debugging information to the executable.

Step 2. Run the executable as normal; some files named mon.*.out will have been created.

Step 3. Extract the useful information from each mon.*.out file using prof

% prof mon.0.out > prof.out

Step 4. Open up the prof.out file, and it should look something like:

 Name                 %Time     Seconds     Cumsecs  #Calls   msec/call
.floor                24.6        3.73        3.7387716946      0.0000
.__itrunc             15.2        2.31        6.04
.get_start            13.3        2.02        8.0622497362      0.0001
.MPI__Comm_size       11.7        1.77        9.83
.__mcount             11.1        1.68       11.51
.get_val_par           5.4        0.82       12.33 7230000      0.0001
.pthread_key_create    4.6        0.69       13.02
._mpi_lock             2.6        0.39       13.41
.get_end               2.6        0.39       13.80 6665102      0.0001
._mpi_unlock           2.3        0.35       14.15
.pow                   1.5        0.22       14.37 2410000      0.0001
.gauss_seidel.GL       0.9        0.14       14.51     482      0.29
.enforce_bc_par        0.7        0.11       14.62 1205000      0.0001
.pthread_setspecific   0.6        0.09       14.71
.pthread_setspecific   0.5        0.08       14.79
.PMPI_Comm_size        0.5        0.08       14.87
.pthread_getspecific   0.4        0.06       14.93
.get_convergence_sqd   0.4        0.06       14.99     482      0.12
.MPI_Comm_size.GL      0.4        0.06       15.05
.main                  0.3        0.05       15.10       1     50.0
.pthread_getspecific   0.3        0.04       15.14
.global_to_local       0.1        0.02       15.16 9167158      0.0000
.get_num_rows          0.0        0.00       15.16     967      0.00
.init_domain           0.0        0.00       15.16       2      0.0
.f                     0.0        0.00       15.16 1205000      0.0000

This is information for the test code available via subversion:

%svn co https://svn.loni.org/repos/2dheat/trunk 2dheat # LONI login required

To use the above code to test prof (and any other tool discussed here):

  1. get code via subversion
  2. cd into 2dheat
  3. compile using ./compile-debug.sh
  4. run executable, ./bin/2dheat.x
  5. analyze mon.*.out file(s)

To run interactively on a 575:

% poe ./bin/2dheat.x -v -rmpool 1 -nodes 1 -procs 2 # for a 2 proc interactive job

hpmcount

This tool is used to accesses low level performance metrics on AIX.

To view the utility's options, used the -h flag:

% hpmcount -h

To use hpmcount on a serial executable, simply run the command:

% hpmcount -o hpm_output_file [hpm_options] program

CPU information for the single CPU command will be contained in hpm_output_file_proc#.procid.

To use hpmcount under a poe parallel environment, run the command:

% poe hpmcount -o hpm_output_file [hpm_options] program -rmpool 1 -nodes 1 -procs [1-8]

CPU information for each of the up-to-eight CPUs will be contained in hpm_output_file_proc#.procid.

An example output file follows:

Execution time (wall clock time): 135.196644 seconds
########  Resource Usage Statistics  ########
Total amount of time in user mode            : 133.911128 seconds
Total amount of time in system mode          : 0.038906 seconds
Maximum resident set size                    : 9528 Kbytes
Average shared memory use in text segment    : 8062 Kbytes*sec
Average unshared memory use in data segment  : 914346 Kbytes*sec
Number of page faults without I/O activity   : 2443
Number of page faults with I/O activity      : 3
Number of times process was swapped out      : 0
Number of times file system performed INPUT  : 0
Number of times file system performed OUTPUT : 0
Number of IPC messages sent                  : 0
Number of IPC messages received              : 0
Number of signals delivered                  : 0
Number of voluntary context switches         : 1807
Number of involuntary context switches       : 153
#######  End of Resource Statistics  ########
#######  End of Resource Statistics  ########
Set: 1
Counting duration: 134.381408594 seconds
 PM_FPU_1FLOP (FPU executed one flop instruction)            :      6273200299
 PM_CYC (Processor cycles)                                   :    255526001030
 PM_MRK_FPU_FIN (Marked instruction FPU processing finished) :               0
 PM_FPU_FIN (FPU produced a result)                          :     14328458593
 PM_INST_CMPL (Instructions completed)                       :    202754087537
 PM_RUN_CYC (Run cycles)                                     :    255526001030
 Utilization rate                                 :          99.366 %
 MIPS                                             :        1499.698 MIPS
 Instructions per cycle                           :           0.793
 HW floating point instructions per Cycle         :           0.056
 HW floating point instructions / user time       :         106.659 M HWflops/s
 HW floating point rate (HW Flops / WCT)          :         105.982 M HWflops/s

Additional hpmcount Details

See this page.

Related Tools and Libraries

mpiP

  1. mpiP is a lightweight profiling library for MPI applications.
    • Software developed by LLNL. Currently installed under /usr/local/tools/mpiP. Some documentation included.
    • Collects only statistical information about MPI routines
    • Captures and stores information local to each task (local memory and disk)
    • Uses communication only at the end of the application to merge results from all tasks into one output file.
  1. mpiP provides statistical information about a program's MPI calls:
    • Percent of a task's time attributed to MPI calls
    • Where each MPI call is made within the program (callsites)
    • Top 20 callsites
    • Callsite statistics (for all callsites)

Using mpiP

Step 1. Compile you program by using the mpiP library, first:

%mpcc_r -L/usr/local/packages/mpiP-3.1.1/lib -lmpiP [additional libs...] myprog.c -o myexe.x

Step 2. Run the code as you would normally:

%poe ./myexe.x -rmpool 1 -nodes 1 -procs 4

Step 3. Look at the output from mpiP:

%vi myexe.x.<numprocs>.<procid>.1.mpiP

Additional Information

TAU

Introduction

TAU stands for "Tuning and Analysis Utilities." It is jointly developed by the University of Oregon, Los Alamos National Labtorary and ZAM (Germany).

Here's the introduction from its website:

"TAU (Tuning and Analysis Utilities) is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python."

"TAU is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and statements. All C++ language features are supported including templates and namespaces. The API also provides selection of profiling groups for organizing and controlling instrumentation. The instrumentation can be inserted in the source code using an automatic instrumentor tool based on the Program Database Toolkit (PDT), dynamically using DyninstAPI, at runtime in the Java virtual machine, or manually using the instrumentation API."

"TAUs profile visualization tool, paraprof, provides graphical displays of all the performance analysis results, in aggregate and single node/context/thread forms. The user can quickly identify sources of performance bottlenecks in the application using the graphical interface. In addition, TAU can generate event traces that can be displayed with the Vampir, Paraver or JumpShot trace visualization tools."

Code Profiling

You can profile your code either using the API provide by TAU or automatically instrumenting your code through PDT.

TAU API

Here's a sample C++ code written with TAU API (other sample codes can be found under /usr/local/packages/tau-2.16/examples):

// This application calculates the value of pi and e using a parallel
// algorithm for integrating a function using Riemann sum. Uses MPI.
#include "mpi.h"
#include <stdio.h>
#include <math.h>
#include <Profile/Profiler.h>
#ifndef M_E
#define M_E         2.7182818284590452354       /* e */
#endif
#ifndef M_PI
#define M_PI        3.14159265358979323846      /* pi */
#endif
double f(double a)
{
    TAU_PROFILE("f()", "double (double)", TAU_USER);
    return (4.0 / (1.0 + a*a));
}
int main(int argc, char* argv[])
{
    int i, n, myid, numprocs, namelen;
    double mySum, h, sum, x;
    double startwtime, timePi, timeE, time1;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    TAU_PROFILE("main()", "int (int, char **)", TAU_DEFAULT);
    TAU_PROFILE_INIT(argc,argv);
    MPI_Init(&argc,&argv);
    TAU_PROFILE_TIMER(t1, "main-init()", "int (int, char **)",  TAU_USER);
    TAU_PROFILE_START(t1);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);
    TAU_PROFILE_SET_NODE(myid);
    <omitted computation>
}

To compile the code, you need to use a makefile like the one below:

TAUROOTDIR      = /usr/local/packages/tau-2.16/
include $(TAUROOTDIR)/rs6000/lib/Makefile.tau-mpi-pthread-pdt-openmp-profile-trace
CXX             = $(TAU_CXX)
CC              = $(TAU_CC)
CFLAGS          = $(TAU_INCLUDE) $(TAU_DEFS) $(TAU_MPI_INCLUDE)
LIBS            = $(TAU_MPI_LIBS) $(TAU_LIBS) -lm
LDFLAGS         = $(USER_OPT) $(TAU_LDFLAGS)
MAKEFILE        = Makefile
PRINT           = pr
RM              = /bin/rm -f
TARGET          = cpi
EXTRAOBJS       =
##############################################
all:            $(TARGET)
install:        $(TARGET)
$(TARGET):      $(TARGET).o
        $(CXX) $(LDFLAGS) $(TARGET).o -o $@ $(LIBS)
$(TARGET).o : $(TARGET).cpp
        $(CXX) $(CFLAGS) -c $(TARGET).cpp
clean:
       $(RM) $(TARGET).o $(TARGET)
##############################################

Note that you don't have to do anything to your environment: all the dependencies are handled in the makefile.

Automatic Instrumentation

If you don't want to rewrite/revise your code with the TAU API, TAU provides automatic instrumentation through PDT, which is a tool that can parse your code and add necesary calls of TAU subroutines to appropriate locations. Here's an example of how it works.

Suppose I have the following Fortran code that I'd like to profile:

program allgatherv
include 'mpif.h'
integer :: isend(4),irecv(10)
integer :: ircnt(0:3),idisp(0:3)
data ircnt/1,2,3,4/ idisp/0,1,3,6/
call mpi_init(ierror) 
call mpi_comm_size(mpi_comm_world,nprocs,ierror)
call mpi_comm_rank(mpi_comm_world,myid,ierror)
do i=1,myid+1
  isend(i)=myid+1
enddo
iscnt=myid+1
call mpi_allgatherv(isend,iscnt,mpi_integer,irecv,ircnt, &
    idisp,mpi_integer,mpi_comm_world,ierror)
write(*,*) 'Process',myid,':'
write(*,*) 'irecv=',irecv
write(*,*)
call mpi_finalize(ierr)
stop
end

After the automatice instrumentation, it becomes:

program allgatherv
include 'mpif.h'
integer :: isend(4),irecv(10)
integer :: ircnt(0:3),idisp(0:3)
data ircnt/1,2,3,4/ idisp/0,1,3,6/
     integer profiler(2) / 0, 0 /
     save profiler
     call TAU_PROFILE_INIT()
     call TAU_PROFILE_TIMER(profiler, '                                &
    &ALLGATHERV [{allgatherv.f90} {1,9}]')
       call TAU_PROFILE_START(profiler)
     call mpi_init(ierror)
call mpi_comm_size(mpi_comm_world,nprocs,ierror)
call mpi_comm_rank(mpi_comm_world,myid,ierror)
do i=1,myid+1
  isend(i)=myid+1
enddo
iscnt=myid+1
call mpi_allgatherv(isend,iscnt,mpi_integer,irecv,ircnt, &
    idisp,mpi_integer,mpi_comm_world,ierror)
write(*,*) 'Process',myid,':'
write(*,*) 'irecv=',irecv
write(*,*)
call mpi_finalize(ierr)
       call TAU_PROFILE_EXIT('exit')
stop
end

You don't have to do this automatic instrumentation yourself. Actually, the makefile shown above can handle it after minor revision:

TAUROOTDIR      = /usr/local/packages/tau-2.16
include $(TAUROOTDIR)/rs6000/lib/Makefile.tau-mpi-pthread-pdt-openmp-profile-trace
CXX             = $(TAU_CXX)
CC              = $(TAU_CC)
F90             = mpxlf90_r
PDTPARSE        = $(PDTDIR)/$(PDTARCHDIR)/bin/f95parse
TAUINSTR        = $(TAUROOTDIR)/$(CONFIG_ARCH)/bin/tau_instrumentor
FFLAGS          = $(TAU_INCLUDE)
LIBS            = $(TAU_LIBS) $(TAU_MPI_FLIBS) $(TAU_FORTRANLIBS)
LDFLAGS         = $(USER_OPT) $(TAU_LDFLAGS) $(TAU_CXXLIBS)
MAKEFILE        = Makefile
PRINT           = pr
RM              = /bin/rm -f
EXTRAOBJS       =
##############################################
# Modified Rules
##############################################
all:    $(TARGET) $(PDTPARSE) $(TAUINSTR)
$(TARGET): $(TARGET).o
       $(F90) $(LDFLAGS) $(TARGET).o -o $@ $(LIBS)
# Use the instrumented source code to generate the object code
$(TARGET).o : $(TARGET).inst.f90
       $(F90) -c $(FFLAGS) $(TARGET).inst.f90 -o $(TARGET).o
# Generate the instrumented source from the original source and the pdb file
$(TARGET).inst.f90 : $(TARGET).pdb $(TARGET).f90 $(TAUINSTR)
        $(TAUINSTR) $(TARGET).pdb $(TARGET).f90 -o $(TARGET).inst.f90
# Parse the source file to generate the pdb file
$(TARGET).pdb : $(PDTPARSE) $(TARGET).f90
       $(PDTPARSE) $(TARGET).f90 $(FFLAGS)

Use paraprof

After the execution of the instrumented code, you should be able to find the TAU profile files with default names being profile.x.0.0, where x is the id of processors. Using Paraprof, which is a tool provided by TAU, you can read and analysis these files.

To use paraprof, you need to add it to your PATH:

export PATH=$PATH:/usr/local/packages/rs6000/bin (ksh or bash)
setenv {PATH} {PATH:/usr/local/packages/rs6000/bin} (csh or tcsh)

Then execute it from the command line:

lyan1@l1f1n01$ paraprof

Availability

If TAU is available on a particular platform, it will be listed in softenv.

References

Detailed information of TAU can be found here: [1]

Detailed information of PDT can be found here: [2]

IPM

PAPI

MPI Tracing

x86 Linux Clusters

mpiP

  1. mpiP is a lightweight profiling library for MPI applications.
    • Software developed by LLNL. Currently installed under /usr/local/tools/mpiP. Some documentation included.
    • Collects only statistical information about MPI routines
    • Captures and stores information local to each task (local memory and disk)
    • Uses communication only at the end of the application to merge results from all tasks into one output file.
  1. mpiP provides statistical information about a program's MPI calls:
    • Percent of a task's time attributed to MPI calls
    • Where each MPI call is made within the program (callsites)
    • Top 20 callsites
    • Callsite statistics (for all callsites)

Using mpiP

Step 1. Compile you program by using the mpiP library, first:

%mpcc_r -L/usr/local/packages/mpiP-3.1.1-mvapich-0.98-gcc/lib -lmpiP \
        [additional libs...] myprog.c -o myexe.x

Step 2. Run the code as you would normally:

%qsub -I -V -lnodes=1:ppn=4 # start interactive parallel session
%cd /to/your/dir 
%mpirun -np 4 -machinefile $PBS_NODEFILE ./myexe.x

Step 3. Look at the output from mpiP:

%vi myexe.x.<numprocs>.<procid>.1.mpiP

Additional Information


Users may direct questions to sys-help@loni.org.

Powered by MediaWiki