Skip to content

2.6.0 release notes

David Böhme edited this page Jun 25, 2021 · 2 revisions

What's new in v2.6.0

A list of new features in Caliper v2.6.0.

New ConfigManager recipes for CUDA profiling

Profile CUDA GPU activities such as memory copies and kernels with the cuda-activity-profile and cuda-activity-report configs. This example output for cuda-activity-report shows GPU time spent in various CUDA kernels:

$ CALI_CONFIG=cuda-activity-report,show_kernels lrun -n 4 ./tea_leaf
Path                        Kernel                                           Avg Host Time Max Host Time Avg GPU Time Max GPU Time GPU %
timestep_loop
 |-                                                                              17.068956     17.069917     0.239392     0.240725 1.402501
 |-                         device_unpack_top_buffe~~le*, double*, int, int)                                 0.091051     0.092734
 |-                         device_tea_leaf_ppcg_so~~ const*, double const*)                                 5.409844     5.419096
 |-                         device_tea_leaf_ppcg_so~~t*, double const*, int)                                 5.316101     5.320777
 |-                         device_pack_right_buffe~~le*, double*, int, int)                                 0.112455     0.113198
 |-                         device_pack_top_buffer(~~le*, double*, int, int)                                 0.092634     0.092820
(..)
 |-                         device_pack_bottom_buff~~le*, double*, int, int)                                 0.098929     0.099095
  summary
   |-                                                                             0.000881      0.000964     0.000010     0.000011 1.179024
   |-                       device_field_summary_ke~~ble*, double*, double*)                                 0.000325     0.000326
   |-                       void reduction<double, ~~N_TYPE)0>(int, double*)                                 0.000083     0.000084
    cudaMemcpy                                                                    0.000437      0.000457     0.000010     0.000011 2.376874
    cudaLaunchKernel
     |-                                                                           0.000324      0.000392
     |-                     device_field_summary_ke~~ble*, double*, double*)                                 0.000325     0.000326
     |-                     void reduction<double, ~~N_TYPE)0>(int, double*)                                 0.000083     0.000084

While cuda-activity-report prints human-readable data, the cuda-activity-profile config produces a JSON or .cali file for processing with Hatchet or cali-query. Learn more about CUDA profiling in the CUDA profiling how-to.

OpenMP profiling with the OpenMP tools interface

Caliper v2.6.0 introduces basic support for profiling CPU-side OpenMP constructs like, parallel regions and workshare constructs, with the OpenMP tools interface (OMPT). Note that only OpenMP 5.1 compliant compilers like clang v9+ support OMPT. When OMPT support is available, Caliper provides the openmp-report config. Here is example output showing the time spent in OpenMP workshare regions and barriers:

$ CALI_CONFIG=openmp-report ./caliper-openmp-example
Path   #Threads Time (thread) Time (total) Work %    Barrier % Time (work) Time (barrier)
main                 0.005122     0.027660 85.969388 14.030612
  work        4      0.005110     0.027572 85.969388 14.030612    0.011121       0.001815

Learn more about OpenMP profiling in the OpenMP profiling how-to.

Write reports into user-defined C++ streams

There is a new API to write ConfigManager reports into user-defined C++ streams for MPI programs:

auto res = cali::make_collective_output_channel("runtime-report(profile.mpi)");
auto channel = res.first;

channel->start();
//...
channel->collective_flush(std::ostream, MPI_COMM_WORLD);

Find a more detailed example program here.

Show only main thread annotations

The new main_thread_only ConfigManager option shows profiling data only from the program's main thread. Consider this program:

int main()
{
    CALI_MARK_BEGIN("main");

#pragma omp parallel for
    for (int i = 0; i < 42; ++i) {
        CALI_MARK_BEGIN("parallel");
        /* ... */
        CALI_MARK_END("parallel");
    }

    CALI_MARK_END("main");
    return EXIT_SUCCESS;
}

Caliper measures the time in the "parallel" region on each thread. Meanwhile, the "main" region is only visible on the main thread. Therefore, you'll find an "orphaned" entry with the time inside the "parallel" region from the OpenMP child threads in the report output:

$ CALI_CONFIG=runtime-report ./caliper-threads
Path       Min time/rank Max time/rank Avg time/rank Time %    
main            0.002054      0.002054      0.002054  8.955744 
  parallel      0.003233      0.003233      0.003233 14.096359 
parallel        0.009773      0.009773      0.009773 42.611729 

With the main_thread_only option, Caliper only reports data from the main thread:

# CALI_CONFIG=runtime-report,main_thread_only
Path       Min time/rank Max time/rank Avg time/rank Time %    
main            0.001465      0.001465      0.001465 11.204589 
  parallel      0.003339      0.003339      0.003339 25.537285 

Region count metric

The region.count metric counts the number of times a Caliper region was called:

$ ./examples/apps cxx-example -P runtime-report,aggregate_across_ranks=false,region.count
Path       Time (E) Time (I) Time % (E) Time % (I) Calls    
main       0.000157 0.000993   7.822621  49.476831 1.000000 
  mainloop 0.000109 0.000813   5.430992  40.508221 5.000000 
    foo    0.000704 0.000704  35.077230  35.077230 4.000000 
  init     0.000023 0.000023   1.145989   1.145989 1.000000 

Note that counts for Caliper regions which are hidden in the report output will be added to the surrounding region. The example above has hidden loop iteration annotations, which are added to the count of the "mainloop" region.

rocTX support

The roctx service forwards Caliper regions to AMD rocprofiler as rocTX annotations:

$ CALI_SERVICES_ENABLE=roctx rocprof (...) ./app

Load custom JSON configs in ConfigManager

You can load custom ConfigManager configuration recipes or options from JSON files. This example defines a new ConfigManager option "tot_ins" that adds the PAPI_TOT_INS PAPI counter:

{
  "options": [
    { "name"        : "tot_ins",
      "description" : "Instructions",
      "category"    : "metric",
      "services"    : [ "papi" ],
      "config"      : { "CALI_PAPI_COUNTERS": "PAPI_TOT_INS" },
      "query"       : [
        { "level": "local", "select": [ { "expr": "sum(sum#papi.PAPI_TOT_INS)", "as": "Instr." } ] },
        { "level": "cross", "select": [
            { "expr": "avg(sum#sum#papi.PAPI_TOT_INS)", "as": "Instr. (avg)"   },
            { "expr": "sum(sum#sum#papi.PAPI_TOT_INS)", "as": "Instr. (total)" }
          ] 
        }
      ]
    }
  ]
}

We can load this file using the load command in a ConfigManager config string and use the option in compatible configs, like runtime-report:

$ ./examples/apps/cxx-example -P "load(tot_ins.json),runtime-report,tot_ins"
Path       Min time/rank Max time/rank Avg time/rank Time %    Instr. (avg)  Instr. (total) 
main            0.000115      0.000115      0.000115  6.068602 207335.000000  207335.000000 
  mainloop      0.000102      0.000102      0.000102  5.382586 222597.000000  222597.000000 
    foo         0.000664      0.000664      0.000664 35.039578  76007.000000   76007.000000 
  init          0.000020      0.000020      0.000020  1.055409  32511.000000   32511.000000 

You can also write entirely new config recipes. This is an advanced feature, reach out via the Github discussion page if you want to learn more.

Improved detection for CUDA components in build system

The Caliper build system should find CUDA components like CUpti and NVTX automatically or simply through CUDA_TOOLKIT_ROOT_DIR.