Read a tab-delimited SGE accounting file (without parsing it)

Coerce to an 'sge_accounting' Object

Locate the SGE Accounting File on the Current System

Read an SGE accounting

Parse SGE Accounting 'category' Field

Table of SGE failed codes with descriptions

read_raw_sge_accounting(
  file,
  offset = 0,
  n_max = Inf,
  skip = if (is.character(file) && offset == 0) 4L else 0L,
  ...
)

write_raw_sge_accounting(x, file, header = attr(x, "header"), ...)

as_sge_accounting(x, ...)

# S3 method for class 'raw_sge_accounting'
as_sge_accounting(x, ...)

# S3 method for class 'sge_accounting'
print(x, format = c("pretty", "raw"), ...)

sge_accounting_file(
  filename = "accounting",
  path = do.call(file.path, args = as.list(c(Sys.getenv(c("SGE_ROOT", "SGE_CELL")),
    "common")))
)

read_sge_accounting(file = sge_accounting_file(), ...)

parse_category(x, ...)

# S3 method for class 'sge_accounting'
parse_category(x, properties = c("h_rt", "s_rt", "mem_free"), ...)

sge_failed_codes()

Arguments

file

(character) The SGE accounting file to read.

offset

The file offset position (in bytes) from where to start reading.

n_max

(numeric) The maximum number of rows to read.

skip

(integer) Number of lines to skip before parsing file content.

x

An sge_accounting object.

header

(character vector) Zero of more header lines to be written at the top of the file.

format

Either "pretty" or "raw".

filename

(character string) The name of the accounting file.

path

(character string) The path to the accounting file.

properties

(character vector) The properties to extract.

...

(optional) Not used.

Value

(character string) The pathname to the SGE accounting file. If not found, and error is thrown.

A tibble data frame with columns:

  • qname (character) - name of the cluster queue in which the job has run

  • hostname (character) - name of the execution host

  • group (character) - the effective group id of the job owner when executing the job

  • owner (character) - owner of the Grid Engine job

  • job_name (character) - job name

  • job_number (integer) - job identifier

  • account (character) - an account string as specified by the qsub or qalter

  • priority (integer) - priority value assigned to the job

  • submission_time (dttm) - submission time

  • start_time (dttm) - start time

  • end_time (dttm) - end time

  • failed (integer) - indicates the problem which occurred in case a job failed (at the system level, as opposed to the job script or binary having non-zero exit status). Indicates the problem which occurred in case a job could not be started on the execution host (e.g. because the owner of the job did not have a valid account on that machine). If Sun Grid Engine tries to start a job multiple times, this may lead to multiple entries in the accounting file corresponding to the same job ID

  • exit_status (integer) - exit status of the job script (or Grid Engine-specific status in case of certain error conditions). The exit status is determined by following the normal shell conventions. If the command terminates normally the value of the command is its exit status. However, in the case that the command exits abnormally, a value of 0200 (octal), 128 (decimal) is added to the value of the command to make up the exit status. For example: If a job dies through signal 9 (SIGKILL) - probably issued by Grid Engine through qdel, or because the job exceeded time or memory hard limits - then the exit status is 128 + 9 = 137.

  • ru_wallclock (drtn) - Difference between 'end_time' and 'start_time' (time interval), except that if the job fails, it is zero.

  • ru_utime (drtn) - user CPU time (in seconds) used, i.e. total amount of time spent executing in user mode

  • ru_stime (drtn) - system CPU time (in seconds) used, i.e. total amount of time spent executing in kernel mode

  • ru_maxrss (character) - maximum resident set size (in kB)

  • ru_ixrss (character) - integral shared memory size (in kB) [UNUSED]

  • ru_ismrss (character) - ???

  • ru_idrss (character) - integral unshared data size (in kB) [UNUSED]

  • ru_isrss (character) - integral unshared stack size (in kB) [UNUSED]

  • ru_minflt (numeric) - page reclaims (soft page faults)

  • ru_majflt (numeric) - page faults (hard page faults)

  • ru_nswap (numeric) - number of swaps [UNUSED]

  • ru_inblock (numeric) - number of block input operations

  • ru_oublock (numeric) - number of block output operations

  • ru_msgsnd (numeric) - number of IPC messages sent [UNUSED]

  • ru_msgrcv (numeric) - number of IPC messages received [UNUSED]

  • ru_nsignals (numeric) - number of signals received

  • ru_nvcsw (numeric) - number of voluntary context switches (number of times a context switch resulted due to a process voluntarily giving up the processor before its time slice was completed (usually to await availability of a resource)

  • ru_nivcsw (numeric) - number of involuntary context switches (number of times a context switch resulted due to a higher priority process becoming runnable or because the current process exceeded its time slice)

  • project (character) -

  • department (character) -

  • granted_pe (character) - the parallel environment which was selected for the job

  • slots (integer) - the number of slots which were dispatched to the job by the scheduler

  • task_number (integer) -

  • cpu (drtn) - The CPU time usage (in seconds)

  • mem (character) - the integral memory usage (in GB seconds)

  • io (character) - the amount of data transferred in input/output operations (in GB) if available, otherwise 0

  • category (character) -

  • iow (drtn) - the input/output wait time (in seconds) if available, otherwise 0

  • pe_taskid (character) - if this identifier is not equal to NONE, the task was part of parallel job, and was passed to Grid Engine via the qrsh-inherit interface

  • maxvmem (numeric) - the maximum vmem size (in bytes)

  • arid (numeric) - advance reservation identifier

  • ar_sub_time (dttm) - advance reservation submission time, if the job uses the resources of an advance reservation, otherwise 0

A tibble data frame with columns corresponding to the requested properties.

A tibble

Location of the SGE accounting file

The SGE accounting file is typically located in a subfolder of the folder $SGE_ROOT/$SGE_CELL/. On Wynton HPC, the pathname is given by sge_accounting_file().

File offset positions for each job entry

If you know the file offset (in bytes) for the first job entry you wish to read, then specify it via argument offset. This speeds up the reading, because it avoids having to parse jobs from the beginning. To find the file offsets for job entries, see make_file_index().

Failed code

CodeDescriptionOKExplanation
0no failureYran and exited normally
1assumedly before jobNfailed early in execd
3before writing configNfailed before execd set up local spool
4before writing PIDNshepherd failed to record its pid - filesystem problem?
6setting processor setNfailed setting up processor set (obsolete)
7before prologNfailed before prolog
8in prologNfailed in prolog
9before pestartNfailed before starting PE
10in pestartNfailed in PE starter
11before jobNfailed in shepherd before starting job
12before pestopYran, but failed before calling PE stop procedure
13in pestopYran, but PE stop procedure failed
14before epilogYran, but failed before calling epilog
15in epilogYran, but failed in epilog
16releasing processor setYran, but processor set could not be released (obsolete)
17through signalYjob killed by signal (possibly qdel)
18shepherd returned errorNshepherd died somehow
19before writing exit_statusNshepherd didn't write reports correctly - probably program or machine crash
20found unexpected error file?shepherd encountered a problem
21in recognizing jobNqmaster asked about an unknown job (not in accounting?)
24migrating (checkpointing jobs)Yran, will be migrated
25reschedulingYran, will be rescheduled
26opening output fileNfailed opening stderr/stdout file
27searching requested shellNfailed finding specified shell
28changing to working directoryNfailed changing to start directory
29AFS setupNfailed setting up AFS security
30application error returnedYran and exited 100 - maybe re-scheduled
36checking configured daemonsNfailed because of configured remote startup daemon
37qmaster enforced h_rt, h_cpu, or h_vmem limitYran, but killed due to exceeding run time limit
38adding supplementary groupNfailed adding supplementary gid to job
100assumedly after jobYran, but killed by a signal (perhaps due to exceeding resources), task died, shepherd died (e.g. node crash), etc.

The following failed codes are specific to MS Windows:

CodeDescriptionOKExplanation
31accessing sgepasswd fileNfailed because sgepasswd not readable*
32entry is missing in password fileNfailed because user not in sgepasswd*
33wrong passwordNfailed because of wrong password against sgepasswd*
34communicating with Grid Engine Helper ServiceNfailed because of failure of helper service*
35before job in Grid Engine Helper ServiceNfailed because of failure running helper service*

Source: man sge_status.

In addition to the above, I, the package author, have tried to gathered additional information about the below failed codes based on real-world observations.

  • 21: When this happens, both qname and hostname are UNKNOWN, qsub_time is 0 ("Wed Dec 31 16:00:00 1969"), start_time and end_time are 0 ("-/-"), all run-time data, including ru_wallclock, are all 0. It appears to happen to old jobs (per job_number) from times before one or more major downtimes. Because of this, I believe these are from jobs with Eqw state that SGE eventually tries to flush out.

Common exit codes

CodeDescription
0Success
1Catchall for general errors
2Misuse of shell builtins (according to Bash documentation)
126Command invoked cannot execute, e.g. /dev/null
127"command not found"
128Invalid argument to exit, e.g. exit 3.14
128 + nFatal error signal n
134128 + 6 = 128 + SIGABRT - Abort signal from abort
135128 + 7 = 128 + SIGBUS - Bus error (bad memory access)
136128 + 8 = 128 + SIGFPE - Floating-point exception
137128 + 9 = 128 + SIGKILL
138128 + 10 = 128 + SIGUSR1
140128 + 12 = 128 + SIGUSR2
255128 + 127 = Exit status out of range, e.g.exit -1

Comment: exit only takes integers in [0,255]

Benchmarking

The accounting on Wynton HPC took ~2 minutes to read when it was 4.8 GB in size and ~6-8 minutes when it was 12 GB in size.

References

  • man accounting

  • man sge_status

Examples

pathname <- system.file("exdata", "accounting", package = "wyntonquery")

jobs <- read_raw_sge_accounting(pathname)
print(jobs)
#> # A tibble: 1,000 × 45
#>    qname    hostname group  owner   job_name job_number account priority
#>    <chr>    <chr>    <chr>  <chr>   <chr>         <int> <chr>      <int>
#>  1 member.q cin-id3  group4 owner09 sleep.sh          1 sge            0
#>  2 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  3 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  4 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  5 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  6 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  7 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  8 member.q cin-id2  group4 owner09 sleep.sh          2 sge            0
#>  9 long.q   cin-id2  group4 owner09 sleep.sh          2 sge           19
#> 10 long.q   cin-id2  group4 owner09 sleep.sh          2 sge           19
#> # ℹ 990 more rows
#> # ℹ 37 more variables: submission_time <dbl>, start_time <dbl>, end_time <dbl>,
#> #   failed <int>, exit_status <int>, ru_wallclock <dbl>, ru_utime <dbl>,
#> #   ru_stime <dbl>, ru_maxrss <dbl>, ru_ixrss <dbl>, ru_ismrss <dbl>,
#> #   ru_idrss <dbl>, ru_isrss <dbl>, ru_minflt <dbl>, ru_majflt <dbl>,
#> #   ru_nswap <dbl>, ru_inblock <dbl>, ru_oublock <dbl>, ru_msgsnd <dbl>,
#> #   ru_msgrcv <dbl>, ru_nsignals <dbl>, ru_nvcsw <dbl>, ru_nivcsw <dbl>, …

## Anonymize (although actually already anonymized)
jobs_anon <- anonymize(jobs)
print(jobs_anon)
#> # A tibble: 1,000 × 45
#>    qname    hostname group  owner   job_name job_number account priority
#>    <chr>    <chr>    <chr>  <chr>   <chr>         <int> <chr>      <int>
#>  1 member.q cin-id3  group6 owner04 sleep.sh          1 sge            0
#>  2 long.q   cc-id2   group6 owner04 sleep.sh          2 sge           19
#>  3 long.q   cc-id2   group6 owner04 sleep.sh          2 sge           19
#>  4 long.q   cc-id2   group6 owner04 sleep.sh          2 sge           19
#>  5 long.q   cc-id2   group6 owner04 sleep.sh          2 sge           19
#>  6 long.q   cc-id2   group6 owner04 sleep.sh          2 sge           19
#>  7 long.q   cc-id2   group6 owner04 sleep.sh          2 sge           19
#>  8 member.q cin-id2  group6 owner04 sleep.sh          2 sge            0
#>  9 long.q   cin-id2  group6 owner04 sleep.sh          2 sge           19
#> 10 long.q   cin-id2  group6 owner04 sleep.sh          2 sge           19
#> # ℹ 990 more rows
#> # ℹ 37 more variables: submission_time <dbl>, start_time <dbl>, end_time <dbl>,
#> #   failed <int>, exit_status <int>, ru_wallclock <dbl>, ru_utime <dbl>,
#> #   ru_stime <dbl>, ru_maxrss <dbl>, ru_ixrss <dbl>, ru_ismrss <dbl>,
#> #   ru_idrss <dbl>, ru_isrss <dbl>, ru_minflt <dbl>, ru_majflt <dbl>,
#> #   ru_nswap <dbl>, ru_inblock <dbl>, ru_oublock <dbl>, ru_msgsnd <dbl>,
#> #   ru_msgrcv <dbl>, ru_nsignals <dbl>, ru_nvcsw <dbl>, ru_nivcsw <dbl>, …
pathname <- system.file("exdata", "accounting", package = "wyntonquery")

jobs <- read_sge_accounting(pathname)
print(jobs)
#> # A tibble: 1,000 × 45
#>    qname    hostname group  owner   job_name job_number account priority
#>    <chr>    <chr>    <chr>  <chr>   <chr>         <int> <chr>      <int>
#>  1 member.q cin-id3  group4 owner09 sleep.sh          1 sge            0
#>  2 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  3 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  4 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  5 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  6 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  7 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19
#>  8 member.q cin-id2  group4 owner09 sleep.sh          2 sge            0
#>  9 long.q   cin-id2  group4 owner09 sleep.sh          2 sge           19
#> 10 long.q   cin-id2  group4 owner09 sleep.sh          2 sge           19
#> # ℹ 990 more rows
#> # ℹ 37 more variables: submission_time <dttm>, start_time <dttm>,
#> #   end_time <dttm>, failed <int>, exit_status <int>, ru_wallclock <drtn>,
#> #   ru_utime <drtn>, ru_stime <drtn>, ru_maxrss <chr>, ru_ixrss <chr>,
#> #   ru_ismrss <chr>, ru_idrss <chr>, ru_isrss <chr>, ru_minflt <dbl>,
#> #   ru_majflt <dbl>, ru_nswap <dbl>, ru_inblock <dbl>, ru_oublock <dbl>,
#> #   ru_msgsnd <dbl>, ru_msgrcv <dbl>, ru_nsignals <dbl>, ru_nvcsw <dbl>, …
## # A tibble: 1,000 x 45
##    qname    hostname group  owner   job_name job_number account priority submission_time    
##    <chr>    <chr>    <chr>  <chr>   <chr>         <int> <chr>      <int> <dttm>             
##  1 member.q cin-id3  group4 owner09 sleep.sh          1 sge            0 2017-08-15 11:59:21
##  2 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19 2017-08-15 12:00:23
##  3 long.q   cc-id2   group4 owner09 sleep.sh          2 sge           19 2017-08-15 12:00:23
## ...

## Identify successful and failed jobs
jobs_success <- subset(jobs, failed == 0)
jobs_fail <- subset(jobs, failed > 0)

## CPU time consumed
t <- c(sum(jobs_success$cpu), sum(jobs_fail$cpu))
units(t) <- "days"
print(t)
#> Time differences in days
#> [1] 1283.6721  328.9508
## Time differences in days
## [1] 1283.6721  328.9508

## Fraction of successful and failed CPU time
u <- as.numeric(t)
u <- u / sum(u)
names(u) <- c("success", "failed")
print(u)
#>   success    failed 
#> 0.7960151 0.2039849 
##   success    failed 
## 0.7960151 0.2039849