Achelous任务debug方法

Achelous任务基本概念

  • 作业(Job):achelous任务都是以job的形式投递出去的
  • 任务(Task): 每一个job都是由一个或者多个task组成的

查看失败原因

1. 作业(job )失败原因的查看

作业的失败原因输出在作业状态的FailReason字段中 ,用户可以通过biocli job status命令可以查看到。任务失败以后优先以job FailReason为主。

[yawei@Cc1Apc simple-error]$ biocli  job status b6aaa5b9-6f69-4d70-63b5-7dea8498d813
Status of Job b6aaa5b9-6f69-4d70-63b5-7dea8498d813:
 Name: simple-job
 Pipeline: SIMPLE-PIPELINE1111
 State: FAIL
 Owner: yawei
 WorkDir: vol6@xtfs-cluster1:yawei/bioflow/wdl/job-simple-job-b6aaa5b9-6f69-4d70-63b5-7dea8498d813
 PausedState:
 Created: 2021-05-13T14:07:43Z
 Finished: 2021-05-13T14:07:43Z
 RetryLimit: 3
 RunCount: 0
 UserStageCount: 0
 StageQuota: -1
 Priority: 9
 FailReason:
    reason 1: Preapre work directories fail: No mounter config for xtfs-cluster1
 GraphBuildPipelinePos: 0
 DoneStages: No stage done
 RunningStages: No stage running
 WaitingStages: No stage waiting
 ForbiddenStages: No stage forbidden

2. 任务(task)失败原因的查看

如果job失败原因输出为空,则需要查看状态为STAGE_FAIL的task来查看具体task的失败原因,首先通过biocli job status命令来查看stage FailReason输出,如果此处的原因无法确定task失败原因,则需通过task日志来分析task失败原因。

[yawei@Cc1Apc simple-error]$ biocli  job status 88c29e76
Status of Job 88c29e76-b7fd-4afc-4142-7a37f1b512ec:
 Name: simple-job
 Pipeline: SIMPLE-PIPELINE1111
 State: PSUDONE
 Owner: yawei
 WorkDir: vol6@xtfs-cluster:yawei/bioflow/wdl/job-simple-job-88c29e76-b7fd-4afc-4142-7a37f1b512ec
 PausedState:
 Created: 2021-05-13T14:09:49Z
 Finished: 2021-05-13T14:12:11Z
 RetryLimit: 3
 RunCount: 0
 UserStageCount: 1
 StageQuota: -1
 Priority: 9
 FailReason:
 GraphBuildPipelinePos: 0
 DoneStages(User and System): 3
    Stage wk-wk.tool-tool.smoke1:
        Name: stage-smoke1
        Image: gatk3:latest
        State: STAGE_FAIL
        Output:
        Backend: paladin:paladin-backend.servicemgr.apc:1026
        Task: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489
        RetryCount: 0
        Submited: Thu May 13 14:09:49 CST 2021
        Scheduled: Thu May 13 14:11:53 CST 2021
        Finished: Thu May 13 14:11:54 CST 2021
        Duration: 2.083333
        ExecutionHost: Cc1Apc(172.27.158.5)
        CPU: 2.000000
        Memory: 4000.000000
        FailReason: Container exited with status 2

3. 任务(task)日志分析

执行命令biocli job logs <job-id> -S <stage-id>

[yawei@Cc1Apc simple-error]$ biocli  job logs 88c29e76  -S wk-wk.tool-tool.smoke1
========================
| Job 88c29e76 Log:
========================

++++++++++++++++++++++++
+ Stage stage-smoke1:
++++++++++++++++++++++++

-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.info
-------------------------------------------------------------------
The executor profiler is enabled
Task exited, message is Container exited with status 2, state is TASK_FAILED


-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.cmd_err
-------------------------------------------------------------------
ls: cannot access /home/yawei: No such file or directory


-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.cmd_info
-------------------------------------------------------------------


-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.err
-------------------------------------------------------------------
I0513 14:11:24.628684  7140 exec.cpp:167] Version: 1.9.0
I0513 14:11:29.277527  7149 exec.cpp:240] Executor registered on agent bd092ed7-fd12-443a-b988-08a198de4291-S0
I0513 14:11:29.289402  7154 executor.cpp:157] Registered docker executor on Cc1Apc
I0513 14:11:36.175945  7169 executor.cpp:213] Starting task paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489
I0513 14:11:36.178061  7169 docker.cpp:1373] Running docker -H unix:///var/run/docker.sock run --cpu-shares 2048 --memory 4194304000 -e NVIDIA_VISIBLE_DEVICES= -e PALADIN_ARRAYID=-1 -e PALADIN_JOBID=paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489 -e SRM_ALLOCATION_ROLE=* -e SRM_CONTAINER_NAME=srm-0188b77c-d2b3-4298-9b10-094151507393 -e SRM_SANDBOX=/mnt/srm/sandbox -e SRM_TASK_ID=paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489 -e XTAO_JOB_ID=88c29e76-b7fd-4afc-4142-7a37f1b512ec -e XT_PALADIN_SERVER=:1026 -v/mnt/vol6:/mnt/vol6:rw -v /tmp:/vols/temp:rw -v /var/lib/srm/slaves/bd092ed7-fd12-443a-b988-08a198de4291-S0/frameworks/paladin-backend/executors/paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489/runs/0188b77c-d2b3-4298-9b10-094151507393:/mnt/srm/sandbox --net host --entrypoint /bin/sh --name srm-0188b77c-d2b3-4298-9b10-094151507393 --log-driver=none --workdir=/mnt/vol6/yawei/bioflow/wdl/job-simple-job-88c29e76-b7fd-4afc-4142-7a37f1b512ec/shadow-wk-wk.tool-tool.smoke1-run0 --user=10003:10003 registry.servicemgr.apc:5000/gatk3:latest -c        echo "Hello world" > smoke.txt
ls /home/yawei

paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.cmd_err中的内容可以确定task的失败的真正原因。 *.cmd_err文件是用户的任务输出日志,*.err文件是启动任务的命令,此命令可以确定启动任务时候的参数。如果是no such file类型错误先确定文件确实存在,然后根据启动任务的命令查看-v参数需要的文件目录是否映射到container中。


任务失败状态

1. 任务是FAIL状态

此种错误一般是系统错误,如上述的情况,错误的使用了不存在的集群和卷名,造成任务失败。需要修改投递任务的时指定的卷和集群名并且重新投递。

[yawei@Cc1Apc simple-error]$ biocli  job status b6aaa5b9-6f69-4d70-63b5-7dea8498d813
Status of Job b6aaa5b9-6f69-4d70-63b5-7dea8498d813:
 Name: simple-job
 Pipeline: SIMPLE-PIPELINE1111
 State: FAIL
 Owner: yawei
 WorkDir: vol6@xtfs-cluster1:yawei/bioflow/wdl/job-simple-job-b6aaa5b9-6f69-4d70-63b5-7dea8498d813
 PausedState:
 Created: 2021-05-13T14:07:43Z
 Finished: 2021-05-13T14:07:43Z
 RetryLimit: 3
 RunCount: 0
 UserStageCount: 0
 StageQuota: -1
 Priority: 9
 FailReason:
    reason 1: Preapre work directories fail: No mounter config for xtfs-cluster1
 GraphBuildPipelinePos: 0
 DoneStages: No stage done
 RunningStages: No stage running
 WaitingStages: No stage waiting
 ForbiddenStages: No stage forbidden

2. 任务是PSUDONE状态

当失败的job是因为task失败造成的,只需要定位task失败原因,修改对应task的脚本(wdl或者bsl)并update pipeline,然后重新recover job即可从上次失败的地方继续执行。

[yawei@Cc1Apc simple-error]$ biocli  job status 88c29e76
Status of Job 88c29e76-b7fd-4afc-4142-7a37f1b512ec:
 Name: simple-job
 Pipeline: SIMPLE-PIPELINE1111
 State: PSUDONE
 Owner: yawei
 WorkDir: vol6@xtfs-cluster:yawei/bioflow/wdl/job-simple-job-88c29e76-b7fd-4afc-4142-7a37f1b512ec
 PausedState:
 Created: 2021-05-13T14:09:49Z
 Finished: 2021-05-13T14:12:11Z
 RetryLimit: 3
 RunCount: 0
 UserStageCount: 1
 StageQuota: -1
 Priority: 9
 FailReason:
 GraphBuildPipelinePos: 0
 DoneStages(User and System): 3
    Stage wk-wk.tool-tool.smoke1:
        Name: stage-smoke1
        Image: gatk3:latest
        State: STAGE_FAIL
        Output:
        Backend: paladin:paladin-backend.servicemgr.apc:1026
        Task: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489
        RetryCount: 0
        Submited: Thu May 13 14:09:49 CST 2021
        Scheduled: Thu May 13 14:11:53 CST 2021
        Finished: Thu May 13 14:11:54 CST 2021
        Duration: 2.083333
        ExecutionHost: Cc1Apc(172.27.158.5)
        CPU: 2.000000
        Memory: 4000.000000
        FailReason: Container exited with status 2
Powered by XTAO TechnologyLast Modified On:2021 2023-03-24 09:05:17

results matching ""

    No results matching ""