Achelous任务debug方法
Achelous任务基本概念
- 作业(Job):achelous任务都是以job的形式投递出去的
- 任务(Task): 每一个job都是由一个或者多个task组成的
查看失败原因
1. 作业(job )失败原因的查看
作业的失败原因输出在作业状态的FailReason字段中 ,用户可以通过biocli job status
命令可以查看到。任务失败以后优先以job FailReason为主。
[yawei@Cc1Apc simple-error]$ biocli job status b6aaa5b9-6f69-4d70-63b5-7dea8498d813
Status of Job b6aaa5b9-6f69-4d70-63b5-7dea8498d813:
Name: simple-job
Pipeline: SIMPLE-PIPELINE1111
State: FAIL
Owner: yawei
WorkDir: vol6@xtfs-cluster1:yawei/bioflow/wdl/job-simple-job-b6aaa5b9-6f69-4d70-63b5-7dea8498d813
PausedState:
Created: 2021-05-13T14:07:43Z
Finished: 2021-05-13T14:07:43Z
RetryLimit: 3
RunCount: 0
UserStageCount: 0
StageQuota: -1
Priority: 9
FailReason:
reason 1: Preapre work directories fail: No mounter config for xtfs-cluster1
GraphBuildPipelinePos: 0
DoneStages: No stage done
RunningStages: No stage running
WaitingStages: No stage waiting
ForbiddenStages: No stage forbidden
2. 任务(task)失败原因的查看
如果job失败原因输出为空,则需要查看状态为STAGE_FAIL的task来查看具体task的失败原因,首先通过biocli job status命令来查看stage FailReason输出,如果此处的原因无法确定task失败原因,则需通过task日志来分析task失败原因。
[yawei@Cc1Apc simple-error]$ biocli job status 88c29e76
Status of Job 88c29e76-b7fd-4afc-4142-7a37f1b512ec:
Name: simple-job
Pipeline: SIMPLE-PIPELINE1111
State: PSUDONE
Owner: yawei
WorkDir: vol6@xtfs-cluster:yawei/bioflow/wdl/job-simple-job-88c29e76-b7fd-4afc-4142-7a37f1b512ec
PausedState:
Created: 2021-05-13T14:09:49Z
Finished: 2021-05-13T14:12:11Z
RetryLimit: 3
RunCount: 0
UserStageCount: 1
StageQuota: -1
Priority: 9
FailReason:
GraphBuildPipelinePos: 0
DoneStages(User and System): 3
Stage wk-wk.tool-tool.smoke1:
Name: stage-smoke1
Image: gatk3:latest
State: STAGE_FAIL
Output:
Backend: paladin:paladin-backend.servicemgr.apc:1026
Task: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489
RetryCount: 0
Submited: Thu May 13 14:09:49 CST 2021
Scheduled: Thu May 13 14:11:53 CST 2021
Finished: Thu May 13 14:11:54 CST 2021
Duration: 2.083333
ExecutionHost: Cc1Apc(172.27.158.5)
CPU: 2.000000
Memory: 4000.000000
FailReason: Container exited with status 2
3. 任务(task)日志分析
执行命令biocli job logs <job-id> -S <stage-id>
[yawei@Cc1Apc simple-error]$ biocli job logs 88c29e76 -S wk-wk.tool-tool.smoke1
========================
| Job 88c29e76 Log:
========================
++++++++++++++++++++++++
+ Stage stage-smoke1:
++++++++++++++++++++++++
-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.info
-------------------------------------------------------------------
The executor profiler is enabled
Task exited, message is Container exited with status 2, state is TASK_FAILED
-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.cmd_err
-------------------------------------------------------------------
ls: cannot access /home/yawei: No such file or directory
-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.cmd_info
-------------------------------------------------------------------
-------------------------------------------------------------------
| TaskID: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.err
-------------------------------------------------------------------
I0513 14:11:24.628684 7140 exec.cpp:167] Version: 1.9.0
I0513 14:11:29.277527 7149 exec.cpp:240] Executor registered on agent bd092ed7-fd12-443a-b988-08a198de4291-S0
I0513 14:11:29.289402 7154 executor.cpp:157] Registered docker executor on Cc1Apc
I0513 14:11:36.175945 7169 executor.cpp:213] Starting task paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489
I0513 14:11:36.178061 7169 docker.cpp:1373] Running docker -H unix:///var/run/docker.sock run --cpu-shares 2048 --memory 4194304000 -e NVIDIA_VISIBLE_DEVICES= -e PALADIN_ARRAYID=-1 -e PALADIN_JOBID=paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489 -e SRM_ALLOCATION_ROLE=* -e SRM_CONTAINER_NAME=srm-0188b77c-d2b3-4298-9b10-094151507393 -e SRM_SANDBOX=/mnt/srm/sandbox -e SRM_TASK_ID=paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489 -e XTAO_JOB_ID=88c29e76-b7fd-4afc-4142-7a37f1b512ec -e XT_PALADIN_SERVER=:1026 -v/mnt/vol6:/mnt/vol6:rw -v /tmp:/vols/temp:rw -v /var/lib/srm/slaves/bd092ed7-fd12-443a-b988-08a198de4291-S0/frameworks/paladin-backend/executors/paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489/runs/0188b77c-d2b3-4298-9b10-094151507393:/mnt/srm/sandbox --net host --entrypoint /bin/sh --name srm-0188b77c-d2b3-4298-9b10-094151507393 --log-driver=none --workdir=/mnt/vol6/yawei/bioflow/wdl/job-simple-job-88c29e76-b7fd-4afc-4142-7a37f1b512ec/shadow-wk-wk.tool-tool.smoke1-run0 --user=10003:10003 registry.servicemgr.apc:5000/gatk3:latest -c echo "Hello world" > smoke.txt
ls /home/yawei
paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489.cmd_err中的内容可以确定task的失败的真正原因。 *.cmd_err文件是用户的任务输出日志,*.err文件是启动任务的命令,此命令可以确定启动任务时候的参数。如果是no such file类型错误先确定文件确实存在,然后根据启动任务的命令查看-v参数需要的文件目录是否映射到container中。
任务失败状态
1. 任务是FAIL状态
此种错误一般是系统错误,如上述的情况,错误的使用了不存在的集群和卷名,造成任务失败。需要修改投递任务的时指定的卷和集群名并且重新投递。
[yawei@Cc1Apc simple-error]$ biocli job status b6aaa5b9-6f69-4d70-63b5-7dea8498d813
Status of Job b6aaa5b9-6f69-4d70-63b5-7dea8498d813:
Name: simple-job
Pipeline: SIMPLE-PIPELINE1111
State: FAIL
Owner: yawei
WorkDir: vol6@xtfs-cluster1:yawei/bioflow/wdl/job-simple-job-b6aaa5b9-6f69-4d70-63b5-7dea8498d813
PausedState:
Created: 2021-05-13T14:07:43Z
Finished: 2021-05-13T14:07:43Z
RetryLimit: 3
RunCount: 0
UserStageCount: 0
StageQuota: -1
Priority: 9
FailReason:
reason 1: Preapre work directories fail: No mounter config for xtfs-cluster1
GraphBuildPipelinePos: 0
DoneStages: No stage done
RunningStages: No stage running
WaitingStages: No stage waiting
ForbiddenStages: No stage forbidden
2. 任务是PSUDONE状态
当失败的job是因为task失败造成的,只需要定位task失败原因,修改对应task的脚本(wdl或者bsl)并update pipeline,然后重新recover job即可从上次失败的地方继续执行。
[yawei@Cc1Apc simple-error]$ biocli job status 88c29e76
Status of Job 88c29e76-b7fd-4afc-4142-7a37f1b512ec:
Name: simple-job
Pipeline: SIMPLE-PIPELINE1111
State: PSUDONE
Owner: yawei
WorkDir: vol6@xtfs-cluster:yawei/bioflow/wdl/job-simple-job-88c29e76-b7fd-4afc-4142-7a37f1b512ec
PausedState:
Created: 2021-05-13T14:09:49Z
Finished: 2021-05-13T14:12:11Z
RetryLimit: 3
RunCount: 0
UserStageCount: 1
StageQuota: -1
Priority: 9
FailReason:
GraphBuildPipelinePos: 0
DoneStages(User and System): 3
Stage wk-wk.tool-tool.smoke1:
Name: stage-smoke1
Image: gatk3:latest
State: STAGE_FAIL
Output:
Backend: paladin:paladin-backend.servicemgr.apc:1026
Task: paladin-task.a132c6b8-4f93-4cdd-b8f3-4a116436d489
RetryCount: 0
Submited: Thu May 13 14:09:49 CST 2021
Scheduled: Thu May 13 14:11:53 CST 2021
Finished: Thu May 13 14:11:54 CST 2021
Duration: 2.083333
ExecutionHost: Cc1Apc(172.27.158.5)
CPU: 2.000000
Memory: 4000.000000
FailReason: Container exited with status 2