RHEL AI: ilab model train command failed with error "Unable to find" skills train msg file.
Issue
- RHEL AI: ilab model train command failed with error.
$ ilab -v -v model train --pipeline simple --data-path /var/home/instruct/.local/share/instructlab/skills_train_msgs_2025-02-12T09_50_37.jsonl --device cuda
INFO 2025-02-12 10:10:00,989 numexpr.utils:148: Note: NumExpr detected 48 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2025-02-12 10:10:00,989 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2025-02-12 10:10:02,098 datasets:59: PyTorch version 2.4.1 available.
LINUX_TRAIN.PY: NUM EPOCHS IS: 8
LINUX_TRAIN.PY: TRAIN FILE IS: /var/home/instruct/.local/share/instructlab/skills_train_msgs_2025-02-12T09_50_37.jsonl
LINUX_TRAIN.PY: TEST FILE IS: /var/home/instruct/.local/share/instructlab/skills_train_msgs_2025-02-12T09_50_37.jsonl
LINUX_TRAIN.PY: Using device 'cuda:0'
NVidia CUDA version: 12.4
AMD ROCm HIP version: n/a
cuda:0 is 'NVIDIA L40S' (44.1 GiB of 44.5 GiB free, capability: 8.9)
cuda:1 is 'NVIDIA L40S' (44.1 GiB of 44.5 GiB free, capability: 8.9)
cuda:2 is 'NVIDIA L40S' (44.1 GiB of 44.5 GiB free, capability: 8.9)
cuda:3 is 'NVIDIA L40S' (44.1 GiB of 44.5 GiB free, capability: 8.9)
LINUX_TRAIN.PY: LOADING DATASETS
Unable to find '/var/home/instruct/.local/share/instructlab/skills_train_msgs_2025-02-12T09_50_37.jsonl/train_gen.jsonl
Environment
- Red Hat Enterprise Linux AI 1.3
- Red Hat Enterprise Linux AI 1.4
- Nvidia GPUs
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.