-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClearML Dockerfile fix #9876
Merged
Merged
ClearML Dockerfile fix #9876
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Glenn Jocher <[email protected]>
@thepycoder ran into a ValueError with ClearML default install in Dockerfile on DDP training. Occurs when ClearML is installed and a training command is run (no auth or other steps taken). root@44816c00311e:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 --master_port 1 train.py --data coco128.yaml --weights yolov5s.pt --img 640 --device 2,3
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=2,3, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
remote: Enumerating objects: 29, done.
remote: Counting objects: 100% (29/29), done.
remote: Compressing objects: 100% (16/16), done.
remote: Total 29 (delta 17), reused 22 (delta 13), pack-reused 0
Unpacking objects: 100% (29/29), 19.13 KiB | 2.13 MiB/s, done.
From https://github.com/ultralytics/yolov5
6371de8..3b1a9d2 master -> origin/master
319b395..c6e9ea5 exp8 -> origin/exp8
github: β οΈ YOLOv5 is out of date by 1 commit. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv5 π v6.2-203-g6371de8 Python-3.8.13 torch-1.12.1+cu113 CUDA:2 (A100-SXM-80GB, 81251MiB)
CUDA:3 (A100-SXM-80GB, 81251MiB)
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 π runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Traceback (most recent call last):
File "train.py", line 630, in <module>
main(opt)
File "train.py", line 524, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 93, in train
loggers = Loggers(save_dir, weights, opt, hyp, LOGGER) # loggers instance
File "/usr/src/app/utils/loggers/__init__.py", line 121, in __init__
self.clearml = ClearmlLogger(self.opt, self.hyp)
File "/usr/src/app/utils/loggers/clearml/clearml_utils.py", line 87, in __init__
self.task = Task.init(
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 601, in init
task = cls._create_dev_task(
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3122, in _create_dev_task
task = cls(
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 199, in __init__
super(Task, self).__init__(**kwargs)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/task/task.py", line 155, in __init__
super(Task, self).__init__(id=task_id, session=session, log=log)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 145, in __init__
super(IdObjectBase, self).__init__(session, log, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 39, in __init__
self._session = session or self._get_default_session()
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 115, in _get_default_session
InterfaceBase._default_session = Session(
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_api/session/session.py", line 186, in __init__
raise ValueError(
ValueError: ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 418 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 417) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-10-20_18:10:27
host : 44816c00311e
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 417)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@44816c00311e:/usr/src/app# Also occurs with basic single-GPU training: root@44816c00311e:/usr/src/app# python train.py --data coco128.yaml --weights yolov5s.pt --img 640 --device 2
train: weights=yolov5s.pt, cfg=, data=coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=2, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (5/5), 3.12 KiB | 3.13 MiB/s, done.
From https://github.com/ultralytics/yolov5
* [new branch] glenn-jocher-patch-2 -> origin/glenn-jocher-patch-2
github: β οΈ YOLOv5 is out of date by 1 commit. Use `git pull` or `git clone https://github.com/ultralytics/yolov5` to update.
YOLOv5 π v6.2-203-g6371de8 Python-3.8.13 torch-1.12.1+cu113 CUDA:2 (A100-SXM-80GB, 81251MiB)
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 π runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Traceback (most recent call last):
File "train.py", line 630, in <module>
main(opt)
File "train.py", line 524, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 93, in train
loggers = Loggers(save_dir, weights, opt, hyp, LOGGER) # loggers instance
File "/usr/src/app/utils/loggers/__init__.py", line 121, in __init__
self.clearml = ClearmlLogger(self.opt, self.hyp)
File "/usr/src/app/utils/loggers/clearml/clearml_utils.py", line 87, in __init__
self.task = Task.init(
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 601, in init
task = cls._create_dev_task(
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 3122, in _create_dev_task
task = cls(
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 199, in __init__
super(Task, self).__init__(**kwargs)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/task/task.py", line 155, in __init__
super(Task, self).__init__(id=task_id, session=session, log=log)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 145, in __init__
super(IdObjectBase, self).__init__(session, log, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 39, in __init__
self._session = session or self._get_default_session()
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 115, in _get_default_session
InterfaceBase._default_session = Session(
File "/opt/conda/lib/python3.8/site-packages/clearml/backend_api/session/session.py", line 186, in __init__
raise ValueError(
ValueError: ClearML configuration could not be found (missing `~/clearml.conf` or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own `clearml-server`, or create a free account at https://app.clear.ml
root@44816c00311e:/usr/src/app#
|
1 task
@thepycoder seems unrelated to Docker. I'll raise a bug report. EDIT: raised in #9877 |
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Glenn Jocher [email protected]
π οΈ PR Summary
Made with β€οΈ by Ultralytics Actions
π Summary
Update to YOLOv5 Dockerfile, optimizing Python package installations.
π Key Changes
torchtext
andtorchvision
from the pip uninstall command.clearml
and fix OpenCV version constraint.π― Purpose & Impact
clearml
could hint at a streamlining of dependencies for specific use-cases, reducing image size and build time.