Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stopped status for HealthCheck #25423

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Feb 28, 2025

This PR adds new status for HealthCheck. If the container is stopped and the ongoing HealthCheck has no chance to complete, the check is evaluated as stopped.

Fixes: https://issues.redhat.com/browse/RUN-2520
Fixes: #25276

Does this PR introduce a user-facing change?

HealthCheck now reports a stopped status if the container stops before the check can complete.

@openshift-ci openshift-ci bot added release-note do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 28, 2025
Copy link
Contributor

openshift-ci bot commented Feb 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign l0rd for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

If the container is stopped and the ongoing HealthCheck has no chance to complete the check is evaluated as stopped.

Fixes: https://issues.redhat.com/browse/RUN-2520
Fixes: containers#25276

Signed-off-by: Jan Rodák <[email protected]>
@github-actions github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Feb 28, 2025
@Honny1 Honny1 marked this pull request as ready for review February 28, 2025 13:16
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025
Comment on lines +364 to +372
if !c.batched {
c.lock.Lock()
defer c.lock.Unlock()

if err := c.syncContainer(); err != nil {
logrus.Errorf("Error syncing container %s state: %v", c.ID(), err)
return false
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really not look and sync, IMO this function should simply not exist. While I think your usage for healthcheck is fine thus will totally be misused. First a private function should not take the container look here, that is totally inconsistent with how deal with the majority of libpod functions.
Most importantly this can only lead to unsound code, you should never check the state, then unlock and then do X,Y,Z based on the state while being unlocked. So IMO just delete this func.

Comment on lines +149 to +155
if exitCode != 0 && c.ensureCurrentState(define.ContainerStateStopped, define.ContainerStateStopping, define.ContainerStateExited) {
hcResult = define.HealthCheckContainerStopped
}

hcl := newHealthCheckLog(timeStart, timeEnd, returnCode, eventLog)

healthCheckResult, err := c.updateHealthCheckLog(hcl, inStartPeriod, isStartup)
healthCheckResult, err := c.updateHealthCheckLog(hcl, hcResult, inStartPeriod, isStartup)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ensureCurrentState takes the lock and updateHealthCheckLog takes the lock again, this is inefficient and should not happen.
I would say remove both locks from the functions and take on lock here in the caller instead.

Comment on lines +410 to +413
if hcResult == define.HealthCheckContainerStopped {
healthCheck.Status = define.HealthCheckStopped
healthCheck.FailingStreak = oldFailingStreak
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the wrong ordering, there is no need to reset FailingStreak when you don't update it in the first place.

		if hcResult == define.HealthCheckContainerStopped {
			healthCheck.Status = define.HealthCheckStopped
		} else if !inStartPeriod {
			// increment failing streak
			healthCheck.FailingStreak++
			// if failing streak > retries, then status to unhealthy
			if healthCheck.FailingStreak >= c.HealthCheckConfig().Retries {
				healthCheck.Status = define.HealthCheckUnhealthy
			}
		}
		


run_podman run -d --name $ctr \
--health-cmd "sleep 20; echo $msg" \
$IMAGE /home/podman/pause

timeout --foreground -v --kill=10 60 \
$PODMAN healthcheck run $ctr &
$PODMAN healthcheck run $ctr > $hcStatus &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use &> to capture both stdout and stderr, in case of an error printed we will see it directly in the assert below

@@ -487,13 +488,20 @@ function _check_health_log {
rc=0
wait -n $hc_pid || rc=$?
assert $rc -eq 1 "exit status check of healthcheck command"
assert $(cat $hcStatus) =~ "stopped" "Health status"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"$(< $hcStatus)", cat is not needed. It Would also be good if you could do a full string match == here to avoid any extra error lines printed

Comment on lines +499 to +504
run_podman inspect $ctr --format "{{.State.Health.Status}}"
assert "$output" == "stopped" "Health status"

run_podman inspect $ctr --format "{{.State.Health.FailingStreak}}"
assert "$output" == "0"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running podman inspect over and over is slow, you can get all the data in one single inspect command

i.e. see #24749 (comment) and 8fa1ffb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Change to remote API; merits scrutiny release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The result of a HealthCheck when the container is stopped during a HealthCheck in progress
2 participants