Add stopped status for HealthCheck #25423

Honny1 · 2025-02-28T11:34:57Z

This PR adds new status for HealthCheck. If the container is stopped and the ongoing HealthCheck has no chance to complete, the check is evaluated as stopped.

Fixes: https://issues.redhat.com/browse/RUN-2520
Fixes: #25276

Does this PR introduce a user-facing change?

HealthCheck now reports a stopped status if the container stops before the check can complete.

openshift-ci · 2025-02-28T11:35:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign l0rd for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

If the container is stopped and the ongoing HealthCheck has no chance to complete the check is evaluated as stopped. Fixes: https://issues.redhat.com/browse/RUN-2520 Fixes: containers#25276 Signed-off-by: Jan Rodák <[email protected]>

Luap99 · 2025-02-28T15:59:06Z

libpod/container_internal.go

+	if !c.batched {
+		c.lock.Lock()
+		defer c.lock.Unlock()
+
+		if err := c.syncContainer(); err != nil {
+			logrus.Errorf("Error syncing container %s state: %v", c.ID(), err)
+			return false
+		}
+	}


This should really not look and sync, IMO this function should simply not exist. While I think your usage for healthcheck is fine thus will totally be misused. First a private function should not take the container look here, that is totally inconsistent with how deal with the majority of libpod functions.
Most importantly this can only lead to unsound code, you should never check the state, then unlock and then do X,Y,Z based on the state while being unlocked. So IMO just delete this func.

Luap99 · 2025-02-28T16:01:50Z

libpod/healthcheck.go

+	if exitCode != 0 && c.ensureCurrentState(define.ContainerStateStopped, define.ContainerStateStopping, define.ContainerStateExited) {
+		hcResult = define.HealthCheckContainerStopped
+	}
+
 	hcl := newHealthCheckLog(timeStart, timeEnd, returnCode, eventLog)

-	healthCheckResult, err := c.updateHealthCheckLog(hcl, inStartPeriod, isStartup)
+	healthCheckResult, err := c.updateHealthCheckLog(hcl, hcResult, inStartPeriod, isStartup)


this ensureCurrentState takes the lock and updateHealthCheckLog takes the lock again, this is inefficient and should not happen.
I would say remove both locks from the functions and take on lock here in the caller instead.

Luap99 · 2025-02-28T16:09:12Z

libpod/healthcheck.go

+		if hcResult == define.HealthCheckContainerStopped {
+			healthCheck.Status = define.HealthCheckStopped
+			healthCheck.FailingStreak = oldFailingStreak
+		}


This looks like the wrong ordering, there is no need to reset FailingStreak when you don't update it in the first place.

if hcResult == define.HealthCheckContainerStopped { healthCheck.Status = define.HealthCheckStopped } else if !inStartPeriod { // increment failing streak healthCheck.FailingStreak++ // if failing streak > retries, then status to unhealthy if healthCheck.FailingStreak >= c.HealthCheckConfig().Retries { healthCheck.Status = define.HealthCheckUnhealthy } }

Luap99 · 2025-02-28T16:09:52Z

test/system/220-healthcheck.bats


    run_podman run -d --name $ctr             \
           --health-cmd "sleep 20; echo $msg" \
           $IMAGE /home/podman/pause

    timeout --foreground -v --kill=10 60 \
-        $PODMAN healthcheck run $ctr &
+        $PODMAN healthcheck run $ctr > $hcStatus &


use &> to capture both stdout and stderr, in case of an error printed we will see it directly in the assert below

Luap99 · 2025-02-28T16:11:34Z

test/system/220-healthcheck.bats

@@ -487,13 +488,20 @@ function _check_health_log {
    rc=0
    wait -n $hc_pid || rc=$?
    assert $rc -eq 1 "exit status check of healthcheck command"
+    assert $(cat $hcStatus) =~ "stopped" "Health status"


"$(< $hcStatus)", cat is not needed. It Would also be good if you could do a full string match == here to avoid any extra error lines printed

Luap99 · 2025-02-28T16:16:50Z

test/system/220-healthcheck.bats

+    run_podman inspect $ctr --format "{{.State.Health.Status}}"
+    assert "$output" == "stopped" "Health status"
+
+    run_podman inspect $ctr --format "{{.State.Health.FailingStreak}}"
+    assert "$output" == "0"
+


running podman inspect over and over is slow, you can get all the data in one single inspect command

i.e. see #24749 (comment) and 8fa1ffb

openshift-ci bot added release-note do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 28, 2025

Add stopped status for HealthCheck

d6d7ea1

If the container is stopped and the ongoing HealthCheck has no chance to complete the check is evaluated as stopped. Fixes: https://issues.redhat.com/browse/RUN-2520 Fixes: containers#25276 Signed-off-by: Jan Rodák <[email protected]>

Honny1 force-pushed the hc-kill-status branch from dcc1405 to d6d7ea1 Compare February 28, 2025 11:54

github-actions bot added the kind/api-change Change to remote API; merits scrutiny label Feb 28, 2025

Honny1 marked this pull request as ready for review February 28, 2025 13:16

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025

Luap99 reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stopped status for HealthCheck #25423

Add stopped status for HealthCheck #25423

Honny1 commented Feb 28, 2025

openshift-ci bot commented Feb 28, 2025

Luap99 Feb 28, 2025

Luap99 Feb 28, 2025

Luap99 Feb 28, 2025

Luap99 Feb 28, 2025

Luap99 Feb 28, 2025

Luap99 Feb 28, 2025

Add stopped status for HealthCheck #25423

Are you sure you want to change the base?

Add stopped status for HealthCheck #25423

Conversation

Honny1 commented Feb 28, 2025

Does this PR introduce a user-facing change?

openshift-ci bot commented Feb 28, 2025

Luap99 Feb 28, 2025

Choose a reason for hiding this comment

Luap99 Feb 28, 2025

Choose a reason for hiding this comment

Luap99 Feb 28, 2025

Choose a reason for hiding this comment

Luap99 Feb 28, 2025

Choose a reason for hiding this comment

Luap99 Feb 28, 2025

Choose a reason for hiding this comment

Luap99 Feb 28, 2025

Choose a reason for hiding this comment