[Question]: Upgrade implementation vs recommendations in docs #363

ivareri · 2025-01-26T10:43:25Z

Ask a question

Hi

I've been meaning to provide feedback on the upgrade functionality for quite some time, but life have gotten in the way. Maybe this should have been multiple issues, and I've might have missed some details or points, but it is what is it.

Latest elastic docs and the implemented process have some differences:

Stopping ML nodes is not implemented.
Docs says "cluster.routing.allocation.enable": "primaries" vs implemented none
Docs recommends upgrading tier-by-tier (frozen-cold-warm-hot)

Things I have observed during testing:

Wait period for a cluster to return to Green status is not always long enough
Sometimes cluster never returns to Green status as there are no eligible nodes for the replica shards
If a node fails, the entire play should abort. Currently it just drops the node that failed, and keeps running for the rest of the nodes.

Questions:
Is the "cluster.routing.allocation.enable" based on earlier recommendations, or is there another reason to choose none over primaries?

My biggest blocker currently is that the cluster remains in a yellow state when there are replicas with no eligible nodes. The Docs says to proceed with the upgrade in these cases. This means we would have to check init and relo columns in _cat/health?v=true. This might either be trivial or far-from-trivial, not sure to be honest.

Regarding failing entire play vs node, this might be something in my ansible setup, or something in my playbook. I've not had time to give this a hard look yet.

Adding a task to start/stop ML nodes should be trivial, I might drop a PR for this if/when I find the time.

The text was updated successfully, but these errors were encountered:

ivareri · 2025-01-27T14:16:18Z

Ok, so by replacing this test

ansible-collection-elasticstack/roles/elasticsearch/tasks/elasticsearch-rolling-upgrade.yml

Line 102 in 87a7dc6

until: "response.json.status == 'green'"

with

 until:
        - "response.json.relocating_shards == 0"
        - "response.json.initializing_shards == 0"
        - "response.json.status == 'green' or
           response.json.status == 'yellow'"

it solved the deadlock with no eligble nodes. It should probably verify this status a couple of times before procceding.

@widhalmt: Do you want PRs for this against main or #349

widhalmt · 2025-02-04T15:08:50Z

Hi,

As you know from our collaboration "life getting in the way" Is a thing just know all too well. Your contributions are very welcome and I'm personally extremely thankful for the work you put into it.

To be honest, the "old implementation" came from a very old documentation where updates where done mostly manually. Until now we have not encountered a problem with the update procedure. That doesn't mean I don't believe it exists, we only work with a limited count of different setups.

Please provide PRs against #349 as this will be the new way for updates/upgrades.

I haven't encountered an upgrade where the system was left without eligible nodes. My personal approach would be that this is an fatal exception and needs manual interaction. On the other hand with your code you provided I don't see a problem with using it in the future. All the more since you can support the idea of proceeding with official documentation.

ivareri added the question Further information is requested label Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Upgrade implementation vs recommendations in docs #363

[Question]: Upgrade implementation vs recommendations in docs #363

ivareri commented Jan 26, 2025

ivareri commented Jan 27, 2025

widhalmt commented Feb 4, 2025

[Question]: Upgrade implementation vs recommendations in docs #363

[Question]: Upgrade implementation vs recommendations in docs #363

Comments

ivareri commented Jan 26, 2025

Ask a question

ivareri commented Jan 27, 2025

widhalmt commented Feb 4, 2025