Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Upgrade implementation vs recommendations in docs #363

Open
ivareri opened this issue Jan 26, 2025 · 2 comments
Open

[Question]: Upgrade implementation vs recommendations in docs #363

ivareri opened this issue Jan 26, 2025 · 2 comments
Labels
question Further information is requested

Comments

@ivareri
Copy link
Contributor

ivareri commented Jan 26, 2025

Ask a question

Hi

I've been meaning to provide feedback on the upgrade functionality for quite some time, but life have gotten in the way. Maybe this should have been multiple issues, and I've might have missed some details or points, but it is what is it.

Latest elastic docs and the implemented process have some differences:

  • Stopping ML nodes is not implemented.
  • Docs says "cluster.routing.allocation.enable": "primaries" vs implemented none
  • Docs recommends upgrading tier-by-tier (frozen-cold-warm-hot)

Things I have observed during testing:

  • Wait period for a cluster to return to Green status is not always long enough
  • Sometimes cluster never returns to Green status as there are no eligible nodes for the replica shards
  • If a node fails, the entire play should abort. Currently it just drops the node that failed, and keeps running for the rest of the nodes.

Questions:
Is the "cluster.routing.allocation.enable" based on earlier recommendations, or is there another reason to choose none over primaries?

My biggest blocker currently is that the cluster remains in a yellow state when there are replicas with no eligible nodes. The Docs says to proceed with the upgrade in these cases. This means we would have to check init and relo columns in _cat/health?v=true. This might either be trivial or far-from-trivial, not sure to be honest.

Regarding failing entire play vs node, this might be something in my ansible setup, or something in my playbook. I've not had time to give this a hard look yet.

Adding a task to start/stop ML nodes should be trivial, I might drop a PR for this if/when I find the time.

@ivareri ivareri added the question Further information is requested label Jan 26, 2025
@ivareri
Copy link
Contributor Author

ivareri commented Jan 27, 2025

Ok, so by replacing this test

with

 until:
        - "response.json.relocating_shards == 0"
        - "response.json.initializing_shards == 0"
        - "response.json.status == 'green' or
           response.json.status == 'yellow'"

it solved the deadlock with no eligble nodes. It should probably verify this status a couple of times before procceding.

@widhalmt: Do you want PRs for this against main or #349

@widhalmt
Copy link
Member

widhalmt commented Feb 4, 2025

Hi,

As you know from our collaboration "life getting in the way" Is a thing just know all too well. Your contributions are very welcome and I'm personally extremely thankful for the work you put into it.

To be honest, the "old implementation" came from a very old documentation where updates where done mostly manually. Until now we have not encountered a problem with the update procedure. That doesn't mean I don't believe it exists, we only work with a limited count of different setups.

Please provide PRs against #349 as this will be the new way for updates/upgrades.

I haven't encountered an upgrade where the system was left without eligible nodes. My personal approach would be that this is an fatal exception and needs manual interaction. On the other hand with your code you provided I don't see a problem with using it in the future. All the more since you can support the idea of proceeding with official documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants