-
-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index commit occasionally fails with AlreadyClosedException
in some deployments
#2379
Comments
We did encounter this as well. These two errors occur every time DT starts up. I was not able to fix this. First occurred after upgrading to 4.7.0 a few weeks ago. |
@holdebeSICKAG Thanks for your input. Does this happen when restarting DT, or generally when starting (as in, a longer delay between stopping and starting the instance)? Are you stopping the DT process gracefully ( |
Thanks for replying @nscuro It happens when restarting DT using compose. Also, I never had to forcefully shut down DT. Observing the behavior for the past few days. The INDEX errror for vulnerability occurs only at start-up. It first wants to check the index and assumes it is corrupt as DT is not happy with the lock file. Then it re-creates the index but the same error occurs. Furthermore, the error below occurs regurarly throughout the day, everytime a BOM was processed. So usually, I see these two log lines. 2023-02-02 07:38:45,714 INFO [BomUploadProcessingTask] Processed 233 components and 0 services uploaded to project d697145b-938e-4841-9395-46f121b6da34 If I can give you more information, please tell me. If you want me to test or change something, please tell me too. |
We see the
|
Thanks @holdebeSICKAG and @olafz! So far it seems like the Referring to this part of the exception here:
|
The vast majority of the exceptions happens on the Some examples (with context) below:
|
Just to rule this out:
Separately, DT performs a health check on each index when it starts, e.g.:
Are there any occurrences in your logs where the logs were not healthy and had to be recreated? Can you correlate the modification timestamps reported in the |
I attached the log file entries from start to DT ready. Maybe it helps 2023-02-01 09:28:33,667 INFO [EmbeddedJettyServer] alpine-executable-war v2.2.0 (050a7a2a-12f7-4d02-b7b0-cd6ec432eac6) built on: 2022-12-14T20:24:42Z |
No
No. It's a locally mounted filesystem (
Yes:
|
@holdebeSICKAG @olafz Is there a correlation of indexes being reported as unhealthy on startup, and Looking at the code, it's odd that the indexes are reporting an existing lock file. We do have a check in place that will delete existing lock files before checking the index's health: dependency-track/src/main/java/org/dependencytrack/search/IndexManager.java Lines 438 to 444 in 93321f8
Line
At that point the file should've been deleted already. I verified that this is actually happening locally. Potentially some sort of race condition... |
Especially for |
I've spun up a test instance on a server, added loads of projects, and will let it run for a while. So far I'm not seeing any index issues. Already tried restarting the instance at various times, but not able to trigger this behavior. Same is true locally, not able to reproduce. @olafz @holdebeSICKAG Are you able to spin up another instance of DT on your servers and see if the errors occur there as well? Thanks a lot for your help so far! |
This is the defect that I should have logged myself after talking to @nscuro back in January. That Anyway, it is still occuring for me with our production 4.7.1. 322 occurrences since last restart on 1st Feb. It is also occurring on 4.8.0-SNAPSHOT (Built-on: 2023-02-01T18:41:56Z) but only 4 occurrences. Both systems are deployed by ArgoCD K8s and are on Azure. Both our systems have almost all the same projects and even have matching project uuid. BOMs uploaded to production are synched to test server via webhook and Benthos. I find it interesting that there is a big disparity between the two systems when it comes to exception count. The test server was set up long after production server and thus never underwent a v4.00 upgrade migration, etc. Maybe there is a need for a tool that could be run to validate local data? Or, what can be deleted and then be allowed to rebuild dynamically thus (and I say this as an optimistic non-dev) cleaning out bad data? |
Thanks @msymons!
At least for the indexes, that's already happening as of #2200. It will even rebuild the index when it detects that the index deviates to much from what's inside the DB.
The entire |
Found the culprit. Since 4.7.0 there is an index health check on startup, but also recurring every 72h (per default, configurable in UI at Administration -> Search -> Search - Periodic consistency check -> Cadence). The health check uses Lucene's dependency-track/src/main/java/org/dependencytrack/search/IndexManager.java Lines 438 to 442 in 7cfc466
Write locks however are also held my the So this is what happens:
As a short-term fix, disable the periodic consistency check: As for a proper fix in the code, I see multiple options:
|
Hello @nscuro Thanks for the spot on analysis.
I don't think that a race condition between the first task execution and initial health check is that common. I've rather implemented this check to prevent against hard process kill/crash which will leave stale lock file (too many bad experiences with flaky file based lock implementations 😓). There's maybe another alternative for a fix :
|
@syalioune Are lock files supposed to / OK to be shared by multiple classes that perform write operations? In my naive thinking that would kind of defeat the purpose of the lock. |
You and me both 😓. I was naively thinking that In that case, I vote for
The periodic task would only compute the delta ratio between DB and Lucene and fire |
See DependencyTrack#2379 (comment) for background. Fixes DependencyTrack#2379 Signed-off-by: nscuro <[email protected]>
Thanks @syalioune, I raised #2477 that does exactly that. I also added proper closing of the |
See DependencyTrack#2379 (comment) for background. Fixes DependencyTrack#2379 Signed-off-by: nscuro <[email protected]>
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Current Behavior
In some deployments, committing Lucene indexes may fail with an
AlreadyClosedException
, stating that thewrite.lock
file of the respective index was "changed by an external force":write.lock
is empty, and created when anIndexWriter
instance is created. DT creates anIndexWriter
instance per Index on startup. Write access to indexes is serialized using the single threadedExecutorService
, so there's no way an index could be modified by multiple threads.Lucene uses timestamps of
write.lock
to determine whether it has been modified after the index was opened.Steps to Reproduce
I am unable to reproduce locally. It seems like it's happening predominantly on servers, but I am yet unsure what exactly is the cause.
Could be third party services modifying the file, or file systems incapable of tracking file timestamps correctly, or skewed server times.
The exception can be forced by
touch
ing (i.e. modifying access timestamps) thewrite.lock
file after DT has started, e.g.:touch ~/.dependency-track/index/component/write.lock
Afterwards, upload a BOM and monitor the log for
AlreadyClosedException
s.Expected Behavior
Committing indexes should not fail.
Dependency-Track Version
4.7.0
Dependency-Track Distribution
Container Image, Executable WAR
Database Server
N/A
Database Server Version
No response
Browser
N/A
Checklist
The text was updated successfully, but these errors were encountered: