-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kraken] Fix aging problem when both kirin & chaos are active #3909
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 A bit shocked that uri itself is the major part of the issue
ℹ️ minor=feel free to disregard.
Co-authored-by: Pierre-Etienne Bougué <[email protected]>
source/kraken/maintenance_worker.cpp
Outdated
auto res = applied_visited_id.insert(entity.id()); | ||
if (!res.second) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto res = applied_visited_id.insert(entity.id()); | |
if (!res.second) { | |
continue; | |
} | |
auto res = applied_visited_id.insert(entity.id()); | |
// an newer disruption with the same id has already been seen, so we can ignore this one | |
if (!res.second) { | |
continue; | |
} |
I am a bit worried about this "ignore disruptions whose id has already been seen".
- do we receive kirin disruptions with the same id ? When this happens are we sure we only need to take into account the last one ? Can we have two disruptions with the same id but that affect different vj ? poke @pbougue
- what about chaos disruptions ? In particular, when we cancel a disruption, don't we get the same id twice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may deserve a code-comment, as the 3 of us asked ourselves the question 😃
-
For kirin this is guaranteed that entity ID is the same only when the VJ is the same, and that taking only the last in the queue is valid.
Details: actually it's not so true with parallelism on kirin, but the guarantee is that it's gonna be as good as it was (nothing else can be used to decide which message is the last one currently).Actually, even with parallelism it's guaranteed (no concurrent processing on the same VJ in Kirin).
💡 This may deserve a comment in https://github.com/hove-io/chaos-proto/blob/master/kirin_proto_doc.rs ? (I can have a shot if you agree) -
For chaos, this was discussed with chaos team and seemed OK, but I don't know about that particular case, I let @xlqian reply if he knows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allright if this has been checked, it sounds good to me.
I do agree that some comments (here and in the proto doc) explaining this may be useful.
For kirin this is guaranteed that entity ID is the same only when the VJ is the same
same VJ and same date ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, same VJ and same date 👍
✅ I'll try to add a little something into kirin_proto_doc.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbench Now I'm having a second thought with your comment and I think I made a mistake after thinking it through...
@pbougue I'm going to remove this trick(reversing the vector) in this PR and open up another PR to tackle this problem so that in case I messed it up, we don't have to revert the whole thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have to reverse the order of the entities in the message to be safe that we still have the same result (although it would be weird for someone to send multiple time the same entity in the same message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion: this was reverted from this PR because of uncertainty around chaos.
Tracked in JIRA https://navitia.atlassian.net/browse/NAV-1878
Kudos, SonarCloud Quality Gate passed! |
…y.id As documented, this is possible to save some processing for both chaos and kirin. See: https://github.com/hove-io/chaos-proto/blob/ab4a1265f933a98bf0efee65f498a331a260b3b4/chaos_proto_doc.rs#L37-L39 and https://github.com/hove-io/chaos-proto/blob/ab4a1265f933a98bf0efee65f498a331a260b3b4/kirin_proto_doc.rs#L37-L39 Inspired by #3909 (comment) temporarily reverted in 6bfc01c#diff-e3c74159dd251840eba27db267ef32654abe662d6a1333da9a7878dbf866a338 JIRA: https://navitia.atlassian.net/browse/NAV-1878
https://navitia.atlassian.net/browse/NAV-1650
In this PR, several issues are tackled.
vehicle_journey
impacted byRailSecion
andLineSection
. During my test, I spotted one of these gargantuan uri attained a length of 33998 characters.2. A trip may be impacted several times by the disruptions/trip_update from kirin, whereas only the last trip_update will be taken into account. In order to avoid applying trip_updates redundantly, aka faster disruption integration ratio, we just apply the newest trip_update by reversing the buffer and ignoring the older ones which have the same id.To be tackled in another PRThis is due to the low timeout(100ms in prod) in
BasicConsumeMessage
. A fast way to fix that is to augment that timeout. The risk is that if disruptions are arriving every (timeout -1ms), in the worst case, we have to waitbuffer_size*(timeout -1ms)
to fill the buffer (ex: 5000*(500-1)ms ~= 25s) before kraken starts to handle them. The solution is to add another timeout to limit that waiting time.