Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kraken] Fix aging problem when both kirin & chaos are active #3909

Merged
merged 8 commits into from
Jan 18, 2023

Conversation

xlqian
Copy link
Member

@xlqian xlqian commented Jan 13, 2023

https://navitia.atlassian.net/browse/NAV-1650

In this PR, several issues are tackled.

  1. when both kirin and chaos are active, we observed a slow memory leak and the disruption's integration speed is slowly exacerbated. This is due to the infinitely growing uri of the vehicle_journey impacted by RailSecion and LineSection. During my test, I spotted one of these gargantuan uri attained a length of 33998 characters.

2. A trip may be impacted several times by the disruptions/trip_update from kirin, whereas only the last trip_update will be taken into account. In order to avoid applying trip_updates redundantly, aka faster disruption integration ratio, we just apply the newest trip_update by reversing the buffer and ignoring the older ones which have the same id. To be tackled in another PR

  1. Kraken didn't poll all messages from rabbitmq, even though the buffer was not yet full. (5000 by default)
    image
    This is due to the low timeout(100ms in prod) in BasicConsumeMessage. A fast way to fix that is to augment that timeout. The risk is that if disruptions are arriving every (timeout -1ms), in the worst case, we have to wait buffer_size*(timeout -1ms) to fill the buffer (ex: 5000*(500-1)ms ~= 25s) before kraken starts to handle them. The solution is to add another timeout to limit that waiting time.

@xlqian xlqian marked this pull request as ready for review January 13, 2023 13:37
Copy link
Contributor

@pbougue pbougue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 A bit shocked that uri itself is the major part of the issue

ℹ️ minor=feel free to disregard.

@woshilapin woshilapin removed their request for review January 13, 2023 15:30
Patrick Qian and others added 3 commits January 13, 2023 17:19
Comment on lines 381 to 384
auto res = applied_visited_id.insert(entity.id());
if (!res.second) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto res = applied_visited_id.insert(entity.id());
if (!res.second) {
continue;
}
auto res = applied_visited_id.insert(entity.id());
// an newer disruption with the same id has already been seen, so we can ignore this one
if (!res.second) {
continue;
}

I am a bit worried about this "ignore disruptions whose id has already been seen".

  • do we receive kirin disruptions with the same id ? When this happens are we sure we only need to take into account the last one ? Can we have two disruptions with the same id but that affect different vj ? poke @pbougue
  • what about chaos disruptions ? In particular, when we cancel a disruption, don't we get the same id twice ?

Copy link
Contributor

@pbougue pbougue Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may deserve a code-comment, as the 3 of us asked ourselves the question 😃

  • For kirin this is guaranteed that entity ID is the same only when the VJ is the same, and that taking only the last in the queue is valid.
    Details: actually it's not so true with parallelism on kirin, but the guarantee is that it's gonna be as good as it was (nothing else can be used to decide which message is the last one currently). Actually, even with parallelism it's guaranteed (no concurrent processing on the same VJ in Kirin).
    💡 This may deserve a comment in https://github.com/hove-io/chaos-proto/blob/master/kirin_proto_doc.rs ? (I can have a shot if you agree)

  • For chaos, this was discussed with chaos team and seemed OK, but I don't know about that particular case, I let @xlqian reply if he knows.

Copy link
Contributor

@pbench pbench Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright if this has been checked, it sounds good to me.
I do agree that some comments (here and in the proto doc) explaining this may be useful.

For kirin this is guaranteed that entity ID is the same only when the VJ is the same

same VJ and same date ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same VJ and same date 👍

✅ I'll try to add a little something into kirin_proto_doc.rs

Copy link
Member Author

@xlqian xlqian Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbench Now I'm having a second thought with your comment and I think I made a mistake after thinking it through...
@pbougue I'm going to remove this trick(reversing the vector) in this PR and open up another PR to tackle this problem so that in case I messed it up, we don't have to revert the whole thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have to reverse the order of the entities in the message to be safe that we still have the same result (although it would be weird for someone to send multiple time the same entity in the same message).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion: this was reverted from this PR because of uncertainty around chaos.
Tracked in JIRA https://navitia.atlassian.net/browse/NAV-1878

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 36 Code Smells

61.7% 61.7% Coverage
0.2% 0.2% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants