Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kraken: Add metrics around RT handling #3955

Merged
merged 7 commits into from
Mar 28, 2023
Merged

Kraken: Add metrics around RT handling #3955

merged 7 commits into from
Mar 28, 2023

Conversation

pbougue
Copy link
Contributor

@pbougue pbougue commented Mar 24, 2023

🔍 please review by commit with message

Added metrics:

  • retrieval duration of RT messages from Rabbit
  • number of RT messages retrieved from Rabbit
  • number of RT entities applied
  • RT messages' ages (min/average/max)

Also a small fix.

✔️ tested locally that it works on NewRelic metrics.
❌ Sonar fails only on new code coverage, which is OK on metrics.

JIRA: https://navitia.atlassian.net/browse/NAV-1892

@pbougue pbougue requested review from woshilapin, xlqian and pbench March 24, 2023 17:08
Comment on lines +176 to +190
this->rt_message_age_min_histogram = &prometheus::BuildHistogram()
.Name("kraken_rt_message_age_min_seconds")
.Help("Minimum age of RT message from a batch")
.Labels({{"coverage", coverage}})
.Register(*registry)
.Add({}, create_exponential_buckets(0.5, 2, 10));

this->rt_message_age_average_histogram = &prometheus::BuildHistogram()
.Name("kraken_rt_message_age_average_seconds")
.Help("Average age of RT message from a batch")
.Labels({{"coverage", coverage}})
.Register(*registry)
.Add({}, create_exponential_buckets(0.5, 2, 10));

this->rt_message_age_max_histogram = &prometheus::BuildHistogram()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use a single message_age_histogram ? I seems to me you can recover the max and average from that (not use about the min)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you could also recover the number of messages from this single histogram

Copy link
Contributor Author

@pbougue pbougue Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that I didn't dig on this and cut short to have it available on time: the min and max values are not precise from what I saw (maybe it's related to the config of histogram and its buckets).
One other advantage is to avoid tracking that many values, but I'm not sure as it's probably regrouped.
I'll try to dig a bit before making a choice.

EDIT: here is what's exposed by /metrics for average

# HELP kraken_rt_message_age_average_seconds Average age of RT message from a batch
# TYPE kraken_rt_message_age_average_seconds histogram
kraken_rt_message_age_average_seconds_count{coverage="default"} 4
kraken_rt_message_age_average_seconds_sum{coverage="default"} 20.293984
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="0.500000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="1.000000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="2.000000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="4.000000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="8.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="16.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="32.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="64.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="128.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="256.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="+Inf"} 4

So min/max are dependant on the number of buckets and their size, we may use something like quantiles but I think it will be too short for this PR.

Copy link
Contributor Author

@pbougue pbougue Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the number of messages, the count from this histogram measures the number of messages retrieved that have a date, not all of them (not sure the ones from chaos have some), and not all the entities applied.
So it can be interesting to have it, but it's not what I aimed for when I listed the needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my question is why do you need to know min/max precisely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xlqian : This metric gives the latency between the reception of RT info from provider to the moment it is available for the travelers (kirin's time is monitored too, to complete it).
So it is quite important for some clients to track it, and max or average is important and tracked to some clients.
min is less important, but it's interesting because it measures minimal latency unrelated to RT-messages themselves.

This is actually the end-goal of improving the RT handling time, the metric we are going to track.

To work on it, especially on max, steps of 128s (~2 min), 256s (~4 min) or "above" is too large (as we want to be able to show a precise max to client - and track it for our work).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can specify more precise buckets if it fits your needs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like :

 this->rt_message_average_histogram = &prometheus::BuildHistogram()
                                                  .Name("kraken_rt_message_age_average_seconds")
                                                  .Help(" Age of RT messages")
                                                  .Labels({{"coverage", coverage}})
                                                  .Register(*registry)
                                                  .Add({}, 
prometheus::Histogram::BucketBoundaries{0.1, 0.5, 1,   2,   5,   10, 30, 60, 90, 120, 180});

Copy link
Contributor Author

@pbougue pbougue Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding a unique/more precise histogram (see fd891b3 on branch metrics_only_one_histogram_rt_message_age) and the result in NewRelic on max is not good on one test.
As discussed, we will probably move on as-is and decide to dig after the release.

PR opened and closed immediately for memory: #3971

* No more number, count is clearer
* typo on ret`r`ieval corrected
* sonar warned about implicit casts: explicit them
@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

29.1% 29.1% Coverage
0.0% 0.0% Duplication

@pbougue pbougue merged commit 937d0e0 into dev Mar 28, 2023
@pbougue pbougue deleted the add_rt_metrics branch March 28, 2023 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants