-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kraken: Add metrics around RT handling #3955
Conversation
These metrics will indicate the time between the RT message's emission (from Chaos or Kirin) and the moment they are available for requests. JIRA: https://navitia.atlassian.net/browse/NAV-1892
this->rt_message_age_min_histogram = &prometheus::BuildHistogram() | ||
.Name("kraken_rt_message_age_min_seconds") | ||
.Help("Minimum age of RT message from a batch") | ||
.Labels({{"coverage", coverage}}) | ||
.Register(*registry) | ||
.Add({}, create_exponential_buckets(0.5, 2, 10)); | ||
|
||
this->rt_message_age_average_histogram = &prometheus::BuildHistogram() | ||
.Name("kraken_rt_message_age_average_seconds") | ||
.Help("Average age of RT message from a batch") | ||
.Labels({{"coverage", coverage}}) | ||
.Register(*registry) | ||
.Add({}, create_exponential_buckets(0.5, 2, 10)); | ||
|
||
this->rt_message_age_max_histogram = &prometheus::BuildHistogram() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use a single message_age_histogram
? I seems to me you can recover the max and average from that (not use about the min)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you could also recover the number of messages from this single histogram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that I didn't dig on this and cut short to have it available on time: the min and max values are not precise from what I saw (maybe it's related to the config of histogram and its buckets).
One other advantage is to avoid tracking that many values, but I'm not sure as it's probably regrouped.
I'll try to dig a bit before making a choice.
EDIT: here is what's exposed by /metrics
for average
# HELP kraken_rt_message_age_average_seconds Average age of RT message from a batch
# TYPE kraken_rt_message_age_average_seconds histogram
kraken_rt_message_age_average_seconds_count{coverage="default"} 4
kraken_rt_message_age_average_seconds_sum{coverage="default"} 20.293984
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="0.500000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="1.000000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="2.000000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="4.000000"} 0
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="8.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="16.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="32.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="64.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="128.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="256.000000"} 4
kraken_rt_message_age_average_seconds_bucket{coverage="default",le="+Inf"} 4
So min/max are dependant on the number of buckets and their size, we may use something like quantiles but I think it will be too short for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the number of messages, the count
from this histogram measures the number of messages retrieved that have a date, not all of them (not sure the ones from chaos have some), and not all the entities applied.
So it can be interesting to have it, but it's not what I aimed for when I listed the needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my question is why do you need to know min/max precisely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xlqian : This metric gives the latency between the reception of RT info from provider to the moment it is available for the travelers (kirin's time is monitored too, to complete it).
So it is quite important for some clients to track it, and max
or average
is important and tracked to some clients.
min
is less important, but it's interesting because it measures minimal latency unrelated to RT-messages themselves.
This is actually the end-goal of improving the RT handling time, the metric we are going to track.
To work on it, especially on max, steps of 128s (~2 min), 256s (~4 min) or "above" is too large (as we want to be able to show a precise max to client - and track it for our work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can specify more precise buckets if it fits your needs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like :
this->rt_message_average_histogram = &prometheus::BuildHistogram()
.Name("kraken_rt_message_age_average_seconds")
.Help(" Age of RT messages")
.Labels({{"coverage", coverage}})
.Register(*registry)
.Add({},
prometheus::Histogram::BucketBoundaries{0.1, 0.5, 1, 2, 5, 10, 30, 60, 90, 120, 180});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding a unique/more precise histogram (see fd891b3 on branch metrics_only_one_histogram_rt_message_age
) and the result in NewRelic on max is not good on one test.
As discussed, we will probably move on as-is and decide to dig after the release.
PR opened and closed immediately for memory: #3971
* No more number, count is clearer * typo on ret`r`ieval corrected * sonar warned about implicit casts: explicit them
SonarCloud Quality Gate failed. |
🔍 please review by commit with message
Added metrics:
Also a small fix.
✔️ tested locally that it works on NewRelic metrics.
❌ Sonar fails only on new code coverage, which is OK on metrics.
JIRA: https://navitia.atlassian.net/browse/NAV-1892