Kraken: Add metrics around RT handling #3955

pbougue · 2023-03-24T17:08:33Z

🔍 please review by commit with message

Added metrics:

retrieval duration of RT messages from Rabbit
number of RT messages retrieved from Rabbit
number of RT entities applied
RT messages' ages (min/average/max)

Also a small fix.

✔️ tested locally that it works on NewRelic metrics.
❌ Sonar fails only on new code coverage, which is OK on metrics.

JIRA: https://navitia.atlassian.net/browse/NAV-1892

These metrics will indicate the time between the RT message's emission (from Chaos or Kirin) and the moment they are available for requests. JIRA: https://navitia.atlassian.net/browse/NAV-1892

pbench · 2023-03-27T08:42:16Z

source/kraken/metrics.cpp

+    this->rt_message_age_min_histogram = &prometheus::BuildHistogram()
+                                              .Name("kraken_rt_message_age_min_seconds")
+                                              .Help("Minimum age of RT message from a batch")
+                                              .Labels({{"coverage", coverage}})
+                                              .Register(*registry)
+                                              .Add({}, create_exponential_buckets(0.5, 2, 10));
+
+    this->rt_message_age_average_histogram = &prometheus::BuildHistogram()
+                                                  .Name("kraken_rt_message_age_average_seconds")
+                                                  .Help("Average age of RT message from a batch")
+                                                  .Labels({{"coverage", coverage}})
+                                                  .Register(*registry)
+                                                  .Add({}, create_exponential_buckets(0.5, 2, 10));
+
+    this->rt_message_age_max_histogram = &prometheus::BuildHistogram()


why not use a single message_age_histogram ? I seems to me you can recover the max and average from that (not use about the min)

and you could also recover the number of messages from this single histogram

I admit that I didn't dig on this and cut short to have it available on time: the min and max values are not precise from what I saw (maybe it's related to the config of histogram and its buckets).
One other advantage is to avoid tracking that many values, but I'm not sure as it's probably regrouped.
I'll try to dig a bit before making a choice.

EDIT: here is what's exposed by /metrics for average

# HELP kraken_rt_message_age_average_seconds Average age of RT message from a batch # TYPE kraken_rt_message_age_average_seconds histogram kraken_rt_message_age_average_seconds_count{coverage="default"} 4 kraken_rt_message_age_average_seconds_sum{coverage="default"} 20.293984 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="0.500000"} 0 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="1.000000"} 0 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="2.000000"} 0 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="4.000000"} 0 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="8.000000"} 4 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="16.000000"} 4 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="32.000000"} 4 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="64.000000"} 4 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="128.000000"} 4 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="256.000000"} 4 kraken_rt_message_age_average_seconds_bucket{coverage="default",le="+Inf"} 4

So min/max are dependant on the number of buckets and their size, we may use something like quantiles but I think it will be too short for this PR.

As for the number of messages, the count from this histogram measures the number of messages retrieved that have a date, not all of them (not sure the ones from chaos have some), and not all the entities applied.
So it can be interesting to have it, but it's not what I aimed for when I listed the needs.

my question is why do you need to know min/max precisely?

@xlqian : This metric gives the latency between the reception of RT info from provider to the moment it is available for the travelers (kirin's time is monitored too, to complete it).
So it is quite important for some clients to track it, and max or average is important and tracked to some clients.
min is less important, but it's interesting because it measures minimal latency unrelated to RT-messages themselves.

This is actually the end-goal of improving the RT handling time, the metric we are going to track.

To work on it, especially on max, steps of 128s (~2 min), 256s (~4 min) or "above" is too large (as we want to be able to show a precise max to client - and track it for our work).

you can specify more precise buckets if it fits your needs

Something like :

this->rt_message_average_histogram = &prometheus::BuildHistogram() .Name("kraken_rt_message_age_average_seconds") .Help(" Age of RT messages") .Labels({{"coverage", coverage}}) .Register(*registry) .Add({}, prometheus::Histogram::BucketBoundaries{0.1, 0.5, 1, 2, 5, 10, 30, 60, 90, 120, 180});

I tried adding a unique/more precise histogram (see fd891b3 on branch metrics_only_one_histogram_rt_message_age) and the result in NewRelic on max is not good on one test.
As discussed, we will probably move on as-is and decide to dig after the release.

PR opened and closed immediately for memory: #3971

source/kraken/metrics.cpp

source/kraken/maintenance_worker.cpp

* No more number, count is clearer * typo on ret`r`ieval corrected * sonar warned about implicit casts: explicit them

source/kraken/maintenance_worker.cpp

sonarqubecloud · 2023-03-27T14:22:26Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

29.1% Coverage
0.0% Duplication

Pierre-Etienne Bougue added 5 commits March 24, 2023 11:37

Add metric on retrieval duration of RT messages from Rabbit

c96150f

JIRA: https://navitia.atlassian.net/browse/NAV-1892

Add metric on number of RT messages retrieved from Rabbit

6afa230

JIRA: https://navitia.atlassian.net/browse/NAV-1892

Add metric on number of RT entities applied

65dc8c4

JIRA: https://navitia.atlassian.net/browse/NAV-1892

Add metrics on RT messages' ages (min/average/max)

05fd968

These metrics will indicate the time between the RT message's emission (from Chaos or Kirin) and the moment they are available for requests. JIRA: https://navitia.atlassian.net/browse/NAV-1892

Fix in case of parse error: just skip one message, not all

91c5a30

pbougue requested review from woshilapin, xlqian and pbench March 24, 2023 17:08

pbench reviewed Mar 27, 2023

View reviewed changes

source/kraken/metrics.cpp Outdated Show resolved Hide resolved

woshilapin approved these changes Mar 27, 2023

View reviewed changes

source/kraken/maintenance_worker.cpp Outdated Show resolved Hide resolved

source/kraken/maintenance_worker.cpp Outdated Show resolved Hide resolved

After review + sonar: rename + typo + explicit casts

e7122af

* No more number, count is clearer * typo on ret`r`ieval corrected * sonar warned about implicit casts: explicit them

pbench approved these changes Mar 27, 2023

View reviewed changes

woshilapin approved these changes Mar 27, 2023

View reviewed changes

xlqian reviewed Mar 27, 2023

View reviewed changes

source/kraken/maintenance_worker.cpp Outdated Show resolved Hide resolved

source/kraken/maintenance_worker.cpp Outdated Show resolved Hide resolved

Second pass of improvements after review and sonar :-)

64b8191

xlqian approved these changes Mar 27, 2023

View reviewed changes

pbougue merged commit 937d0e0 into dev Mar 28, 2023

pbougue deleted the add_rt_metrics branch March 28, 2023 06:58

pbougue mentioned this pull request Apr 6, 2023

Kraken: only one metric histogram for RT message age #3971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kraken: Add metrics around RT handling #3955

Kraken: Add metrics around RT handling #3955

pbougue commented Mar 24, 2023 •

edited

Loading

pbench Mar 27, 2023

pbench Mar 27, 2023

pbougue Mar 27, 2023 •

edited

Loading

pbougue Mar 27, 2023 •

edited

Loading

xlqian Mar 27, 2023

pbougue Mar 27, 2023

pbench Mar 27, 2023

pbench Mar 27, 2023

pbougue Mar 27, 2023 •

edited

Loading

sonarqubecloud bot commented Mar 27, 2023

Kraken: Add metrics around RT handling #3955

Kraken: Add metrics around RT handling #3955

Conversation

pbougue commented Mar 24, 2023 • edited Loading

pbench Mar 27, 2023

Choose a reason for hiding this comment

pbench Mar 27, 2023

Choose a reason for hiding this comment

pbougue Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

pbougue Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

xlqian Mar 27, 2023

Choose a reason for hiding this comment

pbougue Mar 27, 2023

Choose a reason for hiding this comment

pbench Mar 27, 2023

Choose a reason for hiding this comment

pbench Mar 27, 2023

Choose a reason for hiding this comment

pbougue Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 27, 2023

pbougue commented Mar 24, 2023 •

edited

Loading

pbougue Mar 27, 2023 •

edited

Loading

pbougue Mar 27, 2023 •

edited

Loading

pbougue Mar 27, 2023 •

edited

Loading