Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve bad queries (with excessive number of groups) observability #15254

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

albertobastos
Copy link
Contributor

This PR focuses on improving the observability for queries that may lead to a cluster performance degradation.

In particular, it is focused on group-by queries that lead to an excessive number of groups to be processed.

The following changes are done:

  • Change the LOGGER statement that traces all incoming query from DEBUG to INFO. Without it, queries are not logged until they are finished which does not help troubleshooting the latest queries against a cluster presenting issues.
2025/03/12 13:19:33.174 INFO [BaseSingleStageBrokerRequestHandler] [jersey-server-managed-async-executor-4] SQL query for request 2067850605000000001: select * from airlineStats limit 10
  • Create a new num.groups.limit.default.warn.factor Server configuration parameter with 1.5 as default value. When a group by operator detects that more than default num.groups.limit * num.groups.limit.default.warn.factor groups are created, it will emit a warning message such as:
WARN ... [qid=xxx] Query number of groups above warning limit: 150000 (actual: 197312).
  • Add a numGroups attribute to the Broker response, next to the already existing numGroupsLimitReached attribute. When the latter is true, the numGroups will match the configured limit. Otherwise, it will show the total number of groups processed by the query. For non-aggregated queries the value will be 0.

  • Add a new Server metric named aggregateTimesNumGroupsLimitWarning with the number of times the warning message from above has been logged.

@albertobastos albertobastos marked this pull request as draft March 12, 2025 13:30
* But if a single query has 2 different aggregate operators and each one reaches the limit, this will be increased
* by 2.
*/
AGGREGATE_TIMES_NUM_GROUPS_LIMIT_WARNING("times", true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we would also like to have a similar metric but not global

@@ -123,6 +124,7 @@ protected void processSegments() {
if (resultsBlock.isNumGroupsLimitReached()) {
_numGroupsLimitReached = true;
}
_numGroups = resultsBlock.getNumGroups();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be called more than once. In fact what I think you should be doing is to increase _numGroups by mergedKeys, which is calculated below.

@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2025

Codecov Report

Attention: Patch coverage is 73.01587% with 17 lines in your changes missing coverage. Please review.

Project coverage is 63.62%. Comparing base (59551e4) to head (cf2318d).
Report is 1837 commits behind head on master.

Files with missing lines Patch % Lines
...t/core/operator/query/FilteredGroupByOperator.java 25.00% 5 Missing and 1 partial ⚠️
...che/pinot/core/operator/query/GroupByOperator.java 37.50% 4 Missing and 1 partial ⚠️
.../apache/pinot/spi/trace/DefaultRequestContext.java 0.00% 3 Missing ⚠️
...n/java/org/apache/pinot/client/ExecutionStats.java 50.00% 0 Missing and 1 partial ⚠️
...common/response/broker/BrokerResponseNativeV2.java 80.00% 1 Missing ⚠️
...t/common/response/broker/CursorResponseNative.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15254      +/-   ##
============================================
+ Coverage     61.75%   63.62%   +1.87%     
- Complexity      207     1459    +1252     
============================================
  Files          2436     2772     +336     
  Lines        133233   156301   +23068     
  Branches      20636    23982    +3346     
============================================
+ Hits          82274    99450   +17176     
- Misses        44911    49362    +4451     
- Partials       6048     7489    +1441     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.59% <73.01%> (+1.88%) ⬆️
java-21 63.52% <73.01%> (+1.90%) ⬆️
skip-bytebuffers-false 63.61% <73.01%> (+1.86%) ⬆️
skip-bytebuffers-true 63.50% <73.01%> (+35.77%) ⬆️
temurin 63.62% <73.01%> (+1.87%) ⬆️
unittests 63.62% <73.01%> (+1.87%) ⬆️
unittests1 56.15% <71.42%> (+9.26%) ⬆️
unittests2 34.18% <15.87%> (+6.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants