Significant Latency Overhead due to empty Response Handling #15064

praveenc7 · 2025-02-14T04:01:25Z

Recent improvements (#14836) to ensure a valid data type is returned when queries result in empty responses (due to broker pruning or segment pruning) have introduced substantial query overhead.
• In some cases, we observed a 200x increase in query latency (e.g., from 6ms to 1200ms).
• On average, the overhead was at least 100x, leading to high CPU utilization and even host failures.

Root Cause

This issue appears to be linked to the following PRs:
1. PR #13831
2. PR #14918

Findings

A/B testing and profiling point to some operation done in compileQuery like optimize(

pinot/pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java

Line 347 in ad6662f

RelNode optimized = optimize(relation, plannerContext);

) and toRelation as the significant overhead.

This can be easily reproduced from one of the integ-test with and without the improvements from the above PRs

Open Question

We’ve observed that this logic is commonly used in the multi-stage query engine for query compilation.
• Is this latency a known issue, or has something changed downstream that we may not be aware of?
• The current Calcite library has been in place for nearly eight months, so it’s unclear if a recent change is causing this behavior.

cc : @vvivekiyer @Jackie-Jiang @albertobastos

The text was updated successfully, but these errors were encountered:

praveenc7 · 2025-02-15T01:22:07Z

For example decorrelation is un-necessary overhead for simple single-stage queries that have no joins, no subqueries

I guess there are scope of some optimization here, when we are doing this for empty responses

yashmayya · 2025-02-17T08:32:11Z

@praveenc7 can you share some of the queries where you saw disproportionately large latency overheads due to the MSQE compilation? Do they have really large IN clauses? Based on the image of the profile that you've shared, it looks like the root cause is this known issue - #13617. There are currently some attempts at solving it (#14615, #15027) but we're still discussing the cleanest option to fix that issue.

praveenc7 · 2025-02-17T18:34:49Z

@yashmayya Yes we do see this in queries having large IN clauses. However this was observed in some simple queries as well

Query pattern

SELECT col_a, MAX(col_b) 
FROM table_x 
WHERE col_b >= 10000 
  AND col_c NOT IN ('value_x') 
  AND col_d = 123456789 
 // Large IN clause 
  AND col_a IN (
      'x1',
      x2',
      'x3',
      'x4'
   ....... 
     'x1000'
  );

SELECT col_a, col_b, col_c, SUM(col_d) 
FROM table_y 
WHERE col_a IN ('XXXXX')  -- High cardinality
  AND col_c >= 10000 
GROUP BY col_a, col_b, col_c 
ORDER BY SUM(col_d) DESC 
LIMIT 20000;

yashmayya added query performance labels Feb 17, 2025

This was referenced Feb 17, 2025

disable polyfill until we face the performance penalty #15075

Closed

protect usage MSQE compiler for empty schema polyfill with config param #15078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Latency Overhead due to empty Response Handling #15064

Significant Latency Overhead due to empty Response Handling #15064

praveenc7 commented Feb 14, 2025 •

edited by Jackie-Jiang

Loading

praveenc7 commented Feb 15, 2025 •

edited

Loading

yashmayya commented Feb 17, 2025

praveenc7 commented Feb 17, 2025

Significant Latency Overhead due to empty Response Handling #15064

Significant Latency Overhead due to empty Response Handling #15064

Comments

praveenc7 commented Feb 14, 2025 • edited by Jackie-Jiang Loading

Root Cause

Findings

Open Question

praveenc7 commented Feb 15, 2025 • edited Loading

yashmayya commented Feb 17, 2025

praveenc7 commented Feb 17, 2025

praveenc7 commented Feb 14, 2025 •

edited by Jackie-Jiang

Loading

praveenc7 commented Feb 15, 2025 •

edited

Loading