Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Latency Overhead due to empty Response Handling #15064

Open
praveenc7 opened this issue Feb 14, 2025 · 3 comments
Open

Significant Latency Overhead due to empty Response Handling #15064

praveenc7 opened this issue Feb 14, 2025 · 3 comments

Comments

@praveenc7
Copy link
Contributor

praveenc7 commented Feb 14, 2025

Recent improvements (#14836) to ensure a valid data type is returned when queries result in empty responses (due to broker pruning or segment pruning) have introduced substantial query overhead.
• In some cases, we observed a 200x increase in query latency (e.g., from 6ms to 1200ms).
• On average, the overhead was at least 100x, leading to high CPU utilization and even host failures.

Root Cause

This issue appears to be linked to the following PRs:
1. PR #13831
2. PR #14918

Findings

A/B testing and profiling point to some operation done in compileQuery like optimize(

RelNode optimized = optimize(relation, plannerContext);
) and toRelation as the significant overhead.

Image

Image

This can be easily reproduced from one of the integ-test with and without the improvements from the above PRs

Open Question

We’ve observed that this logic is commonly used in the multi-stage query engine for query compilation.
• Is this latency a known issue, or has something changed downstream that we may not be aware of?
• The current Calcite library has been in place for nearly eight months, so it’s unclear if a recent change is causing this behavior.

cc : @vvivekiyer @Jackie-Jiang @albertobastos

@praveenc7
Copy link
Contributor Author

praveenc7 commented Feb 15, 2025

For example decorrelation is un-necessary overhead for simple single-stage queries that have no joins, no subqueries

I guess there are scope of some optimization here, when we are doing this for empty responses

@yashmayya
Copy link
Contributor

@praveenc7 can you share some of the queries where you saw disproportionately large latency overheads due to the MSQE compilation? Do they have really large IN clauses? Based on the image of the profile that you've shared, it looks like the root cause is this known issue - #13617. There are currently some attempts at solving it (#14615, #15027) but we're still discussing the cleanest option to fix that issue.

@praveenc7
Copy link
Contributor Author

@yashmayya Yes we do see this in queries having large IN clauses. However this was observed in some simple queries as well

Query pattern

SELECT col_a, MAX(col_b) 
FROM table_x 
WHERE col_b >= 10000 
  AND col_c NOT IN ('value_x') 
  AND col_d = 123456789 
 // Large IN clause 
  AND col_a IN (
      'x1',
      x2',
      'x3',
      'x4'
   ....... 
     'x1000'
  );
SELECT col_a, col_b, col_c, SUM(col_d) 
FROM table_y 
WHERE col_a IN ('XXXXX')  -- High cardinality
  AND col_c >= 10000 
GROUP BY col_a, col_b, col_c 
ORDER BY SUM(col_d) DESC 
LIMIT 20000;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants