-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set projection before configuring the source #14685
Set projection before configuring the source #14685
Conversation
Hello, thanks for reproducing the bug and the PR! let (
_projected_schema,
_constraints,
projected_statistics,
_projected_output_ordering,
) = self.project();
self.source = source.with_statistics(projected_statistics); |
Thanks for checking! I also thought that, but couldn't do because I just pushed explicit refresh, but maybe it's time to introduce |
Yes, I agree with the builder approach, it might be helpful for such cases. However, I'm a bit concerned about with the explicit approach, since it might be forgotten. It looks like this should be in the FileScanConfigs responsibility, that's why I think we should call the source updater method implicitly on both with_projection and with_table_partition_cols |
Fair, let's do it that way as a hotfix and properly solve with the builder then 🤝 |
Would it make sense to add a test which would cover plan round trip? |
let ( | ||
_projected_schema, | ||
_constraints, | ||
projected_statistics, | ||
_projected_output_ordering, | ||
) = self.project(); | ||
|
||
self.source = self.source.with_statistics(projected_statistics); | ||
self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
let source = self.source.clone();
self.with_source(source)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blaginin, the current state is LGTM. Could we add a test to cover #14679?
PTAL @mertak-synnada
|
||
if max_projection_column | ||
>= self.file_schema.fields().len() + self.table_partition_cols.len() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this case possible? it seems not obvious to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this now. but answering to the question - this could happen if projection
set but table_partition_cols
isn't yet set (or vice versa).
This whole logic should be much cleaner when we switch to the builder approach (I want to do a PR on top after this one is merged)
I am pretty sure I am hitting this issue while upgrading to delta.rs as well in
|
_projected_output_ordering, | ||
) = self.project(); | ||
|
||
self.source = self.source.with_statistics(projected_statistics); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why the source would need projected statistics
I am testing out if the issue is that the FileScanConfig is providing the wrong statistics (like maybe this line should be self.statistics rather than self.source.statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea! We can't use self.source.statistics
directly, because statistics match projection we're applying - so I had to apply projection before (89ed225).
This made the PR a bit messier, and I had to comment several test lines - LMK if you prefer the old version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this PR @blaginin and @berkaysynnada and @mertak-synnada
I am somewhat confused about the current state of statistics as both FileSource and DataSource have statistics and it isn't clear to me how the statistics should be related. I am working to understand this more
I filed a separate ticket for the issue I was seeing in the delta-rs upgrade I tried running the delta-rs upgrade with this change and it also fixes the delta.rs issue 🤗 I will spend more time today looking at this code / trying to help |
Thank you Andrew!! I'll resolve all the comments later today |
Thank you for working on this |
Hi @blaginin -- I also wrote some additional tests in a different PR: Feel free to bring them into this PR (or we can keep them in a separate PR too) |
@@ -1934,7 +1934,8 @@ mod tests { | |||
|
|||
// test metadata | |||
assert_eq!(exec.statistics()?.num_rows, Precision::Exact(8)); | |||
assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671)); | |||
// assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671)); | |||
// todo: uncomment when FileScanConfig::projection_stats puts byte size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think those lines were correct, because tests had the same bug: projection was set after the source, and so source statistics (which were used here) weren't really updated properly
data:image/s3,"s3://crabby-images/29cd2/29cd2bf5d7536329f6695fdf25426eccee709624" alt="RustRover-EAP 2025-02-28 08 01 28"
A proper way to fix this is set bytes size in projection_stats
data:image/s3,"s3://crabby-images/c3853/c385336707497259c2b783f581b7bce150bfaf12" alt="image"
Since it's a separate existing todo, I think it's better to change it in a PR on top of this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed this ticket to track:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @blaginin @mertak-synnada and @xudong963
While I agree this using the existing total_byte_size is not correct I think we should keep the same existing behavior (to minimize downstream impacts) and handle fixing it in a follow on ticket.
I took the liberty of reverting this test change and filing an issue to discuss it separately in
.constraints | ||
.project(proj_indices) | ||
.unwrap_or_else(Constraints::empty); | ||
let schema = self.projected_schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really nice now
@@ -1626,7 +1659,6 @@ async fn roundtrip_parquet_select_star_predicate() -> Result<()> { | |||
roundtrip_test_sql_with_context(sql, &ctx).await | |||
} | |||
|
|||
#[ignore = "Test failing due to https://github.com/apache/datafusion/issues/14679"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@@ -1934,7 +1934,8 @@ mod tests { | |||
|
|||
// test metadata | |||
assert_eq!(exec.statistics()?.num_rows, Precision::Exact(8)); | |||
assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671)); | |||
// assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671)); | |||
// todo: uncomment when FileScanConfig::projection_stats puts byte size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed this ticket to track:
It would be nice if we could avoid having Statistics on both FileScanConfig and file_source. I played around with this but hit some issues and don't have time to pursue it now. Filed this ticket to track: |
Thanks again for sticking with this @blaginin -- the result looks really nice |
Thank you for the help! 🥹 I'll do follow up prs |
Thank you so much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR appears to have accidentally updated the datafusion-testing pin - I will fix that
PR to fix:
Which issue does this PR close?
DataSourceExec
#14905Rationale for this change
FileScanConfig::new
now also configures source, but this currently without projection set before. And so whensource.with_statistics
is called, it is passed with yet emptyprojected_statistics
What changes are included in this PR?
Setting projection before configuring the source... but should the old method exist?
Are these changes tested?
Extended a test to check stats
Are there any user-facing changes?
No