Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set projection before configuring the source #14685

Merged
merged 16 commits into from
Feb 28, 2025

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Feb 15, 2025

Which issue does this PR close?

Rationale for this change

FileScanConfig::new now also configures source, but this currently without projection set before. And so when source.with_statistics is called, it is passed with yet empty projected_statistics

What changes are included in this PR?

Setting projection before configuring the source... but should the old method exist?

Are these changes tested?

Extended a test to check stats

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate labels Feb 15, 2025
@mertak-synnada
Copy link
Contributor

Hello, thanks for reproducing the bug and the PR!
I think the problem is not with the call order of the functions but with missing behavior. How about changing the with_projection method to update statistics as well? Just like we do in the with_source.

let (
    _projected_schema,
    _constraints,
    projected_statistics,
    _projected_output_ordering,
) = self.project();
self.source = source.with_statistics(projected_statistics);

@blaginin
Copy link
Contributor Author

Thanks for checking! I also thought that, but couldn't do because project relies on other fields to be set. You can see that for example in partition_column_projector test, where we call with_projection before with_table_partition_cols, so recomputing on isn't feasible

I just pushed explicit refresh, but maybe it's time to introduce FileScanConfigBuilder (in a separate pr)?

@mertak-synnada
Copy link
Contributor

Yes, I agree with the builder approach, it might be helpful for such cases. However, I'm a bit concerned about with the explicit approach, since it might be forgotten. It looks like this should be in the FileScanConfigs responsibility, that's why I think we should call the source updater method implicitly on both with_projection and with_table_partition_cols

@github-actions github-actions bot removed the proto Related to proto crate label Feb 19, 2025
@blaginin
Copy link
Contributor Author

Fair, let's do it that way as a hotfix and properly solve with the builder then 🤝

@blaginin blaginin marked this pull request as ready for review February 19, 2025 21:39
@xudong963 xudong963 self-requested a review February 23, 2025 14:42
@milenkovicm
Copy link
Contributor

Would it make sense to add a test which would cover plan round trip?

Comment on lines 366 to 374
let (
_projected_schema,
_constraints,
projected_statistics,
_projected_output_ordering,
) = self.project();

self.source = self.source.with_statistics(projected_statistics);
self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

let source = self.source.clone();
self.with_source(source)

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin, the current state is LGTM. Could we add a test to cover #14679?

PTAL @mertak-synnada


if max_projection_column
>= self.file_schema.fields().len() + self.table_partition_cols.len()
{
Copy link
Contributor

@berkaysynnada berkaysynnada Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this case possible? it seems not obvious to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this now. but answering to the question - this could happen if projection set but table_partition_cols isn't yet set (or vice versa).

This whole logic should be much cleaner when we switch to the builder approach (I want to do a PR on top after this one is merged)

@alamb
Copy link
Contributor

alamb commented Feb 26, 2025

I am pretty sure I am hitting this issue while upgrading to delta.rs as well in

_projected_output_ordering,
) = self.project();

self.source = self.source.with_statistics(projected_statistics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why the source would need projected statistics

I am testing out if the issue is that the FileScanConfig is providing the wrong statistics (like maybe this line should be self.statistics rather than self.source.statistics

https://github.com/apache/datafusion/blob/1c54b38e4a4012fd8d1b4f48e2c3d6d35016bad0/datafusion/core/src/datasource/physical_plan/file_scan_config.rs#L233-L232

Copy link
Contributor Author

@blaginin blaginin Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea! We can't use self.source.statistics directly, because statistics match projection we're applying - so I had to apply projection before (89ed225).

This made the PR a bit messier, and I had to comment several test lines - LMK if you prefer the old version

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this PR @blaginin and @berkaysynnada and @mertak-synnada

I am somewhat confused about the current state of statistics as both FileSource and DataSource have statistics and it isn't clear to me how the statistics should be related. I am working to understand this more

@alamb
Copy link
Contributor

alamb commented Feb 27, 2025

I filed a separate ticket for the issue I was seeing in the delta-rs upgrade

I tried running the delta-rs upgrade with this change and it also fixes the delta.rs issue 🤗

I will spend more time today looking at this code / trying to help

@blaginin
Copy link
Contributor Author

Thank you Andrew!! I'll resolve all the comments later today

@alamb
Copy link
Contributor

alamb commented Feb 27, 2025

Thank you Andrew!! I'll resolve all the comments later today

Thank you for working on this

@github-actions github-actions bot removed the core Core DataFusion crate label Feb 27, 2025
@github-actions github-actions bot added the proto Related to proto crate label Feb 28, 2025
@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

Hi @blaginin -- I also wrote some additional tests in a different PR:

Feel free to bring them into this PR (or we can keep them in a separate PR too)

@alamb alamb mentioned this pull request Feb 28, 2025
26 tasks
@github-actions github-actions bot added the core Core DataFusion crate label Feb 28, 2025
@@ -1934,7 +1934,8 @@ mod tests {

// test metadata
assert_eq!(exec.statistics()?.num_rows, Precision::Exact(8));
assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671));
// assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671));
// todo: uncomment when FileScanConfig::projection_stats puts byte size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think those lines were correct, because tests had the same bug: projection was set after the source, and so source statistics (which were used here) weren't really updated properly

RustRover-EAP 2025-02-28 08 01 28

A proper way to fix this is set bytes size in projection_stats

image

Since it's a separate existing todo, I think it's better to change it in a PR on top of this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 28, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @blaginin @mertak-synnada and @xudong963

While I agree this using the existing total_byte_size is not correct I think we should keep the same existing behavior (to minimize downstream impacts) and handle fixing it in a follow on ticket.

I took the liberty of reverting this test change and filing an issue to discuss it separately in

.constraints
.project(proj_indices)
.unwrap_or_else(Constraints::empty);
let schema = self.projected_schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really nice now

@@ -1626,7 +1659,6 @@ async fn roundtrip_parquet_select_star_predicate() -> Result<()> {
roundtrip_test_sql_with_context(sql, &ctx).await
}

#[ignore = "Test failing due to https://github.com/apache/datafusion/issues/14679"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -1934,7 +1934,8 @@ mod tests {

// test metadata
assert_eq!(exec.statistics()?.num_rows, Precision::Exact(8));
assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671));
// assert_eq!(exec.statistics()?.total_byte_size, Precision::Exact(671));
// todo: uncomment when FileScanConfig::projection_stats puts byte size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

It would be nice if we could avoid having Statistics on both FileScanConfig and file_source. I played around with this but hit some issues and don't have time to pursue it now. Filed this ticket to track:

@alamb alamb merged commit 2fd558f into apache:main Feb 28, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

Thanks again for sticking with this @blaginin -- the result looks really nice

@blaginin
Copy link
Contributor Author

Thank you for the help! 🥹 I'll do follow up prs

@alamb
Copy link
Contributor

alamb commented Feb 28, 2025

Thank you for the help! 🥹 I'll do follow up prs

Thank you so much

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR appears to have accidentally updated the datafusion-testing pin - I will fix that

Screenshot 2025-02-28 at 11 53 39 AM

PR to fix:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
6 participants