Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pushdown filters for non-cast date conversion functions (e.g. to_date) #933

Open
omerhadari opened this issue Feb 2, 2025 · 9 comments

Comments

@omerhadari
Copy link
Contributor

omerhadari commented Feb 2, 2025

Currently, for queries that compare timestamps/dates using TO_DATE in order to for example truncate a timestamp column, no pushdown predicates are applied. This is because the functions TO_DATE, TO_TIMESTAMP are not casts, and so they reach the function to_iceberg_predicate as ScalarFunctions, and not matched on any branch.

It would be nice to identify these cases because it is very common to have a partition key on a timestamp column with a Day Transformation, and currently queries such as SELECT * FROM table WHERE TO_DATE(table.timestamp_col) = '2025-01-01' result in no pushdown predicates at all.

Something along the line of these tests would ideally pass:

    #[test]
    fn test_to_timestamp_comparison_creates_predicate() {
        let sql = "TO_TIMESTAMP(ts) >= timestamp '2023-01-05T00:00:00'";
        let predicate = convert_to_iceberg_predicate(sql).unwrap();
        let expected_predicate =
            Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05T00:00:00"));
        assert_eq!(predicate, expected_predicate);
    }

    #[test]
    fn test_to_timestamp_comparison_to_cast_creates_predicate() {
        let sql = "TO_TIMESTAMP(ts) >= CAST('2023-01-05T00:00:00' AS TIMESTAMP)";
        let predicate = convert_to_iceberg_predicate(sql).unwrap();
        let expected_predicate =
            Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05T00:00:00"));
        assert_eq!(predicate, expected_predicate);
    }

    #[test]
    fn test_to_date_comparison_creates_predicate() {
        let sql = "TO_DATE(ts) >= CAST('2023-01-05T00:00:00' AS DATE)";
        let predicate = convert_to_iceberg_predicate(sql).unwrap();
        let expected_predicate =
            Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05"));
        assert_eq!(predicate, expected_predicate);
    }
@kevinjqliu
Copy link
Contributor

Thanks for creating this issue @omerhadari. Is this something you would like to contribute?
As we discussed on slack, Fokko added something similar in #821 for cast operation

@omerhadari
Copy link
Contributor Author

Yes I would like to take a shot at this. Will be free a bit later this week to try if that's ok?

@Fokko
Copy link
Contributor

Fokko commented Feb 2, 2025

@omerhadari That would be great, feel free to ping me for a review of reach out in case of any questions 👍

@liurenjie1024
Copy link
Contributor

Thanks @omerhadari for raising this. To support this feature, there are some blocking issues since iceberg-rust's expression following java's implementation is quite limited, so it's a little difficult to do it in the core crate. I think an easy way to do this would add a transform to convert to_data(ts) = '2025-01-01' to to_data(ts) = '2025-01-01' and ts >= '2025-01-01 0:00:00' and ts < '2025-01-02 0:00:00', and ts >= '2025-01-01 0:00:00' and ts < '2025-01-02 0:00:00' could be pushed down to help pruning and already supported by iceberg.

There are other long term solutions to support more scalar functions, but it requires another design.

@omerhadari
Copy link
Contributor Author

Thank you @liurenjie1024 for the elaboration! Is this an issue in the Java implementation as well, or does it have a way to express functions?

Copying a comment from my PR because maybe it makes more sense to discuss in the issue. Note the point about how CAST expressions are handled, I think this bug is a bit more worrying because it can potentially cause incorrect query results, not just slower runtimes.

Regarding your suggested alternative calculation, this is actually what I did on my part to work around the issue, but didn't want to implement here because I'm new to the project and did not know if this is too workaround-y.

Here is my comment from the PR itself:

I wanted to ask, is there a way to express function within iceberg predicates? Is this even desired? The reason this could be beneficial is that sometimes you need access to the column value and then you could perform much better manifest elimination. A few examples I have in this context:

  • TO_DATE essentially converts the column to Timestamp, and then truncates to the nearest day. I cannot easily do that in the context of generating the predicate
  • TO_TIMESTAMP accepts format for strings, but I see no way to pass the format inside the predicate and use it correctly.

This also reveals what I think is a bug. In datafusion (as well as many engines) when you cast for example a string to a DATE, it truncates to the nearest day. Currently in the conversion function - the expression is simply extracted from the Cast.

See here:
image

If I understand correctly, this could cause wrong results for example for the query
SELECT * FROM table WHERE date_col > CAST('2025-01-01T00:10:00' AS DATE)

would result in the predicate date_col > '2025-01-01T00:10:00' which will filter out data files where 2025-01-01T00:00:00 < date_col < 2025-01-01T00:10:00 even though they are supposed to be included.

Would appreciate some guidance about how to tackle this issue of propagating more information, I don't think it makes sense in the scope of this PR but maybe I am missing something basic.

@omerhadari
Copy link
Contributor Author

Mainly - I think I accidentally stepped into a rabbit hole, and need some help scoping this issue and the PR.

Here is a suggestion, please let me know if this makes sense as achievables within the scope:

  • Support TO_TIMESTAMP (with a single arg) same as CAST is handled today
  • Support BinaryExpressions with TO_DATE on one side, if TO_DATE is called with a single, String Literal as an argument

Out of scope:

  • Support TO_TIMESTAMP with more than 1 arugment (i.e. date formats)
  • Support TO_DATE on anything other than String literals (i.e. columns, other expressions, timestamps).

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2025

@omerhadari I think that's a great first step. When you start doing to_date on columns, then you start introducing a lot of complexity.

@omerhadari
Copy link
Contributor Author

Updated the PR according to this set of problems for now. It doesn't solve the entire issue, but I am not sure I feel comfortable with the approach @liurenjie1024 suggested for dealing with dates, despite being logically sound. I think it adds a lot of unexpected complexity and is only a solution for a subset of the issue (only day-resolution comparisons).

@liurenjie1024
Copy link
Contributor

I wanted to ask, is there a way to express function within iceberg predicates? Is this even desired? The reason this could be beneficial is that sometimes you need access to the column value and then you could perform much better manifest elimination.

Currently there is no way to expression function within iceberg predicates, which is also a problem in iceberg-java/iceberg-python. I'm not sure about the background why this doesn't exist. cc @Fokko

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants