Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace parquet metadata thrift version with in memory version. #1004

Open
liurenjie1024 opened this issue Feb 25, 2025 · 7 comments
Open

Replace parquet metadata thrift version with in memory version. #1004

liurenjie1024 opened this issue Feb 25, 2025 · 7 comments
Labels
good first issue Good for newcomers
Milestone

Comments

@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Feb 25, 2025

In parquet crate, there are two kinds of data structures for metadata: in memory version vs auto generated version from parquet's thrift definition. For example, there are two versions of FileMetadata: in memory vs thrift definition.

We should use the in memory one as it provides more features, while thrift version was only used for ser/de in parquet.

There are several places in our crate which is using thrift version:

  1. parquet writer
@jonathanc-n
Copy link
Contributor

@liurenjie1024 I think the current problem with this is that ArrowFileReader (reader) returns ParquetMetadata and AsyncFileWriter (writer) returns the thrift definition. The solution I was thinking of is creating a conversion from thrift -> ParquetMetadata, but this seems like an unnecessary step. I think keeping both functions so that the parquet writer can convert to datafile given any of the two metadatas without an unnecessary conversion step in between seems to be fine.

@liurenjie1024
Copy link
Contributor Author

@liurenjie1024 I think the current problem with this is that ArrowFileReader (reader) returns ParquetMetadata and AsyncFileWriter (writer) returns the thrift definition. The solution I was thinking of is creating a conversion from thrift -> ParquetMetadata, but this seems like an unnecessary step. I think keeping both functions so that the parquet writer can convert to datafile given any of the two metadatas without an unnecessary conversion step in between seems to be fine.

I think we should always return the in memory representation, rather the thrift one. Is there any case where returning the thrift one is more useful then the in memory one?

@liurenjie1024 liurenjie1024 changed the title Replace FileMetadata in parquet writer with in memory representation. Replace parquet metadata thrift version with in memory version. Mar 6, 2025
@jonathanc-n
Copy link
Contributor

Probably not, so should we cahnge the AsyncFileWriter to return the in memory representation?

@liurenjie1024
Copy link
Contributor Author

Probably not, so should we cahnge the AsyncFileWriter to return the in memory representation?

Yes, but it seems there is no built no approach to do that? We may need to ask for help in arrow community?

@jonathanc-n
Copy link
Contributor

Yes, I'll look to submit an issue

@liurenjie1024
Copy link
Contributor Author

Hi, @jonathanc-n I found this method in parquet crate. I think there are two ways to do this:

  1. We could use thrift api to serialize thrift version to bytes, and read withi this method.
  2. We could simulate the implementation in this method.

@jonathanc-n
Copy link
Contributor

Thanks for that! I'll look into it later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Status: No status
Development

No branches or pull requests

2 participants