Replace parquet metadata thrift version with in memory version. #1004

liurenjie1024 · 2025-02-25T03:42:02Z

In parquet crate, there are two kinds of data structures for metadata: in memory version vs auto generated version from parquet's thrift definition. For example, there are two versions of FileMetadata: in memory vs thrift definition.

We should use the in memory one as it provides more features, while thrift version was only used for ser/de in parquet.

There are several places in our crate which is using thrift version:

parquet writer

The text was updated successfully, but these errors were encountered:

jonathanc-n · 2025-03-03T19:25:43Z

@liurenjie1024 I think the current problem with this is that ArrowFileReader (reader) returns ParquetMetadata and AsyncFileWriter (writer) returns the thrift definition. The solution I was thinking of is creating a conversion from thrift -> ParquetMetadata, but this seems like an unnecessary step. I think keeping both functions so that the parquet writer can convert to datafile given any of the two metadatas without an unnecessary conversion step in between seems to be fine.

liurenjie1024 · 2025-03-06T01:57:02Z

@liurenjie1024 I think the current problem with this is that ArrowFileReader (reader) returns ParquetMetadata and AsyncFileWriter (writer) returns the thrift definition. The solution I was thinking of is creating a conversion from thrift -> ParquetMetadata, but this seems like an unnecessary step. I think keeping both functions so that the parquet writer can convert to datafile given any of the two metadatas without an unnecessary conversion step in between seems to be fine.

I think we should always return the in memory representation, rather the thrift one. Is there any case where returning the thrift one is more useful then the in memory one?

jonathanc-n · 2025-03-06T02:22:27Z

Probably not, so should we cahnge the AsyncFileWriter to return the in memory representation?

liurenjie1024 · 2025-03-06T02:30:00Z

Probably not, so should we cahnge the AsyncFileWriter to return the in memory representation?

Yes, but it seems there is no built no approach to do that? We may need to ask for help in arrow community?

jonathanc-n · 2025-03-06T06:19:30Z

Yes, I'll look to submit an issue

liurenjie1024 · 2025-03-11T09:47:20Z

Hi, @jonathanc-n I found this method in parquet crate. I think there are two ways to do this:

We could use thrift api to serialize thrift version to bytes, and read withi this method.
We could simulate the implementation in this method.

jonathanc-n · 2025-03-11T16:28:14Z

Thanks for that! I'll look into it later today.

liurenjie1024 added the good first issue Good for newcomers label Feb 25, 2025

liurenjie1024 added this to the 0.5.0 Release milestone Feb 25, 2025

liurenjie1024 added this to iceberg-rust Feb 25, 2025

This was referenced Feb 25, 2025

feat: Add existing parquet files #960

Merged

Consolidate methods of converting parquet file to data file builder. #1033

Open

liurenjie1024 changed the title ~~Replace FileMetadata in parquet writer with in memory representation.~~ Replace parquet metadata thrift version with in memory version. Mar 6, 2025

jonathanc-n mentioned this issue Mar 9, 2025

Have writer return parsed ParquetMetadata apache/arrow-rs#7254

Closed

jonathanc-n mentioned this issue Mar 12, 2025

feat: Add conversion from FileMetaData to ParquetMetadata #1074

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace parquet metadata thrift version with in memory version. #1004

Replace parquet metadata thrift version with in memory version. #1004

liurenjie1024 commented Feb 25, 2025 •

edited

Loading

jonathanc-n commented Mar 3, 2025

liurenjie1024 commented Mar 6, 2025

jonathanc-n commented Mar 6, 2025

liurenjie1024 commented Mar 6, 2025

jonathanc-n commented Mar 6, 2025

liurenjie1024 commented Mar 11, 2025

jonathanc-n commented Mar 11, 2025

Replace parquet metadata thrift version with in memory version. #1004

Replace parquet metadata thrift version with in memory version. #1004

Comments

liurenjie1024 commented Feb 25, 2025 • edited Loading

jonathanc-n commented Mar 3, 2025

liurenjie1024 commented Mar 6, 2025

jonathanc-n commented Mar 6, 2025

liurenjie1024 commented Mar 6, 2025

jonathanc-n commented Mar 6, 2025

liurenjie1024 commented Mar 11, 2025

jonathanc-n commented Mar 11, 2025

liurenjie1024 commented Feb 25, 2025 •

edited

Loading