[SEDONA-457] Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadata when writing GeoParquet files #1164

Kontinuation · 2023-12-28T04:36:53Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-457. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Spark SQL primarily uses org.apache.spark.sql.parquet.row.metadata to infer the schema of parquet files. It will fall back to using the native parquet schema only when org.apache.spark.sql.parquet.row.metadata is absent. Writing the schema of dataframes with GeometryUDT columns into org.apache.spark.sql.parquet.row.metadata may cause compatibility problems. Please refer to the JIRA ticket for more details.

This patch replaces the GeometryUDT written into the metadata with binary type, since this is the physical data type for representing geometry values.

How was this patch tested?

Add assertions to verify that the metadata written by Spark SQL does not contain GeometryUDT.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the docs.

…a. This is for maximum compatibility with various versions of Apache Sedona.

jiayuasu · 2023-12-29T01:25:43Z

This PR will fix OvertureMaps/data#89

…t.row.metadata when writing GeoParquet files (apache#1164) * Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadata. This is for maximum compatibility with various versions of Apache Sedona. * Apply this patch to Spark 3.4 and Spark 3.5

Kontinuation added 2 commits December 28, 2023 12:13

Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadat…

fdd611f

…a. This is for maximum compatibility with various versions of Apache Sedona.

Apply this patch to Spark 3.4 and Spark 3.5

26238a5

Kontinuation marked this pull request as ready for review December 28, 2023 06:54

jiayuasu approved these changes Dec 29, 2023

View reviewed changes

jiayuasu added bug behavior change labels Dec 29, 2023

jiayuasu merged commit 7bb0ece into apache:master Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-457] Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadata when writing GeoParquet files #1164

[SEDONA-457] Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadata when writing GeoParquet files #1164

Kontinuation commented Dec 28, 2023 •

edited

Loading

jiayuasu commented Dec 29, 2023 •

edited

Loading

[SEDONA-457] Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadata when writing GeoParquet files #1164

[SEDONA-457] Don't write GeometryUDT into org.apache.spark.sql.parquet.row.metadata when writing GeoParquet files #1164

Conversation

Kontinuation commented Dec 28, 2023 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu commented Dec 29, 2023 • edited Loading

Kontinuation commented Dec 28, 2023 •

edited

Loading

jiayuasu commented Dec 29, 2023 •

edited

Loading