Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2417: Add support for geometry logical type #2971

Open
wants to merge 57 commits into
base: master
Choose a base branch
from

Conversation

zhangfengcdt
Copy link

This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.

Here is the proposal: apache/parquet-format#240

Jira

Tests

  • My PR adds the following unit tests: TestGeometryTypeRoundTrip

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@zhangfengcdt zhangfengcdt marked this pull request as draft July 26, 2024 15:39
@zhangfengcdt
Copy link
Author

CC: @jiayuasu @Kontinuation @wgtmac

@zhangfengcdt zhangfengcdt marked this pull request as ready for review August 12, 2024 15:57
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I have left some comments. I think we are reaching the finish line!

}

@Test
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests!

I think we are missing tests in following cases:

  • verify geometry type metadata is well preserved.
  • verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
  • verify geo stats in the column index have been generated.

I can do these later.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Please see my inline comments.

Comment on lines 85 to 86
void update(Geometry geometry, String crs) {
GeospatialUtils.normalizeLongitude(geometry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always normalize the x coordinate and handle wrap-around? The geometry could be in a projected CRS such as EPSG:3857, normalizing longitude or wrap-around should not be applied to such geometries.

The parquet implementation may need to be aware of the crs, or have some other options to turn on/off this longitude normalization and wrap-around behavior. I found that the C++ implementation does not support wrap-around, which may be a better default for geometry types. CC @jiayuasu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, how will the system handle geometries that cross the antimeridian? Should they be flagged as unsupported? It seems reasonable to continue performing wraparound for antimeridian-crossing geometries, rather than simply rejecting them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot I thought the C++ implementation should also support wraparound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing the ping!

Wraparound is complicated (e.g., requires some knowledge about the bounds of the coordinate system that is not strictly the business of a Parquet implementation, can have various heuristics that apply for different geometry types) and I don't think is a great target for the initial implementation. Non-wrapped bounding boxes are completely valid, of course, just not strictly optimal for certain cases.

For both Java and C++, probably a good model for implementing this would be to allow injecting a custom function for calculating bounds that could also be used for Geography bounding (which is even more complicated and a very big ask for a mostly non-spatial Parquet reader/writer).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangfengcdt @Kontinuation @paleolimbot Let's only do normalizeLongitude when the CRS field == "srid:4326" or unset in both Geography and Geometry types.

In all other cases, let's still accept these geometries and proceed without normalizing.

Of course, a more mature solution in the future would be always parsing the CRS and understand the allowed ranges of longitude.

Splitting a geometry to 2 halves will cause lots of trouble as it will create 2 rows with duplicate non-spatial information (especially when there is a primary key column column) unless we make it a single object of MultiPolygon? Most importantly, there is no good GDAL binding in Java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's only do normalizeLongitude when the CRS field == "srid:4326" or unset in both Geography and Geometry types.

I think we don't want to mess with user coordinates for Geometry (where there's nothing in the spec constraining the valid values)...I would be surprised if there is any file writer anywhere that does this. For Geography we do constrain the valid values (plus it would not make somebody's valid geometry invalid since the definition of Geography ensures this); however, I am still not sure that messing with user coordinates is a good idea for a Parquet implementation (for Sedona, perhaps, this would be more squarely in scope).

Splitting a geometry to 2 halves will cause lots of trouble

Apologies for not making that clear, I don't think a Parquet implementation should do the splitting...they typically arrive this way (e.g., Fiji is usually a MULTIPOLYGON with valid geometries on either side of the antimeridian). The algorithm for generating a more effective bounding box with wrapping is something like "generate bounds for all contiguous sequences and recursively accumulate them looking for a good place to split" (with apologies if that's implemented here and I missed it!)

I think both of these things are good ideas...my point is just that they are complex topics that need a bit more research/testing and not strictly necessary for merging the initial support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, this is not changing user coordinates at all, of course, it's just the edges of the box 🤦. As long as it's correct this is no problem but I think it would benefit from a dedicated PR (at least on the C++ side where there is already quite a lot going on).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. It is alright to have it in the Java PR since it is already doing that. We can do it a follow-up PR on the C++ side 👌

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the logic to check crs to decide if we need to normalize the coordinate.


public class BoundingBox {

boolean allowWraparound = Boolean.parseBoolean(System.getenv().getOrDefault("ALLOW_BBOX_WRAPAROUND", "true"));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @jiayuasu Could you please suggest a better way to inject additional user options like this one? I temporarily use environment variable for now, but open to any other suggestions. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the write path (i.e. when the writer collects bbox), what about adding a configuration to ParquetProperties? This is available when we create the ColumnValueCollector in the ColumnWriterBase.java.

On the read path, we don't know whether users will read bbox from a parquet file and then call update or merge to the bbox. What about adding void enableWraparound(bool enable) to BoundingBox class so they have the chance to set it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants