-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2417: Add support for geometry logical type #2971
base: master
Are you sure you want to change the base?
PARQUET-2417: Add support for geometry logical type #2971
Conversation
This PR is copied form this place: apache#1379
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
…e spherical edge is specified.
…apache-parquet-2417-geospatial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I have left some comments. I think we are reaching the finish line!
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryUtils.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestGeometryTypeRoundTrip.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests!
I think we are missing tests in following cases:
- verify geometry type metadata is well preserved.
- verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
- verify geo stats in the column index have been generated.
I can do these later.
parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveType.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Please see my inline comments.
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryTypes.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/statistics/geometry/GeospatialStatistics.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/statistics/geometry/GeospatialStatistics.java
Outdated
Show resolved
Hide resolved
...column/src/main/java/org/apache/parquet/column/statistics/geometry/GeospatialStatistics.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeospatialUtils.java
Outdated
Show resolved
Hide resolved
void update(Geometry geometry, String crs) { | ||
GeospatialUtils.normalizeLongitude(geometry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we always normalize the x coordinate and handle wrap-around? The geometry could be in a projected CRS such as EPSG:3857, normalizing longitude or wrap-around should not be applied to such geometries.
The parquet implementation may need to be aware of the crs, or have some other options to turn on/off this longitude normalization and wrap-around behavior. I found that the C++ implementation does not support wrap-around, which may be a better default for geometry types. CC @jiayuasu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @paleolimbot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, how will the system handle geometries that cross the antimeridian? Should they be flagged as unsupported? It seems reasonable to continue performing wraparound for antimeridian-crossing geometries, rather than simply rejecting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot I thought the C++ implementation should also support wraparound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing the ping!
Wraparound is complicated (e.g., requires some knowledge about the bounds of the coordinate system that is not strictly the business of a Parquet implementation, can have various heuristics that apply for different geometry types) and I don't think is a great target for the initial implementation. Non-wrapped bounding boxes are completely valid, of course, just not strictly optimal for certain cases.
For both Java and C++, probably a good model for implementing this would be to allow injecting a custom function for calculating bounds that could also be used for Geography bounding (which is even more complicated and a very big ask for a mostly non-spatial Parquet reader/writer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangfengcdt @Kontinuation @paleolimbot Let's only do normalizeLongitude when the CRS field == "srid:4326" or unset
in both Geography and Geometry types.
In all other cases, let's still accept these geometries and proceed without normalizing.
Of course, a more mature solution in the future would be always parsing the CRS and understand the allowed ranges of longitude.
Splitting a geometry to 2 halves will cause lots of trouble as it will create 2 rows with duplicate non-spatial information (especially when there is a primary key column column) unless we make it a single object of MultiPolygon? Most importantly, there is no good GDAL binding in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's only do normalizeLongitude when the CRS field == "srid:4326" or unset in both Geography and Geometry types.
I think we don't want to mess with user coordinates for Geometry (where there's nothing in the spec constraining the valid values)...I would be surprised if there is any file writer anywhere that does this. For Geography we do constrain the valid values (plus it would not make somebody's valid geometry invalid since the definition of Geography ensures this); however, I am still not sure that messing with user coordinates is a good idea for a Parquet implementation (for Sedona, perhaps, this would be more squarely in scope).
Splitting a geometry to 2 halves will cause lots of trouble
Apologies for not making that clear, I don't think a Parquet implementation should do the splitting...they typically arrive this way (e.g., Fiji is usually a MULTIPOLYGON with valid geometries on either side of the antimeridian). The algorithm for generating a more effective bounding box with wrapping is something like "generate bounds for all contiguous sequences and recursively accumulate them looking for a good place to split" (with apologies if that's implemented here and I missed it!)
I think both of these things are good ideas...my point is just that they are complex topics that need a bit more research/testing and not strictly necessary for merging the initial support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, this is not changing user coordinates at all, of course, it's just the edges of the box 🤦. As long as it's correct this is no problem but I think it would benefit from a dedicated PR (at least on the C++ side where there is already quite a lot going on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. It is alright to have it in the Java PR since it is already doing that. We can do it a follow-up PR on the C++ side 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the logic to check crs to decide if we need to normalize the coordinate.
|
||
public class BoundingBox { | ||
|
||
boolean allowWraparound = Boolean.parseBoolean(System.getenv().getOrDefault("ALLOW_BBOX_WRAPAROUND", "true")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the write path (i.e. when the writer collects bbox), what about adding a configuration to ParquetProperties
? This is available when we create the ColumnValueCollector
in the ColumnWriterBase.java
.
On the read path, we don't know whether users will read bbox from a parquet file and then call update
or merge
to the bbox. What about adding void enableWraparound(bool enable)
to BoundingBox
class so they have the chance to set it?
This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.
Here is the proposal: apache/parquet-format#240
Jira
Tests
Commits
Documentation