Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3163: Reduce memory and time overhead of ParquetRewriterTests #3164

Merged

Conversation

rahulketch
Copy link
Contributor

Rationale for this change

Reduce the memory overhead and the time taken to run ParquetRewriterTests

What changes are included in this PR?

Reducing number of records from 100000 to 10000

Are these changes tested?

Are there any user-facing changes?

No

Closes #GH-3163

@@ -107,7 +107,7 @@
@RunWith(Parameterized.class)
public class ParquetRewriterTest {

private final int numRecord = 100000;
private final int numRecord = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may result in a single page for each column chunk. Could you try following things:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac : thanks for your comment. Do you happen to know how I can run a single test in the repository after making my changes? I wanted to run only the ParquetRewriterTest, but have not figured out a good way to achieve that via mvn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd ~/Projects/parquet-java   # replace with your project root directory

cd parquet-hadoop

mvn test -Dtest=org.apache.parquet.hadoop.rewrite.ParquetRewriterTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the relevant config parameter is parquet.page.row.count.limit and I have changed that to be num_records / 5. Hope it looks good!

@rahulketch rahulketch force-pushed the reduce-parquet-rewriter-test-overhead branch from 86cd806 to d20ed14 Compare February 28, 2025 12:27
@rahulketch
Copy link
Contributor Author

@wgtmac : Will this line also require a change?

@wgtmac
Copy link
Member

wgtmac commented Feb 28, 2025

@wgtmac : Will this line also require a change?

Yes, I think it would be good to test more than one row group.

@rahulketch
Copy link
Contributor Author

@wgtmac : Will this line also require a change?

Yes, I think it would be good to test more than one row group.

I verified that currently the test creates more than 1 row group with the current values. Let me know if any more concerns before merging.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks!

cc @ConeyLiu

Copy link
Contributor

@ConeyLiu ConeyLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the contribution.

@wgtmac wgtmac merged commit 976e2d2 into apache:master Mar 4, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants