-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3163: Reduce memory and time overhead of ParquetRewriterTests #3164
GH-3163: Reduce memory and time overhead of ParquetRewriterTests #3164
Conversation
@@ -107,7 +107,7 @@ | |||
@RunWith(Parameterized.class) | |||
public class ParquetRewriterTest { | |||
|
|||
private final int numRecord = 100000; | |||
private final int numRecord = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may result in a single page for each column chunk. Could you try following things:
- Make sure more than one page are produced by setting config like
parquet.page.value.count.threshold
and relevant page size check parameters at https://github.com/apache/parquet-java/blob/master/parquet-hadoop/README.md - Parameterize
numRecord
(and page size check config) and by default use 10000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac : thanks for your comment. Do you happen to know how I can run a single test in the repository after making my changes? I wanted to run only the ParquetRewriterTest
, but have not figured out a good way to achieve that via mvn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cd ~/Projects/parquet-java # replace with your project root directory
cd parquet-hadoop
mvn test -Dtest=org.apache.parquet.hadoop.rewrite.ParquetRewriterTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, the relevant config parameter is parquet.page.row.count.limit
and I have changed that to be num_records / 5
. Hope it looks good!
86cd806
to
d20ed14
Compare
@wgtmac : Will this line also require a change? parquet-java/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java Line 928 in d20ed14
|
Yes, I think it would be good to test more than one row group. |
I verified that currently the test creates more than 1 row group with the current values. Let me know if any more concerns before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks!
cc @ConeyLiu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for the contribution.
Rationale for this change
Reduce the memory overhead and the time taken to run ParquetRewriterTests
What changes are included in this PR?
Reducing number of records from 100000 to 10000
Are these changes tested?
Are there any user-facing changes?
No
Closes #GH-3163