-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Spark batch write failed w/ Already closed files for partition #613
Comments
Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write against partitioned table. You could find more information from here:https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables |
How to enable fanout write on arctic? |
SPARK-23889 still has obvious drawbacks, the strict distribution requirements before writing may cause low parallelism and data skew, which significantly impacts the performance. As Arctic provides async optimize service, it should be ok the enable fanout write in default. |
Arctic do not support enabling fanout write for unkeyed table now beause of it used iceberg spark writer directly. cc @baiyangtx |
So the user must know about table distribution details and redistribute the dataset to match exactly before inserting using Spark? Sounds unfriendly. How do you think switching hardcode fanout write disabled to enabled? |
My biggest concern is that fanout writer will make each writer open too many files in the same time and this may produce OOM problem in Spark. And this is the core reason why Iceberg disable fanout writer in Spark in my opinion. |
|
|
BTW, it's one of the major reasons why I don't have strong confidence to recommend user replacing Hive w/ Iceberg, such default behavior is really a burden of users. |
Enabling fanout writer looks goo to me. The Keyed Arctic table enable fanout writer in default now. But some user DO find arctic writer need more memory than iceberg writer. |
Maybe we can add a config to enable fanout writer and makes it default true. Thus user will find more easier to insert data into table and can change this config and sort data when meeting OOM problem. |
sgtm |
@zhoujinsong Which module will implement this table property? core or spark module? |
What happened?
Unexpected result.
Spark batch write failed w/ "Already closed files for partition"
Affects Versions
0.3.2-rc1
What engines are you seeing the problem on?
Spark 3.1.3
How to reproduce
Relevant log output
Anything else
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: