-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a JSON reader option to ignore type conflicts #7276
base: main
Are you sure you want to change the base?
Conversation
for p in pos { | ||
if !matches!(tape.get(*p), TapeElement::Null) { | ||
return Err(tape.error(*p, "null")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: Indentation-only change
Have you run the benchmarks for this? |
Not yet... but https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md makes it look very easy. Will do so and report back. |
@tustvold is |
I think this one is probably what @tustvold is referring to: https://github.com/apache/arrow-rs/blob/a75da00eed762f8ab201c6cb4388921ad9b67e7e/arrow/benches/json_reader.rs#L30-L29 so like cargo bench --bench json_reader |
Hmm, the benchmark results are not stable from run to run. Even benchmarking the main branch against itself gives a randomly and changing set of regressions and improvements. I tried on two very different computers: 2021 MacBook Pro (Apple M1 Max) and a 2019 Lenovo T490s (Intel Core i5-8365U). Different absolute numbers, same large jitter. Is there some trick for getting stable numbers? I tried increasing the measurement interval to 10s, it didn't solve the problem. |
I am not suepr familar Maybe you could use a non laptop (sometimes they vary based on thermostats, etc)? |
Finally got some reasonably stable benchmark results using an EC2 m6i.8xlarge instance, rustc 1.85.0. It uncovered one issue with the
I don't know why my changes should have caused a speedup, but at least there's no slowdown. Benchmarking commands used# 5 runs against upstream main branch
git checkout 82c2d5f4c
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline main$i; done
# 5 runs against this PR
git switch json-ignore-type-conflicts-option
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline feature$i; done
# compare the run results
for i in $(seq 5); do cargo bench --bench json_reader -- --load-baseline feature$i --baseline main$i; done (see https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines) |
@tustvold -- does the above work? Or are there other benchmarks to double check? |
Which issue does this PR close?
Rationale for this change
JSON data is notoriously non-homogenous, but the JSON parser today is super strict -- it requires a concrete schema and parsing fails if any field of any row encounters a type conflict. In such cases, it can be preferable for incompatible fields to parse as NULL instead of producing a hard error.
What changes are included in this PR?
Adds a new method
arrow_json::reader::ReaderBuilder::with_ignore_type_conflicts
, which can override the default behavior of throwing on type conflict, to return NULL values instead.Plumb that flag through to all ten decoders so they honor it: Null, Boolean, Primitive, Decimal, Timestamp, String, StringView, List, Map, Struct.
Add both positive and negative unit tests for each decoder type, to ensure the plumbing worked.
Are there any user-facing changes?
New API method, see above.