Add a JSON reader option to ignore type conflicts #7276

scovich · 2025-03-12T18:20:13Z

Which issue does this PR close?

Closes Option for JSON parser to return NULL field values on type mismatch #7230

Rationale for this change

JSON data is notoriously non-homogenous, but the JSON parser today is super strict -- it requires a concrete schema and parsing fails if any field of any row encounters a type conflict. In such cases, it can be preferable for incompatible fields to parse as NULL instead of producing a hard error.

What changes are included in this PR?

Adds a new method arrow_json::reader::ReaderBuilder::with_ignore_type_conflicts, which can override the default behavior of throwing on type conflict, to return NULL values instead.

Plumb that flag through to all ten decoders so they honor it: Null, Boolean, Primitive, Decimal, Timestamp, String, StringView, List, Map, Struct.

Add both positive and negative unit tests for each decoder type, to ensure the plumbing worked.

Are there any user-facing changes?

New API method, see above.

scovich · 2025-03-12T18:28:13Z

arrow-json/src/reader/null_array.rs

+            for p in pos {
+                if !matches!(tape.get(*p), TapeElement::Null) {
+                    return Err(tape.error(*p, "null"));
+                }


NOTE: Indentation-only change

tustvold · 2025-03-12T19:26:44Z

Have you run the benchmarks for this?

scovich · 2025-03-12T20:32:20Z

Have you run the benchmarks for this?

Not yet... but https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md makes it look very easy. Will do so and report back.

scovich · 2025-03-12T20:35:55Z

@tustvold is cargo bench -p arrow-json sufficient? Or do I need to benchmark some other sub-crates as well? Asking because there didn't seem to be very many benchmarks in the arrow-json crate?

alamb · 2025-03-12T20:43:30Z

@tustvold is cargo bench -p arrow-json sufficient? Or do I need to benchmark some other sub-crates as well? Asking because there didn't seem to be very many benchmarks in the arrow-json crate?

I think this one is probably what @tustvold is referring to: https://github.com/apache/arrow-rs/blob/a75da00eed762f8ab201c6cb4388921ad9b67e7e/arrow/benches/json_reader.rs#L30-L29

so like

cargo bench --bench json_reader

scovich · 2025-03-12T21:45:41Z

Hmm, the benchmark results are not stable from run to run. Even benchmarking the main branch against itself gives a randomly and changing set of regressions and improvements. I tried on two very different computers: 2021 MacBook Pro (Apple M1 Max) and a 2019 Lenovo T490s (Intel Core i5-8365U). Different absolute numbers, same large jitter.

Is there some trick for getting stable numbers? I tried increasing the measurement interval to 10s, it didn't solve the problem.

alamb · 2025-03-12T22:11:12Z

Is there some trick for getting stable numbers? I tried increasing the measurement interval to 10s, it didn't solve the problem.

I am not suepr familar

Maybe you could use a non laptop (sometimes they vary based on thermostats, etc)?

…ts-option

scovich · 2025-03-13T16:19:15Z

Finally got some reasonably stable benchmark results using an EC2 m6i.8xlarge instance, rustc 1.85.0. It uncovered one issue with the append helpers I had introduced. After addressing that, we now have:

benchmark	run1	run2	run3	run4	run5
small_bench_primitive	--noise--	--noise--	--noise--	--noise--	--noise--
large_bench_primitive	2.27% faster	2.27% faster	1.65% faster	2.71% faster	1.36% faster
small_bench_list	--noise--	--noise--	3.69% faster	2.71% faster	6.02% faster

I don't know why my changes should have caused a speedup, but at least there's no slowdown.

Benchmarking commands used

# 5 runs against upstream main branch
git checkout 82c2d5f4c 
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline main$i; done

# 5 runs against this PR
git switch json-ignore-type-conflicts-option
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline feature$i; done

# compare the run results
for i in $(seq 5); do cargo bench --bench json_reader -- --load-baseline feature$i --baseline main$i; done

(see https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines)

scovich · 2025-03-13T20:35:32Z

@tustvold -- does the above work? Or are there other benchmarks to double check?

scovich added 2 commits March 6, 2025 12:36

Initial prototype

c7c74ce

test coverage

5e87e5b

github-actions bot added the arrow Changes to the arrow crate label Mar 12, 2025

better null decoding

9f9d7c5

scovich commented Mar 12, 2025

View reviewed changes

scovich added 3 commits March 13, 2025 08:38

More efficient append helpers

0936aa5

Merge remote-tracking branch 'oss/main' into json-ignore-type-conflic…

995f1cd

…ts-option

missing unit tests

bb4e8d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a JSON reader option to ignore type conflicts #7276

Add a JSON reader option to ignore type conflicts #7276

scovich commented Mar 12, 2025 •

edited

Loading

scovich Mar 12, 2025

tustvold commented Mar 12, 2025

scovich commented Mar 12, 2025

scovich commented Mar 12, 2025

alamb commented Mar 12, 2025

scovich commented Mar 12, 2025 •

edited

Loading

alamb commented Mar 12, 2025

scovich commented Mar 13, 2025 •

edited

Loading

scovich commented Mar 13, 2025

Add a JSON reader option to ignore type conflicts #7276

Are you sure you want to change the base?

Add a JSON reader option to ignore type conflicts #7276

Conversation

scovich commented Mar 12, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

scovich Mar 12, 2025

Choose a reason for hiding this comment

tustvold commented Mar 12, 2025

scovich commented Mar 12, 2025

scovich commented Mar 12, 2025

alamb commented Mar 12, 2025

scovich commented Mar 12, 2025 • edited Loading

alamb commented Mar 12, 2025

scovich commented Mar 13, 2025 • edited Loading

scovich commented Mar 13, 2025

scovich commented Mar 12, 2025 •

edited

Loading

scovich commented Mar 12, 2025 •

edited

Loading

scovich commented Mar 13, 2025 •

edited

Loading