Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a JSON reader option to ignore type conflicts #7276

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Mar 12, 2025

Which issue does this PR close?

Rationale for this change

JSON data is notoriously non-homogenous, but the JSON parser today is super strict -- it requires a concrete schema and parsing fails if any field of any row encounters a type conflict. In such cases, it can be preferable for incompatible fields to parse as NULL instead of producing a hard error.

What changes are included in this PR?

Adds a new method arrow_json::reader::ReaderBuilder::with_ignore_type_conflicts, which can override the default behavior of throwing on type conflict, to return NULL values instead.

Plumb that flag through to all ten decoders so they honor it: Null, Boolean, Primitive, Decimal, Timestamp, String, StringView, List, Map, Struct.

Add both positive and negative unit tests for each decoder type, to ensure the plumbing worked.

Are there any user-facing changes?

New API method, see above.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 12, 2025
Comment on lines +38 to +41
for p in pos {
if !matches!(tape.get(*p), TapeElement::Null) {
return Err(tape.error(*p, "null"));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Indentation-only change

@tustvold
Copy link
Contributor

Have you run the benchmarks for this?

@scovich
Copy link
Contributor Author

scovich commented Mar 12, 2025

Have you run the benchmarks for this?

Not yet... but https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md makes it look very easy. Will do so and report back.

@scovich
Copy link
Contributor Author

scovich commented Mar 12, 2025

@tustvold is cargo bench -p arrow-json sufficient? Or do I need to benchmark some other sub-crates as well? Asking because there didn't seem to be very many benchmarks in the arrow-json crate?

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

@tustvold is cargo bench -p arrow-json sufficient? Or do I need to benchmark some other sub-crates as well? Asking because there didn't seem to be very many benchmarks in the arrow-json crate?

I think this one is probably what @tustvold is referring to: https://github.com/apache/arrow-rs/blob/a75da00eed762f8ab201c6cb4388921ad9b67e7e/arrow/benches/json_reader.rs#L30-L29

so like

cargo bench --bench json_reader

@scovich
Copy link
Contributor Author

scovich commented Mar 12, 2025

Hmm, the benchmark results are not stable from run to run. Even benchmarking the main branch against itself gives a randomly and changing set of regressions and improvements. I tried on two very different computers: 2021 MacBook Pro (Apple M1 Max) and a 2019 Lenovo T490s (Intel Core i5-8365U). Different absolute numbers, same large jitter.

Is there some trick for getting stable numbers? I tried increasing the measurement interval to 10s, it didn't solve the problem.

@alamb
Copy link
Contributor

alamb commented Mar 12, 2025

Is there some trick for getting stable numbers? I tried increasing the measurement interval to 10s, it didn't solve the problem.

I am not suepr familar

Maybe you could use a non laptop (sometimes they vary based on thermostats, etc)?

@scovich
Copy link
Contributor Author

scovich commented Mar 13, 2025

Finally got some reasonably stable benchmark results using an EC2 m6i.8xlarge instance, rustc 1.85.0. It uncovered one issue with the append helpers I had introduced. After addressing that, we now have:

benchmark run1 run2 run3 run4 run5
small_bench_primitive --noise-- --noise-- --noise-- --noise-- --noise--
large_bench_primitive 2.27% faster 2.27% faster 1.65% faster 2.71% faster 1.36% faster
small_bench_list --noise-- --noise-- 3.69% faster 2.71% faster 6.02% faster

I don't know why my changes should have caused a speedup, but at least there's no slowdown.

Benchmarking commands used
# 5 runs against upstream main branch
git checkout 82c2d5f4c 
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline main$i; done

# 5 runs against this PR
git switch json-ignore-type-conflicts-option
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline feature$i; done

# compare the run results
for i in $(seq 5); do cargo bench --bench json_reader -- --load-baseline feature$i --baseline main$i; done

(see https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines)

@scovich
Copy link
Contributor Author

scovich commented Mar 13, 2025

@tustvold -- does the above work? Or are there other benchmarks to double check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option for JSON parser to return NULL field values on type mismatch
3 participants