[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files #2044

ericearl · 2025-02-05T15:29:33Z

The participants description in src/schema/objects/files.yaml now contains the comprehensive superset rule from #914. This change allows phenotype-only participant_ids (participants not present in the sub-XX folders) to be included in the participants.tsv file. @effigies I believe has a plan to integrate this change into the next BIDS release for the BIDS schema validator.

The participants schema description now contains the comprehensive superset rule from bids-standard#914.

codecov · 2025-02-05T15:33:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.44%. Comparing base (6994398) to head (a8de39d).
Report is 8 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2044   +/-   ##
=======================================
  Coverage   82.44%   82.44%           
=======================================
  Files          17       17           
  Lines        1504     1504           
=======================================
  Hits         1240     1240           
  Misses        264      264

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/schema/objects/files.yaml

Committing the good suggestion. Co-authored-by: Chris Markiewicz <[email protected]>

ericearl · 2025-02-05T15:44:43Z

Yes, that looks like it satisfies our need. Thanks for the suggestion @effigies!

effigies · 2025-02-05T15:51:45Z

@rwblair We pre-load all phenotype files at the beginning of the run in order to populate dataset.subjects.phenotype. What if we dropped that and instead used the rule:

RuleName:
  selectors:
    - datatype == 'phenotype'
    - extension == '.tsv'
  checks:
    - |
      allequal(
        sorted(intersects(dataset.subjects.participant_id, columns.participant_id)),
        sorted(columns.participant_id)
      )

I'm curious which one would be more inefficient:

Load all phenotype files, take the union, and run the intersection once. Then load all phenotype files when validating them. (current)
When validating each phenotype file, run the intersection with this file's column only. (proposal)

It would also be worth considering which one could be optimized under the hood. While it is simplest if the context continues to be serializable to a JSON object, we could consider set-like structures that make it more efficient to run intersects() when encountered.

yarikoptic · 2025-02-05T15:55:45Z

note (please correct me if I am wrong): this rule ATM could only be stated in human language and cannot be operationalized as a "schema rule" for the validator since placement of participant_ids within phenotype/ is not formalized.

me posting above overlapped with @effigies actually providing "howto" ;)

yarikoptic · 2025-02-05T16:30:06Z

I crossed my prior note, but reflecting on the rule by @effigies above, do we already provide top level directory phenotype/ as "datatype" ?

ATM no rule mentions it as a datatype, here is the list/counts

❯ git grep -h 'datatype ==' | sed -e 's,^ *,,g' | sort | uniq -c | sort -n
      1 - datatype == 'fmap'
      2 - datatype == "beh"
      2 - datatype == "dwi"
      2 - datatype == "mrs"
      3 - datatype == "anat"
      6 - datatype == "motion"
      7 - datatype == "micr"
      9 - datatype == "fmap"
      9 - datatype == "func"
     13 - datatype == "ieeg"
     17 - datatype == "eeg"
     18 - datatype == "pet"
     20 - datatype == "perf"
     21 - datatype == "meg"
     24 - datatype == "nirs"

Would we similarly define stimuli and potentially other data types for other possible top level directories?

effigies · 2025-02-05T16:33:45Z

Classify "phenotype/" as a datatype directory with no subject/session parent #1828

I did pragmatically use it as a datatype in #1672.

I don't think there's a call to make stimuli that, as long as there is no constraint on the contents of the stimuli directory. My understanding was your preference was to classify stimuli as a new dataset type and validate its contents separately?

src/schema/objects/files.yaml

…articipants

effigies · 2025-02-05T21:41:07Z

@ericearl I took a quick pass at updating the schema. Would you mind putting together a small example for bids-examples? Maybe one with sub-01/ and sub-02/ directories, phenotype data for sub-01 and sub-03, and a participants.tsv that contains subs 1-4?

ericearl · 2025-02-06T20:40:26Z

@effigies I made our draft PR ready for review over on bids-examples at bids-standard/bids-examples#465. You'll want pheno004 for the example you're asking for.

ericearl · 2025-02-11T16:21:39Z

@effigies What else needs to happen next to finish off this PR? I know there's got to be the two reviews that aren't done by you or I.

effigies · 2025-02-11T16:49:28Z

We need to get the examples validating.

ericearl · 2025-02-11T16:58:27Z

@effigies All 4 or just pheno004?

effigies · 2025-02-11T17:03:42Z

I guess just 004 for this, but if the others aren't going to be fixed, it probably makes sense to pull out into its own PR.

ericearl · 2025-02-13T19:06:25Z

I added just the pheno004/ example dataset to a fresh PR: bids-standard/bids-examples#483.

effigies · 2025-02-14T14:28:47Z

2 independent reviews and more than a week since substantive changes. Merging.

Update src/schema/objects/files.yaml

f734f1f

The participants schema description now contains the comprehensive superset rule from bids-standard#914.

ericearl added the phenotype label Feb 5, 2025

ericearl requested review from effigies, Remi-Gau and tsalo February 5, 2025 15:29

ericearl self-assigned this Feb 5, 2025

ericearl requested a review from erdalkaraca as a code owner February 5, 2025 15:29

ericearl mentioned this pull request Feb 5, 2025

PHENOTYPE_SUBJECTS_MISSING issue bids-standard/bids-validator#54

Open

effigies reviewed Feb 5, 2025

View reviewed changes

src/schema/objects/files.yaml Outdated Show resolved Hide resolved

Update src/schema/objects/files.yaml

4ab2679

Committing the good suggestion. Co-authored-by: Chris Markiewicz <[email protected]>

effigies reviewed Feb 5, 2025

View reviewed changes

src/schema/objects/files.yaml Outdated Show resolved Hide resolved

effigies and others added 4 commits February 5, 2025 11:44

Update src/schema/objects/files.yaml

4781f25

Merge branch 'master' into issue-914-dev

9ea17c5

doc(schema): Update intersects() to return the intersection if non-empty

2f9b5f6

feat(schema): Require participants.tsv to be a superset of sub_dirs/p…

a8de39d

…articipants

effigies mentioned this pull request Feb 6, 2025

feat(expr): Make intersects() return the intersection when non-empty bids-standard/bids-validator#150

Merged

ericearl mentioned this pull request Feb 6, 2025

Example datasets for bep036 bids-standard/bids-examples#465

Open

4 tasks

This comment was marked as off-topic.

Sign in to view

effigies mentioned this pull request Feb 12, 2025

Classify "phenotype/" as a datatype directory with no subject/session parent #1828

Open

effigies mentioned this pull request Feb 13, 2025

Add pheno004 example dataset bids-standard/bids-examples#483

Merged

effigies approved these changes Feb 13, 2025

View reviewed changes

effigies changed the title ~~[ENH] Add in participants+phenotype files comprehensive superset rule from issue 914~~ [ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files Feb 13, 2025

effigies added the needs review label Feb 13, 2025

schema: Improve error messages

4ce9fea

ericearl requested a review from nellh February 13, 2025 20:45

nellh approved these changes Feb 13, 2025

View reviewed changes

Remi-Gau approved these changes Feb 14, 2025

View reviewed changes

effigies merged commit fa2b5d8 into bids-standard:master Feb 14, 2025
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files #2044

[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files #2044

ericearl commented Feb 5, 2025

codecov bot commented Feb 5, 2025 •

edited

Loading

ericearl commented Feb 5, 2025

effigies commented Feb 5, 2025

yarikoptic commented Feb 5, 2025 •

edited

Loading

yarikoptic commented Feb 5, 2025

effigies commented Feb 5, 2025

effigies commented Feb 5, 2025

ericearl commented Feb 6, 2025

ericearl commented Feb 11, 2025

effigies commented Feb 11, 2025

ericearl commented Feb 11, 2025

effigies commented Feb 11, 2025

This comment was marked as off-topic.

ericearl commented Feb 13, 2025

effigies commented Feb 14, 2025

[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files #2044

[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files #2044

Conversation

ericearl commented Feb 5, 2025

codecov bot commented Feb 5, 2025 • edited Loading

Codecov Report

ericearl commented Feb 5, 2025

effigies commented Feb 5, 2025

yarikoptic commented Feb 5, 2025 • edited Loading

yarikoptic commented Feb 5, 2025

effigies commented Feb 5, 2025

effigies commented Feb 5, 2025

ericearl commented Feb 6, 2025

ericearl commented Feb 11, 2025

effigies commented Feb 11, 2025

ericearl commented Feb 11, 2025

effigies commented Feb 11, 2025

This comment was marked as off-topic.

ericearl commented Feb 13, 2025

effigies commented Feb 14, 2025

codecov bot commented Feb 5, 2025 •

edited

Loading

yarikoptic commented Feb 5, 2025 •

edited

Loading