Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files #2044

Merged
merged 7 commits into from
Feb 14, 2025

Conversation

ericearl
Copy link
Collaborator

@ericearl ericearl commented Feb 5, 2025

The participants description in src/schema/objects/files.yaml now contains the comprehensive superset rule from #914. This change allows phenotype-only participant_ids (participants not present in the sub-XX folders) to be included in the participants.tsv file. @effigies I believe has a plan to integrate this change into the next BIDS release for the BIDS schema validator.

The participants schema description now contains the comprehensive superset rule from bids-standard#914.
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.44%. Comparing base (6994398) to head (a8de39d).
Report is 8 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2044   +/-   ##
=======================================
  Coverage   82.44%   82.44%           
=======================================
  Files          17       17           
  Lines        1504     1504           
=======================================
  Hits         1240     1240           
  Misses        264      264           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Committing the good suggestion.

Co-authored-by: Chris Markiewicz <[email protected]>
@ericearl
Copy link
Collaborator Author

ericearl commented Feb 5, 2025

Yes, that looks like it satisfies our need. Thanks for the suggestion @effigies!

@effigies
Copy link
Collaborator

effigies commented Feb 5, 2025

@rwblair We pre-load all phenotype files at the beginning of the run in order to populate dataset.subjects.phenotype. What if we dropped that and instead used the rule:

RuleName:
  selectors:
    - datatype == 'phenotype'
    - extension == '.tsv'
  checks:
    - |
      allequal(
        sorted(intersects(dataset.subjects.participant_id, columns.participant_id)),
        sorted(columns.participant_id)
      )

I'm curious which one would be more inefficient:

  1. Load all phenotype files, take the union, and run the intersection once. Then load all phenotype files when validating them. (current)
  2. When validating each phenotype file, run the intersection with this file's column only. (proposal)

It would also be worth considering which one could be optimized under the hood. While it is simplest if the context continues to be serializable to a JSON object, we could consider set-like structures that make it more efficient to run intersects() when encountered.

@yarikoptic
Copy link
Collaborator

yarikoptic commented Feb 5, 2025

note (please correct me if I am wrong): this rule ATM could only be stated in human language and cannot be operationalized as a "schema rule" for the validator since placement of participant_ids within phenotype/ is not formalized.

me posting above overlapped with @effigies actually providing "howto" ;)

@yarikoptic
Copy link
Collaborator

I crossed my prior note, but reflecting on the rule by @effigies above, do we already provide top level directory phenotype/ as "datatype" ?

ATM no rule mentions it as a datatype, here is the list/counts
❯ git grep -h 'datatype ==' | sed -e 's,^ *,,g' | sort | uniq -c | sort -n
      1 - datatype == 'fmap'
      2 - datatype == "beh"
      2 - datatype == "dwi"
      2 - datatype == "mrs"
      3 - datatype == "anat"
      6 - datatype == "motion"
      7 - datatype == "micr"
      9 - datatype == "fmap"
      9 - datatype == "func"
     13 - datatype == "ieeg"
     17 - datatype == "eeg"
     18 - datatype == "pet"
     20 - datatype == "perf"
     21 - datatype == "meg"
     24 - datatype == "nirs"

Would we similarly define stimuli and potentially other data types for other possible top level directories?

@effigies
Copy link
Collaborator

effigies commented Feb 5, 2025

I did pragmatically use it as a datatype in #1672.

I don't think there's a call to make stimuli that, as long as there is no constraint on the contents of the stimuli directory. My understanding was your preference was to classify stimuli as a new dataset type and validate its contents separately?

@effigies
Copy link
Collaborator

effigies commented Feb 5, 2025

@ericearl I took a quick pass at updating the schema. Would you mind putting together a small example for bids-examples? Maybe one with sub-01/ and sub-02/ directories, phenotype data for sub-01 and sub-03, and a participants.tsv that contains subs 1-4?

@ericearl
Copy link
Collaborator Author

ericearl commented Feb 6, 2025

@effigies I made our draft PR ready for review over on bids-examples at bids-standard/bids-examples#465. You'll want pheno004 for the example you're asking for.

@ericearl
Copy link
Collaborator Author

@effigies What else needs to happen next to finish off this PR? I know there's got to be the two reviews that aren't done by you or I.

@effigies
Copy link
Collaborator

We need to get the examples validating.

@ericearl
Copy link
Collaborator Author

@effigies All 4 or just pheno004?

@effigies
Copy link
Collaborator

I guess just 004 for this, but if the others aren't going to be fixed, it probably makes sense to pull out into its own PR.

@yarikoptic

This comment was marked as off-topic.

@ericearl
Copy link
Collaborator Author

I added just the pheno004/ example dataset to a fresh PR: bids-standard/bids-examples#483.

@effigies effigies changed the title [ENH] Add in participants+phenotype files comprehensive superset rule from issue 914 [ENH] Allow participants.tsv to contain a superset of subject directories and subjects listed in phenotype files Feb 13, 2025
@ericearl ericearl requested a review from nellh February 13, 2025 20:45
@effigies
Copy link
Collaborator

2 independent reviews and more than a week since substantive changes. Merging.

@effigies effigies merged commit fa2b5d8 into bids-standard:master Feb 14, 2025
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants