Skip to content

kesslerr/m4d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How EEG preprocessing shapes decoding performance

Working title: Multiverse 4 Decoding (m4d)

Kessler et al., 2024, How EEG preprocessing shapes decoding performance. Arxiv. doi.org/10.48550/arXiv.2410.14453

  • Read preprint here

  • Feel free to send me feedback: via email

  • An interactive dashboard to explore the impact of changing single preprocessing steps on decoding performance can be found on streamlit.

Abstract:

EEG preprocessing varies widely between studies, but its impact on classification performance remains poorly understood. To address this gap, we analyzed seven experiments with 40 participants drawn from the public ERP CORE dataset. We systematically varied key preprocessing steps, such as filtering, referencing, baseline interval, detrending, and multiple artifact correction steps. Then we performed trial-wise binary classification (i.e., decoding) using neural networks (EEGNet), or time-resolved logistic regressions. Our findings demonstrate that preprocessing choices influenced decoding performance considerably. All artifact correction steps reduced decoding performance across experiments and models, while higher high-pass filter cutoffs consistently increased decoding performance. For EEGNet, baseline correction further increased decoding performance, and for time-resolved classifiers, linear detrending, and lower low-pass filter cutoffs increased decoding performance. The influence of other preprocessing choices was specific for each experiment or event-related potential component. The current results underline the importance of carefully selecting preprocessing steps for EEG-based decoding. While uncorrected artifacts may increase decoding performance, this comes at the expense of interpretability and model validity, as the model may exploit structured noise rather than the neural signal.

Structure of this repository

Subfolders will contain READMEs which are more specific.

Note: The multiverse-preprocessed epoch data comprises >15 TB of storage.

If you are interested in the TBs of epochs data, send me an email and we figure out a way of sharing.

Single large files can be assed via Zenodo, such as the summary csvs for analysis and modeling (single accuracy and T-sum values per participant, experiment, forking path).

If you reuse the scripts or pipeline, please adapt all the paths in the scripts! Paths are sometimes absolute in the scripts because data was shared across file servers for computing requirements.

General project structure adapted from cookiecutter:


├── README.md          <- The top-level README.
│
├── dashboard          <- dashboard submodule pointing to a different repository used for the streamlit app
│
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── env                <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `conda export >> env.json`
│
├── julia              <- Julia scripts.
│
├── manuscript         <- Manuscript submodule pointing to a different repository synced with Overleaf
│
├── models             <- Trained models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks and similar with one-off analyses
│
├── plots              <- Plots (other plots are directly plotted into the manuscript folder)
│
├── poster             <- Conference posters
│
├── presentation       <- Project presentations
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│
├── src                <- Python source code for use in this project.
│
├── targets            <- R (targets) Pipeline.

Environments / Packages

The conda environment is saved in the folder env. All python/bash/slurm scripts can be found in src.

The R environment, used in a targets pipeline and all related processing scripts can be found in targets, and a list of packages in env.

The Julia environment for LMM fitting is found in env, Julia scripts in julia.

The system architecture and hardware details of the HPC used for all Python and Bash scripts with SLURM job scheduling system can be found in MPCDF RAVEN user guide.

The system architecture and hardware details of the Macbook Pro (2020, M1) used to process the targets pipeline in R and Julia can be found here. A 16 GB RAM version was used.

Run analyses

The following is done on an HPC cluster with SLURM job scheduling system and the conda environment set-up.

Multiverse preprocessing and machine learning model fitting

Download the ERP CORE data for all participants and experiments. ⏳ several minutes to hours, depending on bandwidth

python3 src/0-download.py

Prepare the data ⏳ <1h

  • rearrange trigger values
  • rename annotations
  • get times
  • resample to 256 Hz
  • calculate artificial EOG channels
  • set montage
python3 src/1-pre-multiverse.py

Run multiverse preprocessing: For each experiment and participant, preprocess the raw data using >2500 different preprocessing pipelines. ⏳ 24h per participant and experiment

bash src/2-multiverse.sh

Calculate evoked responses, visualize particularly for an example forking path. ⏳ <1h

python3 src/3-evoked.py

Run decoding for each forking path, participant, and experiment:

  • EEGNet decoding ⏳ 24h per participant and experiment
  • Time-resolved decoding ⏳ <1h per participant and experiment
bash src/4a-eegnet.sh
bash src/4b-sliding.sh

Aggregate EEGNet results for analysis in R/targets. ⏳ <1h

python src/5a-aggregate_results.py

Aggregate time-resolved results on group-level for analysis in R/targets, and visualize for example forking path. ⏳ <1h

python src/5b-sliding_group.py

Fitting Linear Mixed Models in Julia

All the following steps were performed on a Macbook Pro (2020, M1).

From a terminal with Julia installed based on the environment. ⏳ <24h

julia julia/pretarget_model_fitting_en.jl
julia julia/pretarget_model_fitting_tr.jl

The model fitting in Julia is an infinite times faster than in R, especially for large models and data sets. The bottleneck however is the conversion from a Julia LMM object to an R LMM object, which takes a few hours per model (due to reasons that escape me).

The present steps were performed before the targets pipeline to prevent computationally intensive steps from running after pipeline invalidation. Other, less intensive steps shown in the manuscript appendix - run in Julia - are performed from within the targets pipeline.

Modeling the impact of preprocessing on decoding performance

The following is performed within an R targets pipeline, with access to Julia language. From within RStudio, source targets/renv/activate.R and targets/_targets.R. _targets.R contains the entire pipeline.

The pipeline (and the status of each node) can be visualized using

tar_visnetwork()

The complete pipeline is run using ⏳ <2h

tar_make()

The resulting plots are directly plotted into the manuscript folder (git submodule).

License

CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

About

How EEG preprocessing shapes decoding performance

Topics

Resources

License

Stars

Watchers

Forks