Computational Notebooks for "Morphology-Aware Profiling of Highly Multiplexed Tissue Images using Variational Autoencoders"
Gregory J. Baker1,2,3,&,*,#, Edward Novikov1,4,*, Shannon Coy1,2,5, Yu-An Chen1,2, Clemens B. Hug1, Zergham Ahmed1,4, Sebastián A. Cajas Ordóñez4, Siyu Huang4,%, Clarence Yapp1, Artem Sokolov1, Hanspeter Pfister4, Peter K. Sorger1,2,3,#
1Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 2Ludwig Center for Cancer Research at Harvard, Harvard Medical School, Boston, MA 3Department of Systems Biology, Harvard Medical School, Boston, MA 4Harvard John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 5Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA
& Current affiliation: Division of Oncological Sciences, Knight Cancer Institute, Oregon Health & Science University, Portland, OR
% Current affiliation: Visual Computing Division, School of Computing, Clemson University, Clemson, SC
*Co-first Authors: G.J.B., E.N.
#Corresponding Authors: gbak7696@gmail.com (G.J.B.), peter_sorger@hms.harvard.edu (P.K.S.)
Spatial proteomics (highly multiplexed tissue imaging) provides unprecedented insight into the types, states, and spatial organization of cells within preserved tissue environments. To enable single-cell analysis, high-plex images are typically segmented using algorithms that assign marker signals to individual cells. However, conventional segmentation is often imprecise and susceptible to signal spillover between adjacent cells, interfering with accurate cell type identification. Segmentation-based methods also fail to capture the morphological detail that histopathologists rely on for disease diagnosis and staging. Here, we present a method that combines unsupervised, pixel-level machine learning using autoencoders with traditional segmentation to generate single-cell data that captures information on protein abundance, morphology, and local neighborhood in a manner analogous to human experts while overcoming the problem of signal spillover. The result is a more accurate and nuanced characterization of cell types and states than segmentation-based analysis alone.
Python code in this GitHub repository is organized into Jupyter notebooks used to generate the figures shown in the paper. To run the code, first clone this repository onto your computer by opening a terminal window and entering the following command:
git clone https://github.com/labsyspharm/vae-paper.git
Next, change directories into the top level directory of the cloned repository and create and activate a dedicated Conda environment containing the necessary Python libraries for running the code:
cd <path/to/cloned/repo>
conda env create -f environment.yml
conda activate morphaeus-paper
If conda is not already installed, you can download it by following the instructions provided here.
To browse the Jupyter notebooks, change directories to the src
folder and activate Jupyter Lab with the following command:
jupyter lab
To re-run the Jupyter notebooks, input data must first be downloaded from our public Amazon S3 bucket. This can be done by running the download.py
script located in the src
folder. In addition to the required input data, this script will also download a folder containing precomputed output files as a reference (output_reference
):
# from the top level directory
python src/download.py
Note: ~313GB of storage space is required to download the complete file set.
To re-run any of the notebooks in Jupyter Lab, first double click on a .ipynb file at the left of the screen and the notebook will open at the right. Then, click the double-arrow button at the top of the notebook to restart the kernel and run all cells. Notebook output will be saved to a folder called output
at the top level of the repository.
MORPHÆUS source code is freely available for academic re-use under the MIT license on GitHub and is archived on Zenodo.
To demo the data analysis pipeline, be sure that the input data files have first been downloaded as described above, then change directories to the demo
directory and run the following command:
vae config.yml
This will execute the pipeline on a small subsample of data from the CyCIF-1A image presented in the paper, demonstrating all major modules ranging from single-cell CSV subsampling and image patch generation, to VAE model training, plot visualization, and concept saliency analysis.
Note: demo results will differ from those shown in the paper due to the use of a smaller training dataset and fewer training epochs. Each epoch is estimated to complete in about 30sec - 1min running locally on CPUs. For this example, ~100 epochs are required before learned reconstructions begin to resemble cells and the data start to form discrete clusters in feature space. As a convenience, lightly pre-trained encoder and decoder networks are provided so that the pipeline skips the VAE model training step. For those who desire to train a new model, prior to executing the pipeline, please comment out the encoder.hdf5
and decoder.hdf5
files as well as the TRAIN_VAE.txt
checkpoint file.
This GitHub repository will be archived on Zenodo following publication of the manuscript.
This work was supported by Ludwig Cancer Research and the Ludwig Center at Harvard (P.K.S., S.S.), the Gray Foundation, and by NIH NCI grants U01-CA284207, and U2C-CA233262. S.S. is supported by the BWH President’s Scholars Award. Results shown in this study are in part based upon data generated by the Human Tumor Atlas Network (HTAN, https://humantumoratlas.org/).
Baker GJ., Novikov E. et al. Morphology-Aware Profiling of Highly Multiplexed Tissue Images using Variational Autoencoders. bioRxiv (2025) https://doi.org/10.1101/2025.06.23.661064