2026
In: (9th International Conference on Medical Imaging with Deep Learning, MIDL 2026, 8-10 July 2026, Chientan). 2026. 3443-3463 ( ; 315)
Genetic variation provides stable, time-invariant markers of disease risk and can therefore reveal upstream mechanisms underlying complex traits. Genome-wide association studies (GWAS) have identified thousands of loci associated with disease, yet most remain difficult to interpret because the intermediate phenotypes linking genotype to disease are unknown. Here, we address the question whether disease-associated genetic loci can be directly used to extract such risk-related features from quantitative phenotypes, including functional tests and medical imaging. We introduce GEMCONT (GEnetics-based Multimodal CONTrastive Learning), a multimodal contrastive learning framework that aligns genotype and phenotype representations in a shared latent space. Unlike task-agnostic multimodal pretraining, GEMCONT is disease-conditioned: GWAS-informed variant panels act as targeted supervision to learn risk-relevant imaging embeddings. To reflect the weak, additive nature of genetic effects, it employs a linear genetic encoder alongside a deep phenotypic encoder. We validate GEMCONT in controlled simulations and apply it to two real-world settings: spirometry curves for asthma and retinal fundus images for glaucoma. In both, GEMCONT improves disease risk prediction and enhances recovery of genetic associations compared with standard unsupervised or polygenic risk–based models. Altogether, our results demonstrate that incorporating stable genetic supervision into multimodal representation learning enables the extraction of genetically informed risk traits, refining disease phenotypes and improving the interpretability of association studies.
Int. J. Mol. Sci. 27:4803 (2026)
Chromatin conformation capture technologies have revealed the complex 3D organizationof the genome and its key regulatory role. Single-cell Hi-C (scHi-C) maps this architectureat single-cell level, but its sparse nature makes data interpretation challenging, and tools fortheir analysis remain limited. Here, we present a physics-based framework that combinespolymer modeling with computational methods to reconstruct full 3D genome structuresfrom sparse scHi-C data. Using both artificial and experimental data, we show that ourapproach imputes missing contacts and recovers accurate structures validated againstindependent Hi-C and established polymer models. Applied to scHi-C from a 15 Mbregion of human HeLa-S3 cells as a case study, the method uncovers distinct structuralclasses defined by the spatial distribution of chromatin binding domains. The reconstructedmodels enable robust downstream analyses, including the identification of single-celltopologically associated domains (TADs), which appear highly variable across cells yet tendto accumulate around those observed in bulk. Importantly, the inferred 3D polymer modelscapture diverse epigenetic signatures, with active chromatin domains exhibiting greaterstructural variability than repressive ones across single cells. Overall, our study providesa mechanistic and interpretable framework to analyze sparse scHi-C data, highlightinghow polymer physics can be leveraged to uncover genome architecture and its functionalvariability at single-cell resolution.
Wissenschaftlicher Artikel
Scientific Article
Genome Biol. 27:122 (2026)
Understanding how genetic variation shapes tissue structure is crucial for disease biology, yet scalable, general-purpose frameworks for genetic analysis of histology traits are lacking. We present HistoGWAS, a framework for genome-wide association studies of histology data that leverages foundation models for automated trait definition, variance component models for efficient association testing, and generative models for variant effect interpretation. Applied to 11 tissues from the Genotype-Tissue Expression project, HistoGWAS identifies four genome-wide significant loci associated with tissue histology-tissue quantitative trait loci (tissueQTLs)-which we link to molecular changes and complex traits. Power analyses demonstrate scalability to population-scale histology cohorts.
Wissenschaftlicher Artikel
Scientific Article
2025
In: (20th Machine Learning in Computational Biology, MLCB 2025, 10-11 September 2025, New York). 2025. accepted ( ; 311)
AI foundation models have transformed cancer histopathology by enabling rich, data-driven feature extraction from H&E-stained whole-slide images. However, their application to studying how germline variation shapes tumor morphology remains limited. Here, we perform the first genome-wide association study of breast cancer morphology, independently analyzing AI-derived features from histology images and diagnostic pathology reports. Analyzing H&E slides from 753 patients with matched germline data, we identified six genome-wide significant loci associated with either imaging or textual features, two of which replicated across modalities. We then linked these two loci to histological features described in pathology reports, visual histological features through generative modelling, gene expression modules and patient survival. We found that rs819976 in ATAD3B is associated with disorganized, necrotic tumor morphology, poor-prognosis expression programs, and clinical features including invasive lobular carcinoma and ER positivity. These findings demonstrate the power of AI-based histology to uncover and characterize germline variants that shape tumor morphology, and assess their clinical significance.
Genome Res. 35, 2682-2690 (2025)
Gene-level rare variant association tests (RVATs) are essential for uncovering disease mechanisms and identifying therapeutic targets. Advances in sequence-based machine learning have generated diverse variant pathogenicity scores, creating opportunities to improve RVATs. However, existing methods often rely on rigid models or single annotations, limiting their ability to leverage these advances. We introduce BayesRVAT, a Bayesian rare variant association test that jointly models multiple annotations. By specifying priors on annotation effects and estimating genetrait-specific posterior burden scores, BayesRVAT flexibly captures diverse rare-variant architectures. In simulations, BayesRVAT improves power while maintaining calibration. In UK Biobank analyses, it detects 10.2% more blood-trait associations and reveals novel genedisease links, including PRPH2 with retinal disease. Integrating BayesRVAT within omnibus frameworks further increases discoveries, demonstrating that flexible annotation modeling captures complementary signals beyond existing burden and variance-component tests.
Wissenschaftlicher Artikel
Scientific Article
In: (Research in Computational Molecular Biology). 2025. 428-431 (Lect. Notes Comput. Sc. ; 15647 LNBI)
Gene-based rare variant association tests (RVATs) are essential for uncovering disease mechanisms and identifying candidate drug targets, yet existing frameworks lack flexibility in integrating multiple variant annotations. Here, we introduce BayesRVAT, a Bayesian framework for RVAT which models variant effects using priors informed by multiple annotations. We show that BayesRVAT outperforms state-of-the-art burden test strategies in both simulations and an analysis of 12 blood traits from the UK Biobank.
Alzheimers Dement. 21:e70170 (2025)
INTRODUCTION: In Alzheimer's disease (AD), fibrillar tau gradually progresses from initial seed to larger brain area. However, those brain properties underlying the region-dependent susceptibility to tau accumulation remain unclear. METHODS: We constructed multimodal spatial gradients to characterize molecular properties and connectomic architecture. A predictive model for regional tau deposition was developed by integrating embeddings in the principal gradients of global connectome gradients with gene expression, neurotransmitters, myelin, and amyloid-beta. The model was trained on amyloid-beta-positive participants from Alzheimer's Disease Neuroimaging Initiative (ADNI) and externally validated in independent datasets. RESULTS: The combination of gradients explained up to 77.7% of cross-sectional and 77.3% of longitudinal inter-regional variance of tau deposition. Gene set enrichment analysis of a major gene expression gradient points to synaptic transmission to confer increased susceptibility to tau. DISCUSSION: Our findings reveal a spatially heterogeneous molecular landscape shaping regional susceptibility to tau deposition, presenting a powerful system-level explanatory model of tau pathology in AD. HIGHLIGHTS: Spatial gradients of fundamental molecular brain properties associated with tau pathology. The explanatory power showed high consistency across studies. Genetic analyses suggested that synapse expression plays a vital role in tau accumulation.
Wissenschaftlicher Artikel
Scientific Article
Nat. Commun. 16:3061 (2025)
Despite the frequent implication of aberrant gene expression in diseases, algorithms predicting aberrantly expressed genes of an individual are lacking. To address this need, we compile an aberrant expression prediction benchmark covering 8.2 million rare variants from 633 individuals across 49 tissues. While not geared toward aberrant expression, the deleteriousness score CADD and the loss-of-function predictor LOFTEE show mild predictive ability (1-1.6% average precision). Leveraging these and further variant annotations, we next train AbExp, a model that yields 12% average precision by combining in a tissue-specific fashion expression variability with variant effects on isoforms and on aberrant splicing. Integrating expression measurements from clinically accessible tissues leads to another two-fold improvement. Furthermore, we show on UK Biobank blood traits that performing rare variant association testing using the continuous and tissue-specific AbExp variant scores instead of LOFTEE variant burden increases gene discovery sensitivity and enables improved phenotype predictions.
Wissenschaftlicher Artikel
Scientific Article
Nat. Commun. 16:3278 (2025)
Longitudinal multi-view omics data offer unique insights into the temporal dynamics of individual-level physiology, which provides opportunities to advance personalized healthcare. However, the common occurrence of incomplete views makes extrapolation tasks difficult, and there is a lack of tailored methods for this critical issue. Here, we introduce LEOPARD, an innovative approach specifically designed to complete missing views in multi-timepoint omics data. By disentangling longitudinal omics data into content and temporal representations, LEOPARD transfers the temporal knowledge to the omics-specific content, thereby completing missing views. The effectiveness of LEOPARD is validated on four real-world omics datasets constructed with data from the MGH COVID study and the KORA cohort, spanning periods from 3 days to 14 years. Compared to conventional imputation methods, such as missForest, PMM, GLMM, and cGAN, LEOPARD yields the most robust results across the benchmark datasets. LEOPARD-imputed data also achieve the highest agreement with observed data in our analyses for age-associated metabolites detection, estimated glomerular filtration rate-associated proteins identification, and chronic kidney disease prediction. Our work takes the first step toward a generalized treatment of missing views in longitudinal omics data, enabling comprehensive exploration of temporal dynamics and providing valuable insights into personalized healthcare.
Wissenschaftlicher Artikel
Scientific Article
2024
Genome Res. 34, 1276-1285 (2024)
Accurate predictive models of future disease onset are crucial for effective preventive healthcare, yet longitudinal data sets linking early risk factors to subsequent health outcomes are limited. To overcome this challenge, we introduce a novel framework, Predictive Risk modeling using Mendelian Randomization (PRiMeR), which utilizes genetic effects as supervisory signals to learn disease risk predictors without relying on longitudinal data. To do so, PRiMeR leverages risk factors and genetic data from a healthy cohort, along with results from genome-wide association studies of diseases of interest. After training, the learned predictor can be used to assess risk for new patients solely based on risk factors. We validate PRiMeR through comprehensive simulations and in future type 2 diabetes predictions in UK Biobank participants without diabetes, using follow-up onset labels for validation. Moreover, we apply PRiMeR to predict future Alzheimer's disease onset from brain imaging biomarkers and future Parkinson's disease onset from accelerometer-derived traits. Overall, with PRiMeR we offer a new perspective in predictive modeling, showing it is possible to learn risk predictors leveraging genetics rather than longitudinal data.
Wissenschaftlicher Artikel
Scientific Article
In: (Research in Computational Molecular Biology). Gewerbestrasse 11, Cham, Ch-6330, Switzerland: Springer International Publishing Ag, 2024. 385-389 (Lect. Notes Comput. Sc. ; 14758 LNCS)
Predicting future disease onset is crucial in preventive healthcare, yet longitudinal datasets linking early risk factors to subsequent health outcomes are scarce. To address this challenge, we introduce Differentiable Mendelian Randomization (DMR), an extension of the classical Mendelian Randomization framework for disease risk predictions without longitudinal data. To do so, DMR leverages risk factors and genetic profiles from a healthy cohort, along with results from genome-wide association studies (GWAS) of diseases of interest. In this work, we describe the DMR framework and confirm its reliability and effectiveness in simulations and an application to a type 2 diabetes (T2D) cohort.
In: Proceedings of Machine Learning Research (27th International Conference on Artificial Intelligence and Statistics (AISTATS), MAY 02-04, 2024, Valencia, SPAIN). 2024. 3664-3672 (Int. Conf. art. intell. stat. ; 238)
Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.
2023
In: (Proceedings - International Symposium on Biomedical Imaging, 18-21 April 2023, Cartagena, Colombia). 345 E 47th St, New York, Ny 10017 Usa: Ieee, 2023. 5 ( ; 2023-April)
Multiple Instance Learning (MIL) has become the predominant approach for classification tasks on gigapixel histopathology whole slide images (WSIs). Within the MIL framework, single WSIs (bags) are decomposed into patches (instances), with only WSI-level annotation available. Recent MIL approaches produce highly informative bag level representations by utilizing the transformer architecture's ability to model the dependencies between instances. However, when applied to high magnification datasets, problems emerge due to the large number of instances and the weak supervisory learning signal. To address this problem, we propose to additionally train transformers with a novel Bag Embedding Loss (BEL). BEL forces the model to learn a discriminative bag-level representation by minimizing the distance between bag embeddings of the same class and maximizing the distance between different classes. We evaluate BEL with the Transformer architecture TransMIL on two publicly available histopathology datasets, BRACS and CAMELYON17. We show that with BEL, TransMIL outperforms the baseline models on both datasets, thus contributing to the clinically highly relevant AI-based tumor classification of histological patient material.
Am. J. Hum. Genet. 110, 1330-1342 (2023)
Allelic series are of candidate therapeutic interest because of the existence of a dose-response relationship between the functionality of a gene and the degree or severity of a phenotype. We define an allelic series as a collection of variants in which increasingly deleterious mutations lead to increasingly large phenotypic effects, and we have developed a gene-based rare-variant association test specifically targeted to identifying genes containing allelic series. Building on the well-known burden test and sequence kernel association test (SKAT), we specify a variety of association models covering different genetic architectures and integrate these into a Coding-Variant Allelic-Series Test (COAST). Through extensive simulations, we confirm that COAST maintains the type I error and improves the power when the pattern of coding-variant effect sizes increases monotonically with mutational severity. We applied COAST to identify allelic-series genes for four circulating-lipid traits and five cell-count traits among 145,735 subjects with available whole-exome sequencing data from the UK Biobank. Compared with optimal SKAT (SKAT-O), COAST identified 29% more Bonferroni-significant associations with circulating-lipid traits, on average, and 82% more with cell-count traits. All of the gene-trait associations identified by COAST have corroborating evidence either from rare-variant associations in the full cohort (Genebass, n = 400,000) or from common-variant associations in the GWAS Catalog. In addition to detecting many gene-trait associations present in Genebass by using only a fraction (36.9%) of the sample, COAST detects associations, such as that between ANGPTL4 and triglycerides, that are absent from Genebass but that have clear common-variant support.
Wissenschaftlicher Artikel
Scientific Article
2015
Nat. Biotechnol. 33, 155-160 (2015)
Recent technical developments have enabled the transcriptomes of hundreds of cells to be assayed in an unbiased manner, opening up the possibility that new subpopulations of cells can be found. However, the effects of potential confounding factors, such as the cell cycle, on the heterogeneity of gene expression and therefore on the ability to robustly identify subpopulations remain unclear. We present and validate a computational approach that uses latent variable models to account for such hidden factors. We show that our single-cell latent variable model (scLVM) allows the identification of otherwise undetectable subpopulations of cells that correspond to different stages during the differentiation of naive T cells into T helper 2 cells. Our approach can be used not only to identify cellular subpopulations but also to tease apart different sources of gene expression heterogeneity in single-cell transcriptomes.
Wissenschaftlicher Artikel
Scientific Article