2024
Sens, D.W. ; Shilova, L. ; Gräf, L. ; Grebenshchikova, M. ; Eskofier, B.M. ; Casale, F.P.
Genome Res. 34, 1276-1285 (2024)
Accurate predictive models of future disease onset are crucial for effective preventive healthcare, yet longitudinal data sets linking early risk factors to subsequent health outcomes are limited. To overcome this challenge, we introduce a novel framework, Predictive Risk modeling using Mendelian Randomization (PRiMeR), which utilizes genetic effects as supervisory signals to learn disease risk predictors without relying on longitudinal data. To do so, PRiMeR leverages risk factors and genetic data from a healthy cohort, along with results from genome-wide association studies of diseases of interest. After training, the learned predictor can be used to assess risk for new patients solely based on risk factors. We validate PRiMeR through comprehensive simulations and in future type 2 diabetes predictions in UK Biobank participants without diabetes, using follow-up onset labels for validation. Moreover, we apply PRiMeR to predict future Alzheimer's disease onset from brain imaging biomarkers and future Parkinson's disease onset from accelerometer-derived traits. Overall, with PRiMeR we offer a new perspective in predictive modeling, showing it is possible to learn risk predictors leveraging genetics rather than longitudinal data.
Wissenschaftlicher Artikel
Scientific Article
Gräf, L. ; Sens, D.W. ; Shilova, L. ; Casale, F.P.
In: (Research in Computational Molecular Biology). Gewerbestrasse 11, Cham, Ch-6330, Switzerland: Springer International Publishing Ag, 2024. 385-389 (Lect. Notes Comput. Sc. ; 14758 LNCS)
Predicting future disease onset is crucial in preventive healthcare, yet longitudinal datasets linking early risk factors to subsequent health outcomes are scarce. To address this challenge, we introduce Differentiable Mendelian Randomization (DMR), an extension of the classical Mendelian Randomization framework for disease risk predictions without longitudinal data. To do so, DMR leverages risk factors and genetic profiles from a healthy cohort, along with results from genome-wide association studies (GWAS) of diseases of interest. In this work, we describe the DMR framework and confirm its reliability and effectiveness in simulations and an application to a type 2 diabetes (T2D) cohort.
Engelmann, J.P. ; Palma, A. ; Tomczak, J.M. ; Theis, F.J. ; Casale, F.P.
In: (Proceedings of Machine Learning Research). 2024. 3664-3672 ( ; 238)
Predicting patient features from single-cell data can help identify cellular states implicated in health and disease. Linear models and average cell type expressions are typically favored for this task for their efficiency and robustness, but they overlook the rich cell heterogeneity inherent in single-cell data. To address this gap, we introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL), upholding the advantages of linear models while modeling cell state heterogeneity. By leveraging predefined cell embeddings, MixMIL enhances computational efficiency and aligns with recent advancements in single-cell representation learning. Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets, uncovering new associations and elucidating biological mechanisms across different domains.
2023
Sens, D. ; Sadafi, A. ; Casale, F.P. ; Navab, N. ; Marr, C.
In: (Proceedings - International Symposium on Biomedical Imaging, 18-21 April 2023, Cartagena, Colombia). 345 E 47th St, New York, Ny 10017 Usa: Ieee, 2023. 5 ( ; 2023-April)
Multiple Instance Learning (MIL) has become the predominant approach for classification tasks on gigapixel histopathology whole slide images (WSIs). Within the MIL framework, single WSIs (bags) are decomposed into patches (instances), with only WSI-level annotation available. Recent MIL approaches produce highly informative bag level representations by utilizing the transformer architecture's ability to model the dependencies between instances. However, when applied to high magnification datasets, problems emerge due to the large number of instances and the weak supervisory learning signal. To address this problem, we propose to additionally train transformers with a novel Bag Embedding Loss (BEL). BEL forces the model to learn a discriminative bag-level representation by minimizing the distance between bag embeddings of the same class and maximizing the distance between different classes. We evaluate BEL with the Transformer architecture TransMIL on two publicly available histopathology datasets, BRACS and CAMELYON17. We show that with BEL, TransMIL outperforms the baseline models on both datasets, thus contributing to the clinically highly relevant AI-based tumor classification of histological patient material.
McCaw, Z.R. ; O'Dushlaine, C. ; Somineni, H. ; Bereket, M. ; Klein, C. ; Karaletsos, T. ; Casale, F.P. ; Koller, D. ; Soare, T.W.
Am. J. Hum. Genet. 110, 1330-1342 (2023)
Allelic series are of candidate therapeutic interest because of the existence of a dose-response relationship between the functionality of a gene and the degree or severity of a phenotype. We define an allelic series as a collection of variants in which increasingly deleterious mutations lead to increasingly large phenotypic effects, and we have developed a gene-based rare-variant association test specifically targeted to identifying genes containing allelic series. Building on the well-known burden test and sequence kernel association test (SKAT), we specify a variety of association models covering different genetic architectures and integrate these into a Coding-Variant Allelic-Series Test (COAST). Through extensive simulations, we confirm that COAST maintains the type I error and improves the power when the pattern of coding-variant effect sizes increases monotonically with mutational severity. We applied COAST to identify allelic-series genes for four circulating-lipid traits and five cell-count traits among 145,735 subjects with available whole-exome sequencing data from the UK Biobank. Compared with optimal SKAT (SKAT-O), COAST identified 29% more Bonferroni-significant associations with circulating-lipid traits, on average, and 82% more with cell-count traits. All of the gene-trait associations identified by COAST have corroborating evidence either from rare-variant associations in the full cohort (Genebass, n = 400,000) or from common-variant associations in the GWAS Catalog. In addition to detecting many gene-trait associations present in Genebass by using only a fraction (36.9%) of the sample, COAST detects associations, such as that between ANGPTL4 and triglycerides, that are absent from Genebass but that have clear common-variant support.
Wissenschaftlicher Artikel
Scientific Article
2015
Buettner, F. ; Natarajan, K.N. ; Casale, F.P. ; Proserpio, V. ; Scialdone, A. ; Theis, F.J. ; Teichmann, S.A. ; Marioni, J.C. ; Stegle, O.
Nat. Biotechnol. 33, 155-160 (2015)
Recent technical developments have enabled the transcriptomes of hundreds of cells to be assayed in an unbiased manner, opening up the possibility that new subpopulations of cells can be found. However, the effects of potential confounding factors, such as the cell cycle, on the heterogeneity of gene expression and therefore on the ability to robustly identify subpopulations remain unclear. We present and validate a computational approach that uses latent variable models to account for such hidden factors. We show that our single-cell latent variable model (scLVM) allows the identification of otherwise undetectable subpopulations of cells that correspond to different stages during the differentiation of naive T cells into T helper 2 cells. Our approach can be used not only to identify cellular subpopulations but also to tease apart different sources of gene expression heterogeneity in single-cell transcriptomes.
Wissenschaftlicher Artikel
Scientific Article