publications
publications by categories in reversed chronological order.
2025
- Scaling Multi-Modal and Multi-Task Transformers for Small Molecule Drug DiscoveryDavid S. Farina Jr, Sai Krishna Sirumalla, Michiel J. M. Niesen, and 19 more authorsIn NeurIPS 2025 Workshop on AI for Science, 2025
We introduce Enchant v2, a large-scale multi-modal transformer for predicting molecular, biochemical, and pharmacological properties from heterogeneous biomedical data. The model addresses a core challenge in drug discovery: generalizing under extreme data sparsity and across incompatible modalities. Diverse inputs including molecular graphs, protein sequences, assay measurements, and free text are represented as unified token sequences processed by a single transformer. Pretraining on a large, curated corpus is followed by parameter-efficient fine-tuning for molecule property prediction. We show that Enchant v2 follows established transformer scaling laws, with performance improving predictably as pre-training compute increases. On public and proprietary benchmarks including drug property prediction and internal pharmacology datasets, it consistently outperforms TxGemma and Enchant v1. Crucially, in real-world applications, Enchant v2 surpasses the current industry standard of in vitro screening: for example, it achieves an AUROC of 0.74 in classifying high versus low in vivo rat clearance, compared to 0.51 when extrapolating from measured in vitro clearance values. In addition, the model produces calibrated uncertainty estimates that closely track observed hit rates in virtual screening tasks, enabling reliable hit identification and efficient prioritization of compounds in early discovery workflows. These findings suggest that scalable, modality-agnostic transformers can deliver robust generalization and substantial performance gains in real-world low-data drug discovery settings.
- Clinically Informed Semi-Supervised Learning Improves Disease Annotation and Equity from Electronic Health Records: A Glaucoma Case StudyMousa Moradi, Rishi Shah, Asahi Fujita, and 9 more authorsnpj Digital Medicine, Dec 2025
Clinical notes represent a vast but underutilized source of information for disease characterization, whereas structured electronic health record (EHR) data such as ICD codes are often noisy, incomplete, and too coarse to capture clinical complexity. These limitations constrain the accuracy of datasets used to investigate disease pathogenesis and progression and to develop robust artificial intelligence (AI) systems. To address this challenge, we introduce Ci-SSGAN (Clinically Informed Semi-Supervised Generative Adversarial Network), a novel framework that leverages large-scale unlabeled clinical text to reannotate patient conditions with improved accuracy and equity. As a case study, we applied Ci-SSGAN to glaucoma, a leading cause of irreversible blindness characterized by pronounced racial and ethnic disparities. Trained on a demographically balanced subset of 349587 unlabeled ophthalmology notes and 2954 expert-annotated notes (drawn from an institutional corpus of 2.1 million notes), Ci-SSGAN achieved 0.85 accuracy and 0.95 AUROC, representing a 10.19% AUROC improvement compared to ICD-based labels (0.74 accuracy, 0.85 AUROC). Ci-SSGAN also narrowed subgroup performance gaps, with F1 gains for Black patients (+0.05), women (+0.06), and younger patients (+0.033). By integrating semi-supervised learning and demographic conditioning, Ci-SSGAN minimizes reliance on expert annotations, making AI development more accessible to resource-constrained healthcare systems.
- OphthaBERT: Automated Glaucoma Diagnosis from Clinical NotesRishi Shah, Mousa Moradi, Sedigheh Eslami, and 12 more authorsmedRxiv, Jun 2025
Glaucoma is a leading cause of irreversible blindness worldwide, with early intervention often being crucial. Research into the underpinnings of glaucoma often relies on electronic health records (EHRs) to identify patients with glaucoma and their subtypes. However, current methods for identifying glaucoma patients from EHRs are often inaccurate or infeasible at scale, relying on International Classification of Diseases (ICD) codes or manual chart reviews. To address this limitation, we introduce (1) OphthaBERT, a powerful general clinical ophthalmology language model trained on over 2 million diverse clinical notes, and (2) a fine-tuned variant of OphthaBERT that automatically extracts binary and subtype glaucoma diagnoses from clinical notes. The base OphthaBERT model is a robust encoder, outperforming state-of-the-art clinical encoders in masked token prediction on out-of-distribution ophthalmology clinical notes and binary glaucoma classification with limited data. We report significant binary classification performance improvements in low-data regimes (p < 0.001, Bonferroni corrected). OphthaBERT is also able to achieve superior classification performance for both binary and subtype diagnosis, outperforming even fine-tuned large decoder-only language models at a fraction of the computational cost. We demonstrate a 0.23-point increase in macro-F1 for subtype diagnosis over ICD codes and strong binary classification performance when externally validated at Wilmer Eye Institute. OphthaBERT provides an interpretable, equitable framework for general ophthalmology language modeling and automated glaucoma diagnosis.