Machine learning for molecular binding affinity
Graph neural networks predicting host–guest binding, then screening all of PubChem (~9–10M compounds) for novel binders.
Helmholtz (HIDA) Israel exchange at Ben-Gurion University, 2022, with follow-on work into 2024–25.
Problem
Cucurbit[7]uril (CB7) is a macrocyclic host molecule of broad interest for drug delivery, sensing, and molecular recognition. The task: predict the binding affinity (logKa) of arbitrary small-molecule guests to CB7 directly from their chemical structure (SMILES), accurately enough to screen for novel binders at the scale of all known small molecules — working from a small, replicate-heavy dataset (~490 measurements) where naive evaluation badly overstates performance.
Approach
I benchmarked 15+ molecular featurisers (RDKit/DeepChem) spanning physicochemical descriptors, fingerprints (MACCS, ECFP), 2D molecular-graph encoders, and learned embeddings (Mol2Vec, ChemBERTa) against a wide range of regressors — from linear models, random forests, and XGBoost up to graph neural networks (GraphConv, GCN, GAT, Weave, MPNN/DMPNN). Critically, I evaluated under chemically-aware, leakage-safe splits — Bemis-Murcko scaffold, molecular-weight, and fingerprint-similarity splits with 5-fold CV, and molecule-disjoint folds for replicates — so the reported accuracy reflects generalisation to genuinely new chemistry rather than memorised near-duplicates.
Result
Graph convolutional networks on 2D molecular graphs performed best (Pearson r ≈ 0.79), with MACCS + Random Forest a strong, interpretable baseline (r ≈ 0.63–0.67). Feature-importance analysis recovered chemically sensible drivers — nitrogen-containing groups and specific ring substructures, consistent with the known cationic-nitrogen-driven binding to CB7. I then scaled the validated pipeline to featurise and score the full PubChem library (~9–10M compounds) on an LSF HPC cluster, producing ranked shortlists of candidate binders for experimental follow-up by collaborating chemists.