← Back to home

Machine learning for molecular binding affinity

Graph neural networks predicting host–guest binding, then screening all of PubChem (~9–10M compounds) for novel binders.

Helmholtz (HIDA) Israel exchange at Ben-Gurion University, 2022, with follow-on work into 2024–25.

Rotating ball-and-stick model of cucurbit[7]uril (CB7), a barrel-shaped macrocyclic host molecule — Cucurbit[7]uril (CB7) — the macrocyclic "host" molecule whose binding affinities for small-molecule "guests" the models predict.

Problem

Cucurbit[7]uril (CB7) is a macrocyclic host molecule of broad interest for drug delivery, sensing, and molecular recognition. The task: predict the binding affinity (logK_a) of arbitrary small-molecule guests to CB7 directly from their chemical structure (SMILES), accurately enough to screen for novel binders at the scale of all known small molecules — working from a small, replicate-heavy dataset (~490 measurements) where naive evaluation badly overstates performance.

Approach

I benchmarked 15+ molecular featurisers (RDKit/DeepChem) spanning physicochemical descriptors, fingerprints (MACCS, ECFP), 2D molecular-graph encoders, and learned embeddings (Mol2Vec, ChemBERTa) against a wide range of regressors — from linear models, random forests, and XGBoost up to graph neural networks (GraphConv, GCN, GAT, Weave, MPNN/DMPNN). Critically, I evaluated under chemically-aware, leakage-safe splits — Bemis-Murcko scaffold, molecular-weight, and fingerprint-similarity splits with 5-fold CV, and molecule-disjoint folds for replicates — so the reported accuracy reflects generalisation to genuinely new chemistry rather than memorised near-duplicates.

Result

Graph convolutional networks on 2D molecular graphs performed best (Pearson r ≈ 0.79), with MACCS + Random Forest a strong, interpretable baseline (r ≈ 0.63–0.67). Feature-importance analysis recovered chemically sensible drivers — nitrogen-containing groups and specific ring substructures, consistent with the known cationic-nitrogen-driven binding to CB7. I then scaled the validated pipeline to featurise and score the full PubChem library (~9–10M compounds) on an LSF HPC cluster, producing ranked shortlists of candidate binders for experimental follow-up by collaborating chemists.

Methods & stack

Graph neural networks (GCN · GAT · MPNN)
RDKit · DeepChem · Mordred
Mol2Vec · ChemBERTa embeddings
scikit-learn · XGBoost
PyTorch · Keras/TF · DGL
Scaffold / leakage-safe CV splits
HPC (LSF / bsub) large-scale screening
Python · R