DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

ICLR, ‘23
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

  • This article is one of the first research work that formulated molecular docking as a generative problem.

  • Showed very interesting results with decent performance gain.

  • If you are interested in molecular docking and diffusion models, this is definitely a must-read paper!

  • It is highly recommended to watch youtube video explained by the authors.

Summary

  • Molecular docking as a generative problem, not regression!

    • Problem of learning a distribution over ligand poses conditioned on the target protein structure $p(\mathbf{x} | \mathbf{y})$
  • Used “Diffusion process” for generation

  • Two separate model

    • Score model: $s(\mathbf{x}, \mathbf{y}, t)$

      Predicts score based on ligand pose $\mathbf{x}$, protein structure $\mathbf{y}$, and timestep $t$

    • Confidence model: $d(\mathbf{x}, \mathbf{y})$

      Predicts whether the ligand pose has RMSD below 2Ă… compared to ground truth ligand pose

  • Diffusion on Product space $\mathbb{P}$

    • Reduced degrees of freedom $3n \rightarrow (m+6)$

Preliminaries

Molecular Docking

  • Definition:

    Predicting the position, orientation, and conformation of a ligand when bound to a target protein

  • Two types of tasks

    • Known-pocket docking
      • Given: position of the binding pocket
    • Blind docking
      • More general setting: no prior knowledge about binding pocket

Previous works: Search-based / Regression-based

  • Search based docking methods

    • Traditional methods

    • Consist of parameterized physics-based scoring function and a search algorithm

    • Scoring function

      • Input: 3D structures
      • Output: estimate of the quality/likelihood of the given pose
    • Search algorithm

      • Stochastically modifies the ligand pose (position, orientation, torsion angles)
      • Goal: finding the global optimum of the scoring function.
    • ML has been applied to parameterize the scoring function.

      • But very computationally expensive (large search space)
    • Example

  • Regression based methods

    • Recent deep learning method

    • Significant speedup compared to search based methods

    • No improvements in accuracy

    • Example

      • EquiBind

        • Tried to tackle the blind docking task as a regression problem by directly predicting pocket keypoints on both ligand and protein and aligning them.
      • TANKBind

        • Improved over this by independently predicting a docking pose for each possible pocket and then ranking them.
      • E3Bind

        • Used ligand-constrained & protein-constrained update layer to embed ligand atoms and iteratively updated coordinates.

Docking objective

  • Standard evaluation metric:
    • $\mathcal{L}_{\epsilon} = \sum_{x, y} I_{\text{RMSD}(y, \hat{y}(x))<\epsilon}$:

      proportion of predictions with $\text{RMSD} < \epsilon$ → Not differentiable!

    • Instead, we use $\text{argmin}_{\hat{y}} \lim_{\epsilon \rightarrow 0} \mathcal{L}_\epsilon$ as objective function.

  • Regression is suitable for docking only if it is unimodal.
  • Docking has significant aleatoric (irreducible) & epistemic (reducible) uncertainty
    • Regression methods will minimize $\sum \|y - \hat{y}\|^2_2$ → will produce weighted mean of multiple modes
    • On the other hand, generative model will populate all/most modes!

  • Regression (EquiBind) model set conformer in the middle of the modes.
  • Generative samples can populate conformer in most modes.

  • Much less steric clashes for generative models

DiffDock Overview

  • Two-step approach
    • Score model: Reverse diffusion over translation, rotation, and torsion
    • Confidence model: Predict whether or not each ligand pose is $\text{RMSD} < 2\text{Ă…}$ compared to ground truth ligand pose

Score model

  • Ligand pose: $\mathbb{R}^{3n}$ ($n$: number of atoms)

  • But molecular docking needs far less degrees of freedom.

    • Reduced degree of freedom: $(m+6)$

      • Local structure: Fixed (rigid) after conformer generation with RDKit EmbedMolecule(mol)

        • Bond length, angles, small rings
      • Position (translation): $\mathbb{R}^3$ - 3D vector

      • Orientation (rotation): $SO(3)$ - three Euler angle vector

        Euler2a.gif

      • Torsion angles: $\mathbb{T}^m$ ($m$: number of rotatable bonds)

        Dihedral-angles-anim.gif

    • Can perform diffusion on product space $\mathbb{P}: \mathbb{R}^3 \times SO(3) \times \mathbb{T}^m$

      • For a given seed conformation $\mathbf{c}$, the map $A(\cdot, \mathbf{c}): \mathbb{P} \rightarrow \mathcal{M}_\mathbf{c}$ is a bijection!

Confidence Model

  • Generative model can sample an arbitrary number of poses, but researchers are interested in one or a fixed number of them.
  • Confidence predictions are very useful for downstream tasks.
  • Confidence model $d(\mathbf{x}, \mathbf{y})$
    • $\mathbf{x}$: pose of a ligand
    • $\mathbf{y}$: target protein structure
  • Samples are ranked by score and the score of the best is used as overall confidence score.
  • Training & Inference
    • Ran the trained diffusion model to obtain a set of candidate poses for every training example and generate binary labels: each pose has RMSD below $2 \text{Ă…}$ or not.
    • Then the confidence model is trained with cross entropy loss to predict the binary label for each pose.
    • During inference, diffusion model is run to generate $N$ poses in parallel, and passed to the confidence model that ranks them based on its confidence that they have RMSD below $2\text{Ă…}$.

DiffDock Workflow

DiffDock Results

  • Standard benchmark: PDBBind

    19k experimentally determined structures of small molecules + proteins

    Baselines: search-based & deep learning

  • Prediction correctness

    Outperform search-based, deep learning, and pocket prediction + search-based methods

  • Runtime

    3 times faster than the most accurate baseline

  • Generalization to unseen receptors

    Able to generalize: outperform classical method

  • Reverse diffusion process GIF

  • Confidence score quality

    High selective accuracy: valuable information for practitioners

Personal opinions

  • It is impressive that the authors formulated molecular docking as a generative problem, conditioned on protein structure.
  • But it is not an end-to-end approach. And there are some discrepancy between the inputs and output of the confidence model. The input is the predicted ligand pose $\hat{\mathbf{x}}$ and protein structure $\mathbf{y}$, but the output is “whether the RMSE is below 2Ă… between predicted ligand pose $\hat{\mathbf{x}}$ and ground truth ligand pose $\mathbf{x}$”.
  • There are quite a room to improve the performance, but it requires heavy workloads of GPUs.
  • I’m skeptical about the generalizability of this model since there are almost no physics informed inductive bias in the model.

Reference

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy