Connecting the dots: 100k protein network graph using AI and GPU-accelerated clustering

Synopsis

Combinatorial mutagenesis (CM) is an established approach to protein engineering in pharma and industrial settings. As an extremely laborious process, CM relies on human intuition (rational engineering inspired by existing 3D structures of a protein target) or environmental pressure (directed evolution) to guide the development of new functional variants (mutants) of desired protein target with the goal of enhancing specific property; e.g. thermal stability, solubility or aggregation propensity.

Time inefficiencies, human labor and material costs of CM may translate into major issues in big Pharma: the average drug costs $2.6 B with a 5% success rate for small-molecule drugs and a 13% success rate for protein therapeutic. Ultimately, $77B in revenue lost (2011–2012) due to late-stage terminations of drug candidates.

Idea

The aim of this project was to develop an automated pipeline for rapid, AI-powered assessment of small peptide developability, as a function of structural disorder and its relationship to protein aggregation behaviour.

Execution

We have gathered all known to be expressed; 9-amino acid protein fragments (3,900,078 out of 512,000,000,000 theoretically possible combinations), 5-amino acid (3,200,000) fragments; and 3-amino acid (8000) fragments.
We have used our proprietary (AI) structural disorder predictors (trained on dspp- keras https://github.com/PeptoneInc/dspp-keras) to predict residual disorder probability for each molecule and cross-correlate it with known solubility data.
Our models were trained on AWS p3.2xlarge instances with custom Deep Learning Ubuntu 16.04LTS versions and equipped with NVIDIA Tesla V100 SMX2 accelerator cards.
We have used c5d.18xlarge nodes with 36-core Xenon Platinum processors to benchmark our calculations against.
We have run proprietary tSNE algorithms, which were developed specifically for NVIDIA GPUs; Tesla V100- SXM2 available on p3.2xlarge nodes. The tSNE procedures were written using CUDA 9.0 libraries with a support for Compute Capability 7.0.
The algorithms allowed us to perform [4,000,000 x 50] classification problem calculations in under 2h time, achieving 200x to 1000x performance gain with respect to state of the art CPU-only nodes.
With tSNE calculations done for 7.1M+ peptides, we have performed data clustering utilizing Facebook AI libraries (faiss) compiled with a support for NVIDIA Volta-architecture GPUs.
Subsequently, we have made an interactive, massively parallel visualization of the data graphs, which runs under the control on Kubernetes and utilizes EC2 instances.
The frontend of the graph visualization (still under development) uses WebGL. We are using NVIDIA Titan Xp GPUs to inspect 100k+ node-graphs in real time.

Why does it matter?

With this data in hand, our clients will be able to make rapid and accurate research decisions about commercial developability of a given protein fragment lead and possible upfront R&D capital that needs to be invested.

Why is it unique?

We are the first to offer an accurate and rapid prediction of protein properties, which are of fundamental importance for protein solubility engineering and commercial developability assessment on a such scale (the underlying network data graph contains ~4M molecules).

Through this project we have assessed the horizontal scalability of our AI platform and found out that given existing AWS and NVIDIA solutions we can easily apply our approach to protein families as big as 1,000,000,000 (billion) molecules.

What’s next

We are aiming to assess the relationships among the 122M known and annotated proteins.