Introduction

G-protein coupled receptors control multiple physiological and disease states. Deciphering GPCR-G protein coupling is pivotal to capture and comprehend GPCR signaling, and subsequently study their genetic variations. In this study, we leveraged GPCR coupling data generated using a shedding assay technique to develop a Machine learning based predictor: PRECOG, which allows users to:

1) predict with higher confidence the coupling probabilities of a given input sequence for individual G-proteins;
2) visually inspect the protein sequence and structural features that are responsible for a particular coupling;
3) rationally design artificial GPCRs with new coupling properties (i.e. DREADD) based on suggested mutations informed by feature analysis.

Publications

The web-server
PRECOG: PREdicting COupling probabilities of G-protein coupled receptors
Singh G, Inoue A, Gutkind JS, Russell RB, Raimondi F. Nucl Acids Res. (Accepted), 2019.

The method
Illuminating G-protein coupling selectivity of GPCRs
Inoue A, Raimondi F, Kadji FMN, Singh G, Kishu T, Uwamizu A, Ono Y, Shinjo Y, Ishida S, Arang N, Kawakami K, Gutkind JS, Aoki J, Russell RB. Cell (Accepted), 2019.

Data set, significant features and ML

As a part of a recent effort to systematically quantify and dissect GPCR coupling specifities, we built the predictor by exploiting experimental binding affinities of 144 Class A human GPCRs for 11 G-proteins obtained through the TGFα (transforming growth factor alpha) shedding assay. For each G-protein, we defined a training set labeling receptors with binding affinities above a given threshold (LogRAi >= -1) as coupled, and the remaining ones as not coupled. Based on this classification, we derived a set of sequence- and structure-based features (refer to the Heatmap) that were statistically associated to a given coupling group and which we used for training. We implemented the predictor using a logistic regression classifier.

Workflow

Methodology

Sequence-based coupling determinant features

We first generated a multiple sequence alignments (MSAs) of the 144 Class A GPCR sequences using HMMalign from the HMMer package, and the 7tm_1 Pfam Hidden Markov Model (HMM). As in a previously described procedure, we subdivided the pool of receptor sequences into positively and negatively coupled to a given G-protein using the optimal LogRAi cutoff as a lower and upper bound. These sub-alignments were used to build corresponding HMM profiles through hmmbuild, leading to 22 models (coupled vs. uncoupled for 11 G-proteins).
From coupled and uncoupled HMM profiles for each G-protein, we then extracted alignment positions present in both HMM models and showing statistically different distributions (Wilcoxon’s signed-rank test; p-value <= 0.05) of the 20 amino acid bit scores. We also considered those alignment positions with consensus columns (i.e. those having a fraction of residues, as opposed to gaps, equal or greater than the hmmbuild’s symfrac parameter, using default value of 0.5) present in either of HMM models. In details, if a consensus column was present only in the HMM profile of either the coupled or uncoupled groups, we labelled it as insertion or deletion, respectively. As additional features, we also included length and amino acid composition of the N- and C-termini (N-term and C-term) and the extra- and intra-cellular loops (ECLs and ICLs). For every G-protein, only statistically significant (p-value < 0.05; Wilcoxon's rank-sum test) features were considered. To identify each positions within the alignment, we employed the Ballesteros-Weinstein numbering, using the consensus secondary structure from the 7tm_1 HMM model to number residues within helices in a consecutive way. Most conserved positions within each helix were defined according to GPCRDB.

Implementing Precog

We implemented Precog using a logistic regression classifier, or Log-reg classifier, available from the scikit-learn package. The possible outcomes in log-reg are modeled using a logistic function, with L1 or L2 based regularization. In this study we used L2 penalized form of log-reg. The target value is expected to be a linear combination of the given features. This property of log-reg can also be exploited to study the weights of its features. We used the liblinear method as the optimization algorithm as it is shown to be optimal for relatively small datasets.

Training and cross-validation

We used 7TM domain positions and compositional features for the ICL3 and C-Term, which prevail over other extra-7TM domain features, to create a training matrix. In case of significant positional features, two-bit scores (derived from the positive and negative HMMs for a given G-protein) are returned for the corresponding amino acid found at a given position in the input GPCR sequence. In case a position was found to be present in either positive or negative HMMs, the single bit score, derived from the respective HMM, was returned. If for any GPCR, no amino acid was present at the given position, it was assigned the highest bit scores from the both models, implying the least conserved scores.
All the features were scaled to the range [0, 1]. A grid search was performed over a stratified 5-fold cross validation (CV) to select the best value of C (inverse of the regularization strength). The parameters showing the best Area Under the Curve (AUC) of the Receiver Operating Curve (ROC) were chosen to create models for every G-protein. The feature weights were extracted as described elsewhere from the trained models and are critical to understand the relative importance of different features.

Randomized test

In order to assess over-fitting, we performed a randomization test. For every G-protein, the original labels of the training matrix were replaced with randomly determined labels, while preserving the ratio of number of positive (coupling) and negative (non-coupling) GPCRs.

Test set

To benchmark our method and compare Precog with other web-servers of GPCR-G protein coupling, we extracted all the GPCRs from IUPHAR that are present in neither our training set nor in other web-servers'. One of the major limitations of IUPHAR is the absence of a definite true negative set, thus, the best measure to compare performance of Precog with others is Recall, also known as Sensitivity or the true positive rate. We combined the performance of individual G-protein predictors based on their families to evaluate the performance of Precog, which outperformed other publically available web-servers over the test set.

Libraries used

Following libraries were used to build the webserver:
- Jmol
- jQuery
- neXtProt
- Bootstrap
- Flask
- Scikit-learn

Contact

Russell Lab at the University of Heidelberg, Germany
Gurdeep Singh: gurdeep.singh@bioquant.uni-heidelberg.de
Francesco Raimondi: francesco.raimondi@bioquant.uni-heidelberg.de
Rob Russell: robert.russell@bioquant.uni-heidelberg.de


Inoue Lab at Tohoku University, Japan
Asuka Inoue: iaska@m.tohoku.ac.jp

Precog was developed in the Russell lab at the University of Heidelberg, 2019