VAEResTL: A Novel Generative Model for Designing Complementarity

Determining Region of Antibody for SARS-CoV-2

Saeed Khalilian

, Zahra Moti

, Arian Baloochestani

, Yeganeh Hallaj

, Alireza Chavosh

and Zahra Hemmatian

1,∗

MarWell Bio Inc., California, U.S.A.

Independent Researcher, Iran

Independent Researcher, The Netherlands

Independent Researcher, Norway

Keywords:

Antibody, Nanobody, Complementarity Determining Region (CDR), SARS-CoV-2, COVID-19, Deep

Generative Models, Transfer Learning, Bioinformatics, in-silico Screening.

Abstract:

The global impact of the COVID-19 pandemic underlines the importance of developing a competent machine

learning (ML) approach that can rapidly design therapeutics and prophylactics such as antibodies/nanobodies

against novel viral infections despite data shortage problems and sequence complexity. Here, we propose

a novel end-to-end deep generative model based on convolutional Variational Autoencoder (VAE), Resid-

ual Neural Network (Resnet), and Transfer Learning (TL), named VAEResTL that can competently generate

CDR-H3 sequences for a novel target lacking sufﬁcient training data. We further demonstrate that our pro-

posed method generates the third complementarity-determining region (CDR) of the heavy chain (CDR-H3)

sequences for designing and developing therapeutic antibodies/nanobodies that can bind to different variants

of SARS-CoV-2 despite the shortage of SARS-CoV-2 training data. The predicted CDR-H3 sequences are

then screened and ﬁltered for their developability parameters namely viscosity, clearance, solubility, stability,

and immunogenicity through several in-silico steps resulting in a list of highly optimized lead candidates.

1 INTRODUCTION

Antibodies play an important role in therapeutic dis-

covery and vaccine development for a variety of dis-

eases ranging from infectious diseases to cancer and

autoimmune diseases (Zohar and Alter, 2020). Wet-

lab methods for antibody discovery can be very time-

consuming and costly. One of these methods is high

throughput screening which is a drug discovery pro-

cess used to identify the antibody leads that bind

to their antigen targets and are within the therapeu-

tics and developability index range (Sharma et al.,

2014). The binding site of antibody/nanobody in-

cludes a region known as complementarity determin-

ing region (CDR) (Murphy et al., 2008). Amongst

CDRs, CDR-H3 on the heavy chain is the most

variable CDR and typically contributes the most

to antigen speciﬁcity for antibodies and nanobod-

ies (Tsuchiya and Mizuguchi, 2016). The current

COVID-19 pandemic underlines the importance of

developing approaches capable of rapidly designing

and developing therapeutics and prophylactics against

∗

Corresponding author email: zara@marwell.bio

novel viral infections. Designing CDR-H3 plays

a critical role in antibody/nanobody-based therapeu-

tics. Artiﬁcial intelligence (AI) and machine learn-

ing (ML) techniques have been recently explored for

COVID-19 vaccine development to stop the spread of

the virus (Ong et al., 2020); however, rapid develop-

ment of antibodies and nanobodies which can offer

therapeutics and prophylactics beneﬁts is of critical

importance.

Recently, biologically plausible deep learn-

ing (Yoo et al., 2020), and computational mod-

els (Adolf-Bryfogle et al., 2018) have been success-

fully applied to design and optimize CDR loops us-

ing deep sequencing data (Norman et al., 2020).

Deep generative models using Variational Autoen-

coder (VAE) have also been applied in designing pro-

teins (Friedensohn et al., 2020), and in predicting

protein structures (Guo et al., 2020). Moreover, deep

Residual Neural Network models (Resnet) have also

greatly improved protein design and protein structure

predictions (Lu et al., 2020), and antibody-epitope

classiﬁcation (Ripoll et al., 2021). Nevertheless, the

application of Resnet in antibody/nanobody discov-

Khalilian, S., Moti, Z., Baloochestani, A., Hallaj, Y., Chavosh, A. and Hemmatian, Z.

VAEResTL: A Novel Generative Model for Designing Complementarity Determining Region of Antibody for SARS-CoV-2.

DOI: 10.5220/0010823700003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 3: BIOINFORMATICS, pages 107-114

ISBN: 978-989-758-552-4; ISSN: 2184-4305

107

ery has been rarely explored. Existing deep learning

and VAE-based approaches are often used in conjunc-

tion with large datasets and are incapable of discov-

ering and designing new antibodies/nanobodies when

facing data shortage for novel targets such as SARS-

CoV-2 and its variants. Transfer learning (TL) tech-

niques succeeded in biomedical image classiﬁcation

and various protein prediction tasks (Heinzinger et al.,

2019; Valeri et al., 2020); however, to the best of

our knowledge, no study in the literature has empow-

ered the deep generative model of VAE with TL to

tackle the lack of training data in antibody/nanobody

discovery. Here, we leverage the power of deep

learning algorithms to predict therapeutic antibodies

with binding ability to SARS-CoV-2. We present a

novel end-to-end generative ”VAEResTL” model to

discover amino acid sequences against novel target

antigens that lack training data. Our proposed VAER-

esTL model is based on a VAE model, a Resnet struc-

ture, and a Network-based TL technique. We demon-

strate Resnet improved our deep generative model

VAERes’s performance efﬁciently while providing a

deep neural network capable of learning CDR-H3 se-

quence complexity. The learning from our VAERes

pre-trained on sufﬁcient antibody amino acid CDR-

H3 sequences on different target antigens can be efﬁ-

ciently transferred for predicting antibody/nanobody

amino acid sequences with binding ability to SARS-

CoV-2. The VAEResTL predicted CDR-H3 se-

quences were then subject to in-silico screening and

ﬁltering for developability parameters namely viscos-

ity, clearance, solubility, stability, and immunogenic-

ity, resulting in potential lead antibody/nanobody

CDR-H3 sequence candidates with optimal therapeu-

tic properties.

2 METHODOLOGY

2.1 Datasets and Data Pre-processing

We used amino acid sequences of antibody CDR-H3

derived from deep sequencing data to develop and

train our proposed methods. We ﬁltered out CDR-

H3 sequences based on their binding ability to SARS-

CoV-2 (Raybould et al., 2021) resulting in 2298 se-

quences which we used as our primary training data.

For our TL method, we used three large datasets of

three different antigenic targets with a wide variety

of antibodies, including ranibizumab (Rani) (the size

of the dataset is 67769 sequences) (Liu et al., 2020),

yeast display scFv (Yeast) (the size of the dataset is

11038 sequences) (Adams et al., 2016), and chicken

ovalbumin (OVA) (the size of the dataset is 65638 se-

quences) (Goldstein et al., 2019) to pre-train our al-

gorithms.

We converted amino acid sequences into a 2-

dimensional matrix through one-hot encoding. To

have the sequences of various lengths of 8-20 with the

ﬁxed lengths of 20, we used padding and added null

character J to the left and right sides of sequences.

Since there are 24 amino acids (20 standards, two rare

(U, O), one unknown (X), and one null (J)), a CDR-

H3 sequence with a length of 20 amino acids results

in a 20 x 24 matrix. Each row will consist of a single

‘1’ in the column corresponding to the amino acid in

that position, whereby this value for all other columns

in that row is equal to ‘0’.

Figure 1: Overview of the proposed method.

2.2 Model Architecture and Training

The VAEResTL, which is an end-to-end trainable

model comprises a convolutional VAE that adopts a

Resnet structure and is enhanced with a TL technique

(Figure 1). Our method utilizes only the amino acid

sequences data without the need for structural data.

CDR-H3 sequences are subject to high variation in

the amino acid distribution and offer the highest con-

tribution to antigen speciﬁcity. A deep neural network

capable of learning from complex and highly variant

sequences is essential to model CDR-H3 sequences;

nevertheless, increasing the depth of the neural net-

work by adding more layers leads to a vanishing gra-

dient problem (Goceri, 2019). Resnet introduces skip

connections to jump over some layers. The skip con-

nections allow gradients of a deep neural network to

ﬂow easily from layer to layer and prevent gradients

from vanishing (He et al., 2016). Inspired by the suc-

cessful applications of Resnet in bioinformatics (Xu

et al., 2020), (Ripoll et al., 2021), here we incorpo-

rate Resnet blocks into our convolutional VAE model

to increase the depth of the neural network for mod-

eling and predicting CDR-H3 sequences. We named

our convolutional VAE that adopts a Resnet approach,

VAERes. We further incorporated a network-based

TL technique (Tan et al., 2018), into our VAERes and

named our model VAEResTL (Figure 1).

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

108

3 EXPERIMENTS

3.1 Experimental Setup

We executed ﬁve experiments and compared VAE,

VAERes, VAEResTL with baseline models of HMM

and LSTM (Table 1). In experiment 1 and 2, we

trained the convolutional VAE and VAERes directly

on the SARS-CoV-2 dataset. For our VAEResTL we

designed experiment 3, through which VAERes was

trained on all the data together including Rani + Yeast

+ OVA sequences. Then the VAERes pre-trained on

Rani+ Yeast+ OVA sequences was trained on SARS-

CoV-2 dataset. In experiments 4 and 5, we trained the

baseline models of HMM and LSTM directly on the

SARS-CoV-2 dataset. HMM and LSTM are based on

sequence models described as follows:

(a) LSTM: We adopted an LSTM model previously

reported (Gupta et al., 2018), and replaced one-

hot encoding with embedding layer. The network

consists of two layers of LSTM with 100 units,

cross-entropy loss function and Adam optimizer.

(b) HMM: We used HMM model described by Ra-

biner (Rabiner, 1989) as a character-based model

where amino acid characters are considered as

states. Each state has a probability distribution

over a set of possible sequences. Amino acid

characters are then selected to form CDR-H3 se-

quences.

3.2 Experimental Metrics

We used machine learning (ML) and biophysical met-

rics to evaluate our proposed methods’ performance.

We further performed in-silico screening to assess

the predicted antibody/nanobody CDR-H3 sequences

that may bind to SARS-CoV-2.

3.2.1 Sequence Similarity

We used three different metrics to evaluate sequence

similarity. 1) Bilingual Evaluation Understudy

(BLEU) (Papineni et al., 2002), We employed 2-

gram, 3-gram, and 4-gram to estimate BLEU, using

the nltk python library. A higher BLEU score indi-

cates a higher degree of similarity between the seed

and the generated sequences. 2) Statistical measures

of Jensen-Shannon divergence (JSD) (Lin, 1991),

When the JSD value is close to zero, the distribution

of the generated sequences is very close to that of seed

sequences. 3) Pairwise sequence similarity method

of Needleman-Wunch (NW) (Needleman and Wun-

sch, 1970) is a biophysical characteristic and is used

to evaluate the sequence similarity between seed and

generated sequences (Wang et al., 2020), the higher

the NW value, the more similar the two sequences.

3.2.2 Sequence Diversity

We estimated the sequence diversity by measuring

the number of shared n-grams for different values

of n between generated and seed sequences, referred

to as S

(Das et al., 2018). Therefore, a value of

model1

model2

< 1 implies more diversity of gener-

ated sequences by model 1 at a particular n compared

to that of model 2.

3.2.3 Biophysical Metrics

We used Bio and modLAMP library (M

uller et al.,

2017) which incorporates several modules, like de-

scriptor calculation of biophysical characteristics

of amino acid CDR-H3 sequences, e.g., stability,

isoelectric-point, charge, and hydrophobicity (H) to

evaluate the biophysical properties of predicted CDR-

H3 sequences (Sharma et al., 2014).

3.2.4 In-Silico Screening

We used the CamSol method, a protein solubility pre-

dictor (Sormanni et al., 2015), at pH= 7.0 to esti-

mate the protein solubility score for each CDR-H3

sequence. We measured each sequence variant’s net-

charge and hydrophobicity (H) (Sharma et al., 2014)

to predict the sequences’ viscosity and clearance. We

predicted the peptide binding afﬁnity of the vari-

ant CDR-H3 sequences to MHC Class II molecules

to a reference set of 26 human leukocyte antigen

(HLA) alleles by employing NetMHCIIpan (Jensen

et al., 2018) to reduce their immunogenicity. The

NetMHCIIpan’ output provides a percentile rank that

reﬂects sequences’ afﬁnity compared with a set of

random natural peptides. The percentile rank classi-

ﬁes the peptides weak and strong binders to speciﬁc

MHC Class II alleles. The strong binders have per-

centile rank of above two, and the weak binders have

percentile rank of below ten. The minimum percentile

rank, with percentile rank of below ten is also classi-

ﬁed as weak binders, and the average percentile rank

is calculated across all 26 HLA alleles. The weaker

the peptide afﬁnity binding, the less immunogenic is

a sequence (Mason et al., 2019).

VAEResTL: A Novel Generative Model for Designing Complementarity Determining Region of Antibody for SARS-CoV-2

109

Table 1: Baseline models comparison. Biological Characteristics and Machine Learning Metrics for generated SARS-CoV-2

CDR-H3 sequences by VAERes, VAEResTL, HMM, LSTM, and Seed SARS-CoV-2.

Method

Biophysical Characteristics Machine Learning Metrics

NW H ISO Charge Stability

BLEU

JSD

(2-gram) (3-gram) (4-gram)

VAERes 25.07 0.26 5.08 -2.15 13.18 76.56 75.66 75.59 0.007

VAEResTL 39.31 0.24 5.77 -1.20 25.46 77.57 76.41 75.20 0.005

LSTM 40.51 0.20 4.20 -1.60 57.48 81.12 80.01 78.54 0.11

HMM 9.12 0.52 9.5 0.68 45.9 82.1 80.50 78.60 0.009

Seed NA 0.23 5.64 -0.84 33.36 NA NA NA NA

4 EXPERIMENTAL RESULTS

AND DISCUSSION

4.1 VAE Models Comparison

The convolutional VAE algorithm could not learn

the complexity of CDR-H3 sequences, and 100%

of the generated amino acid sequences were abnor-

mal (invalid padding or invalid characters) and in-

appropriate for further analysis. Nevertheless, when

we incorporated Resnet structure into the convolu-

tional VAE, our VAERes model demonstrated signiﬁ-

cant improvements in generating CDR-H3 sequences

for SARS-CoV-2. When we trained our VAERes

on SARS-CoV-2 seed CDR-H3 sequences, 98% of

the generated sequences were valid (valid padding

with valid characters), and only 2% of the gener-

ated sequences were abnormal (invalid padding or

invalid characters). Notwithstanding, 86.8% of the

valid generated CDR-H3 sequences for SARS-CoV-

2 were duplicate sequences, and only 11.2% of the

generated sequences were unique. These results sug-

gest that VAERes could learn a small range of CDR-

H3 sequences, reﬂecting the lack of adequate train-

ing data for SARS-CoV-2 seed CDR-H3 sequences

as a new target. With VAEResTL, 100% of the

generated CDR-H3 sequences for SARS-CoV-2 were

valid (valid padding with valid characters), 70.7% of

the valid generated sequences were unique, and only

29.3% of the CDR-H3 sequences were duplicate se-

quences.

4.2 Baseline Models Comparison

When we trained HMM on SARS-CoV-2 seed se-

quences, HMM could generate a very small library

of CDR-H3 sequences which is 1/56 of the size of

VAEResTL-generated library of SARS-CoV-2 CDR-

H3, where only 17% of the generated sequences

were valid (valid padding with valid characters) and

unique. These results may indicate that HMM as a

classic model is incapable of learning the complex-

ity of CDR-H3 sequences while suffering from lack

of sufﬁcient training data. LSTM could also gen-

erate a small library of sequences, where 95.5% of

the sequences were valid (valid padding with valid

characters). However, 86.1% of the valid generated

CDR-H3 sequences for SARS-CoV-2 were duplicate

sequences, and only 9.4% of the valid sequences

were unique. LSTM could only generate a library of

SARS-CoV-2 CDR-H3 sequences which is 1/6 of

the size of VAEResTL-generated library of CDR-H3

sequences. We calculated the value for a number of

antibody heuristics including Needleman (NW), Hy-

drophobicity (H), Isoelectric Point (ISO), Charge, and

Stability that give biological clues about how VAERes

and VAEResTL perform compared to the baseline

models of HMM and LSTM (Table 1). The aver-

age pairwise sequence similarity of NW is consis-

tently lower for VAERes, and HMM than VAEResTL.

Though, the higher NW for LSTM can be due to the

bias of such a small ratio of the unique sequences

among the small library of valid sequences. The av-

erage H value for VAEResTL generated sequences is

closer to the seed sequences than generated sequences

by VAERes, HMM, and LSTM. The ISO values are

between 4.20 and 5.77 across all the other models that

are close to the ISO values for the seed sequences,

except the HMM generated sequences that has the

highest ISO value. The average Charge values for all

models are negative except for HMM. The average

stability values for VAERes generated sequences are

too low. The average stability values for LSTM gen-

erated sequences and HMM generated sequences are

too high. Therefore, predicted sequences by VAERes,

LSTM, and HMM are not appropriate for their bio-

physical stability property. Moreover, the VAER-

esTL has a stability value within the range of seed

sequences. The overall BLEU values for VAEResTL-

generated sequences have higher values for 2-gram,

3-gram, and 4-gram than the BLEU for VAERes. The

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

110

small ratio of valid and unique sequences for LSTM

and HMM may bias their learning ability to only

a small range of CDR-H3 sequences, reﬂecting the

higher BLEU values for LSTM and HMM. Nonethe-

less, the JSD value for LSTM and HMM is higher

than other models presented in Table 1. The over-

all ML and Biophysical metrics demonstrate VAER-

esTL can more efﬁciently predict CDR-H3 sequences

with binding ability to SARS-CoV-2. It is likely that

with transfer learning, VAEResTL learns a more “bi-

ologically plausible” latent space by utilizing a much

larger dataset than VARes and the baseline models

of LSTM and HMM. These results may further in-

dicate that VAEResTL architecture loads more bio-

logical context during the CDR-H3 sequence gener-

ation process. Our baseline model comparison anal-

ysis may suggest that VAEResTL outperforms base-

line model techniques by predicting a more exten-

sive library of valid and biologically more valuable

antibody/nanobody CDR-H3 sequences with binding

ability to SARS-CoV-2 more accurately despite the

shortage of training data.

4.3 Transfer Learning Impact

Figure 2 compares predicted sequences by VAERes,

VAEResTL, with the seed CDR-H3 sequences that

bind to SARS-CoV-2 to visually imply the impact of

TL. We reported molecular features, e.g., sequence

length, amino acid sequence distribution, charge, and

H, as they play a crucial role in determining the

membrane-binding afﬁnity and speciﬁcity. We report

the VAEResTL results when our proposed method

is pre-trained by Rani + Yeast + OVA. The aver-

age amino acid composition-frequency distribution

(Figure 2A), length distribution (Figure 2B), net-

charge (Figure 2C), and H (Figure 2D) of VAEResTL-

generated SARS-CoV-2 CDR-H3 match the seed se-

quences more than the VAERes-generated SARS-

CoV-2 CDR-H3. This observation may suggest that

VAERes with TL perform well in capturing the charge

patterning, H, and composition within generated se-

quences.

We further analyzed the diversity of the gener-

ated sequences in terms of their shared n-grams (S

The n-gram similarity is lower for VAEResTL with

respect to VAERes(S

VAEResT L

VAERes

< 1) for n >

2. 2-gram, 3-gram, 4-gram, 5-gram are 0.91, 0.78,

0.63 and 0.44 respectively. These results imply that

VAEResTL-generated sequences show strong long-

range diversity; however, they are still consistent with

biological sequences, as evident from the sequence

similarity comparison (Table 1). The VAEResTL-

generated sequences show more substantial diversity

Figure 2: Visualize Comparison of molecular charac-

teristics between Seed (SARS-CoV-2 seed sequences, Yel-

low), VAERes (VAERes-generated sequences for SARS-

CoV-2, Violet) and VAEResTL (VAEResTL-generated se-

quences for SARS-CoV-2, Turquoise). Horizontal dashed

lines account for the mean. Whiskers extend to the most ex-

treme non-outlier data points. (A) amino acid distribution,

(B) amino acid length distribution, (C) total charge distribu-

tion, (D) Eisenberg hydrophobicity.

at higher n-grams (lower S

values) as a desirable

feature that can prevent viral resistance while de-

signing next-generation anti-virals and can provide a

larger and more diverse pool of sequences for in-silico

screening and therapeutics development studies. We

also found from the ML visualization heatmaps (Fig-

ure 3A1-3C1) and logo plots (Figure 3A2-3C2) that

although VAEResTL changed certain CDR-H3 amino

acid positions and their frequency distributions, the

overall pattern of VAEResTL-generated sequences

are closer to seed sequences as compared to the

VAERes-generated CDR-H3 sequences. High simi-

larity of predicted CDR-H3 sequences and seed CDR-

H3 sequences observed by biophysical and ML analy-

sis may suggest that the VAEResTL-generated CDR-

H3 sequences also have binding ability to SARS-

CoV-2. Moreover, ML and biophysical characteristics

of the VAEResTL-generated CDR-H3 sequences (Ta-

ble 1, Figure 2, and Figure 3) demonstrate that VAER-

esTL predicts more biologically valuable CDR-H3

sequences with binding ability to SARS-CoV-2 al-

though, we used divers training databases. These

results show that the generalization for our VAER-

esTL in antibody/nanobody design is signiﬁcantly im-

proved.

4.4 In-Silico Screening

With current advances in computational fore-

casts (Raybould et al., 2019), a number of parame-

ters including viscosity, clearance, solubility, stabil-

ity, and immunogenicity are used as the guideline for

VAEResTL: A Novel Generative Model for Designing Complementarity Determining Region of Antibody for SARS-CoV-2

111

Figure 3: Visualization of machine learning (ML). Position proposed map. Heatmap visualization showing: (A1) the count

of observed seed sequences for SARS-CoV-2. (B1) the count of VAERes-proposed sequences for SARS-CoV-2 from (left)

and to (right) each amino acid at each sequence position. (C1) the count of VAEResTL-proposed sequences for SARS-

CoV-2 from (left) and to (right) each amino acid at each sequence position. (seed and VAERes-generated and VAEResTL-

generated sequences have the length of 20). Sequence logo visualizations. (A2) Sequence logos for Seeds SARS-CoV-2, (B2)

VAERes-generated sequences for SARS-CoV-2 (C2) VAEResTL-generated sequences for SARS-CoV-2 are based on residue

frequency. Sequence logos are computed using Skylign.

in-silico screening to select the best in class lead can-

didates. Although, the predicted sequences are not in

clinical stage yet, we characterized the VAEResTL-

generated SARS-CoV-2 CDR-H3 sequences com-

pared to seed CDR-H3 sequences on a number of

these in-silico methods. In order to screen the CDR-

H3 sequences’ viscosity and clearance we measured

net-charge and hydrophobicity (H) by calculating ev-

ery amino acid sequence of the CDR-H3 sequences.

For all the sequences in the library the net-charge is

calculated at a given pH=7.0 and the hydrophobicity

scale used is “eisenberg” (M

uller et al., 2017). The

optimal net-charge for drug clearance is between 0-

6.2 with H of < 0. Therefore, we ﬁltered out the se-

quences with a net-charge of < 0 (Figure 4A, marked

with red box) and a H of < 0 (Figure 4B, marked

with red box). We also calculated the stability of

CDR-H3 sequences based on proteins and their dipep-

tide composition and ﬁltered out sequences with sta-

bility values of > 40 and < 20 (Figure 4C, marked

with red boxes). We then ran VAEResTL-generated

CDR-H3 amino acid sequences through CamSol to

estimate their solubility. We ﬁltered out sequences

with the CamSol score of < 0.2 (Figure 4D, marked

with red box) according to Sormanni et al. (Sormanni

and Vendruscolo, 2019) guidelines. The low immuno-

genicity of antibodies/nanobodies is an essential bio-

physical property for their therapeutics developabil-

ity. We predicted the peptide binding afﬁnity of all

padded CDR-H3 sequences to MHC Class II by uti-

lizing NetMHCIIpan (Jensen et al., 2018) to reduce

their immunogenicity. We then used peptide’s %Rank

of predicted afﬁnity that we calculated when compar-

ing CDR-H3 sequences with a set of 200,000 random

natural peptides. We predicted afﬁnity for a set of 26

HLA alleles which covers over 98% of the global pop-

ulation. We ﬁltered out the sequences with a %Rank

of < 2.5 (ﬁgure 4E, marked with red box) (SARS-

CoV-2 minimum %Rank) and %Rank of < 70 (Fig-

ure 4F, marked with red box) (SARS-CoV-2 average

%Rank). After employing in-silico screening anti-

body/nanobody CDR-H3 variants with desired vis-

cosity, clearance, solubility, stability, and immuno-

genicity remained as potential lead candidates. All

remaining predicted CDR-H3 variants against SARS-

CoV-2, marked outside the red boxes, conﬁned values

equal or superior to the parameters of the SARS-CoV-

2 seed sequences. Through the in-silico screening,

we identiﬁed CDR-H3 sequence variants with opti-

mized multi-parameters that can be further evaluated

in a wet-lab setting. However, in our future work

additional ﬁlters, including speciﬁcity and humaniza-

tion, could be implemented to ﬁnd the most devel-

opable therapeutic candidates. In addition, mapping

the predicted CDR-H3 sequences in-silico on pro-

tein/epitope targets can be a valuable validation for

our future work.

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

112

Figure 4: In-Silico Screening of predicted CDR-H3 se-

quences. Histograms present the parameter distributions of

generated and seed SARS-CoV-2 sequences for different ﬁl-

tering steps. Red boxes show ﬁltering cut-off for the ther-

apeutic index in clinical setting. (A) CDR-H3 net-charge.

(B) CDR-H3 H. (C) CDR-H3 stability score. (D) CamSol

solubility score. (E) the minimum NetMHCIIpan %Rank

across all possible 15-mers for a given sequence and across

all 26 HLA alleles. (F) the average NetMHCIIpan %Rank

across all possible 15-mers for a given sequences and across

all 26 HLA alleles.

5 CONCLUSIONS

The results of our study exhibit successful application

of Resnet adopted VAE for generating novel CDR-H3

sequences. We further illustrate how transfer learning

techniques can maximize the power of our VAERes

model for antibody/nanobody discovery when dealing

with the lack of training data for novel targets such as

SARS-CoV-2. Our model was trained on hundreds of

thousands of known and diverse CDR-H3 sequences

from well-studied targets and created a readily usable

tool with extensive generalization capabilities to dis-

cover new antibody/nanobodybased therapeutics. To

select antibodies/nanobodies with improved charac-

teristics, we identiﬁed best lead CDR-H3 sequences

with binding ability to SARSCoV- 2 through in-silico

screening. The outcome of this proof-of-concept

study can drive future work for validation of lead can-

didates through wet-lab experiments and the expan-

sion of our model for discovery of other CDR frac-

tions to develop therapeutics against different vari-

ants of SARS-CoV-2 including Delta and Omicron,

as well as other targets. In our future work we will

also employ our VAEResTL to design bispeciﬁc and

trispeciﬁc antibodies to develop next generation can-

cer therapeutics.

ACKNOWLEDGEMENTS

This wok was supported by MarWell Bio Inc. No ex-

ternal grants or funding contributed to the completion

of this work. All present and future rights to any in-

tellectual property arising from this work will be the

sole property of MarWell Bio Inc.

CONFLICT OF INTEREST

The authors declare no conﬂict of interest.

REFERENCES

Adams, R. M., Mora, T., Walczak, A. M., and Kinney, J. B.

(2016). Measuring the sequence-afﬁnity landscape

of antibodies with massively parallel titration curves.

Elife, 5:e23156.

Adolf-Bryfogle, J., Kalyuzhniy, O., Kubitz, M., Weitzner,

B. D., Hu, X., Adachi, Y., Schief, W. R., and Dun-

brack Jr, R. L. (2018). Rosettaantibodydesign (rabd):

A general framework for computational antibody de-

sign. PLoS computational biology, 14(4):e1006112.

Das, P., Wadhawan, K., Chang, O., Sercu, T., Santos, C. D.,

Riemer, M., Chenthamarakshan, V., Padhi, I., and

Mojsilovic, A. (2018). Pepcvae: Semi-supervised

targeted design of antimicrobial peptide sequences.

arXiv preprint arXiv:1810.07743.

Friedensohn, S., Neumeier, D., Khan, T. A., Csepregi, L.,

Parola, C., de Vries, A. R. G., Erlach, L., Mason,

D. M., and Reddy, S. T. (2020). Convergent selection

in antibody repertoires is revealed by deep learning.

bioRxiv.

Goceri, E. (2019). Analysis of deep networks with resid-

ual blocks and different activation functions: classiﬁ-

cation of skin diseases. In 2019 Ninth international

conference on image processing theory, tools and ap-

plications (IPTA), pages 1–6. IEEE.

Goldstein, L. D., Chen, Y.-J. J., Wu, J., Chaudhuri, S.,

Hsiao, Y.-C., Schneider, K., Hoi, K. H., Lin, Z., Guer-

rero, S., Jaiswal, B. S., et al. (2019). Massively

parallel single-cell b-cell receptor sequencing enables

rapid discovery of diverse antigen-reactive antibodies.

Communications biology, 2(1):1–10.

VAEResTL: A Novel Generative Model for Designing Complementarity Determining Region of Antibody for SARS-CoV-2

113

Guo, X., Tadepalli, S., Zhao, L., and Shehu, A. (2020).

Generating tertiary protein structures via an inter-

pretative variational autoencoder. arXiv preprint

arXiv:2004.07119.

Gupta, A., M

uller, A. T., Huisman, B. J., Fuchs, J. A.,

Schneider, P., and Schneider, G. (2018). Generative

recurrent networks for de novo drug design. Molecu-

lar informatics, 37(1-2):1700111.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C.,

Nechaev, D., Matthes, F., and Rost, B. (2019). Mod-

eling aspects of the language of life through transfer-

learning protein sequences. BMC bioinformatics,

20(1):1–17.

Jensen, K. K., Andreatta, M., Marcatili, P., Buus, S., Green-

baum, J. A., Yan, Z., Sette, A., Peters, B., and Nielsen,

M. (2018). Improved methods for predicting peptide

binding afﬁnity to mhc class ii molecules. Immunol-

ogy, 154(3):394–406.

Lin, J. (1991). Divergence measures based on the shannon

entropy. IEEE Transactions on Information theory,

37(1):145–151.

Liu, G., Zeng, H., Mueller, J., Carter, B., Wang, Z., Schilz,

J., Horny, G., Birnbaum, M. E., Ewert, S., and Gif-

ford, D. K. (2020). Antibody complementarity de-

termining region design using high-capacity machine

learning. Bioinformatics, 36(7):2126–2133.

Lu, S., Hong, Q., Wang, B., and Wang, H. (2020). Efﬁ-

cient resnet model to predict protein-protein interac-

tions with gpu computing. IEEE Access, 8:127834–

127844.

Mason, D. M., Friedensohn, S., Weber, C. R., Jordi, C.,

Wagner, B., Meng, S., Gainza, P., Correia, B. E., and

Reddy, S. T. (2019). Deep learning enables thera-

peutic antibody optimization in mammalian cells by

deciphering high-dimensional protein sequence space.

BioRxiv, page 617860.

uller, A. T., Gabernet, G., Hiss, J. A., and Schneider, G.

(2017). modlamp: Python for antimicrobial peptides.

Bioinformatics, 33(17):2753–2755.

Murphy, K., Travers, P., Walport, M., and Janeway, C.

(2008). Janeway’s Immunobiology - 7th (Seventh) edi-

tion. Garland Science, New York.

Needleman, S. B. and Wunsch, C. D. (1970). A gen-

eral method applicable to the search for similarities

in the amino acid sequence of two proteins. Journal

of molecular biology, 48(3):443–453.

Norman, R. A., Ambrosetti, F., Bonvin, A. M., Colwell,

L. J., Kelm, S., Kumar, S., and Krawczyk, K. (2020).

Computational approaches to therapeutic antibody de-

sign: established methods and emerging trends. Brief-

ings in bioinformatics, 21(5):1549–1567.

Ong, E., Wong, M. U., Huffman, A., and He, Y. (2020).

Covid-19 coronavirus vaccine design using reverse

vaccinology and machine learning. Frontiers in im-

munology, 11:1581.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th annual meet-

ing of the Association for Computational Linguistics,

pages 311–318.

Rabiner, L. R. (1989). A tutorial on hidden markov models

and selected applications in speech recognition. Pro-

ceedings of the IEEE, 77(2):257–286.

Raybould, M. I., Kovaltsuk, A., Marks, C., and Deane,

C. M. (2021). Cov-abdab: the coronavirus antibody

database. Bioinformatics, 37(5):734–735.

Raybould, M. I., Marks, C., Krawczyk, K., Taddese, B.,

Nowak, J., Lewis, A. P., Bujotzek, A., Shi, J., and

Deane, C. M. (2019). Five computational developa-

bility guidelines for therapeutic antibody proﬁling.

Proceedings of the National Academy of Sciences,

116(10):4025–4030.

Ripoll, D. R., Chaudhury, S., and Wallqvist, A. (2021).

Using the antibody-antigen binding interface to train

image-based deep neural networks for antibody-

epitope classiﬁcation. PLoS computational biology,

17(3):e1008864.

Sharma, V. K., Patapoff, T. W., Kabakoff, B., Pai, S., Hi-

lario, E., Zhang, B., Li, C., Borisov, O., Kelley, R. F.,

Chorny, I., et al. (2014). In silico selection of ther-

apeutic antibodies for development: viscosity, clear-

ance, and chemical stability. Proceedings of the Na-

tional Academy of Sciences, 111(52):18601–18606.

Sormanni, P., Aprile, F. A., and Vendruscolo, M. (2015).

The camsol method of rational design of protein mu-

tants with enhanced solubility. Journal of molecular

biology, 427(2):478–490.

Sormanni, P. and Vendruscolo, M. (2019). Protein solubil-

ity predictions using the camsol method in the study

of protein homeostasis. Cold Spring Harbor perspec-

tives in biology, 11(12):a033845.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu,

C. (2018). A survey on deep transfer learning. In

International conference on artiﬁcial neural networks,

pages 270–279. Springer.

Tsuchiya, Y. and Mizuguchi, K. (2016). The diversity of h

3 loops determines the antigen-binding tendencies of

antibody cdr loops. Protein Science, 25(4):815–825.

Valeri, J. A., Collins, K. M., Ramesh, P., Alcantar, M. A.,

Lepe, B. A., Lu, T. K., and Camacho, D. M. (2020).

Sequence-to-function deep learning frameworks for

engineered riboregulators. Nature communications,

11(1):1–14.

Wang, Y., Yadav, P., Magar, R., et al. (2020). Bio-informed

protein sequence generation for multi-class virus mu-

tation prediction. bioRxiv.

Xu, J., Mcpartlon, M., and Li, J. (2020). Improved protein

structure prediction by deep learning irrespective of

co-evolution information. bioRxiv.

Yoo, D. K., Lee, S. R., Jung, Y., Han, H., Lee, H. K., Han, J.,

Kim, S., Chae, J., Ryu, T., and Chung, J. (2020). Ma-

chine learning-guided prediction of antigen-reactive

in silico clonotypes based on changes in clonal abun-

dance through bio-panning. Biomolecules, 10(3):421.

Zohar, T. and Alter, G. (2020). Dissecting antibody-

mediated protection against sars-cov-2. Nature Re-

views Immunology, 20(7):392–394.

BIOINFORMATICS 2022 - 13th International Conference on Bioinformatics Models, Methods and Algorithms

114