This is the reference (TensorFlow) implementation of recurrent geometric networks (RGNs), described in the paper End-to-end differentiable learning of protein structure.
Extract all files in the model directory in a single location and use protling.py
, described further below, to train new models and predict structures. Below are the language requirements and package dependencies:
- Python 2.7
- TensorFlow >= 1.4 (tested up to 1.12)
- setproctitle
The protling.py
script facilities training of and prediction using RGN models. Below are typical use cases. The script also accepts a number of command-line options whose functionality can be queried using the --help
option.
RGN models are described using a configuration file that controls hyperparameters and architectural choices. For a list of available options and their descriptions, see its documentation. Once a configuration file has been created, along with a suitable dataset (download a ready-made ProteinNet data set or create a new one from scratch using the convert_to_tfrecord.py
script), the following directory structure must be created:
<baseDirectory>/runs/<runName>/<datasetName>/<configurationFile>
<baseDirectory>/data/<datasetName>/[training,validation,testing]
Where the first path points to the configuration file and the second path to the directories containing the training, validation, and possibly test sets. Note that <runName>
and <datasetName>
are user-defined variables specified in the configuration file that encode the name of the model and dataset, respectively.
Training of a new model can then be invoked by calling:
python protling.py [configurationFilePath] -d [baseDirectory]
Download a pre-trained model for an example of a correctly defined directory structure. Note that ProteinNet training sets come in multiple "thinnings" and only one should be used at a time by placing it in the main training directory.
To resume training an existing model, run the command above for a previously trained model with saved checkpoints.
To predict the structure of a new protein using an existing model with a saved checkpoint, call:
python protling.py [configFilePath] -d [baseDirectory] -p
This predicts the structures of the dataset specified in the configuration file. By default only the validation set is predicted, but this can be changed using the -e
option.
Below we make available pre-trained RGN models using the ProteinNet 7 - 12 datasets as checkpointed TF graphs. These models are identical to the ones used in reporting results in the bioRxiv preprint, except for the CASP 11 model which is slightly different due to using a newer codebase.
CASP7 | CASP8 | CASP9 | CASP10 | CASP11 | CASP12 |
---|
To train new models from scratch using the same hyperparameter choices as the above models, use the appropriate configuration file from here.
The reference RGN implementation is currently only available in TensorFlow, however the OpenProtein project has implementations of various aspects of the RGN model in PyTorch.
End-to-end differentiable learning of protein structure, bioRxiv 2018