A free and open-source program to visualize proteins and the function of its individual amino acids through deep learning.
This interactive visualizer uses a neural network to identify functional and structural clusters within a protein by calculating embeddings, numerical expressions of each amino acid. The high-dimensional data is then transformed into 2D in order to easily interpret the possible functional components of a protein. Next, a graph convolutional network uses both the 3D structure and amino acid sequence to predict Gene Ontology (GO) annotations that describe the molecular function of the protein. Through Gradient-weighted Class Activation Mapping (GradCAM), the program can calculate how much each amino acid contributed to each GO prediction and subsequently allow for the identification of functional residues.
This program was written entirely with Python, using Pyglet as an OpenGL interface.
Disclaimer: All of the data (except for 3D structure) shown in this program is generated through machine learning and thus cannot be verified to be completely accurate. It is recommended to double-check the predicted functions through experimental sources
Download and extract the latest release from the
protein-visualizer release page.
Run protein-visualizer-x.x/main.exe
and wait for the help terminal to appear.
When the program is loaded, it will prompt you for a protein file in either the Protein Data Bank (.pdb) format or ModelCIF (.cif) format. The RCSB Protein Data Bank is a great source to find proteins to test out with this program. If your file has more than one amino acid chain, the help terminal will prompt you to type the name of the chain you want to render. After some time computing predictions about your protein, the interactive interface will open with your protein ready to view.
Left Mouse
: Translate the camera and navigate the embedding space.Right/Middle Mouse
: Rotate the camera.Up/Down Arrow
: Increase/decrease atom point size.O
: Toggles atom outline.Before viewing a protein, the program caches amino acid embeddings and Gene Ontology (GO) annotations in
protein-visualizer-x.x/data/[protein_name]_data.json
to reduce load times the next time the
program is run with the selected protein.
If you wish to use the prediction data yourself, you can parse through the data file. The following information is stored as a dictionary for each protein:
GO_ids
: A list of predicted GO IDs.GO_names
: A list of descriptions for each predicted GO ID.confidence
: A number between 0 and 1 denoting the confidence of each GO prediction.saliency_maps
: A matrix where each row contains numbers between 0 and 1 denoting how
strongly an amino acid contributed to the prediction of that row's GO annotation.sequence
: The protein's amino acid sequence in the FASTA format (without a header).embedding_points
: A flattened list of (x, y) points denoting the coordinates of each
amino acid in the latent space after being transformed into 2D.cluster_indices
: A list of integers denoting which cluster each amino acid belongs to, with -1 meaning no cluster.You can also generate the data file without opening the GUI by executing the program from the command line:
protein-visualizer.x.x/main.exe [path_to_pdb_file] [optional_chain_id]
. If no chain is selected,
the program will automatically select one rather than prompt the user like when the GUI is opened.
All of the code is available under the MIT license on the protein-visualizer GitHub repository.
Note that in order to run the code, you must download the most recent release of the program and copy the
protein-visualizer-x.x/saved_models/
folder into the main directory.
Feel free to take any part of the code to expand or transform, just make sure you keep the attributions to ProSE and DeepFRI!
deep_learning/
: Contains ProSE and DeepFRI code that generates amino acid embeddings and
GO annotations respectively using the pre-trained models.docs/
: Contains this website.gui/
: Contains code to draw user interface widgets, such as buttons and the amino acid label.img/
: Contains icons used when running the program.renderers/
: Contains code used to draw the 3D protein and the 2D embedding space.saved_models/
Contains the pre-trained models created by the ProSE and DeepFRI teams. Must be copy
and pasted from the latest release due to large size.batch_predict.py
: Example script of using the program from the command line. main.py
: Entrypoint to the program.protein.py
: Contains classes that parse and organize data about a protein.Protein Sequence Embeddings (ProSE): Amino acid embedding code and pre-trained models.
Deep Functional Residue Identification (DeepFRI): GO annotation prediction code and pre-trained models.