protein-visualizer

A free and open-source program to visualize proteins and the function of its individual amino acids through deep learning.

Demonstration of running the visualizer Demonstration of running the visualizer

How does it work?

This interactive visualizer uses a neural network to identify functional and structural clusters within a protein by calculating embeddings, numerical expressions of each amino acid. The high-dimensional data is then transformed into 2D in order to easily interpret the possible functional components of a protein. Next, a graph convolutional network uses both the 3D structure and amino acid sequence to predict Gene Ontology (GO) annotations that describe the molecular function of the protein. Through Gradient-weighted Class Activation Mapping (GradCAM), the program can calculate how much each amino acid contributed to each GO prediction and subsequently allow for the identification of functional residues.

This program was written entirely with Python, using Pyglet as an OpenGL interface.

Disclaimer: All of the data (except for 3D structure) shown in this program is generated through machine learning and thus cannot be verified to be completely accurate. It is recommended to double-check the predicted functions through experimental sources

Usage

Installation/Running

Download and extract the latest release from the protein-visualizer release page. Run protein-visualizer-x.x/main.exe and wait for the help terminal to appear.

Input

When the program is loaded, it will prompt you for a protein file in either the Protein Data Bank (.pdb) format or ModelCIF (.cif) format. The RCSB Protein Data Bank is a great source to find proteins to test out with this program. If your file has more than one amino acid chain, the help terminal will prompt you to type the name of the chain you want to render. After some time computing predictions about your protein, the interactive interface will open with your protein ready to view.

Controls

Using Generated Data

Before viewing a protein, the program caches amino acid embeddings and Gene Ontology (GO) annotations in protein-visualizer-x.x/data/[protein_name]_data.json to reduce load times the next time the program is run with the selected protein.

If you wish to use the prediction data yourself, you can parse through the data file. The following information is stored as a dictionary for each protein:

You can also generate the data file without opening the GUI by executing the program from the command line: protein-visualizer.x.x/main.exe [path_to_pdb_file] [optional_chain_id]. If no chain is selected, the program will automatically select one rather than prompt the user like when the GUI is opened.

Code

All of the code is available under the MIT license on the protein-visualizer GitHub repository. Note that in order to run the code, you must download the most recent release of the program and copy the protein-visualizer-x.x/saved_models/ folder into the main directory. Feel free to take any part of the code to expand or transform, just make sure you keep the attributions to ProSE and DeepFRI!

File Structure

Attribution