Documentation

1. Introduction

Protein-chemical interactions are rarely viewed from a structural perspective. However, the network of structural interaction, where nodes represent proteins or small molecules and edges, resolved 3D structures of protein-ligand complexes, is a rich source of new hypotheses about them. Recently, we have published a method to predict novel protein-chemical interactions using superimposition of known 3D structures (Kalinina et al, PLoS Comp Biol, 2011, (7)5: e1002043). The underlying premise is that if two proteins share a common ligand, and one of the proteins is known to bind another ligand, the 3D structures of protein-ligand complexes can be superimposed to build and evaluate a model for the complex of the second protein with that other ligand. In ProtChemSI, we provide not only these data but extend the approach in several directions: additionally, we construct models for all interactions with molecules similar to known interaction partners of a protein or a chemical of interest, and develop a functionality to traverse the network of interactions and assess possibility of building a model of any protein with any chemical of interest.

2. Data organization

ProtChemSI is based on two major notions: Molecules and Links.

2.1. Molecules are proteins and chemicals.

2.1.1. Proteins are identified by their UniProt IDs. For each protein, the following information is stored, extracted from UniProt:

  • protein name;
  • name synonyms;
  • organism name;
  • taxonomy.
Each protein is linked out to UniProt.

2.1.2. Chemicals are identified by their PubChem {} IDs. For each chemical, the following information is stored, extracted from PubChem:

  • chemical name;
  • list of name synonyms.
Each chemical is linked out to PubChem.

2.2. Links can be of three types: between a protein and a chemical; between two proteins; between two chemicals. All links have a weight assigned to them.

2.2.1. A protein and a chemical are considered to be linked, if there is a complex, experimentally resolved by X-ray or NMR, in the Protein Data Bank. All these links have a weight of 1.

2.2.2. Two proteins are considered to be linked, if one of them is found as a hit in a BLAST search for the other protein with e-value < 0.01 and identity > 30%. These links get a weight of % identity / 100.

2.2.3. Two chemicals are considered to be linked, of they have a Tanimoto score > 0.9 in a comparison using PubChem fingerprints. These links are assigned a weight equal to the Tanimoto score of the two chemicals.

The weights are used when the two query molecules from the network are being connected with a shortest path in order to build a model: the shortest path with the maximum weight is chosen.

3. Sources

The data on experimentally resolved complexes is downloaded from the Protein Data Bank monthly. Protein and chemical assignments, as well as the additional information on them, are downloaded from UniProt and PubChem, respectively. The list of chemicals is manually checked in order to solvents and buffer components out of the database. Chlorophyll, heme and other porphyrins are also being removed, as they represent a very specific case of protein-ligand interaction, and could obscure the more generic interactions made by smaller ligands.

4. Automatically generated data

For each protein-chemical pair, all the Protein Data Bank entries, and all instances of those protein and chemical being in contact in these entries are extracted and stored separately as contacting pairs. For each pair of contacting pairs that share a common protein or chemical, or either proteins or chemicals are connected by a link, a transformation matrix that brings the common component into the same frame of reference is computed. STAMP is used to generate superimposition of proteins, and PINTS, for chemicals.

The model complex structures are built interactively for each query and stored in the cache for two weeks. For a query consisting of a single protein/ligand, three types of models are generated.

4.1. For a protein query, models with all chemicals that are similar to those bound by it. For a chemical query, models with all proteins that are similar to the ones that bind it.

4.2. For a protein query, models with all chemicals that can be bound by proteins similar to it. For a chemical query, models with all proteins that bind chemicals similar to it.

4.3. For a protein query, model with all chemicals that can be bound by a protein that shares a binding partner with the query. For a chemical query, models with all proteins that bind a chemical bound by a protein that bind also the query. These model are termed 1-step superimpositions.

For a query consisting of a protein-chemical pair, the shortest path in the network is computed, so that it minimizes the function Σ(1 - weight) -> min, sum taken over all links in the path. Then all possible models are built using sequential superimposition of common or similar complex components.

Each model is scored using the statistics described in (Kalinina et al, PLoS Comp Biol, 2011, (7)5: e1002043). Briefly, a number of physical and chemical parameters of the complex are calculated that result to a score from 0 to 7, 7 being the best model. Model with a p-value below 0.05 (which corresponds to a score of 5.6), i.e. having a 5% or less chance of being randomly generated, are highlighted.

5. Website

The index page of the ProtChemSI website contains a search form for proteins or chemicals. Proteins can be searched by protein name, UniProt ID or sequence, chemicals by ligand name, PubChem ID or SMILES representation. There is also a link that redirects to browsing the database, where entries can be grouped either by protein, by chemical, or by complex.

The information for a hit molecule can be viewed by clicking the Interactions link on the search results page. The information for known complexes and the predicted models is given.

After viewing information for a single molecule, the user may select the Find shortest path option, and will be presented with a search form for the second molecule from the network. It allows to reconstruct and evaluate models of complexes of these two molecules.

6. Downloads

The database can be downloaded as a series of flatfiles. The flatfiles are tab-delimited and include the following fields.

6.1. Proteins.txt

  • mol_id: internal database identifier;
  • real_id: UniProt ID;
  • name: protein name;
  • syn: protein name synonyms;
  • organism: source organism;
  • taxonomy: full taxonomy.

6.2. Chemicals.txt

  • mol_id: internal database identifier;
  • real_id: PubChem Compound ID;
  • name: chemical name;
  • syn: chemical name synonyms.

6.3. Links.txt

  • link_id: internal database identifier of a link between. Note that link_id is not unique in the table, but unique for a pair of protein and chemical mol_id’s;
  • link_type: 0 for protein-chemical complexes, 1 for similarity pairs of two proteins, 2 for similarity pairs of two chemicals;
  • weight: link weight, 1 for protein-chemical complexes, % identity / 100 for similarity pairs of two proteins, Tanimoto score for similarity pairs of two chemicals;
  • cont_id: identifier for a contacting pair, unique;
  • pdb_id: identifier of an entry in The Protein Data Bank (this and the fields below are blank for links of type 1 and 2);
  • lig_name: ligand name as in the entry from The Protein Data Bank;
  • prot_chain: protein chain name in the entry from The Protein Data Bank;
  • prot_from: number of the first residue in the chain in the entry from The Protein Data Bank;
  • prot_to: number of the last residue in the chain in the entry from The Protein Data Bank;
  • chem_chain: chain for the chemical in the entry from The Protein Data Bank;
  • chem_num: residue number for the chemical in the entry from The Protein Data Bank.

6.4. Transformations.txt

  • trans_id: internal database identifier;
  • cont_id1, cont_id2: indentificators for the two contacting pairs from Links.txt involved in the transformation;
  • trans_type: 0 if the proteins are superimposed, 1 if chemicals;
  • score: score reflecting the quality of the superimposition. STAMP score for protein superimpositions (0 to 10, the greater the better), all-atom RMSD for chemicals (the smaller the better);
  • m01, m02, m03, m04, m05, m06, m07, m08, m09, m10, m11, m12: transformation matrix, where
    | m01 m02 m03 |
    | m05 m06 m07 |
    | m09 m10 m11 |
    is the rotation matrix and (m04, m08, m12) is the transition vector.