Structural Bioinformatics and Protein Design
(Gerd Krause)
Summary of Databases
db=database
Nucleic Acid Databases
- db 'Genomes': possibility to search complete genomes of mammals, eukaryotes, chordates...
- SRS to query all available db`s at the EBI
- Simple Sequence Retrieval by accession number
- has been functioning as the international nucleotide sequence database in collaboration with EBI/EMBL and NCBI/GenBank
- provides many ways to do searches like SRS, FASTA, BLAST, PSI-BLAST, clustalw, malign...
Protein Databases
- contains 3D biological macromolecular structure data
- search by PDB id, text
- PSD: protein sequence database of functionally annotated protein sequences
- iProClass: contains descriptions of protein family, function + structure for PSD/Swissprot-TrEMBL sequences
- The iProClass is an integrated resource that provides comprehensive family relationships and structural/functional features of proteins
- PIR-NREF: db for sequence searching and protein identification of sequences from PSD, Swissprot, TrEMBL, RefSeq, GenPept + PDB
- Sequence/text searches and sequence alignment search possible
The PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture)
- SwissProt is currently cross-referenced with about 60 db`s
- for each sequence entry the core data consists of the sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) while the annotation consists of the description of the following items:
- Function(s) of the protein
- Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc.
- Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, etc.
- Secondary structure
- Quaternary structure. For example homodimer, heterotrimer, etc.
- Similarities to other proteins
- Disease(s) associated with deficiencie(s) in the protein
- Sequence conflicts, variants, etc.
Nucleic Acid + Protein Databases
- has different databases like nucleotide, protein, structure, genome, 3D domain...
- contains sequence collections including GenBank, RefSeq and PDB
- clusters of orthologous groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain
- text and protein/gene name searchs are possible
- NCBI`s structure database is called MMDB (Molecular Modeling DataBase), and it is a subset of three-dimensional structures obtained from the Protein Data Bank (PDB), excluding theoretical models
Protein family resources
- db of protein domain classification
- is an automatic algorithm for domain decomposition and clustering of all protein domain families
- search by sequence identifiers, text, sequence
- method of dissecting proteins into domain-like fragments based on sequence homology
- upon user submission of a protein sequence, CHOP will analyse the protein for its homology to PDB domains, Pfam domains and SwissProt proteins. It will then return an e-mail to the user about the putative domain assignments
- db of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences
- search by name, abstract, method name, method accession and InterPro entry accession
- collection of multiple sequence alignments and hidden markov models covering protein domains and families
- you can look at multiple alignments, view protein domain architectures, examine species distribution, view protein structures
- search by protein name/sequence, keyword, DNA sequence, domain
- compendium of fingerprints (a fingerprint is a group of conserved motifs used to characterize a protein family)
- search possibilities are accession number, text, sequence, title, author, motif number, BLAST
- is a comprehensive set of protein domain families automatically generated from the SwissProt and TrEMBL sequence databases
- for complete genomes: For each ProDom-CG family with more than 2 domains, we compute an evolutionary tree, with a prediction about the node in which the domain could have appeared. The results are graphically presented
- search whole db, complete genomes only or structural genomics
- automatic hierarchical classification of protein sequences
- ProtoNet classification hierarchically partitions the protein space into clusters of similar proteins. The lower a cluster is situated in its tree, the smaller it is and the more similar are its proteins to each other. Browsing the clustering hierarchy can provide insight to function and structure of proteins
- allows the identification and annotation of genetically mobile domains and analysis of domain architectures
- normal SMART: db contains SwissProt, SwissProt/TrEMBL and stable Ensembl proteomes
- genomic SMART: only the proteomes of completely sequenced genomes are used (Ensembl for metazoans and SwissProt for the rest)
- search: accession number, protein sequence, domain, GO terms, text
- a collection of graph-based algorithms to hierarchically partition a large set of protein sequences into homologous families and superfamilies. The methods unified now under the name SYSTERS (short for SYSTEmatic Re-Searching) are based on an all-against-all database search (using Smith-Waterman comparisons)
- select protein family by protein accession id, keyword, organism, Pfam domain, protein family id, gene name
- collection of protein families based on Hidden Markov Models
- tool for identifying functionally related proteins based on sequence homology
- search by sequence or text
Protein structure family resources
- hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class(C), Architecture(A), Topology(T) and Homologous superfamily (H). Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. The topology level clusters structures according to their toplogical connections and numbers of secondary structures. The homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to toplogy families and homologous superfamilies are made by sequence and structure comparisons
- search by PDB code, CATH code, text
- db and tools for 3D protein structure comparison and alignment using the combinatorial extension (CE) method
- search by
- structural alignments by selecting from all or representatives from the PDB
- calculate structural alignment for 2 chains either from the PDB or uploaded by the user
- calculate structural neighbours for one protein uplodaded by the user against the PDB or calculate multiple structure alignments
- dictionary of homologous superfamilies
- provides structural and functional annotations of domains within each superfamily in CATH
- search by superfamily
- HOMologous STRucture Alignment Database is a curated database of structure-based alignments for homologous protein families. All known protein structure are clustered into homologous families (i.e., common ancestry), and the sequences of representative members of each family are aligned on the basis of their 3D structures
- db provides annotated structural alignments in various formats, superimposed structures, links to other db`s and the alignment
- search by text, BLAST, families
- structural classification of proteins
- nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification
Protein-Protein Interaction
- Biomolecular Interaction Network Database (BIND) is a collection of records documenting molecular interactions, complexes and pathways
- search by PubMed id, PDB id, GO id, text
- catalogs experimentally determined interactions between proteins
- once the initial protein is found through keyword or sequence searches the interaction network can be explored by interactively following the interaction links
- search by BLAST, Prosite ID, user-defined regular expression after interactions described in selected articles
- provides assignments of gene products to the gene ontology (GO) resource
- GO is a controlled vocabulary that can be applied to all organisms even while knowledge of gene and protein roles in cells is still changing
- in GOA this vocabulary will be applied to proteins described in UniProt and Ensembl
- provides functional information for proteins, eg. you can find all proteins involved in apoptosis (GO:0006915)
QuickGO - is a browser of the GO data
- search by name-value pair, GO term/id, UniProt id/keyword, InterPro id, text, wildcard
- Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome
- search by accession number, protein name, gene symbol, PTM, cellular component, domain, motif, expression, molecular weight, protein sequence length, disease
- MINT focuses on experimentally verified protein interactions with special emphasis on proteomes from mammalian organisms
- search proteins by protein/gene name, text, InterPro domain, PDB id, GO term
- search interactions by reference, protein/gene name, text, SwissProt id, InterPro domain, PDB id, GO term
- Protein-Protein Interaction Database
- db is unifying molecular entries across three species, namely human, rat and mouse and is is footed on sequence databases such as SwissProt, EMBL, TrEMBL (translated EMBL sequences) and Unigene and the literature database PubMed
- search by PPID id, protein name/type, drug, diseases
- db of small molecule - domain interactions determined from MMDB records
- is essentially a "listing" of all small molecules that have been shown to bind to any given conserved protein domain. Along with simple small molecule-domain pairwise interactions, detailed information pertaining to the small molecule and protein are stored
- SMID Genomes offers a simple interface to browse, search or compare predicted small molecule interactions in an organism-specific or cross-genomes manner
- search by protein id, domain id, PDB id, text, BLAST
