|
Guide to EcoCyc
Contents
3 The Roles of EcoCyc in Microbial Genome Annotation
4 Conditions of E. coli Growth and Non-Growth
8 Data Sources Incorporated into EcoCyc
9 EcoCyc Accession Numbers
10 Other E. coli and Shigella PGDBs in BioCyc
1 EcoCyc Project OverviewEcoCyc1 is a bioinformatics database that describes the genome and the biochemical machinery of E. coli K-12 MG1655. The long-term goal of the project is to describe the molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists, and for biologists who work with related microorganisms. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc. This chapter provides an overview of the data content of EcoCyc, and of the procedures by which these data have been and continue to enter EcoCyc. EcoCyc is designed for several different modes of interactive use via the EcoCyc.org web site and in conjunction with the downloadable Pathway Tools [1] software (Section 12 tells how to learn how to use the web site and software):
EcoCyc data are also available for download in multiple file formats [2] and can be queried programmatically via web services [3]. Genome. EcoCyc contains the complete genome sequence of E. coli, and describes the nucleotide position and function of every E. coli gene. A staff of five full-time curators updates the annotation of the E. coli genome on an ongoing basis using a literature-based curation (see below) strategy. Mini-review summaries of E. coli gene products can be found in EcoCyc protein and RNA pages. Users can retrieve the nucleotide sequence of a gene, and the amino-acid sequence of a gene product. Regulation. EcoCyc describes several types of E. coli cellular regulation:
Membrane transporters. EcoCyc annotates E. coli transport proteins, and the associated transport reactions that they mediate. Metabolism. EcoCyc describes all known metabolic pathways and signal-transduction pathways of E. coli. It describes each metabolic enzyme of E. coli, including its cofactors, activators, inhibitors, and subunit structure. See also the MetaCyc project. Database links. EcoCyc is linked to other biological databases containing protein and nucleic acid sequence data, bibliographic data, protein structures, and descriptions of different E. coli strains. Literature-Based Curation. Curation is the process of manually refining and updating a bioinformatics database. The EcoCyc project uses a literature-based curation approach in which database updates are based on evidence in the experimental literature. EcoCyc is largely up to date with respect to its curation activities. As of March 2013, EcoCyc has encoded information from more than 30,224 publications. Curators collect gene, protein, pathway, and compound names and synonyms. They classify genes and gene products using the Gene Ontology and MultiFun ontology, and they classify pathways within the Pathway Tools pathway ontology. Protein complex components and the stoichiometry of these subunits are captured; cellular localization of polypeptides and protein complexes is entered, as are experimentally determined protein molecular weights; enzyme activities and any enzyme prosthetic groups, cofactors, activators, or inhibitors are captured. Operon structure and gene regulation information are encoded. Textual summaries with extensive citations are authored by curators. Within the summaries for proteins, RNAs, pathways, and operons, curators capture additional information not captured in the highly structured database fields of EcoCyc. For example, curators use the free-text summary sections to capture phenotypes caused by mutation, depletion, or overproduction of each gene product; any genetic interactions known; protein domain architecture and structural studies; similarity to other proteins; or any functional complementation experiments that have been described. Summaries can also be used to note cases in which the published reports present contradictory results. In such cases, both viewpoints will be presented with proper attribution. This approach assures that no information is lost. Underlying software. The Pathway Tools software that underlies EcoCyc is not specific to E. coli, but has been applied to manage genomic and biochemical data for hundreds of organisms. 2 How to Cite EcoCycPlease cite EcoCyc in publications that benefited from the use of the EcoCyc database or web site. Please cite EcoCyc as: Keseler et al., Nuc Acids Res, 39:D583–90 2011.
3 The Roles of EcoCyc in Microbial Genome AnnotationThe EcoCyc database can impact two aspects of microbial genome annotation: annotation of gene function, and annotation of metabolic pathways. We suggest that microbial genome annotation pipelines include a BLAST search (or a search by other sequence similarity tools) against all proteins with experimentally defined functions from EcoCyc. As discussed in our article Multidimensional annotation of the Escherichia coli K-12 genome, E. coli contains more proteins of experimentally determined functions than any other organism. Strong similarity hits to the preceding proteins should be preferred over hits against other proteins during assignment of functions to newly sequenced genes to minimize the chances of annotation errors due to transitive annotations. 4 Conditions of E. coli Growth and Non-GrowthAs of 2011 EcoCyc incorporates media that have been shown experimentally to support or not support growth of both wild type and knock-out strains of E. coli K–12. This work has two goals. First is to assemble a comprehensive encyclopedia of E. coli growth conditions for experimentalists. The spectrum of environmental conditions supporting the growth of a bacterium is among its most important phenotypic traits. We cannot expect to understand the functions of all genes in an organism unless we understand the full range of environments in which the cell can grow. Second, a comprehensive collection of E. coli growth media will drive more accurate systems biology modeling of E. coli. The larger is the set of growth media against which these models are validated, the more accurate and comprehensive the models will be. EcoCyc captures approximately 20 media that are commonly used by E. coli laboratories. It also describes media used in the following high-throughput experiments from Biolog Phenotype Microarrays (PMs) that support respiration in E. coli.
These data on growth conditions can be accessed from the EcoCyc Web site by invoking the command Tools → Search → Growth Media, then clicking on the button “All Growth Media for this Organism.” Individual media are shown in the initial table; PM data are shown in the following tables. The coloring of each cell indicates the degree of growth observed under that condition. Three levels of growth can be recorded: no growth, low growth, and growth (see legend that indicates the colors associated with each level of growth). Click on any growth medium to request a page describing its composition, and to see genes that are essential or not essential for growth under that condition.
5 Essential Gene InformationAs of 2011 EcoCyc incorporates several large-scale datasets on gene essentiality in E. coli. Gene essentiality information is useful for
EcoCyc incorporates data on essentiality from the following publications:
When essentiality data is available for a given gene, the EcoCyc gene page includes a table of the conditions under which that gene has been found to be essential, or not essential, for growth. Clicking on the condition will navigate to a growth-medium page that lists all essentiality information under that growth condition.
6 EcoCyc Metabolic Flux ModelA quantitative steady-state metabolic flux model has been derived from EcoCyc using Flux-Balance Analysis (FBA). By running this model with different parameters, scientists can model the growth of E. coli under different nutrient conditions and under different gene knock-outs. Every time the model is executed, the model is freshly generated from EcoCyc, meaning that as the reactions in EcoCyc are updated due to curation, the model evolves to reflect those changes. To run the model, use the Tools → Metabolism → Run Metabolic Model command. MetaFlux is described in the in the Metabolic Models section of the website user guide.
7 Update FrequencyThe EcoCyc.org and BioCyc.org Web sites and downloadable files are updated approximately three times per year. A faster, more powerful EcoCyc that you can install locally on your computer (Macintosh, PC/Windows, PC/Linux) is released semiannually.[EcoCyc release history]
8 Data Sources Incorporated into EcoCyc
8.1 UniProt FeaturesUniProt protein features (the UniProt KB term is
sequence
annotations) from the complete proteome of E. coli
K-12 MG1655 in SwissProt are imported into EcoCyc for every EcoCyc
release. We
import all protein features with experimental or non-experimental
evidence qualifiers except for the following types:
8.2 Gene OntologyFor several years, EcoCyc and EcoliWiki have been
collaborating on improving and maintaining the GO annotations for
E. coli. Since the summer of 2008, we have been periodically generating a
file containing all E. coli K-12 GO term annotations, called
GO annotation has become a standard part of the EcoCyc’s manual literature-based curation process. The GO annotations are added to the database objects that represent the functional gene products or protein complexes, not directly to the gene objects, so as to model the biology as accurately as possible. In parallel, manual annotation of E. coli genes with GO is ongoing at EcoliWiki. On a regular basis, the GO annotations are merged. The latest UniProt and EcoliWiki annotations are imported into EcoCyc. Because electronic annotations are not accepted by the GO consortium as part of the gene association file if they are more than one year old, these UniProt annotations are reimported into EcoCyc on a regular basis.
EcoCyc incorporates many electronic and experimental GO term annotations of
E. coli K-12 gene products obtained from the “UniProt [multispecies] GO
Annotations @ EBI” file downloaded from the Gene Ontology
Consortium. When this import was first performed in 2007,
about 30,000 new IEA (“Inferred from Electronic Annotation”) GO term assignments were added to EcoCyc, along
with approximately 1,000 assignments with experimental evidence codes
including assignments from high-throughput protein-interaction
studies. During the import of GO terms from UniProt into EcoCyc,
a filtering operation is applied to prune out GO term
annotations that had solely computational (IEA) evidence, if the
EcoCyc gene product already had more specific GO annotations (in other
words, GO terms that are children of the GO term being imported), and which had
experimental evidence available. For example, if a gene product
already contained an experimental annotation of the term “galactose
kinase,” the software would not add the computational annotation
“carbohydrate kinase.” This filtering leads to the removal of about 1,000
of these less specific and redundant annotations.
A gene association file is generated from the quarterly releases of
EcoCyc. This file is sent to the EcoliWiki team at Texas A&M for
further processing. At EcoliWiki, annotations made in the wiki-based
community annotation system since the last EcoCyc update are added to
the file, along with annotations containing qualifiers (mainly
8.3 RefSeq CollaborationEcoCyc is involved in a collaboration to update the genome annotation of the GenBank (U00096.3) and RefSeq (NC_000913.3) entries for E. coli K-12 MG1655 on an ongoing basis. The primary collaborators include EcoCyc, EcoGene, UniProtKB/Swiss-Prot, and NCBI. The collaborators routinely share their data and resolve conflicts among the data. Updates of gene names, gene positions, and gene product names are shared among all partners.
8.4 MetaCycThe EcoCyc and MetaCyc databases exchange data as part of the release processes for both databases. Updates that have occurred to enzymes, genes, pathways, reactions, and metabolites are exchanged between the database based on automated comparisons of update dates to ensure that the latest information and corrections are propagated between databases.
9 EcoCyc Accession Numbers9.1 Gene Accession NumbersThree systems of accession numbers are typically available for genes within EcoCyc. Any of these accession numbers may be used when querying EcoCyc genes “by name,” and in the Web site Quick Search.
10 Other E. coli and Shigella PGDBs in BioCycEcoCyc is part of the larger BioCyc collection of Pathway/Genome Databases (PGDBs). BioCyc version 16.0 (2012) included more than 130 E. coli and Shigella PGDBs. Most of these PGDBs were generated computationally and lack the extensive manual literature-based curation of the EcoCyc K-12 database. Two of these PGDBs have undergone additional curation: the BioCyc PGDBs for strains W3110 and for E. coli B str. REL606. Both strains underwent a computational annotation normalization procedure in which gene names, product names, heteromultimeric protein complexes, and Gene Ontology terms were propagated from EcoCyc to their orthologous genes in these other two strains. This procedure was performed under the assumption that genome annotation pipelines typically introduce syntactically large but semantically insignificant variation in the naming of genes and gene products. In addition, E. coli B str. REL606 is undergoing literature-based curation to incorporated experimental information regarding the genes and pathways present in this straing but not in the EcoCyc strain MG1655. This curation is supported by the PortEco (formerly EcoliHub) project. To select a given genome for querying in the BioCyc Web site, click on the word “change” under the Quick Search and Gene Search buttons in the upper right corner of most Web pages.
11 We Encourage Your FeedbackFeedback from the scientific community has been invaluable to improving EcoCyc during its many years of development. We strongly encourage your comments and suggestions for improvements in areas including the following. Please email suggestions or questions to biocyc-support at ai dot sri dot com.
At every EcoCyc release we email a summary of new developments to our biocyc-users mailing list. To subscribe to this mailing list, please see http://biocyc.org/subscribe.shtml.
12 How to Learn More
13 AcknowledgmentsThe development of EcoCyc is funded by NIH grants GM77678 and GM71962 from the NIH National Institute of General Medical Sciences. Contributors to EcoCyc are listed on the credits page. References
1 “EcoCyc” is pronounced “eeko-sike”. It sounds like “ecology” and like “encyclopedia”.
|