Request a New Genome in BioCyc
If you'd like to see a new genome added to BioCyc, please send a
request to firstname.lastname@example.org. Please include the species name,
strain name, and the RefSeq/genbank_assembly_accession ID.
Difficulties Querying BioCyc Genomes by Gene Name
We have received a number of reports from users who are unable to
search certain BioCyc genomes by gene name. Here we explain the
reasons for this situation.
Most of the genomes within BioCyc were obtained from NCBI RefSeq. Many
RefSeq annotations include gene names and (protein names), which are the
sources of gene names and protein names that you see in BioCyc
databases (before the additional curation that also occurs in BioCyc).
Periodically we re-download RefSeq genomes, and re-generate the BioCyc
PGDBs. This allows us to integrate improved RefSeq annotations for
the genomes, and to provide improved metabolic reconstructions based
on newer versions of our MetaCyc database and PathoLogic software.
However, it also means that any problems introduced in RefSeq can
appear in BioCyc. Recently released RefSeq genomes omit large numbers
of gene names that were present in earlier versions of those genomes.
We were not aware of this problem at the time we re-generated these
BioCyc databases, and some such genomes were integrated into BioCyc.
NCBI has acknowledged the problem and they are working to fix it,
but they will not give an estimate of when they will fix the problem. In
the mean time, we will not integrate any updated RefSeq genomes into
BioCyc if the updated version contains significantly fewer gene names
than the previous version. Reverting or reliably repairing the
existing BioCyc PGDBs with small numbers of gene names is not
Here are a few statistics regarding gene names in BioCyc that
illustrate the variability in current genome annotations. We do not
know how these numbers compare to the previous version of BioCyc.
However, the fact that only 2,000 out of the 11,000 BioCyc databases were
re-generated on our last release constrains how much these numbers
could have changed from the previous BioCyc release.
In BioCyc 21.5:
- 1334 genomes contain no gene names.
- For 8928 genomes, less than 10% of their genes contain gene names.
Conversely, for 2052 genomes, more than 10% of their genes contain gene names.
- For 208 genomes, greater than 50% of their genes contain gene names.
However, for a number of these genomes, many of the "names" stored for
the genes are actually accession numbers, not true gene names.
- For EcoCyc, 69.5% of genes contain gene names that do not begin with
the letter "y" (genes beginning with y do not have well established