178 pseudogenes are currently annotated in EcoGene, including 5 RNA pseudogenes. 116 are ygene pseudogenes.



A bacterial pseudogene is a naturally occurring, spontaneously generated gene fragment or mutant allele that can be identified as apparently defective by comparative genomics or that has been shown to be defective by experiment.


The vast majority of pseudogenes are identified in complete genome sequences during the annotation process. A wide variety of annotation styles and pseudogene definitions are used, leading to confusing inconsistencies. In collaboration with Guy Plunkett of ASAP, Mary Berlyn of the Coli Genetic Stock Center, Amos Bairoch of Swiss-Prot, EcoGene has embraced detailed and comprehensive pseudogene identification and annotation for over a decade, including pseudogene identification in the 1998 E. coli physical linkage map publication (MMBR 62:985). The pseudogene annotations from EcoGene are now in Genbank. They can serve as a starting point to work towards a universally accepted pseudogene definition and annotation rules for consistent bacterial pseudogene annotation.

A major problem in identifying true pseudogenes is distinguishing an apparently mutant allele from a DNA sequencing error. For this reason, as well as the possibility of a functional pseudogene (see below) we refer to pseudogenes identified using comparative genomics as apparent pseudogenes. E. coli K-12 has the advantage of being resequenced to resolve DNA sequencing errors.

An apparent pseudogene fragment is distinguished from a stand-alone functional domain in practice when no homologs (in another species) or paralogs/xenologs of the stand-alone version can be found.

The working pseudogene definition implemented in EcoGene is elaborated in pseudogene class definitions. Pseudogenes are classified based on their mutation type (frameshift, insertion, deletion, stop), but also on their evolutionary origin (ancestral, domestication), or consequences (functional, chimeric).


Type: Pseudogenes may have a single mutation or multiple mutations.

Deleted: A portion of the gene has been deleted from the genome.

5' truncated: The beginning of the gene is missing.

3' truncated: The end of the gene is missing.

Internal deletion: An in-frame internal deletion (multiple of three bases deleted) leaves a shortened ORF with both ends intact, an out-of-frame deletion causes a frameshift mutation causing a heterologous and probably shortened protein C-terminus to be synthesized.

Combinatory: Both ends can be truncated, leaving a piece in the middle; an internal deletion can be combined with a truncation.

RNA gene-derived fragment: For example, prophage att sites.

Frameshifted: A short insertion or deletion causes an unintended shift in the reading frame during translation, usually leading to heterologous C-terminal amino acids and premature termination of translation.

Interrupted: Due to an internal insertion of an IS element or other foreign DNA.

In-frame stop: A sense codon has acquired a point mutation turning it into a stop codon causing premature termination of translation.

Complex (multiple types): Degenerating pseudogenes can acquire multiple mutations of different types.

Other mutation types, such as inversions or translocations, may occur in pseudogenes, but have not been observed during the EcoGene annotation of strain MG16555.

Source: Some pseudogenes can be classified based on their evolutionary history.

Ancestral: An ancestral pseudogene has an identical mutant allele present in another independent natural isolate, i.e. the allele did not arise after strain isolation. Different mutant alleles are considered as different pseudogenes arising from the same parent gene.

Domestication: Pseudogenes with mutations arising spontaneously from the selective pressures of laboratory cultivation and storage. These domestication mutations are probably spontaneous, but they may be unintended induced mutations as E. coli K-12 was subjected to UV and acridine orange treatments during cultivation. Some apparent pseudogene mutations may have been induced by these mildly mutagenic treatments.

Early: Pseudogene alleles shared by both the W3110 and MG1655 K-12 strains. These alleles are candidates for domestication mutations, but further genome sequencing of E. coli isolates may identify an ancestral copy of the allele in the future.

Late: Pseudogene alleles that are present only in either the W3110 or the MG1655 strain must have arisen after domestication.

Reconstructed: Frameshifted, in-frame stopped and interrupted pseudogenes are reconstructed in EcoGene to facilitate phylogenetic analysis and prediction of ancestral function. The mutations are reversed computationally. Often an intact allele is present in another E. coli genome sequence validating the reconstruction as accurate. One copy of the target site duplications must be removed for the reconstruction of an IS-interrupted pseudogene. And X can mark an amino acid and an N can mark a nucleotide uncertainty at the point of frameshift or stop codon reconstructions.

Errorgene: an apparent pseudogene that is not a pseudogene but a DNA sequence error that can only be correctly identified as an errorgene by re-sequencing the region from the same strain.


Chimeric: deletions, insertions and frameshifts create potential fusion or extended genes. The fusions of the partial normal gene can be with either coding or non-coding DNA. A fusion pseudogene TopicPage is being developed but is not yet implemented. Chimeric pseudogenes may be expressed and may also be functional. The sequence of the chimeric form is not presented in the protein sequence filed of EcoGene GenePages, but will be presented on a chimeric pseudogene TopicPage.

Functional: Apparent pseudogenes could possibly retain full or partial activity or could gain new functions. An example is the prophage att sites derived from tRNA genes. Another example, not yet confirmed is a report that an MG1655 mutation of the apparent pseudgogene mdtQ'. Another example is rph', which has a C-terminal frameshift and is expressed with ~1% activity.


Comment on false gene annotations:

Although they are not included in the definition of a pseudogene, a false gene is a misannotation. A false gene is annotated in a database as a protein or structural RNA gene but is very unlikely to be a gene for a variety of reasons. It is especially a problem with annotated ORFs less than 50 codons long, whcih can reflect conserved DNA sites, such as transcription factor binding sites (TFBSs).

Pseudogenes in EcoGene are prominently marked in red and have an apostrophe (') added to the gene name.