Utility functions

fqfa provides basic utility functions for working with biological sequences as strings. For efficiency, these functions assume that any required validation (such as making sure all the characters string are valid bases) has already been performed.

fqfa has a copy of the standard translation table and alternative translation tables can be imported using ncbi_genetic_code_to_dict().

Nucleotide sequence utility functions

fqfa.util.nucleotide.convert_dna_to_rna(seq: str) str

Convert a DNA sequence into a RNA sequence by changing “T” to “U”.

Parameters:

seq (str) – String containing DNA bases.

Returns:

The equivalent RNA sequence.

Return type:

str

fqfa.util.nucleotide.convert_rna_to_dna(seq: str) str

Convert an RNA sequence into a DNA sequence by changing “U” to “T”.

Parameters:

seq (str) – String containing RNA bases.

Returns:

The equivalent DNA sequence.

Return type:

str

fqfa.util.nucleotide.reverse_complement(seq: str) str

Reverse-complement a DNA sequence string and return it.

If a character not in fqfa.iupac.dna.DNA_CHARACTERS is encountered, it is retained.

Parameters:

seq (str) – String containing DNA bases.

Returns:

The reverse complement DNA sequence.

Return type:

str

Coding sequence translation

fqfa.util.translate.ncbi_genetic_code_to_dict(ncbi_string: str) Dict[str, str]

Parse a translation table from NCBI into a dictionary.

The five-line table input is parsed into a dictionary representation suitable for translate_dna(). As an example, the standard genetic code (transl_table=1) is defined in CODON_TABLE.

NCBI translation tables can be found here .

The standard genetic code is encoded by:

  AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = ---M------**--*----M---------------M----------------------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

Information from the Starts line is not retained in the dictionary representation.

Blank lines or whitespace-only lines are automatically skipped, as are lines beginning with #.

Parameters:

ncbi_string (str) – Multi-line string containing a transl_table from NCBI.

Returns:

Dictionary mapping codons to single-letter amino acid codes.

Return type:

Dict[str, str]

Raises:
  • ValueError – If any of the rows is missing.

  • ValueError – If the row labels do not match the expected format.

  • ValueError – If any row does not have the expected format (<label> = <data>).

  • ValueError – If any of the rows fails to contain the expected number of characters (64).

  • ValueError – If there are duplicate codons in the table.

  • ValueError – If any of the BaseN rows contains a character other than ACGT.

  • ValueError – If the AAs row contains a character other than an amino acid.

fqfa.util.translate.translate_dna(seq: str, table: Dict[str, str] | None = None, frame: int = 0) Tuple[str, str | None]

Translate a DNA sequence into the corresponding amino acid sequence.

Parameters:
  • seq (str) – String containing DNA bases to translate.

  • table (Optional(Dict[str, str])) – Map from codon strings to single-letter amino acid codes or None to use the default translation table.

  • frame (int) – Integer with value in (0, 1, 2) defining the position in the sequence to start at.

Returns:

Returns a Tuple where the first string consists of the single-letter amino acid codes and the second string contains any remaining bases in a trailing partial codon (or None if there was no remainder).

Return type:

Tuple[str, Optional[str]]

Raises:

KeyError – If a full-length codon was not present in the translation table.

fqfa.constants.translation.table.CODON_TABLE: Dict[str, str] = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': '*', 'TAC': 'Y', 'TAG': '*', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': '*', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}

Map from codons to single-letter amino acid codes according to the standard code. Sorted by codon.

Codon

Symbol

Amino Acid

AAA

K

Lysine

AAC

N

Asparagine

AAG

K

Lysine

AAT

N

Asparagine

ACA

T

Threonine

ACC

T

Threonine

ACG

T

Threonine

ACT

T

Threonine

AGA

R

Arginine

AGC

S

Serine

AGG

R

Arginine

AGT

S

Serine

ATA

I

Isoleucine

ATC

I

Isoleucine

ATG

M

Methionine

ATT

I

Isoleucine

CAA

Q

Glutamine

CAC

H

Histidine

CAG

Q

Glutamine

CAT

H

Histidine

CCA

P

Proline

CCC

P

Proline

CCG

P

Proline

CCT

P

Proline

CGA

R

Arginine

CGC

R

Arginine

CGG

R

Arginine

CGT

R

Arginine

CTA

L

Leucine

CTC

L

Leucine

CTG

L

Leucine

CTT

L

Leucine

GAA

E

Glutamic acid

GAC

D

Aspartic acid

GAG

E

Glutamic acid

GAT

D

Aspartic acid

GCA

A

Alanine

GCC

A

Alanine

GCG

A

Alanine

GCT

A

Alanine

GGA

G

Glycine

GGC

G

Glycine

GGG

G

Glycine

GGT

G

Glycine

GTA

V

Valine

GTC

V

Valine

GTG

V

Valine

GTT

V

Valine

TAA

*

termination codon

TAC

Y

Tyrosine

TAG

*

termination codon

TAT

Y

Tyrosine

TCA

S

Serine

TCC

S

Serine

TCG

S

Serine

TCT

S

Serine

TGA

*

termination codon

TGC

C

Cysteine

TGG

W

Tryptophan

TGT

C

Cysteine

TTA

L

Leucine

TTC

F

Phenylalanine

TTG

L

Leucine

TTT

F

Phenylalanine

Type:

Dict[str, str]

Sequence type inference functions

fqfa.util.infer.infer_all_sequence_types(seq: str, report_iupac: bool = True) List[str] | None

Return all inferred types for the given sequence.

Sequence types include:
  • “dna”

  • “dna-iupac” (DNA sequence that contains ambiguity characters)

  • “rna”

  • “protein”

  • “protein-iupac” (protein sequence that contains ambiguity characters)

Parameters:
  • seq (str) – The string to infer the type of.

  • report_iupac (bool) – If True, report sequence types with extended characters as “<type>-iupac”; else report only the sequence type.

Returns:

List of strings containing the inferred sequence types if any type was inferred. None if the sequence didn’t match any sequence types.

Return type:

Optional[List[str]]

fqfa.util.infer.infer_sequence_type(seq: str, report_iupac: bool = True) str | None

Infer the type of the given sequence.

Returns the first sequence type that validates given the following priority order:
  • “dna”

  • “rna”

  • “protein”

  • “dna-iupac” (DNA sequence that contains ambiguity characters)

  • “protein-iupac” (protein sequence that contains ambiguity characters)

Parameters:
  • seq (str) – The string to infer the type of.

  • report_iupac (bool) – If True, report sequence types with extended characters as “<type>-iupac”; else report only the sequence type.

Returns:

String containing the inferred sequence type if a type was inferred. None if the sequence didn’t match any sequence types.

Return type:

Optional[str]