Utility functions¶

fqfa provides basic utility functions for working with biological sequences as strings. For efficiency, these functions assume that any required validation (such as making sure all the characters string are valid bases) has already been performed.

fqfa has a copy of the standard translation table and alternative translation tables can be imported using ncbi_genetic_code_to_dict().

Nucleotide sequence utility functions¶

fqfa.util.nucleotide.convert_dna_to_rna(seq: str) → str¶

Convert a DNA sequence into a RNA sequence by changing “T” to “U”.

Parameters:: seq (str) – String containing DNA bases.
Returns:: The equivalent RNA sequence.
Return type:: str

fqfa.util.nucleotide.convert_rna_to_dna(seq: str) → str¶

Convert an RNA sequence into a DNA sequence by changing “U” to “T”.

Parameters:: seq (str) – String containing RNA bases.
Returns:: The equivalent DNA sequence.
Return type:: str

fqfa.util.nucleotide.reverse_complement(seq: str) → str¶

Reverse-complement a DNA sequence string and return it.

If a character not in fqfa.iupac.dna.DNA_CHARACTERS is encountered, it is retained.

Parameters:: seq (str) – String containing DNA bases.
Returns:: The reverse complement DNA sequence.
Return type:: str

Coding sequence translation¶

fqfa.util.translate.ncbi_genetic_code_to_dict(ncbi_string: str) → Dict[str, str]¶

Parse a translation table from NCBI into a dictionary.

The five-line table input is parsed into a dictionary representation suitable for translate_dna(). As an example, the standard genetic code (transl_table=1) is defined in CODON_TABLE.

NCBI translation tables can be found here .

The standard genetic code is encoded by:

  AAs  = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = ---M------**--*----M---------------M----------------------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

Information from the Starts line is not retained in the dictionary representation.

Blank lines or whitespace-only lines are automatically skipped, as are lines beginning with #.

Parameters:

ncbi_string (str) – Multi-line string containing a transl_table from NCBI.

Returns:

Dictionary mapping codons to single-letter amino acid codes.

Return type:

Dict[str, str]

Raises:

ValueError – If any of the rows is missing.
ValueError – If the row labels do not match the expected format.
ValueError – If any row does not have the expected format (<label> = <data>).
ValueError – If any of the rows fails to contain the expected number of characters (64).
ValueError – If there are duplicate codons in the table.
ValueError – If any of the BaseN rows contains a character other than ACGT.
ValueError – If the AAs row contains a character other than an amino acid.

fqfa.util.translate.translate_dna(seq: str, table: Dict[str, str] | None = None, frame: int = 0) → Tuple[str, str | None]¶

Translate a DNA sequence into the corresponding amino acid sequence.

Parameters:

seq (str) – String containing DNA bases to translate.
table (Optional(Dict[str, str])) – Map from codon strings to single-letter amino acid codes or None to use the default translation table.
frame (int) – Integer with value in (0, 1, 2) defining the position in the sequence to start at.

Returns:

Returns a Tuple where the first string consists of the single-letter amino acid codes and the second string contains any remaining bases in a trailing partial codon (or None if there was no remainder).

Return type:

Tuple[str, Optional[str]]

Raises:

KeyError – If a full-length codon was not present in the translation table.

fqfa.constants.translation.table.CODON_TABLE: Dict[str, str] = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': '*', 'TAC': 'Y', 'TAG': '*', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': '*', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}¶

Map from codons to single-letter amino acid codes according to the standard code. Sorted by codon.

Codon	Symbol	Amino Acid
AAA	K	Lysine
AAC	N	Asparagine
AAG	K	Lysine
AAT	N	Asparagine
ACA	T	Threonine
ACC	T	Threonine
ACG	T	Threonine
ACT	T	Threonine
AGA	R	Arginine
AGC	S	Serine
AGG	R	Arginine
AGT	S	Serine
ATA	I	Isoleucine
ATC	I	Isoleucine
ATG	M	Methionine
ATT	I	Isoleucine
CAA	Q	Glutamine
CAC	H	Histidine
CAG	Q	Glutamine
CAT	H	Histidine
CCA	P	Proline
CCC	P	Proline
CCG	P	Proline
CCT	P	Proline
CGA	R	Arginine
CGC	R	Arginine
CGG	R	Arginine
CGT	R	Arginine
CTA	L	Leucine
CTC	L	Leucine
CTG	L	Leucine
CTT	L	Leucine
GAA	E	Glutamic acid
GAC	D	Aspartic acid
GAG	E	Glutamic acid
GAT	D	Aspartic acid
GCA	A	Alanine
GCC	A	Alanine
GCG	A	Alanine
GCT	A	Alanine
GGA	G	Glycine
GGC	G	Glycine
GGG	G	Glycine
GGT	G	Glycine
GTA	V	Valine
GTC	V	Valine
GTG	V	Valine
GTT	V	Valine
TAA	*	termination codon
TAC	Y	Tyrosine
TAG	*	termination codon
TAT	Y	Tyrosine
TCA	S	Serine
TCC	S	Serine
TCG	S	Serine
TCT	S	Serine
TGA	*	termination codon
TGC	C	Cysteine
TGG	W	Tryptophan
TGT	C	Cysteine
TTA	L	Leucine
TTC	F	Phenylalanine
TTG	L	Leucine
TTT	F	Phenylalanine

Type:: Dict[str, str]

Sequence type inference functions¶

fqfa.util.infer.infer_all_sequence_types(seq: str, report_iupac: bool = True) → List[str] | None¶

Return all inferred types for the given sequence.

Sequence types include:

“dna”
“dna-iupac” (DNA sequence that contains ambiguity characters)
“rna”
“protein”
“protein-iupac” (protein sequence that contains ambiguity characters)

Parameters:

seq (str) – The string to infer the type of.
report_iupac (bool) – If True, report sequence types with extended characters as “<type>-iupac”; else report only the sequence type.

Returns:

List of strings containing the inferred sequence types if any type was inferred. None if the sequence didn’t match any sequence types.

Return type:

Optional[List[str]]

fqfa.util.infer.infer_sequence_type(seq: str, report_iupac: bool = True) → str | None¶

Infer the type of the given sequence.

Returns the first sequence type that validates given the following priority order:

“dna”
“rna”
“protein”
“dna-iupac” (DNA sequence that contains ambiguity characters)
“protein-iupac” (protein sequence that contains ambiguity characters)

Parameters:

seq (str) – The string to infer the type of.
report_iupac (bool) – If True, report sequence types with extended characters as “<type>-iupac”; else report only the sequence type.

Returns:

String containing the inferred sequence type if a type was inferred. None if the sequence didn’t match any sequence types.

Return type:

Optional[str]