Utility functions¶
fqfa provides basic utility functions for working with biological sequences as strings. For efficiency, these functions assume that any required validation (such as making sure all the characters string are valid bases) has already been performed.
fqfa has a copy of the standard translation table and alternative translation tables can be imported using
ncbi_genetic_code_to_dict()
.
Nucleotide sequence utility functions¶
- fqfa.util.nucleotide.convert_dna_to_rna(seq: str) str ¶
Convert a DNA sequence into a RNA sequence by changing “T” to “U”.
- fqfa.util.nucleotide.convert_rna_to_dna(seq: str) str ¶
Convert an RNA sequence into a DNA sequence by changing “U” to “T”.
Coding sequence translation¶
- fqfa.util.translate.ncbi_genetic_code_to_dict(ncbi_string: str) Dict[str, str] ¶
Parse a translation table from NCBI into a dictionary.
The five-line table input is parsed into a dictionary representation suitable for
translate_dna()
. As an example, the standard genetic code (transl_table=1) is defined inCODON_TABLE
.NCBI translation tables can be found here .
The standard genetic code is encoded by:
AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG Starts = ---M------**--*----M---------------M---------------------------- Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
Information from the Starts line is not retained in the dictionary representation.
Blank lines or whitespace-only lines are automatically skipped, as are lines beginning with #.
- Parameters:
ncbi_string (str) – Multi-line string containing a transl_table from NCBI.
- Returns:
Dictionary mapping codons to single-letter amino acid codes.
- Return type:
- Raises:
ValueError – If any of the rows is missing.
ValueError – If the row labels do not match the expected format.
ValueError – If any row does not have the expected format (
<label> = <data>
).ValueError – If any of the rows fails to contain the expected number of characters (64).
ValueError – If there are duplicate codons in the table.
ValueError – If any of the BaseN rows contains a character other than ACGT.
ValueError – If the AAs row contains a character other than an amino acid.
- fqfa.util.translate.translate_dna(seq: str, table: Dict[str, str] | None = None, frame: int = 0) Tuple[str, str | None] ¶
Translate a DNA sequence into the corresponding amino acid sequence.
- Parameters:
- Returns:
Returns a Tuple where the first string consists of the single-letter amino acid codes and the second string contains any remaining bases in a trailing partial codon (or None if there was no remainder).
- Return type:
- Raises:
KeyError – If a full-length codon was not present in the translation table.
- fqfa.constants.translation.table.CODON_TABLE: Dict[str, str] = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAT': 'N', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGT': 'S', 'ATA': 'I', 'ATC': 'I', 'ATG': 'M', 'ATT': 'I', 'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAT': 'H', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAT': 'D', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'TAA': '*', 'TAC': 'Y', 'TAG': '*', 'TAT': 'Y', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TGA': '*', 'TGC': 'C', 'TGG': 'W', 'TGT': 'C', 'TTA': 'L', 'TTC': 'F', 'TTG': 'L', 'TTT': 'F'}¶
Map from codons to single-letter amino acid codes according to the standard code. Sorted by codon.
Codon
Symbol
Amino Acid
AAA
K
Lysine
AAC
N
Asparagine
AAG
K
Lysine
AAT
N
Asparagine
ACA
T
Threonine
ACC
T
Threonine
ACG
T
Threonine
ACT
T
Threonine
AGA
R
Arginine
AGC
S
Serine
AGG
R
Arginine
AGT
S
Serine
ATA
I
Isoleucine
ATC
I
Isoleucine
ATG
M
Methionine
ATT
I
Isoleucine
CAA
Q
Glutamine
CAC
H
Histidine
CAG
Q
Glutamine
CAT
H
Histidine
CCA
P
Proline
CCC
P
Proline
CCG
P
Proline
CCT
P
Proline
CGA
R
Arginine
CGC
R
Arginine
CGG
R
Arginine
CGT
R
Arginine
CTA
L
Leucine
CTC
L
Leucine
CTG
L
Leucine
CTT
L
Leucine
GAA
E
Glutamic acid
GAC
D
Aspartic acid
GAG
E
Glutamic acid
GAT
D
Aspartic acid
GCA
A
Alanine
GCC
A
Alanine
GCG
A
Alanine
GCT
A
Alanine
GGA
G
Glycine
GGC
G
Glycine
GGG
G
Glycine
GGT
G
Glycine
GTA
V
Valine
GTC
V
Valine
GTG
V
Valine
GTT
V
Valine
TAA
*
termination codon
TAC
Y
Tyrosine
TAG
*
termination codon
TAT
Y
Tyrosine
TCA
S
Serine
TCC
S
Serine
TCG
S
Serine
TCT
S
Serine
TGA
*
termination codon
TGC
C
Cysteine
TGG
W
Tryptophan
TGT
C
Cysteine
TTA
L
Leucine
TTC
F
Phenylalanine
TTG
L
Leucine
TTT
F
Phenylalanine
Sequence type inference functions¶
- fqfa.util.infer.infer_all_sequence_types(seq: str, report_iupac: bool = True) List[str] | None ¶
Return all inferred types for the given sequence.
- Sequence types include:
“dna”
“dna-iupac” (DNA sequence that contains ambiguity characters)
“rna”
“protein”
“protein-iupac” (protein sequence that contains ambiguity characters)
- Parameters:
- Returns:
List of strings containing the inferred sequence types if any type was inferred. None if the sequence didn’t match any sequence types.
- Return type:
Optional[List[str]]
- fqfa.util.infer.infer_sequence_type(seq: str, report_iupac: bool = True) str | None ¶
Infer the type of the given sequence.
- Returns the first sequence type that validates given the following priority order:
“dna”
“rna”
“protein”
“dna-iupac” (DNA sequence that contains ambiguity characters)
“protein-iupac” (protein sequence that contains ambiguity characters)
- Parameters:
- Returns:
String containing the inferred sequence type if a type was inferred. None if the sequence didn’t match any sequence types.
- Return type:
Optional[str]