Sequence validation¶

fqfa implements regular expression-based sequence validators. There are several commonly-used validators based on IUPAC codes, as well as a function for creating new callable validators from a string or list of characters. This create_validator() function can also be used to create case-insensitive versions of the provided validators.

fqfa.validator.validator.amino_acids_all_validator(string, pos=0, endpos=9223372036854775807)¶

Callable[[str, int, int], Optional[Match[str]]]: validator for amino acids including ambiguous amino acids.

Returns a match object if all characters in the string are single-letter amino acid codes found in AA_CODES_ALL.

fqfa.validator.validator.amino_acids_validator(string, pos=0, endpos=9223372036854775807)¶

Callable[[str, int, int], Optional[Match[str]]]: validator for amino acids.

Returns a match object if all characters in the string are single-letter amino acid codes found in AA_CODES.

fqfa.validator.validator.dna_bases_validator(string, pos=0, endpos=9223372036854775807)¶

Callable[[str, int, int], Optional[Match[str]]]: validator for DNA bases.

Returns a match object if all characters in the string are found in DNA_BASES.

fqfa.validator.validator.dna_characters_validator(string, pos=0, endpos=9223372036854775807)¶

Callable[[str, int, int], Optional[Match[str]]]: validator for DNA bases and ambiguity characters.

Returns a match object if all characters in the string are found in DNA_CHARACTERS.

fqfa.validator.validator.rna_bases_validator(string, pos=0, endpos=9223372036854775807)¶

Callable[[str, int, int], Optional[Match[str]]]: validator for RNA bases.

Returns a match object if all characters in the string are found in RNA_BASES.

fqfa.validator.create.create_validator(valid_characters: str | List[str], case_sensitive: bool = True) → Callable[[str], Match[str] | None]¶

Function that generates a callable, regular-expression based sequence validator.

When called on a given string, the validator will return a Match object if every character is one of the valid_characters, else None.

Parameters:

valid_characters (Union[str, List[str]]) – A string or list of single-character strings defining the set of valid characters.
case_sensitive (bool) – False if both upper- and lower-case characters in valid_characters are valid. Default True.

Returns:

Callable validator that uses re.fullmatch.

Return type:

Callable[[str, int, int], Optional[Match[str]]]

Raises:

ValueError – If valid_characters is a list containing multiple characters per entry.

IUPAC codes¶

fqfa includes the International Union of Pure and Applied Chemistry (IUPAC) notation for degenerate bases. A mapping between single- and three-letter amino acid codes is also included. Validation based on single-letter amino acid codes can be accomplished by using the keys of the mapping.

DNA sequences¶

fqfa.constants.iupac.dna.DNA_AMBIGUITY: List[str] = ['W', 'S', 'M', 'K', 'R', 'Y', 'B', 'D', 'H', 'V', 'N']¶

IUPAC ambiguity characters for DNA sequence.

Symbol	Description	Bases
W	Weak	AT
S	Strong	GC
M	aMino	AC
K	Keto	GT
R	puRine	AG
Y	pYrimidine	CT
B	not A	CGT
D	not C	AGT
H	not G	ACT
V	not T	ACG
N	any Nucleotide	ACGT

Type:: List[str]

fqfa.constants.iupac.dna.DNA_BASES: List[str] = ['A', 'C', 'G', 'T']¶

The four DNA bases.

Symbol	Description
A	Adenine
C	Cytosine
G	Guanine
T	Thymine

Type:: List[str]

fqfa.constants.iupac.dna.DNA_CHARACTERS: List[str] = ['A', 'C', 'G', 'T', 'W', 'S', 'M', 'K', 'R', 'Y', 'B', 'D', 'H', 'V', 'N']¶

Bases and IUPAC ambiguity characters for DNA sequence.

Symbol	Description	Bases
A	Adenine	A
C	Cytosine	C
G	Guanine	G
T	Thymine	T
W	Weak	AT
S	Strong	GC
M	aMino	AC
K	Keto	GT
R	puRine	AG
Y	pYrimidine	CT
B	not A	CGT
D	not C	AGT
H	not G	ACT
V	not T	ACG
N	any Nucleotide	ACGT

Type:: List[str]

fqfa.constants.iupac.dna.DNA_COMPLEMENTS: Dict[str, str] = {'A': 'T', 'B': 'V', 'C': 'G', 'D': 'H', 'G': 'C', 'H': 'D', 'K': 'M', 'M': 'K', 'N': 'N', 'R': 'Y', 'S': 'S', 'T': 'A', 'V': 'B', 'W': 'W', 'Y': 'R'}¶

Map for complementing DNA sequences.

Symbol	Complement	Bases	Comp. Bases
A	T	A	T
C	G	C	G
G	C	G	C
T	A	T	A
W	W	AT	AT
S	S	GC	GC
M	K	AC	GT
K	M	GT	AC
R	Y	AG	CT
Y	R	CT	AG
B	V	CGT	ACG
D	H	AGT	ACT
H	D	ACT	AGT
V	B	ACG	CGT
N	N	ACGT	ACGT

Type:: Dict[str, str]

RNA sequences¶

fqfa.constants.iupac.rna.RNA_BASES: List[str] = ['A', 'C', 'G', 'U']¶

The four RNA bases.

Symbol	Description
A	Adenine
C	Cytosine
G	Guanine
U	Uracil

Type:: List[str]

Amino acid sequences¶

fqfa.constants.iupac.protein.AA_CODES: Dict[str, str] = {'*': 'Ter', 'A': 'Ala', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'Y': 'Tyr'}¶

Map from single-letter amino acid codes to three-letter codes. Sorted by three-letter code.

Single-letter	Three-letter	Amino Acid
A	Ala	Alanine
R	Arg	Arginine
N	Asn	Asparagine
D	Asp	Aspartic acid (Aspartate)
C	Cys	Cysteine
Q	Gln	Glutamine
E	Glu	Glutamic acid (Glutamate)
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
L	Leu	Leucine
K	Lys	Lysine
M	Met	Methionine
F	Phe	Phenylalanine
P	Pro	Proline
S	Ser	Serine
T	Thr	Threonine
W	Trp	Tryptophan
Y	Tyr	Tyrosine
V	Val	Valine
*	Ter	termination codon

Type:: Dict[str, str]

fqfa.constants.iupac.protein.AA_CODES_ALL: Dict[str, str] = {'*': 'Ter', 'A': 'Ala', 'B': 'Asx', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'X': 'Xaa', 'Y': 'Tyr', 'Z': 'Glx'}¶

Map from all single-letter amino acid codes to three-letter codes. Sorted by three-letter code.

Single-letter	Three-letter	Amino Acid
A	Ala	Alanine
R	Arg	Arginine
N	Asn	Asparagine
D	Asp	Aspartic acid (Aspartate)
C	Cys	Cysteine
Q	Gln	Glutamine
E	Glu	Glutamic acid (Glutamate)
G	Gly	Glycine
H	His	Histidine
I	Ile	Isoleucine
L	Leu	Leucine
K	Lys	Lysine
M	Met	Methionine
F	Phe	Phenylalanine
P	Pro	Proline
S	Ser	Serine
T	Thr	Threonine
W	Trp	Tryptophan
Y	Tyr	Tyrosine
V	Val	Valine
*	Ter	termination codon
B	Asx	Aspartic acid or Asparagine
Z	Glx	Glutamine or Glutamic acid
X	Xaa	Any amino acid

Type:: Dict[str, str]

fqfa.constants.iupac.protein.AA_CODES_AMBIGUITY: Dict[str, str] = {'B': 'Asx', 'X': 'Xaa', 'Z': 'Glx'}¶

Map from ambiguous single-letter amino acid codes to three-letter codes. Sorted by three-letter code.

Single-letter	Three-letter	Amino Acid
B	Asx	Aspartic acid or Asparagine
Z	Glx	Glutamine or Glutamic acid
X	Xaa	Any amino acid

Type:: Dict[str, str]