Sequence validation¶
fqfa implements regular expression-based sequence validators.
There are several commonly-used validators based on IUPAC codes,
as well as a function for creating new callable validators from a string or list of characters.
This create_validator()
function can also be used to create case-insensitive versions
of the provided validators.
- fqfa.validator.validator.amino_acids_all_validator(string, pos=0, endpos=9223372036854775807)¶
Callable[[str, int, int], Optional[Match[str]]]: validator for amino acids including ambiguous amino acids.
Returns a match object if all characters in the string are single-letter amino acid codes found in
AA_CODES_ALL
.
- fqfa.validator.validator.amino_acids_validator(string, pos=0, endpos=9223372036854775807)¶
Callable[[str, int, int], Optional[Match[str]]]: validator for amino acids.
Returns a match object if all characters in the string are single-letter amino acid codes found in
AA_CODES
.
- fqfa.validator.validator.dna_bases_validator(string, pos=0, endpos=9223372036854775807)¶
Callable[[str, int, int], Optional[Match[str]]]: validator for DNA bases.
Returns a match object if all characters in the string are found in
DNA_BASES
.
- fqfa.validator.validator.dna_characters_validator(string, pos=0, endpos=9223372036854775807)¶
Callable[[str, int, int], Optional[Match[str]]]: validator for DNA bases and ambiguity characters.
Returns a match object if all characters in the string are found in
DNA_CHARACTERS
.
- fqfa.validator.validator.rna_bases_validator(string, pos=0, endpos=9223372036854775807)¶
Callable[[str, int, int], Optional[Match[str]]]: validator for RNA bases.
Returns a match object if all characters in the string are found in
RNA_BASES
.
- fqfa.validator.create.create_validator(valid_characters: str | List[str], case_sensitive: bool = True) Callable[[str], Match[str] | None] ¶
Function that generates a callable, regular-expression based sequence validator.
When called on a given string, the validator will return a Match object if every character is one of the valid_characters, else None.
- Parameters:
- Returns:
Callable validator that uses re.fullmatch.
- Return type:
- Raises:
ValueError – If valid_characters is a list containing multiple characters per entry.
IUPAC codes¶
fqfa includes the International Union of Pure and Applied Chemistry (IUPAC) notation for degenerate bases. A mapping between single- and three-letter amino acid codes is also included. Validation based on single-letter amino acid codes can be accomplished by using the keys of the mapping.
DNA sequences¶
- fqfa.constants.iupac.dna.DNA_AMBIGUITY: List[str] = ['W', 'S', 'M', 'K', 'R', 'Y', 'B', 'D', 'H', 'V', 'N']¶
IUPAC ambiguity characters for DNA sequence.
Symbol
Description
Bases
W
Weak
AT
S
Strong
GC
M
aMino
AC
K
Keto
GT
R
puRine
AG
Y
pYrimidine
CT
B
not A
CGT
D
not C
AGT
H
not G
ACT
V
not T
ACG
N
any Nucleotide
ACGT
- Type:
List[str]
- fqfa.constants.iupac.dna.DNA_BASES: List[str] = ['A', 'C', 'G', 'T']¶
The four DNA bases.
Symbol
Description
A
Adenine
C
Cytosine
G
Guanine
T
Thymine
- Type:
List[str]
- fqfa.constants.iupac.dna.DNA_CHARACTERS: List[str] = ['A', 'C', 'G', 'T', 'W', 'S', 'M', 'K', 'R', 'Y', 'B', 'D', 'H', 'V', 'N']¶
Bases and IUPAC ambiguity characters for DNA sequence.
Symbol
Description
Bases
A
Adenine
A
C
Cytosine
C
G
Guanine
G
T
Thymine
T
W
Weak
AT
S
Strong
GC
M
aMino
AC
K
Keto
GT
R
puRine
AG
Y
pYrimidine
CT
B
not A
CGT
D
not C
AGT
H
not G
ACT
V
not T
ACG
N
any Nucleotide
ACGT
- Type:
List[str]
- fqfa.constants.iupac.dna.DNA_COMPLEMENTS: Dict[str, str] = {'A': 'T', 'B': 'V', 'C': 'G', 'D': 'H', 'G': 'C', 'H': 'D', 'K': 'M', 'M': 'K', 'N': 'N', 'R': 'Y', 'S': 'S', 'T': 'A', 'V': 'B', 'W': 'W', 'Y': 'R'}¶
Map for complementing DNA sequences.
Symbol
Complement
Bases
Comp. Bases
A
T
A
T
C
G
C
G
G
C
G
C
T
A
T
A
W
W
AT
AT
S
S
GC
GC
M
K
AC
GT
K
M
GT
AC
R
Y
AG
CT
Y
R
CT
AG
B
V
CGT
ACG
D
H
AGT
ACT
H
D
ACT
AGT
V
B
ACG
CGT
N
N
ACGT
ACGT
RNA sequences¶
Amino acid sequences¶
- fqfa.constants.iupac.protein.AA_CODES: Dict[str, str] = {'*': 'Ter', 'A': 'Ala', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'Y': 'Tyr'}¶
Map from single-letter amino acid codes to three-letter codes. Sorted by three-letter code.
Single-letter
Three-letter
Amino Acid
A
Ala
Alanine
R
Arg
Arginine
N
Asn
Asparagine
D
Asp
Aspartic acid (Aspartate)
C
Cys
Cysteine
Q
Gln
Glutamine
E
Glu
Glutamic acid (Glutamate)
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
L
Leu
Leucine
K
Lys
Lysine
M
Met
Methionine
F
Phe
Phenylalanine
P
Pro
Proline
S
Ser
Serine
T
Thr
Threonine
W
Trp
Tryptophan
Y
Tyr
Tyrosine
V
Val
Valine
*
Ter
termination codon
- fqfa.constants.iupac.protein.AA_CODES_ALL: Dict[str, str] = {'*': 'Ter', 'A': 'Ala', 'B': 'Asx', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'X': 'Xaa', 'Y': 'Tyr', 'Z': 'Glx'}¶
Map from all single-letter amino acid codes to three-letter codes. Sorted by three-letter code.
Single-letter
Three-letter
Amino Acid
A
Ala
Alanine
R
Arg
Arginine
N
Asn
Asparagine
D
Asp
Aspartic acid (Aspartate)
C
Cys
Cysteine
Q
Gln
Glutamine
E
Glu
Glutamic acid (Glutamate)
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
L
Leu
Leucine
K
Lys
Lysine
M
Met
Methionine
F
Phe
Phenylalanine
P
Pro
Proline
S
Ser
Serine
T
Thr
Threonine
W
Trp
Tryptophan
Y
Tyr
Tyrosine
V
Val
Valine
*
Ter
termination codon
B
Asx
Aspartic acid or Asparagine
Z
Glx
Glutamine or Glutamic acid
X
Xaa
Any amino acid
- fqfa.constants.iupac.protein.AA_CODES_AMBIGUITY: Dict[str, str] = {'B': 'Asx', 'X': 'Xaa', 'Z': 'Glx'}¶
Map from ambiguous single-letter amino acid codes to three-letter codes. Sorted by three-letter code.
Single-letter
Three-letter
Amino Acid
B
Asx
Aspartic acid or Asparagine
Z
Glx
Glutamine or Glutamic acid
X
Xaa
Any amino acid