Sequence validation

fqfa implements regular expression-based sequence validators. There are several commonly-used validators based on IUPAC codes, as well as a function for creating new callable validators from a string or list of characters. This create_validator() function can also be used to create case-insensitive versions of the provided validators.

fqfa.validator.validator.amino_acids_all_validator(string, pos=0, endpos=9223372036854775807)

Callable[[str, int, int], Optional[Match[str]]]: validator for amino acids including ambiguous amino acids.

Returns a match object if all characters in the string are single-letter amino acid codes found in AA_CODES_ALL.

fqfa.validator.validator.amino_acids_validator(string, pos=0, endpos=9223372036854775807)

Callable[[str, int, int], Optional[Match[str]]]: validator for amino acids.

Returns a match object if all characters in the string are single-letter amino acid codes found in AA_CODES.

fqfa.validator.validator.dna_bases_validator(string, pos=0, endpos=9223372036854775807)

Callable[[str, int, int], Optional[Match[str]]]: validator for DNA bases.

Returns a match object if all characters in the string are found in DNA_BASES.

fqfa.validator.validator.dna_characters_validator(string, pos=0, endpos=9223372036854775807)

Callable[[str, int, int], Optional[Match[str]]]: validator for DNA bases and ambiguity characters.

Returns a match object if all characters in the string are found in DNA_CHARACTERS.

fqfa.validator.validator.rna_bases_validator(string, pos=0, endpos=9223372036854775807)

Callable[[str, int, int], Optional[Match[str]]]: validator for RNA bases.

Returns a match object if all characters in the string are found in RNA_BASES.

fqfa.validator.create.create_validator(valid_characters: str | List[str], case_sensitive: bool = True) Callable[[str], Match[str] | None]

Function that generates a callable, regular-expression based sequence validator.

When called on a given string, the validator will return a Match object if every character is one of the valid_characters, else None.

Parameters:
  • valid_characters (Union[str, List[str]]) – A string or list of single-character strings defining the set of valid characters.

  • case_sensitive (bool) – False if both upper- and lower-case characters in valid_characters are valid. Default True.

Returns:

Callable validator that uses re.fullmatch.

Return type:

Callable[[str, int, int], Optional[Match[str]]]

Raises:

ValueError – If valid_characters is a list containing multiple characters per entry.

IUPAC codes

fqfa includes the International Union of Pure and Applied Chemistry (IUPAC) notation for degenerate bases. A mapping between single- and three-letter amino acid codes is also included. Validation based on single-letter amino acid codes can be accomplished by using the keys of the mapping.

DNA sequences

fqfa.constants.iupac.dna.DNA_AMBIGUITY: List[str] = ['W', 'S', 'M', 'K', 'R', 'Y', 'B', 'D', 'H', 'V', 'N']

IUPAC ambiguity characters for DNA sequence.

Symbol

Description

Bases

W

Weak

AT

S

Strong

GC

M

aMino

AC

K

Keto

GT

R

puRine

AG

Y

pYrimidine

CT

B

not A

CGT

D

not C

AGT

H

not G

ACT

V

not T

ACG

N

any Nucleotide

ACGT

Type:

List[str]

fqfa.constants.iupac.dna.DNA_BASES: List[str] = ['A', 'C', 'G', 'T']

The four DNA bases.

Symbol

Description

A

Adenine

C

Cytosine

G

Guanine

T

Thymine

Type:

List[str]

fqfa.constants.iupac.dna.DNA_CHARACTERS: List[str] = ['A', 'C', 'G', 'T', 'W', 'S', 'M', 'K', 'R', 'Y', 'B', 'D', 'H', 'V', 'N']

Bases and IUPAC ambiguity characters for DNA sequence.

Symbol

Description

Bases

A

Adenine

A

C

Cytosine

C

G

Guanine

G

T

Thymine

T

W

Weak

AT

S

Strong

GC

M

aMino

AC

K

Keto

GT

R

puRine

AG

Y

pYrimidine

CT

B

not A

CGT

D

not C

AGT

H

not G

ACT

V

not T

ACG

N

any Nucleotide

ACGT

Type:

List[str]

fqfa.constants.iupac.dna.DNA_COMPLEMENTS: Dict[str, str] = {'A': 'T', 'B': 'V', 'C': 'G', 'D': 'H', 'G': 'C', 'H': 'D', 'K': 'M', 'M': 'K', 'N': 'N', 'R': 'Y', 'S': 'S', 'T': 'A', 'V': 'B', 'W': 'W', 'Y': 'R'}

Map for complementing DNA sequences.

Symbol

Complement

Bases

Comp. Bases

A

T

A

T

C

G

C

G

G

C

G

C

T

A

T

A

W

W

AT

AT

S

S

GC

GC

M

K

AC

GT

K

M

GT

AC

R

Y

AG

CT

Y

R

CT

AG

B

V

CGT

ACG

D

H

AGT

ACT

H

D

ACT

AGT

V

B

ACG

CGT

N

N

ACGT

ACGT

Type:

Dict[str, str]

RNA sequences

fqfa.constants.iupac.rna.RNA_BASES: List[str] = ['A', 'C', 'G', 'U']

The four RNA bases.

Symbol

Description

A

Adenine

C

Cytosine

G

Guanine

U

Uracil

Type:

List[str]

Amino acid sequences

fqfa.constants.iupac.protein.AA_CODES: Dict[str, str] = {'*': 'Ter', 'A': 'Ala', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'Y': 'Tyr'}

Map from single-letter amino acid codes to three-letter codes. Sorted by three-letter code.

Single-letter

Three-letter

Amino Acid

A

Ala

Alanine

R

Arg

Arginine

N

Asn

Asparagine

D

Asp

Aspartic acid (Aspartate)

C

Cys

Cysteine

Q

Gln

Glutamine

E

Glu

Glutamic acid (Glutamate)

G

Gly

Glycine

H

His

Histidine

I

Ile

Isoleucine

L

Leu

Leucine

K

Lys

Lysine

M

Met

Methionine

F

Phe

Phenylalanine

P

Pro

Proline

S

Ser

Serine

T

Thr

Threonine

W

Trp

Tryptophan

Y

Tyr

Tyrosine

V

Val

Valine

*

Ter

termination codon

Type:

Dict[str, str]

fqfa.constants.iupac.protein.AA_CODES_ALL: Dict[str, str] = {'*': 'Ter', 'A': 'Ala', 'B': 'Asx', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu', 'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile', 'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn', 'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser', 'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'X': 'Xaa', 'Y': 'Tyr', 'Z': 'Glx'}

Map from all single-letter amino acid codes to three-letter codes. Sorted by three-letter code.

Single-letter

Three-letter

Amino Acid

A

Ala

Alanine

R

Arg

Arginine

N

Asn

Asparagine

D

Asp

Aspartic acid (Aspartate)

C

Cys

Cysteine

Q

Gln

Glutamine

E

Glu

Glutamic acid (Glutamate)

G

Gly

Glycine

H

His

Histidine

I

Ile

Isoleucine

L

Leu

Leucine

K

Lys

Lysine

M

Met

Methionine

F

Phe

Phenylalanine

P

Pro

Proline

S

Ser

Serine

T

Thr

Threonine

W

Trp

Tryptophan

Y

Tyr

Tyrosine

V

Val

Valine

*

Ter

termination codon

B

Asx

Aspartic acid or Asparagine

Z

Glx

Glutamine or Glutamic acid

X

Xaa

Any amino acid

Type:

Dict[str, str]

fqfa.constants.iupac.protein.AA_CODES_AMBIGUITY: Dict[str, str] = {'B': 'Asx', 'X': 'Xaa', 'Z': 'Glx'}

Map from ambiguous single-letter amino acid codes to three-letter codes. Sorted by three-letter code.

Single-letter

Three-letter

Amino Acid

B

Asx

Aspartic acid or Asparagine

Z

Glx

Glutamine or Glutamic acid

X

Xaa

Any amino acid

Type:

Dict[str, str]