Usage examples

This page contains some example use cases for fqfa. They are formatted as doctest tests.

Basic sequence validation

The validators in fqfa.validator.validator return a match object if the sequence validates or None if the sequence doesn’t validate. This means that they can be used in simple if statements.

This validator only accepts the standard DNA bases, so the input sequence is invalid.

>>> if dna_bases_validator("ACGTNW"):
...     print("valid!")
... else:
...     print("invalid!")
invalid!

This validator accepts all IUPAC bases, so the input sequence is valid.

>>> if dna_characters_validator("ACGTNW"):
...     print("valid!")
... else:
...     print("invalid!")
valid!

The validators only accept strings (or bytes), as they are based on regular expressions. Attempting to validate anything else results in a TypeError.

>>> if dna_characters_validator(42):
...     print("valid!")
... else:
...     print("invalid!")
Traceback (most recent call last):
    ...
TypeError: expected string or bytes-like object

Default validators only accept uppercase characters, so mixed-case or lowercase input is invalid.

>>> if dna_bases_validator("ACgT"):
...     print("valid!")
... else:
...     print("invalid!")
invalid!

Case-insensitive validators can be created using create_validator().

>>> case_insensitive_validator = create_validator(DNA_BASES, case_sensitive=False)
>>> if case_insensitive_validator("ACgT"):
...     print("valid!")
... else:
...     print("invalid!")
valid!

Translating FASTA sequences

fqfa implements a function to parse individual records from FASTA files.

>>> fasta_string = """
... >test record
... ACGAAA
... TAA
...
... >another record here
... ACANaa
... """
>>> for header, seq in parse_fasta_records(StringIO(fasta_string)):
...     print(header)
...     print(seq)
test record
ACGAAATAA
another record here
ACANaa

These sequences can be validated and/or transformed using utility functions in the library and rewritten as FASTA output.

>>> fasta_string = """
... >test record
... ACGAAA
... TAA
... """
>>> output_file = StringIO()
>>> for header, dna_seq in parse_fasta_records(StringIO(fasta_string)):
...     if dna_bases_validator(dna_seq):
...         protein_seq, _ = translate_dna(dna_seq)
...         write_fasta_record(output_file, header, protein_seq)
>>> output_file.seek(0)
0
>>> print(output_file.read())
>test record
TK*

Filtering paired-end FASTQ reads on sequence quality

fqfa can read single FASTQ files or a pair of FASTQ files in parallel.

>>> fastq_fwd = """\
... @TEST:123:456 AAA
... ACGTAA
... +
... AAA!CD
... @TEST:999:888 AAA
... AAACCC
... +
... ABABAB
... """
>>> fastq_rev = """\
... @TEST:123:456 BBB
... TTTTTT
... +
... ACACAC
... @TEST:999:888 BBB
... GGGCCC
... +
... BBBAAA
... """
>>> for fwd, rev in parse_fastq_pe_reads(StringIO(fastq_fwd), StringIO(fastq_rev)):
...     print(f"{fwd.sequence}-{rev.sequence}")
ACGTAA-TTTTTT
AAACCC-GGGCCC

The parse_fastq_reads() and parse_fastq_pe_reads() functions are generators that return FastqRead objects, so the relevant class methods can be used for filtering or trimming of reads as they are processed.

>>> fastq_fwd = """\
... @TEST:123:456 AAA
... ACGTAA
... +
... AAA!CD
... @TEST:999:888 AAA
... AAACCC
... +
... ABABAB
... """
>>> fastq_rev = """\
... @TEST:123:456 BBB
... TTTTTT
... +
... ACACAC
... @TEST:999:888 BBB
... GGGCCC
... +
... BBBAAA
... """
>>> for fwd, rev in parse_fastq_pe_reads(StringIO(fastq_fwd), StringIO(fastq_rev)):
...     if fwd.min_quality() > 20 and rev.min_quality() > 20:
...         print(f"{fwd.sequence}-{rev.sequence}")
AAACCC-GGGCCC