File handling

fqfa implements several functions to help open FASTA and FASTQ data files. This includes functions for validating file names as well as for opening compressed file handles. Currently fqfa supports opening files compressed with bzip2 or gzip. Generally speaking, gzip is faster and more widely-supported by other bioinformatics software, but bzip2 offers slightly better compression that may be relevant for large FASTQ files that are not frequently accessed.

The generator functions for FASTA and FASTQ files take open file handles as their arguments, supporting the use of open_compressed().

fqfa.util.file.has_fasta_ext(path: str) bool

Checks whether the file path has the expected file extension for FASTA format.

Recognized file extensions are .fa and .fasta optionally in combination with a compression file extension supported by open_compressed().

Parameters:

path (str) – File path to be checked.

Returns:

True if the file has a recognized extension, else False.

Return type:

bool

fqfa.util.file.has_fastq_ext(path: str) bool

Checks whether the file path has the expected file extension for FASTQ format.

Recognized file extensions are .fq and .fastq optionally in combination with a compression file extension supported by open_compressed().

Parameters:

path (str) – File path to be checked.

Returns:

True if the file has a recognized extension, else False.

Return type:

bool

fqfa.util.file.open_compressed(path: str, encoding: str | None = None) IO[Any]

Open the file handle for reading using the correct (optional) decompression method.

Compression status is determined by the file extension. Recognized file extensions are .bz2 for bzip2 compression and .gz for gzip compression. If there is any other file extension (or no extension), the file is opened normally. The file is opened in text mode.

Parameters:
  • path (str) – File path to be opened.

  • encoding (Optional[str]) – Text file encoding as described for io.TextIOWrapper.

Returns:

Open text file handle.

Return type:

IO[Any]

Raises:

FASTA files

fqfa has basic support for FASTA files. This is designed for small FASTA files such as those containing gene or plasmid sequences. fqfa does not use or create FASTA index (.fai) files.

The generator function below that parses FASTA files is slightly more flexible than the FASTA specification. Specifically, it ignores any lines before the first FASTA record, allowing for comments or other metadata at the start of the file, and allows any amount of leading or trailing whitespace in the sequence (including blank lines within a record).

No validation is performed on the sequences, but fqfa implements a set of callable validators that can be used.

fqfa.fasta.fasta.parse_fasta_records(handle: TextIO) Generator[Tuple[str, str], None, None]

Generator function that returns tuples of FASTA headers and their associated sequences.

Lines before the start of the first record are ignored. Any leading and trailing whitespace is removed before the sequence lines are concatenated together. No validation of the characters in the FASTA record is performed.

Parameters:

handle (TextIO) – Open text file handle to parse.

Yields:

Tuple[str, str] – Tuple containing the header line (with leading ‘>’ removed) and the sequence.

fqfa.fasta.fasta.write_fasta_record(handle: TextIO, header: str, seq: str, width: int = 60) None

Writes a FASTA record to an open file handle.

Leading and trailing whitespace will be removed from the header and all whitespace will be removed from the sequence before generating output.

Parameters:
  • handle (TextIO) – Open text file handle to write to.

  • header (str) – Header string for the FASTA record, without the leading ‘>’

  • seq (str) – Sequence for the FASTA record.

  • width (int) – Width to use when hard-wrapping the sequence. Default 60.

Return type:

None

Raises:

FASTQ files

fqfa supports reading FASTQ files either singly or as a pair (for paired-end data). Reads are returned as FastqRead objects. These objects support several basic operations, such as in-place read trimming and calculating quality-based values. The sequence and headers are stored as strings, and the quality values are stored as a list of integers.

Note that there are no FASTQ output functions, because the __str__() method formats a FastqRead object as a standard FASTQ record. Generating a FASTQ output file is as simple as printing all the objects.

fqfa.fastq.fastq.parse_fastq_pe_reads(handle_fwd: TextIO, handle_rev: TextIO, revcomp: bool = False) Generator[Tuple[FastqRead, FastqRead], None, None]

Generator function that returns FASTQ read pairs as a tuple of objects.

Parameters:
  • handle_fwd (TextIO) – Open text file handle to parse for forward reads.

  • handle_rev (TextIO) – Open text file handle to parse for reverse reads.

  • revcomp (bool) – Whether to reverse-complement the reverse reads. Default False.

Returns:

Tuple of forward and reverse FastqRead objects.

Return type:

Tuple[FastqRead, FastqRead]

Raises:
  • ValueError – If a record is incomplete.

  • ValueError – If the two file handles have a different number of reads.

  • ValueError – If the read header portion before the first whitespace doesn’t match between read pairs. This usually contains the machine ID and read coordinates, and is therefore expected to match for PE data.

fqfa.fastq.fastq.parse_fastq_reads(handle: TextIO) Generator[FastqRead, None, None]

Generator function that returns FASTQ reads as objects.

Parameters:

handle (TextIO) – Open text file handle to parse.

Yields:

FastqRead – FastqRead object for the read.

Raises:

ValueError – If a record is incomplete.

class fqfa.fastq.fastqread.FastqRead(header: str, sequence: str, header2: str, quality_string: dataclasses.InitVar[str], quality_encoding_value: int = 33)

Dataclass representing a single read from a FASTQ file.

Most methods modify the read in-place rather than returning a modified copy.

Parameters:
  • header (str) – The first header line in the FASTQ read, beginning with ‘@’.

  • sequence (str) – The nucleotide sequence of the FASTQ read, consisting of only bases “ACGTN”.

  • header2 (str) – The second header line in the FASTQ read, beginning with ‘+’.

  • quality_string (str) – The base quality values, ASCII encoded.

  • quality_encoding_value (int) – The ASCII value of base quality 0. Default is 33.

header

The first header line in the FASTQ read, beginning with ‘@’.

Type:

str

sequence

The nucleotide sequence of the FASTQ read, consisting of only bases “ACGTN”.

Type:

str

header2

The second header line in the FASTQ read, beginning with ‘+’.

Type:

str

quality

The base quality values as a list of integers.

Type:

List[int]

quality_encoding_value

The ASCII value of base quality 0.

Type:

int

__eq__(other)

Return self==value.

__hash__ = None
__init__(header: str, sequence: str, header2: str, quality_string: dataclasses.InitVar[str], quality_encoding_value: int = 33) None
__len__() int

The object’s length is defined as the length of the sequence.

Returns:

The length of the read’s sequence.

Return type:

int

__post_init__(quality_string: str) None

Perform some basic checks on the input and converts the quality string into a list of integers.

The quality string is converted to integers using the quality_encoding_value. This defaults to Sanger-style quality values (minimum value of 33).

Parameters:

quality_string (str) – ASCII-encoded quality values.

Return type:

None

Raises:
  • ValueError – If the length of the sequence and quality strings are not equal.

  • ValueError – If the header string doesn’t start with ‘@’.

  • ValueError – If the sequence contains characters other than A, C, G, T, or N.

  • ValueError – If the secondary header string doesn’t start with ‘+’.

  • ValueError – If the quality values are outside the allowed range (0-93).

__repr__()

Return repr(self).

__str__() str

Formats the object as a four-line FASTQ record.

Returns:

Reconstruction of the original FASTQ record.

Return type:

str

__weakref__

list of weak references to the object (if defined)

average_quality() float

Calculates and returns the read’s mean quality value.

Returns:

Mean quality value.

Return type:

float

min_quality() int

Calculates and returns the read’s minimum quality value.

Returns:

The lowest quality value.

Return type:

int

reverse_complement() None

Reverse-complements the sequence and reverse the order of quality values.

Return type:

None

trim(start: int = 1, end: int | None = None) None

Trim the read such that it contains bases between start and end (inclusive).

Bases are numbered starting at 1.

Parameters:
  • start (int) – The first base to retain (1-indexed). Defaults to 1, which will not trim the start.

  • end (Optional[int]) – The last base to retain (1-indexed). Defaults to None, which will not trim the end.

Return type:

None

Raises:
  • ValueError – If the start is less than or equal to the end.

  • ValueError – If the start is less than 1.

trim_length(length: int, start: int = 1) None

Trim the read to a specific length, beginning at start.

Bases are numbered starting at 1.

Parameters:
  • length (int) – The length of the read after trimming.

  • start (int) – The first base to retain (1-indexed). Defaults to 1, which will not trim the start.

Return type:

None

Raises: