File handling¶
fqfa implements several functions to help open FASTA and FASTQ data files. This includes functions for validating file names as well as for opening compressed file handles. Currently fqfa supports opening files compressed with bzip2 or gzip. Generally speaking, gzip is faster and more widely-supported by other bioinformatics software, but bzip2 offers slightly better compression that may be relevant for large FASTQ files that are not frequently accessed.
The generator functions for FASTA and FASTQ files take open file handles as their arguments,
supporting the use of open_compressed()
.
- fqfa.util.file.has_fasta_ext(path: str) bool ¶
Checks whether the file path has the expected file extension for FASTA format.
Recognized file extensions are
.fa
and.fasta
optionally in combination with a compression file extension supported byopen_compressed()
.
- fqfa.util.file.has_fastq_ext(path: str) bool ¶
Checks whether the file path has the expected file extension for FASTQ format.
Recognized file extensions are
.fq
and.fastq
optionally in combination with a compression file extension supported byopen_compressed()
.
- fqfa.util.file.open_compressed(path: str, encoding: str | None = None) IO[Any] ¶
Open the file handle for reading using the correct (optional) decompression method.
Compression status is determined by the file extension. Recognized file extensions are
.bz2
for bzip2 compression and.gz
for gzip compression. If there is any other file extension (or no extension), the file is opened normally. The file is opened in text mode.- Parameters:
path (str) – File path to be opened.
encoding (Optional[str]) – Text file encoding as described for
io.TextIOWrapper
.
- Returns:
Open text file handle.
- Return type:
IO[Any]
- Raises:
FileNotFoundError – If path does not correspond to a file.
NotImplementedError – If a recognized compression extension lacks an implementation.
FASTA files¶
fqfa has basic support for FASTA files.
This is designed for small FASTA files such as those containing gene or plasmid sequences.
fqfa does not use or create FASTA index (.fai
) files.
The generator function below that parses FASTA files is slightly more flexible than the FASTA specification. Specifically, it ignores any lines before the first FASTA record, allowing for comments or other metadata at the start of the file, and allows any amount of leading or trailing whitespace in the sequence (including blank lines within a record).
No validation is performed on the sequences, but fqfa implements a set of callable validators that can be used.
- fqfa.fasta.fasta.parse_fasta_records(handle: TextIO) Generator[Tuple[str, str], None, None] ¶
Generator function that returns tuples of FASTA headers and their associated sequences.
Lines before the start of the first record are ignored. Any leading and trailing whitespace is removed before the sequence lines are concatenated together. No validation of the characters in the FASTA record is performed.
- Parameters:
handle (TextIO) – Open text file handle to parse.
- Yields:
Tuple[str, str] – Tuple containing the header line (with leading ‘>’ removed) and the sequence.
- fqfa.fasta.fasta.write_fasta_record(handle: TextIO, header: str, seq: str, width: int = 60) None ¶
Writes a FASTA record to an open file handle.
Leading and trailing whitespace will be removed from the header and all whitespace will be removed from the sequence before generating output.
- Parameters:
- Return type:
None
- Raises:
ValueError – If the header is empty.
ValueError – If the sequence is empty.
FASTQ files¶
fqfa supports reading FASTQ files either singly or as a pair (for paired-end data).
Reads are returned as FastqRead
objects.
These objects support several basic operations, such as in-place read trimming and calculating quality-based values.
The sequence and headers are stored as strings, and the quality values are stored as a list of integers.
Note that there are no FASTQ output functions, because the __str__()
method formats a FastqRead
object as a standard FASTQ record.
Generating a FASTQ output file is as simple as printing all the objects.
- fqfa.fastq.fastq.parse_fastq_pe_reads(handle_fwd: TextIO, handle_rev: TextIO, revcomp: bool = False) Generator[Tuple[FastqRead, FastqRead], None, None] ¶
Generator function that returns FASTQ read pairs as a tuple of objects.
- Parameters:
handle_fwd (TextIO) – Open text file handle to parse for forward reads.
handle_rev (TextIO) – Open text file handle to parse for reverse reads.
revcomp (bool) – Whether to reverse-complement the reverse reads. Default False.
- Returns:
Tuple of forward and reverse FastqRead objects.
- Return type:
- Raises:
ValueError – If a record is incomplete.
ValueError – If the two file handles have a different number of reads.
ValueError – If the read header portion before the first whitespace doesn’t match between read pairs. This usually contains the machine ID and read coordinates, and is therefore expected to match for PE data.
- fqfa.fastq.fastq.parse_fastq_reads(handle: TextIO) Generator[FastqRead, None, None] ¶
Generator function that returns FASTQ reads as objects.
- Parameters:
handle (TextIO) – Open text file handle to parse.
- Yields:
FastqRead – FastqRead object for the read.
- Raises:
ValueError – If a record is incomplete.
- class fqfa.fastq.fastqread.FastqRead(header: str, sequence: str, header2: str, quality_string: dataclasses.InitVar[str], quality_encoding_value: int = 33)¶
Dataclass representing a single read from a FASTQ file.
Most methods modify the read in-place rather than returning a modified copy.
- Parameters:
header (str) – The first header line in the FASTQ read, beginning with ‘@’.
sequence (str) – The nucleotide sequence of the FASTQ read, consisting of only bases “ACGTN”.
header2 (str) – The second header line in the FASTQ read, beginning with ‘+’.
quality_string (str) – The base quality values, ASCII encoded.
quality_encoding_value (int) – The ASCII value of base quality 0. Default is 33.
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(header: str, sequence: str, header2: str, quality_string: dataclasses.InitVar[str], quality_encoding_value: int = 33) None ¶
- __len__() int ¶
The object’s length is defined as the length of the sequence.
- Returns:
The length of the read’s sequence.
- Return type:
- __post_init__(quality_string: str) None ¶
Perform some basic checks on the input and converts the quality string into a list of integers.
The quality string is converted to integers using the
quality_encoding_value
. This defaults to Sanger-style quality values (minimum value of 33).- Parameters:
quality_string (str) – ASCII-encoded quality values.
- Return type:
None
- Raises:
ValueError – If the length of the sequence and quality strings are not equal.
ValueError – If the header string doesn’t start with ‘@’.
ValueError – If the sequence contains characters other than A, C, G, T, or N.
ValueError – If the secondary header string doesn’t start with ‘+’.
ValueError – If the quality values are outside the allowed range (0-93).
- __repr__()¶
Return repr(self).
- __str__() str ¶
Formats the object as a four-line FASTQ record.
- Returns:
Reconstruction of the original FASTQ record.
- Return type:
- __weakref__¶
list of weak references to the object (if defined)
- average_quality() float ¶
Calculates and returns the read’s mean quality value.
- Returns:
Mean quality value.
- Return type:
- min_quality() int ¶
Calculates and returns the read’s minimum quality value.
- Returns:
The lowest quality value.
- Return type:
- reverse_complement() None ¶
Reverse-complements the sequence and reverse the order of quality values.
- Return type:
None
- trim(start: int = 1, end: int | None = None) None ¶
Trim the read such that it contains bases between
start
andend
(inclusive).Bases are numbered starting at 1.
- Parameters:
- Return type:
None
- Raises:
ValueError – If the start is less than or equal to the end.
ValueError – If the start is less than 1.
- trim_length(length: int, start: int = 1) None ¶
Trim the read to a specific length, beginning at
start
.Bases are numbered starting at 1.
- Parameters:
- Return type:
None
- Raises:
ValueError – If the length is less than 1.
ValueError – If the start is less than 1.
ValueError – If the length is longer than the read.