Performance comparison¶

This page contains some performance and usage comparisons for processing FASTQ files with fqfa and pyfastx.

In these benchmarks, fqfa is comparable to pyfastx, although pyfastx has made substantial performance improvements since fqfa was written, particularly when reading gzip-compressed input files.

The results are derived from Jupyter notebooks. If you’d like to run this code yourself, the notebooks are available with the fqfa documentation in fqfa/docs/notebooks. The file used in the benchmark is from the Enrich2 example dataset. To run the benchmarks as written, you will have to decompress the bz2 file and also create a gzipped version.

This section includes examples of usage that are common in my work, primarily in processing files of barcode reads for high-throughput functional genomic assays. pyfastx includes many other functions that are not demonstrated here.

Benchmarking for raw FASTQ files¶

import pyfastx
from fqfa.fastq.fastq import parse_fastq_reads

Benchmark 1: list of reads¶

This code creates a list containing all the reads in the file. Note that the data structures for the reads are quite different, with two being package-specific objects and one being a tuple.

pyfastx with index¶

Much of the time spent in the first example is likely spent building the .fxi index file. This file enables direct access into the FASTQ file, which we are not using here. The index is quite large, much larger than the reads in this case:

334M    BRCA1_input_sample.fq
 48M    BRCA1_input_sample.fq.bz2
511M    BRCA1_input_sample.fq.fxi
 68M    BRCA1_input_sample.fq.gz
513M    BRCA1_input_sample.fq.gz.fxi

%time reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq")]
for x in reads[:5]:
    print(repr(x))
del reads

CPU times: user 6.69 s, sys: 993 ms, total: 7.68 s
Wall time: 7.73 s
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1 with length of 16

pyfastx without index¶

This is by far the fastest for just reading data from the file, but it doesn’t perform any extra computation or quality value conversion.

%time reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq", build_index=False)]
for x in reads[:5]:
    print(x)
del reads

CPU times: user 1.42 s, sys: 417 ms, total: 1.83 s
Wall time: 1.93 s
('140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1', 'CCCGTGGCCTTTTCCA', 'B@CFFFFFHHHHHJJJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1', 'TTTGGTAAAGGGTAAC', 'BBCFFDFFHHHHDHIJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1', 'AATAATGTATGTACCT', 'BC@FFFFEFHHHHJJJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1', 'CTATTGCGTGTGATCT', 'BCCFFFFFHHHHHJJJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1', 'ACCCCTACCCTCTGCC', 'BBBFFFFFHHHHHJJJ')

fqfa¶

Unlike pyfastx, fqfa takes an open file handle rather than a file name. In these examples, this is addressed using a context created by a with statement.

with open("BRCA1_input_sample.fq") as handle:
    %time reads = [x for x in parse_fastq_reads(handle)]
for x in reads[:5]:
    print(x)
del reads

CPU times: user 26.7 s, sys: 1.03 s, total: 27.8 s
Wall time: 27.8 s
@140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1
CCCGTGGCCTTTTCCA
+
B@CFFFFFHHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1
TTTGGTAAAGGGTAAC
+
BBCFFDFFHHHHDHIJ
@140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1
AATAATGTATGTACCT
+
BC@FFFFEFHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1
CTATTGCGTGTGATCT
+
BCCFFFFFHHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1
ACCCCTACCCTCTGCC
+
BBBFFFFFHHHHHJJJ

Benchmark 2: summarized quality statistics¶

This code calculates the median average read quality for all reads in the file.

from statistics import mean, median

pyfastx with index¶

pyfastx provides integer quality values as part of its FASTQ read data structure.

%time read_quals = [mean(x.quali) for x in pyfastx.Fastq("BRCA1_input_sample.fq")]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 54.8 s, sys: 630 ms, total: 55.5 s
Wall time: 55.9 s
Median average quality is 37.5

pyfastx without index¶

The timing here is quite a bit closer to the others, since the conversion and calculation has not already been performed as part of processing the input file.

%time read_quals = [mean([ord(c) - 33 for c in x[2]]) for x in pyfastx.Fastq("BRCA1_input_sample.fq", build_index=False)]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 53.9 s, sys: 95.4 ms, total: 54 s
Wall time: 54 s
Median average quality is 37.5

fqfa¶

This code uses the average_quality() method implemented by the FastqRead class.

with open("BRCA1_input_sample.fq") as handle:
    %time read_quals = [x.average_quality() for x in parse_fastq_reads(handle)]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 1min 19s, sys: 146 ms, total: 1min 19s
Wall time: 1min 19s
Median average quality is 37.5

Benchmark 3: filtering reads on quality¶

This code creates a list of reads for which all bases are at least Q20. The performance and usage in this section is quite a bit faster than Benchmark 2 following recent performance improvements in pyfastx.

pyfastx with index¶

%time filt_reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq") if min(x.quali) >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 5.75 s, sys: 556 ms, total: 6.3 s
Wall time: 6.32 s
Kept 3641707 reads after applying filter.

pyfastx without index¶

%time filt_reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq", build_index=False) if min([ord(c) - 33 for c in x[2]]) >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 6.71 s, sys: 472 ms, total: 7.18 s
Wall time: 7.25 s
Kept 3641762 reads after applying filter.

fqfa¶

This code uses the min_quality() method implemented by the FastqRead class.

with open("BRCA1_input_sample.fq") as handle:
    %time filt_reads = [x for x in parse_fastq_reads(handle) if x.min_quality() >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 30.9 s, sys: 4.38 s, total: 35.3 s
Wall time: 1min 15s
Kept 3641762 reads after applying filter.

Benchmarking for gzip-compressed FASTQ files¶

import pyfastx
from fqfa.fastq.fastq import parse_fastq_reads
from fqfa.util.file import open_compressed

Benchmark 1: list of reads¶

This code creates a list containing all the reads in the file. Note that the data structures for the reads are quite different, with two being package-specific objects and one being a tuple.

pyfastx with index¶

Much of the time spent in the first example is likely spent building the .fxi index file. This file enables direct access into the FASTQ file, which we are not using here. The index is quite large, much larger than the reads in this case:

334M    BRCA1_input_sample.fq
 48M    BRCA1_input_sample.fq.bz2
511M    BRCA1_input_sample.fq.fxi
 68M    BRCA1_input_sample.fq.gz
513M    BRCA1_input_sample.fq.gz.fxi

%time reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq.gz")]
for x in reads[:5]:
    print(repr(x))
del reads

CPU times: user 9.1 s, sys: 1.05 s, total: 10.1 s
Wall time: 10.2 s
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1 with length of 16
<Read> 140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1 with length of 16

pyfastx without index¶

This is by far the fastest for just reading data from the file, but it doesn’t perform any extra computation or quality value conversion.

%time reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq.gz", build_index=False)]
for x in reads[:5]:
    print(x)
del reads

CPU times: user 2.59 s, sys: 312 ms, total: 2.9 s
Wall time: 2.9 s
('140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1', 'CCCGTGGCCTTTTCCA', 'B@CFFFFFHHHHHJJJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1', 'TTTGGTAAAGGGTAAC', 'BBCFFDFFHHHHDHIJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1', 'AATAATGTATGTACCT', 'BC@FFFFEFHHHHJJJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1', 'CTATTGCGTGTGATCT', 'BCCFFFFFHHHHHJJJ')
('140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1', 'ACCCCTACCCTCTGCC', 'BBBFFFFFHHHHHJJJ')

fqfa¶

Unlike pyfastx, fqfa takes an open file handle rather than a file name. In these examples, this is addressed using a context created by a with statement.

with open_compressed("BRCA1_input_sample.fq.gz") as handle:
    %time reads = [x for x in parse_fastq_reads(handle)]
for x in reads[:5]:
    print(x)
del reads

CPU times: user 30.8 s, sys: 881 ms, total: 31.6 s
Wall time: 31.6 s
@140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1
CCCGTGGCCTTTTCCA
+
B@CFFFFFHHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1
TTTGGTAAAGGGTAAC
+
BBCFFDFFHHHHDHIJ
@140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1
AATAATGTATGTACCT
+
BC@FFFFEFHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1
CTATTGCGTGTGATCT
+
BCCFFFFFHHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1
ACCCCTACCCTCTGCC
+
BBBFFFFFHHHHHJJJ

Benchmark 2: summarized quality statistics¶

This code calculates the median average read quality for all reads in the file.

from statistics import mean, median

pyfastx with index¶

pyfastx provides integer quality values as part of its FASTQ read data structure.

Note: this step ran for over an hour without completing, so timing information is not provided.

%time read_quals = [mean(x.quali) for x in pyfastx.Fastq("BRCA1_input_sample.fq.gz")]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 53.9 s, sys: 323 ms, total: 54.2 s
Wall time: 54.2 s
Median average quality is 37.5

pyfastx without index¶

The timing here is quite a bit closer to the others, since the conversion and calculation has not already been performed as part of processing the input file.

%time read_quals = [mean([ord(c) - 33 for c in x[2]]) for x in pyfastx.Fastq("BRCA1_input_sample.fq.gz", build_index=False)]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 55.9 s, sys: 15.4 ms, total: 55.9 s
Wall time: 56 s
Median average quality is 37.5

fqfa¶

This code uses the average_quality() method implemented by the FastqRead class.

with open_compressed("BRCA1_input_sample.fq.gz") as handle:
    %time read_quals = [x.average_quality() for x in parse_fastq_reads(handle)]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 1min 23s, sys: 55.6 ms, total: 1min 23s
Wall time: 1min 23s
Median average quality is 37.5

Benchmark 3: filtering reads on quality¶

This code creates a list of reads for which all bases are at least Q20. The performance and usage in this section is quite a bit faster than Benchmark 2 following recent performance improvements in pyfastx.

pyfastx with index¶

Note: this step ran for over an hour without completing, so timing information is not provided.

%time filt_reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq.gz") if min(x.quali) >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 6.17 s, sys: 360 ms, total: 6.53 s
Wall time: 6.53 s
Kept 3641707 reads after applying filter.

pyfastx without index¶

%time filt_reads = [x for x in pyfastx.Fastq("BRCA1_input_sample.fq.gz", build_index=False) if min([ord(c) - 33 for c in x[2]]) >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 7.24 s, sys: 620 ms, total: 7.86 s
Wall time: 7.87 s
Kept 3641762 reads after applying filter.

fqfa¶

This code uses the min_quality() method implemented by the FastqRead class.

with open_compressed("BRCA1_input_sample.fq.gz") as handle:
    %time filt_reads = [x for x in parse_fastq_reads(handle) if x.min_quality() >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 31.2 s, sys: 660 ms, total: 31.9 s
Wall time: 31.9 s
Kept 3641762 reads after applying filter.

Benchmarking for bzip2-compressed FASTQ files¶

from fqfa.fastq.fastq import parse_fastq_reads
from fqfa.util.file import open_compressed

Benchmark 1: list of reads¶

This code creates a list containing all the reads in the file. Note that the data structures for the reads are quite different, with two being package-specific objects and one being a tuple.

Because pyfastx does not support bzip2, these results are most useful for comparing with fqfa’s gzip benchmarks.

fqfa¶

Unlike pyfastx, fqfa takes an open file handle rather than a file name. In these examples, this is addressed using a context created by a with statement.

with open_compressed("BRCA1_input_sample.fq.bz2") as handle:
    %time reads = [x for x in parse_fastq_reads(handle)]
for x in reads[:5]:
    print(x)
del reads

CPU times: user 42.2 s, sys: 1.05 s, total: 43.3 s
Wall time: 43.4 s
@140313_SN743_0432_AC3TTHACXX:4:1101:5633:2224:1#0/1
CCCGTGGCCTTTTCCA
+
B@CFFFFFHHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:6580:2239:1#0/1
TTTGGTAAAGGGTAAC
+
BBCFFDFFHHHHDHIJ
@140313_SN743_0432_AC3TTHACXX:4:1101:6929:2242:1#0/1
AATAATGTATGTACCT
+
BC@FFFFEFHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:13004:2221:1#0/1
CTATTGCGTGTGATCT
+
BCCFFFFFHHHHHJJJ
@140313_SN743_0432_AC3TTHACXX:4:1101:14034:2219:1#0/1
ACCCCTACCCTCTGCC
+
BBBFFFFFHHHHHJJJ

Benchmark 2: summarized quality statistics¶

This code calculates the median average read quality for all reads in the file.

from statistics import median

fqfa¶

This code uses the average_quality() method implemented by the FastqRead class.

with open_compressed("BRCA1_input_sample.fq.bz2") as handle:
    %time read_quals = [x.average_quality() for x in parse_fastq_reads(handle)]
print(f"Median average quality is {median(read_quals)}")
del read_quals

CPU times: user 1min 35s, sys: 277 ms, total: 1min 35s
Wall time: 1min 35s
Median average quality is 37.5

Benchmark 3: filtering reads on quality¶

This code creates a list of reads for which all bases are at least Q20. The performance and usage in this section is quite similar to Benchmark 2.

fqfa¶

This code uses the min_quality() method implemented by the FastqRead class.

with open_compressed("BRCA1_input_sample.fq.bz2") as handle:
    %time filt_reads = [x for x in parse_fastq_reads(handle) if x.min_quality() >= 20]
print(f"Kept {len(filt_reads)} reads after applying filter.")
del filt_reads

CPU times: user 43 s, sys: 784 ms, total: 43.8 s
Wall time: 43.8 s
Kept 3641762 reads after applying filter.