Why should you clean up your FASTQ files?

Pablo Gómez Sacristán - JUNIOR BIOINFORMATICIAN

You have received a few tens or hundreds of gigabytes of sequencing data in the form of FASTQ files. If you are not yet familiar with this format, the amount of data can feel overwhelming at first. What do you do with these files? How do you extract meaningful biological information from these long strings of letters? 

Whether you have performed a bulk, single-cell, or spatial transcriptomics experiment, your goal is to extract relevant biological insights from your data, for example, understand which genes are being expressed, which are not, and what the differences are between your samples, cells, or tissues. Many technologies can be used for this purpose, but they often converge in an Illumina sequencing library. 

In this blog, we will explore how you can preprocess your sequencing data properly.

Illumina sequencing libraries & adapters

Illumina sequencing is based on short reads, typically 50-300 base pairs (bp) long. This read length range gives you enough information to confidently map a read to the genome, determine which gene it comes from, and quantify it.  

An Illumina sequencing library contains identifiers (indexes) that allow mixing multiple libraries in a pool and sequencing them together cost-effectively. These indices are added in the lab during the library preparation step using adapters that bind to the DNA fragments derived from processing your samples. Once the library is built and pooled with other libraries, it is loaded into the sequencer, resulting in the tens or hundreds of gigabytes that need to be processed.  

Our data scientists can maximize your biological discovery

Our data consultancy service is designed for scientists seeking easy access to deep bioinformatics expertise. Rely on us for fast, flexible analysis. You bring the biological insight; we bring the analytical power.

MEET OUR DATA SCIENTISTS TODAY

Evaluating sequencing data quality

Understanding the quality of your data is essential. At this stage, we use tools such as FastQC and MultiQC. FastQC will generate metrics to assess data quality, while MultiQC will compile the results into a nice HTML report. These provide useful information on the presence of Illumina adapters, poly (A), poly (G), and other sequences overrepresented in the data. 

Adapters and homopolymers present in the sequencing data can be intrinsic to the protocol, but low sample quality and other samples can increase their occurrence. For example, if the RNA in your sample is highly degraded, it can result in short inserts used for Illumina library preparation. Because of the dual-color Illumina chemistry, when the insert size is smaller than expected, it appears as a poly-G stretch in the data (see the image below).

James. (2014, January 17). NextSeq 500’s new chemistry described. Enseqlopedia. Retrieved from https://enseqlopedia.com/2014/01/nextseq-500s-new-chemistry-described/

In this example, the highlighted line represents the poly-G in the data. We can see that after 60 bp, its presence increases in the reads, suggesting that up to 35% of the reads have short insert sizes.

astQC adapter content plot showing rising adapter contamination across read positions in multiple samples, highlighting the need for trimming in FASTQ file preprocessing.

Cleaning up adapters and homopolymers

The presence of these sequences in your reads can be a disadvantage for your analysis. Say, for example, that you want to do a cell type annotation analysis. For that, you need to properly detect and quantify which genes are being expressed in a cell. If you do not preprocess your sequencing data, the presence of these adapters and homopolymers in your reads can lead to non-specific mapping. 

At SCD, for our plate-based single-cell methods (SORT-seq and VASA-seq) and bulk RNA sequencing methods, we remove reads containing adapters and trim homopolymers. For our default read length, if a read includes an adapter, it usually indicates that the insert is too short or absent, so we discard the read. When a homopolymer is found, we trim it from the read and keep the remaining sequence if it is long enough for further analysis. Understanding your library structure is crucial here. For instance, if your 60 bp read contains an Illumina adapter sequence, it may indicate that the read was sequenced through approximately 50 bp of poly (A) tail, a UMI, and a cell or sample barcode, indicating that the read does not contain a transcript and should therefore be discarded. 

Make sure you know which library preparation method you used, as this determines which Illumina adapter sequences are expected in your data and must be removed during preprocessing. 

Several tools can handle these preprocessing steps, including Trimmomatic, Seqkit, Cutadapt, and Fastp. Our internal preprocessing pipeline uses Cutadapt to remove adapters and trim homopolymers. Still, we encourage you to benchmark different tools to see which works best for your data and computational resources. 

Tips: if you use cutadapt, you can use the --minimum-length parameter to set a threshold for discarding reads that are too short after trimming for further analysis. 

Removing rRNA Reads

Another important step is removing ribosomal RNA (rRNA) reads. Even though library preparation protocols include steps not to capture these fragments, e.g., SORT-seq will target poly-adenylated fragments using a poly-T capture, and VASA-seq will use rRNA depletion probes, their presence in the data can be expected up to some extent. 

Our internal preprocessing pipeline uses ribodetector, a deep learning-based tool that rapidly captures and removes rRNA reads from sequencing data. 

rRNA genes are often excluded from reference genomes because of their repetitive nature and size. Mapping reads that are not present in the reference genome is not a good practice, as it can lead to non-specific mapping results. Therefore, removing them prior to mapping can improve the accuracy of the results.  

Concluding Remarks

Properly preprocessing your sequencing data is a crucial step to getting reliable results. By understanding your library preparation method and removing adapters, homopolymers, and rRNA reads, you ensure your data is clean and ready for analysis. These steps help avoid problems such as non-specific mapping and provide more confident insights into gene expression differences.

For more details, do not hesitate to reach out to our data experts by emailing data@scdiscoveries.com