Step 1: project set-up
Quality assurance (QA) can mean many things – to us QA means not only that the raw data files are examined for any issues that could compromise downstream analyses, but also that the data is organized in a way that others can understand what was done for a given project. This greatly improves transparency and reproducibility. To ensure that the different file types and analyses for your RNAseq project remain clear and organized, we recommend an approach we call MInimal Directory for Analysis of Sequencing data (MIDAS). MIDAS_RNAseq is just a directory structure that provides a simple framework for organizing your RNAseq experiment.
|-- DATA |-- raw |-- sample1_mergedLanes.fastq.gz |-- sample2_mergedLanes.fastq.gz |-- ... |-- processed |-- preprocessing.sh |-- sample1_mergedLanes_trim.fastq.gz |-- sample2_mergedLanes_trim.fastq.gz |-- ... |-- ANALYSIS |-- code |-- myProject.Rproj |-- step1.R |-- step2.R |-- ... |-- projectSummary.rmd |-- projectDashboard.rmd |-- readmapping |-- readmapping.sh |-- reference.fasta |-- reference.index |-- QA |-- library_prep |-- RNAquality_tapestation.pdf |-- Library_tapestation.pdf |-- fastqc |-- sample1.fastqc.html |-- sample1.fastqc.zip |-- sample2.fastqc.html |-- sample2.fastqc.zip |-- ... |-- fastq_screen |-- multiqc_report.html |-- studyDesign.txt |-- readme.txt
Step 2: Connect to a CHMI linux cluster
Once you have your directory structure set-up, you’re ready to begin using some software tools for investigating the quality of the reads in your fastq files. These software run best on a machine with a fair amount of RAM and disk storage. We use our linux machine.
Step 3: Check data quality
Begin by using fastqc to check the quality of each of your fastq files. Throughout this protocol, we’ll assume you use a directory structure like the one outlined above. The file paths below reference this directory structure.
# navigate to the folder with your raw fastq files cd data/raw # run fastqc on all files, putting the outputs into the QA/fastqc folder fastqc *.gz -t 24 -o ../QA/fastqc
Optional: check for contamination
Often times there are questions about whether there may be reads, other than those from the intended sample source, present in a data file. If you suspect contamination of a particular kind (e.g. other host, plasmid, rRNA, or some common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes.
fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our cluster to which you can easily compare.
We’ve taken care of configurating fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at /usr/local/bin/fastq_screen_v0.12.0/fastq_screen.conf
cd data/raw fastq_screen --threads 24 --outdir QA/fastq_screen *gz
- Mouse (Mus musculus)
- Dog (Canis familiaris)
- Cow (Bos taurus)
- Horse (Equus caballus)
- Pig (Sus scrofa)
- Chicken (Gallus gallus)
- Fruitfly (Drosophila melanogaster)
- Yeast (Saccharomyces cerevisiae)
- E. coli (strain K12)
- Staph (Staphyloccous aureus strain NCTC 8325)
- Lambda phage (Enterobacteriophage lambda)
Step 4: Summarize QA results
MultiQC is a fantastic piece of software for aggregating and summarizing the outputs from many different kinds of bioinformatics programs in one convenient and interactive html file. In this case, we’ll use it to summarize the output from fastqc.
# use the -d command to tell multiqc to look in all folders (data, analysis and qa) to find log files cd MIDAS/ multiqc -d .
You should now see a multiqc_report.html file in your project directory. Move this to your data/qa folder. You can also copy it from our server to your local computer using an FTP client (e.g. FileZilla), then double click and explore!
To document your analysis in a transparent and reproducible way, use the