The following protocol is intended to help our collaborators (and ourselves!) navigate through the steps involved in analysis of 16S rRNA sequencing data via QIIME2. We wrote this during and immediately following one of the QIIME2 workshops, incorporating much of the content we learned there. This page is no way meant to be a replacement for the extensive and excellent QIIME2 documentation. Check back often, as we plan to update this protocol as we refine how we use QIIME2 in the lab.
Step 1: Connect to a CHMI linux cluster
After you’re connected to our server, activate the QIIME enivronment as follows:
source activate qiime2
Step 2: prepare your metadata
Before you get started, there are a few basic QIIME commands that you will want to get familiar with
#take a look at all the plugins you have available to you through QIIME qiime tools #documentation for any QIIME function ca be accessed by appending --help at the end of any command. For example: qiime tools --help
Metadata is information about your samples (e.g. date collected, patient age, sex, pH, etc). This information should be contained within a single spreadsheet that has samples as rows and variables as columns.
In order to use QIIME, you must have your mapping spreadsheet correctly formatted. Set this file up as a google sheet, using this example. In order to check whether your file is correctly formatted, use the Keemei plugin for google sheets. Once Keemei says your spreadsheet is correctly formatted for QIIME2, you’re ready to proceed!
Create a .txt version of your metadata spreadsheet and trasfer it to the server so that it resides in your working directory. You can then inspect further if you want using:
qiime tools inspect-metadata [filename]
Create a visualiation of your metadata on the QIIME2 viewer (.qzv is a qiime zipped visualization).
qiime metadata tabulate --m-input-file sample-metadata.tsv --o-visualization tabulated-metadata.qzv
navigate to QIIME2 viewer in browser to view this visualization
Step 3: prepare your raw data
There are a number of ways you may have your raw data structured, depending on sequencing platform (e.g., Illumina vs Ion Torrent) and sequencing approach (e.g., single-end vs paired-end), and any pre-processing steps that have been performed by sequenencing facilities (e.g., joined paired ends, barcodes in fastq header, etc). Check out the QIIME info page, or consult with someone in our lab for help in preparing your data.
Move your fastq files and your barcode file (if data is not already demultiplexed) to your working directory on the server.
create an artifact of your data:
qiime tools import \ --type EMPSingleEndSequences \ --input-path emp-single-end-sequences \ --output-path emp-single-end-sequences.qza
Check this artifact to make sure that QIIME now recognizes your data (.qza is a qiime zipped artifact)
qiime tools peek emp-single-end-sequences.qza
unzipor the qiime commands
qiime tools extractor
qiime tools export. These unpacked files can then be used in other settings (R, perl, etc). In other words, the QIIME concept of an ‘artifact’ does not permanently alter the structure of your data.
Step 4: demultiplexing
qiime demux emp-singlein the code snipet below to produce a demux.qza file
qiime demux emp-paired \ --i-seqs emp-single-end-sequences.qza \ --m-barcodes-file sample-metadata.tsv \ --m-barcodes-column BarcodeSequence \ --o-per-sample-sequences demux.qza
Now produce a visualization of the demux’ing. This will produce a file called demux.qza, which can be viewed on the QIIME2 viewer as you did earlier for your metadata.qzv file
qiime demux summarize \ --i-data demux.qza \ --o-visualization demux.qzv
Step 5: Denoising and QC filtering
The methods for processing and analysis of 16S marker gene sequencing data continue to improve. One recent and major improvement was the introduction of two methods, Dada2 and Deblur, both of which significantly advance quality control measures by ‘denosing’ your sequences in order to better discriminate between true sequence diversity and sequencing errors. Be sure to check out the Dada2 online methods section for a more indepth (but still very readable) description of the underlying statistical method. These methods have pushed the field from the practice of binning sequences into 97% OTUs, to effectively using the sequences themselves as the unique idenfier for a taxon (also referred to 100% OTU, sub-OTUs or amplicon sequence variants).
denoise-singlein the code below.
qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux.qza \ --p-trim-left 0 \ --p-trunc-len 120 \ --o-representative-sequences rep-seqs-dada2.qza \ --o-table table-dada2.qza
Now you’ll summarize your filtered/denoised data using the
qiime feature table function.
qiime feature-table summarize \ --i-table table.qza \ --o-visualization table.qzv \ --m-sample-metadata-file sample-metadata.tsv qiime feature-table tabulate-seqs \ --i-data rep-seqs.qza \ --o-visualization rep-seqs.qzv
The code above produced two .qzv files that can be explored on the QIIME2 viewer. The “interactive Sample Detail” tab of the viewer gives you a great way to explore how rarefaction depths (subsampling) will impact your data (i.e. which samples will be dropped).
Step 6: build a phylogenetic tree
Tree construction is optional, but is useful if you want to compute alpha diversity metrics that are phylogenetically based. The resulting tree is unrooted and is created using Fasttree. Since some downstream steps will require a rooted tree, we’ll also use the longest branch to root the tree (called ‘midrooting’).
#carry out a multiple seqeunce alignment using Mafft qiime alignment mafft \ --i-sequences rep-seqs.qza \ --o-alignment aligned-rep-seqs.qza #mask (or filter) the alignment to remove positions that are highly variable. These positions are generally considered to add noise to a resulting phylogenetic tree. qiime alignment mask \ --i-alignment aligned-rep-seqs.qza \ --o-masked-alignment masked-aligned-rep-seqs.qza #create the tree using the Fasttree program qiime phylogeny fasttree \ --i-alignment masked-aligned-rep-seqs.qza \ --o-tree unrooted-tree.qza #root the tree using the longest root qiime phylogeny midpoint-root \ --i-tree unrooted-tree.qza \ --o-rooted-tree rooted-tree.qza
Step 7: alpha rarefaction
Produce a graphic that shows the impact of rarefaction (sampling at different depths) on alpha diversity of each sample. These are also called ‘collectors curves’.
qiime diversity alpha-rarefaction \ --i-table table.qza \ --i-phylogeny rooted-tree.qza \ --p-max-depth 4000 \ --m-metadata-file sample-metadata.tsv \ --o-visualization alpha-rarefaction.qzv
Step 8: calculate and explore diversity metrics
diversity core-metrics-phylogenetic function to calculate a whole bunch of diversity metrics all at once. Note that you should input a sample-depth value (set at 1109 reads in the example below) based on the alpha-rarefaction analysis that you ran in step 6.
qiime diversity core-metrics-phylogenetic \ --i-phylogeny rooted-tree.qza \ --i-table table.qza \ --p-sampling-depth 1109 \ --m-metadata-file sample-metadata.tsv \ --output-dir core-metrics-results
Now test for relationships between alpha diversity and study metadata and create .qzv files to view these relationships on QIIME2 viewer.
qiime diversity alpha-group-significance \ --i-alpha-diversity core-metrics-results/faith_pd_vector.qza \ --m-metadata-file sample-metadata.tsv \ --o-visualization core-metrics-results/faith-pd-group-significance.qzv qiime diversity alpha-group-significance \ --i-alpha-diversity core-metrics-results/evenness_vector.qza \ --m-metadata-file sample-metadata.tsv \ --o-visualization core-metrics-results/evenness-group-significance.qzv
Now test for relationships between beta diversity and study metadata and create .qzv files to view these relationships on QIIME2 viewer.
qiime diversity beta-group-significance \ --i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza \ --m-metadata-file sample-metadata.tsv \ --m-metadata-column BodySite \ --o-visualization core-metrics-results/unweighted-unifrac-body-site-significance.qzv \ --p-pairwise qiime diversity beta-group-significance \ --i-distance-matrix core-metrics-results/unweighted_unifrac_distance_matrix.qza \ --m-metadata-file sample-metadata.tsv \ --m-metadata-column Subject \ --o-visualization core-metrics-results/unweighted-unifrac-subject-group-significance.qzv \ --p-pairwise
Create a PCoA plot to explore beta diversity metric. To do this you can use Emperor, a powerful tool for interactively exploring scatter plots. You do not need to install Emperor.
#first, use the unweighted unifrac data as input qiime emperor plot \ --i-pcoa core-metrics-results/unweighted_unifrac_pcoa_results.qza \ --m-metadata-file sample-metadata.tsv \ --p-custom-axes DaysSinceExperimentStart \ --o-visualization core-metrics-results/unweighted-unifrac-emperor-DaysSinceExperimentStart.qzv #now repeat with bray curtis qiime emperor plot \ --i-pcoa core-metrics-results/bray_curtis_pcoa_results.qza \ --m-metadata-file sample-metadata.tsv \ --p-custom-axes DaysSinceExperimentStart \ --o-visualization core-metrics-results/bray-curtis-emperor-DaysSinceExperimentStart.qzv
Step 9: assign taxonomy
In this step, you will take the denoised sequences from step 5 (rep-seqs.qza) and assign taxonomy to each sequence (phylum -> class -> …genus -> ). This step requires a trained classifer. You have the choice of either training your own classifier using the q2-feature-classifier or downloading a pretrained classifier.
We’ll use a classifier that has been pretrained on GreenGenes database with 99% OTUs. Download this classifier from the qiime site and place in your working directory.
wget -O "gg-13-8-99-515-806-nb-classifier.qza" "https://data.qiime2.org/2018.2/common/gg-13-8-99-515-806-nb-classifier.qza"
Now you’re ready to assign taxonomy to your sequences
qiime feature-classifier classify-sklearn \ --i-classifier gg-13-8-99-515-806-nb-classifier.qza \ --i-reads rep-seqs.qza \ --o-classification taxonomy.qza qiime metadata tabulate \ --m-input-file taxonomy.qza \ --o-visualization taxonomy.qzv
Now create a visualization of the classified sequences. Again, the resulting .qzv file produced below can be explored in the QIIME2 viewer
qiime taxa barplot \ --i-table table.qza \ --i-taxonomy taxonomy.qza \ --m-metadata-file sample-metadata.tsv \ --o-visualization taxa-bar-plots.qzv
OPTIONAL: export .biom file
#first, export your data as a .biom qiime tools export \ feature-table.qza \ --output-dir exported-feature-table #then export taxonomy info qiime tools export \ taxonomy.qza \ --output-dir exported-feature-table #then combine the two using the biome package (dependence loaded as part of QIIME2 install)
Step 10: differential abundance
The tools and theory for calling differentially abundant taxa between samples is very much an area of active research. Basically, it’s a challenging statistical problem since measurements of abundance are relative, not absolute, and therefore features are not independent (i.e. if one taxa increases in abundance, then the others will go down). This is also a common, and perhaps better studied, problem in RNAseq. QIIME2 uses ANCOM to identify differentially abundant taxa. As is the case with all statistical tests, ANCOM makes certain assumptions about your data and if these assumptions are violated, then the results of the ANCOM analysis are invalid. A key assumption made by ANCOM is that few taxa will be differentially abundant between groups. To ensure that we don’t violate this assumption, we first need to filter our data to focus on a single body site, for example, and then compare treatments, conditions, etc, within that body site
#filter based on body site qiime feature-table filter-samples \ --i-table table.qza \ --m-metadata-file sample-metadata.tsv \ --p-where "BodySite='gut'" \ --o-filtered-table gut-table.qza qiime composition add-pseudocount \ --i-table gut-table.qza \ --o-composition-table comp-gut-table.qza #now run ANCOM qiime composition ancom \ --i-table comp-gut-table.qza \ --m-metadata-file sample-metadata.tsv \ --m-metadata-column Subject \ --o-visualization ancom-Subject.qzv
Once you’re confortable with the individual steps of this workflow, you could use this shell script, to run the entire workflow.
Step 11: supervised machine learning
Beyond differential gene expression analysis, often times you will want to know which taxa (or sets of taxa) do the best job classifying particularly phenotypes, and therefore could serve as biomarkers. There are two general types of supervised learning approaches that can be used to this: ‘Classifiers’ are used to find taxa that are associated with categorical metadata (Sex, disease status, etc), while ‘regressors’ are used with continuous metadata (age, weight, etc). Typically, your data is split into a training set (2/3) and a test set (remaining 1/3 of data that was left out from the training).
We can use a random forest classifier directly within QIIME via the
qiime sample-classifier tool.
qiime sample-classifier classify-samples \ --i-table moving-pictures-table.qza \ --m-metadata-file moving-pictures-sample-metadata.tsv \ --m-metadata-column BodySite \ --p-optimize-feature-selection \ --p-parameter-tuning \ --p-estimator RandomForestClassifier \ --p-n-estimators 100 \ --o-visualization moving-pictures-BodySite.qzv
For continuous data, you can use a random forest regressor.
qiime sample-classifier regress-samples \ --i-table ecam-table.qza \ --m-metadata-file ecam-metadata.tsv \ --m-metadata-column month \ --p-optimize-feature-selection \ --p-parameter-tuning \ --p-estimator RandomForestRegressor \ --p-n-estimators 100 \ --o-visualization ecam-month.qzv
OPTIONAL: other stuff
- qiime taxa collapse - takes taxa table and taxonomy artifact and allows you to specify the level to which you collapse (e.g. phylum)
- check about q2-longitudinal plugin for regression models using longitudinal data