Before starting
The following protocol outlines the use of Sourmash for interpreting microbial whole genome sequence (WGS) or metagenomic data. Sourmash computes a fingerprint or ‘sketch’ from your WGS data using minHash, and enables comparison of sketches from isolates (to understand strain relatedness, for example). Similarly, a sketch can be compared against a large databases of sketches, stored as a Sequence Bloom Tree (SBT) for rapid searching, from RefSeq or Genbank in order to help assign identity to a sample. SBT databases of microbial genomes from all of Genbank and RefSeq are available directly on our server for convenience.
If you want to learn more about how Sourmash works, you could start with the original Sourmash paper, and a recent follow-up paper. Also check out C. Titus Brown’s blog posts on the topic here and here. Also take a look at Adam Phillippy’s original Mash paper.
Step 1: Connect to a CHMI linux cluster
ssh username@130.91.255.137
Step 2: Check data quality
Begin by using fastqc to check the quality of each of your fastq files. Throughout this protocol, ‘path/to/your/data’ indicates the path to your folder on our linux server which contains raw sequence data from your WGS or metagenomics experiment.
cd path/to/your/data
fastqc *.gz -t 24 #uses all 24 threads available on our machine
Step 3: Summarize QC results
MultiQC is a fantastic piece of software for aggregating and summarizing the outputs from many different kinds of bioinformatics programs in one convenient and interactive html file. In this case, we’ll use it to summarize the output from fastqc.
multiqc path/to/your/data
You should now see an .html file in your directory. Move it from our server to your local computer using an FTP client (e.g. FileZilla), double click, and explore!
Step 4: Prepare sample ‘sketches’
Here, we’ll use sourmash’s compute
function to prepare a sketch of each fastq file in our directory. The --scaled
option applies a 1000:1 compression ratio, which retains the ability to detect regions of similarity in the 10kb range.
sourmash compute --scaled 1000 *.gz
Step 5: Compare sketches to each other
The Sourmash compare
command computes the jaccard index between two or more sketches generated above.
sourmash compare *.sig -k 31 -o cmp
Step 6: Visualize sample relatedness
sourmash plot --pdf --labels cmp
Step 7: Search against RefSeq
This Sequence Bloom Tree (SBT) database contains approximately 60,000 microbial genomes (including viral and fungal) from NBCI’s RefSeq, including:
- 53865 bacterial genomes
- 5463 viral genomes
- 475 archael genomes
- 177 fungal genomes
- 72 protist genomes
sourmash search *.sig /data/reference_db/refseq-d2-k31.sbt.json -n 20
#shows top 20 hits
Step 8: Search against Genbank
This Sequence Bloom Tree (SBT) database contains approximately 100,000 microbial genomes (including viral and fungal) from NBCI’s GenBank.
sourmash search *.sig /data/reference_db/genbank-d2-k31.sbt.json -n 20
#shows top 20 hits
Optional: ‘What genomes are in my sample?’
Often times there are questions about whether a preparation used for WGS was contaminated with another organism. Perhaps Sanger sequencing of 16S rDNA gave murky results, or restreaking an isolate showed more than one distinct colony morphology. Alternatively, you may know that your sample has multiple genomes present (e.g. metagenomics). In either case, the Sourmash gather
function allows you to ask exactly which organisms might be present in a sample based only on the sequence data. Like the search function above, gather needs a reference database to search against.
sourmash gather -k 31 *.sig /data/reference_db/refseq-d2-k31.sbt.json #using refseq
sourmash gather -k 31 *.sig /data/reference_db/genbank-d2-k31.sbt.json #using genbank
On the other hand, if you suspect contamination of a particular kind (e.g. host, plasmid, or common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes. Although this has nothing to do with Sourmash or minHask sketches per se, it can be a useful way to confirm findings from Sourmash, or to check for organisms not well represented in the SBT reference databases provided above.
fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our cluster to which you can easily compare.
We’ve taken care of configurating fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at /usr/local/bin/fastq_screen_v0.12.0/fastq_screen.conf
- Human
- Mouse (Mus musculus)
- Dog (Canis familiaris)
- Cow (Bos taurus)
- Horse (Equus caballus)
- Pig (Sus scrofa)
- Chicken (Gallus gallus)
- Fruitfly (Drosophila melanogaster)
- Yeast (Saccharomyces cerevisiae)
- E. coli (strain K12)
- Staph (Staphyloccous aureus strain NCTC 8325)
- Clostridium hiranonis
- Porcine retrovirus
- Lambda phage (Enterobacteriophage lambda)
- PhiX
- Contaminants
- plasmids/vectors
Optional: filtering reads
Depending on the results you get with sourmash gather
or fastq_screen
above, you may want to filter reads based on alignment to a particular reference genome of interest. This is particularly useful for removing host reads contaminating a metagenomic sample, for example. To do this, you can use the --tag
and --filter
options for fastq_screen.
First, tag each read in each fastq with the genome to which it aligns (from the available references described above)
fastq_screen --tag sampleX.fastq.gz
Next, filter based on tags that were assigned above
fastq_screen --filter 1000 sampleX.fastq.gz