|1pm||Welcome||talk - Kyle and Dan|
|Launch Instance||activity - Dan|
|intro to command line||talk - Dan|
|Install Sunbeam, get data||activity - Kyle|
|2pm||Shotgun sequencing intro||talk - Dan|
|Initialize, configure, run||activity - Kyle|
|3pm||Explore results, make report||activity - Kyle|
|Download report||activity - Dan|
|Discuss results||talk - Kyle and Dan|
In this workshop we’ll use the Google Cloud to analyze raw ‘shotgun’ metagenomic sequence data to identify the microbial composition of stool from Crohn’s disease patients. In addition to generating this microbial census, you’ll also assemble sequences into contigs, which can then be used to infer functional potential of the microbiome. To accomplish these tasks we’ll use Sunbeam, a snake-make based metagenomics pipeline developed by Kyle Bittinger and his group at the PennCHOP Microbiome Center.
This page is meant to serve as a guide to walk you through the workshop material, while providing a resource you can revisit after the workshop to practice and begin to adapt this workflow for your own studies.
To participate in this workshop, you’ll only need a few things:
- a laptop computer
- an internet connection
- a google account (free)
- a google cloud account (free sign-up comes with a $300 credit!)
- Dan’s slides covering the basics of working in the terminal and a quick intro to metagenomic sequencing are here
- Kyle’s slides on analysis of metagenomic data are here
Set-up cloud computer
We’ll begin the workshop with a demonstration of how to launch your first Google Cloud instance to build your cloud computer to have the following specs:
- 8 cores
- 50Gb of RAM
- 100Gb solid state harddrive
- Running Linux Ubuntu 18.04 LTS operating system
Once you have finalized this instance, you have effectively rented a computer from Google, and we are all using exactly the same type of computer with the same operating system and compute resources. In the case of the computer we set-up above, you will be charged 36 cents per hour, or about $260/month. The more powerful the computer, the more you will be charged in rent, regardless of whether or not you actually use these resources.
Connect to your cloud computer using the ‘ssh’ button next to the instance.
Install some software using the Advanced Package Tool (apt), a free program that works with core libraries to handle the installation and removal of software on Debian, Ubuntu and other Linux distributions
# first, update all current packages sudo apt-get update # now install the R programming language sudo apt-get install r-base
- Download Sunbeam from github using the code below
cd ~ git clone -b stable https://github.com/sunbeam-labs/sunbeam sunbeam-stable ls
- Install Sunbeam
cd sunbeam-stable bash install.sh
- notice that we got a few warnings at the end of the Sunbeam installation. Although Conda is now installed on our cloud computer, it has not been added to our PATH. We can do this using the following code:
# first, take a look at what is in your PATH echo $PATH # now add the location of Conda to your PATH echo "export PATH=$PATH:/home/dbeiting/miniconda3/bin" >> ~/.bashrc
After opening the new SSH terminal window, check your PATH again with
echo $PATH. Notice that it has been updated with the location of the conda environments.
Since Sunbeam was installed as a Conda environment, we have to enter this environment to start using the software
source activate sunbeam
This is a command you’ll want to remember for future sessions. Each time you log into your cloud instance, you’ll need to activate the pipeline with
source activate sunbeam. Upon activation, you should see that your command prompt begins with “(sunbeam)”. Anytime you want to exit out of sunbeam, simply type
source deactivate sunbeam and hit return.
- let’s install some additional software in our environment. SRA tools will allow us to easily retrieve raw data from NCBI’s Sequence Read Archive. The data is also available on github here.
conda install -c bioconda sra-tools
- For this workshop, we’ll use data from a recent metagenomics study in Crohn’s disease. This was a large study, but for the purpose of the workshop we’ll only fetch data from 7 patients. Note: contaminating human reads have already been removed from these files. Let’s download these data to our cloud computer using the
fasterq-dumpfunction from the SRA tools software.
cd ~ mkdir workshop-data cd workshop-data fasterq-dump SRR2145310 -e 8 fasterq-dump SRR2145329 -e 8 fasterq-dump SRR2145381 -e 8 fasterq-dump SRR2145353 -e 8 fasterq-dump SRR2145354 -e 8 fasterq-dump SRR2145492 -e 8 fasterq-dump SRR2145498 -e 8
cd ~ mkdir workshop-project sunbeam init workshop-project --data_fp workshop-data
- Use your nano text editor to explore the samples file and configuration file.
Download reference data
We need two reference databases to run our analysis: a database of host DNA sequence to remove, and a database of bacterial DNA to match against.
We’ll get the human genome data from UCSC. Filtering against the entire human genome takes too long, so we’ll only filter against chromosome 1.
cd ~ mkdir human cd human wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz gunzip chr1.fa.gz
- Sunbeam requires that the host DNA sequence files end in “.fasta”, so it can
find them automatically. Let’s use the
mvcommand to rename this file.
mv chr1.fa chr1.fasta
- The database of bacterial genomes comes pre-built from the homepage of our
taxonomic assignment software, Kraken. We’ll download that using the
cd ~ wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171101_4GB_dustmasked.tgz tar xvzf minikraken_20171101_4GB_dustmasked.tgz
- Now that we have reference databases, we need to add them to our configuration
file. We’ll use
nanoto open and modify this file directly in our termial
cd ~ nano workshop-project/sunbeam_config.yml
- The configuration values are below. You’ll need to navigate to the right spot in your configuration file, and substitute “USERNAME” with your google username.
- The default configuration for Sunbeam uses 4 threads, but we have 16 threads available, since our cloud machine has 8 cores. Let’s use all threads we have.
Run the pipeline
- We are ready to actually run the pipeline. All the information about how
to run the pipeline is in our configuration file, so we’ll provide that to
--configfileargument). We’ll also let Sunbeam know how many CPU cores we’d like to use (
cd ~ sunbeam run --configfile workshop-project/sunbeam_config.yml --jobs 8
Explore your results
- Open your QC results and take notice of the number of fwd and rev reads that passed quality filter, as well number of host reads filtered out from each sample.
cd ~/workshop-project/sunbeam_output/qc/reports/ nano preprocess_summary.tsv
- Take a look at taxanomic breakdown for one of the samples
cd ~/workshop-project/sunbeam_output/classify/kraken nano SRR2145310-taxa.tsv
Generate a report
- To look at some of our results, we’ll install a Sunbeam extension and generate a report.
cd ~/sunbeam-stable/extensions git clone https://github.com/sunbeam-labs/sbx_report conda install --file sbx_report/requirements.txt
- Now run the extension you just installed
cd ~ sunbeam run --configfile workshop-project/sunbeam_config.yml final_report
The summary report you just prepared is conveniently available as a single .html file that can be opened in any browser. The problem is that this file resides on a google-owned harddrive and there is no simple way to open and view .html files directly in the terminal. So, we need to transfer it to your laptop harddrive.
Notice that your SSH terminal window has a small gear icon in the upper right-hand corner of the screen. Click on this and choose Download file from the dropdown menu.
In the pop-up box, enter the path to summary report
/home/USERNAME/workshop-project/sunbeam_output/reports/final_report.html, with your own username.
To wrap-up the workshop, we’ll expore and discuss the report together. Just in case you had any issues retrieving this file from the cloud instance, you can also view a copy of the report here.
Time to take a live survey!
Practice after workshop
Practice makes perfect (or at least better!). After destroying your instance, try firing up a fresh instance and run through this entire tutorial again from start to finish, exactly as we’ve outlined above. As you do this, take some time to really think about each line of code and what is being accomplished. If you don’t understand a command, start investigating via google.