2016 / October / Metagenomics¶
This workshop was given on October 12th and 13th, 2016, by Harriet Alexander and C. Titus Brown, at the Scripps Institute of Oceanography.
For more information, please contact Titus directly.
We have an EtherPad for sharing text and asking questions.
Rough schedule:
- Wed, 9am-noon: Amazon Web Services; short reads, QC, and trimming.
- Logging into AWS
- Read quality evaluation and FASTQ trimming (FASTQC, Trimmomatic)
- (Optional) K-mer spectral error analysis and trimming (khmer)
- Wed, 1-4pm: Assembly, annotation, and evaluation.
- Short-read assembly with MEGAHIT
- Advantages and drawbacks to assembly (discussion)
- Evaluating assembly results (mapping rates, mapping viz, etc.)
- Annotating assembly (Prokka)
- Abundance calculations
- Thursday: topics TBD from below, or others:
- Biology and bioinformatics:
- Abundance comparisons between samples
- Taxonomic analysis of reads and assemblies
- Genome binning
- CIRCOS plots
- ShotMap for annotating shotgun reads
- Moar computing:
- Jupyter Notebook or RMarkdown
- git for version control
- Docker execution environments
- Workflow strategies (scripting, make, doit, etc.)
- More Amazon Web Services: EC2, S3, ...?
- Miscellany:
- Visualizing assembly graphs
- MinHash sketches for comparing genomes and metagenomes
- Tricks with sequences using Python libraries (screed & khmer)
Tutorials:
Welcome!¶
1. Learning goals¶
For you:
- get a first (or second) look at tools;
- gain some experience in the basic command line;
- get 80% of way to a complete analysis of some data;
- introduction to philosophy and perspective of data analysis in science;
2. Safe space and code of conduct¶
This is intended to be a safe and friendly place for learning!
Please see the Software Carpentry workshop Code of Conduct: http://software-carpentry.org/conduct.html
In particular, please ask questions, because I guarantee you that your question will help others!
3. Instructor introductions¶
Harriet Alexander - postdoc at UC Davis.
Titus Brown - prof at UC Davis in the School of Vet Med.
4. Amazon and cloud computing - why?!¶
- simplifies software installation;
- can be used for bigger analyses quite easily;
- good for “burst” capacity (just got a data set!)
- accessible everywhere;
5. Sticky notes and how they work... + Minute Cards¶
Basic rules:
- no sticky note - “working on it”
- green sticky note - “all is well”
- red sticky note - “need help!”
Place the sticky notes where we can see them from the back of the room – e.g. on the back of your laptop.
At the end of each session (coffee break, lunch, end of day) please write down on an index card one thing you learned and one thing you’re still confused about.
—
Starting up an Amazon Web Services machine¶
Start here: Start an Amazon Web Services computer:¶
Full table of contents:¶
Start an Amazon Web Services computer:¶
This page shows you how to create a new “AWS instance”, or a running computer.
Start at the Amazon Web Services console (http://aws.amazon.com/ and sign in to the console).
0. Select “EC2 - virtual servers in the cloud”¶

1. Switch to zone US West (N California)¶

2. Click on “Launch instance.”¶
3. Select “Community AMIs.”¶

5. Click on “Select.”¶
6. Choose m4.xlarge.¶

7. Click “Review and Launch.”¶
8. Click “Launch.”¶

9. Select “Create a new key pair.”¶
Note: you only need to do this the first time you create an instance. If you know where your amazon-key.pem file is, you can select ‘Use an existing key pair’ here. But you can always create a new key pair if you want, too.
If you have an existing key pair, go to step 12, “Launch instance.”

10. Enter name ‘amazon-key’.¶
11. Click “Download key pair.”¶
12. Click “Launch instance.”¶
13. Select View instances (lower right)¶

14. Bask in the glory of your running instance¶
Note that for your instance name you can use either “Public IP” or “Public DNS”. Here, the machine only has a public IP.

You can now Log into your instance with the UNIX shell or Configure your instance firewall.
Log into your instance with the UNIX shell¶
You will need the amazon-key.pem
file that was downloaded in
step 11 of booting up your new instance (see Start an Amazon Web Services computer:).
Then, you can either Log into your instance from a Mac or Linux machine or Log into your instance from a Windows machine.
Log into your instance via the UNIX shell (Mac/Linux)¶
Log into your instance via MobaXTerm (Windows)¶
See: Log into your instance from a Windows machine
Logging in is the starting point for most of the follow-on tutorials. For example, you can now install and run software on your EC2 instance.
Go back to the top page to continue: Starting up an Amazon Web Services machine
Log into your instance from a Mac or Linux machine¶
You’ll need to do two things: first, set the permissions on
amazon-key.pem
:
chmod og-rwx ~/Downloads/amazon-key.pem
Then, ssh into your new machine using your key:
ssh -i ~/Downloads/amazon-key.pem -l ubuntu MACHINE_NAE
where you should replace MACHINE_NAME with the public IP or hostname
of your EC2 instance, which is located at the top of the host
information box (see screenshot below). It should be something like
54.183.148.114
or ec2-XXX-YYY.amazonaws.com
.
Here are some screenshots!
Change permissions and execute ssh¶

Successful login¶

Host information box - MACHINE_NAME location¶

Logging in is the starting point for most of the follow-on tutorials. For example, you can now install and run software on your EC2 instance.
Go back to the top page to continue: Starting up an Amazon Web Services machine
Log into your instance from a Windows machine¶
Go follow the instructions this URL:
https://angus.readthedocs.org/en/2015/amazon/log-in-with-mobaxterm-win.html
Logging in is the starting point for most of the follow-on tutorials. For example, you can now install and run software on your EC2 instance.
Go back to the top page to continue: Starting up an Amazon Web Services machine
Configure your instance firewall¶
Normally, Amazon computers only allow shell logins via ssh (port 22 access). If we want to run a Web service or something else, we need to give the outside world access to other network locations on the computer.
Below, we will open ports 8000-9000, which will let us run things like RStudio Server. If you want to run other things, like a Web server, you’ll need to find the port(s) associated with those services and open those instead of 8000-9000. (Tip: Web servers run on port 80.)
1. Select ‘Security Groups’¶
Find “Security Groups” in the lower pane of your instance’s information page, and click on “launch-wizard-N”.

2. Select ‘Inbound’¶

3. Select ‘Edit’¶

4. Select ‘Add Rule’¶

5. Enter rule information¶
Add a new rule: Custom TCP, 8000-9000, Source Anywhere.
6. Select ‘Save’.¶
7. Return to the Instances page.¶

You’re done!
Go back to the index: Starting up an Amazon Web Services machine
Creating your own Amazon Machine Image¶
1. Actions, Create image¶

2. Fill out name and description¶

3. Wait for it to become available¶

Go back to the index: Starting up an Amazon Web Services machine
Working with persistent storage: volumes and snapshots¶
Volumes are basically UNIX disks (“block devices”) that will persist after you terminate your instance. They are tied to a zone within a region and can only be mounted on instances within that zone.
Snapshots are an Amazon-specific thing that let you communicate data on volumes between accounts. They are “read-only” backups that are created from volumes; they can be used to create new volumes in turn, and can also be shared with specific people (or made public). Snapshots are tied to a region but not a zone.
Creating persistent volumes to store data¶
0. Locate your instance zone¶

1. Click on the volumes tab¶

2. ‘Create Volume’¶

3. Configure your volume to have the same zone as your instance¶

4. Wait for your volume to be available¶

5. Select volume, Actions, Attach volume¶

6. Select instance, attachment point, and Attach¶
Here, your attachment point will be ‘/dev/sdf’ and your block device will be named ‘/dev/xvdf’.

7. On your instance, list block devices¶
Type:
lsblk
You should see something like this:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda 202:0 0 8G 0 disk
`-xvda1 202:1 0 8G 0 part /
xvdf 202:80 0 100G 0 disk
Now format the disk (ONLY ON EMPTY DISKS - THIS WILL ERASE ANY DATA ON THE DISK):
sudo mkfs -t ext4 /dev/xvdf
and mount the disk:
sudo mkdir /disk
sudo mount /dev/xvdf /disk
sudo chmod a+rwxt /disk
and voila, anything you put on /disk will be on the volume that you allocated!
The command ‘df -h’ will show you what disks are actually mounted & where.
Detaching volumes¶
1. Unmount it from the instance¶
Change out of the directory, stop any running programs using it, and then:
sudo umount /disk
3. Yes, detach.¶

Note, volumes remain attached when you reboot or stop an instance, but are (of course) detached when you terminate an instance.
Terminating your instance¶
Amazon will happily charge you for running instances and/or associated ephemeral storage until the cows come home - it’s your responsibility to turn things off. The Right Way to do this for running instances is to terminate.
The caveat here is that everything ephemeral will be deleted (excluding volumes that you created/attached). So you want to make sure you transfer off anything you care about.
To terminate:
1. Select Actions, Instance State, Terminate¶
In the ‘Instances’ tab, select your instance and then go to the Actions menu.

2. Agree to terminate.¶

3. Verify status on your instance page.¶
Instance state should be either “shutting down” or “terminated”.

Return to index: Starting up an Amazon Web Services machine
Things to mention and discuss¶
When do disks go away?¶
- never on reboot;
- ephemeral disks go away on stop;
- AMI-attached volumes go away on terminate;
- attached volumes never go away on terminate and have to be explicitly deleted;
- snapshots only go away when you explicitly delete them.
What are you charged for?¶
- you are charged for a running instance at the @@instance price rates;
- ephemeral storage/instance-specific storage is included within that.
- when you stop an instance, you are charged at disk-space rates for the stopped disk;
- when you create a volume, you are charged for that volume until you delete it;
- when you create a snapshot, you are charged for that snapshot until you delete it.
To make sure you’re not getting charged, go to your Instance view and clear all search filters; anything that is “running” or “stopped” is costing you. Also check your volumes and your snapshots - they should be empty.
Regions vs zones:¶
- AMIs and Snapshots (and keys and security groups) are per region;
- Volumes and instances are per zone;
Running RStudio Server in the cloud¶
In this section, we will run RStudio Server on a remote Amazon machine. This will require starting up an instance, configuring its network firewall, and installing and running some software.
Reference documentation for running RStudio Server on Ubuntu:
1. Start up an Amazon instance¶
Start an ami-05384865 on an m4.xlarge machine, as per the instructions here:
2. Configure your network firewall¶
Normally, Amazon computers only allow shell logins via ssh. Since we want to run a Web service, we need to give the outside world access to other network locations on the computer.
Follow these instructions:
Configure your instance firewall
(You can do this while the computer is booting.)
3. Log in via the shell¶
Follow these instructions to log in via the shell:
4. Set a password for the ‘ubuntu’ account¶
Amazon Web Services computers normally require a key (the .pem file) instead of a login password, but RStudio Server will need us to log in with a password. So we need to configure a password for the account we’re going to use (which is ‘ubuntu’)
Create a password like so:
sudo passwd ubuntu
and set it to something you’ll remember.
5. Install R and the gdebi tool¶
Update the software catalog and install a few things:
sudo apt-get update && sudo apt-get -y install gdebi-core r-base
This will take a few minutes.
6. Download & install RStudio Server¶
wget https://download2.rstudio.org/rstudio-server-0.99.891-amd64.deb
sudo gdebi -n rstudio-server-0.99.891-amd64.deb
Upon success, you should see:
Mar 07 15:20:18 ip-172-31-6-68 systemd[1]: Starting RStudio Server...
Mar 07 15:20:18 ip-172-31-6-68 systemd[1]: Started RStudio Server.
7. Open your RStudio Server instance¶
Finally, go to ‘http://‘ + your hostname + ‘:8787’ in a browser, eg.
http://ec2-XX-YY-33-165.us-west-1.compute.amazonaws.com:8787/
and log into RStudio with username ‘ubuntu’ and the password you set it to above.
Voila!
You can now just go ahead and use this, or you can “stop” it, or you can freeze into an AMI for later use.
Note that on reboot, RStudio Server will start up again and all your files will be there.
Go back to the index: Starting up an Amazon Web Services machine.
Indices and tables¶
Short read quality and trimming¶
Start up an instance with ami-05384865 and 500 GB of local storage (Start an Amazon Web Services computer:). You should also configure your firewall (Configure your instance firewall) to pass through TCP ports 8000-8888.
Then, Log into your computer.
—
You should now be logged into your Amazon computer! You should see something like this:
ubuntu@ip-172-30-1-252:~$
this is the command prompt.
Prepping the computer¶
Before we do anything else, we need to set up a place to work and install a few things.
First, let’s set up a place to work. Here, we’ll make /mnt writeable:
sudo chmod a+rwxt /mnt
Note
/mnt is the location we’re going to use on Amazon computers, but if you’re working on a local cluster, it will have a different location. Talk to your local sysadmin and ask them where they recommend putting lots of short-term working files, i.e. the “scratch” space.
Installing some software¶
Run:
sudo apt-get -y update && \
sudo apt-get -y install trimmomatic fastqc python-pip \
samtools zlib1g-dev ncurses-dev python-dev
Install anaconda:
curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
bash Anaconda3-4.2.0-Linux-x86_64.sh
Then update your environment and install khmer:
source ~/.bashrc
cd
git clone https://github.com/dib-lab/khmer.git
cd khmer
sudo python2 setup.py install
Running Jupyter Notebook¶
Let’s also run a Jupyter Notebook in /mnt. First, configure it a teensy bit more securely, and also have it run in the background:
jupyter notebook --generate-config
cat >>/home/ubuntu/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000
EOF
Now, run!
cd /mnt
jupyter notebook &
You should be able to visit port 8000 on your AWS computer and see the Jupyter console. (The password is ‘davis’.)
Data source¶
We’re going to be using a subset of data from Hu et al., 2016. This paper from the Banfield lab samples some relatively low diversity environments and finds a bunch of nearly complete genomes.
(See DATA.md for a list of the data sets we’re using in this tutorial.)
1. Copying in some data to work with.¶
We’ve loaded subsets of the data onto an Amazon location for you, to make everything faster for today’s work. We’re going to put the files on your computer locally under the directory /mnt/data:
mkdir /mnt/data
Next, let’s grab part of the data set:
cd /mnt/data
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_1.fastq.gz
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_2.fastq.gz
Now if you type:
ls -l
you should see something like:
total 346936
-rw-rw-r-- 1 ubuntu ubuntu 169620631 Oct 11 23:37 SRR1976948_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 185636992 Oct 11 23:38 SRR1976948_2.fastq.gz
These are 1m read subsets of the original data, taken from the beginning of the file.
One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. Let’s fix that before we go on any further:
chmod u-w *
We’ll talk about what these files are below.
1. Copying data into a working location¶
First, make a working directory; this will be a place where you can futz around with a copy of the data without messing up your primary data:
mkdir /mnt/work
cd /mnt/work
Now, make a “virtual copy” of the data in your working directory by linking it in –
ln -fs /mnt/data/* .
These are FASTQ files – let’s take a look at them:
less SRR1976948_1.fastq.gz
(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)
Question:
- where does the filename come from?
- why are there 1 and 2 in the file names?
Links:
2. FastQC¶
We’re going to use FastQC to summarize the data. We already installed ‘fastqc’ on our computer for you.
Now, run FastQC on two files:
fastqc SRR1976948_1.fastq.gz
fastqc SRR1976948_2.fastq.gz
Now type ‘ls’:
ls -d *fastqc*
to list the files, and you should see:
SRR1976948_1_fastqc.html
SRR1976948_1_fastqc.zip
SRR1976948_2_fastqc.html
SRR1976948_2_fastqc.zip
You can download these files using your Jupyter Notebook console, if you like; or you can look at these copies of them:
Questions:
- What should you pay attention to in the FastQC report?
- Which is “better”, file 1 or file 2? And why?
Links:
3. Trimmomatic¶
Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via apt-get.
The first thing we’ll need are the adapters to trim off:
curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/TruSeq2-PE.fa
Now, to run Trimmomatic:
TrimmomaticPE SRR1976948_1.fastq.gz \
SRR1976948_2.fastq.gz \
SRR1976948_1.qc.fq.gz s1_se \
SRR1976948_2.qc.fq.gz s2_se \
ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
You should see output that looks like this:
...
Input Read Pairs: 1000000 Both Surviving: 885734 (88.57%) Forward Only Surviving: 114262 (11.43%) Reverse Only Surviving: 4 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
Questions:
- How do you figure out what the parameters mean?
- How do you figure out what parameters to use?
- What adapters do you use?
- What version of Trimmomatic are we using here? (And FastQC?)
- Do you think parameters are different for RNAseq and genomic data sets?
- What’s with these annoyingly long and complicated filenames?
- why are we running R1 and R2 together?
For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.
Links:
4. FastQC again¶
Run FastQC again on the trimmed files:
fastqc SRR1976948_1.qc.fq.gz
fastqc SRR1976948_2.qc.fq.gz
And now view my copies of these files:
Let’s take a look at the output files:
less SRR1976948_1.qc.fq.gz
(again, use spacebar to scroll, ‘q’ to exit less).
Questions:
- is the quality trimmed data “better” than before?
- Does it matter that you still have adapters!?
Optional: K-mer Spectral Error Trimming
K-mer Spectral Error Trimming¶
(Optional)
khmer documentation: http://khmer.readthedocs.io/en/latest
If you plot a k-mer abundance histogram of the samples, you’ll notice something: there’s an awful lot of unique (abundance=1) k-mers. These are erroneous k-mers caused by sequencing errors.
In a new Python3 Jupyter Notebook, run:
cd /mnt/work
and then
!abundance-dist-single.py -M 1e9 -k 21 SRR1976948_1.fastq.gz SRR1976948_1.fastq.gz.dist
and in another cell:
%matplotlib inline
import numpy
from pylab import *
dist1 = numpy.loadtxt('SRR1976948_1.fastq.gz.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1])
axis(xmax=50)
Many of these errors remain even after you do the Trimmomatic run; you can see this with:
!abundance-dist-single.py -M 1e9 -k 21 SRR1976948_1.qc.fq.gz SRR1976948_1.qc.fq.gz.dist
and then plot:
dist2 = numpy.loadtxt('SRR1976948_1.qc.fq.gz.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1], label='untrimmed')
plot(dist2[:,0], dist2[:,1], label='trimmed')
legend(loc='upper right')
axis(xmax=50)
This is for two reasons:
First, Trimmomatic trims based solely on the quality score, which is a statistical statement about the correctness of a base - a Q score of 30 means that, of 1000 bases with that Q score, 1 of those bases will be wrong. So, a base can have a high Q score and still be wrong! (and many bases will have a low Q score and still be correct)
Second, we trimmed very lightly - only bases that had a very low quality were removed. This was intentional because with assembly, you want to retain as much coverage as possible, and the assembler will generally figure out what the “correct” base is from the coverage.
An alternative to trimming based on the quality scores is to trim based on k-mer abundance - this is known as k-mer spectral error trimming. K-mer spectral error trimming always beats quality score trimming in terms of eliminating errors; e.g. look at this table from Zhang et al., 2014:

The basic logic is this: if you see low abundance k-mers in a high coverage data set, those k-mers are almost certainly the result of errors. (Caveat: strain variation could also create them.)
In metagenomic data sets we do have the problem that we may have very low and very high coverage data. So we don’t necessarily want to get rid of all low-abundance k-mers, because they may represent truly low abundance (but useful) data.
As part of the khmer project in my lab, we have developed an approach that sorts reads into high abundance and low abundance reads, and only error trims the high abundance reads.

This does mean that many errors may get left in the data set, because we have no way of figuring out if they are errors or simply low coverage, but that’s OK (and you can always trim them off if you really care).
To run such error trimming, use the command trim-low-abund.py
(at the command line, or prefix with a ‘!’ in the notebook):
interleave-reads.py SRR1976948_1.qc.fq.gz SRR1976948_2.qc.fq.gz |
trim-low-abund.py -V -M 8e9 -C 3 -Z 10 - -o SRR1976948.trim.fq
Why (or why not) do k-mer trimming?¶
If you can assemble your data set without k-mer trimming, there’s no reason to do it. The reason we’re error trimming here is to speed up the assembler (by removing data) and to decrease the memory requirements of the assembler (by removing a number of k-mers).
To see how many k-mers we removed, you can examine the distribution as above,
or use the unique-kmers.py
script:
unique-kmers.py SRR1976948_1.qc.fq.gz SRR1976948_2.qc.fq.gz
unique-kmers.py SRR1976948.trim.fq
Run the MEGAHIT assembler¶
MEGAHIT is a very fast, quite good assembler designed for metagenomes.
First, install it:
cd
git clone https://github.com/voutcn/megahit.git
cd megahit
make
Now, download some data:
cd /mnt/data
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz
These are data that have been run through k-mer abundance trimming (see K-mer Spectral Error Trimming) and subsampled so that we can run an assembly in a fairly short time period :).
Now, finally, run the assembler!
mkdir /mnt/assembly
cd /mnt/assembly
ln -fs ../data/*.subset.pe.fq.gz .
~/megahit/megahit --12 SRR1976948.abundtrim.subset.pe.fq.gz,SRR1977249.abundtrim.subset.pe.fq.gz \
-o combined
This will take about 25 minutes; at the end you should see output like this:
... 12787984 bp, min 200 bp, max 61353 bp, avg 1377 bp, N50 3367 bp
... ALL DONE. Time elapsed: 1592.503825 seconds
The output assembly will be in combined/final.contigs.fa
.
While the assembly runs...¶
How assembly works - whiteboarding the De Bruijn graph approach.
Interpreting the MEGAHIT working output :)
What does, and doesn’t, assemble?
How good is assembly anyway?
Discussion:
Why would we assemble, vs looking at raw reads? What are the advantages and disadvantages?
What are the technology tradeoffs between Illumina HiSeq, Illumina MiSeq, and PacBio? (Also see this paper.)
What kind of experimental design considerations should you have if you plan to assemble?
Some figures: the first two come from work by Dr. Sherine Awad on analyzing the data from Shakya et al (2014). The third comes from an analysis of read search vs contig search of a protein database.



After the assembly is finished¶
At this point we can do a bunch of things:
- annotate the assembly (Annotation with Prokka);
- evaluate the assembly’s inclusion of k-mers and reads;
- set up a BLAST database so that we can search it for genes of interest;
- quantify the abundance of the contigs or genes in the assembly, using the original read data set (Gene Abundance Estimation with Salmon);
- bin the contigs in the assembly into species bins;
Annotation with Prokka¶
Prokka is a tool that facilitates the fast annotation of prokaryotic genomes.
The goals of this tutorial are to:
- Install Prokka
- Use Prokka to annotate our genomes
Installing Prokka¶
Download and extract the latest version of prokka:
cd ~/
wget http://www.vicbioinformatics.com/prokka-1.11.tar.gz
tar -xvzf prokka-1.11.tar.gz
We also will need some dependencies such as bioperl:
sudo apt-get install bioperl libdatetime-perl libxml-simple-perl libdigest-md5-perl
sudo perl -MCPAN -e shell
sudo perl -MCPAN -e 'install "XML::Simple"'
Now, you should be able to add Prokka to your $PATH
and set up the index for the sequence database:
export PATH=$PATH:$HOME/prokka-1.11/bin
prokka --setupdb
Prokka should be good to go now– you can check to make sure that all is well by typing prokka
. This should print the help screen with all available options.
Running Prokka¶
Make a new directory for the annotation:
cd /mnt
mkdir annotation
cd annotation
Link the metagenome assembly file into this directory:
ln -fs /mnt/assembly/combined/final.contigs.fa
Now it is time to run Prokka! There are tons of different ways to specialize the running of Prokka. We are going to keep it simple for now, though. It will take a little bit to run.
prokka subset_assembly.fa --outdir prokka_annotation --prefix metagG
This will generate a new folder called prokka_annotation
in which will be a series of files, which are detailed here.
In particular, we will be using the *.ffn
file to assess the relative read coverage within our metagenomes across the predicted genomic regions.
Gene Abundance Estimation with Salmon¶
Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis (if you are quantifying RNAseq data).
Today we will use it to get a handle on the relative distribution of genomic reads across the predicted protein regions.
The goals of this tutorial are to:
- Install salmon
- Use salmon to estimate gene coverage in our metagenome dataset
Installing Salmon¶
Download and extract the latest version of Salmon and add it to your PATH:
cd
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.7.2/Salmon-0.7.2_linux_x86_64.tar.gz
tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
cd Salmon-0.7.2_linux_x86_64
export PATH=$PATH:$HOME/Salmon-0.7.2_linux_x86_64/bin
Running Salmon¶
Go to the data directory and download the prokka annotated sequences, assembled metagenome, and fastq files:
mkdir -p /mnt/data
cd /mnt/data
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/prokka_annotation_assembly.tar.gz
tar -xvzf prokka_annotation_assembly.tar.gz
Make a new directory for the quantification of data with Salmon:
mkdir /mnt/quant
cd /mnt/quant
Grab the nucleotide (*ffn
) predicted protein regions from Prokka and link them here. Also grab the trimmed sequence data (*fq
)
ln -fs /mnt/data/prokka_annotation/*ffn .
ln -fs /mnt/data/*.abundtrim.subset.pe.fq.gz .
Create the salmon index:
salmon index -t metagG.ffn -i transcript_index --type quasi -k 31
Salmon requires that paired reads be separated into two files. We can split the reads using the split-paired-reads.py
from the khmer package:
for file in *.abundtrim.subset.pe.fq.gz
do
tail=.fq.gz
BASE=${file/$tail/}
split-paired-reads.py $BASE$tail -1 ${file/$tail/}.1.fq -2 ${file/$tail/}.2.fq
done
Now, we can quantify our reads against this reference:
for file in *.pe.1.fq
do
tail1=.abundtrim.subset.pe.1.fq
tail2=.abundtrim.subset.pe.2.fq
BASE=${file/$tail1/}
salmon quant -i transcript_index --libType IU \
-1 $BASE$tail1 -2 $BASE$tail2 -o $BASE.quant;
done
(Note that –libType must come before the read files!)
This will create a bunch of directories named after the fastq files that we just pushed through. Take a look at what files there are within one of these directories:
find SRR1976948.quant -type f
Working with count data¶
Now, the quant.sf
files actually contain the relevant information about expression – take a look:
head -10 SRR1976948.quant/quant.sf
The first column contains the transcript names, and the fourth column is what we will want down the road - the normalized counts (TPM). However, they’re not in a convenient location / format for use; let’s fix that.
Download the gather-counts.py script:
curl -L -O https://raw.githubusercontent.com/ngs-docs/2016-metagenomics-sio/master/gather-counts.py
and run it:
python2 ./gather-counts.py
This will give you a bunch of .counts files, which are processed from the quant.sf files and named for the directory from which they emanate.
Plotting the results¶
In Jupyter Notebook, open a new Python3 notebook and enter:
%matplotlib inline
import numpy
from pylab import *
In another cell:
cd /mnt/quant
In another cell:
counts1 = [ x.split()[1] for x in open('SRR1976948.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1 = numpy.array(counts1)
counts2 = [ x.split()[1] for x in open('SRR1977249.quant.counts')]
counts2 = [ float(x) for x in counts2[1:] ]
counts2 = numpy.array(counts2)
plot(counts1, counts2, '*')
Day 2 - installation instructions¶
(Instructions mostly copied from Short read quality and trimming!)
Use ami-05384865, with a 500 GB local disk (see: Start an Amazon Web Services computer:)
Make /mnt/
read/write:
sudo chmod a+rwxt /mnt
Run:
sudo apt-get -y update && \
sudo apt-get -y install trimmomatic fastqc python-pip \
samtools zlib1g-dev ncurses-dev python-dev
Install anaconda:
curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
bash Anaconda3-4.2.0-Linux-x86_64.sh
Then update your environment and install khmer:
source ~/.bashrc
cd
git clone https://github.com/dib-lab/khmer.git
cd khmer
sudo python2 setup.py install
Running Jupyter Notebook¶
Let’s also run a Jupyter Notebook in /mnt. First, configure it a teensy bit more securely, and also have it run in the background.
Generate a config:
jupyter notebook --generate-config
Add a password, have it not run a browser, and put it on port 8000 by default:
cat >>/home/ubuntu/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000
EOF
Now, run!
cd /mnt
jupyter notebook &
You should be able to visit port 8000 on your AWS computer and see the Jupyter console.
Technical information¶
The github repository for this workshop is public at https://github.com/ngs-docs/2016-metagenomics-sio
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.