2016 / October / Metagenomics

This workshop was given on October 12th and 13th, 2016, by Harriet Alexander and C. Titus Brown, at the Scripps Institute of Oceanography.

For more information, please contact Titus directly.

We have an EtherPad for sharing text and asking questions.

Rough schedule:

  • Wed, 9am-noon: Amazon Web Services; short reads, QC, and trimming.
  • Logging into AWS
  • Read quality evaluation and FASTQ trimming (FASTQC, Trimmomatic)
  • (Optional) K-mer spectral error analysis and trimming (khmer)
  • Wed, 1-4pm: Assembly, annotation, and evaluation.
  • Short-read assembly with MEGAHIT
  • Advantages and drawbacks to assembly (discussion)
  • Evaluating assembly results (mapping rates, mapping viz, etc.)
  • Annotating assembly (Prokka)
  • Abundance calculations
  • Thursday: topics TBD from below, or others:
  • Biology and bioinformatics:
  • Abundance comparisons between samples
  • Taxonomic analysis of reads and assemblies
  • Genome binning
  • CIRCOS plots
  • ShotMap for annotating shotgun reads
  • Moar computing:
  • Jupyter Notebook or RMarkdown
  • git for version control
  • Docker execution environments
  • Workflow strategies (scripting, make, doit, etc.)
  • More Amazon Web Services: EC2, S3, ...?
  • Miscellany:
  • Visualizing assembly graphs
  • MinHash sketches for comparing genomes and metagenomes
  • Tricks with sequences using Python libraries (screed & khmer)

Tutorials:

Welcome!

1. Learning goals

For you:

  • get a first (or second) look at tools;
  • gain some experience in the basic command line;
  • get 80% of way to a complete analysis of some data;
  • introduction to philosophy and perspective of data analysis in science;

2. Safe space and code of conduct

This is intended to be a safe and friendly place for learning!

Please see the Software Carpentry workshop Code of Conduct: http://software-carpentry.org/conduct.html

In particular, please ask questions, because I guarantee you that your question will help others!

3. Instructor introductions

Harriet Alexander - postdoc at UC Davis.

Titus Brown - prof at UC Davis in the School of Vet Med.

4. Amazon and cloud computing - why?!

  • simplifies software installation;
  • can be used for bigger analyses quite easily;
  • good for “burst” capacity (just got a data set!)
  • accessible everywhere;

5. Sticky notes and how they work... + Minute Cards

Basic rules:

  • no sticky note - “working on it”
  • green sticky note - “all is well”
  • red sticky note - “need help!”

Place the sticky notes where we can see them from the back of the room – e.g. on the back of your laptop.

At the end of each session (coffee break, lunch, end of day) please write down on an index card one thing you learned and one thing you’re still confused about.

Next: Non-model organisms and RNAseq

Starting up an Amazon Web Services machine

Full table of contents:

Start an Amazon Web Services computer:

This page shows you how to create a new “AWS instance”, or a running computer.


Start at the Amazon Web Services console (http://aws.amazon.com/ and sign in to the console).

0. Select “EC2 - virtual servers in the cloud”
1. Switch to zone US West (N California)
2. Click on “Launch instance.”
3. Select “Community AMIs.”
4. Search for ami-05384865 (ubuntu-wily-15.10-amd64-server)

Use ami-05384865.

5. Click on “Select.”
6. Choose m4.xlarge.
7. Click “Review and Launch.”
8. Click “Launch.”
9. Select “Create a new key pair.”

Note: you only need to do this the first time you create an instance. If you know where your amazon-key.pem file is, you can select ‘Use an existing key pair’ here. But you can always create a new key pair if you want, too.

If you have an existing key pair, go to step 12, “Launch instance.”

10. Enter name ‘amazon-key’.
11. Click “Download key pair.”
12. Click “Launch instance.”
13. Select View instances (lower right)
14. Bask in the glory of your running instance

Note that for your instance name you can use either “Public IP” or “Public DNS”. Here, the machine only has a public IP.

You can now Log into your instance with the UNIX shell or Configure your instance firewall.

Log into your instance with the UNIX shell

You will need the amazon-key.pem file that was downloaded in step 11 of booting up your new instance (see Start an Amazon Web Services computer:).

Then, you can either Log into your instance from a Mac or Linux machine or Log into your instance from a Windows machine.

Log into your instance via the UNIX shell (Mac/Linux)

See: Log into your instance from a Mac or Linux machine

Log into your instance via MobaXTerm (Windows)

See: Log into your instance from a Windows machine


Logging in is the starting point for most of the follow-on tutorials. For example, you can now install and run software on your EC2 instance.

Go back to the top page to continue: Starting up an Amazon Web Services machine

Log into your instance from a Mac or Linux machine

You’ll need to do two things: first, set the permissions on amazon-key.pem:

chmod og-rwx ~/Downloads/amazon-key.pem

Then, ssh into your new machine using your key:

ssh -i ~/Downloads/amazon-key.pem -l ubuntu MACHINE_NAE

where you should replace MACHINE_NAME with the public IP or hostname of your EC2 instance, which is located at the top of the host information box (see screenshot below). It should be something like 54.183.148.114 or ec2-XXX-YYY.amazonaws.com.

Here are some screenshots!

Change permissions and execute ssh
Successful login
Host information box - MACHINE_NAME location

Logging in is the starting point for most of the follow-on tutorials. For example, you can now install and run software on your EC2 instance.

Go back to the top page to continue: Starting up an Amazon Web Services machine

Log into your instance from a Windows machine

Go follow the instructions this URL:

https://angus.readthedocs.org/en/2015/amazon/log-in-with-mobaxterm-win.html

Logging in is the starting point for most of the follow-on tutorials. For example, you can now install and run software on your EC2 instance.

Go back to the top page to continue: Starting up an Amazon Web Services machine

Configure your instance firewall

Normally, Amazon computers only allow shell logins via ssh (port 22 access). If we want to run a Web service or something else, we need to give the outside world access to other network locations on the computer.

Below, we will open ports 8000-9000, which will let us run things like RStudio Server. If you want to run other things, like a Web server, you’ll need to find the port(s) associated with those services and open those instead of 8000-9000. (Tip: Web servers run on port 80.)

1. Select ‘Security Groups’

Find “Security Groups” in the lower pane of your instance’s information page, and click on “launch-wizard-N”.

2. Select ‘Inbound’
3. Select ‘Edit’
4. Select ‘Add Rule’
5. Enter rule information

Add a new rule: Custom TCP, 8000-9000, Source Anywhere.

6. Select ‘Save’.
7. Return to the Instances page.

You’re done!

Go back to the index: Starting up an Amazon Web Services machine

Creating your own Amazon Machine Image

1. Actions, Create image
2. Fill out name and description
3. Wait for it to become available

Go back to the index: Starting up an Amazon Web Services machine

Working with persistent storage: volumes and snapshots

Volumes are basically UNIX disks (“block devices”) that will persist after you terminate your instance. They are tied to a zone within a region and can only be mounted on instances within that zone.

Snapshots are an Amazon-specific thing that let you communicate data on volumes between accounts. They are “read-only” backups that are created from volumes; they can be used to create new volumes in turn, and can also be shared with specific people (or made public). Snapshots are tied to a region but not a zone.

Creating persistent volumes to store data
0. Locate your instance zone
1. Click on the volumes tab
2. ‘Create Volume’
3. Configure your volume to have the same zone as your instance
4. Wait for your volume to be available
5. Select volume, Actions, Attach volume
6. Select instance, attachment point, and Attach

Here, your attachment point will be ‘/dev/sdf’ and your block device will be named ‘/dev/xvdf’.

7. On your instance, list block devices

Type:

lsblk

You should see something like this:

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0    8G  0 disk
`-xvda1 202:1    0    8G  0 part /
xvdf    202:80   0  100G  0 disk

Now format the disk (ONLY ON EMPTY DISKS - THIS WILL ERASE ANY DATA ON THE DISK):

sudo mkfs -t ext4 /dev/xvdf

and mount the disk:

sudo mkdir /disk
sudo mount /dev/xvdf /disk
sudo chmod a+rwxt /disk

and voila, anything you put on /disk will be on the volume that you allocated!

The command ‘df -h’ will show you what disks are actually mounted & where.

Detaching volumes
1. Unmount it from the instance

Change out of the directory, stop any running programs using it, and then:

sudo umount /disk
2. Detach

On the ‘volumes’ tab in your EC2 console, go to Actions, Detach.

3. Yes, detach.

Note, volumes remain attached when you reboot or stop an instance, but are (of course) detached when you terminate an instance.

Creating snapshots of volumes
1. Actions, Create snapshot
2. Fill out name and description
3. Click ‘Close’ & wait.

Terminating your instance

Amazon will happily charge you for running instances and/or associated ephemeral storage until the cows come home - it’s your responsibility to turn things off. The Right Way to do this for running instances is to terminate.

The caveat here is that everything ephemeral will be deleted (excluding volumes that you created/attached). So you want to make sure you transfer off anything you care about.

To terminate:

1. Select Actions, Instance State, Terminate

In the ‘Instances’ tab, select your instance and then go to the Actions menu.

2. Agree to terminate.
3. Verify status on your instance page.

Instance state should be either “shutting down” or “terminated”.


Return to index: Starting up an Amazon Web Services machine

Things to mention and discuss

When do disks go away?
  • never on reboot;
  • ephemeral disks go away on stop;
  • AMI-attached volumes go away on terminate;
  • attached volumes never go away on terminate and have to be explicitly deleted;
  • snapshots only go away when you explicitly delete them.
What are you charged for?
  • you are charged for a running instance at the @@instance price rates;
  • ephemeral storage/instance-specific storage is included within that.
  • when you stop an instance, you are charged at disk-space rates for the stopped disk;
  • when you create a volume, you are charged for that volume until you delete it;
  • when you create a snapshot, you are charged for that snapshot until you delete it.

To make sure you’re not getting charged, go to your Instance view and clear all search filters; anything that is “running” or “stopped” is costing you. Also check your volumes and your snapshots - they should be empty.


Regions vs zones:
  • AMIs and Snapshots (and keys and security groups) are per region;
  • Volumes and instances are per zone;

Running RStudio Server in the cloud

In this section, we will run RStudio Server on a remote Amazon machine. This will require starting up an instance, configuring its network firewall, and installing and running some software.

Reference documentation for running RStudio Server on Ubuntu:


1. Start up an Amazon instance

Start an ami-05384865 on an m4.xlarge machine, as per the instructions here:

Start an Amazon Web Services computer:.

2. Configure your network firewall

Normally, Amazon computers only allow shell logins via ssh. Since we want to run a Web service, we need to give the outside world access to other network locations on the computer.

Follow these instructions:

Configure your instance firewall

(You can do this while the computer is booting.)

3. Log in via the shell

Follow these instructions to log in via the shell:

Log into your instance with the UNIX shell.

4. Set a password for the ‘ubuntu’ account

Amazon Web Services computers normally require a key (the .pem file) instead of a login password, but RStudio Server will need us to log in with a password. So we need to configure a password for the account we’re going to use (which is ‘ubuntu’)

Create a password like so:

sudo passwd ubuntu

and set it to something you’ll remember.

5. Install R and the gdebi tool

Update the software catalog and install a few things:

sudo apt-get update && sudo apt-get -y install gdebi-core r-base

This will take a few minutes.

6. Download & install RStudio Server
wget https://download2.rstudio.org/rstudio-server-0.99.891-amd64.deb
sudo gdebi -n rstudio-server-0.99.891-amd64.deb

Upon success, you should see:

Mar 07 15:20:18 ip-172-31-6-68 systemd[1]: Starting RStudio Server...
Mar 07 15:20:18 ip-172-31-6-68 systemd[1]: Started RStudio Server.
7. Open your RStudio Server instance

Finally, go to ‘http://‘ + your hostname + ‘:8787’ in a browser, eg.

http://ec2-XX-YY-33-165.us-west-1.compute.amazonaws.com:8787/

and log into RStudio with username ‘ubuntu’ and the password you set it to above.

Voila!


You can now just go ahead and use this, or you can “stop” it, or you can freeze into an AMI for later use.

Note that on reboot, RStudio Server will start up again and all your files will be there.

Go back to the index: Starting up an Amazon Web Services machine.

Indices and tables

Short read quality and trimming

Start up an instance with ami-05384865 and 500 GB of local storage (Start an Amazon Web Services computer:). You should also configure your firewall (Configure your instance firewall) to pass through TCP ports 8000-8888.

Then, Log into your computer.

You should now be logged into your Amazon computer! You should see something like this:

ubuntu@ip-172-30-1-252:~$

this is the command prompt.

Prepping the computer

Before we do anything else, we need to set up a place to work and install a few things.

First, let’s set up a place to work. Here, we’ll make /mnt writeable:

sudo chmod a+rwxt /mnt

Note

/mnt is the location we’re going to use on Amazon computers, but if you’re working on a local cluster, it will have a different location. Talk to your local sysadmin and ask them where they recommend putting lots of short-term working files, i.e. the “scratch” space.


Installing some software

Run:

sudo apt-get -y update && \
sudo apt-get -y install trimmomatic fastqc python-pip \
   samtools zlib1g-dev ncurses-dev python-dev

Install anaconda:

curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
bash Anaconda3-4.2.0-Linux-x86_64.sh

Then update your environment and install khmer:

source ~/.bashrc

cd
git clone https://github.com/dib-lab/khmer.git
cd khmer
sudo python2 setup.py install

Running Jupyter Notebook

Let’s also run a Jupyter Notebook in /mnt. First, configure it a teensy bit more securely, and also have it run in the background:

jupyter notebook --generate-config

cat >>/home/ubuntu/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000

EOF

Now, run!

cd /mnt
jupyter notebook &

You should be able to visit port 8000 on your AWS computer and see the Jupyter console. (The password is ‘davis’.)

Data source

We’re going to be using a subset of data from Hu et al., 2016. This paper from the Banfield lab samples some relatively low diversity environments and finds a bunch of nearly complete genomes.

(See DATA.md for a list of the data sets we’re using in this tutorial.)

1. Copying in some data to work with.

We’ve loaded subsets of the data onto an Amazon location for you, to make everything faster for today’s work. We’re going to put the files on your computer locally under the directory /mnt/data:

mkdir /mnt/data

Next, let’s grab part of the data set:

cd /mnt/data
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_1.fastq.gz
curl -O -L https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948_2.fastq.gz

Now if you type:

ls -l

you should see something like:

total 346936
-rw-rw-r-- 1 ubuntu ubuntu 169620631 Oct 11 23:37 SRR1976948_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 185636992 Oct 11 23:38 SRR1976948_2.fastq.gz

These are 1m read subsets of the original data, taken from the beginning of the file.

One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. Let’s fix that before we go on any further:

chmod u-w *

We’ll talk about what these files are below.

1. Copying data into a working location

First, make a working directory; this will be a place where you can futz around with a copy of the data without messing up your primary data:

mkdir /mnt/work
cd /mnt/work

Now, make a “virtual copy” of the data in your working directory by linking it in –

ln -fs /mnt/data/* .

These are FASTQ files – let’s take a look at them:

less SRR1976948_1.fastq.gz

(use the spacebar to scroll down, and type ‘q’ to exit ‘less’)

Question:

  • where does the filename come from?
  • why are there 1 and 2 in the file names?

Links:

2. FastQC

We’re going to use FastQC to summarize the data. We already installed ‘fastqc’ on our computer for you.

Now, run FastQC on two files:

fastqc SRR1976948_1.fastq.gz
fastqc SRR1976948_2.fastq.gz

Now type ‘ls’:

ls -d *fastqc*

to list the files, and you should see:

SRR1976948_1_fastqc.html
SRR1976948_1_fastqc.zip
SRR1976948_2_fastqc.html
SRR1976948_2_fastqc.zip

You can download these files using your Jupyter Notebook console, if you like; or you can look at these copies of them:

Questions:

  • What should you pay attention to in the FastQC report?
  • Which is “better”, file 1 or file 2? And why?

Links:

3. Trimmomatic

Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via apt-get.

The first thing we’ll need are the adapters to trim off:

curl -O -L http://dib-training.ucdavis.edu.s3.amazonaws.com/mRNAseq-semi-2015-03-04/TruSeq2-PE.fa

Now, to run Trimmomatic:

TrimmomaticPE SRR1976948_1.fastq.gz \
              SRR1976948_2.fastq.gz \
     SRR1976948_1.qc.fq.gz s1_se \
     SRR1976948_2.qc.fq.gz s2_se \
     ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
     LEADING:2 TRAILING:2 \
     SLIDINGWINDOW:4:2 \
     MINLEN:25

You should see output that looks like this:

...
Input Read Pairs: 1000000 Both Surviving: 885734 (88.57%) Forward Only Surviving: 114262 (11.43%) Reverse Only Surviving: 4 (0.00%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully

Questions:

  • How do you figure out what the parameters mean?
  • How do you figure out what parameters to use?
  • What adapters do you use?
  • What version of Trimmomatic are we using here? (And FastQC?)
  • Do you think parameters are different for RNAseq and genomic data sets?
  • What’s with these annoyingly long and complicated filenames?
  • why are we running R1 and R2 together?

For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.

Links:

4. FastQC again

Run FastQC again on the trimmed files:

fastqc SRR1976948_1.qc.fq.gz
fastqc SRR1976948_2.qc.fq.gz

And now view my copies of these files:

Let’s take a look at the output files:

less SRR1976948_1.qc.fq.gz

(again, use spacebar to scroll, ‘q’ to exit less).

Questions:

  • is the quality trimmed data “better” than before?
  • Does it matter that you still have adapters!?

Optional: K-mer Spectral Error Trimming

Next: Run the MEGAHIT assembler

K-mer Spectral Error Trimming

(Optional)

khmer documentation: http://khmer.readthedocs.io/en/latest

If you plot a k-mer abundance histogram of the samples, you’ll notice something: there’s an awful lot of unique (abundance=1) k-mers. These are erroneous k-mers caused by sequencing errors.

In a new Python3 Jupyter Notebook, run:

cd /mnt/work

and then

!abundance-dist-single.py -M 1e9 -k 21 SRR1976948_1.fastq.gz SRR1976948_1.fastq.gz.dist

and in another cell:

%matplotlib inline
import numpy
from pylab import *
dist1 = numpy.loadtxt('SRR1976948_1.fastq.gz.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1])
axis(xmax=50)

Many of these errors remain even after you do the Trimmomatic run; you can see this with:

!abundance-dist-single.py -M 1e9 -k 21 SRR1976948_1.qc.fq.gz SRR1976948_1.qc.fq.gz.dist

and then plot:

dist2 = numpy.loadtxt('SRR1976948_1.qc.fq.gz.dist', skiprows=1, delimiter=',')
plot(dist1[:,0], dist1[:,1], label='untrimmed')
plot(dist2[:,0], dist2[:,1], label='trimmed')
legend(loc='upper right')
axis(xmax=50)

This is for two reasons:

First, Trimmomatic trims based solely on the quality score, which is a statistical statement about the correctness of a base - a Q score of 30 means that, of 1000 bases with that Q score, 1 of those bases will be wrong. So, a base can have a high Q score and still be wrong! (and many bases will have a low Q score and still be correct)

Second, we trimmed very lightly - only bases that had a very low quality were removed. This was intentional because with assembly, you want to retain as much coverage as possible, and the assembler will generally figure out what the “correct” base is from the coverage.

An alternative to trimming based on the quality scores is to trim based on k-mer abundance - this is known as k-mer spectral error trimming. K-mer spectral error trimming always beats quality score trimming in terms of eliminating errors; e.g. look at this table from Zhang et al., 2014:

The basic logic is this: if you see low abundance k-mers in a high coverage data set, those k-mers are almost certainly the result of errors. (Caveat: strain variation could also create them.)

In metagenomic data sets we do have the problem that we may have very low and very high coverage data. So we don’t necessarily want to get rid of all low-abundance k-mers, because they may represent truly low abundance (but useful) data.

As part of the khmer project in my lab, we have developed an approach that sorts reads into high abundance and low abundance reads, and only error trims the high abundance reads.

This does mean that many errors may get left in the data set, because we have no way of figuring out if they are errors or simply low coverage, but that’s OK (and you can always trim them off if you really care).

To run such error trimming, use the command trim-low-abund.py (at the command line, or prefix with a ‘!’ in the notebook):

interleave-reads.py SRR1976948_1.qc.fq.gz SRR1976948_2.qc.fq.gz |
   trim-low-abund.py -V -M 8e9 -C 3 -Z 10 - -o SRR1976948.trim.fq

Why (or why not) do k-mer trimming?

If you can assemble your data set without k-mer trimming, there’s no reason to do it. The reason we’re error trimming here is to speed up the assembler (by removing data) and to decrease the memory requirements of the assembler (by removing a number of k-mers).

To see how many k-mers we removed, you can examine the distribution as above, or use the unique-kmers.py script:

unique-kmers.py SRR1976948_1.qc.fq.gz SRR1976948_2.qc.fq.gz
unique-kmers.py SRR1976948.trim.fq

Next: Run the MEGAHIT assembler

Run the MEGAHIT assembler

MEGAHIT is a very fast, quite good assembler designed for metagenomes.

First, install it:

cd
git clone https://github.com/voutcn/megahit.git
cd megahit
make

Now, download some data:

cd /mnt/data
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz

These are data that have been run through k-mer abundance trimming (see K-mer Spectral Error Trimming) and subsampled so that we can run an assembly in a fairly short time period :).


Now, finally, run the assembler!

mkdir /mnt/assembly
cd /mnt/assembly
ln -fs ../data/*.subset.pe.fq.gz .

~/megahit/megahit --12 SRR1976948.abundtrim.subset.pe.fq.gz,SRR1977249.abundtrim.subset.pe.fq.gz \
    -o combined

This will take about 25 minutes; at the end you should see output like this:

... 12787984 bp, min 200 bp, max 61353 bp, avg 1377 bp, N50 3367 bp
... ALL DONE. Time elapsed: 1592.503825 seconds

The output assembly will be in combined/final.contigs.fa.

While the assembly runs...

How assembly works - whiteboarding the De Bruijn graph approach.

Interpreting the MEGAHIT working output :)

What does, and doesn’t, assemble?

How good is assembly anyway?

Discussion:

Why would we assemble, vs looking at raw reads? What are the advantages and disadvantages?

What are the technology tradeoffs between Illumina HiSeq, Illumina MiSeq, and PacBio? (Also see this paper.)

What kind of experimental design considerations should you have if you plan to assemble?

Some figures: the first two come from work by Dr. Sherine Awad on analyzing the data from Shakya et al (2014). The third comes from an analysis of read search vs contig search of a protein database.

After the assembly is finished

At this point we can do a bunch of things:

  • annotate the assembly (Annotation with Prokka);
  • evaluate the assembly’s inclusion of k-mers and reads;
  • set up a BLAST database so that we can search it for genes of interest;
  • quantify the abundance of the contigs or genes in the assembly, using the original read data set (Gene Abundance Estimation with Salmon);
  • bin the contigs in the assembly into species bins;

Annotation with Prokka

Prokka is a tool that facilitates the fast annotation of prokaryotic genomes.

The goals of this tutorial are to:

  • Install Prokka
  • Use Prokka to annotate our genomes

Installing Prokka

Download and extract the latest version of prokka:

cd ~/
wget http://www.vicbioinformatics.com/prokka-1.11.tar.gz
tar -xvzf prokka-1.11.tar.gz

We also will need some dependencies such as bioperl:

sudo apt-get install bioperl libdatetime-perl libxml-simple-perl libdigest-md5-perl
sudo perl -MCPAN -e shell
sudo perl -MCPAN -e 'install "XML::Simple"'

Now, you should be able to add Prokka to your $PATH and set up the index for the sequence database:

export PATH=$PATH:$HOME/prokka-1.11/bin
prokka --setupdb

Prokka should be good to go now– you can check to make sure that all is well by typing prokka. This should print the help screen with all available options.

Running Prokka

Make a new directory for the annotation:

cd /mnt
mkdir annotation
cd annotation

Link the metagenome assembly file into this directory:

ln -fs /mnt/assembly/combined/final.contigs.fa

Now it is time to run Prokka! There are tons of different ways to specialize the running of Prokka. We are going to keep it simple for now, though. It will take a little bit to run.

prokka subset_assembly.fa --outdir prokka_annotation --prefix metagG

This will generate a new folder called prokka_annotation in which will be a series of files, which are detailed here.

In particular, we will be using the *.ffn file to assess the relative read coverage within our metagenomes across the predicted genomic regions.

Gene Abundance Estimation with Salmon

Salmon is one of a breed of new, very fast RNAseq counting packages. Like Kallisto and Sailfish, Salmon counts fragments without doing up-front read mapping. Salmon can be used with edgeR and others to do differential expression analysis (if you are quantifying RNAseq data).

Today we will use it to get a handle on the relative distribution of genomic reads across the predicted protein regions.

The goals of this tutorial are to:

  • Install salmon
  • Use salmon to estimate gene coverage in our metagenome dataset

Installing Salmon

Download and extract the latest version of Salmon and add it to your PATH:

cd
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.7.2/Salmon-0.7.2_linux_x86_64.tar.gz
tar -xvzf Salmon-0.7.2_linux_x86_64.tar.gz
cd Salmon-0.7.2_linux_x86_64
export PATH=$PATH:$HOME/Salmon-0.7.2_linux_x86_64/bin

Running Salmon

Go to the data directory and download the prokka annotated sequences, assembled metagenome, and fastq files:

mkdir -p /mnt/data
cd /mnt/data
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1976948.abundtrim.subset.pe.fq.gz
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/SRR1977249.abundtrim.subset.pe.fq.gz
curl -L -O https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/metagenomics-scripps-2016-10-12/prokka_annotation_assembly.tar.gz
tar -xvzf prokka_annotation_assembly.tar.gz

Make a new directory for the quantification of data with Salmon:

mkdir /mnt/quant
cd /mnt/quant

Grab the nucleotide (*ffn) predicted protein regions from Prokka and link them here. Also grab the trimmed sequence data (*fq)

ln -fs /mnt/data/prokka_annotation/*ffn .
ln -fs /mnt/data/*.abundtrim.subset.pe.fq.gz .

Create the salmon index:

salmon index -t metagG.ffn -i transcript_index --type quasi -k 31

Salmon requires that paired reads be separated into two files. We can split the reads using the split-paired-reads.py from the khmer package:

for file in *.abundtrim.subset.pe.fq.gz
do
  tail=.fq.gz
  BASE=${file/$tail/}
  split-paired-reads.py $BASE$tail -1 ${file/$tail/}.1.fq -2 ${file/$tail/}.2.fq
done

Now, we can quantify our reads against this reference:

for file in *.pe.1.fq
do
tail1=.abundtrim.subset.pe.1.fq
tail2=.abundtrim.subset.pe.2.fq
BASE=${file/$tail1/}
salmon quant -i transcript_index --libType IU \
      -1 $BASE$tail1 -2 $BASE$tail2 -o $BASE.quant;
 done

(Note that –libType must come before the read files!)

This will create a bunch of directories named after the fastq files that we just pushed through. Take a look at what files there are within one of these directories:

find SRR1976948.quant -type f

Working with count data

Now, the quant.sf files actually contain the relevant information about expression – take a look:

head -10 SRR1976948.quant/quant.sf

The first column contains the transcript names, and the fourth column is what we will want down the road - the normalized counts (TPM). However, they’re not in a convenient location / format for use; let’s fix that.

Download the gather-counts.py script:

curl -L -O https://raw.githubusercontent.com/ngs-docs/2016-metagenomics-sio/master/gather-counts.py

and run it:

python2 ./gather-counts.py

This will give you a bunch of .counts files, which are processed from the quant.sf files and named for the directory from which they emanate.

Plotting the results

In Jupyter Notebook, open a new Python3 notebook and enter:

%matplotlib inline
import numpy
from pylab import *

In another cell:

cd /mnt/quant

In another cell:

counts1 = [ x.split()[1] for x in open('SRR1976948.quant.counts')]
counts1 = [ float(x) for x in counts1[1:] ]
counts1 = numpy.array(counts1)

counts2 = [ x.split()[1] for x in open('SRR1977249.quant.counts')]
counts2 = [ float(x) for x in counts2[1:] ]
counts2 = numpy.array(counts2)

plot(counts1, counts2, '*')

Day 2 - installation instructions

(Instructions mostly copied from Short read quality and trimming!)

Use ami-05384865, with a 500 GB local disk (see: Start an Amazon Web Services computer:)

Make /mnt/ read/write:

sudo chmod a+rwxt /mnt

Run:

sudo apt-get -y update && \
sudo apt-get -y install trimmomatic fastqc python-pip \
   samtools zlib1g-dev ncurses-dev python-dev

Install anaconda:

curl -O https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
bash Anaconda3-4.2.0-Linux-x86_64.sh

Then update your environment and install khmer:

source ~/.bashrc

cd
git clone https://github.com/dib-lab/khmer.git
cd khmer
sudo python2 setup.py install

Running Jupyter Notebook

Let’s also run a Jupyter Notebook in /mnt. First, configure it a teensy bit more securely, and also have it run in the background.

Generate a config:

jupyter notebook --generate-config

Add a password, have it not run a browser, and put it on port 8000 by default:

cat >>/home/ubuntu/.jupyter/jupyter_notebook_config.py <<EOF
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:5d813e5d59a7:b4e430cf6dbd1aad04838c6e9cf684f4d76e245c'
c.NotebookApp.port = 8000

EOF

Now, run!

cd /mnt
jupyter notebook &

You should be able to visit port 8000 on your AWS computer and see the Jupyter console.

Technical information

The github repository for this workshop is public at https://github.com/ngs-docs/2016-metagenomics-sio


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.