Mitogenome Data Analysis for Species ID

This document was modified for the GBatNet “My Field for Dummies: Field Sequencing Workshop” from the supplementary material of “Frank, L. E., Lindsey, L. L., Kipp, E. J., Faulk, C., Stone, S., Roerick, T. M., … & Larsen, P. A. (2024). Rapid molecular species identification of mammalian scat samples using nanopore adaptive sampling. Journal of Mammalogy, gyae044.”

Basic in Linux Command line

Change directory (folder)

cd

List out the contents of a directory

ls

Print your working directory (where you are)

pwd

Make a directory

mkdir

Output the contents of a file or concatenate files together

cat <filename>

Output the first 10 lines of a file

head <filename>

Clear your window

clear

Move or rename a file

mv <filename>

Copy a file

cp <filename>

Download and install packages following instructions on linked websites/Github pages or using Conda installation.

Analysis

This analysis workflow is optimized for long-read, Nanopore sequencing data.

Navigate to the directory (folder) with your sequence data (fastq or fasta).

cd /Desktop/directoryfordata

Combine the fastq files from a single sample/individual into one fastq so they’ll be easy to work with. Unzip the compressed fastq file for downstream filtering. * indicates any characters, so *.fastq indicates all .fastq files in a directory.

cat *.fastq.gz > all_barcode10.fastq.gz

gunzip all_barcode10.fastq.gz

Filtering

Use nanofilt v2.6.0 to filter the reads for quality and length. We are targeting the mitochondrial genome, which is around 16kb in length in mammals. So we filter for reads that are between 300 and 17,000kb.

NanoFilt -q 7 -l 300 --maxlength 17000 all_barcode10.fastq > all_barcode10_q7_300_17k.fastq

Consider trying Chopper. It is a recently updated implementation of nanofilt.

Map reads to Reference Database

We mapped our sample’s reads to our reference database using minimap2 v2.17. Here our reference database is a file containing all the NCBI RefSeq mammal mitochondrial genomes (downloaded from NCBI website). See Frank et al. (2024) for how to create this file.

minimap2 -ax map-ont reference.fasta all_barcode10_q7_300_17k.fastq > all_barcode10_mitoreads.sam

Use samtools v.1.9 to manipulate files. Change the .sam file to a .bam file.

samtools view -bS all_barcode10_mitoreads.sam > all_barcode10_mitoreads.bam

Reorganize the order of the .bam files to match the order of the genome/gene.

samtools sort all_barcode10_mitoreads.bam -o all_barcode10_mitoreads_sorted.bam

Pull out reads that mapped to the reference.

samtools view -h -F 4 -b all_barcode10_mitoreads_sorted.bam > all_barcode10_mapped.bam

Change .bam file to .fastq file format.

samtools bam2fq all_barcode10_mapped.bam > all_barcode10_mapped.fastq

Assemble the Mitochondrial Genome

Here we assemble the entire mitochondrial genome using the Flye v2.9.1 assembler.

Use reads that mapped to the reference database file in Flye to assemble the mitogenome.

flye --nano-corr all_barcode10_mapped.fastq --genome-size 16k --out-dir Assembly_barcode10

Check the assembly by opening the assembly_info.txt to see if the mitogenome was assembled. Look for a single contig around 16kb in length and usually circular (but not necessarily). Check the coverage (higher coverage is better, 30X coverage considered to be high quality). Once you determine which contig is likely the mitogenome, open the assembly.fasta file. Copy and paste the sequence into NCBI Blast to search for matches to the contig.

cd Assembly_barcode1

nano assembly_info.txt

cat assembly.fasta

Annotate the Mitochondrial Genome

Annotation will be done on the Galaxy platform. Go to Galaxy MITOS2. Upload the assembled mitogenome, assembly.fasta (must be one contig only!) and use the Ref89 Metazoa and Vertebrate Code options. Output an annotation fasta and .gff. Open using a text editor. Find the COI gene. Copy the sequence. Go to the BOLD COI database, paste sequence and run search to find best matches. The same can be done with NCBI Blast search and the cytb/COI gene.

If no mitogenome was assembled, the reads that mapped to the mitogenome reference file can still be used to generate an ID. See Frank et al. (2024) for more details.

Phylogenetic Tree Generation

Geneious Software

Use the reference genome from NCBI of the best-guess or putative species ID of the sample and align it to the sample’s mapped reads fastq file in Geneious Prime software v2022.2.1. (Mega software is a free alternative to Geneious). This will allow you to find where genes are located within the mitogenome or determine which reads are covering the barcoding genes (cytb or COI). Or annotate the mitogenome in Geneious with the MITOS2 output. Download the .GFF output file from MITOS2. To convert the file to a Genbank file, use Emboss seqret.

seqret -sequence contig_56_mitogenome.fasta -feature -fformat gff -fopenfile contig56mitogenome.gff -osformat genbank -auto

Bring the genbank file into Geneious.

A phylogenetic tree can be created in Geneious using RAxML or in-command line RAxML. In Geneious, gather sequences of entire mitogenomes, cytb, or COI from several different species closely related to the suspected ID of the sample and one outgroup (more distantly related species). Multiple align (Geneious or MAFFT align) these sequences and then use RAxML to generate the phylogenetic tree with GAMMA model with 1,000 bootstrap iterations.

RAxML Phylogenetic Tree without Geneious software

Create a phylogenetic tree based on the entire mitogenome, cytb gene or COI gene. Use the gene sequence from MITOS2 for cytb or COI. Make a file with sample sequence and sequences of closely related species from NCBI.

nano mammal_gene.fasta

Use MAFFT to align sequences.

mafft mammal_gene.fasta > COI_Aln.fasta

Use Fasta2Phylip perl script to change format of alignment to phylip.The phylogenetic program we use will want a phylip format.

perl Fasta2Phylip.pl COI_pero_Aln.fasta COI_pero_Aln.phy

RAxML is the phylogenetic analysis program we want to use to create a tree.

raxml -f a -m GTRGAMMAI -p 12345 -x 98765 -s COI_Aln.phy -n COI_1000 -# 1000 -o Sample

To view the tree, use the bipartitions file created from RAxML and input into FigTree. Add Node labels and use drop down menu to select ‘label’. Adjust the size of tree, branch labels, etc. Export graphic as an enhanced meta file (.emf) which can be edited in powerpoint or as an svg which can be edited in Inkscape v1.2.2. Insert the file, ungroup the image and edit labels, branches, etc.

BOLD database

Copy and paste the consensus sequence of the COI barcoding gene from Geneious software into the BOLD search engine. This generates an identification based on similarity to the BOLD database of COI sequences and a phylogenetic neighbor-joining tree.

NCBI blast

Copy and paste the consensus sequence of the COI/cytb/mitogenome sequence from Geneious software into the Blast search engine. This generates an identification based on similarity to the Blast database of sequences and a phylogenetic neighbor-joining tree.

View/Generate Metadata

To view metadata about a Nanopore sequencing experiment, go to the sequencing report .html file. It will be named something like report_AJL698_20220808_1044_24bdeb57.html.

To generate files on the stats from the sequencing experiment, we ran nanoplot v1.32.1 and then looked at the .html file that is generated.

NanoPlot -t 2 --fastq barcode1.fastq

To generate .html file on other stats from the sequencing experiment, we used fastqc v0.11.7.

fastqc all_barcode1.fastq --out fastqc_out

To generate easy stats about a fastq file, use seqkit v2.8.0.

seqkit stats <file name>