This document was modified for the GBatNet “My Field for Dummies: Field Sequencing Workshop” from the supplementary material of “Frank, L. E., Lindsey, L. L., Kipp, E. J., Faulk, C., Stone, S., Roerick, T. M., … & Larsen, P. A. (2024). Rapid molecular species identification of mammalian scat samples using nanopore adaptive sampling. Journal of Mammalogy, gyae044.”
Change directory (folder)
cd
List out the contents of a directory
ls
Print your working directory (where you are)
pwd
Make a directory
mkdir
Output the contents of a file or concatenate files together
cat <filename>
Output the first 10 lines of a file
head <filename>
Clear your window
clear
Move or rename a file
mv <filename>
Copy a file
cp <filename>
Download and install packages following instructions on linked websites/Github pages or using Conda installation.
This analysis workflow is optimized for long-read, Nanopore sequencing data.
Navigate to the directory (folder) with your sequence data (fastq or fasta).
cd /Desktop/directoryfordata
Combine the fastq files from a single sample/individual into one fastq so they’ll be easy to work with. Unzip the compressed fastq file for downstream filtering. * indicates any characters, so *.fastq indicates all .fastq files in a directory.
cat *.fastq.gz > all_barcode10.fastq.gz
gunzip all_barcode10.fastq.gz
Use nanofilt v2.6.0 to filter the reads for quality and length. We are targeting the mitochondrial genome, which is around 16kb in length in mammals. So we filter for reads that are between 300 and 17,000kb.
NanoFilt -q 7 -l 300 --maxlength 17000 all_barcode10.fastq > all_barcode10_q7_300_17k.fastq
Consider trying Chopper. It is a recently updated implementation of nanofilt.
We mapped our sample’s reads to our reference database using minimap2 v2.17. Here our reference database is a file containing all the NCBI RefSeq mammal mitochondrial genomes (downloaded from NCBI website). See Frank et al. (2024) for how to create this file.
minimap2 -ax map-ont reference.fasta all_barcode10_q7_300_17k.fastq > all_barcode10_mitoreads.sam
Use samtools v.1.9 to manipulate files. Change the .sam file to a .bam file.
samtools view -bS all_barcode10_mitoreads.sam > all_barcode10_mitoreads.bam
Reorganize the order of the .bam files to match the order of the genome/gene.
samtools sort all_barcode10_mitoreads.bam -o all_barcode10_mitoreads_sorted.bam
Pull out reads that mapped to the reference.
samtools view -h -F 4 -b all_barcode10_mitoreads_sorted.bam > all_barcode10_mapped.bam
Change .bam file to .fastq file format.
samtools bam2fq all_barcode10_mapped.bam > all_barcode10_mapped.fastq
Here we assemble the entire mitochondrial genome using the Flye v2.9.1 assembler.
Use reads that mapped to the reference database file in Flye to assemble the mitogenome.
flye --nano-corr all_barcode10_mapped.fastq --genome-size 16k --out-dir Assembly_barcode10
Check the assembly by opening the assembly_info.txt to see if the mitogenome was assembled. Look for a single contig around 16kb in length and usually circular (but not necessarily). Check the coverage (higher coverage is better, 30X coverage considered to be high quality). Once you determine which contig is likely the mitogenome, open the assembly.fasta file. Copy and paste the sequence into NCBI Blast to search for matches to the contig.
cd Assembly_barcode1
nano assembly_info.txt
cat assembly.fasta
Annotation will be done on the Galaxy platform. Go to Galaxy MITOS2. Upload the assembled mitogenome, assembly.fasta (must be one contig only!) and use the Ref89 Metazoa and Vertebrate Code options. Output an annotation fasta and .gff. Open using a text editor. Find the COI gene. Copy the sequence. Go to the BOLD COI database, paste sequence and run search to find best matches. The same can be done with NCBI Blast search and the cytb/COI gene.
If no mitogenome was assembled, the reads that mapped to the mitogenome reference file can still be used to generate an ID. See Frank et al. (2024) for more details.
Use the reference genome from NCBI of the best-guess or putative species ID of the sample and align it to the sample’s mapped reads fastq file in Geneious Prime software v2022.2.1. (Mega software is a free alternative to Geneious). This will allow you to find where genes are located within the mitogenome or determine which reads are covering the barcoding genes (cytb or COI). Or annotate the mitogenome in Geneious with the MITOS2 output. Download the .GFF output file from MITOS2. To convert the file to a Genbank file, use Emboss seqret.
seqret -sequence contig_56_mitogenome.fasta -feature -fformat gff -fopenfile contig56mitogenome.gff -osformat genbank -auto
Bring the genbank file into Geneious.
A phylogenetic tree can be created in Geneious using RAxML or in-command line RAxML. In Geneious, gather sequences of entire mitogenomes, cytb, or COI from several different species closely related to the suspected ID of the sample and one outgroup (more distantly related species). Multiple align (Geneious or MAFFT align) these sequences and then use RAxML to generate the phylogenetic tree with GAMMA model with 1,000 bootstrap iterations.
Create a phylogenetic tree based on the entire mitogenome, cytb gene or COI gene. Use the gene sequence from MITOS2 for cytb or COI. Make a file with sample sequence and sequences of closely related species from NCBI.
nano mammal_gene.fasta
Use MAFFT to align sequences.
mafft mammal_gene.fasta > COI_Aln.fasta
Use Fasta2Phylip perl script to change format of alignment to phylip.The phylogenetic program we use will want a phylip format.
perl Fasta2Phylip.pl COI_pero_Aln.fasta COI_pero_Aln.phy
RAxML is the phylogenetic analysis program we want to use to create a tree.
raxml -f a -m GTRGAMMAI -p 12345 -x 98765 -s COI_Aln.phy -n COI_1000 -# 1000 -o Sample
To view the tree, use the bipartitions file created from RAxML and input into FigTree. Add Node labels and use drop down menu to select ‘label’. Adjust the size of tree, branch labels, etc. Export graphic as an enhanced meta file (.emf) which can be edited in powerpoint or as an svg which can be edited in Inkscape v1.2.2. Insert the file, ungroup the image and edit labels, branches, etc.
Copy and paste the consensus sequence of the COI barcoding gene from Geneious software into the BOLD search engine. This generates an identification based on similarity to the BOLD database of COI sequences and a phylogenetic neighbor-joining tree.
Copy and paste the consensus sequence of the COI/cytb/mitogenome sequence from Geneious software into the Blast search engine. This generates an identification based on similarity to the Blast database of sequences and a phylogenetic neighbor-joining tree.
To view metadata about a Nanopore sequencing experiment, go to the sequencing report .html file. It will be named something like report_AJL698_20220808_1044_24bdeb57.html.
To generate files on the stats from the sequencing experiment, we ran nanoplot v1.32.1 and then looked at the .html file that is generated.
NanoPlot -t 2 --fastq barcode1.fastq
To generate .html file on other stats from the sequencing experiment, we used fastqc v0.11.7.
fastqc all_barcode1.fastq --out fastqc_out
To generate easy stats about a fastq file, use seqkit v2.8.0.
seqkit stats <file name>