Getting Started with BLAST on the Command Line

Bioinformatics relies heavily on sequence identification by comparing nucleotide or protein sequences against known databases. One of the most popular tools for this purpose is BLAST (Basic Local Alignment Search Tool).

In this blog post, we’ll explore the basics of using BLAST on the command line, focusing on the two most common types of searches: blastn and blastp. We’ll also cover how to perform searches against the NCBI databases, as well as creating and searching a custom local database.

What is BLAST?

BLAST is a powerful algorithm and software suite from NCBI that allows you to compare a query sequence against a database of sequences to find regions of similarity. This can help identify homologous genes, infer functional and evolutionary relationships, and discover new genes.

Types of BLAST Searches

There are several types of BLAST searches, the two most common are:

blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
blastp: Compares an amino acid query sequence against a protein sequence database.

Setting Up BLAST

To get started, you’ll need to install the BLAST+ command line tools. You can either download the source files from the NCBI BLAST+ download page or install them though conda.

If using the source files, make sure that you extract the files and add the BLAST directory to your system’s PATH to use the BLAST commands from any directory.

Running BLAST Searches Against NCBI Databases

1. Using `blastn`

Let’s start with a blastn example. Suppose you have a nucleotide sequence in a file called gene.fasta and you want to identify the closest matching hits to your gene of interest in the NCBI nucleotide database:

blastn -query gene.fasta -remote -db nt -out results_blastn.txt -evalue 1e-6 -outfmt 6

This command performs a nucleotide BLAST search of your nucleotide sequence -query gene.fasta against the NCBI nucleotide database -remote -db nt and saves the results with an E-value lower than 1e-6 in the output file results_blastn.txt formatted as a table -outfmt 6.

2. Using `blastp`

Next, we’ll run a blastp search with a protein sequence file called protein.fasta to identify the best hits in the NCBI non-redundant protein database:

blastp -query protein.fasta -remote -db nr -out results_blastp.txt -evalue 1e-6 -outfmt 6

This performs a protein BLAST search of your protein sequence -query protein.fasta against the NCBI non-redundant protein database -remote -db nr and saves the results with an E-value lower than 1e-6 in the output file results_blastp.txt formatted as a table -outfmt 6.

Creating and Searching a Custom Local Database

Sometimes, you may want to perform searches against a custom database. For example, you may have a known gene that you’d like to identify the homolog of in a de novo assembly. Here’s how you can create and search a local custom BLAST database.

1. Creating a Custom Database

First, gather your sequences in a FASTA file, for example, my_sequences.fasta.

To create a nucleotide database:

makeblastdb -in my_sequences.fasta -dbtype nucl -out my_custom_db

For a protein database:

makeblastdb -in my_sequences.fasta -dbtype prot -out my_custom_db

2. Running a BLAST Search Against the Custom Database

Now, you can run BLAST searches against your custom database. For example, using blastn with your custom nucleotide database:

blastn -query gene.fasta -db my_custom_db -out custom_results_blastn.txt -evalue 1e-6 -outfmt 6

Or using blastp with your custom protein database:

blastp -query protein.fasta -db my_custom_db -out custom_results_blastp.txt -evalue 1e-6 -outfmt 6

Interpreting BLAST Results

BLAST results help determine the potential function of your query sequence by identifying similar sequences in known databases. When using the -outfmt 6 flag to output the results in a tabular format, the data in the columns are represented by the following headers:

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1. qseqid: Identifier of the query sequence.

2. sseqid: Identifier of the subject sequence that matches your query.

3. pident: Percentage of identical matches between the query and subject sequences.

4. length: Total alignment length between the query and subject sequences.

5. mismatch: Number of positions that are mismatches.

6. gapopen: Number of gap openings in the alignment.

7. qstart: Start position of the alignment for the query sequence.

8. qend: End position of the alignment for the query sequence.

9. sstart: Start position of the alignment for the subject sequence.

10. send: End position of the alignment for the subject sequence.

11. evalue: The number of expected hits of similar quality (lower is better).

12. bitscore: The higher the bit-score, the better the sequence similarity.

Summary

BLAST is an essential tool in bioinformatics for sequence identification and comparison. By understanding how to use blastn and blastp, and knowing how to create and search a custom database, you can leverage BLAST to gain significant insights into your sequences. Whether you are searching the NCBI databases or your custom datasets, BLAST provides the versatility and power needed for effective sequence analysis.

To learn more about the functions and features of the BLAST+ command line tools, take a look at the User Manual or use the -help flag to learn more about the possible arguments for each search.

Support

If you find this blog post helpful, please consider buying me a coffee on Ko-fi. Your support is very much appreciated!