FILE: ncbi_search.pl
AUTH: Paul Stothard stothard@ualberta.ca
DATE: April 18, 2020
VERS: 1.2
This script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. This script can return either the complete database records, or the IDs of the records.
For additional information on NCBI's Entrez Programming Utilities see: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
This script requires the LWP::Protocol::https
Perl module. This can be
installed using conda:
conda install -c bioconda perl-lwp-protocol-https
ncbi_search.pl - search NCBI databases.
DISPLAY HELP AND EXIT:
usage:
perl ncbi_search.pl -help
PERFORM NCBI SEARCH:
usage:
perl ncbi_search.pl -q <string> -o <file> -d <string> [Options]
required arguments:
-q - Entrez query text.
-o - Output file to create. If the -s option is used this is the output
directory to create.
-d - Name of the NCBI database to search, such as 'nuccore', 'protein', or
'gene'.
optional arguments:
-r - Type of information to download. For sequences, 'fasta' is typically
specified. The accepted formats depend on the database being queried. The
default is to specify no format.
-m - The maximum number of records to download. Default is to download all
records.
-s - Save each record as a separate file. This option is only supported for -r
values of 'gb' and 'gbwithparts'.
-v - Provide progress messages.
example usage:
perl ncbi_search.pl -q 'NC_045512[Accession]' -o NC_045512.gbk -d nuccore \
-r gbwithparts
Download a sequence in GenBank format (with the full sequence included), using an accession number:
perl ncbi_search.pl -q 'NC_045512[Accession]' \
-o NC_045512.gbk \
-d nuccore \
-r gbwithparts \
-v
Download the protein sequences encoded by a genome, using the genome's accession number:
perl ncbi_search.pl -q 'NC_012920.1[Accession]' \
-o AL513382.1.faa \
-d nuccore \
-r fasta_cds_aa \
-v
Download multiple genomes using an accession number range, and save each genome to a file named after its accession number:
perl ncbi_search.pl -q 'NC_009925:NC_009934[Accession]' \
-o outdir1 \
-d nuccore \
-r gbwithparts \
-s \
-v
Download five coronavirus genomes from the RefSeq collection, and save each genome to a separate file:
perl ncbi_search.pl -q 'coronavirus[Organism] AND nucleotide genome[Filter] AND refseq[Filter]' \
-o outdir2 \
-d nuccore \
-r gbwithparts \
-m 5 \
-s \
-v
Download five abstracts from PubMed using an author name:
perl ncbi_search.pl -q 'Stothard P[Author]' \
-o abstracts.txt \
-d pubmed \
-r abstract \
-m 5 \
-v
Download information on the genes located in a genome region of interest:
perl ncbi_search.pl -q 'homo sapiens[Organism] AND 17[Chromosome] AND 7614064:7833711[Base position] AND GRCh38.p13[Assembly name]' \
-o gene_list.txt \
-d gene \
-r gene_table \
-v
Download information about a gene of interest:
perl ncbi_search.pl -q 'homo sapiens[Organism] AND PRNP[Gene name]' \
-o gene_info.txt \
-d gene \
-v
Download information about health-affecting variants for a genome region of interest:
perl ncbi_search.pl -q '17[Chromosome] AND 7614064:7620000[Base Position]' \
-o clinvar_info.xml \
-d clinvar \
-r clinvarset \
-v
Download a sequence record for each accession number in a file of accession numbers:
#preparing sample file of accession numbers
echo $'NP_776246.1\nNP_001073369.1\nNP_995328.2\n' \
> accessions.txt
#performing search for each accession using xargs
< accessions.txt xargs -t -I{} \
perl ncbi_search.pl -q '{}[Accession]' \
-o {}.fasta \
-d protein \
-r fasta \
-v
Download sequences in fasta format and then save each sequence as a separate file:
#download fasta file containing multiple sequences
perl ncbi_search.pl -q 'coronavirus[Organism] AND nucleotide genome[Filter] AND refseq[Filter]' \
-o sequences.fasta \
-d nuccore \
-r fasta \
-m 5 \
-v
#create separate file for each sequence
outputdir=sequences/
mkdir -p "$outputdir"
awk '/^>/ {OUT=substr($0,2); split(OUT, a, " "); sub(/[^A-Za-z_0-9\.\-]/, "", a[1]); OUT = "'"$outputdir"'" a[1] ".fa"}; OUT {print >>OUT; close(OUT)}' \
sequences.fasta
- annotinfo
- assembly
- bioproject
- biosample
- biosystems
- blastdbinfo
- books
- cdd
- clinvar
- dbvar
- gap
- gapplus
- gds
- gene
- genome
- geoprofiles
- grasp
- homologene
- ipg
- medgen
- mesh
- ncbisearch
- nlmcatalog
- nuccore
- nucleotide
- omim
- orgtrack
- pcassay
- pccompound
- pcsubstance
- pmc
- popset
- probe
- protein
- proteinclusters
- pubmed
- seqannot
- snp
- sparcle
- sra
- structure
- taxonomy
The supported -r option values are grouped by database type (i.e. -d option value) below. The name of each format is followed by the corresponding -r option value in parentheses. A value of null indicates that the -r option should be omitted in order to obtain that output format.
- Document summary (docsum)
- List of UIDs in plain text (uilist)
- Full record XML (xml)
- Full record text (full)
- Full record XML (xml)
- Summary (summary)
- text ASN.1 (null)
- Gene table (gene_table)
- text ASN.1 (null)
- Alignment scores (alignmentscores)
- FASTA (fasta)
- HomoloGene (homologene)
- Full record (full)
- Full record (null)
- text ASN.1 (null)
- Full record in XML (native)
- Accession number(s) (acc)
- FASTA (fasta)
- SeqID string (seqid)
- GenBank flat file (gb)
- INSDSeq XML (gbc)
- Feature table (ft)
- GenBank flat file with full sequence (gbwithparts)
- CDS nucleotide FASTA (fasta_cds_na)
- CDS protein FASTA (fasta_cds_aa)
- EST report (est)
- GSS report (gss)
- GenPept flat file (gp)
- INSDSeq XML (gpc)
- Identical Protein XML (ipg)
- XML (null)
- MEDLINE (medline)
- text ASN.1 (null)
- MEDLINE (medline)
- PMID list (uilist)
- Abstract (abstract)
- text ASN.1 (null)
- Accession number(s) (acc)
- FASTA (fasta)
- SeqID string (seqid)
- text ASN.1 (null)
- Flat file (flt)
- FASTA (fasta)
- RS Cluster report (rsr)
- SS Exemplar list (ssexemplar)
- Chromosome report (chr)
- Summary (docset)
- UID list (uilist)
- XML (full)
- XML (null)
- TaxID list (uilist)
- ClinVar Set (clinvarset)
- UID list (uilist)
- GTR Test Report (gtracc)