Biopython

Online databases

There is a lot more information in the official documentation. What follows here is a summary of the key points so that you can start using the online databases.

The Bio.Entrez module give you Python access to the NCBI's online databases. Before you start using it, there are a few things to be aware of:

  • For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey.
  • Use the email parameter so the NCBI can contact you if there is a problem. You can set a global email address:
    from Bio import Entrez
    Entrez.email = "A.N.Other@example.com"
    

EInfo - Obtaining information about the Entrez databases

Let's start with the EInfo utility. This allows you to ask the service for its list of available databases:

In [1]:
from Bio import Entrez
Entrez.email = "matt.williams@bristol.ac.uk"  # Always tell NCBI who you are

handle = Entrez.einfo()
result = Entrez.read(handle)
print(result["DbList"])
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

So we can see, for example, the "pubmed", "protein" and "structure" databases.

You can delve into any one of the databases individually by passing the name of it to the einfo function. For example, you can ask it for the valid list of query fields which we will be able to use shortly:

In [2]:
handle = Entrez.einfo(db="nucleotide")
result = Entrez.read(handle)

for f in result["DbInfo"]["FieldList"]:
    print(f["Name"], f["Description"])
ALL All terms from all searchable fields
UID Unique number assigned to each sequence
FILT Limits the records
WORD Free text associated with record
TITL Words in definition line
KYWD Nonstandardized terms provided by submitter
AUTH Author(s) of publication
JOUR Journal abbreviation of publication
VOL Volume number of publication
ISS Issue number of publication
PAGE Page number(s) of publication
ORGN Scientific and common names of organism, and all higher levels of taxonomy
ACCN Accession number of sequence
PACC Does not include retired secondary accessions
GENE Name of gene associated with sequence
PROT Name of protein associated with sequence
ECNO EC number for enzyme or CAS registry number
PDAT Date sequence added to GenBank
MDAT Date of last update
SUBS CAS chemical name or MEDLINE Substance Name
PROP Classification by source qualifiers and molecule type
SQID String identifier for sequence
GPRJ BioProject
SLEN Length of sequence
FKEY Feature annotated on sequence
PORG Scientific and common names of primary organism, and all higher levels of taxonomy
COMP Component accessions for an assembly
ASSM Assembly
DIV Division
STRN Strain
ISOL Isolate
CULT Cultivar
BRD Breed
BIOS BioSample

Exercise 1

Look at the database information for another database. What are the valid list of search fields there?

ESearch - Searching the Entrez databases

You can also use ESearch to search GenBank. Here we’ll do a search for the matK gene in Cypripedioideae orchids. We specify the database we want to search, the search terms (based on the fields we found out about earlier) and pass the idtype argument to specify what we want to have returned as the .id field:

In [3]:
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]", idtype="acc")
record = Entrez.read(handle)
record["IdList"]
Out[3]:
['OQ981989.1', 'NC_063680.1', 'NC_064145.1', 'NC_063681.1', 'NC_066405.1', 'OP465215.1', 'NC_071758.1', 'NC_069974.1', 'NC_069973.1', 'NC_069972.1', 'NC_069971.1', 'NC_069970.1', 'NC_069969.1', 'NC_069968.1', 'NC_069967.1', 'NC_069966.1', 'NC_069965.1', 'NC_069964.1', 'NC_069963.1', 'NC_069962.1']

Each of the IDs (NC_050871.1, NC_050870.1, ...) is a GenBank identifier (Accession number). We'll see shortly how to actually download these GenBank records.

Exercise 2

Try searching for another gene or organism and print its ID list. If you don't know of another then just try replicating the code above.

EFetch - Downloading full records from Entrez

EFetch is what you use when you want to retrieve a full record from Entrez. This covers several possible databases, as described on the main EFetch help page.

For most of their databases, the NCBI support several different file formats. Requesting a specific file format from Entrez using Bio.Entrez.efetch requires specifying the rettype and/or retmode optional arguments. The different combinations are described for each database type on the pages linked to on NCBI efetch webpage.

One common usage is downloading sequences in the FASTA or GenBank/GenPept plain text formats (which can then be parsed with Bio.SeqIO). From the Cypripedioideae example above, we can download GenBank record NC_050871.1 using Bio.Entrez.efetch.

We specify the database we want to fetch from, the ID returned from the search, the return type of the data (FASTA, GenBank etc.) and that we want the result as plain text:

In [4]:
handle = Entrez.efetch(db="nucleotide", id="NC_050871.1", rettype="gb", retmode="text")

This has given us a handle which we can pass to Bio.SeqIO.read in place of the filename:

In [5]:
from Bio import SeqIO
SeqIO.read(handle, "genbank")
Out[5]:
SeqRecord(seq=Seq('GTAAAAAATCCCCATATCTTACAAGATATGGGGATTTTTTACCTTCAAAAATTC...TTT'), id='NC_050871.1', name='NC_050871', description='Paphiopedilum hirsutissimum chloroplast, complete genome', dbxrefs=['BioProject:PRJNA927338'])

If you want to download the file and save a copy of it on disk, you can write the result to file with:

In [6]:
handle = Entrez.efetch(db="nucleotide", id="NC_050871.1", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
SeqIO.write(record, "NC_050871.gbk", "genbank")
Out[6]:
1

Exercise 3

Using one of the results from exercise 2 above, download the full record and save it to a file. Then load it from the local file using SeqIO.read.

EGQuery- Global Query, counts for search terms

EGQuery provides counts for a search term in each of the Entrez databases (i.e. a global query). This is particularly useful to find out how many items your search terms would find in each database without actually performing lots of separate searches with ESearch.

In this example, we use Bio.Entrez.egquery to obtain the counts for "Biopython":

In [7]:
handle = Entrez.egquery(term="biopython")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    print(row["DbName"], row["Count"])
pubmed 45
pmc 3148
mesh 0
books 2
pubmedhealth Error
omim 0
ncbisearch 1
nuccore 3
nucgss 0
nucest 0
protein 0
genome 0
structure 0
taxonomy 0
snp 0
dbvar 0
gene 0
sra 2234
biosystems Error
unigene 0
cdd 0
clone 0
popset 0
geoprofiles 0
gds 17
homologene 0
pccompound 0
pcsubstance 0
pcassay 0
nlmcatalog 0
probe 0
gap 0
proteinclusters 0
bioproject 1
biosample 0
biocollections 0

Exercise 4

Run a global query for an organism name to find out how many matches there are in the various databases.