Multiple Sequence Alignments are a collection of multiple sequences which have been aligned together, usually with the insertion of gap characters, such that all the sequence strings are the same length. Alignment can be regarded as a matrix of letters, where each row is held as a SeqRecord
object internally.
The MultipleSeqAlignment
object holds this kind of data, and the AlignIO
module is used for reading and writing them as various file formats.
For this section you need to download some files which we will be reading in. You can either download them by hand from using these links: PF05371_seed.sth and dummy_aln.phy or run the following Python code:
import urllib
for f in ["PF05371_seed.sth", "dummy_aln.phy"]:
urllib.request.urlretrieve(f"https://milliams.com/courses/biopython/{f}", f)
Much like SeqIO
, AlignIO
contains 2 functions for reading in sequence alignments:
read()
- will return a single MultipleSeqAlignment
objectparse()
- will return an iterator which gives MultipleSeqAlignment
objectsBoth functions expect two mandatory arguments:
Let's start with a single alignments file which contains the seed alignment for the Phage_Coat_Gp8 (PF05371) PFAM entry. The file contains a lot of annotation information but let's just go ahead and load it in to see how it looks:
from Bio import AlignIO
aln_seed = AlignIO.read("PF05371_seed.sth", "stockholm")
print(aln_seed)
Note in the above output the sequences have been elided in the middle (...
). We could instead write our own code to format this as we please by iterating over the rows as SeqRecord
objects and printing the first 50 values of each sequence:
for record in aln_seed:
print(f"{record.seq[:50]} - {record.id}")
With any supported file format, we can load an alignment in exactly the same way just by changing the format string. For example, use "phylip"
for PHYLIP files, "nexus"
for NEXUS files or "emboss"
for the alignments output by the EMBOSS tools.
In general alignment files can contain multiples alignments, and to read these files we must use the AlignIO.parse
function.
We have previously downloaded a file called dummy_aln.phy
which contains some dummy alignment information in PHYLIP format. If we wanted to read this in using AlignIO
we could use:
aln_dummy = AlignIO.parse("dummy_aln.phy", "phylip")
for alignment in aln_dummy:
print(alignment)
print("---")
The .parse()
function returns an iterator. If we want to keep all the alignments in memory at once, then we need to turn the iterator into a list, just as we did with SeqIO.parse
:
alignments = list(AlignIO.parse("dummy_aln.phy", "phylip"))
second_aln = alignments[1]
print(second_aln)
Now we’ll look at AlignIO.write()
which is for alignments output (writing files).
This function takes 3 arguments:
MultipleSeqAlignment
objects We start by creating a MultipleSeqAlignment
object the hard way (by hand). Note we create some SeqRecord
objects to construct the alignment from.
from Bio.Align import MultipleSeqAlignment
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
align1 = MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG"), id="toto"),
SeqRecord(Seq("ACT-CTAGCTAG"), id="titi"),
SeqRecord(Seq("ACTGCTAGDTAG"), id="tata"),
])
print(align1)
Now let's try to output, in PHYLIP format, these alignments in a file with the Phage_Coat_Gp8 alignments.
my_alignments = [align1, aln_seed]
AlignIO.write(my_alignments, "mixed.phy", "phylip")
Read in the alignment in PF05371_seed.sth
and write it out in PHYLIP format.
Biopython also has the ability to call out to lots of different external alignment tools including ClustalW, MUSCLE, EMBOSS, DIALIGN2-2, TCoffee and MSAProbs. Have a look at the classes in Bio.Align.Applications
for more details.
By using the Biopython interfaces to these tools, you can build full pipelines in Python, making use of whatever tool is best for the particular job you want to do.