Biopython

Sequences

The main class to get to grips with in Biopython is Seq. This is the primary interface to sequence data that you will work with. It is imported as:

In [1]:
from Bio.Seq import Seq

Once you have the Seq class, you can create sequences from standard Python strings:

In [2]:
my_dna = Seq("AGTACACTGGTT")

DNA operations

Once you have the DNA in a Seq object, you can perform standard operations on it, such as getting the complement of the sequence:

In [3]:
my_dna.complement()
Out[3]:
Seq('TCATGTGACCAA')

and the reverse complement:

In [4]:
my_dna.reverse_complement()
Out[4]:
Seq('AACCAGTGTACT')

RNA

You can the corresponding RNA sequence from a DNA sequence by using the transcribe method:

In [5]:
my_rna = my_dna.transcribe()

Once you have an RNA sequence, you can again do standard operations on it, such as getting the complement:

In [6]:
my_rna.complement_rna()
Out[6]:
Seq('UCAUGUGACCAA')

It is also possible to convert back from an RNA sequence to a DNA sequence:

In [7]:
my_dna_from_rna = my_rna.back_transcribe()

Which, if it's working correctly should give us back the original data:

In [8]:
my_dna == my_dna_from_rna
Out[8]:
True

Translation

Once we have an RNA sequence, you can get the expressed protein with translate:

In [9]:
my_protein = my_rna.translate()
In [10]:
my_protein
Out[10]:
Seq('STLV')

Exercise

Given a particular sequence

new_seq = Seq("AAATGGCAAAA")

Use the Biopython documentation to discover how you can count how many time the subsequence AA is present.

Do you get the count you expect? Can you find the way to count all instances of AA, even those that overlap?