Multiple sequence alignment algorithm python

I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character); they aren't DNA sequences or protein sequences.

I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?

Thanks!

asked Apr 28, 2011 at 5:06

Multiple sequence alignment algorithm python

6

  • The easiest way to align multiple sequences is to do a number of pairwise alignments.

First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.

When you are aligning a sequence to the aligned sequences, (based on a pairwise alignment), when you insert a gap in the sequence that is already in the set, you insert gaps in the same place in all sequences in the aligned set.

Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.

In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.

answered May 3, 2011 at 21:25

James CrookJames Crook

1,58012 silver badges17 bronze badges

2

Are you looking for something quick and dirty, as in the following?

from difflib import SequenceMatcher

a = "dsa jld lal"
b = "dsajld kll"
c = "dsc jle kal"
d = "dsd jlekal"

ss = [a,b,c,d]

s = SequenceMatcher()

for i in range(len(ss)):
    x = ss[i]
    s.set_seq1(x)
    for j in range(i+1,len(ss)):

        y = ss[j]
        s.set_seq2(y)

        print
        print s.ratio()
        print s.get_matching_blocks()

answered Apr 28, 2011 at 11:59

lafraslafras

8,1414 gold badges28 silver badges28 bronze badges

3

MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:

mafft --text input_text.fa > output_alignment.fa

Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.

answered Apr 4, 2017 at 14:41

Chris_RandsChris_Rands

36.4k12 gold badges79 silver badges110 bronze badges

1

I've pretty recently written a python script that runs the Smith-Waterman algorithm (which is what is used to generate gapped local sequence alignments for DNA or protein sequences). It's almost certainly not the fastest implementation, as I haven't optimized it for speed at all (not my bottleneck at the moment), but it works and doesn't care about the identity of each character in the strings. I could post it here or email you the files if that's the kind of thing you're looking for.

answered Apr 28, 2011 at 19:01

4

Which algorithm is used for multiple sequence alignment?

We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set.

How do you do multiple sequence alignment?

Creating a Multiple-Sequence Alignment.
List the features of interest and select them..
Invoke the Multiple-Sequence Alignment Tool..
Choose Nucleotide or Amino Acid..
Process the result..

Which tool is best for multiple sequence alignment?

Most recent answer DNA alignment and MEGA X are used conventionally for small-size data but if you are dealing with large data then MAFFT will be best.

What is sequence alignment in Python?

Advertisements. Sequence alignment is the process of arranging two or more sequences (of DNA, RNA or protein sequences) in a specific order to identify the region of similarity between them.