Python script needed to create DNA diversity index
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a python script to allow me to take DNA distances (in a matrix) and create a diversity index based on these distances.

There is a method developed by watve and gangal which takes the distance between 2 DNA sequences (this distance value is based on the similarity DNA between 2 DNA sequences). It then takes all these distances and adds them and creates an average distance value.

So I need a script which takes two files
File 1: DNA sequences linked to sample name
e.g. sample1 has 5 DNA sequences, sample2 has 5 and sample3 has 4 DNA sequences

seq1 sample1
seq2 sample1
seq3 sample1
seq4 sample1
seq5 sample1
seq6 sample2
seq7 sample2
seq8 sample2
seq9 sample2
seq10 sample2
seq11 sample3
seq12 sample3
seq13 sample3
seq14 sample3

and File 2:

The second file contains the pairwise distances between all the DNA sequences and is in the form of a square matrix. It can be in several forms, including square and those are upper or lower triangle matrix. The square matrix contains duplicate values of the DNA distances so only half the matrix is used (I can send this as a text file as the formatting is lost here)

seq1        seq2    seq3    seq4    seq5    seq6    seq7    seq8    seq9    seq10   seq11   seq12   seq13   seq14

seq1 0.0 1.7 2.0 2.1 1.5 1.8 1.9 2.0 1.9 2.0 2.1 1.7 1.7 1.7
seq2 1.7 0.0 1.2 1.3 1.2 1.3 1.3 1.3 1.3 1.3 1.2 1.3 1.4 1.4
seq3 2.0 1.2 0.0 1.3 1.3 1.4 1.5 1.5 1.5 1.9 1.5 1.5 1.8 1.7
seq4 2.1 1.3 1.3 0.0 1.8 1.3 1.4 1.4 1.5 1.7 1.7 1.6 1.7 1.8
seq5 1.5 1.2 1.3 1.8 0.0 1.1 1.1 1.0 1.1 0.9 1.1 0.9 1.1 1.1
seq6 1.8 1.3 1.4 1.3 1.1 0.0 0.1 0.2 0.4 0.8 0.7 0.7 0.8 0.7
seq7 1.9 1.3 1.5 1.4 1.1 0.1 0.0 0.3 0.4 0.7 0.7 0.7 0.9 0.7
seq8 2.0 1.3 1.5 1.4 1.0 0.2 0.3 0.0 0.4 0.8 0.8 0.9 0.9 0.8
seq9 1.9 1.3 1.5 1.5 1.1 0.4 0.4 0.4 0.0 0.8 0.7 0.8 0.8 0.9
seq10 2.0 1.3 1.9 1.7 0.9 0.8 0.7 0.8 0.8 0.0 0.7 0.6 0.6 0.6
seq11 2.1 1.2 1.5 1.7 1.1 0.7 0.7 0.8 0.7 0.7 0.0 0.5 0.5 0.5
seq12 1.7 1.3 1.5 1.6 0.9 0.7 0.7 0.9 0.8 0.6 0.5 0.0 0.4 0.3
seq13 1.7 1.4 1.8 1.7 1.1 0.8 0.9 0.9 0.8 0.6 0.5 0.4 0.0 0.3
seq14 1.7 1.4 1.7 1.8 1.1 0.7 0.7 0.8 0.9 0.6 0.5 0.3 0.3 0.0

Then using this formulae

Dmean = sum(d(seqi,seqj)/(S(S11)/2)
Where d= the distance between the 2 DNA sequences seqi and seqj in the matrix in a sample and
S=total number of species in a sample.

We can calculate the Dmean for sample1 by adding up the DNA distances for all pairwise comparisons between those DNA sequences only found in sample1 i.e. seq1 to seq5 and using the formula.

This can then be done for sample2 and sample3 and 3 Dmean value produced for the 3 samples.

But in the real world we have 100,000s of DNA sequences in very large matrices and 100s of samples, so doing this is impractical in excel or by hand.

So I need a python script that can take File 1, sample and associated DNA sequences and File2, DNA distance matrix of all pairwise distances, and use the Dmean formula to generate a Dmean value for each sample in File 1.

Can you post a more concrete example? I am having trouble understanding what those sample1, sample2 are. Where are the distances of the sample taken from?
ZeroCool 4 years ago
How do you get the total number of the species in the sample? seq1 sample1 seq2 sample1 seq3 sample1 seq4 sample1 seq5 sample1 Sample 1 contains 5 species? Is this the correct assumption? Can you send me the sample file for the matrix and the species?
ZeroCool 4 years ago
Is the distance always divided by the number of samples, or do you take up the sum and then divide by the number of samples?
ZeroCool 4 years ago
Hi I just realised there is a typo in the equation When you add up all the distances for sequences in a sample, so for the example I showed sample 1 has 5 sequences which is 10 distances added up and then divided by s(s+1)/2 - which I have just realized didn't copy into the example - sorry. here S is the number of sequences per sample in this case 5
Jugsy 4 years ago
Currently the script does the following: For sample 1 it goes from seq1 to seq5 in the matrix and gets all the distances, sums them up and then divides them by the number of samples (here 5). So it sums up the seq1 line till the 5th number, and then seq2 line till the 5th number etc., sums the last line seq5 till the 5th number and then divides the whole sum with 5.
ZeroCool 4 years ago
I have corrected the solution, it now divides the sum by (number of samples * number of samples + 1) / 2.
ZeroCool 4 years ago
I have updated to script to read file lines as tab separated values.
ZeroCool 4 years ago
awarded to MSF

Crowdsource coding tasks.

1 Solution


This is the script. If you run it withour arguments, the delimiter between the data will be a space. But you can run python script.py "\t" and the delimiter will be a tabulator. Works on data 100 000 x 100 000, but the calculation may take some time (on a 10 000 x 10 000 dataset, a minute or two). If you want to output the data into a file, just run it like this:

python script.py > data.txt

import sys

f1 = open("file1.txt")
f2 = open("file2.txt")

seq_file_positions = {}

args = sys.argv
delimiter = " "

if len(args) > 1:
    delimiter = args[1]

def get_seq_positions():
    global seq_file_positions
    kolvo = 0
    length = 0
    for each in f2:
        seq_file_positions[kolvo] = length
        length = length + len(each) + 1
        kolvo = kolvo + 1

def get_seq_data(seq_number):
    global seq_file_positions, f2
    f2.seek(seq_file_positions.get(seq_number))
    line = f2.readline()
    return line

seqs = dict()

for i, line in enumerate(f1):
    l = line.rstrip("\n")
    l = line.split(delimiter)
    key = l[1].rstrip("\n")
    if key in seqs:
        seqs[key][0] += 1
        seqs[key][1].append(i)
    else:
        seqs[key] = [1, [i]]

get_seq_positions()

distances = dict()

for key, data in seqs.items():
    num_species = data[0]
    samples = data[1]

    sum_of_dist = 0

    for smpidx in samples:
        smp_data = get_seq_data(smpidx).rstrip("\n").split(delimiter)
        start = samples[0] + 1
        end = samples[len(samples)-1] + 2
        for i in range(start, end):
            sum_of_dist += (float(smp_data[i]) / (num_species * (num_species + 1)) / 2)

    distances[key] = sum_of_dist

for key in sorted(distances):
    print key + " " + str(distances[key])
another point is that the input files are usually tab seperated, but that's a small issue
Jugsy 4 years ago
I have corrected the issue.
ZeroCool 4 years ago
this looks great do i run by downloading python and then try it?
Jugsy 4 years ago
looks great can I test it on monday and if all OK I'll award you the monies? can't download python onto laptop wifi too slow
Jugsy 4 years ago
Ok, I am just wondering do you have a bigger dataset for me to test on? For a 100x100 matrix it should work quite fine.
ZeroCool 4 years ago
I think that for a 100 000 x 100 000 matrix, a more optimized script would be needed (because you can not add a matrix of that size to the memory itself) which would require more time from my side and far succeed the bounty value.
ZeroCool 4 years ago
I have updated the solution to work with bigger data and added the possibility to specify the delimiter between the data.
ZeroCool 4 years ago
Hi thanks for helping me, but when i change the file inout names and run it I get this File "C:\Users\jmarches\Google Drive\Work docs\Cardiff Projects\Bountify Code\wandg4.py", line 34, in key = l[1].rstrip("\n") IndexError: list index out of range what am I doing wrong?
Jugsy 4 years ago
File 1 is the sequence list and and sample file, file 2 is the matrix file. Try running with: python script.py " " or python script.py "\t" Without actually seeing the files, I can not say what goes wrong.
ZeroCool 4 years ago
I changed your script to this f1 = open("sampleseqlist.txt") f2 = open("matrix_distances.txt") I tried what you suggested I have called your script wandg4.py so I installed Python 2.7 and go this C:\Users\jmarches\Google Drive\Work docs\Cardiff Projects\Bountify Code>wandg4.py "\t" Traceback (most recent call last): File "C:\Users\jmarches\Google Drive\Work docs\Cardiff Projects\Bountify Code\wandg4.py", line 34, in key = l[1].rstrip("\n") IndexError: list index out of range
Jugsy 4 years ago
i need to ask a dumb question - how do I run your script?
Jugsy 4 years ago
I would really need to see your files to make out what went wrong. You run by running the command prompt tool, going to your folder and typing python script.py.
ZeroCool 4 years ago
how can I get the files to you I don't have an email address cheers Julian
Jugsy 4 years ago
I have send you an email already to @cardiff.ac.uk. Reply to that email.
ZeroCool 4 years ago
are you b***.c*@gmail.com? if you are I have your email and I'll send you a real matrix and sample to sequence list tomorro
Jugsy 4 years ago