Scaling Up Peptide Sequencing

  • More Errors
  • More Residues

1


From Last Time

In [135]:
AminoAcid = {
    'A': 'Alanine', 'C': 'Cysteine', 'D': 'Aspartic acid', 'E': 'Glutamic acid',
    'F': 'Phenylalanine', 'G': 'Glycine', 'H': 'Histidine', 'I': 'Isoleucine',
    'K': 'Lysine', 'L': 'Leucine', 'M': 'Methionine', 'N': 'Asparagine',
    'P': 'Proline', 'Q': 'Glutamine', 'R': 'Arginine', 'S': 'Serine',
    'T': 'Theronine', 'V': 'Valine', 'W': 'Tryptophan', 'Y': 'Tyrosine',
    '*': 'STOP'
}

AminoAbbrv = {
    'A': 'Ala', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu',
    'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile',
    'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn',
    'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser',
    'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'Y': 'Tyr',
    '*': 'STP'    
}

# Now it's time to use this dictionary!
Daltons = { 
    'A':  71, 'C': 103, 'D': 115, 'E': 129,
    'F': 147, 'G':  57, 'H': 137, 'I': 113,
    'K': 128, 'L': 113, 'M': 131, 'N': 114,
    'P':  97, 'Q': 128, 'R': 156, 'S':  87,
    'T': 101, 'V':  99, 'W': 186, 'Y': 163 
}
In [136]:
import itertools

def TheoreticalSpectrum(peptide):
    # Generate every possible fragment of a peptide
    spectrum = set()
    for fragLength in xrange(1,len(peptide)+1):
        for start in xrange(0,len(peptide)-fragLength+1):
            seq = peptide[start:start+fragLength]
            spectrum.add(sum([Daltons[res] for res in seq]))
    return sorted(spectrum)

def LeaderboardFindPeptide(noisySpectrum, cutThreshold=0.05):
    # Golf Tournament Heuristic
    spectrum = set(noisySpectrum)
    target = max(noisySpectrum)
    players = [''.join(peptide) for peptide in itertools.product(Daltons.keys(), repeat=2)]
    round = 1
    currentLeader = [0.0, '']
    while True:
        print "%8d Players in round %d [%5.4f]" % (len(players), round, currentLeader[0])
        leaderboard = []
        for prefix in players:
            testSpectrum = set(TheoreticalSpectrum(prefix))
            totalWeight = max(testSpectrum)
            score = len(spectrum & testSpectrum)/float(len(spectrum | testSpectrum))
            if (score > currentLeader[0]):
                currentLeader = [score, prefix]
            elif (score == currentLeader[0]):
                currentLeader += [prefix]
            if (totalWeight < target):
                leaderboard.append((score, prefix))
        remaining = len(leaderboard)
        if (remaining == 0):
            print "Done, no sequences can be extended"
            break
        leaderboard.sort(reverse=True)
        # Prune the larger of the top 5% or the top 5 players
        cut = leaderboard[max(min(5,remaining-1),int(remaining*cutThreshold))][0]
        players = [p+r for s, p in leaderboard if s >= cut for r in Daltons.iterkeys()]
        round += 1
    return currentLeader

2


Let's try a Noisier Spectrum

In [137]:
# generate a synthetic experimental spectrum with 60% Error
import random
random.seed(1961)

TyrocidineB1 = "VKLFPWFNQY"
print TyrocidineB1
spectrum = TheoreticalSpectrum(TyrocidineB1)
print len(spectrum)
print spectrum
print

# Pick around ~60% at random to remove
missingMass = random.sample(spectrum[:-1], 30)
print "Missing Masses = ", missingMass

# Add back another ~10% of false, but actual, peptide masses
falseMass = []
for i in xrange(5):
    fragment = ''.join(random.sample(Daltons.keys(), random.randint(2,len(TyrocidineB1)-2)))
    weight = sum([Daltons[residue] for residue in fragment])
    falseMass.append(weight)
print "False Masses = ", falseMass

experimentalSpectrum = sorted(set([mass for mass in spectrum if mass not in missingMass] + falseMass))

print len(experimentalSpectrum)
print experimentalSpectrum
VKLFPWFNQY
51
[97, 99, 113, 114, 128, 147, 163, 186, 227, 241, 242, 244, 260, 261, 283, 291, 333, 340, 357, 388, 389, 405, 430, 447, 485, 487, 543, 544, 552, 575, 577, 584, 671, 672, 690, 691, 738, 770, 804, 818, 819, 835, 917, 932, 982, 1031, 1060, 1095, 1159, 1223, 1322]

Missing Masses =  [1159, 114, 691, 186, 819, 357, 291, 543, 1223, 147, 671, 97, 388, 552, 447, 770, 672, 261, 738, 487, 577, 485, 932, 1031, 690, 389, 340, 575, 113, 260]
False Masses =  [356, 160, 572, 879, 244]
25
[99, 128, 160, 163, 227, 241, 242, 244, 283, 333, 356, 405, 430, 544, 572, 584, 804, 818, 835, 879, 917, 982, 1060, 1095, 1322]
In [138]:
spectrum = TheoreticalSpectrum(TyrocidineB1)
experimentalSpectrum = [mass for mass in spectrum if mass not in missingMass] + falseMass
%time winners = LeaderboardFindPeptide(experimentalSpectrum)
print winners
print len(winners) - 1, "Candidate residues with", winners[0], 'matches'
print TyrocidineB1
print TyrocidineB1 in winners
     400 Players in round 1 [0.0000]
     440 Players in round 2 [0.1200]
     600 Players in round 3 [0.1481]
    1480 Players in round 4 [0.2069]
    1840 Players in round 5 [0.2121]
    2320 Players in round 6 [0.2222]
    5200 Players in round 7 [0.2619]
    5360 Players in round 8 [0.2826]
    8640 Players in round 9 [0.2826]
    9040 Players in round 10 [0.3220]
    8320 Players in round 11 [0.3220]
     480 Players in round 12 [0.3220]
Done, no sequences can be extended
CPU times: user 4.76 s, sys: 89.3 ms, total: 4.85 s
Wall time: 5.27 s
[0.3220338983050847, 'VQLDEWFNQY', 'VQLDEWFNKY', 'VQIDEWFNQY', 'VQIDEWFNKY', 'VKLDEWFNQY', 'VKLDEWFNKY', 'VKIDEWFNQY', 'VKIDEWFNKY']
8 Candidate residues with 0.322033898305 matches
VKLFPWFNQY
False

3


A New Idea

  • Maybe we are still not using our spectrum to its fullest extent
  • Is there some information about missing masses that we can extract?

4


Information in the Mass Differences

  • Recall the theoretical spectrum of "PLAY" is [71, 97, 113, 163, 184, 210, 234, 281, 347, 444]
  • Suppose we remove masses 71 and 163, can we get them back?
  • Let's generate a table of all pair-wise differences between the observed peaks
  • Notice that interesting numbers, (71, 97, 113, 137, 163, 234) are repeated in the table
97 113 184 210 234 281 347 444
97 16 87 113 137 184 250 347
113 71 97 121 168 234 331
184 26 50 97 163 260
210 24 71 137 234
234 47 113 210
281 66 163
347 97
  • Why does this work?
  • This table of differences is called the Spectral Convolution

5


Spectral Convolution

  • Spectral Convolution gives us an approach for recovering some missing masses
  • Given a noisy experimental spectrum
    1. Compute its spectral convolution
    2. Add frequent masses above some threshold to the spectrum
    3. Infer the peptide sequence
In [140]:
def SpectralConvolution(spectrum):
    delta = {}
    for i in xrange(len(spectrum)-1):
        for j in xrange(i+1,len(spectrum)):
            diff = abs(spectrum[j] - spectrum[i])
            delta[diff] = delta.get(diff, 0) + 1
    return delta

experimentalSpectrum = sorted(set([mass for mass in spectrum if mass not in missingMass] + falseMass))
specConv = SpectralConvolution(sorted(experimentalSpectrum))
for delta, count in sorted(specConv.iteritems()):
    if (count >= 2) and (delta not in experimentalSpectrum) and (delta > min(Daltons.values())):
        print "Adding", delta, "appeared", count, "times", "*" if delta in missingMass else ""
        experimentalSpectrum.append(delta)

print sorted(missingMass)
Adding 61 appeared 2 times 
Adding 64 appeared 2 times 
Adding 78 appeared 2 times 
Adding 81 appeared 2 times 
Adding 82 appeared 2 times 
Adding 113 appeared 3 times *
Adding 114 appeared 3 times *
Adding 142 appeared 2 times 
Adding 143 appeared 2 times 
Adding 147 appeared 2 times *
Adding 164 appeared 2 times 
Adding 178 appeared 3 times 
Adding 188 appeared 2 times 
Adding 216 appeared 2 times 
Adding 228 appeared 2 times 
Adding 234 appeared 2 times 
Adding 251 appeared 2 times 
Adding 260 appeared 2 times *
Adding 277 appeared 2 times 
Adding 291 appeared 2 times *
Adding 302 appeared 2 times 
Adding 331 appeared 2 times 
Adding 340 appeared 2 times *
Adding 345 appeared 2 times 
Adding 485 appeared 2 times *
Adding 487 appeared 2 times *
Adding 523 appeared 2 times 
Adding 552 appeared 2 times *
Adding 577 appeared 3 times *
Adding 591 appeared 2 times 
Adding 655 appeared 2 times 
Adding 675 appeared 2 times 
Adding 676 appeared 2 times 
Adding 690 appeared 3 times *
Adding 719 appeared 2 times 
Adding 738 appeared 2 times *
Adding 819 appeared 2 times *
Adding 854 appeared 2 times 
Adding 932 appeared 2 times *
[97, 113, 114, 147, 186, 260, 261, 291, 340, 357, 388, 389, 447, 485, 487, 543, 552, 575, 577, 671, 672, 690, 691, 738, 770, 819, 932, 1031, 1159, 1223]
In [129]:
winners = LeaderboardFindPeptide(experimentalSpectrum)
print winners
print len(winners) - 1, "Candidate residues with", winners[0], 'matches'
print TyrocidineB1
print TyrocidineB1 in winners
     400 Players in round 1 [0.0000]
     920 Players in round 2 [0.1000]
     960 Players in round 3 [0.1613]
    1360 Players in round 4 [0.2121]
    1680 Players in round 5 [0.2286]
    3520 Players in round 6 [0.2750]
    4160 Players in round 7 [0.3182]
    4640 Players in round 8 [0.3617]
    5440 Players in round 9 [0.3922]
    1920 Players in round 10 [0.4107]
     160 Players in round 11 [0.4107]
Done, no sequences can be extended
[0.4107142857142857, 'YQNLMWFLQV', 'YQNLMWFLKV', 'YQNLMWFIQV', 'YQNLMWFIKV', 'YQNIMWFLQV', 'YQNIMWFLKV', 'YQNIMWFIQV', 'YQNIMWFIKV', 'YKNLMWFLQV', 'YKNLMWFLKV', 'YKNLMWFIQV', 'YKNLMWFIKV', 'YKNIMWFLQV', 'YKNIMWFLKV', 'YKNIMWFIQV', 'YKNIMWFIKV']
16 Candidate residues with 0.410714285714 matches
VKLFPWFNQY
False

6


A More Realistic Example

For long sequences the exponential foundation underlying our heuristic search becomes more evident.

Note: If we set the cutThreshold to 0.0, we will set our cut at the 4th best score and ties with it

In [132]:
Insulin = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
spectrum = TheoreticalSpectrum(Insulin)
print len(spectrum)
missingMass = random.sample(spectrum[:-1], 300)
experimentalSpectrum = sorted([mass for mass in spectrum if mass not in missingMass])
print len(experimentalSpectrum)

# See slide for the following *hack*
del Daltons['I']
del Daltons['K']

%time winners = LeaderboardFindPeptide(experimentalSpectrum, cutThreshold=0.01)
print winners
print len(winners) - 1, "Candidate residues with", winners[0], 'matches'
print Insulin
print Insulin in winners

# See slide for the following *hack*
Daltons['I'] = Daltons['L']
Daltons['K'] = Daltons['Q']
3407
3107
     324 Players in round 1 [0.0000]
    3060 Players in round 2 [0.0010]
   17496 Players in round 3 [0.0019]
   63540 Players in round 4 [0.0032]
  138204 Players in round 5 [0.0048]
  158076 Players in round 6 [0.0068]
   86256 Players in round 7 [0.0090]
   20376 Players in round 8 [0.0116]
   10422 Players in round 9 [0.0145]
    3438 Players in round 10 [0.0177]
     846 Players in round 11 [0.0206]
     162 Players in round 12 [0.0241]
     108 Players in round 13 [0.0267]
     126 Players in round 14 [0.0302]
     180 Players in round 15 [0.0333]
     216 Players in round 16 [0.0368]
     126 Players in round 17 [0.0410]
     108 Players in round 18 [0.0451]
     108 Players in round 19 [0.0489]
     126 Players in round 20 [0.0526]
     126 Players in round 21 [0.0567]
     108 Players in round 22 [0.0610]
     108 Players in round 23 [0.0655]
     108 Players in round 24 [0.0692]
     108 Players in round 25 [0.0744]
     108 Players in round 26 [0.0787]
     108 Players in round 27 [0.0839]
     108 Players in round 28 [0.0883]
     108 Players in round 29 [0.0934]
     126 Players in round 30 [0.0988]
     108 Players in round 31 [0.1033]
     108 Players in round 32 [0.1082]
     126 Players in round 33 [0.1125]
     108 Players in round 34 [0.1182]
     108 Players in round 35 [0.1239]
     108 Players in round 36 [0.1280]
     108 Players in round 37 [0.1331]
     126 Players in round 38 [0.1384]
     108 Players in round 39 [0.1435]
     108 Players in round 40 [0.1485]
     108 Players in round 41 [0.1542]
     108 Players in round 42 [0.1593]
     108 Players in round 43 [0.1651]
     108 Players in round 44 [0.1694]
     108 Players in round 45 [0.1743]
     108 Players in round 46 [0.1794]
     108 Players in round 47 [0.1857]
     108 Players in round 48 [0.1916]
     126 Players in round 49 [0.1976]
     108 Players in round 50 [0.2031]
     126 Players in round 51 [0.2094]
     126 Players in round 52 [0.2139]
     108 Players in round 53 [0.2193]
     108 Players in round 54 [0.2250]
     108 Players in round 55 [0.2292]
     108 Players in round 56 [0.2343]
     108 Players in round 57 [0.2408]
     108 Players in round 58 [0.2454]
     108 Players in round 59 [0.2508]
     108 Players in round 60 [0.2550]
     108 Players in round 61 [0.2600]
     108 Players in round 62 [0.2670]
     108 Players in round 63 [0.2709]
     108 Players in round 64 [0.2755]
     108 Players in round 65 [0.2808]
     108 Players in round 66 [0.2850]
     108 Players in round 67 [0.2891]
     108 Players in round 68 [0.2927]
     108 Players in round 69 [0.2974]
     108 Players in round 70 [0.3020]
     108 Players in round 71 [0.3063]
     108 Players in round 72 [0.3101]
     108 Players in round 73 [0.3136]
     108 Players in round 74 [0.3172]
     108 Players in round 75 [0.3200]
     108 Players in round 76 [0.3244]
     108 Players in round 77 [0.3270]
     108 Players in round 78 [0.3307]
     108 Players in round 79 [0.3336]
     108 Players in round 80 [0.3361]
     108 Players in round 81 [0.3401]
     108 Players in round 82 [0.3432]
     108 Players in round 83 [0.3450]
     108 Players in round 84 [0.3460]
     108 Players in round 85 [0.3481]
     108 Players in round 86 [0.3499]
     108 Players in round 87 [0.3513]
     108 Players in round 88 [0.3528]
     108 Players in round 89 [0.3562]
     108 Players in round 90 [0.3575]
     108 Players in round 91 [0.3591]
     108 Players in round 92 [0.3610]
Done, no sequences can be extended
CPU times: user 2min 22s, sys: 3.19 s, total: 2min 25s
Wall time: 2min 36s
[0.3614561027837259, 'QTPWAVESLYCFRGPSGFCCQHNQRVWYMLDESELVDVPAASRLMQPWALYAHEAPWLWRNGLRLQLLLPWCSYLQLETMVPLTTHYFRMPSW']
1 Candidate residues with 0.361456102784 matches
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
False

Some of the reasons that things blow up:

  1. The search space got large fast
  2. There must be a LOT of ties
  3. Algorithm tends to keep all (N-k+1) subpeptides as k approaches the sequence's size (k is related to our round)
  4. The I/L and K/Q ambiguities lead to exponential number of ties, hence the hack
  5. Reversed sequences are doubling our leaderboard size

There are bandaids to fix problems 3 and 4, but the problem remains

7


How are peptide sequences really identified?

8


Peptide Identification Problem

Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:

  • S: experimental spectrum
  • database of peptides
  • Δ: set of possible ion types
  • m: parent mass

Output:

  • A peptide of mass m from the database whose theoretical spectrum matches the experimental S spectrum the best

9


Mass Spec Database Searches

How do you get a database?

  1. Compute theoretical spectrums for all peptides from length N to M
  2. More commonly, store theoretical spectrums for known peptide sequences

Database searches are very effective in identfying known or closely related proteins.

Experimental spectrums are compared with spectra of database peptides to find the best fit (ex. SEQUEST, Yates et al., 1995)

But reliable algorithms for identification of new proteins is a more difficult problem.

10


Essence of the Database Search

  • We need a notion of spectral similarity that correlates well with the sequence similarity.

  • If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.

  • Simplest measure: Shared Peak Counts (SPC)

    • Very similar to the scoring function used in our De novo approach.

11


SPC Diminishes Quickly

12


In [144]:
print TheoreticalSpectrum('PRTEIN')
print TheoreticalSpectrum('PRTEYN')
print TheoreticalSpectrum('PGTEYN')

print set(TheoreticalSpectrum('PRTEIN')) & set(TheoreticalSpectrum('PRTEYN'))
print set(TheoreticalSpectrum('PRTEIN')) & set(TheoreticalSpectrum('PGTEYN'))
[97, 101, 113, 114, 129, 156, 227, 230, 242, 253, 257, 343, 354, 356, 386, 457, 483, 499, 596, 613, 710]
[97, 101, 114, 129, 156, 163, 230, 253, 257, 277, 292, 354, 386, 393, 406, 483, 507, 549, 646, 663, 760]
[57, 97, 101, 114, 129, 154, 158, 163, 230, 255, 277, 287, 292, 384, 393, 406, 450, 507, 547, 564, 661]
set([97, 354, 483, 101, 230, 257, 129, 386, 114, 156, 253])
set([129, 114, 101, 230, 97])

Spectral Convolution to the Rescue!

Difference matrix of spectrums. The elements with multiplicity >2 are colored; the elements with multiplicity =2 are circled. The SPC takes into account only the red entries

12