Determining a Peptide's Sequence

  • From last time we learned that we can't always use DNA to resolve peptide/protein sequences
  • What else can we do?
    • Extract and purify a pure sample of the peptide/protein
    • Try to resolve the peptide sequence by analyzing this sample
  • Today's approach
    • Randomly fracture the peptide
    • Assemble an answer from the peices

1


Molecular Weights are the Puzzle Peices

2


Structure of a Peptide Chain

  • Peptides are chains of amino acids that are joined by peptide bonds
  • These bonds reduce the weight of each amino acid by one H20 molecule
  • The result is called a residue
  • A Mass Spectrograph can precisely measure the molecular weight (and charge and abundance) of any peptide chain
  • Since the molecular weight of each of the possible 20 residues is known precisely, one can ask the question, which combination of residues would give a particular weight?
  • The problem is ambiguous for the entire molecule
    • Consider all permulations of 'PIT':
       'PIT', 'PTI', 'ITP', 'IPT', 'TPI', and 'TIP' all weigh the same
    • But they differ in their 2-peptide fragments:
       'PIT' breaks into 'PI' and 'IT', while 
       'PTI' breaks into 'PT' and 'TI'

3


An Simplified Peptide Weight table

  • The actual molecular weight of an amino acid is a real number. This acounts for the relative abundances of atomic isotopes
  • We will use a simplified version that uses only integer molecular weights
  • Example:

    • Molecular weight of Glycine Amino Acid

      $$W(C_2 H_5 N O_2) = 12 \times 2 + 5 \times 1 + 14 + 16 \times 2 = 75$$

    • Molecular wieght of Glycine Residue (Minus the $H_{2} O$ lost forming the peptide bond)

      $$W(C_2 H_5 N O_2 - H_2 O) = 57$$

  • We can repeat this for all 20 Amino Acids to get a integer molecular weight table, which I call Daltons

4


Table Definitions

In [2]:
AminoAcid = {
    'A': 'Alanine', 'C': 'Cysteine', 'D': 'Aspartic acid', 'E': 'Glutamic acid',
    'F': 'Phenylalanine', 'G': 'Glycine', 'H': 'Histidine', 'I': 'Isoleucine',
    'K': 'Lysine', 'L': 'Leucine', 'M': 'Methionine', 'N': 'Asparagine',
    'P': 'Proline', 'Q': 'Glutamine', 'R': 'Arginine', 'S': 'Serine',
    'T': 'Theronine', 'V': 'Valine', 'W': 'Tryptophan', 'Y': 'Tyrosine',
    '*': 'STOP'
}

AminoAbbrv = {
    'A': 'Ala', 'C': 'Cys', 'D': 'Asp', 'E': 'Glu',
    'F': 'Phe', 'G': 'Gly', 'H': 'His', 'I': 'Ile',
    'K': 'Lys', 'L': 'Leu', 'M': 'Met', 'N': 'Asn',
    'P': 'Pro', 'Q': 'Gln', 'R': 'Arg', 'S': 'Ser',
    'T': 'Thr', 'V': 'Val', 'W': 'Trp', 'Y': 'Tyr',
    '*': 'STP'    
}

# Now it's time to use this dictionary!
Daltons = { 
    'A':  71, 'C': 103, 'D': 115, 'E': 129,
    'F': 147, 'G':  57, 'H': 137, 'I': 113,
    'K': 128, 'L': 113, 'M': 131, 'N': 114,
    'P':  97, 'Q': 128, 'R': 156, 'S':  87,
    'T': 101, 'V':  99, 'W': 186, 'Y': 163 
}

5


Some Issues with our Table

  • We can't distinguish between Leucine (L) and Isoleucine (I). They both weight 113 d
  • Nor can we distinguish Lysine (K) and Glutamine (Q), which weigh 128 d
  • For long peptide chains >50, our errors can build up
  • In reality, peptides can loose or gain one or more small molecules from their side chains and fractured peptide bonds
    • Hydrogen ions (H - 1 Dalton)
    • Water (H2O - 18 Daltons)
    • Ammonia (NH3, 17 Daltons)
  • This leads to measurements that vary around the ideal sums we assume
  • Regardless of these caveats, let's keep going

6


Let's compute the Total Molecular Weight of our Target

In [3]:
TyrocidineB1 = "VKLFPWFNQY"

# The weight of Tyrocidine B1
print sum([Daltons[res] for res in TyrocidineB1])
1322
  • Generally, we will assume that this target weight is known
  • We will use it as a terminating condition for many of our algorithms that attempt to reconstruct the measured set of weights

7


Ideally, what Weights should we get?

  • We will make the optimistic assumption that we will fracture our given petide chain into all of its constituent parts
  • For a 10 peptide chain
    • 10 single peptides
    • 9, 2-peptide chains
    • 8, 3-peptide chains
    • 7, 4-peptide chains
    • 6, 5-peptide chains
    • 5, 6-peptide chains
    • 4, 7-peptide chains
    • 3, 8-peptide chains
    • 2, 9-peptide chains
    • 1, 10-peptide chain
  • This gives an upper bound of ${10 \choose 2} = 55$ molecular weights, but relatity both the peptide chains and their weights may not be unique
  • The collection of all possible sub-peptide molecular weights from a peptide is called the peptide's Theoretical Spectrum

8


Code for computing a Theoretical Spectrum

In [4]:
def TheoreticalSpectrum(peptide):
    # Generate every possible fragment of a peptide
    spectrum = set()
    for fragLength in xrange(1,len(peptide)+1):
        for start in xrange(0,len(peptide)-fragLength+1):
            seq = peptide[start:start+fragLength]
            spectrum.add(sum([Daltons[res] for res in seq]))
    return sorted(spectrum)

print TyrocidineB1
spectrum = TheoreticalSpectrum(TyrocidineB1)
print len(spectrum)
print spectrum
VKLFPWFNQY
51
[97, 99, 113, 114, 128, 147, 163, 186, 227, 241, 242, 244, 260, 261, 283, 291, 333, 340, 357, 388, 389, 405, 430, 447, 485, 487, 543, 544, 552, 575, 577, 584, 671, 672, 690, 691, 738, 770, 804, 818, 819, 835, 917, 932, 982, 1031, 1060, 1095, 1159, 1223, 1322]
  • Why are we using a set rather than a list? Notice that we end up returning a list.
In [5]:
peptide = TyrocidineB1
fragList = []
for fragLength in xrange(1,len(peptide)+1):
    for start in xrange(0,len(peptide)-fragLength+1):
        seq = peptide[start:start+fragLength]
        fragList.append((sum([Daltons[res] for res in seq]), seq))

print len(fragList)
lastWeight = 0
for weight, frag in sorted(fragList):
    print "%10s: %d%s" % (frag, weight, "*" if (weight == lastWeight) else " ")
    lastWeight = weight
55
         P: 97 
         V: 99 
         L: 113 
         N: 114 
         K: 128 
         Q: 128*
         F: 147 
         F: 147*
         Y: 163 
         W: 186 
        VK: 227 
        KL: 241 
        NQ: 242 
        FP: 244 
        LF: 260 
        FN: 261 
        PW: 283 
        QY: 291 
        WF: 333 
       VKL: 340 
       LFP: 357 
       KLF: 388 
       FNQ: 389 
       NQY: 405 
       FPW: 430 
       PWF: 430*
       WFN: 447 
      KLFP: 485 
      VKLF: 487 
      LFPW: 543 
      PWFN: 544 
      FNQY: 552 
      WFNQ: 575 
      FPWF: 577 
     VKLFP: 584 
     KLFPW: 671 
     PWFNQ: 672 
     LFPWF: 690 
     FPWFN: 691 
     WFNQY: 738 
    VKLFPW: 770 
    LFPWFN: 804 
    KLFPWF: 818 
    FPWFNQ: 819 
    PWFNQY: 835 
   VKLFPWF: 917 
   KLFPWFN: 932 
   LFPWFNQ: 932*
   FPWFNQY: 982 
  VKLFPWFN: 1031 
  KLFPWFNQ: 1060 
  LFPWFNQY: 1095 
 VKLFPWFNQ: 1159 
 KLFPWFNQY: 1223 
VKLFPWFNQY: 1322 
  • Here's a smaller example that we will play with first before solving for Tyrocidine B1
In [6]:
spectrum = TheoreticalSpectrum('PLAY')
print len(spectrum)
print spectrum
10
[71, 97, 113, 163, 184, 210, 234, 281, 347, 444]

9


Can we Invert the Process of creating a Spectrum?

  • In essence, the problem of inferring a peptide chain from the set of mass values reported by a Mass Spectrometer is the inverse of the code we just wrote
Easy Problem: Peptide Sequence → Spectrum
Hard Problem: Peptide Sequence ← Spectrum
  • Why is computing a spectrum from a peptide sequence easy? $O(N^2)$?
  • Why is computing a peptide sequence from a specturm hard? $O(?)$

10


How might you approach this problem?

  • Can you think of a Brute-Force way of solving this problem?

  • Here's one:

    1. For every peptide sequence with the target peptide's molecular weight
    2. Compute the sequence's Theoretical Spectrum
    3. If it matches the one given, report this peptide as a possible solution
  • Which step in this algorithm is the hard part?

11


A Brute-Force Attempt

In [7]:
def PossiblePeptide(spectrum, prefix=''):
    """ A brute force method of generating all peptide
       sequences that add up to our target weight 
       from the given spectrum """
    global peptideList
    if (len(prefix) == 0):
        peptideList = []
    current = sum([Daltons[res] for res in prefix])
    target = max(spectrum)  # our target
    if (current == target):
        peptideList.append(prefix)
    elif (current < target):
        for residue in Daltons.iterkeys():
            PossiblePeptide(spectrum, prefix+residue)

def TestPeptides(candidateList, target):
    filteredList = []
    for peptide in candidateList:
        candidateSpectrum = TheoreticalSpectrum(peptide)
        if (candidateSpectrum == target):
            filteredList.append(peptide)
    return filteredList

spectrum = TheoreticalSpectrum('PLAY')
%time PossiblePeptide(spectrum)
print len(peptideList), "candidates"
print "PLAY" in peptideList
%time matches = TestPeptides(peptideList, spectrum)
print matches
print "PLAY" in matches
CPU times: user 4.56 s, sys: 71.3 ms, total: 4.63 s
Wall time: 4.65 s
3687 candidates
True
CPU times: user 84 ms, sys: 331 µs, total: 84.3 ms
Wall time: 84.6 ms
['PIAY', 'PLAY', 'YAIP', 'YALP']
True

12


Impressions?

  • Not so bad for a first attempt, but how will it perform for longer peptides?
  • We are getting the expected answer as well as answers with the indistinguishable amino acids substituted
  • We are also getting the sequence reversed? Is this a surprise?
  • We could code around this, but for today we'll just include the reversed peptide chain as a possible answer

13


Improving on Brute Force

  • The brute force method does not make good use of the spectrum it is given

    • It only ever considers the largest value from this table
    • How might we make use of the other values?
  • We could only extend our prefix with residues that appear in our spectrum

  • The weight of every new prefix that we consider should also be in our spectrum

Actual fragments:

   P L A Y PL LA AY PLA LAY PLAY

Growing and Checking prefixes:

   A             I             L             P             Y

   AI   = LA     IA   = LA     LA   = LA     PI   = PL     YA   = AY
   AIP  = PLA    IAP  = PLA    LAP  = PLA    PIA  = PLA    YAI  = LAY
   AIPY = PLAY   IAPY = PLAY   LAPY = PLAY   PIAY = PLAY   YAIP = PLAY
   AIY  = LAY    IAY  = LAY    LAY  = LAY                  YAL  = LAY
   AIYP = PLAY   IAYP = PLAY   LAYP = PLAY                 YALP = PLAY

   AL   = LA     IP   = PL     LP   = PL     PL   = PL
   ALP  = PLA    IPA  = PLA    LPA  = PLA    PLA  = PLA
   ALPY = PLAY   IPAY = PLAY   LPAY = PLAY   PLAY = PLAY
   ALY  = LAY
   ALYP = PLAY

   AY   = AY
   AYI  = LAY
   AYIP = PLAY
   AYL  = LAY
   AYLP = PLAY

14


Only a Small Change to the Code

In [8]:
def ImprovedPossiblePeptide(spectrum, prefix=''):
    global peptideList
    if (len(prefix) == 0):
        peptideList = []
    current = sum([Daltons[res] for res in prefix])
    target = max(spectrum)
    if (current == target):
        peptideList.append(prefix)
    elif (current < target):
        for residue in Daltons.iterkeys():
            if (Daltons[residue] not in spectrum):
                continue
            extend = prefix + residue
            if (sum([Daltons[res] for res in extend]) not in spectrum):
                continue
            ImprovedPossiblePeptide(spectrum, extend)

spectrum = TheoreticalSpectrum('PLAY')
%time ImprovedPossiblePeptide(spectrum)
print len(peptideList)
print "PLAY" in peptideList
%time matches = TestPeptides(peptideList, spectrum)
print matches
print "PLAY" in matches
CPU times: user 993 µs, sys: 421 µs, total: 1.41 ms
Wall time: 1.09 ms
16
True
CPU times: user 310 µs, sys: 229 µs, total: 539 µs
Wall time: 361 µs
['PIAY', 'PLAY', 'YAIP', 'YALP']
True
  • Provides a HUGE performace difference
  • Yet another example of Branch-and-Bound
  • We improved both the enumeration and verification phases, but the differece was much more significant in the enumeration step
  • Let's look at the list generated by pruning the search tree and see if it gives us any more insights
In [9]:
for peptide in peptideList:
    print peptide
AIPY
AIYP
ALPY
ALYP
AYIP
AYLP
IAPY
IAYP
IPAY
LAPY
LAYP
LPAY
PIAY
PLAY
YAIP
YALP
In [10]:
TheoreticalSpectrum('PLAY')
Out[10]:
[71, 97, 113, 163, 184, 210, 234, 281, 347, 444]
In [11]:
TheoreticalSpectrum('LAPY')
Out[11]:
[71, 97, 113, 163, 168, 184, 260, 281, 331, 444]
In [12]:
print sum([Daltons[res] for res in 'AP'])  # Suffix of 'LAP' prefix
print sum([Daltons[res] for res in 'APY']) # Suffix of 'LAPY'
print sum([Daltons[res] for res in 'PY'])  # Suffix of 'LAPY'
168
331
260
  • There are still differences in the spectrums, yet every prefix was in the spectrum when we added it. What are we missing?
  • Suffixes!

15


We can do Even Better

  • All suffixes of each prefix that we consider should also be in our spectrum
In [13]:
def UltimatePossiblePeptide(spectrum, prefix=''):
    global peptideList
    if (len(prefix) == 0):
        peptideList = []
    current = sum([Daltons[res] for res in prefix])
    target = max(spectrum)
    if (current == target):
        peptideList.append(prefix)
    elif (current < target):
        for residue in Daltons.iterkeys():
            extend = prefix + residue
            suffix = [extend[i:] for i in xrange(len(extend))]
            for fragment in suffix:
                if (sum([Daltons[res] for res in fragment]) not in spectrum):
                    break
            else:
                UltimatePossiblePeptide(spectrum, extend)

spectrum = TheoreticalSpectrum('PLAY')
%time UltimatePossiblePeptide(spectrum)
print len(peptideList)
print "PLAY" in peptideList
%time matches = TestPeptides(peptideList, spectrum)
print matches
print "PLAY" in matches
CPU times: user 2.44 ms, sys: 723 µs, total: 3.17 ms
Wall time: 2.61 ms
4
True
CPU times: user 170 µs, sys: 63 µs, total: 233 µs
Wall time: 195 µs
['PIAY', 'PLAY', 'YAIP', 'YALP']
True
  • A little slower, but our list is pruned significantly
  • All of theses have identical spectrums

16


Now let's return to our real peptide

In [14]:
spectrum = TheoreticalSpectrum(TyrocidineB1)
%time UltimatePossiblePeptide(spectrum)
print len(peptideList)
print TyrocidineB1 in peptideList
%time matches = TestPeptides(peptideList, spectrum)
print len(matches)
print TyrocidineB1 in matches
CPU times: user 40.4 ms, sys: 21.9 ms, total: 62.3 ms
Wall time: 45.4 ms
16
True
CPU times: user 2.92 ms, sys: 1.56 ms, total: 4.47 ms
Wall time: 3.27 ms
16
True
In [15]:
for peptide in peptideList:
    print peptide
VKIFPWFNKY
VKIFPWFNQY
VKLFPWFNKY
VKLFPWFNQY
VQIFPWFNKY
VQIFPWFNQY
VQLFPWFNKY
VQLFPWFNQY
YKNFWPFIKV
YKNFWPFIQV
YKNFWPFLKV
YKNFWPFLQV
YQNFWPFIKV
YQNFWPFIQV
YQNFWPFLKV
YQNFWPFLQV

All of these peptides give also give us our desired spectrum

17


Great, but our assumptions are a little Naïve

  • In relaity, Mass Spectometers don't report the Theoretical Spectrum of a peptide
  • Instead they report a measured or Experimental Spectrum
  • This spectrum might miss some fragments
  • It might also report false fragments
    • From Contaminants
    • New peptides forming from assorted fragments
  • The result is that some of the masses that appear may be misleading, and some that we want might be missing
  • We need to make our algorithms for reporting candidate protein sequences robust to noise

18


An example Experimental Spectrum for Tyrocidine B1

       97,     99,    113,    114,    128,    147,    163,   
      186,    200,    227,    241,    242,    244,    260,   
      261,    283,    291,    333,    340,    357,    388,   
      389,    405,    430,    447,    457,    485,    487,   
      543,    544,    552,    575,    577,    584,    659,   
      671,    672,    690,    691,    731,    738,    770,   
      804,    818,    819,    835,    906,    917,    932,   
      982,   1031,   1060,   1095,   1159,   1223,   1322

False Masses: present in the experimental spectrum, but not in the theoretical spectrum

Missing Masses: present in the theoretical spectrum, but not in the experimental spectrum

19


An example Experimental Spectrum for Tyrocidine B1

       97,     99,    113,            128,    147,    163,   
      186,    200,    227,    241,    242,    244,    260,   
      261,    283,    291,    333,    340,    357,           
              405,    430,    447,    457,            487,   
      543,    544,    552,    575,    577,    584,    659,   
      671,    672,    690,    691,    731,    738,    770,   
      804,    818,    819,    835,    906,    917,    932,   
      982,   1031,           1095,   1159,           1322

False Masses: We don't know which these are

Missing Masses: And these values don't appear

20


An aside: How I faked an Experimental Spectrum

In [20]:
# generate a synthetic experimental spectrum with 10% Error
import random
random.seed(1961)

spectrum = TheoreticalSpectrum(TyrocidineB1)

# Pick around ~10% at random to remove
missingMass = random.sample(spectrum, 6)
print "Missing Masses = ", missingMass

# Add back another ~10% of false, but actual, peptide masses
falseMass = []
for i in xrange(5):
    fragment = ''.join(random.sample(Daltons.keys(), random.randint(2,len(TyrocidineB1)-2)))
    weight = sum([Daltons[residue] for residue in fragment])
    falseMass.append(weight)
print "False Masses = ", falseMass

experimentalSpectrum = sorted(set([mass for mass in spectrum if mass not in missingMass] + falseMass))
Missing Masses =  [1223, 114, 738, 186, 835, 357]
False Masses =  [457, 200, 731, 906, 659]
In [21]:
print experimentalSpectrum
[97, 99, 113, 128, 147, 163, 200, 227, 241, 242, 244, 260, 261, 283, 291, 333, 340, 388, 389, 405, 430, 447, 457, 485, 487, 543, 544, 552, 575, 577, 584, 659, 671, 672, 690, 691, 731, 770, 804, 818, 819, 906, 917, 932, 982, 1031, 1060, 1095, 1159, 1322]

21


A Golf Tournament Analogy

  • After the first couple of rounds of a major golf tournament a cut is made of all golfers who are so far back from the leader that it is deemed they are unlikely to ever finish in the money
  • These cut golfers are removed from further consideration
  • This choice is heuristic
    • It is possible that a player just below the cut could have two exceptional rounds, but that is considered unlikely
  • What is the equivalent of a score in our peptide finding problem?
    • The number of matching masses in the candidate peptide's Theoretical Spectrum and the Experimental Spectrum
    • Normalized score, why?
    • matches / len(candidate peptide's Theoretical Spectrum)
  • In our peptide golf game a round will be considered a one peptide extension of a active set of player peptides
  • We will do cuts on every round, keeping to top 5% of finishers or the top 5 players, which ever is more
  • Why 5%? It is arbitrary, but on each round we will extend the current set of players by one of 20 amino acids, thus increasing the number of peptides by a factor of 20, so reducing by 5% leaves the poolsize realtively stable.

22


An Implementation

In [28]:
import itertools

def LeaderboardFindPeptide(noisySpectrum, cutThreshold=0.05):
    # Golf Tournament Heuristic
    spectrum = set(noisySpectrum)
    target = max(noisySpectrum)
    players = [''.join(peptide) for peptide in itertools.product(Daltons.keys(), repeat=2)]
    round = 1
    currentLeader = [0.0, '']
    while True:
        print "%8d Players in round %d" % (len(players), round) 
        leaderboard = []
        for prefix in players:
            testSpectrum = set(TheoreticalSpectrum(prefix))
            totalWeight = max(testSpectrum)
            score = len(spectrum & testSpectrum)/float(len(testSpectrum))
            if (totalWeight == target):
                if (score > currentLeader[0]):
                    currentLeader = [score, prefix]
                elif (score == currentLeader[0]):
                    currentLeader += [prefix]
            elif (totalWeight < target):
                leaderboard.append((score, prefix))
        remaining = len(leaderboard)
        if (remaining == 0):
            print "Done, no sequences can be extended"
            break
        leaderboard.sort(reverse=True)
        # Prune the larger of the top 5% or the top 5 players
        cut = leaderboard[max(min(5,remaining-1),int(remaining*cutThreshold))][0]
        players = [p+r for s, p in leaderboard if s >= cut for r in Daltons.iterkeys()]
        round += 1
    return currentLeader

spectrum = TheoreticalSpectrum(TyrocidineB1)
experimentalSpectrum = [mass for mass in spectrum if mass not in missingMass] + falseMass
%time winners = LeaderboardFindPeptide(experimentalSpectrum)
print winners
print len(winners) - 1, "Candidate residues with", winners[0], 'matches'
print TyrocidineB1
print TyrocidineB1 in winners
     400 Players in round 1
     480 Players in round 2
    1280 Players in round 3
    1620 Players in round 4
    3060 Players in round 5
    3240 Players in round 6
    3640 Players in round 7
    3680 Players in round 8
    3760 Players in round 9
    3520 Players in round 10
     800 Players in round 11
     320 Players in round 12
Done, no sequences can be extended
CPU times: user 2.04 s, sys: 16 ms, total: 2.06 s
Wall time: 2.11 s
[0.8823529411764706, 'VQLFPWFNQY', 'VQLFPWFNKY', 'VQIFPWFNQY', 'VQIFPWFNKY', 'VKLFPWFNQY', 'VKLFPWFNKY', 'VKIFPWFNQY', 'VKIFPWFNKY', 'YQNFWPFLQV', 'YQNFWPFLKV', 'YQNFWPFIQV', 'YQNFWPFIKV', 'YKNFWPFLQV', 'YKNFWPFLKV', 'YKNFWPFIQV', 'YKNFWPFIKV']
16 Candidate residues with 0.882352941176 matches
VKLFPWFNQY
True

23


Next Time

  • This method works well, but it rely on heuristcs, and thus might miss the best answer
  • Our methods are still make a lot of simplfying assumptions
  • In reality relying only exact matches might lead us
  • We will continue to explore ways of assembling peptide sequences from a given experimental spectrum

24