Compgen Tool Suite

Genetics, Evolution, and the Coalescent Theory

Comp 790-087 meets on Tuesdays and Thursdays at 12:30pm in SN325.

Lecture 1

Download slides

Code for Wright-Fisher Haploid Model

Example of Wright-Fisher Haploid Model with 10 genes. At each generation decedents are at chosen at random from the current generation with replacement. Eventually all decedents will be copies of a single ancestral gene.

A histogram of the number of generations for convergence of 10 genes.

Code for Wright-Fisher Diploid Model

A histogram of the number of generations for convergence of 10 genes.

           Females            |            Males             
['AA', 'BB', 'CC', 'DD', 'EE' | 'FF', 'GG', 'HH', 'II', 'JJ']
['CJ', 'EG', 'AF', 'AG', 'CH' | 'CF', 'AI', 'BJ', 'DG', 'DJ']
['JD', 'CJ', 'JA', 'FJ', 'GB' | 'EI', 'JJ', 'JA', 'AJ', 'EG']
['JJ', 'GJ', 'JJ', 'JG', 'JA' | 'JJ', 'AJ', 'AJ', 'CJ', 'JE']
['JJ', 'GJ', 'JJ', 'GJ', 'GJ' | 'JA', 'GJ', 'JJ', 'JC', 'JA']
['JJ', 'JG', 'JJ', 'GA', 'JA' | 'JJ', 'GJ', 'JJ', 'GJ', 'JJ']
['JG', 'JJ', 'JJ', 'JG', 'JJ' | 'JJ', 'GJ', 'JJ', 'GJ', 'GJ']
['JJ', 'JG', 'JJ', 'JJ', 'JG' | 'JG', 'JJ', 'JJ', 'JJ', 'JG']
['JJ', 'JJ', 'GJ', 'JJ', 'GG' | 'JG', 'JJ', 'JJ', 'GJ', 'GJ']
['JJ', 'GJ', 'GG', 'GJ', 'JG' | 'JJ', 'JJ', 'JG', 'JJ', 'JJ']
['JJ', 'GJ', 'JG', 'JJ', 'JJ' | 'JJ', 'GJ', 'GJ', 'JJ', 'GJ']
['JJ', 'JJ', 'JJ', 'GJ', 'JJ' | 'JJ', 'JJ', 'JJ', 'JJ', 'GJ']
['JJ', 'GJ', 'GJ', 'JJ', 'JJ' | 'JJ', 'JJ', 'JJ', 'JJ', 'JJ']
['GJ', 'JJ', 'GJ', 'JJ', 'GJ' | 'JJ', 'GJ', 'JJ', 'JJ', 'JJ']
['GG', 'JJ', 'GJ', 'JJ', 'GJ' | 'JJ', 'JJ', 'JJ', 'GJ', 'JJ']
['JJ', 'GG', 'JJ', 'JJ', 'JJ' | 'JJ', 'JJ', 'JJ', 'JG', 'GJ']
['JJ', 'JG', 'JJ', 'JJ', 'JJ' | 'GJ', 'GJ', 'JJ', 'JJ', 'JG']
['JJ', 'JJ', 'GJ', 'JG', 'JJ' | 'JJ', 'JJ', 'JJ', 'JJ', 'JG']
['GJ', 'JJ', 'JJ', 'JJ', 'JJ' | 'JJ', 'JJ', 'GJ', 'JJ', 'JG']
['JJ', 'JJ', 'GJ', 'JJ', 'JJ' | 'JJ', 'JG', 'JG', 'JJ', 'JJ']
['JJ', 'JJ', 'JJ', 'JG', 'JJ' | 'JJ', 'JJ', 'JJ', 'JJ', 'JJ']
['JJ', 'JJ', 'JJ', 'JJ', 'JJ' | 'JJ', 'JJ', 'JJ', 'JJ', 'JJ']
21 Generations

Project Data

Data Set [1]: Genotypes from 4 mouse populations

Project Ideas

[mcmillan] On page 50 of the text a variant of Ewen's formula suggests that the number of singleton genes can be used to estimate the mutation rate, θ. I propose to decompose the genome of the laboratory mouse strains (group 3 in our data set), into intervals based on the 4-gamete test, then count the number of haplotypes that occur only once in each interval to estimate the value of θ for each such interval. I will plot this value over the entire genome. I will compare this estimate to other applicable estimators of θ suggested in section 6.2 (Watterson's, Tajima's and the two of Fu). Where the estimator is not applicable, I will write a short explanation why. My objective is to investigate the consistency of these estimators.
[mcmillan] I propose to find the effective population sizes for all four groups in our data set. I will first use the simple approach mentioned in section 1.10 and again in 8.2. My goal is to see if coalescent methods can recognize a potential bottleneck in the lab set.
[catie] Within the data, there exist a number of related sets, mainly in population 3. These strains are referred to as sister strains, and include the 129, C57, CBA, DBA, A/J, C3H, BALB, NOD, and NON mice to name a few. I propose to first pair-wise analyze the number of differences between these strains both in terms of the number of differing SNPs and genes. Then, assuming the infinite sites model, with mutation rate = θ/2, I will compare the expected # of mutations over time with the actual number of mutations, and determine an accurate value for θ. To find the expected # of mutations, I will use Watterson's, Tajima's and the two of Fu estimators of θ. I also plan to build coalescent trees based on the historical evolution of strains using the mutation models presented in Ch. 2 that obey the molecular clock. If time permits, I'd also like to apply the algorithm presented in Ch. 3, pg 75 to find the gene trees between sister strains. My objective is to compare sister strains and see if their rate of mutation adheres to the standard model.
[zhaojun] I put the wiki page of this project under the [2]. Simulation algorithm for genetic data is very important. And in this course we learned how to fast simulate data based on observed samples. However, one thing confusing me is that how good the algorithm in the book is. So I'd like to develop a efficient simulation algorithm based on the algorithm in the book, including simulation of mutation, gene conversion and recombination. This simulation algorithm is a forward model, which means the number of sample starts with a single gene (seed) , simulate generation by generation and increases by time. And I will also implement the algorithm 5in Ch. 5. Then in order to verify the algorithm 5, I will simulate a 'dense' data, randomly select some samples as observed ones. Based on this sample data, I will run algorithm 5 to see how similar the MCRA and seed are, and how similar both evolution trees are, (One is simulated by my algorithm, and one is simulated by algorithm 5). I will start to implement the algorithm by the naive way and based on a small scale. If time permits, I will consider scalability of this algorithm. This simulation algorithm has some efficiency problem, such as [1] how to run in limited memory ( what should we do if we simulate a large population) and [2] how to simulate faster than naive way, and has some simulation problem, such as [3] how to simulate with a large population number and a small sample number.
[liuyi] Estimating ρ is difficult for recombination events are partially observed. Assume that the population is derived from a small set of known founders. Recombination events can be easily inferred by comparing the sample sequences to the founders. The knowledge of founders, though unavailable in natural animal populations, are given in lab-raised animal resources such as CC. Thus, we can use the observed recombination rate in CC population as the ground truth to evaluate the effectiveness of different ρ estimators. In this project, I would like to implement/evaluate the moment-based Hudsen 1987 and/or Wakeley 1997 estimators. I am also interested in using the Metropolis-Hastings MCMC method proposed in Kuhner 1999. The pseudo likelihood in Hudson 2001b(6.3.2.1) will not be assessed for computing the likelihood of a configuration is extremely difficult. Experiments based on synthetic data from Zhaojun's simulator is also possible. At the moment I have only skimmed through a few related papers. I may refine the scope/approaches in the future.
[kemal] On page 180 of the book, it discusses an estimator of θ, UPBLUE, which takes the coalescent tree topology and branch length into account. Book says that variance of UPBLUE is close to the theoretical optimum when coalescent tree is estimated accurately. Since we know the tree for population 2, this estimator can be used to accurately estimate mutation rate. I propose to estimate mutation rate for Jackson lab mice using UPBLUE and 4 other (Watterson's, Tajima's and the two of Fu) methods. My aim is to find accuracy of 4 estimators taking UPBLUE as the ground truth. I also plan to observe the mean and variance of the estimators by taking random sample of the population.
[kemal] ('Is there a Chinese in Turkey' project) I want to check whether there is migration between Jackson lab mice population and other lab mice population. For this purpose we need an estimator of migration rate. No estimator for migration rate is given in the book but maybe it can be derived by using the intuition of the techniques used to estimate mutation rate. Second step is to come up with a significance test to test if the estimated value is significantly different than 0. By this way we can check if there is a migration between two lab mice population.
[shunping] The recombination rate ρ is an important parameter for the model, but it does not provide an intuitive way of describing the history. To get a better understanding of the coalescent events and recombination events, we may need to resort to the ancestral recombination graphs (ARGs). However, the ARGs may be complex, and generating ARGs is hard. So focusing on the attributes on ARGs, eg. the number of different kinds of recombination events, may be the point to start. In the project, I would like to estimate ρ first. Then I will implement the estimators of the number of recombination events on Ch 5.7(P152-153). Also, other attributes of ARGs will be explored, such as the total length of the graph, and so on. At last, I will try to plot out the ARGs to provide a visual depiction of the history.

Project Progress

[kemal] I applied Tajima's estimator of theta on synthetic data generated by Zhaojun's program. Interestingly estimated values do not change much although I change the mutation parameter. (Is there a bug in Zhaojun's program?) Here are the codes: Media:Kemal_project_codes.zip update: Zhaojun fixed the bug so I get different mutation estimated values for different mutation rates. But I am trying to interpret the results: I get 50 +- 30 for N=1000 and mu=0.03 and 5000 generations. I plan to apply to mitochondrial snps. I am working on implementing other estimators of theta.

[catie] So far I have parsed the data into .snp files that contain (build36 position, build37 position, snp, inGene) where snp is the allele for every strain in population 3 of the data set at a particular snp location, and inGene is a binary value for whether or not the SNP falls within a gene as specified by the data set given in class. I've built MMVs of all pair-wise relationships between sister strains, using only the SNPs that were included in build 37, which comes out to 581,602 SNPs. I've analyzed those MMVs and posted the number of matches, mismatches, heterozygous and no-calls between each pair of sister strains. That analysis can be seen here Sister Strains Project. I'm currently in the process of finding the number of mismatches that occur within genes between each pair of sister strains. Those results should be posted soon. I'm also attempting to apply the estimators listed in the project ideas section to find an estimate for theta.

[liuyi] [I have implemented Wakeley's estimator and played with synthetic data / CC data. It gives reasonable estimates. I have also applied MCMC estimators but it fails to generate meaningful results with affordable length of chains. http://compgen.unc.edu/wiki/index.php/Rho]

[shunping] I put the progress in this page [3].

[zhaojun] My project appears here: [4]

Genetics, Evolution, and the Coalescent Theory

Lecture 1

Lecture 2

Lecture 3

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Lecture 9

Lecture 10 (Catie Welsh)

Lecture 11 (Catie Welsh)

Lecture 12 (Kemal Pakatci)

Lecture 13 (Kemal Pakatci)

Lecture 14 (Shunping Huang)

Lecture 15 (Shunping Huang)

Lecture 16&17 (Liu Yi)

Lecture 18&19 (Zhaojun Zhang)

Project Data

Project Ideas

Project Progress