Logged in as: guest Log in
  Comp 790-087:
Computational Genetics
Spring 2014

4/11/2014: I have swapped the project presentation times of Couture and Zhang

4/10/2014: I have swapped the project presentation times of Maurizio and Raulerson

2/16/2014:  I have readjusted the course schedule again.

2/5/2014:  I pushed the start of the paper presentations one class period later.

Comp 790-087 meets on Mondays and Wednesdays at 1:00pm-2:15pm in SN011.

 


Date Topic
January 8 Course Preliminaries (slides) (syllabus)
January 13 Pedigree and Genotype (slides) (notebook)
January 15 Inferring Haplotype (slides) (notebook)
January 20 Martin Luther King Day
January 22 John Didion's PhD defense (noon, Pagano Conference Room)
January 27 Go over solutions to haplotype inference, choosing papers. (notebook)
January 29 Cancelled due to snow
February 3 Measuring relatedness of samples (notebook)
February 5 Processing sequence data using Pysam (notebook)
February 10 More experiments with sequence data (notebook)
February 12 Cancelled due to snow
February 17 Searching genomes (slides) (notebook)
February 19 TBD
February 24 Paper Presentations: Orellana, Couture
February 26 Paper Presentations: Christy, Kao
March 3 Paper Presentations: Holt, Zhang
March 5 Paper Presentations: Keele, Bartlett
March 10 Spring Break (No Class)
March 12 Spring Break (No Class)
March 17 Paper Presentations: Harn, Fu
March 19

Paper Presentations: Aggrawal, TBA

Please attend:

Lior Pachter's Lecture on
"Making Sense of RNA-seq"
March 19,  at 3pm in MBRB G202
March 24
Paper Presentations: Raulerson, Maurizio
March 26
Code-a-thon Day 1
March 31
Code-a-thon Day 2
April 2
Code-a-thon Day 3
April 7

Project proposal:
Orellana (Gene isoform detection and validation by integrating short and long reads)
Holt (De novo DNA assembly using msBWTs)

April 9
Project proposal:
Christy (Xist secondary structure considering variants),
Fu (QTL mapping using microarray intensities as genotype)
April 14
Project proposal:
Zhang (Genome alignments and variant discovery using a read index),
Aggrawal (Using msBWTs to represent read-overlap graphs in paired-end HTS datasets)
April 16

Project proposal:
Raulerson (Find a projection of RNAseq data that exposes phenotype, in particular age),
Couture (RNA expression normalization using histology features)

April 21

Project proposal:
Keele (Modeling founder effects when mapping QTLs in outbred populations),
Kao (Modeling and compressing NGS quality strings)

April 23
Project proposal: 
Harn (RNAseq abundance and isoform estimation), 
Maurizio (Searching for Viral sequences, and differential expression (genetic backgrounds) within RNASeq)

 

Ground Rules for Project Proposals:

  • Explain why your problem is important 
  • You must showcase at least one recent paper that addresses a simliar problem to the one you are addressing
    • It is okay to appy the paper's solution to a new problem, but not merely a new dataset?
    • What limitations of the paper's solution do you address?
  • Describe the data and the algorithms/analysis approaches that you will use.
  • Explain what a positive outcome of your proposed project would be.

Resources:

Sequence Data:
Genomes:
Human reference genome Build 37
Same as used by 23andMe data
Mouse reference genome Build 37
Same as used by RNAseq data

Review Articles:

  1. Martin, Jeffrey A., and Zhong Wang. "Next-generation transcriptome assembly." Nature Reviews Genetics 12.10 (2011): 671-682. (PDF)
  2. (1425 - Presenter: Orellana, 975      Discussant: Zhang, 300)
    Transcriptomics studies often rely on partial reference transcriptomes that fail to capture the full catalogue of transcripts and their variations. Recent advances in sequencing technologies and assembly algorithms have facilitated the reconstruction of the entire transcriptome by deep RNA sequencing (RNA-seq), even without a reference genome. However, transcriptome assembly from billions of RNA-seq reads, which are often very short, poses a significant informatics challenge. This Review summarizes the recent developments in transcriptome assembly approaches — reference-based, de novo and combined strategies — along with some perspectives on transcriptome assembly in the near future.
  3. Lovén, Jakob, et al. "Revisiting global gene expression analysis." Cell 151.3 (2012): 476-482.(PDF)
  4. (1150 - Presenter: Couture, 600    Discussant: Maurizio, 300)
    Gene expression analysis is a widely used and powerful method for investigating the transcriptional behavior of biological systems, for classifying cell states in disease, and for many other purposes. Recent studies indicate that common assumptions currently embedded in experimental and analyt- ical practices can lead to misinterpretation of global gene expression data. We discuss these assumptions and describe solutions that should minimize erroneous interpretation of gene expres- sion data from multiple analysis platforms.
  5. Fatica, Alessandro, and Irene Bozzoni. "Long non-coding RNAs: new players in cell differentiation and development." Nature Reviews Genetics (2013). (PDF)
  6. (950 - Presenter: Christy, 500   Discussant: Raulerson, ??)
    Genomes of multicellular organisms are characterized by the pervasive expression of different types of non-coding RNAs (ncRNAs). Long ncRNAs (lncRNAs) belong to a novel heterogeneous class of ncRNAs that includes thousands of different species. lncRNAs have crucial roles in gene expression control during both developmental and differentiation processes, and the number of lncRNA species increases in genomes of developmentally complex organisms, which highlights the importance of RNA-based levels of control in the evolution of multicellular organisms. In this Review, we describe the function of lncRNAs in developmental processes, such as in dosage compensation, genomic imprinting, cell differentiation and organogenesis, with a particular emphasis on mammalian development.
  7. Rodriguez, Jesse M., Serafim Batzoglou, and Sivan Bercovici. "An accurate method for inferring relatedness in large datasets of unphased genotypes via an embedded likelihood-ratio test." Research in Computational Molecular Biology. Springer Berlin Heidelberg, 2013.(PDF)
  8. (1725 - Presenter: Kao, 500     Discussant: Fu, 400)
    Studies that map disease genes rely on accurate annotations that indicate whether individuals in the studied cohorts are related to each other or not. For example, in genome-wide association studies, the cohort members are assumed to be unrelated to one another. Investi- gators can correct for individuals in a cohort with previously-unknown shared familial descent by detecting genomic segments that are shared between them, which are considered to be identical by descent (IBD). Alternatively, elevated frequencies of IBD segments near a particular locus among affected individuals can be indicative of a disease-associated gene. As genotyping studies grow to use increasingly large sample sizes and meta-analyses begin to include many data sets, accurate and efficient detection of hidden relatedness becomes a challenge. To enable disease- mapping studies of increasingly large cohorts, a fast and accurate method to detect IBD segments is required.

    We present PARENTE, a novel method for detecting related pairs of individuals and shared haplotypic segments within these pairs. PARENTE is a computationally-efficient method based on an embedded like- lihood ratio test. As demonstrated by the results of our simulations, our method exhibits better accuracy than the current state of the art, and can be used for the analysis of large genotyped cohorts. PARENTE’s higher accuracy becomes even more significant in more challenging sce- narios, such as detecting shorter IBD segments or when an extremely low false-positive rate is required. PARENTE is publicly and freely available at http://parente.stanford.edu/.
  9. Robasky, Kimberly, Nathan E. Lewis, and George M. Church. "The role of replicates for error mitigation in next-generation sequencing." Nature Reviews Genetics (2013). (PDF)
  10. (810 - Presenter: Holt, 450,  Discussant: Barlett, 150)
    Advances in next-generation sequencing (NGS) technologies have rapidly improved sequencing fidelity and substantially decreased sequencing error rates. However, given that there are billions of nucleotides in a human genome, even low experimental error rates yield many errors in variant calls. Erroneous variants can mimic true somatic and rare variants, thus requiring costly confirmatory experiments to minimize the number of false positives. Here, we discuss sources of experimental errors in NGS and how replicates can be used to abate such errors.
  11. He, Dan, et al. "IPED: Inheritance path based pedigree reconstruction algorithm using genotype data." Research in Computational Molecular Biology. Springer Berlin Heidelberg, 2013. (PDF)
  12. (1045 - Presenter: Zhang, 400,  Discussant: Keele, 200)
    The problem of inference of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. Current methods, unfortunately, are very time consuming and inaccurate for complicated pedigrees such as pedigrees with inbreeding. In this work, we propose an efficient algorithm which is able to reconstruct large pedigrees with reasonable accuracy. Our algorithm reconstructs the pedigrees generation by generation backwards in time from the extant generation. We predict the relationships between individuals in the same generation using an inheritance path based approach implemented using an efficient dynamic programming algorithm. Experiments show that our algorithm runs in linear time with respect to the number of reconstructed generations and therefore it can reconstruct pedigrees which have a large number of generations. Indeed it is the first practical method for reconstruction of large pedigrees from genotype data.
  13. Cheng, Riyan, et al. "Practical Considerations Regarding the Use of Genotype and Pedigree Data to Model Relatedness in the Context of Genome-Wide Association Studies." G3: Genes| Genomes| Genetics 3.10 (2013): 1861-1867. (PDF)
  14. (485 - Presenter: Keele, 325  Discussant: Kao, 125)
    Genome-wide association studies of complex traits often are complicated by relatedness among individuals. Ignoring or inappropriately accounting for relatedness often results in inflated type I error rates. Either genotype or pedigree data can be used to estimate relatedness for use in mixed-models when undertaking quantitative trait locus mapping. We performed simulations to investigate methods for controlling type I error and optimizing power considering both full and partial pedigrees and, similarly, both sparse and dense marker coverage; we also examined real data sets. (1) When marker density was low, estimating relatedness by genotype data alone failed to control the type I error rate; (2) this was resolved by combining both genotype and pedigree data. (3) When sufficiently dense marker data were used to estimate relatedness, type I error was well controlled and power increased; however, (4) this was only true when the relatedness was estimated using genotype data that excluded genotypes on the chromosome currently being scanned for a quantitative trait locus.
  15. Plass, Christoph, et al. "Mutations in regulators of the epigenome and their connections to global chromatin patterns in cancer." Nature Reviews Genetics 14.11 (2013): 765-780. (PDF)
  16. (1143 - Presenter: Bartlett, 300  Discussant: Couture 300)
    Malignancies are characterized by extensive global reprogramming of epigenetic patterns, including gains or losses in DNA methylation and changes to histone marks. Furthermore, high-resolution genome-sequencing efforts have discovered a wealth of mutations in genes encoding epigenetic regulators that have roles as ‘writers’, ‘readers’ or ‘editors’ of DNA methylation and/or chromatin states. In this Review, we discuss how these mutations have the potential to deregulate hundreds of targeted genes genome wide. Elucidating these networks of epigenetic factors will provide mechanistic understanding of the interplay between genetic and epigenetic alterations, and will inform novel therapeutic strategies.
  17. Huang, Lin, Victoria Popic, and Serafim Batzoglou. "Short read alignment with populations of genomes." Bioinformatics 29.13 (2013): i361-i370. (PDF)
  18. (969 - Presenter: Harn, 300  Discussant: Holt, 450)
    The increasing availability of high-throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to date, there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this article. We (i) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (ii) design a new alignment algorithm based on the Burrows–Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of two or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome.
  19. Mackay, Trudy FC. "Epistasis and quantitative traits: using model organisms to study gene-gene interactions." Nature Reviews Genetics (2013). (PDF)
  20. (650 - Presenter: Fu, 300                Discussant: Harn, 0)
    The role of epistasis in the genetic architecture of quantitative traits is controversial, despite the biological plausibility that nonlinear molecular interactions underpin the genotype–phenotype map. This controversy arises because most genetic variation for quantitative traits is additive. However, additive variance is consistent with pervasive epistasis. In this Review, I discuss experimental designs to detect the contribution of epistasis to quantitative trait phenotypes in model organisms. These studies indicate that epistasis is common, and that additivity can be an emergent property of underlying genetic interaction networks. Epistasis causes hidden quantitative genetic variation in natural populations and could be responsible for the small additive effects, missing heritability and the lack of replication that are typically observed for human complex traits.
  21. Keightley, Peter D., et al. "Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family." Genetics 196.1 (2014): 313-320.(PDF)
  22. (571 - Presenter: Aggrawal, 200   Discussant: Christy, 201)
    We employed deep genome sequencing of two parents and 12 of their offspring to estimate the mutation rate per site per generation in a full-sib family of Drosophila melanogaster recently sampled from a natural population. Sites that were homozygous for the same allele in the parents and heterozygous in one or more offspring were categorized as candidate mutations and subjected to detailed analysis. In 1.23 3 109 callable sites from 12 individuals, we confirmed six single nucleotide mutations. We estimated the false negative rate in the experiment by generating synthetic mutations using the empirical distributions of numbers of nonreference bases at heterozygous sites in the offspring. The proportion of synthetic mutations at callable sites that we failed to detect was ,1%, implying that the false negative rate was extremely low. Our estimate of the point mutation rate is 2.8 3 1029 (95% confidence interval = 1.0 3 1029 2 6.1 3 1029) per site per generation, which is at the low end of the range of previous estimates, and suggests an effective population size for the species of $1.4 3 106. At one site, point mutations were present in two individuals, indicating that there had been a premeiotic mutation cluster, although surprisingly one individual had a G/A transition and the other a G/T transversion, possibly associated with error-prone mismatch repair. We also detected three short deletion mutations and no insertions, giving a deletion mutation rate of 1.2 3 1029 (95% confidence interval = 0.7 3 1029 2 11 3 1029).
  23. Sims, David, et al. "Sequencing depth and coverage: key considerations in genomic analyses. " Nature Reviews Genetics 15.2 (2014): 121-132. (PDF)
  24. (351 - Presenter: Maurizio, 5  Discussant: Orellana, 7)
    Sequencing technologies have placed a wide range of genomic analyses within the capabilities of many laboratories. However, sequencing costs often set limits to the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. In this Review, we discuss the issue of sequencing depth in the design of next-generation sequencing experiments. We review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses (for example, chromatin immunoprecipitation followed by sequencing (ChIP–seq) and chromosome conformation capture (3C)).
  25. Chen, Taiping and Sharon Dent, "Chromatin modifiers and remodellers: regulators of cellular differentiation", Nature Reviews Genetics 15 (2014), 93–106. (PDF)
  26. (??? - Presenter: Raulerson, ??  Discussant: Aggrawal, ??)
    Cellular differentiation is, by definition, epigenetic. Genome-wide profiling of pluripotent cells and differentiated cells suggests global chromatin remodelling during differentiation, which results in a progressive transition from a fairly open chromatin configuration to a more compact state. Genetic studies in mouse models show major roles for a variety of histone modifiers and chromatin remodellers in key developmental transitions, such as the segregation of embryonic and extra-embryonic lineages in blastocyst stage embryos, the formation of the three germ layers during gastrulation and the differentiation of adult stem cells. Furthermore, rather than merely stabilizing the gene expression changes that are driven by developmental transcription factors, there is emerging evidence that chromatin regulators have multifaceted roles in cell fate decisions.
    (Unclaimed)

  27. Browning, Sharon R., and Brian L. Browning. "Haplotype phasing: existing methods and new developments." Nature Reviews Genetics 12.10 (2011): 703-714. (PDF)
  28. (200)
    Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variation and disease susceptibility, are particularly relevant to sequence data. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing methods that are available, focusing in particular on statistical methods, and we discuss the practical aspects of their application. We also describe recent developments that may transform this field, particularly the use of identity-by -descent for computational biology.
  29. Chen, Shijian, Anqi Wang, and Lei M. Li. "SEME: a fast mapper of illumina sequencing reads with statistical evaluation." Research in Computational Molecular Biology. Springer Berlin Heidelberg, 2013. (PDF)
  30. (10)
    Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform get longer and their quality scores get higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to rea- sonable specificity. Our another observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast mapping approach, referred to as “SEME”, which has two core steps: first it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing at most one short-INDEL each, using a novel method “auto-match function”. We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several data sets, and SEME shows better performance in terms of both running time and mapping rates.
  31. He, Dan, et al. "Optimal algorithms for haplotype assembly from whole-genome sequence data." Bioinformatics 26.12 (2010): i183-i190. (PDF)
  32. (516)
    Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of ‘haplotype assembly’ is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem.

    In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m × 2k × n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants.
  33. Jiang, Lichun, et al. "Synthetic spike-in standards for RNA-seq experiments." Genome Research 21.9 (2011): 1543-1551.(PDF)
  34. (0)
    High-throughput sequencing of cDNA (RNA-seq) is a widely deployed transcriptome profiling and annotation technique, but questions about the performance of different protocols and platforms remain. We used a newly developed pool of 96 synthetic RNAs with various lengths, and GC content covering a 220 concentration range as spike-in controls to measure sensitivity, accuracy, and biases in RNA-seq experiments as well as to derive standard curves for quantifying the abundance of transcripts. We observed linearity between read density and RNA input over the entire detection range and excellent agreement between replicates, but we observed significantly larger imprecision than expected under pure Poisson sampling errors. We use the control RNAs to directly measure reproducible protocol-dependent biases due to GC content and transcript length as well as stereotypic heterogeneity in coverage across transcripts correlated with position relative to RNA termini and priming sequence bias. These effects lead to biased quantification for short transcripts and individual exons, which is a serious problem for measurements of isoform abundances, but that can partially be corrected using appropriate models of bias. By using the control RNAs, we derive limits for the discovery and detection of rare transcripts in RNA-seq experiments. By using data collected as part of the model organism and human Encyclopedia of DNA Elements projects (ENCODE and modENCODE), we demonstrate that external RNA controls are a useful resource for evaluating sensitivity and accuracy of RNA-seq experiments for transcriptome discovery and quantification. These quality metrics facilitate comparable analysis across different samples, protocols, and platforms.


Site built using pyWeb version 1.10
© 2010 Leonard McMillan, Alex Jackson and UNC Computational Genetics