UNC Systems Genetics
Collaborative Cross Genomes
Here we provide genomic sequences for the Collaborative Cross (CC) mouse strains and the eight CC founder strains in the form of FASTA files for the 19 autosomes, sex chromosomes (X and Y), and mitochondria (M). These sequences can be used as reference sequences for high-throughput short-read alignments, or for any other comparative genomic analyses.
Each genome comes with a companion MOD file, which can be used to remap coordinates from the FASTA sequences back to reference coordinates. This is necessary since, in general, all gene and genomic annotations are specified relative to the reference. MOD files are genome and version specific, and therefore should always be downloaded together as a set with their associated FASTA sequence.
We supply two types of genomes, sequenced and imputed. Sequenced genomes result from direct DNA sequencing at a minimum of 30x coverage, and an iterative alignment process. Imputed genomes are derived from genotype data, where we first construct a haplotype mosaic using MegaMUGA genotypes and then assemble an imputed genome using segments of DNA sequence from the inferred founders
For the previous version of this page, please click here.
Variants and Standard Reference Sequences
The variants included in the MOD files were extracted from the latest Variant Calling Files VCFs from Sanger's recent sequencing efforts of common and important Mus musculus strains.
The standard Mus musculus reference sequences can be downloaded as follows.
The sequences and MOD files in the table below are from the eight founder strains of the CC and DO genetic reference panels.
We provide a suite of tools that simplify the incorporation of our pseudogenomes into standard analysis and hiseq pipelines.
All of the latest tools are under PyPi. It is highly recommended to use the following commands for installation.
Modtools is used to generate standard reference genome and pseudogenome sequences.
The code is hosted in https://pypi.python.org/pypi/modtools.
For the usage of vcf2mod, please refer to the example in http://csbio.unc.edu/~sphuang/vcf2mod/ .
Lapels is used to remap pseudogenome alignments, in the form of a BAM file, back to the reference sequence. This entails the removal of all indels (via the cigar string modifications, the underlying sequence is unaltered) and adjustments to the fragment and its mate's starting positions. Lapels also annotates the number and types (SNPs, insertions, and deletions) of sequence variants seen in each read.
The input includes the BAM file of psedogenome alignment and the MOD file associated with the FASTA sequences used in the alignment. (Please bundle MOD and FASTA while downloading.)
The output is a BAM file with corrected reads positions, cigar strings, and annotated tags. It has been tested to be compatible with downstream tools, such as IGV (using the reference genome) and Cufflinks (using any referenced based transcript library).
The code is hosted in https://pypi.python.org/pypi/lapels.
Suspenders merges the results of multiple alignments (BAM files) applied to the same set of reads. It is used when working with F1 and RIX crosses, where we suggest performing separate alignments to each parental genome. Suspenders then effectively merges and annotates these separate BAM files into a single consensus BAM file.
When reads map to the same genomic location in both alignments, only one read is output. Where there are differences in either mapping positions or multiplicity of reads, Suspenders determines the most likely alignment and source genome for the read, which is sent to the output BAM file. When there is no significant difference in the alignments all multiple mappings are output.
The code for Suspenders is available in https://pypi.python.org/pypi/suspenders/.
For general use cases, please refer to the Suspenders wiki page or contact James Holt (email@example.com).
S. Huang, C.-Y. Kao, L. McMillan, and W. Wang.Transforming genomes using mod files with applications. In Proceedings of the ACM Conferenceon Bioinformatics, Computational Biology and Biomedicine. ACM, 2013.[link]
J. Holt, S. Huang, L. McMillan, and W. Wang. Read annotation pipeline for high-throughput sequencing data. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 2013. [link]
Using Pseudogenomes with the Collaborative Cross