HTreeQA: Using Semi-perfect Phylogeny Trees in Quantitative Trait Loci Study on Genotype Data

Zhaojun Zhang1, Xiang Zhang2, and Wei Wang1

1Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599

2Department of Electrical Engineering and Computer Scince, Case Western Reserve University.

Summary:

With the advances in high-throughput genotyping technology, quantitative trait loci (QTL) study has emerged as a promising tool to understand the genetic basis of complex traits. Methodology development for QTL study has recently attracted significant research attention. Local phylogeny-based methods have been demonstrated to be powerful tools for uncovering significant associations between phenotypes and SNP markers. However, most existing methods are designed for homozygous genotypes and a separate haplotype reconstruction step is needed to resolve heterozygous genotypes. This approach has limited power to detect non-additive genetic effects and imposes an extensive computational burden. In this paper, we propose a new method, HTreeQA, that uses a tri-state semi-perfect phylogeny tree to approximate the perfect phylogeny used by existing methods. The semi-perfect phylogeny trees are used as high-level markers for association study. HTreeQA uses the genotype data as direct input without any phasing step. HTreeQA can handle complex local population structures. It is suitable for QTL mapping on any mouse populations including the incipient Collaborative Cross (CC) lines. Significant QTLs are found for two phenotypes of the PreCC lines, white head spot and running distance at day 5/6. These findings are consistent with known genes and QTLs discovered in independent studies. Simulation studies under three different genetic models show that HTreeQA can detect a wider range of genetic effects and is more efficient than existing phylogeny-based approaches. We also provide rigorous theoretical analysis to show that HTreeQA has a lower error rate than alternative methods.

Please download HTreeQA (Linux executable) with its test dataset at here.

How to run htreeqa?

Please unzip the downloaded zip file first.

Test downloaded zip file :

Try it by running the following command in the terminal: ./htreeqa --config htreeqa_config.cfg .

The output will generate some debug info, please just ignore them. If there is no error report, it means the htreeqa is downloaded successfully. You can also override the parameters by adding more parameters to the command line, please check the Configuration section.

Dataset (test.csv and phenotype.txt) :

TreeQA require two different types of files as input files: genotype file and phenotype file. For phenotype file, each line contains one individual's name and phenotype value, separated by space. Genotype file is in cvs format: first line of genotype file is header, beginning with some meta information such as SNP id, position, etc, and following the names of strains/individuals (Genotype information must be consecutive columns and must start from some column to the end ). Please take a look at test.csv as an example. And your csv file is not necessary to have exact the same meta info as the example file. In config.cfg, you can configure how many columns at beginning will be ignored by TreeQA.

Configuration:

In HTreeQA, configuration can be input by two different ways: the command parameter list and a configuration file (e.g. htreeqa_config.cfg). And if there is a conflict (a parameter is supplied by both ways), the value from the command parameter list will overwrite the value in the configuration file. In command parameter list, a parameter and its value is specified as "--parameter_name value". And in configuration file, a parameter and its value is specified as "parameter_name=value" for each line.

Here, we list all parameters supported by HTreeQA and their meanings.

Please leave other parameters in the configuration file untouched, because they are experimental parameters and only for internal use only.

Result:

Result file names could be specified, and in the first command as an example, it is sample_pvalue and sample_tree.output. There two files have same number of lines. Each line in sample_pvalue corresponds to the same line in sample_tree.output. Sample_pvalue file only contains one number in each line, which is log(-Pvalue), the higher the more significant. And sample_tree.output is another file contains three columns, the first two columns are the begin and end SNP of the phylogeny tree, indexed by the row number in your genotype file. And the third column is the phylogeny tree TreeQA use for calculation.