HTreeQA: Using Semi-perfect Phylogeny Trees in Quantitative Trait Loci Study on Genotype Data

<h1 style="text-align: center;">HTreeQA: Using Semi-perfect Phylogeny Trees in Quantitative
  Trait Loci Study on Genotype Data 
</h1>

<h2 style="text-align: center;">Zhaojun Zhang<SUP>1</SUP>, Xiang Zhang<SUP>2</SUP>, and Wei Wang<SUP>1</SUP>

<h3 style="text-align: center;">   <SUP>1</SUP>Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 </h3> 
<h3 style="text-align: center;"> <SUP>2</SUP>Department of Electrical
  Engineering and Computer Scince, Case Western Reserve
  University. </h3>

<b> Summary: </b> <p> 

With the advances in high-throughput genotyping technology,
quantitative trait loci (QTL) study has emerged as a promising tool to
understand the genetic basis of complex traits. Methodology
development for QTL study has recently attracted significant research
attention. Local phylogeny-based methods have been demonstrated to be
powerful tools for uncovering significant associations between
phenotypes and SNP markers. However, most existing methods are
designed for homozygous genotypes and a separate haplotype
reconstruction step is needed to resolve heterozygous genotypes. This
approach has limited power to detect non-additive genetic effects and
imposes an extensive computational burden. In this paper, we propose a
new method, HTreeQA, that uses a tri-state semi-perfect phylogeny
tree to approximate the perfect phylogeny used by existing
methods. The semi-perfect phylogeny trees are used as high-level
markers for association study. HTreeQA uses the genotype
data as direct input without any phasing step. HTreeQA can handle
complex local population structures. It is suitable for QTL mapping on
any mouse populations including the incipient Collaborative
Cross (CC) lines. Significant QTLs are found for two phenotypes of the
PreCC lines, white head spot and running distance at day 5/6. These
findings are consistent with known genes and QTLs discovered in
independent studies. Simulation studies under three different genetic
models show that HTreeQA can detect a wider range of genetic effects
and is more efficient than existing phylogeny-based approaches. We
also provide rigorous theoretical analysis to show that HTreeQA has
a lower error rate than alternative methods.
<p>

<b> Please download HTreeQA (Linux executable) with its test dataset at <a href = htreeqa.1.0.zip > here. </a> </b> <p>

<p>

<h3> <b> How to run htreeqa? </b> </h3> <p>

<b> Please unzip the downloaded zip file first. </b> <p>

<b> Test downloaded zip file : </b><p>

Try it by running the following command in the terminal:  <b> ./htreeqa --config htreeqa_config.cfg </b>.
 <p>
The output will generate some debug info, please just ignore them. If
there is no error report, it means the htreeqa is downloaded
successfully. You can also override the parameters by adding more
parameters to the command line, please check the Configuration section.
 <p>
<b> Dataset (test.csv and phenotype.txt) : </b><p>

TreeQA require two different types of files as input files: genotype file and phenotype file. For phenotype file, each line contains one individual's name and phenotype value, separated by space. Genotype file is in  cvs format: first line of genotype file is header, beginning with some meta information such as SNP id, position, etc, and following the names of strains/individuals (Genotype information must be consecutive columns and must start from some column to the end ). Please take a look at test.csv as an example. And your csv file is not necessary to have exact the same meta info as the example file. In config.cfg, you can configure how many columns at beginning will be ignored by TreeQA. <p>

<b> Configuration: </b> <p>

In HTreeQA, configuration can be input by two different ways: the command parameter list and a configuration file (e.g. htreeqa_config.cfg). And if there is a conflict (a parameter is supplied by both ways), the value from the command parameter list will overwrite the value in the configuration file. In command parameter list, a parameter and its value is specified as "--parameter_name value". And in configuration file, a parameter and its value is specified as "parameter_name=value" for each line. <p>

Here, we list all parameters supported by HTreeQA and their meanings. 

<ul> 
<li> <b> genotype_file </b>:  genotype file name. Please refer to Dataset Section about the format.  </li>
<li> <b> phenotype_file </b>:  phenotype file name. Please refer to Dataset Section about the format.  </li>
<li> <b> tree_output </b>:  phylogeny tree output file. each line contains three columns: start SNP row number, end SNP row number, and the corresponding phylogeny tree for the region. </li>
<li> <b> output </b>:  p-value output file. It has same number of lines with tree_output, and each line corresponds to the same line with tree_output. Each line contains two column: the first  is the value of -log(p-value), and the second is the corresponding tree partition.  </li>
<li> <b> hvalue</b>: A character to present the heterozygous site in the data. </li>
<li> <b> mvalue</b>: A character to present the missing value in the data. </li>
<li> <b> position_column</b>: The column index (start with 0).</li>
<li> <b> ignore_column</b>: The first column index of the genotype data. </li>
<li> <b> delimiter</b>: A character to present the delimiter in the genotype file </li>
<li> <b> tree_file</b>: Use existing tri-state semi-perfect phylogeny tree or not ( 0 for no and 1 for yes). </li>
<li> <b> num_permutation</b>: The number of permutations used to test each tree.  </l>
<li> <b> need_permutation</b>: Use permutation test or not. (1 for yes, 0 for no).</li>

</ul>

Please leave other parameters in the configuration file untouched, because they are experimental parameters and only for internal use only. 

 <p>

Result: <p>
 

Result file names could be specified, and in the first command as an example, it is sample_pvalue and sample_tree.output. There two files have same number of lines. Each line in sample_pvalue corresponds to the same line in sample_tree.output. Sample_pvalue file only contains one number in each line, which is log(-Pvalue), the higher the more significant. And sample_tree.output is another file contains three columns, the first two columns are the begin and end SNP of the phylogeny tree, indexed by the row number in your genotype file. And the third column is the phylogeny tree TreeQA use for calculation.<p>