Why is heritability squared




















The rice data used in the study includes recombinant inbred lines RILs , for each of which four traits [yield YD , grain weight GW , tiller number TN , and grain number GN ] have been replicated 4 times in different years and different locations High-density markers are used to infer recombination breakpoints 25 , facilitating construction of bins bins in the study which are treated as new synthetic markers.

The kinship matrix K and the identify matrix for residual e are specified in the general linear covariance structure for estimation of the variance components. The residual variance is fixed at 0. We first calculated the average of the 4 values of each trait for each RIL. The results are presented in Table 1. We then applied the proposed algorithm to estimate the predictability and heritability for each trait using predicted genetic values through fold cross validation repeated 10 times for various data partitioning , which are also presented in Table 1.

The fold cross validation has been repeated 10 times and the numbers in parentheses are the standard deviations of the averages. The results showed that in non-cross-validation setting, the heritability appeared to be unrealistically larger than that shown in cross-validation setting. This is because, in non-cross-validation setting, a large number of neutral loci are used in regression analysis which overfitted the data and then overestimated genetic variances and the heritability.

Whereas in cross-validation setting, the genetic variances and residual variance are calculated using predicted genetic values through cross validation, which provides a certain level of control for the potential overfitting in the training process. Note that, in cross-validation setting, the trait heritability calculated using the predicted genetic values is close to the trait predictability.

In order to demonstrate that genetic variances are overestimated in the non-cross-validation setting indicated in Table 1 , we did the following simulated studies. We adopted the genotypes of the loci for each of the RILs such that the natural genetic relationship between these RILs are preserved.

For each of the loci, we simulated a genetic effect which is independently sampled from a normal distribution, i. We only consider a single trait, for which the phenotypic values for each of the RILs were calculated by multiplying the genotypes and the genetic effects plus a random error which is independently sampled from a normal distribution, i. The results are summarized in Fig. On the contrary, with more and more neutral genes being added to the data, the heritability calculated in non-cross-validation setting continuously increased red curve.

This results supported our hypothesis that, without the control by the cross-validation in GS analysis, including irrelevant loci will overfit the data, and subsequently overestimate the genetic variance and eventually the heritability. This is because that many relevant but only weakly associated loci were not included yet, yielding an incomplete modeling. When most relevant loci were included in the GS analysis, Hcv and Pcv tend to be quite stable, suggesting that cross validation provides desirable control on overfitting due to the inclusion of neutral loci in the GS analysis.

In general, the heritability calculated from non-cross-validation setting appeared to be much higher than those calculated from cross-validation setting, supporting the speculation of overfitting aforementioned. Moreover, the heritability calculated from cross-validation setting is very similar to the predictability that was calculated from the cross-validation setting green and blue curves are close to each other in Fig.

An alternative approach to calculate Hcv is to use the ratio of the variance of the predicted genetic values via cross validation and the variance of the observed phenotypic variance. Analysis of simulated data with loci being continuously added to the GS analysis. X axis in each plot represents the percent of the sorted loci that have been included in the analysis.

Y axis in each plot represents the achieved heritability or predictability with or without cross validation. H: heritability without cross validation; Hcv: heritability with cross validation; Pcv: predictability with cross validation.

We further did the following simulated study to demonstrate the deficiency of ANOVA analysis when compared with the GS analysis with cross validation. We also adopted the genotypes of the loci for each of the RILs. Similarly, for each of the loci, we simulated a genetic effect from a normal distribution, i. The phenotypic values for each of the RILs were calculated with the same manner, i. Note that these overall heritability is equivalent to the heritability that is calculated using GS model without cross validation.

Therefore, the overall heritability only reflects the property of the entire training set; however, it does not indicate how well the genetic model developed from this training set would predict when it is applied to an independent set.

This is the main point that we hope to address in the study. Using GS analysis either with or without cross-validation , we were able to analyze the lines in each of the four replicated experiments separately, or analyze the average a single value of the four replicated measurements.

However, when we analyzed the lines in each of the four replicated experiments separately, the effective sample size is only The heritability without cross validation H , the heritability with cross validation Hcv , and the predictability with cross validation Pcv were calculated for each data analysis. The results from the three approaches are presented in Table 2. Analysis of simulated data with different levels of heritability. H: heritability; Hcv: heritability calculated through cross validation; Pcv: predictability calculated through cross validation.

In this simulation, we generated 4 replicates for each of individuals. The results in Table 2 clearly show that 1 the trait predictability is equivalent to the trait heritability with cross validation; 2 the trait heritability calculated with non-cross-validation setting is consistently higher than those from cross-validation settings, indicating overfitting due to oversaturated markers in the analysis proved in Fig.

In the ANOVA analysis, we no longer analyzed the variance based on the genotypes of the markers; rather, we analyzed the variance between groups or RILs which represent a new independent variable derived from the genotype data. This gives a good explanation to the aforementioned observation 4. In addition, another type of overfitting is possible if more groups than necessary are used in the ANOVA analysis. For example, samples with similar genotypes may be placed into different groups or RILs.

We calculated the pair-wise correlations of genotypes between the RILs. It shows that the absolute correlation coefficient ranges from 0. Therefore, the heritability calculated using ANOVA with reorganized data is likely to be overestimated.

Like the previous simulation, we simulated a genetic effect for each of the loci. The genetic effects were sampled independently from a normal distribution, i. The phenotypic value for each of the RILs was calculated by multiplying the genotype and the genetic effect plus a random error which was independently sampled from a normal distribution, i.

For each of the RILs, we simulated different numbers of simple replicates, i. The heritability did not change as the sample size increases Haov in Table 3. We then averaged the replicated measurements for each RIL and analyzed the averaged phenotype using GS analysis with and without cross validation. The heritability calculated without cross validation H , the heritability calculated with cross validation Hcv , and the predictability with cross validation Pcv are listed in Table 3.

Hcv and Pcv increased as the sample size grew, indicating that using larger samples boosts the statistical power for GS analysis. Moreover, ANOVA required replicated measurements to perform the analysis of variances by comparing the variance between groups and the variances within groups; increasing the number of replicates within groups does not help increase the heritability of the genetic model based on ANOVA.

On the contrary, GS analyses do not require replicates; high level of statistical power for detecting the genetic effects may be gained by scrutinizing the genome-wide high density markers. From Table 3 , it is obvious to see the overfitting due to the inclusion of neutral loci if cross validation is not applied in GS analysis H. Also, we have proved again that the heritability Hcv is equivalent to predictability Pcv when cross validation is applied to GS analyses.

Analysis of the simulated data with different numbers of replicates. Haov: heritability calculated using ANOVA; H: heritability; Hcv: heritability calculated through cross validation; Pcv: predictability calculated through cross validation.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. National Center for Biotechnology Information , U. Sci Rep. Published online Oct Zhenyu Jia. Author information Article notes Copyright and License information Disclaimer. Zhenyu Jia, Email: ude. Corresponding author. Received Aug 22; Accepted Oct 4. I also ran across an old school paper by Jacquard ft which presents a formula something like this:.

Besides the H 2 vs. For instance, MZ twins may actually share more environmental factors than DZ twins since being similar makes people treat them similarly. Also, it is assumed that DZ twins share exactly half their genome, but in fact, due to random segregation of alleles, there is variance in what fraction of alleles siblings actually share — more on this shortly. There are plenty of other study designs as well.

Whereas MZ vs. In short, the data are all over the place. This should probably be interpreted as a pretty rough estimate.

Does that bias the heritability estimate? I am still undecided on my answer. However, Visscher presents formulas for controlling for parent inbreeding, implying consanguinity does matter. Leave me a comment if you have the answer. In talking about consanguinity, my concern is with excess IBD.

But you might also ask whether excess identity-by-state IBS matters for heritability calculations. So does that mess up the heritability calculations? Again, since it affects both the numerator and denominator — you have extra IBS with your siblings and with random people in the population — I believe the answer should be no.

However, leaving consanguinity behind now, the fact is that different sibling pairs do share different amounts of IBD, and different unrelated individuals do share different amounts of IBS. This variability has enabled a couple of very cool modern approaches to calculating heritability. The first of these is sibling IBD regression. Visscher presents an excellent and perhaps the first with any considerable sample size? The conventional solution is to treat the binary trait as if it has an underlying continuous liability, as depicted above, and then quantify the heritability of that continuous liability.

In other words estimate the genetic contribution to the continuous liability as shown in the left plot, based on observing the binary outcome of that liability as shown in the center plot. In some cases we may intentionally select more individuals who have the binary outcome, as shown in the right plot, in which case we have to further adjust the heritability calculations for how that ascertainment has changed the distribution of liability in our sampled individuals.

This adjustment requires making assumptions about the prevalence of the trait in the population, which may or may not be safe in the UK Biobank data depending on the trait. As a result, the estimates of heritability for binary traits should be interpreted carefully, with an expectation that they are at a higher risk of statistical artifacts than than heritability estimates for continuous traits.

There will be additional uncertainty in these estimates for binary traits. As a result, most current work on common genetic effects focuses on this additive component. The Human Genome Project to provide a map of the human genome, the HapMap Project identify the structure of genetic variation in the human population, advances in genotyping and sequencing technology to allow robust and efficient measurement of genetic variants, to name a few.

This is mostly because they are cheaper and easier to observe. Insertions, deletions, duplications, and other genetic modifications are all also areas of active research. As an aside for those familiar with LD score regression, this is part of benefit of standardizing analysis to use the common HapMap3 SNPs and pre-computed reference LD scores. Derivation of this conversion in the genetics context can be found here , but note that does not perfectly resolve the issue of ascertained binary traits.

Neale lab. Heritability Types of heritability and how we estimate it.



0コメント

  • 1000 / 1000