Chapter 10 Mutations Angela

This chapter describes the analyses of Angela mutations currently done in the the angela branch of the detectMutations repository.

10.1 Quality check

The results of multiQC are available for the haplotypes HS1 here and HS2 here. They globally all gave a green light for subsequent analyses.

10.2 Genome

We compared the two haplophases and chose HS1 for contiguity and length (Fig. 10.1 and Fig. 10.2) as well as mean coverage and homogeneity (Fig. 10.4).

Anchored scaffolds size-distribution for haplotypes HS1 and HS2 in Angela's genome.

Figure 10.1: Anchored scaffolds size-distribution for haplotypes HS1 and HS2 in Angela’s genome.

BUSCO results for haplotypes HS1 and HS2.

Figure 10.2: BUSCO results for haplotypes HS1 and HS2.

Coverage distribution. Distribution of the number of locations in the reference genome with a given depth of coverage.

Figure 10.3: Coverage distribution. Distribution of the number of locations in the reference genome with a given depth of coverage.

Genome coverage and GC-content on a 10-kb windows for the HS1 haplotype (scaffolds > 2-Mb).

Figure 10.4: Genome coverage and GC-content on a 10-kb windows for the HS1 haplotype (scaffolds > 2-Mb).

Genome coverage and GC-content on a 10-kb windows for the HS2 haplotype (scaffolds > 2-Mb).

Figure 10.5: Genome coverage and GC-content on a 10-kb windows for the HS2 haplotype (scaffolds > 2-Mb).

10.2.1 Annotation

10.2.2 Mutations

10.2.3 Comp

## 
## Call:
## glm(formula = mutations ~ percentage_gene, family = "poisson", 
##     data = all)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.69760    0.02017   34.58   <2e-16 ***
## percentage_gene -2.31318    0.15468  -14.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 8716.4  on 4134  degrees of freedom
## Residual deviance: 8474.1  on 4133  degrees of freedom
## AIC: 15578
## 
## Number of Fisher Scoring iterations: 5

## 
## Call:
## glm(formula = mutations ~ percentage_te, family = "poisson", 
##     data = all)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.82114    0.01764   46.54   <2e-16 ***
## percentage_te -0.37257    0.03857   -9.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15698  on 4920  degrees of freedom
## Residual deviance: 15602  on 4919  degrees of freedom
## AIC: 24551
## 
## Number of Fisher Scoring iterations: 6

10.3 Heterozygosity

We used k-mer analyses (21-mers) using jellyfish and the GenomeScope to estimate genetic diversity \(\pi=0.901%\) (Fig. 10.6). Should I also estimate non-synonymous diversity with ORF annotation using transdecoder and SNPdat (https://phdpages.netlify.app/6b189d4142e8224bff99e28abd11cbfdd50c51b1/symcapture/annotation.html#synonymy)? We detected and filtered SNPs as follow:

  • Raw=5M : raw result from GATK HaplotypeCaller + GenomicsDBImport + GenotypeGVCFs
  • Biallelic=5.3M : biallelic sites with bcftools
  • SNP=4.7M : SNPs with GATK
  • Filtered=3.85M : SNPs with QUAL < 30, QD < 2, FS > 60, SOR > 3 using GATK
  • Non-missing=3.84M : SNPs in all genotypes and individuals with plink
  • Shared=3.84M : shared SNPs by at least 32 out of 33 individuals with bcftools
GenomeScope Profile. Full results here: http://genomescope.org/analysis.php?code=ayEfgONEP1cFWw2yjhfb.

Figure 10.6: GenomeScope Profile. Full results here: http://genomescope.org/analysis.php?code=ayEfgONEP1cFWw2yjhfb.

10.4 Mutations filters

We removed all candidate mutations present in the heterozygous sites and in the raw candidate mutations from the cambium. We kept only mutations with no copy of the mutated allele in the normal sample (NAC, normal_altCount == 0). We tested independently the effect of 4 filters and look their effect on (1) the percentage kept, (2) the distribution of allelic frequencies, and (3) the overlap between filters:

  1. MAC: a minimum of 5 copies of the mutated allele in the mutated sample (mutation_altCount >= 5)
  2. DP: a read depth for the two sample between the 5th quantile and the 95th quantile of the coverage of the corresponding library (normal_DP <= high_cov, normal_DP >= low_cov, mutation_DP <= high_cov, mutation_DP >= low_cov)
  3. BIO: the mutation is present in at least two biological replicates (2 leaves from the crown)
  4. EVS: Strelka 2 automatic filtering based on the empiric variant score (Filter == "PASS")

MAC and EVS are the most stringent filters (Fig. 10.7, and mostly MAC filter changes the allele frequencies distribution (Fig. 10.8. Most filters share mutations, except EVS and BIO that rejected individually a lot mutations shared by the two others (Fig. 10.9. And all filters except MAC are not sensitive to the library coverage (Fig. 10.10. We will thus use two filters for next steps:

  1. base:
    • NAC: no copy of the mutated allele in the normal sample (normal_altCountT1 == 0)
    • MAC: a minimum of 5 copies of the mutated allele in the mutated sample (mutation_altCount >= 5)
    • DP: a read depth for the two sample between the 10th quantile and the 90th quantile of the coverage of the mutated library (normal_DP <= high_cov, normal_DP >= low_cov, mutation_DP <= high_cov, mutation_DP >= low_cov)
    • BIO: the mutation is present in at least two biological replicates (2 leaves from the crown)
  2. evs:
    • base
    • EVS: Strelka 2 automatic filtering based on the empiric variant score (Filter == "PASS")
Percentage of kept mutations per filter.

Figure 10.7: Percentage of kept mutations per filter.

Alleles frequencies of filtered mutations.

Figure 10.8: Alleles frequencies of filtered mutations.

Filtered mutations sharing across filters.

Figure 10.9: Filtered mutations sharing across filters.

Link between the number of detected mutations and the coverage across samples depending on the filter used.

Figure 10.10: Link between the number of detected mutations and the coverage across samples depending on the filter used.

10.5 Leaf mutations

Only 30 to 89 mutations passed evs filtering across samples (Fig. ??). Most of mutations are low frequency (Fig. 10.12). Most of mutations are not shared by biological replicates (Fig. 10.13 and Fig. 10.14).

Number of filtered mutations per leaf.

Figure 10.11: Number of filtered mutations per leaf.

Alleles-frequencies of filtered mutations in leaf samples.

Figure 10.12: Alleles-frequencies of filtered mutations in leaf samples.

Filtered mutations sharing across leaf samples within tips for evs filtering.

Figure 10.13: Filtered mutations sharing across leaf samples within tips for evs filtering.

Filtered mutations sharing across leaf samples within tips for base filtering.

Figure 10.14: Filtered mutations sharing across leaf samples within tips for base filtering.

10.6 Mutations architecture

10.7 Phylogeny

10.8 Light & annotation

Number of mutations per filter and light condition across branches.

Figure 10.15: Number of mutations per filter and light condition across branches.

10.9 Rate

10.10 Mutations for Fruits

Selection of mutations for transmission in fruits.

Table 10.1: Number of mutations per branch.
replicate N minAF meanAF maxAF
18 1 0.06 0.06 0.06
10 1 0.06 0.06 0.06
8 1 0.05 0.05 0.05
7 3 0.05 0.08 0.10
6 4 0.04 0.05 0.05
5 8 0.03 0.04 0.06
4 26 0.04 0.06 0.21
3 80 0.03 0.06 0.50

10.11 Genes

cd data/mutations/angela/annotation
~/Tools/gffread/gffread trsc.fa.transdecoder.gff3 -T -o trsc.fa.transdecoder.gtf
perl ~/Tools/SNPdat_package_v1.0.5/SNPdat_v1.0.5.pl \
  -i SNPsOnTrsc.tsv \
  -g trsc.fa.transdecoder.gtf \
  -f trsc.fa \
  -s synonymy.summary \
  -o synonymy.output
Mutations types.

Figure 10.16: Mutations types.

Mutation allelic fraction per type.

Figure 10.17: Mutation allelic fraction per type.

Synonymy with allelic fraction.

Figure 10.18: Synonymy with allelic fraction.

GO enrichment for Angela non-synonymous mutations.

Figure 10.19: GO enrichment for Angela non-synonymous mutations.