Chapter 5 Genome assembly
Assembly of high-quality annotated reference genomes for three trees of three different species.
5.1 Sequencing
- CNRGV, Toulouse
- Genomic reads with PacBio HiFi
- Optical maps with Bionano
- Illumina HiSeq/NovaSeq
5.2 Assembly workflow
5.2.2 V2
A second version can be produced using a singularity
and snakemake
workflow developed by Ludovic Duvaux (possible collaboration).
The wrokflow uses the following steps:
- converting
bam
files tofastq
: converting HiFi reads to regular fastq with the packagesimlink
1 assembly of contigs: usinghifiasm
first but step to test additional assembler such ascanu
andspades
- removing haplotigs:
purge_dups
andpurge_haplotigs
- optical maps: proprietary program of bionano & manual curation by CNRGV to obatin a haploid optical map
- gap closing: using Illumina reads with
SOAPdenovo2
- quality check:
5.3 Annotation workflow
Automatic singularity & snakemake workflow to annotate genomes: https://github.com/sylvainschmitt/genomeAnnotation .
5.4 Project description
In each species, using a single genotype, such as leaves from a single branch, we will extract DNA and obtain long reads -2 runs of each Mini and Prometh ION targeting a total coverage of 100x -and short reads -sequencing one of the libraries to be used for the detection of mutations (below) using Illumina NovaSeq6000 technology and targeting a minimum coverage of 100x. The sequencing services will be subcontracted to CEA Genoscope, Evry. Ideally, we would use cambium tissue to produce the reference genomes, however, since a limited amount of cambium tissue can be sampled per tree, we will use leaf tissue. Assembling the quality-filtered reads from both sequencing technologies will provide long scaffolds (Johnson et al., 2019). In each species, we will build an “optimal map” of the genome using high-resolution restriction maps from single, labelled molecules of DNA obtained through Bionano technology (subcontracted to CNRGV, Toulouse). In combination with the previous scaffolds this will provide a chromosomal assembly.The three genomes will be annotated using automated pipelines (Bolger, Arsova, & Usadel, 2018).
Seedling original genotype (T4.1): Short read sequencing data will be demultiplexed, quality-filtered, and mapped to the reference genomes. The “zero mutation reference” genotype will be constructed from the consensus genotype calls obtained from short read sequencing of the three cambium samples using a standard variant calling approach such as GATK (“Best Practices for Variant Calling with the GATK,” 2015).