Chapter 5 Genome assembly

Assembly of high-quality annotated reference genomes for three trees of three different species.

5.1 Sequencing

  • CNRGV, Toulouse
  • Genomic reads with PacBio HiFi
  • Optical maps with Bionano
  • Illumina HiSeq/NovaSeq

5.2 Assembly workflow

5.2.1 V1

The first assembly will be produced by William Marande in the CNRGV.

5.2.2 V2

A second version can be produced using a singularity and snakemake workflow developed by Ludovic Duvaux (possible collaboration). The wrokflow uses the following steps:

  1. converting bam files to fastq: converting HiFi reads to regular fastq with the package simlink 1 assembly of contigs: using hifiasm first but step to test additional assembler such as canu and spades
  2. removing haplotigs: purge_dups and purge_haplotigs
  3. optical maps: proprietary program of bionano & manual curation by CNRGV to obatin a haploid optical map
  4. gap closing: using Illumina reads with SOAPdenovo2
  5. quality check:

5.3 Annotation workflow

Automatic singularity & snakemake workflow to annotate genomes: https://github.com/sylvainschmitt/genomeAnnotation .

5.4 Project description

In each species, using a single genotype, such as leaves from a single branch, we will extract DNA and obtain long reads -2 runs of each Mini and Prometh ION targeting a total coverage of 100x -and short reads -sequencing one of the libraries to be used for the detection of mutations (below) using Illumina NovaSeq6000 technology and targeting a minimum coverage of 100x. The sequencing services will be subcontracted to CEA Genoscope, Evry. Ideally, we would use cambium tissue to produce the reference genomes, however, since a limited amount of cambium tissue can be sampled per tree, we will use leaf tissue. Assembling the quality-filtered reads from both sequencing technologies will provide long scaffolds (Johnson et al., 2019). In each species, we will build an “optimal map” of the genome using high-resolution restriction maps from single, labelled molecules of DNA obtained through Bionano technology (subcontracted to CNRGV, Toulouse). In combination with the previous scaffolds this will provide a chromosomal assembly.The three genomes will be annotated using automated pipelines (Bolger, Arsova, & Usadel, 2018).

Seedling original genotype (T4.1): Short read sequencing data will be demultiplexed, quality-filtered, and mapped to the reference genomes. The “zero mutation reference” genotype will be constructed from the consensus genotype calls obtained from short read sequencing of the three cambium samples using a standard variant calling approach such as GATK (“Best Practices for Variant Calling with the GATK,” 2015).