TropGBTropical Crops Genome Database

Cassava/木薯

Taxonomy:    Magnoliopsida / Rosidae / Rosanae / Malpighiales / Euphorbiaceae / Manihot /Manihot esculenta

Introduction

1. Manihot esculenta, commonly called cassava , manioc, yuca, or tapioca is a woody shrub of the spurge family, Euphorbiaceae, native to South America, from Brazil and parts of the Andes.

2.Cassava is the third-largest source of food carbohydrates in the tropics, after rice and maize.

3. Cassava is classified as either sweet or bitter. Like other roots and tubers, both bitter and sweet varieties of cassava contain antinutritional factors and toxins, with the bitter varieties containing much larger amounts.

4. The cassava root is long and tapered, with a firm, homogeneous flesh encased in a detachable rind, about 1 millimetre (1⁄16 inch) thick, rough and brown on the outside.

5. Wild populations of M. esculenta subspecies flabellifolia, shown to be the progenitor of domesticated cassava, are centered in west-central Brazil, where it was likely first domesticated no more than 10,000 years ago.

Genomic Version Information

Manihot esculenta v7.1

Genome Overview

Cassava (Manihot esculenta Crantz) is grown throughout tropical Africa, Asia and the Americas for its starchy storage roots, and feeds an estimated 1 billion people each day. Farmers choose it for its high productivity and its ability to withstand a variety of environmental conditions (including significant water stress) in which other crops fail. However, it has low protein content, and is susceptible to a range of biotic stresses. Despite these problems, the crop production potential for cassava is enormous, and its capacity to grow in a variety of environmental conditions makes it a plant of the future for emerging tropical nations. Cassava is also an excellent energy source - its roots contain 20-40% starch that costs 15-30% less to produce per hectare than starch from corn, making it an attractive and strategic source of renewable energy.

The cassava genome project built upon a pilot initiated through the DOE-JGI Community Sequencing Program (CSP) by a 14-member consortium led by Claude Fauquet, Joe Tohme and Pablo Rabinowicz. This pilot project produced a little under 1× coverage from over 700,000 Sanger shotgun reads using plasmid and fosmid libraries, and it provided insights into the overall characteristics of the cassava genome, and a valuable source of Sanger paired-end sequences to be used later. Next, a draft 454-based assembly (v4) was generated in a project led by Steve Rounsley, Dan Rokhsar, Chinnappa Kodira, and Tim Harkins. This began in Spring 2009 when 454 Life Sciences, a Roche company partnered with DOE-JGI to provide the resources for a whole genome shotgun sequencing of cassava using the 454 GS FLX Titanium platform.

The version 6 assembly improved the contiguity and completeness of the AM560-2 genome by starting afresh with high-quality Illumina-based contig sequences ordered and oriented into scaffolds using newly-introduced mate-pair and fosmid-end jumping libraries and Dovetail Genomics in-vitro proximity ligation sequences. The scaffolds were then anchored onto chromosomes with a high-resolution genetic linkage map (International Cassava Genetic Map Consortium, 2014).

This assembly (v7) was constructed following the same strategy, but now with more contiguous underlying sequences assembled de novo from Pacific Biosciences (PacBio) single-molecule real-time (SMRT) continuous long-read (CLR) data. This assembly was supported by grants to Steve Rounsley (Univ. of Arizona), Dan Rokhsar (DOE-JGI, UC Berkeley), and Jean-Luc Jannink (Cornell Univ., NextGen Cassava) from the Bill & Melinda Gates Foundation and UKAID/DFID (OPPGD1493 and OPP1048542).

Genome Information

Assembly Source:UC Berkeley
Assembly Version:v7
Annotation Source:JGI
Annotation Version:v7.1
Total Scaffold Length (bp):669,368,850
Number of Scaffolds:899
Min. Number of Scaffolds containing half of assembly (L50):10
Shortest Scaffold from L50 set (N50):33,084,714
Total Contig Length (bp):667,249,289
Number of Contigs:2,460
Min. Number of Contigs containing half of assembly (L50):276
Shortest Contig from L50 set (N50):693,312
Number of Protein-coding Transcripts:66,486
Number of Protein-coding Genes:33,849
Percentage of Eukaryote BUSCO Genes:98.7
Percentage of Embroyphyte BUSCO Genes:98.1

Sequencing, Assembly, and Annotation

Sequence data were generated from a partially inbred (third generation self, or S3, of MCOL1505) line called AM560-2 which was generated at the International Center for Tropical Agriculture (CIAT) in Cali, Colombia by Hernan Ceballos. Previous assemblies v4 and v6 were based on the same accession.

Rebecca Bart's group at the Donald Danforth Plant Science Center grew and collected etiolated AM560-2 leaves, from which high molecular weight genomic DNA was extracted into low melting agarose plugs by Julia Vrebalov at the Boyce Thompson Institute. DNA was also extracted from fresh leaf tissue by Jessica B. Lyons (UC Berkeley) using the CTAB method. These DNAs were sent to the University of California, Davis DNA Technologies & Expression Analysis Core and sequenced with PacBio RSII machines using P6-C4 chemistry. The resulting 73.3 Gb of CLR data produced, representing 97.7× depth of sequencing coverage, was used as input for assembly.

Contig sequences were constructed with Canu v1.6 (Koren et al., 2017). The median depth of coverage for each contig was used to identify sequence redundancy between two or more contigs, and sequence alignment used to identify the extent of overlap. Shorter redundant sequences were removed. Contigs were then formed into scaffolds with SSPACE v3 (Boetzer et al., 2011) using the mate-pair and fosmid-end sequences generated previously (Bredeson et al., 2016), which were further ordered and oriented along the 22.4k marker framework map (International Cassava Genetic Map Consortium (ICGMC), 2015) into chromosome-scale superscaffolds. Lastly, the assembly was polished to remove base-level errors using Arrow (Chin et al., 2013) and FreeBayes v1.1.0-54-g49413aa (Garrison & Marth, 2012) with 121× of Illumina whole-genome shotgun paired-end sequences. The vast majority of the assembly (605.7 Mb of 667.2 Mb (90.8%)) could be anchored to the 18 chromosomes.

Although cassava has an estimated genome size of ~772 Mbp (Awoleye et al., 1994), this assembly spans only 669.4 Mbp. Despite the discrepancy, we believe that the assembly represents nearly all of the genic regions of the genome (see below), and that the missing portion is repetitive sequence that could not so far be assembled.

We were able to map 95.6% of the 122.7k EST sequences available in the NCBI nucleotide database, showing near-complete coverage of protein-coding genes in the assembly.

In order to find and mask repeats in the genome sequence, RepeatModeler v1.0.8 (Smit & Hubley, 2008-2015) was run on the v7 assembly and generated 1,122 repetitive sequences (median length 178 bp and totaling 1,405,165 bp). Over half of these sequences were unknown, but ~37% could be classified LTR; ~7% DNA and ~3% LINE/SINE elements. Nonetheless, the software is unaware that plants contain large families of genes with similar sequence and these often are included in the putative repeat library. In order to remove these gene-like sequences from the repeat library, PFAM and PANTHER annotations were predicted on the repeat sequences and those with annotations known to correspond to genes were removed from the repeat library. The remaining 929 sequences were combined with cassava sequences in RepBase and used to mask the genome with RepeatMasker (Smit et al., 2013-2015). This masked 60.2% of the genome.

Gene prediction

To produce the current "Cassava v7.1" gene set, we used the homology-based gene prediction programs FgenesH and GenomeScan, along with the PASA program to integrate expression information from cassava ESTs and RNA-Seq.

Transcript assemblies were made from 0.8B pairs of 2×150, 1.2B pairs of 2×100 paired-end Illumina RNA-seq reads, and approximately 5M Roche 454 reads using PERTRAN (Shu, unpublished). PASA (Haas et al., 2003) constructed 173,456 transcript assemblies from NCBI ESTs, and the RNA-seq transcript assemblies above. Loci were determined by transcript assembly alignments and/or EXONERATE (Slater & Birney, 2005) alignments of proteins from arabi (Arabidopsis thaliana), soybean, poplar, Kitaake rice, sorghum, Setaria viridis, grape, and Swiss-Prot proteomes to repeat-soft-masked Manihot esculenta genome using RepeatMasker with up to 2kb extension on both ends unless extending into another locus on the same strand. Repeat library consists of do novo repeats by RepeatModeler on Manihot esculenta genome and M. esculenta repeats in RepBase. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov & Solovyev, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and EXONERATE, PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS via BRAKER1 (Hoff et al., 2016). The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP (Altschul et al., 1990; Camacho et al., 2009) score and protein coverage is the highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed and weak gene models. Incomplete gene models, low homology supported without fully transcriptome supported gene models and short single exon (< 300 BP CDS) without protein domain nor good expression gene models were manually filtered out. The quality of the protein-coding gene annotation was benchmarked using the BUSCO tool and OrthoDB9 embryophyta dataset (Simão et al., 2015; Zdobnov et al., 2017), estimating that 98.1% of BUSCOs are sampled completely, 0.9% are fragmented, and only 1.0% missing.

The v7 reference was assembled by Jessen V. Bredeson at UC Berkeley and annotated by Shengqiang Shu at DOE-JGI.

Reference Publication(s)

Bredeson, J. V., Lyons, J. B., Prochnik, S. E., Wu, G. A., Ha, C. M., Edsinger-Gonzales, E., … Rokhsar, D. S. (2016). Sequencing wild and cultivated cassava and related species reveals extensive interspecific hybridization and genetic diversity. Nature Biotechnology, 34(5), 562–570.https://doi.org/10.1038/nbt.3535