TropGBTropical Crops Genome Database

Cassava/木薯

Taxonomy:    Magnoliopsida / Rosidae / Rosanae / Malpighiales / Euphorbiaceae / Manihot /Manihot esculenta

Introduction

1. Manihot esculenta, commonly called cassava , manioc, yuca, or tapioca is a woody shrub of the spurge family, Euphorbiaceae, native to South America, from Brazil and parts of the Andes.

2.Cassava is the third-largest source of food carbohydrates in the tropics, after rice and maize.

3. Cassava is classified as either sweet or bitter. Like other roots and tubers, both bitter and sweet varieties of cassava contain antinutritional factors and toxins, with the bitter varieties containing much larger amounts.

4. The cassava root is long and tapered, with a firm, homogeneous flesh encased in a detachable rind, about 1 millimetre (1⁄16 inch) thick, rough and brown on the outside.

5. Wild populations of M. esculenta subspecies flabellifolia, shown to be the progenitor of domesticated cassava, are centered in west-central Brazil, where it was likely first domesticated no more than 10,000 years ago.

Genomic Version Information

Manihot esculenta v8.1

Genome Overview

Cassava (Manihot esculenta Crantz) is grown throughout tropical Africa, Asia, and the Americas for its starchy storage roots, and feeds an estimated 1 billion people each day. Farmers choose it for its high productivity and its ability to withstand a variety of environmental conditions (including significant water stress) in which other crops fail. However, it has low protein content and is susceptible to a range of biotic stresses. Despite these problems, the crop production potential for cassava is enormous, and its capacity to grow in a variety of environmental conditions makes it a plant of the future for emerging tropical nations. Cassava is also an excellent energy source?its roots contain 20?40% starch that costs 15?30% less to produce per hectare than starch from corn, making it an attractive and strategic source of renewable energy.

The cassava genome project built upon the pilot initiated through the DOE-JGI Community Sequencing Program (CSP) by a 14-member consortium led by Claude Fauquet, Joe Tohme, and Pablo Rabinowicz. This pilot project produced a little under 1× coverage from over 700,000 Sanger shotgun reads using plasmid and fosmid libraries, provided insights into the overall characteristics of the cassava genome, and was a valuable source of Sanger paired-end sequences used later. Next, a draft 454-based assembly (v4) was generated in a project led by Steve Rounsley, Dan Rokhsar, Chinnappa Kodira, and Tim Harkins. The project began in Spring 2009 when 454 Life Sciences, a Roche company, partnered with DOE-JGI to provide the resources for whole-genome shotgun sequencing of cassava using the 454 GS FLX Titanium platform.

The v6 assembly improved the contiguity and completeness of the AM560-2 genome by starting afresh with high-quality Illumina-based contig sequences, ordered and oriented into scaffolds using newly-introduced mate-pair and fosmid-end jumping libraries and Dovetail Genomics in-vitro proximity ligation sequences. Scaffolds were arranged into chromosomes with a high-resolution genetic linkage map (1). The v7 assembly followed the same strategy as v6, but with more contiguous underlying sequences assembled de novo from Pacific Biosciences (PacBio) single-molecule real-time (SMRT) continuous long-read (CLR) data.

The v8 assembly described here leverages the same PacBio CLR and fosmid-end data used above; however, instead of using linkage maps to form chromosome-scale scaffolds, it now incorporates newly generated high-throughput chromatin conformation capture (Hi-C) and PacBio data contributed by Sarah Dyer (NIAB, Cambridge, UK; BBSRC GCRF Foundation Award BB/P022804/1). The higher-resolution Hi-C anchored an additional 187.8 Mb and 26.0 Mb of sequence compared to the v6 and v7 assemblies, respectively. The improved resolution and completeness further augmented assembly contiguity by nearly 123.6× and 4.8× over v6 and v7 after gap filling. Grants to Steve Rounsley (formerly at Univ. of Arizona), Dan Rokhsar (DOE-JGI, UC Berkeley), and Jean-Luc Jannink (Cornell Univ., NextGen Cassava) from the Bill & Melinda Gates Foundation and UKAID/DFID (OPPGD1493 and OPP1048542) supported these efforts.

Genome Information

Assembly Source:UCB
Assembly Version:v8.0
Annotation Source:JGI
Annotation Version:v8.1
Total Scaffold Length (bp):639,586,700
Number of Scaffolds:84
Min. Number of Scaffolds containing half of assembly (L50):9
Shortest Scaffold from L50 set (N50):34,668,054
Total Contig Length (bp):637,114,355
Number of Contigs:756
Min. Number of Contigs containing half of assembly (L50):56
Shortest Contig from L50 set (N50):3,258,867
Number of Protein-coding Transcripts:59,204
Number of Protein-coding Genes:32,858
Percentage of Eukaryote BUSCO Genes:98.3
M.esculenta v8.1: Phytozome98.7

Sequencing, Assembly, and Annotation

Sequence data were generated from a partially inbred (third-generation self, or S3, of MCOL1505) line called AM560-2, which was developed at the International Center for Tropical Agriculture (CIAT) in Cali, Colombia by Hernan Ceballos. Previous reference assemblies (v4+) are all based on the same accession.

Rebecca Bart's group at the Donald Danforth Plant Science Center grew and collected etiolated AM560-2 leaves, from which high molecular weight (HMW) genomic DNA was extracted into low-melting agarose plugs by Julia Vrebalov at the Boyce Thompson Institute. Jessica B. Lyons (UC Berkeley) also extracted HMW DNA from fresh leaf tissue using the CTAB method. These DNAs were sent to the University of California, Davis DNA Technologies & Expression Analysis Core for sequencing with PacBio RSII machines using P6-C4 chemistry. Additionally, HMW DNA was extracted from AM560-2 leaves by Tatiana Ovalle and Monica Carvajal-Yepes at CIAT and sent for PacBio Sequel (v2 chemistry) sequencing at the University of Exeter (UK). These combined data totalled 88.8 Gbp of sequence, representing 115× depth of coverage. Lastly, over 389 million Hi-C pairs (254 million non-redundant contacts) were generated by Dovetail Genomics from fresh AM560-2 leaf tissue collected by Monica Carvajal-Yepes and Ericson Aranzales (CIAT Genebank).

Contigs were assembled with Canu v1.6 (2) from raw PacBio SMRT CLR data. Redundant contig sequences were identified by their median depth of coverage and the extent of their sequence overlap assessed by alignment. The shorter redundant sequences were then discarded. Misassembled contigs were broken in JuiceBox using the JuiceBox Assembly Tools (JBAT) (3-5). Contigs were ordered and oriented into chromosome-scale scaffolds with SSPACE v3 (6) and 3D-DNA (7) using, respectively, 40 kb fosmid-end sequencing reads (8) and the newly-generated Hi-C sequences. The resulting scaffolds were manually curated with JBAT to remedy contig inversion errors and incorrect placements, then to incorporate additional small contigs. Scaffold gap sizes were estimated using custom scripts. Gaps were patched via the local de novo assembly of gap-flanking long reads. Base-level errors were corrected by two rounds of signal-based polishing with Arrow (9) (from SMRT Link Suite v6.0.0.47841), followed by four rounds of Illumina short-read polishing with FreeBayes v1.3.1-17-gaa2ace8 (10) and custom scripts. Evaluated by Merqury v1.0.0-g9917ad8 (11), the polishing achieved an average base-level quality value (QV) of 43.5, or less than one error in 22kb.

Although cassava has an estimated genome size of ~772 Mbp (12), this assembly spans 637.2 Mbp. Despite the discrepancy, we believe the assembly represents nearly all of the genic regions of the genome (see below), and that the missing portion are repetitive sequences that could not be assembled. Demonstrating this near-complete protein-coding gene coverage, 95.7% of the 122.7k cassava ESTs available in NCBI map to the v8 assembly. The 18 chromosomal scaffolds are numbered and oriented according to those in the v6 assembly.

Protein-coding gene and repeat prediction

To produce the current "Cassava v8.1" gene set, homology-based gene prediction programs FgenesH and GenomeScan were leveraged, along with the PASA program to integrate expression information from cassava ESTs and RNA-Seq.

Transcript assemblies (TAs) were constructed with PERTRAN (Shu, unpublished) from roughly three billion strand-specific and five billion standard paired-end Illumina RNA-seq read pairs and five million LS454 ESTs. Subsequently, 282,674 transcript assemblies were constructed from TAs above and ESTs with PASA (13). Protein-coding loci were inferred from TA alignments and protein homology to Hevea brasiliensis, Jatropha curcas, arabidopsis, soybean, poplar, peach, rice, sorghum, foxtail millet, tomato, grape, castor bean, and Swiss-Prot proteins. Gene model prediction leveraged the following homology-based methods: FGENESH+ (14), FGENESH_EST, and assembly sequence-based open-reading frames inferred by EXONERATE (15) and refined with PASA. The best-scoring predictions for each locus were selected using a heuristic, rule-based method based on multiple positive factors, including EST and peptide coverages, homology scores, and mRNA expression levels. Negative factors included overlap with annotated genomic repeats (see below), presence of known transposable element domains, and minimum single-exon coding sequence length. Filtered gene models were then improved with PASA to refine splice sites and add UTRs and alternative transcripts. This v8.1 annotation represents 32,447 protein-coding gene loci and, using BUSCO v3 benchmarking, is estimated to be 99.2% complete, with only 0.2% of BUSCO genes fragmented and 0.6% missing (eudicotyledons_odb10, N=2,121).

To find and mask repeats in the genome sequence, RepeatModeler v1.0.8 (16) was run on the v7 assembly and generated 1,093 repetitive sequences (totaling 901,271 bp). Over half of these sequences were unknown, but ~37% could be classified LTR; ~7% DNA and ~3% LINE/SINE elements. These 1,093 sequences were combined with cassava sequences in RepBase and used to mask the genome with RepeatMasker (17), which masked roughly 61% of the genome.

The v8 reference was annotated by Shengqiang Shu at DOE-JGI.

Submitted / In Press Manuscripts

Bredeson, J.V., Shu, S., Berkoff, K., Lyons, J.B., Caccamo, M., Santos, B., Ovalle, T., Bart, R.S., Augusto Becerra Lopez-Lavalle, L., Carvajal Yepes, M., Aranzales, E., Wenzl, P., Jannink, J.-L., Dyer, S., Rokhsar, D.S. "An improved reference assembly for cassava (Manihot esculenta Crantz)".In preparation.