TropGBTropical Crops Genome Database

Sugarcane /甘蔗

Taxonomy:    Magnoliopsida / Liliidae / Commelinanae / Poales / Poaceae / Saccharum /Saccharum officinarum

Introduction

1. Sugarcane, perennial tall solid herb. The rhizome is stout and well developed. The stalk height is 3-5 (-6) meters. ChinaTaiwan,Fujian,Guangdong,Hainan,Guangxi,Sichuan,Yunnanand so onThe tropicsWidely cultivated.

2. Sugar cane istemperateandtropicCrops are manufacturedcane sugarand can be refinedethanolas an energy alternative. More than 100 countries around the world produce sugarcane, and the largest sugarcane producer isBrazil,IndiaandChina.

3. Sugarcane is an annual or perennial tropical and subtropical herb that belongs to the C4 crop.

4. Sugar cane is a perennial tall solid herb. The rhizome is stout and well developed.

5. Land preparation is to provide a deep, loose and fertile soil condition for the growth of sugarcane, so as to fully meet the needs of its root growth, so that the root system can better play the role of absorbing water and nutrients. At the same time, land preparation can also reduce diseases, insects and weeds in sugarcane fields.

Genomic Version Information

Saccharum officinarum x spontaneum R570 v2.1

Genome Overview

Sugarcane (Saccharum spp.) was first domesticated approximately 8000 years ago and is one of the world's most valuable crops, responsible for 80% of the world's sugar production. Modern day cultivars are mainly derived from inter-specific hybridization, combining the desirable traits of high sugar production from Saccharum officinarum, with disease resistance, vigor and environmental adaptability of S. spontaneum. However, this breeding design has resulted in a large (2n = 10Gb) and highly complex genome that is highly redundant, aneuploid, and has approximately 12 copies of each chromosome which undergo frequent interspecific recombination.

Due to this complexity, the sugarcane community has focused on developing genomic resources on cultivar R570, which has broad environmental adaptability and is the best characterized sugarcane cultivar to date. The 10Gb genome of R570 was sequenced with a combination of different sequencing technologies and techniques, including a Bionano Optical map, genetic map derived from selfed offspring, single flow-sorted chromosomes, HiC libraries, and Ilumina and PacBio (CLR and HiFi) libraries. The genome assembly was found to be highly redundant, with large stretches of identical sequences among chromosomes. To increase the utility of the assembly, the genome was separated into a mainly unique (primary) assembly for annotation and an alternate assembly of mostly redundant sequences. The R570 genome will serve as a resource for continued breeding improvement of sugarcane.

Genome Information

Assembly Source:JGI
Assembly Version:v2.0
Annotation Source:JGI
Annotation Version:v2.1
Total Scaffold Length (bp):5,046,770,891
Number of Scaffolds:144
Min. Number of Scaffolds containing half of assembly (L50):28
Shortest Scaffold from L50 set (N50):79,221,035
Total Contig Length (bp):5,042,101,904
Number of Contigs:842
Min. Number of Contigs containing half of assembly (L50):99
Shortest Contig from L50 set (N50):15,340,496
Number of Protein-coding Transcripts:299,731
Number of Protein-coding Genes:194,593
Percentage of Eukaryote BUSCO Genes:98.7
DB Xrefs:99.8

Sequencing, Assembly, and Annotation

Main assembly consisted of 66.55x of single haplotype CCS PACBIO coverage (17,192 bp average read size), and was assembled using HiFiAsm+HIC and the resulting sequence was polished using RACON. Initial ordering of the V2 primary genome assembly was performed using the R570 V1 build. Unique 2kb markers were extracted from the V1 build and aligned to the HiFi contigs using BLAT. Additionally, genetic map information (237 Linkage groups; ~1.8 million correlated markers; 80bp) and R570 single chromosome illumina libraries (n=81, assembled using HipMER, 500 bp non overlapping markers extracted from >1kb contigs) were projected onto the HiFi contigs to identify misjoins. A total of 558 misjoins were initially identified and broken. Additional misjoins were identified using the Bionano R570 optical map, using the 'cut' version of the genome. V1 build markers were re-aligned to 'cut' genome assembly and were used to define a primary path along each chromosome, selecting the minimum number of contigs required to complete the chromosome path. Secondary contigs were marked 'alternative' and removed into a separate bin. Sorghum bicolor (v3.1.1; n=34,211) primary protein sequences were then aligned to the primary assembly to order and orient contigs within chromosomes, and were used to trim sequence overlaps. S.bicolor protein sequences were then used to visualize gaps in the chromosome builds and find contigs that could potentially fill them. After these contigs were inserted into gaps, the overlap and trimming step was repeated. Lastly, to identify contigs which belong in the primary build (representing unique sequence) but had not been previously anchored, 48bp kmers were extracted from the primary chromosomes and were used to mask the remaining contigs. Any non-duplicative contigs were placed in the primary assembly. After fixing potential misjoins identified from HiC, homozygous SNPs and INDELs were corrected in the release sequence using 45x of Illumina reads (2x150, 400bp insert).

Transcript assemblies were made from ~3.7B pairs of 2X150 stranded paired-end Illumina RNA-seq reads using PERTRAN, which conducts genome-guided transcriptome short read assembly via GSNAP (Wu and Nacu, 2010) and builds splice alignment graphs after alignment validation, realignment and correction. To obtain ~1.5M putative full length transcripts, about 31M PacBio Iso-Seq CCSs were corrected and collapsed by genome guided correction pipeline, which aligns CCS reads to genome with GMAP (Wu and Nacu, 2010) with intron correction for small indels in splice junctions if any and clusters alignments when all introns are the same or 95% overlap for single exon. Subsequently 1,763,610 transcript assemblies were constructed using PASA (Haas, 2003) from RNA-seq transcript assemblies above. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabi (Arabidopsis thaliana), soybean, rice, sorghum, brachy, Panicum hallii, Joinvillea ascendens, Acorus americanus, Paspalum vaginatum, Phoenix dactylifera, Musa acuminata, Ananas comosus, Asparagus officinalis, Phalaenopsis equestris, aquilegia, grape and Swiss-Prot proteomes to repeat-soft-masked Saccharum officinarum x spontaneum R570 genome using RepeatMasker (Smit, 2013-2015) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Repeat library consists of De Novo repeats by RepeatModeler (Smit, 2008-2015) on Saccharum officinarum x spontaneum R570 genome and repeats in RepBase. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, but using EST to compute splice site and intron input instead of protein/translated ORF), and EXONERATE (Slater and Birney, 2005), PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS (Stanke, 2006) trained by the high confidence PASA assembly ORFs and with intron hints from short read alignments. The best scored predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and their CDS overlapping with repeats. The transcripts were selected if their Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but their CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more than 20%, their Cscore must be at least 0.9 and homology coverage at least 70% to be selected. This is first round of gene annotation.

Each sub genome including all recombinant chromosomes as one sub genome is hardmasked with its respective gene models of high confidence (transcriptome fully supported, high homology supported and complete gene models). The masked genome was BLASTXed and EXONERATEed against non-self high confidence and non-redudant peptide proteomes from the first round to make EXONERATE gene predictions. The predicted gene models were scored with BLASTP using homology proteomes. These models were compared to ones from the first round and better homology supported ones (not contradicated by transcriptome evidence) were kept to improve gene models from first round or added if no first round gene model

The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed and weak gene models. Incomplete gene models, low homology supported without fully transcriptome supported gene models and short single exon (< 300 BP CDS) without protein domain nor good expression gene models were manually filtered out.

References

1.  Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K.,Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. . Nucleic Acids Res., 31, 5654-5666.

2.  Slater, G.S., and Birney, E. (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31

3.  Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

4.  Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013-2015 http://www.repeatmasker.org

5.  Stanke, M., Schoffmann, O., Morgenstern, B. et al. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006). https://doi.org/10.1186/1471-2105-7-62.

Reference Publication(s)

Healey A L, Garsmeur O, Lovell J T, et al. The complex polyploid genome architecture of sugarcane[J]. Nature, 2024, 628(8009): 804-810.