Embly of a gene, for example a frameshift indel in a

Embly of a gene, for example a frameshift indel in a coding Marimastat sequence, can produce an incorrect annotation [3]. Since the publication of rheMac2, another rhesus macaque genome was produced from a Chinese-origin animal: CR_1.0 [8] (referred to as rheMac3 at the University of California at Santa Cruz Genome Browser). Whole genome shotgun sequencing was performed on the Illumina platform generating 142 billion bases of sequence data. Scaffolds were assembled with SOAPdenovo [8]. These scaffolds were assigned to chromosomes based partly on rheMac2 and partly on human chromosome synteny [8]. Hence, this was not a completely new assembly as errors in scaffold assignment to chromosomes in rheMac2 were propagated to the CR_1.0 assembly. Further, the CR_1.0 contig N50 was much lower than for rheMac2 indicating a more fragmented genome. Annotations for CR_1.0 are available in the form of a GFF file. Although Ensembl gene IDs are provided in this file, gene names and gene descriptions are not, limiting the use of these annotations for NGS. We have produced a new rhesus genome (MacaM) with an assembly that is not dependent on rheMac2. Further, we provide an annotation in a form that can beimmediately and productively used for NGS studies, i.e., a GTF file which provides meaningful gene names and gene descriptions for a significant portion of the rhesus macaque genome. We demonstrate that both the assembly and annotation of our new rhesus genome, MacaM, offer significant improvements over rheMac2 and CR_1.MethodsGenomic DNA sequencingWe obtained genomic DNA from the reference rhesus macaque (animal 17573) [1] and performed whole genome Illumina sequencing on a GAIIx instrument, yielding 107 billion bases of sequence data. We deposited these sequences in the Sequence Read Archive (SRA) under accessions [GenBank:SRX112027, GenBank:SRX113068, GenBank:SRX112904]. In addition, we used a human exome capture kit (Illumina TruSeq Exome Enrichment) to enrich exonic sequence from the reference rhesus macaque genomic DNA. Illumina HiSeq2000 sequencing of exonic fragments from this animal generated a total of 17.7 billion bases of data. We PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/18111632 deposited these sequences in the SRA under accession [GenBank:SRX115899].Contig and scaffold assemblyWe assembled the combined set of Sanger (approximately 6?coverage), Illumina whole genome shotgun (approximately 35?coverage) and exome reads using MaSuRCA (then MSR-CA) assembler version 1.8.3 [9]. We pre-screened and pre-trimmed the Sanger data with the standard set of vector and contaminant sequences used by the GenBank submission validation pipeline. The MaSuRCA assembler is based on the concept of super-read reduction whereby the high-coverage Illumina data is transformed into 3-4?coverage by much longer super-reads. This transformation is done by uniquely extending the Illumina reads using k-mers and then combining the reads that extend to the same sequence. We transformed the exome sequence data from the reference animal into a separate set of exome superreads. We then used these exome super-reads along with Sanger and whole genome shotgun Illumina data in the assembly. The exome super-reads were marked as nonrandom and therefore were excluded from the PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/9547713 contig coverage evaluation step that is designed to distinguish between unique and repeat contigs.Chromosome assembly stepsA flowchart (Figure 1) illustrates the overall process of assembly and annotation. We used BLAST + (version 2.2.25) for all BLASTn [10] alignments. We.

Leave a Reply

Your email address will not be published. Required fields are marked *