Bioinformatics tools in the annotation and assessment of next-generation sequencing data 2185003 MSc. Biotechnology Bhanu Krishan
Contents Next Generation Sequences The need for Annotation Annotation and tools Assessment in NGS technology Assessment tools Conclusion References
Next Generation Sequencing Next Generation sequencing (NGS) has made great stride in sequencing technology. Enables sequencing of gene in high throughput manner with low cost. One of the important aspect is its usage in early disease diagnosis. Can identify all the mutation which cannot be identified using conventional sequencing technologies. Various NGS platforms such as Illumina , Roche, ABI/ SOLiD are used for wet-lab analysis of NGS data and computational tools such as BWA, Bowtie, Galaxy, SanGeniX are used for dry-lab analysis of NGS data ( Wadapurkar et al., 2018).
Next Generation Sequencing Source: Illumina.com
Annotation Genome annotation is the process of identifying functional elements along the sequence of a genome, thus giving meaning to it. It is necessary because the sequencing of DNA produces sequences of unknown function. In the last three decades, genome annotation has evolved from the computational annotation of long protein-coding genes on single genomes and the experimental annotation of short regulatory elements.
Figure 1 : Illustration of an annotated sequence ; source: The G-cat
The need for Annotation in NGS The sequencing of the genome or DNA generates sequence information without its functional role. After the genome is sequenced, it must be annotated to bring: More logical information about its structure Functional role It consists of three steps: Recognizing pieces of genome that do not code for protein Recognizing essential; Recognizing organic information to these elements ( Harbola et al., 2022).
Annotation Given the sequence of a genome, we can identify Exon boundaries & splice sites Beginning and end of translation Alternative splicing Regulatory elements The only certain way to do this experiment is via Computational methods that can : Achieve moderate accuracy quickly and cheaply High direct experimental approaches.
Methods of gene prediction Sequence similarity searches / homology-based approaches Ab initio methods Signal-based Content-based Methods for Gene Prediction and Annotation
Bioinformatics Tools for Annotation GeneMark A combination of several gene prediction programmes developed at Georgia Institute of technology, USA. An effective tool for prediction of genes in varied organisms such as prokaryotes, eukaryotes, viruses, phages, plasmids and transcripts. Available for download and local installation. Based on HMM and heuristic algorithms. Its is part of genome annotation pipelines at NCBI, JGI, Broad Institute ( Besemer et al., 2001).
http://opal.biology.gatech.edu/GeneMark Figure 2 : Home page of GeneMark
Figure 3: Open accessible GeneMark tool for Prokaryotic gene
Figure 4 : Output options in GeneMark
Figure 5: Coordinates of Predicted genes Figure 6: Protein sequence of query sequence
Figure 7: Prediction of a single gene (with seven exons ) made by the eukaryotic version of GeneMark.hmm for a fragment of the Arabidopsis thaliana genom ; Source: Oxford Academic
Limitations of GeneMark The output of the GeneMark program consists of a list of ORFs predicted as genes The GeneMark programs will not find genes in the masked areas (sequences of ‘N’ characters); thus, the predictions will be compatible with this extrinsic information. The detection of exact gene starts remains a challenging problem in gene finding, as many genes have relatively weak patterns indicating sites of translation and transcription initiation
GENSCAN Designed to predict complete gene structures. Uses generalised HMM, structure of genomic sequence is modelled by explicit state duration HMMs. Signals are modelled by weight matrices, weight arrays, and maximal dependence decomposition Probability score is provided for each predicted exon .
http://genes.mit.edu/GENSCAN.html Figure 8: GenScan web server
Figure 9 : GenScan output result for given sequence
Limitations of GenScan Sequence read length up to 200 kilobases Primarily for human/vertebrate sequences; maybe lower accuracy for non-vertebrates. Resulting statistics may not be representative
GeneID One of the oldest gene prediction program to predict genes in anonymous genomic sequences designed with a hierarchical structure. 1. Splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). 2. Exons are built from the sites. Exons are scored as the sum of the scores, plus long likelihood ratio of Markov model for coding DNA. 3. From predicted exons , the gene structure is assembled (Parra et al., 2000). Very fast and scale linearly with the length of the sequence (both in time and memory. Trained with Drosophila and Human.
Figure 11 : Predicted results for the given sequence
Key features of GeneID Currently, geneid v1.2 analyzes the whole human genome in 3 hours (approx. 1 Gbp / hour) on a processor Intel(R) Xeon CPU 2.80 Ghz . Geneid output can be customized to different levels of detail, including exhaustive listing of potential signals and exons . Furthermore, several output formats as gff or XML are available. There are available parameter files in geneid v 1.2 for Drosophila Melanogaster , human (which can be also used for vertebrate genomes), Dictyostelium discoideum and Tetraodon nigroviridis (which can be used for Fugu rubripes ) among many others for species spanning the four "classical" kingdoms. The additional currently available parameter files can be found under the section " geneid parameter files" .
HMMgene HMMgene is a program for prediction of genes in anonymous DNA. The program predicts whole genes, so the predicted exons always splice correctly. Can predict whole or partial genes in one sequence. Can also be used to predict splice sites and start/stop codons Based on Hidden Markov model
Figure 11 : HMMgene open accessible webpage
Figure 12 : HMMgene server output
Key features of HMMgene The sensitivity and specificity at whole gene level is 99% to 95%. One of the main problems for HMMgene is that it does not do a good job on sequences with a low G/C content, which is a common problem for automated gene fining methods, see e.g. (Snyder & Stormo 1995).
Other Softwares/ Tools include: AUGUSTUS FGENESH GENIE GENMARK VEIL PROCRUST
Assessment Tools Sequencing generates large volumes of data, and the assessment required can be intimidating. In this step, quality of NGS reads is evaluated to remove, correct or trim the reads not meeting the standards For this, tools various computational tools are used, which assesses the quality by considering the above mentioned errors with calculation of quality scores. After assessing the quality of NGS reads, the reads are aligned to the reference genome.
Assessment Tools The NGS data analysis process includes three main steps: primary, secondary, and tertiary data analysis.
BFAST BFAST: BLAT-like FAST Accurate Search Tool is a fast and accurate alignment tool. Powerful and complete means to perform billions of short sequence alignments within the context of large genomes in a highly sensitive and tuneable manner. BFAST performs alignment in two steps. First, using multiple indexes of the reference genome, BFAST identifies candidate alignment locations (CALs) for each read. Next, the reads at each CAL are further aligned using gapped local alignment to identify the best match. These processes are supported for direct sequence reads (the typical output of platforms based on sequencing-by-synthesis, such as the Illumina , 454 or Helico sequencers) as well as reads in two base color encoded form, which is the primary output of the ligation-based ABI SOLiD platform (Homer et al., 2009).
Figure 13 : Steps involved in BFAST
Figure 14 : Comparative analysis of BFAST with other assessment tool; source NCBI
FASTQC A quality control tool for high throughput sequence data. Requires a suitable Java Runtime Environment FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. Current Version as released on 8 th January 20219; Version 0.11.9 Fixed a bug when analysing empty files Added support for multi-read fast5 files Fixed a corner case bug in adapter detection Bundled a JRE with the OSX build so you don't have to install it Fixed a hang if the program runs out of memory
FASTQC The main functions of FastQC are Import of data from BAM, SAM or FastQ files (any variant) Providing a quick overview to tell you in which areas there may be problems Summary graphs and tables to quickly assess your data Export of results to an HTML based permanent report Offline operation to allow automated generation of reports without running the interactive application
Figure 14 : Quality score for the given sequence
Figure 14 : Per Base sequence content, Source: Babraham bioinformatics
Figure 15 : Duplicate sequences
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. Bowtie uses a different and novel indexing strategy to create an ultrafast, memory-efficient short read aligner geared toward mammalian re-sequencing. Bowtie aligns 35-base pair ( bp ) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions. BOWTIE
Employs a Burrows-Wheeler index based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome. The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM Bowtie has been used to align 14.3× coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores ( Langmead et al ., 2009). BOWTIE
Conclusion Next Generation sequencing (NGS) has made great stride in sequencing technology The sequencing of the genome or DNA generates sequence information without its functional role Through annotation functional elements of a given sequence can be determined This sophisticated process requires use of Computational tools such as GeneMark , GenScan and others Assessment of next generation sequences involves In this step, quality of NGS reads is evaluated to remove, correct or trim the reads not meeting the standards Assessment tools are used to assess the quality of genome assemblies and gene structure by assessing gene- and repeat-space completeness of an input genome assembly, screening for both vector and adaptor contamination. Thus, these tools allow researchers to benchmark their metrics relative to a gold standard reference genome.
References Wadapurkar RM, Vyas R. Computational analysis of next generation sequencing data and its applications in clinical oncology. Informatics in Medicine Unlocked. 2018 Jan 1;11:75-82. Harbola A, Negi D, Manchanda M, Kesharwani RK. Bioinformatics and biological data mining. InBioinformatics 2022 Jan 1 (pp. 457-471). Academic Press. Parra G, Blanco E, Guigó R. Geneid in drosophila. Genome research. 2000 Apr 1;10(4):511-5. Homer N, Merriman B, Nelson SF. BFAST: an alignment tool for large scale genome resequencing . PloS one. 2009 Nov 11;4(11):e7767. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009 Mar;10(3):1-0. Besemer J, Lomsadze A, Borodovsky M. GeneMarkS : a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research. 2001 Jun 15;29(12):2607-18.