Bioinformatic tool for Annotation of gene

dxx7bhanu 26 views 47 slides Mar 01, 2025
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

Ppt


Slide Content

Bioinformatics tools in the annotation and assessment of next-generation sequencing data 2185003 MSc. Biotechnology Bhanu Krishan

Contents Next Generation Sequences The need for Annotation Annotation and tools Assessment in NGS technology Assessment tools Conclusion References

Next Generation Sequencing Next Generation sequencing (NGS) has made great stride in sequencing technology. Enables sequencing of gene in high throughput manner with low cost. One of the important aspect is its usage in early disease diagnosis. Can identify all the mutation which cannot be identified using conventional sequencing technologies. Various NGS platforms such as Illumina , Roche, ABI/ SOLiD are used for wet-lab analysis of NGS data and computational tools such as BWA, Bowtie, Galaxy, SanGeniX are used for dry-lab analysis of NGS data ( Wadapurkar et al., 2018).

Next Generation Sequencing Source: Illumina.com

Annotation Genome annotation is the process of identifying functional elements along the sequence of a genome, thus giving meaning to it. It is necessary because the sequencing of DNA produces sequences of unknown function. In the last three decades, genome annotation has evolved from the computational annotation of long protein-coding genes on single genomes and the experimental annotation of short regulatory elements.

Figure 1 : Illustration of an annotated sequence ; source: The G-cat

The need for Annotation in NGS The sequencing of the genome or DNA generates sequence information without its functional role. After the genome is sequenced, it must be annotated to bring: More logical information about its structure Functional role It consists of three steps: Recognizing pieces of genome that do not code for protein Recognizing essential; Recognizing organic information to these elements ( Harbola et al., 2022).

Annotation Given the sequence of a genome, we can identify Exon boundaries & splice sites Beginning and end of translation Alternative splicing Regulatory elements The only certain way to do this experiment is via Computational methods that can : Achieve moderate accuracy quickly and cheaply High direct experimental approaches.

Methods of gene prediction Sequence similarity searches / homology-based approaches Ab initio methods Signal-based Content-based Methods for Gene Prediction and Annotation

Bioinformatics Tools for Annotation GeneMark A combination of several gene prediction programmes developed at Georgia Institute of technology, USA. An effective tool for prediction of genes in varied organisms such as prokaryotes, eukaryotes, viruses, phages, plasmids and transcripts. Available for download and local installation. Based on HMM and heuristic algorithms. Its is part of genome annotation pipelines at NCBI, JGI, Broad Institute ( Besemer et al., 2001).

http://opal.biology.gatech.edu/GeneMark Figure 2 : Home page of GeneMark

Figure 3: Open accessible GeneMark tool for Prokaryotic gene

Figure 4 : Output options in GeneMark

Figure 5: Coordinates of Predicted genes Figure 6: Protein sequence of query sequence

Figure 7:  Prediction of a single gene (with seven exons ) made by the eukaryotic version of GeneMark.hmm for a fragment of the  Arabidopsis thaliana   genom ; Source: Oxford Academic

Limitations of GeneMark The output of the GeneMark program consists of a list of ORFs predicted as genes The GeneMark programs will not find genes in the masked areas (sequences of ‘N’ characters); thus, the predictions will be compatible with this extrinsic information. The detection of exact gene starts remains a challenging problem in gene finding, as many genes have relatively weak patterns indicating sites of translation and transcription initiation

GENSCAN Designed to predict complete gene structures. Uses generalised HMM, structure of genomic sequence is modelled by explicit state duration HMMs. Signals are modelled by weight matrices, weight arrays, and maximal dependence decomposition Probability score is provided for each predicted exon .

http://genes.mit.edu/GENSCAN.html Figure 8: GenScan web server

Figure 9 : GenScan output result for given sequence

Limitations of GenScan Sequence read length up to 200 kilobases Primarily for human/vertebrate sequences; maybe lower accuracy for non-vertebrates. Resulting statistics may not be representative

GeneID One of the oldest gene prediction program to predict genes in anonymous genomic sequences designed with a hierarchical structure. 1. Splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). 2. Exons are built from the sites. Exons are scored as the sum of the scores, plus long likelihood ratio of Markov model for coding DNA. 3. From predicted exons , the gene structure is assembled (Parra et al., 2000). Very fast and scale linearly with the length of the sequence (both in time and memory. Trained with Drosophila and Human.

Accessible via: https://genome.crg.es/cgi-bin/GeneID_cgi/geneid_2002/geneid_2002.cgi Figure 10: GeneID webserver

Figure 11 : Predicted results for the given sequence

Key features of GeneID Currently,  geneid v1.2 analyzes the whole human genome in 3 hours (approx. 1 Gbp / hour) on a processor Intel(R) Xeon CPU 2.80 Ghz . Geneid  output can be customized to different levels of detail, including exhaustive listing of potential signals and exons . Furthermore, several output formats as gff or XML are available. There are available parameter files in  geneid v 1.2 for  Drosophila Melanogaster ,  human  (which can be also used for vertebrate genomes),  Dictyostelium discoideum  and  Tetraodon nigroviridis  (which can be used for  Fugu rubripes ) among many others for species spanning the four "classical" kingdoms. The additional currently available parameter files can be found under the section " geneid  parameter files" .

HMMgene HMMgene is a program for prediction of genes in anonymous DNA. The program predicts whole genes, so the predicted exons always splice correctly. Can predict whole or partial genes in one sequence. Can also be used to predict splice sites and start/stop codons Based on Hidden Markov model

Figure 11 : HMMgene open accessible webpage

Figure 12 : HMMgene server output

Key features of HMMgene The sensitivity and specificity at whole gene level is 99% to 95%. One of the main problems for HMMgene is that it does not do a good job on sequences with a low G/C content, which is a common problem for automated gene fining methods, see e.g. (Snyder & Stormo 1995).

Other Softwares/ Tools include: AUGUSTUS FGENESH GENIE GENMARK VEIL PROCRUST

Assessment Tools Sequencing generates large volumes of data, and the assessment required can be intimidating. In this step, quality of NGS reads is evaluated to remove, correct or trim the reads not meeting the standards For this, tools various computational tools are used, which assesses the quality by considering the above mentioned errors with calculation of quality scores. After assessing the quality of NGS reads, the reads are aligned to the reference genome.

Assessment Tools The NGS data analysis process includes three main steps: primary, secondary, and tertiary data analysis. 

BFAST BFAST: BLAT-like FAST Accurate Search Tool is a fast and accurate alignment tool.  Powerful and complete means to perform billions of short sequence alignments within the context of large genomes in a highly sensitive and tuneable manner. BFAST performs alignment in two steps. First, using multiple indexes of the reference genome, BFAST identifies candidate alignment locations (CALs) for each read. Next, the reads at each CAL are further aligned using gapped local alignment to identify the best match. These processes are supported for direct sequence reads (the typical output of platforms based on sequencing-by-synthesis, such as the Illumina , 454 or Helico sequencers) as well as reads in two base color encoded form, which is the primary output of the ligation-based ABI SOLiD platform (Homer et al., 2009).

Figure 13 : Steps involved in BFAST

Figure 14 : Comparative analysis of BFAST with other assessment tool; source NCBI

FASTQC A quality control tool for high throughput sequence data. Requires a suitable Java Runtime Environment FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. Current Version as released on 8 th January 20219; Version 0.11.9 Fixed a bug when analysing empty files Added support for multi-read fast5 files Fixed a corner case bug in adapter detection Bundled a JRE with the OSX build so you don't have to install it Fixed a hang if the program runs out of memory

FASTQC The main functions of FastQC are Import of data from BAM, SAM or FastQ files (any variant) Providing a quick overview to tell you in which areas there may be problems Summary graphs and tables to quickly assess your data Export of results to an HTML based permanent report Offline operation to allow automated generation of reports without running the interactive application

Figure 14 : Quality score for the given sequence

Figure 14 : Per Base sequence content, Source: Babraham bioinformatics

Figure 15 : Duplicate sequences

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. Bowtie uses a different and novel indexing strategy to create an ultrafast, memory-efficient short read aligner geared toward mammalian re-sequencing. Bowtie aligns 35-base pair ( bp ) reads at a rate of more than 25 million reads per CPU-hour, which is more than 35 times faster than Maq and 300 times faster than SOAP under the same conditions. BOWTIE

Employs a Burrows-Wheeler index based on the full-text minute-space (FM) index, which has a memory footprint of only about 1.3 gigabytes (GB) for the human genome.  The small footprint allows Bowtie to run on a typical desktop computer with 2 GB of RAM Bowtie has been used to align 14.3× coverage worth of human Illumina reads from the 1,000 Genomes project in about 14 hours on a single desktop computer with four processor cores ( Langmead et al ., 2009). BOWTIE

Downloadable via: http://bowtie-bio.sourceforge.net/index.shtml

Figure 15: Alignment algorithm in Bowtie; source: Nucleic acids research

Other Softwares/ Tools include: DESeq FASTXCLIPPER VCAKE SOAP3 Velvet Trinity Cufflinks

Conclusion Next Generation sequencing (NGS) has made great stride in sequencing technology The sequencing of the genome or DNA generates sequence information without its functional role Through annotation functional elements of a given sequence can be determined This sophisticated process requires use of Computational tools such as GeneMark , GenScan and others Assessment of next generation sequences involves In this step, quality of NGS reads is evaluated to remove, correct or trim the reads not meeting the standards Assessment tools are used to assess the quality of genome assemblies and gene structure by assessing gene- and repeat-space completeness of an input genome assembly, screening for both vector and adaptor contamination. Thus, these tools allow researchers to benchmark their metrics relative to a gold standard reference genome.

References Wadapurkar RM, Vyas R. Computational analysis of next generation sequencing data and its applications in clinical oncology. Informatics in Medicine Unlocked. 2018 Jan 1;11:75-82. Harbola A, Negi D, Manchanda M, Kesharwani RK. Bioinformatics and biological data mining. InBioinformatics 2022 Jan 1 (pp. 457-471). Academic Press. Parra G, Blanco E, Guigó R. Geneid in drosophila. Genome research. 2000 Apr 1;10(4):511-5. Homer N, Merriman B, Nelson SF. BFAST: an alignment tool for large scale genome resequencing . PloS one. 2009 Nov 11;4(11):e7767. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009 Mar;10(3):1-0. Besemer J, Lomsadze A, Borodovsky M. GeneMarkS : a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic acids research. 2001 Jun 15;29(12):2607-18.
Tags