Gene identification using bioinformatic tools.pptx

4,334 views 14 slides Mar 02, 2023
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Being able to identify genes, compare them, analyze them could be applied in various research areas from medical to industrial.
This ppt is designed for Health science and computational biology students to enable you understand the above mentioned topic.


Slide Content

Gene identification using Bioinformatic tools 1 Okechukwu Francis Programme: PhD Biotechnology SCHOOL OF HEALTH SCIENCE AND TECHNOLOGY ( SoHST )

2 What is a Gene? A gene is a region of DNA that encodes a function ( e.g proteins, mRNA, tRNA etc ), proteins encoded by genes may overlap and are responsible for the inheritance of physical features. What is a Genome? This is the complete set of genetic material (DNA, RNA) in an organism (Bacteria, human, etc ). In people, almost every cell in the body contains a complete copy of the genome. The genome contains all of the information needed for a person to develop and grow.

3 What is RefSeq ? A reference sequence database is an open-access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. Why are RefSeq important? RefSeq sequences form a foundation for medical, functional, and diversity research. it provides a stable reference for a known genome, gene identification and characterization, mutation and polymorphism analysis (especially RefSeq Gene records), expression studies, and comparative analyses. Hence most organisms have a stable Sequence which can be compared to the sequence on RefSeq and changes in this sequence can give important information like; Mutation in cancer patients identification of bacteria responsible for an infection Comparison between Organism strains.

4 How to identify genes using bioinformatics tools

5

6 What is Genscan ? GENSCAN is a bioinformatics program designed and hosted by MIT to identify complete gene structures in genomic DNA. It is a Generalized Hidden Markov Model  ( GHMM) based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of organisms (mostly vertebrates, Arabidopsis and maize).

7 Promoter region identification Promoter regions are DNA sequences located upstream of a gene that regulates gene expression by binding to transcription factors and RNA polymerase to initiate transcription.

8 Promoter region identification can be done using various bioinformatics tools and approaches, such as: Promoter prediction tools: These are software tools that predict promoter regions based on sequence features such as GC content, TATA boxes, and transcription factor binding sites. Examples of promoter prediction tools include PromoterScan , Neural Network Promoter Prediction, and CpG Island Searcher. Comparative genomics: This involves comparing the genomes of related organisms to identify conserved promoter regions. Promoter regions are often more conserved across species than other non-coding regions of the genome, which can help identify potential promoter regions. Comparative genomics can be done using tools such as BLAST and ClustalW . Chromatin immunoprecipitation ( ChIP ): This is an experimental technique that can be used to identify protein-DNA interactions in vivo, including transcription factor binding to promoter regions. ChIP can be used to identify known and novel promoter regions in a specific cell type or tissue. Gene expression analysis: Promoter regions are often associated with gene expression levels, and genes with similar expression patterns may share common promoter regions. Gene expression analysis, such as RNA sequencing or microarray analysis, can be used to identify potential promoter regions based on co-expression patterns.

9 There are several methods that can be used to identify repeats in a genome, including; Sequence Alignment: Repeats in a genome can be identified by aligning the genome sequence to itself. This can be done using tools such as BLAST or Smith-Waterman algorithm. Repeats can be identified as regions with high sequence similarity. K- mer Analysis: K- mers are short sequences of DNA of a fixed length (usually 3-6 nucleotides). By counting the occurrence of each k- mer in the genome, we can identify regions that are highly repetitive. De Novo Assembly: De novo assembly is the process of reconstructing the genome sequence from a set of short reads. Repeats can be identified by examining the assembly graph and identifying regions where the reads form loops or bubbles. RepeatMasker : RepeatMasker is a tool that identifies and masks repetitive elements in genomic sequences. It compares the genome sequence to a library of known repetitive elements and identifies regions that match. RepeatExplorer : RepeatExplorer is a tool that uses clustering algorithms to identify and classify repetitive elements in genomic sequences. It can be used to visualize the repeat landscape of a genome and identify novel repeat families. These methods can be used individually or in combination to identify and annotate repeats in a genome.

10 How to Identify repeats in a genome Identifying repeats in a genome is an important task in genomic analysis.

11 ORF Prediction ORF (Open Reading Frame) prediction is the process of identifying potential protein-coding regions within a genomic sequence. Here are some common methods for predicting ORF sequences:

12 Start and Stop Codon Detection: ORFs typically begin with a start codon (ATG, AUG, or rarely GUG) and end with a stop codon (TAA, TAG, or TGA). One approach to ORF prediction is to scan the genomic sequence for potential start and stop codons and identify all ORFs that are flanked by these codons. Codon Usage Bias: ORFs in prokaryotic genomes often show a strong bias towards certain codons. This bias can be used to predict ORFs by identifying regions of the genome with codon usage patterns consistent with protein-coding regions. Comparative Genomics: ORF prediction can also be aided by comparing the genome sequence to related genomes. Conserved ORFs between species are more likely to be protein-coding and can be used to guide ORF prediction in the target genome. Machine Learning: Machine learning algorithms, such as Hidden Markov Models (HMMs) and neural networks, can be trained on known protein-coding regions to predict ORFs in a genome. These methods can also incorporate information from other genomic features, such as codon usage and RNA secondary structure. Gene Finding Software: There are many software tools available that use a combination of the above methods to predict ORFs in a genome, such as Glimmer, GeneMark , and Augustus.

13 Glimmer (Gene Locator and Interpolated Markov ModelER ) Glimmer is a popular software tool used in bioinformatics for gene prediction in bacterial and archaeal genomes. Here are the steps for using Glimmer for gene prediction: Input Sequence: The first step is to input the genomic sequence in FASTA format to Glimmer. Training: Known gene sequence databases are used to predict genes in the genome. The test data can be obtained from a related genome or from experimental data such as RNA-seq or proteomics. Running Glimmer: Glimmer can be run to predict test genes in the genome. Glimmer outputs a set of predicted genes in Glimmer format, which includes information such as the predicted start and stop codons, the coding sequence, and the predicted gene product. Post-Processing: The predicted genes can be further processed to remove false positives and to annotate the genes with additional information such as gene ontology (GO) terms and functional annotations. This can be done using tools such as BLAST and InterProScan .

14 CONCLUSION It is important to note that, identifying genes using bioinformatics tools requires a combination of computational and experimental techniques, as well as expertise in genomics, molecular biology, and bioinformatics. It is also important to note that no single method can accurately predict all ORFs, and multiple approaches should be used to increase the accuracy of ORF prediction Additionally, experimental validation, such as transcriptome sequencing or proteomics, is necessary to confirm predicted ORFs. Finally, perfectly mastering bioinformatics tools would require daily practice to better understand how they work and how to interpret results.