Introduction to Bioinformatics 2025.....pdf

omniaabdo276 138 views 50 slides Oct 05, 2024
Slide 1
Slide 1 of 50
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50

About This Presentation


..


Slide Content

Prof. Naglaa Abdallah

Course contents
•12 weeks
•Lectures, exercise, discussion.
•Materials (presentations, links, books, etc).
Class Structure
2 hours lecture
1 hour tutorial

Grading
•Final exam 60%
•Practical exam 20%
•Quizzes (Homework assignments), midterm 10%
•Oral 10%

Course description
An introduction to theory and practice of Bioinformatics and
computational biology

Goals for the course:
The course will familiarize the students with the tools and principles of
contemporary bioinformatics. By the end of the course, students will
have a working knowledge of a variety of publicly available data and
computational tools important in bioinformatics and a grasp of the
underlying principles that is adequate for them to evaluate and use
novel techniques as they arise the future.

What is Bioinformatics
•Bioinformatics: Collection and storage of biological information.
•It is The field of science in which biology, computer science, and
information technology merge to form a single discipline
Computational Biology: Development of statistical models to analyze
biological data.
•Ultimate goal: to enable the discovery of new biological insights as well
as to create a global perspective from which unifying principles in
biology can be discerned.

•Bioinformatics: any use of computers to handle biological information.
•Bioinformatics (Oxford English Dictionary): The branch of science
concerned with information and information flow in biological systems,
esp. the use of computational methods in genetics and genomics.
•Molecular Bioinformatics: involves the use of computational tools to
discover new information in complex data sets (from the one-dimensional
information of DNA through the two-dimensional information of RNA
and the three-dimensional information of proteins, to the four-dimensional
information of evolving living systems).

The field of science in which biology, computer science and
information technology merge into a single line
Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Bioinformaticians
Study biological questions
by analyzing molecular
data


Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.

•Store data
01001011
01100101
01010101010
010101010101
011
010
010
010
000
•Create Interfaces to the data
•Build tools to analyze data

The objective of biological experimentation is not just to generate
biological data, but also to analyze the data and extract information and
knowledge from it. The high complexities and volumes of these data
require the use of computers for the storage and analysis of the data,
and makes bioinformatics an complete part of modern molecular
science.

•Bioinformatics started in the
appears in the 1989.
•Bioinformatics was created to
Studying life at molecular level.
1960s, Term ‘Bioinformatics’
serve molecular biology or
•Development of fast computers, good algorithmic techniques.
•Bioinformatics
mathematics,
Biochemistry.
is the application of many sciences (Applied
Statistics, Physics, Biology, Genetics and
Origin of Bioinformatics

1953 :DNA structure discovered.
1956 :First protein sequenced (insulin)
1960 : Assembly of protein sequence databases.
1972: Protein Data Bank (PDB).
1977 : Sanger sequencing technique developed .
:1979 First DNA Data Bank (GenBank).
1987 : Multiple sequence alignment.
1981: National Centre for Biotechnology information
(NCBI) is released by Larry Wall.

1990: Human Genome Project started.
1990: BLAST program introduced by Kartin and Altshul.
1993: The first genome database (C.elegans).
1995 : Influenza genome sequences (5Mb).
2000: Drosophila genome sequences (180Mb.)
2001 : The human genome (3.000 Mbp) is published.

The evolution of bioinformatics as seen in the 90’s

The requirement of bioinformatics
•Data collection techniques
sequencing, microarrays)
• Theoretical concept (concepts
structure, evolution)
(DNA sequencing, protein
of DNA structure, protein
•Programs (BLAST, FASTA)
•Databases
•Institutions
•Complex genomic and high throughput data

The important of Bioinformatics
•Applications areas include
•Medicine
•Pharmaceutical drug design
•Toxicology
•Molecular evolution
•Biosensors
•Biomaterials
•Biological computing models
•DNA computing

15
What could Bioinformatics offer?
Analyze and interpret biological data: Genomic Sequences, RNA
structure & Transcriptomic Sequences & Protein sequences and
Structures - RNA structure (RNA)
Develop new algorithms and tools to: Assess the biological
information, handle large datasets, find relationships between data
sources etc…
Basic Science :
-Understand the living cell
-Find the function of a new protein
-Find the genes/proteins that are unique to human
Medical applications: identify the mutations (SNPs) that cause
genetic diseases, disease diagnosis & find and develop new and better
drugs
Agriculture applications: higher yield crop, increase shelf life, etc.

•Structural genomics is a field of genomics that involves the
characterization of genome structures.
•This knowledge can be useful in the practice of manipulating the
genes and DNA segments of a species.
•Functional genomics is a field of molecular biology that
attempts to describe gene functions and interactions.
•Functional genomics make use of the large data generated by
genomic and transcriptomic projects.

Bioinformtics could be used to:
•Sequence complete genome
•Identify protein coding regions
•Identify unique genes
•Gene knockout
• Functional analysis (phenotype,
detailed functional characterization)..

•Structural studies, drug development

•The most logical way to look at how bioinformatics assists
molecular biology, is to look at it from the central dogma.
•Bioinformatics plays a role at each stage of the central dogma of
molecular biology.
First, there is DNA
•DNA is the most basic data gathered from molecular experiments,
and data types associated with DNA are genomes, genes, and gene
features.
18

Central Paradigm in Molecular Biology
mRNA Gene (DNA) Protein
21
ST
centaury
Genome Transcriptome Proteome

The second part in the dogma is mRNA.
•Data is generated in many areas of experimentation involving
mRNA, and this includes the levels of expression of the mRNA.

•Typically these would be microarray experiments.
•Data associated with the structure of RNA, and then data associated
with other RNA.
•This include ribosomal RNA, transfer RNA and studies involving
RNAi
20

Next are data associated with proteins.
•Especially in modern molecular biology and biochemistry,
proteomics is a growing field with many resources being allocated
to the molecular study of proteins.
•The data types most associated with proteins are sequence data,
structural data and phylogenetics data.
21

•Today, the field of Systems Biology is very high on the list of important research
topics. It is a field that studies the complex interactions of genes proteins and
other cellular elements and is very important for the advancement of
knowledge regarding the function of the life. Metabolic pathway
determination and modeling is a very important aspect of systems biology.
•The highest level at which molecular biologists are working, is at the
phenotypic level, trying to explain the reasons for phenotypes given genetic and
proteomic makeup of cells.
•Human disease manifestation as a result of genetic flaws is a big topic of study,
and bioinformatics is playing a major role in determining the causes of genetic
disease, as well as helping with the search for cures for these conditions.


•Genome browsers, is one that exists only in silico, inside computers, these
resources provide an integrated look at entire organisms, and how everything
known about its biology is interrelated.
22

Central Dogma and Genome Browsers
DNA
Genomes
Genes
Features
RNA
Expression
Structure
Protein
Sequence
Structure
Phylogenetics
Systems Phenotype
Metabolic
Pathways
Disease
Integrated Resources
(
Other RNA
tRNA,rRNA&iRNA)

Central Dogma and Genome Browsers
DNA RNA Protein Systems Phenotype
Integrated Resources

DNA
Genomes
•Automated sequencing
•200 million by 1998
•1.5 billion by 2003 (Human Genome complete)
•Roche 454: >1Gb/day
•Cost drops too....
•ATCGATCGATCATGCTAGCTAGCTAGCTAGCTAGCG
CTATGCTAGCTCGTGCTAGCATGATCGATCATG .......

DNA
ACCTGTGTTCATCGGTCATGC
TCATCGGTCA
TCATCGGTCATGCACGGTTA
TCATCGGTCATGC
•Huge number of sequences from library (1000 000 000 clones)

•Sequence must be determined

•Must be assembled

•Must be stored
TCATCGGTCA
TCATCGGTCATGC
TCATCGGTCATGCATGC TCATCGGTCATGCAATCGA

Genome Storage –what we need?
Computational tools are needed to distill pathways of interest
from large molecular interaction databases

•Need modern computer systems
•Files
•Specific formats
•Databases
•Manage data
•Searchable
•Fast, reliable, available

DNA

•Genes (Eukaryotic cell)

DNA
•Genome annotation
•Finding genes and the features associated with the genes
•Must be done on computers
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGA TGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCG CAATTATATATATTTTCTCT TATATAACTCGATAGCTACTAC TACCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATC GATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG

31
CCTGACAAATTCGACGTGCGGCATTGCATGC AGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGAC GGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT
.................................
.............. TGAAAAACGTA
TF binding site promoter
Ribosome binding Site
ORF=Open Reading Frame
CDS=Coding Sequence
Transcription Start

Site

Central Dogma and Genome Browsers
DNA RNA Protein Systems Phenotype
Integrated Resources

RNA
•Poly-nucleotide molecules
•mRNA most well known
•Carries gene information for translation
•Expression analysis for genes
•Comparison between different environmental stimuli or different
cellular states
•Level of protein production proportional to RNA level
Gives a good indication of gene expression

RNA
Structure – Function relationship important
Primary Secondary Tertiary
5’AACUCGAGC
UACUAGCUAG
GCGCGUUAAU
UAUCGUACUA
UAGCUACUAC
UUCGCGUAAU
UAUUACGAUG
UUCGGCUAGA
UUAGCGAUAU
UAUUACGAUA
UAUAUGCGCA
UAUCAGAUU3’

Central Dogma and Genome Browsers
DNA RNA Protein Systems Phenotype
Integrated Resources

Protein
•Relatively easy to find DNA sequence
•More than 200 organisms sequenced
•Putative proteomes are abundant

•Bioinformatics: Computational tools to determine protein
structure and function from sequence

•Proteomics  analyze protein sequences

•Translation of RNA into proteins. The process of genome
sequencing is relatively simple, and to data several organisms had
their entire genome sequenced.
•Through various annotation processes using computers, it is possible
to predict the genes, and the proteins they will encode.
•This creates the need for bioinformatics tools to computationally
determine protein structure and function from sequence, with the end
goal being that the accuracy of predicting the structure and
function of proteins would be extremely high.

•Classify proteins (Database of protein
motifs)
•Choose and express representative
proteins from all families
•Determine structure by X-ray
•Predict the rest by homology modeling
Protein function

•When comparing protein sequences, it can be assumed that for
proteins with similar sequences, the proteins should have similar
functions.
•This fact is also true for subsections of a particular sequence, and not
only for the entire sequences.
•In Proteomics, many discoveries as to the specific functionality of a
particular stretch of sequence (motif) have been made, and these
discoveries were stored, and are being used today in sequence
function assignment.

•Protein structures have been available in the public domain for
longer than any other type of biological data, with the first public
repository being created in the early 70’s.
•More than 33000 structures are available for download via the
internet.
•These structures can be downloaded for the purpose of homology
modeling.
•If 2 sequences are homologues, and functionally similar as well,
then their 3-dimensional structures should be similar as well.
This approach can be useful during drug discovery, or just studies
on the functioning of a particular protein.

Protein
•Similar sequence  Similar function
•Smaller stretches of sequence carries similar function
•Motifs or signature sequences
•DNA binding motifs
Sequence A
Sequence B

Protein
•Motifs and signatures  Identify unknown proteins
•Search protein database for proteins with probable functionality

•Databases of protein signatures and motifs
•Pfam, Prosite, Prints, BLOCKS
•Various methods of representation
•HMMs – Pfam
•Regular expressions - Prosite
•PSSMs - BLOCKS

Central Dogma and Genome Browsers
DNA RNA Protein Systems Phenotype
Integrated Resources

•Knowing the function of one protein is not enough. If we are to
understand the way life function, we must know and understand the way
in which all proteins in the cell function together.
•We must understand the complex relationships there are between the
molecules in the cell that makes it function.
• The way in which this is done, is first to understand the metabolic
pathways.
•These are networks into which functional groups of proteins are divided.

•To do that, we must add all the different pathways and their effects
together.
•The field of study concerned with doing this is systems biology, in
which the biological system, as a whole, is studied with the end goal
of studying the effects of all the pathways eventually put together.

•Bioinformatics is an essential component of systems biology.

•Many software packages exist for the study of metabolic pathways
and the greater system as a whole.
•One such packages is called CellDesigner (http://www.celldesigner.org.)/
•It is a structured diagram editor for drawing gene-regulatory and
biochemical networks.

•It is a package used to create in silico metabolic systems.
•These models can then be assigned metabolic characteristics, for example
the effect of the up regulation of one component on another. This can
eventually lead to the modeling of entire cascading effects in the system.

Cell Designer (http://www.celldesigner.org)/
47

Central Dogma and Genome Browsers
DNA RNA Protein Systems Phenotype
Integrated Resources

Phenotype
•All individual systems together  phenotype
•Phenotype studies  big money required

•Human Health  Phenotype studies
•Agriculture  Phenotype studies
Tags