Bioinformatics, as related to genetics and genomics, is a scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences. Scientists and clinicians ...
Bioinformatics, as related to genetics and genomics, is a scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences. Scientists and clinicians use databases that organize and index such biological information to increase our understanding of health and disease and, in certain cases, as part of medical care. Bioinformatics. The role of bioinformatics in biological research can be compared with the role of data analysis in the age of information and the Internet. In earlier days, the primary challenge was getting to the information. Advances in reading DNA sequences have lowered that barrier substantially. Going forward, the challenge is how to understand and interpret the information that has been collected. Because the data sets are large, whether you're talking about information about website visits or the human genome, computer-based methods are the default approach. In the end, bioinformatics work with human genomes seeks to discover practical insights about human health and biology with all its complexity.
Coursework
1 coursework – worth 20 marks
–Work in pairs
Retrieving information from a database
Using Perl to manipulate that information
The Robot Scientist
Performs experiments
Learns from results
–Using machine learning
Plans more experiments
Saves time and money
Team member:
–Stephen Muggleton
Biological Nomenclature
Need to know the meaning of:
–Species, organism, cell, nucleus, chromosome, DNA
–Genome, gene, base, residue, protein, amino acid
–Transcription, translation, messenger RNA
–Codons, genetic code, evolution, mutation, crossover
–Polymer, genotype, phenotype, conformation
–Inheritance, homology, phylogenetic trees
Substructure and Effect
(Top Down/Bottom Up)
Species
Organism
Cell
Nucleus
Chromosome
DNA strand
Gene
Base
Protein
Amino Acid
Folds
into
Affects the
Function of
Affects the
Behaviour of
Prescribes
Cells
Basic unit of life
Different types of cell:
–Skin, brain, red/white blood
–Different biological function
Cells produced by cells
–Cell division (mitosis)
–2 daughter cells
Eukaryotic cells
–Have a nucleus
Nucleus and Chromosomes
Each cell has nucleus
Rod-shaped particles inside
–Are chromosomes
–Which we think of in pairs
Different number for species
–Human(46),tobacco(48)
–Goldfish(94),chimp(48)
–Usually paired up
X & Y Chromosomes
–Humans: Male(xy), Female(xx)
–Birds: Male(xx), Female(xy)
DNA Strands
Chromosomes are same in every cell of organism
–Supercoiled DNA (Deoxyribonucleic acid)
Take a human, take one cell
–Determine the structure of all chromosonal DNA
–You’ve just read the human genome (for 1 person)
–Human genome project
13 years, 3.2 billion chemicals (bases) in human genome
Other genomes being/been decoded:
–Pufferfish, fruit fly, mouse, chicken, yeast, bacteria
DNA Structure
Double Helix (Crick & Watson)
–2 coiled matching strands
–Backbone of sugar phosphate pairs
Nitrogenous Base Pairs
–Roughly 20 atoms in a base
–Adenine Thymine [A,T]
–Cytosine Guanine [C,G]
–Weak bonds (can be broken)
–Form long chains called polymers
Read the sequence on 1 strand
–GATTCATCATGGATCATACTAAC
Differences in DNA
2
% tiny
R
o
u
g
h
l
y
4
%
S hare
M
aterial
DNA differentiates:
–Species/race/gender
–Individuals
We share DNA with
–Primates,mammals
–Fish, plants, bacteria
Genotype
–DNA of an individual
Genetic constitution
Phenotype
–Characteristics of the
resulting organism
Nature and nurture
Genes
Chunks of DNA sequence
–Between 600 and 1200 bases long
–32,000 human genes, 100,000 genes in tulips
Large percentage of human genome
–Is “junk”: does not code for proteins
“Simpler” organisms such as bacteria
–Are much more evolved (have hardly any junk)
–Viruses have overlapping genes (zipped/compressed)
Often the active part of a gene is spit into exons
–Seperated by introns
The Synthesis of Proteins
Instructions for generating Amino Acid sequences
–(i) DNA double helix is unzipped
–(ii) One strand is transcribed to messenger RNA
–(iii) RNA acts as a template
ribosomes translate the RNA into the sequence of amino acids
Amino acid sequences fold into a 3d molecule
Gene expression
–Every cell has every gene in it (has all chromosomes)
–Which ones produce proteins (are expressed) & when?
Transcription
Take one strand of DNA
Write out the counterparts to each base
–G becomes C (and vice versa)
–A becomes T (and vice versa)
Change Thymine [T] to Uracil [U]
You have transcribed DNA into messenger RNA
Example:
Start: GGATGCCAATG
Intermediate: CCTACGGTTAC
Transcribed: CCUACGGUUAC
Genetic Code
How the translation occurs
Think of this as a function:
–Input: triples of three base letters (Codons)
–Output: amino acid
–Example: ACC becomes threonine (T)
Gene sequences end with:
–TAA, TAG or TGA
Example Synthesis
TCGGTGAATCTGTTTGAT
Transcribed to:
AGCCACUUAGACAAACUA
Translated to:
SHLDKL
Proteins
DNA codes for
–strings of amino acids
Amino acids strings
–Fold up into complex 3d molecule
–3d structures:conformations
–Between 200 & 400 “residues”
–Folds are proteins
Residue sequences
–Always fold to same conformation
Proteins play a part
–In almost every biological process
Evolution of Genes: Inheritance
Evolution of species
–Caused by reproduction and survival of the fittest
But actually, it is the genotype which evolves
–Organism has to live with it (or die before reproduction)
–Three mechanisms: inheritance, mutation and crossover
Inheritance: properties from parents
–Embryo has cells with 23 pairs of chromosomes
–Each pair: 1 chromosome from father, 1 from mother
–Most important factor in offspring’s genetic makeup
Evolution of Genes: Mutation
Genes alter (slightly) during reproduction
–Caused by errors, from radiation, from toxicity
–3 possibilities: deletion, insertion, alteration
Deletion: ACGTTGACTC ACGTGACTC
Insertion: ACGTTGACTC AGCGTTGACTC
Substitution: ACGTTGACTC ACGATGACTT
Mutations are almost always deleterious
–A single change has a massive effect on translation
–Causes a different protein conformation
Evolution of Genes:
Crossover (Recombination)
DNA sections are swapped
–From male and female genetic input to offspring DNA
Bioinformatics Application #1
Phylogenetic trees
Understand our evolution
Genes are homologous
–If they share a common ancestor
By looking at DNA seqs
–For particular genes
–See who evolved from who
Example:
–Mammoth most related to
African or Indian Elephants?
LUCA:
–Last Universal Common Ancestor
–Roughly 4 billion years ago
Genetic Disorders
Disorders have fuelled much genetics research
–Remember that genes have evolved to function
Not to malfunction
Different types of genetic problems
Downs syndrome: three chromosome 21s
Cystic fibrosis:
–Single base-pair mutation disables a protein
–Restricts the flow of ions into certain lung cells
–Lung is less able to expel fluids
Bioinformatics Application #2
Predicting Protein Structure
Proteins fold to set up an active site
–Small, but highly effective (sub)structure
–Active site(s) determine the activity of the protein
Remember that translation is a function
–Always same structure given same set of codons
–Is there a set of rules governing how proteins fold?
–No one has found one yet
–“Holy Grail” of bioinformatics
Protein Structure Knowledge
Both protein sequence and structure
–Are being determined at an exponential rate
1.3+ Million protein sequences known
–Found with projects like Human Genome Project
20,000+ protein structures known
–Found using techniques like X-ray crystallography
Takes between 1 month and 3 years
–To determine the structure of a protein
–Process is getting quicker
Sequence versus Structure
00959085
0
100000
200000
300000
400000
500000
Year
N
u
m
b
e
r
Protein sequence
Protein structure
Database Approaches
Slow(er) rate of finding protein structure
–Still a good idea to pursue the Holy Grail
Structure is much more conservative than sequence
–1.3m genes, but only 2,000 – 10,000 different conformations
First approach to sequence prediction:
–Store [sequence,structure] pairs in a database
–Find ways to score similarity of residue sequences
– Given a new sequence, find closest matches
A good match will possibly mean similar protein shape
E.g., sequence identity > 35% will give a good match
–Rest of the first half of the course about these issues
Potential (Big) Payoffs
of Protein Structure Prediction
Protein function prediction
–Protein interactions and docking
Rational drug design
–Inhibit or stimulate protein activity with a drug
Systems biology
–Putting it all together: “E-cell” and “E-organism”
–In-silico modelling of biological entities and process
Further Reading
Human Genome Project at Sanger Centre
–http://www.sanger.ac.uk/HGP/
Talking glossary of genetic terms
–http://www.genome.gov/glossary.cfm
Primer on molecular genetics
–http://www.ornl.gov/TechResources/Human_Genome/publicat/primer/toc.html