Bioinformatics An overview Soumitra Nath m ail: [email protected] Department of Biotechnology GURUCHARAN COLLEGE:: SILCHAR
Bioinformatics Biological Data Computer Calculations +
What is Bioinformatics? “The field of science in which biology, computer science, and information technology merge to form a single discipline”
Central Dogma in Molecular Biology mRNA Gene (DNA) Protein 21 ST century Genome Transcriptome Proteome
The Human Genome Project Initiated in 1986 Completed in 2003 Project goals were to identify all the genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis and develop new tools address the ethical, legal, and social issues that may arise from the project.
What makes us human? CHIMP GENOME Chimpanzees are similar to humans in so many ways: they are socially complex, sensitive and communicative, and yet indisputably on the animal side of the man/beast divide. Scientists have now sequenced the genetic code of our closest living relative, showing the striking concordances and divergences between the two species, and perhaps holding up a mirror to our own humanity.
Perhaps not surprising!!! Comparison between the full drafts of the human and chimp genomes revealed that they differ only by 1.23% How humans are chimps?
Annotation Open reading frames Functional sites Structure, function
CCTGACAAATTCGACGTGCGGCATTGCATGC AGACGTGCAT G CGTGCAAA TAATCA ATGTGGACTTTTCTGC GATTAT GGA AGA A CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGAC GGAG ATGTCTG ATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT ................................. .............. TGA AAAACGTA Transcription Factor binding site promoter Ribosome binding Site ORF=Open Reading Frame CDS=Coding Sequence Transcription Start Site
Organisms Genome maps DNA sequences RNA sequences ...AATGGTACCGATGACCTGGAGCTTGGTTCGA... Molecular biology data types Lei Liu
DNA sequences RNA sequences Protein sequences ...TRLRPLLALLALWPPPPARAFVNQHLCGSHLVEA... Molecular biology data types Organisms Genome maps Lei Liu
Protein sequences Protein structures RNA structures Molecular biology data types Organisms Genome maps DNA sequences RNA sequences Lei Liu
Protein structures DNA motifs Protein motifs RNA expression Molecular biology data types Organisms Genome maps DNA sequences RNA sequences RNA structures Protein sequences Lei Liu
Bioinformatics
Sequence Analysis
What we want to know about a sequence? Is this sequence similar to any known genes? How close is the best match? Significance? What do we know about that gene? Genomic (chromosomal location, allelic information, regulatory regions, etc.) Structural (known structure? structural domains? etc.) Functional (molecular, cellular & disease) Evolutionary information: Is this gene found in other organisms? What is its taxonomic tree? Larry Hunter
Biological databases Data is of different types Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein annotated sequences and structures, expression data)
EMBL / GenBank / DDBJ Serve as a rchives / storage containing all sequences (single genes, ESTs, complete genomes, etc.) derived from: Genome projects Sequencing centers Individual scientists Patent offices (i.e. European Patent Office, EPO) Non-confidential data are exchanged daily Currently: 18 x10 6 sequences, over 20 x10 9 bp ; Over the last 12 months the database size has tripled Sequences from > 50’000 different species ; These 3 db contain mainly the same informations within 2-3 days (few differences in the format and syntax )
www.ncbi.nlm.nih.gov Created in 1988 as part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information 20
NCBI and Entrez NCBI provides interesting summaries, browsers for genome data, and search tools Entrez is their database search interface http://www.ncbi.nlm.nih.gov/Entrez Can search on gene names, sequences, chromosomal location, diseases, keywords, ...
Sequence Comparison DNA is blue print for living organisms Evolution is related to changes in DNA By comparing DNA sequences we can infer evolutionary relationships between the sequences
Copyright 2004 limsoon wong Sequence Alignment Sequence U Sequence V mismatch match indel Key aspect of sequence comparison is sequence alignment A sequence alignment maximizes the number of positions that are in agreement in two sequences
Copyright 2004 limsoon wong Multiple Alignment: An Example Conserved sites Multiple seq alignment maximizes number of positions in agreement across several seqs seqs belonging to same “family” usually have more conserved positions in a multiple seq alignment
Copyright 2004 limsoon wong Phylogeny: An Example By looking at extent of conserved positions in the multiple seq alignment of different groups of seqs, can infer when they last shared an ancestor Construct “family tree” or phylogeny
Visualizing the 3d structure of Proteins
From: Brandon & Tooze, “Introduction to Protein Structure” primary (1º) secondary (2º) tertiary (3º) quaternary (4º)
Small-scale X-ray source in lab or at national synchrotron facility Getting crystals of proteins or nucleic acids is no small feat! Diffraction pattern Computers: Aid in model building, phase determination, visualization Problem: no way to “focus” Need to determine phases
Cn3d Cn3D is a visualization tool for macromolecules. It allows you to view 3-D structures from NCBI's Entrez retrieval service. Cn3D is able to correlate structure and sequence information; for example, you can find the residues in a crystal structure that correspond to known disease mutations. Software for 3d structure visualization
RasMol RasMol is a molecular graphics program Intended for the visualization of proteins, nucleic acids, and small molecules Aimed at display, teaching, and generation of publication quality images. Software for 3d structure visualization
Swiss- Pdb Viewer Swiss- Pdb viewer is used to calculate the distance and angle between atoms atoms . It allows browsing a rotamer library in order to change amino acids side chains. This can be very useful to quickly evaluate the assumed effect of mutation before actually doing the lab work. It allows altering the torsions angles of amino-acids and hetero-atoms, as well as the backbone omega, phi and psi angles. Software for 3d structure visualization
CADD
What is a drug target? A drug target may be a native protein (or sometimes DNA/RNA) in the body whose activity is modified by a drug resulting in a desirable therapeutic effect. Drug Targets may be: Enzymes Hormone Receptors Ion Channel Proteins sometimes, DNA or RNA CADD
The Drug Designing Pathway: Disease Drug Target Ligand Database Natural Product Combinatorial Library Ligand Side chain modification Lipinski & ADMET Filters -ve Docking Result +ve Docking Result Synthesis Docking ( in silico binding study) In vitro screening +ve Result -ve Result In vivo screening Clinical Trials
Ligand (analog) based drug design 1. Receptor structure is not known 2. Mechanism is known/ unknown 3. Ligands and their biological activities are known Target (structure) based drug design 1. Receptor structure is known 2. Mechanism is known 3. Ligands and their biological activities are known/ unknown Computational tools are used to: Identify and study drug targets of various diseases Study and identify suitable ligand that binds with the drug target Prediction of toxicity and drug likeness of small molecules (Lipinski Filters & ADMET Screening) Generation of Combinatorial Library There are two major types of drug design.
3D Structure of the protein (Drug Target) Download from Protein Data Bank (www.rcsb.org/pdb) (It is a macromolecular structure database) If not available in PDB, predict the structure (Homology Modeling, Ab initio prediction, Threading etc.) 3D Structure of the small molecule ( Ligand ) Small molecule 2D Structures are available in Databases like PubChem , KEGG- Ligand etc. The structure of isolated natural product or synthetic compound may also be derived using NMR spectroscopy or/and XRC. Convert the 2D small molecule to its 3D structure using software, like CORINA (It stands for C o OR d INA tes ) Prerequisites of a docking experiment:
The Molecular Wt. must be less than (≤) 500 C logP ≤ 5 ( Octanol /Water Partition Coefficient) H-bond Donors ≤ 5 H-bond Acceptors (sum of N and O atoms) ≤ 10 No. of Rotatable Bonds ≤ 10 Lipinski‘s Rule of Five is applicable to orally active compounds. Lipinski‘s Rule of Five
Absorption:- Must be easily absorbed by body Distribution:- Compound needs to be easily transferred and distributed to its target site. Metabolism:- Should take part in various metabolic activities Excretion:- Byproducts need to be excreted out from the body. Toxicity:- The toxic effect must be neutralized ADME- Tox Screening
Examples: Tubulin : As a Cancer Drug Target Tubulin heterodimer (a + b) is the basic structural unit of microtubule. Drug molecule ( Taxol ) binds to the tubulin , so that heterodimer can’t be formed. As a result, cell division ceases. Tubulin-a + Tubulin-b Heterodimer Microtubule Taxol
Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma : Save time, save cost, make more $ To the scientist: Better science
Programme Designing
PERL : P ractical E xtraction and R eport L anguage Perl 1.0.0 Larry Wall 1987 http://www.perl.org/ 42 Perl is a programming language that is offered at no cost.
Why Perl? Fairly easy to learn the basics Many powerful functions for working with text: search & extract, modify, combine Can control other programs Free and available for all operating systems Most popular language in bioinformatics Many pre-built “modules” are available that do useful things 43
Get Perl You can install Perl on any type of computer. Download and install Perl on your own computer: www.perl.org Windows version: http://www.activestate.com/Products/ActivePerl/ On your desktop Set up a shortcut to the Command Prompt Programs/Accessories/Command Prompt Edit the properties of the command prompt to set the Start in to be blank 44
Extension and Path On Windows systems, it's usual to associate the filename extension .pl . This is done as part of the Perl installation process, which modifies the registry settings to include this file association. You can then launch this_program.pl In MS-DOS type the complete pathname to the program, for instance perl c:\windows\desktop\my_program.pl. Notepad works satisfactorily. 45
( Computers are VERY dumb -they do exactly what you tell them to do, so be careful what you ask for…........) 46
Program details Perl programs always start with the line: #!/ usr /bin/ perl this tells LINUX that this is a Perl program and where to get the Perl interpreter. In windows this is not needed the .pl extension is enough but it is a good idea to include this card. All other lines that start with # are considered comments, and are ignored by Perl Lines that are Perl commands end with a ; 47
The most simpliest #!/ usr /bin/ perl print "Hello"; 48