Comparative genomics involves a comprehensive systematic comparison of genome sequences . It begins with powerful computer programs that identify homologous regions within the genomes under comparison. Sets of homologous sequences are then grouped with their sequences aligned at the base-pair level in an attempt to define whole genome sequence alignments. Discover what lies hidden in genomic sequences by comparing sequence information .
By comparing the human genome with the genomes of different organisms, researchers can better understand the structure and function of human genes and thereby develop new strategies in the battle against human disease . In addition, comparative genomics provides a powerful new tool for studying evolutionary changes among organisms, helping to identify the genes that are conserved among species along with the genes that give each organism its own unique characteristics.
some questions that comparative genomics can address?
Phylogenetic distance Information that can be gained by comparison of genomes largely dependent upon the phylogenetic distances between them. Phylogenetic distance is a measure of the degree of separation b/w two organisms or genomes on an evolutionary scale , usually expressed as the number of accumulated sequence changes, number of years or number of generations More distance, less sequence similarity or less shared genomic features.
Comparisons of Genomes at Different Phylogenetic Distances Are Appropriate to Address Different Questions
Broad insights about types of genes can be gleaned by genomic comparisons at very long phylogenetic distances , e.g., greater than 1 billion years since their separation. For example, comparing the genomes of yeast, worms, and flies reveals that these eukaryotes encode many of the same proteins, and the non-redundant protein sets of flies and worms are about the same size, being only twice that of yeast. The more complex developmental biology of flies and worms is reflected in the greater number of signaling pathways in these two species than in yeast. Over such very large distances, the order of genes and the sequences regulating their expression are generally not conserved. At moderate phylogenetic distances (roughly 70–100 million years of divergence), both functional and nonfunctional DNA is found within the conserved DNA. In these cases, the functional sequences will show a signature of purifying or negative selection, which is that the functional sequences will have changed less than the nonfunctional or neutral DNA ( Jukes and Kimura 1984 ).
Commonly used tools UCSC Browser : This site contains the reference sequence and working draft assemblies for a large collection of genomes . Ensembl : The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online . MapView : The Map Viewer provides a wide variety of genome mapping and sequencing data . VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic sequences. It was built to visualize the results of comparative analysis based on DNA alignments. The presentation of comparative data generated by VISTA can easily suit both small and large scale of data . BlueJay Genome Browser : a stand-alone visualization tool for the multi-scale viewing of annotated genomes and other genomic elements .
Chromosome level Number of genes Genome size Content (sequence) Location (map position) Gene Order Gene Cluster (Genes that are part of a known metabolic pathway, are found to exist as a group ) Translocation: movement of genomic part from one position to another How are genomes compared ?
GENOME ALIGNMENT Alignment of DNA sequences is the core process in comparative genomics. An alignment is a mapping of the nucleotides in one sequence onto the nucleotides in the other sequence, with gaps introduced into one or the other sequence to increase the number of positions with matching nucleotides. Several powerful alignment algorithms have been developed to align two or more sequences . Popular alignment programs such as BLAST and FASTA or the multiple alignment program Clustal W are essentially optimized for the alignment
Computational tools for genome-scale sequence alignment
Human PKLR gene region compared to the macaque, dog, mouse, chicken, and zebrafish genomes Numbers on the vertical axis represent the proportion of identical nucleotides in a 100-bp window for a point on the plot. Numbers on the horizontal axis indicate the nucleotide position from the beginning of the 12-kilobase human genomic sequence. Peaks shaded in blue correspond to the PKLR coding regions. Peaks shaded in light blue correspond to PKLR mRNA untranslated regions. Peaks shaded in red correspond to conserved non-coding regions (CNSs), defined as areas where the average identity is > 75%. Alignment was generated using the sequence comparison tool VISTA (http://pipeline.lbl.gov). GENOME ALIGNMENT
Notice the high degree of sequence similarity between human and macaque (two primates) in both PKLR exons (blue) as well as introns (red) and untranslated regions (light blue) of the gene. In contrast, the chicken and zebrafish alignments with human only show similarity to sequences in the coding exons; the rest of the sequence has diverged to a point where it can no longer be reliably aligned with the human DNA sequence. Using such computer-based analysis to zero in on the genomic features that have been preserved in multiple organisms over millions of years, researchers are able to locate the signals that represent the location of genes, as well as sequences that may regulate gene expression. Indeed , much of the functional parts of the human genome have been discovered or verified by this type of sequence comparison (Lander et al. 2001) and it is now a standard component of the analysis of every new genome sequence.
Comparison of overall nucleotide statistics • Overall nucleotide statistics, such as – Genome size, – Overall (G+C) content, – Regions of different (G+C) content, – Genome signature such as codon usage biases, – Amino acid usage biases, and the ratio of observed dinucleotide frequency These all present a global view of the similarities and differences of the genomes
SYNTENY Refers to regions of two genomes that show considerable similarity in terms of sequence and conservation of the order of genes likely to be related by common descent . By mapping of syntenic regions in corresponding genomes, genome rearrangement events can be identified such as fission, translocation, inversion, and transposition
SYNTENY
Once syntenic regions are detected, one can obtain breakpoints (a.k.a. syntenic boundaries) between syntenic regions . Analysis of various genomic features of the breakpoints such as G+C content, gene density, and the density of various DNA repeats provides understanding of the evolution of genomes. For instance, Mural et al. observed sharp discontinuity of features around some syntenic boundaries but not others. They hypothesized that syntenic boundaries that do not show sharp transitions in these various features may provide evidence for conservation of the ancestral pattern in the lineage . Analysis Of Breakpoints
Homologs : Genes that have the same ancestor; in general retain the same function Orthologs : Homologs from different species (arise from speciation) Paralogs : H omologs from the same species (arise from duplication) Duplication before speciation (ancient duplication) : Out- paralogs ; may not have the same function Duplication after speciation (recent duplication) : In- paralogs ; likely to have the same function GENE CENTRIC COMPARISON
Gene clusters In prokaryotes, groups of functionally related genes tend to be located in close proximity to each other, and often in specific order, as exemplified by operons. Although gene order conservation beyond the level of operons is much less prevalent, conservation of clusters and gene order can be important indicators of function . Several approaches have been used to determine functionally related ‘‘clusters’’ of genes. Overbeek et al . use the constructs of a ‘‘pair of close bidirectional best hits’’ (PCBBH) and ‘‘pairs of close homologs ’’ ( PCHs) to represent pairs of genes that are closely conserved between two species and likely to be functionally related .
COG s Cluster of orthologous genes . groups of three or more ortholog genes, meaning they are direct evolutionary counter parts and are considered to be part of an 'ancient conserved domain'. A COG is defined as three or more proteins from the genomes of distant species that are more similar to each other than to any other protein within the individual genome. COGs can be used to predict the function of homologous proteins in poorly studied species and can also be used to track the evolutionary divergence from a common ancestor, hence providing a powerful tool for functional annotation of uncharacterized proteins . Important in comparative genomics studies
Application of COG The most straightforward application of the COGs is for the prediction of functions of individual proteins or protein sets, including those from newly completed genomes . COG database NCBI provides a COG database that consists of 4,873 COGs that code for over 13600 proteins from the genomes of 50 bacteria, 13 archaea and 3 unicellular eukaryotes. This database uses completely sequenced genomes to classify proteins using the orthology concept.
MBGD MBGD is a database for comparative analysis Of completely sequenced microbial genomes , the number of which is now growing rapidly. The aim of MBGD is to facilitate comparative genomics from various points of view such as ortholog identification , paralog clustering, motif analysis and gene order comparisons
Comparative analysis of coding regions typically involves the identification of gene-coding regions, comparison of gene content, and comparison of protein content . Recently there have also been a number of algorithms developed that use comparative genomics to aid function prediction of genes. The analysis and comparison of the coding regions starts with, and is very dependent upon, the gene identification algorithm that is used to infer what portions of the genomic sequence actively code for genes .
A combination of multiple gene identification approaches are often used together in large-scale analysis to improve the overall accuracy
Comparative analysis of non coding regions Noncoding regions of the genome, which may comprise as much as 97% of the genome length such as in the human genome, gained a lot of attention in recent years because of its predicted role in regulation of transcription, DNA replication, and other biological functions . However , identification of regulatory elements from the noncoding portion of a genome remains a challenge. Comparative genomics has been used to greatly aid the identification of regulatory segments by comparing the genomic noncoding DNA sequences from diverse species to identify conserved regions . This approach is based on the presumption that selective pressure causes regulatory elements to evolve at a slower rate than that of non regulatory sequences in the noncoding regions.
Analysis of mutations Search and display of mutations within multiple alignments, with discrimination between intergenic , synonymous, non-synonymous and Indel mutations. Additional filtering based on SNP quality scores. Display colors based on mutation type or quality; sorting based on position, gene, NA change, AA change, quality Direct clustering based upon mutations or export of mutation list for further analysis.
Nonfunctional protein coding genes Mutations introduce “sequence problems” ( frameshifts , stop in frame, absence of stop) pseudogenes ? “ Normal” bacterial genomes have 1-5% of pseudogenes [Liu et al] Pseudogenes can give interesting clues to evolutionary pathways High fractions of pseudogenes suggest a “genome degradation” process May be cause or effect of niche restriction Examples Mycobacterium leprae : 36% (~1,100 genes) Leifsonia xyli subsp. xyli : 13% (~300 genes) Pseudogenes do not show up in BLAST searches
applications Gene identification comparative genomics can aid gene identification. Comparative genomics can recognize real genes based on their patterns of nucleotide conservation across evolutionary time. With the availability of genome-wide alignments across the genomes compared, the different ways by which sequences change in known genes and in intergenic regions can be analyzed . The alignments of known genes will reveal the conservation of the reading frame of protein translation . Regulatory motif discovery Regulatory motifs are short DNA sequences about 6 to 15bp long that are used to control the expression of genes, dictating the conditions under which a gene will be turned on or off. Each motif is typically recognized by a specific DNA-binding protein called a transcription factor (TF). A transcription factor binds precise sites in the promoter region of target genes in a sequence-specific way, but this contact can tolerate some degree of sequence variation. Comparative genomics provides a powerful way to distinguish regulatory motifs from non-functional patterns based on their conservation.
applications Comparative genomics has wide applications in the field of molecular medicine and molecular evolution. The most significant application of comparative genomics in molecular medicine is the identification of drug targets of many infectious diseases. For example, comparative analyses of fungal genomes have led to the identification of many putative targets for novel antifungal. This discovery can aid in target based drug design to cure fungal diseases in human. Comparative genomics also helps in the clustering of regulatory sites , which can help in the recognition of unknown regulatory regions in other genomes. The metabolic pathway regulation can also be recognized by means of comparative genomics of a species. Agriculture is a field that reaps the benefits of comparative genomics. Identifying the loci of advantageous genes is a key step in breeding crops that are optimized for greater yield, cost-efficiency, quality, and disease resistance .