Chairs and Committees


Previous Conferences

Conference Program


Contact Information

Photo Galleries

About Atlanta

About Georgia Tech


David Lipman
NCBI/NIH, Bethesda, MD, USA

Margaret Dayhoff and Molecular Evolution in the 21st Century

Vineet Bafna
University of California at San Diego


Mass spectrometry is commonly used to identify and quantify expressed proteins in a sample, when the protein sequence has previously been determined. The discovery of protein sequences is usually done through an analysis of genomic and protein-coding regions. In this talk, we discuss how mass spectrometry can be used to identify and correct eukaryotic gene structures. A large study of Arabidopsis resulted in the discovery of 900 new genes. Moreover, a genome is at best an imperfect template for genes. As an example, antibodies are generated by combining templated (via recombination and splicing) and non-template sequence. We describe template proteogenomics, a technique that uses mass spectrometry to predict proteins using an imperfect genomic template. Our analysis of antibody data resulted in the near perfect reconstruction of many different targets.

Gill Bejerano
Department of Developmental Biology and
Department of Computer Science, Stanford University

Genomics and the evolution of human-specific traits

The availability of several primate whole genome sequences has spurred great excitement for the prospect of understanding the molecular basis of what makes us human. Recent investigations have discovered conserved non protein coding genomic loci that have experienced accelerated basepair changes in the human lineage, as well as protein coding genes that show similar evidence of positive selection.

We expand these studies in search of human-specific events particularly likely to produce functional effects. I will share a computational screen resulting in nearly 600 such regions lying in proximity to genes involved in development, morphogenesis, neural function, and steroid hormone signaling.

We have functionally tested a subset of these regions in mice, and have found intriguing examples of regulatory alterations in humans that appear to be associated with evolution of specific anatomical differences between humans and other animals.

Jeffrey Bennetzen
Department of Genetics, University of Georgia

The hyperevolution of artifacts and realities in the structure and function of higher plant genomes

Plant genomes change their structure and apparent gene content at an exceptional rate, thereby providing excellent model systems for the study of genome rearrangement and its effects upon gene and genome function. However, the numerous gene presence/absence and gene movement polymorphisms are mostly annotation artifacts that, though easily identified, are not ever quantified in full genome analyses. Once these artifacts have been detected and removed, the actual rate of genome restructuring is still quite high, but involves primarily differences in gene copy number and local arrangement. Most or all of the mechanisms that rearrange plant genomes are now known, and none are fully random. Gene loss is common, as is exon shuffling to create candidate novel genes. I will describe the processes that restructure plant nuclear genomes, including some specific cases that lead to hotspots for genomic instability even within these generally unstable genomes.

Mark Borodovsky
Georgia Tech and Emory University

Gene Finding in the Era of Next Generation Sequencing

Next generation sequencing is quickly changing long standing paradigms of genomics in terms of what is feasible to accomplish within a research career span and what is supposed to remain beyond limits of reliable experimental analysis. Rapid sequencing of large plant and animal genomes, accumulation of metagenomic data, advances with RNA-seq drive the complexity of the task of consistent and accurate annotation of biological "Big data" beyond any expectations. Gene finding plays a fundamental role in bioinformatics. Accurate gene calling is crucial for a success of a number of computational and experimental approaches. Comparative genomic characterization of protein function, manufacturing DNA microarrays, building biomolecular networks rely on genes and proteins identified by gene finding tools. In the talk I will describe the machine learning methods that we have recently developed for ab initio gene finding in novel eukaryotic genomes as well as in metagenomes. These methods provide the fastest way for sequence data interpretation as they do not require accumulation of experimental or expert knowledge related to the novel genomic data. This is a joint work with Alexandre Lomsadze, Paul Burns, Ivan Antonov and Wenhan Zhu.

Nick Grishin
Howard Hughes Medical Institute, University of Texas

Evolutionary Classification of Protein Structures

Recent advances in automatic methods to deduce evolutionary relationships between proteins using both sequences and spatial structures allowed us to classify domains with known structure into ca. 1500 homologous groups. The logic behind this classification, its differences from SCOP, and approaches we developed to combine sequence and structure signals for the support of homology between weakly similar proteins will be discussed, and some examples of unexpected and non-trivial relationships between domains will be shown.

Curtis Huttenhower
Harvard University

Large scale genomic data mining

Computational biology deals with a wide range of biological scales: molecular data describing cellular function, population studies incorporating genomic data, and the systems biology between these extremes that allows us to better understand human health and disease. At all of these levels, the scale of available data is large; public repositories of genomic data currently contain billions of experimental results from a variety of assays. While modern search engines have organized the size and heterogeneity of other complex systems such as the Internet, it remains an open question how machine learning can be used to mine large genomic data collections for answers to specific biological questions.

I will discuss algorithmic approaches to large scale genomic data integration and preliminary results applying this methodology to two additional extremes of scale: a clinical cohort and microbial communities. HEFalMp, a human genomic data mining system, incorporates information from ~30,000 genome-scale experiments to provide functional maps in over 200 areas of human cellular biology. Each map relies on efficient machine learning and network analysis to vertically integrate individual experimental results, complete experimental datasets, whole-genome functional interaction networks, and systems-level maps of pathway co-regulation and disease linkage.

In a population context, applying this system to human genomic data from a ~1,000-subject colorectal cancer cohort has begun to explain the mechanisms connecting BRAF and KRAS mutations, LINE-1 hypomethylation, and CpG island methylation. Applying similar computational methodology to data from microbial populations, I will discuss preliminary results on data integration for gene and protein function transfer in pathogen communities and uncultured metagenomic samples.

King Jordan
Georgia Tech

MIR elements provide chromatin boundaries to the human genome

Boundary elements function as higher-order eukaryotic regulatory sequences by partitioning active and repressive chromatin domains. Previous studies indicated that a number of chromatin boundary elements bear tRNA related gene features including RNA Pol III B-box promoter elements and insulator protein binding sites. Boundary elements are also distinguished by their epigenetic characteristics including specific histone tail modifications. We developed and applied an algorithm for the discovery of novel boundary elements based on these known sequence and epigenetic features. We computationally screened the human genome for novel boundary element sequences by focusing on MIRs, a family of tRNA-derived SINE transposable elements. By co-locating B-box containing MIR elements with experimentally characterized binding sites for the insulator binding protein CTCF, we obtained 15,072 potential boundary elements. The distributions of histone tail modifications at and around these putative MIR-derived boundary elements are consistent with known CTCF-bound insulators. The list of putative boundary elements was further narrowed to 223 MIR sequences that best match the sequence and epigenetic features of tRNA-related boundaries. This set of MIR insertions was shown to partition active versus repressive histone modifications along the genome consistent with a functional role as chromatin boundaries. Furthermore, genes located in the active chromatin domains delineated by the putative MIR-derived boundary elements are more highly expressed than those found in the adjacent repressively modified domains. Our boundary element prediction algorithm was used to generate a prioritized list of the MIR elements that are most likely to contribute functional boundary elements to the human genome. Two of these predicted MIR boundaries were then experimentally validated by using an enhancer-blocking assay (EBA). The enhancer-blocking activities of our predicted MIR human boundary elements are similar to the activity of a known SINE-derived boundary element in mouse. These results underscore the potential for interspersed repeats to delineate active and repressive chromatin domains genome-wide.

Igor Jouline (Zhulin)
University of Tennessee - Oak Ridge National Laboratory

Molecular Evolution of a Complex Signal Transduction System in Prokaryotes

Molecular machinery that governs bacterial motility (chemotaxis) is one of the best studied signal transduction systems in Nature. Sophisticated behavior of Earth’s smallest organisms fascinated naturalists of the last century and modern biochemists, who hoped that the properties of the underlying molecular navigation system would resemble those of higher organisms. However, the latest structural and functional studies revealed no such similarity in the molecular design. Using the wealth of information encoded in hundreds of bacterial genomes we reconstructed the evolutionary history of this system. Here we show that the chemotaxis system is the evolutionary youngest and most sophisticated signal transduction pathway in prokaryotes. It appeared in Bacteria after the separation of the three domains of Life and has been later irradiated into Archaea, but not into Eucarya. It developed from classical bacterial two-component regulatory systems that, in turn, originated from simple one-component systems comprised of a single protein with sensory and regulatory capabilities that were likely present in the Last Universal Common Ancestor. Through a series of domain innovations the chemotaxis system differentiated into several functional classes that evolved to control not only motility, but also other cellular functions. Detailed evolutionary analysis of individual system components allowed us to reveal novel structural and functional insights that have not been identified by previous experimental studies.

Eugene Koonin

Systems biology and the prospects of a post-modern evolutionary synthesis

Comparative genomics and systems biology have revealed several quantitative relationships between evolutionary and phenomic variables that are surprisingly conserved across a broad range of life forms. These universal dependencies include the distribution of the evolutionary rates among orthologous genes; distribution of the size paralogous gene families; strong negative correlation between expression level and gene evolutionary rate; differential scaling of genes of different functional classes with genome size; and more. At least some of these dependencies are accurately reproduced by simple stochastic models that incorporate fundamental processes of gene duplication, mutation and protein folding but not specific biological functions. Such models might contribute to the formulation of a new synthesis of evolutionary biology.

Boris Lenhard
Bergen Center for Computational Science and Sars Centre for Marine Molecular Biology,
University of Bergen, Norway

Long-, short- and mid-range gene regulation: lessons from genome-wide patterns of sequence conservation and transcription factor binding

The accumulation of data on highly-conserved long-range enhancers, as well as the genome wide binding data from ChIP experiments have begun to dismantle the textbook picture of transcriptional regulation by transcription factors binding to either proximal promoter regions or distal upstream enhancers. Highly conserved enhancers of developmental genes can drive the expression of their target genes from megabase instances, often with one or more unaffected genes in between. As the new ChIP-seq-derived binding data shows, transcription factors are often found to bind inside long introns, especially the first intron of metazoan genes that is also the longest on average. At the same time, many expression patterns of tissue-specific genes can still be recapitulated using only a short upstream sequence of the gene driving the expression pattern of a reporter gene.  We have investigated  the distribution  of regulatory elements driving expression of different functional categories of genes and the effect of their position and gene properties on their target genes and other genes in their neighborhood. The results indicate that the responsiveness of genes to long-range regulation strongly depends on the type of their  core promoter and represents a defining property of several functionally distinct classes of genes. This dependence between core promoter type and responsiveness to long-range regulation is confirmed across Metazoa, using several independent approaches and data sets. The results have far-reaching implications for gene regulation studies and for explaining the evolutionary patterns of gene and genome duplications.

Jian Ma
University of Illinois at Urbana Champaign

Unraveling the ancestral mammalian genome yields insights into the human genome

Molecular evolution teaches us quite a bit about our own genome biology. For example, regions of the genome that code for proteins may be recognized by the distinctive way they evolve. The genomic data generated from many sequencing projects in the past decade have provided an unprecedented opportunity for us to understand the trajectory of the genetic changes leading to modern species using comparative genomics approach, enabling us to explore the patterns of specific genomic innovations that occurred on different lineages. In this talk, I will discuss computational methods we developed to reconstruct the ancestral genomic sequences for different scales, ranging from small genomic changes (substitutions and indels) to larger chromosomal operations (rearrangements and duplications), to render explicit the genetic history that is implicit in the genomes of living mammals. In addition, I will discuss several computational challenges and applications related to genome reconstruction analysis.

Yael Mandel-Gutfreund
Technion, Israel Institute of Technology, Haifa, Israel

Deciphering the Role of Alternative Splicing in Modulating the Human Gene Regulatory Network

Alternative splicing is a post transcriptional process which is considered to be responsible for the huge diversity of human proteins. In this study we have analyzed a unique set of conserved alternative splicing events, including alternative splice sites and cassette exons, concentrating on genes encoding for proteins which are involved in the gene-expression pathway regulation. As observed previously for alternative splicing in general, we show that alternative splicing events, which are conserved between human and mouse, affect protein regions which are predicted to be significantly more disordered than the rest of the proteins. Accordingly, these regions are predicted to be located at the protein surface, ranked as highly exposed regions. This phenomenon was strikingly more apparent for the subset of genes coding for regulatory proteins, specifically those related to transcription regulation. Furthermore, by applying a trustfully computational tool for predicting post-translational modifications we found that these proteins are predicted to have a significant higher density of phosphorylation sites compared to control sets of proteins. Moreover, we found that the predicted phosphorylation sites themselves were predominantly conserved at the alternative spliced regions. To study the global relationship between splicing regulation and transcription regulation, we built a co-regulatory network, where the nodes of the network are the splicing and transcription factors and the edges represent predicted regulatory interactions between the factors. Based on our predictions we uncovered an interesting network of interactions between alternative splicing and transcription regulation. Overall, our results suggest that alternative splicing plays an important role in regulating the gene expression pathway, presumably by modifying the regulatory regions of the proteins involved in the process.

Joanna Masel
University of Arizona

The origin of new coding sequences

Random polypeptide sequences are likely to be strongly deleterious. It has therefore long been thought that new protein sequences are always derived from old ones, through duplication and divergence. More recent evidence suggests that noncoding sequences have sometimes been converted into coding sequences through changes in splicing, in start codons, or in stop codons. There are even case studies of complete genes derived de novo from noncoding sequences. We suggest that de novo gene birth may be possible if and only if it happens in stages, with errors in each molecular process leading to an evolutionary "preview" of the next stage at very low levels of accidental expression. These previews allow "preadaptation" to occur by screening out the most deleterious sequences. To begin this evolutionary process, novel transcripts appear due to intrinsic bidirectional transcription. In subsequent stages, an ever-smaller proportion of transcripts escape degradation, are transported to the nucleus, associate with ribosomes, and acquire an open reading frame that is, at minimum, non-toxic. This evolutionary pathway is consistent with patterns in transcript orientations and lengths at each stage in Saccharomyces cerevisiae.

Andrey Mironov
Moscow State University, Russia

Conserved Intronic RNA Secondary Structures

Accurate and efficient recognition of splice sites during pre-mRNA splicing is essential for proper transcriptome expression. Splice site usage can be modulated by secondary structures, but it is not clear whether this type of modulation is commonly used or whether it occurs to a significant degree with secondary structures forming over long distances. Using phlyogenetic comparisons of intronic sequences among twelve Drosophila genomes, we elucidated a group of 202 highly-conserved pairs of sequences, each at least nine nucleotides long, capable of forming stable stem structures. This set was highly enriched in alternatively spliced introns, introns with weak acceptor sites, and long introns, and most occurred over long-distances (>150 nucleotides). Experimentally, we analyzed the splicing of several of these introns using mini-genes in Drosophila S2 cells. Wild-type splicing patterns were changed by mutations that opened the stem structure, and restored by compensatory mutations that re-established the base-pairing potential, demonstrating that these secondary structures were indeed implicated in the splice site choice. Mechanistically, the RNA structures masked splice sites, brought together distant splice sites, and/or looped out introns. Thus, base-pairing interactions within introns, even those occurring over long distances, are more frequent modulators of alternative splicing than is currently assumed.

Jason Miller
J. Craig Venter Institute

Studies of The Human Microbiome

It has become increasingly evident that the advent of metagenomics holds significant promise for increasing our understanding of microbial diversity on humans, in agriculture and in the environment. We stand to gain knowledge on the many microbial diseases associated with the human body, inclusive of those that are yet to be characterized.  Current estimates are that the multitudes of microbial species that inhabit the human body vastly outnumber the number of host somatic cells. The role of the majority of these species remains to be characterized. Recent technological advances are allowing us to generate in-depth sequence information on the diversity of these populations and how they change over time. Some of these species are being correlated with the onset and development of a number of diseases, and assembly, data mining, and annotation tools are trying to keep pace with the rapid pace of data generation. It is anticipated that the surveys of the human body will enable tremendous advances in this realm of science. Initial studies on the human microbiome conducted by our group and others have allowed for the identification of homologs to virulence factors, and the recreation metabolic pathways from a range of microbial species. The availability of high-throughput metagenomic approaches now allows us to address medically relevant diseases that have been thought to have a microbial association but that we have not been in a position to investigate before. The recent NIH Roadmap initiative focused on the human microbiome, efforts leading up to this initiative, and additional spin off projects will be presented.

Andrei Osterman
Burnham Institute for Medical Research

Integrated Genomic Reconstruction of Metabolic and Regulatory Networks in Bacteria

Metabolic reconstruction, an ability to infer metabolic pathways and networks directly from genomes, is one of the most important and successful applications of comparative genomics. Despite many limitations, especially in the analysis of diverse species, this technology sets the stage for predictive modeling of organisms’ behavior. A comparative genomics-based reconstruction of transcriptional regulons is another powerful technology enabling ab initio inference of genome-scale regulatory networks. Combination of these two technologies, in addition to improving our understanding of cellular networks, strongly impacts the accuracy of both reconstruction layers. Indeed, reconstructed metabolic pathways provide a starting point and impose consistency on regulatory inferences. At the same time, reconstructed regulons provide genomic context evidence for accurate functional assignment and prediction of novel genes in metabolic pathways. Synergy emerging from integration of metabolic and regulatory reconstructions is illustrated by here by the analysis of carbohydrate catabolic machinery in a group of bacteria from Shewanella genus. Briefly, we used a subsystems-based comparative approach implemented in The SEED database to reconstruct complete pathways for utilization of 17 distinct sugar substrates showing mosaic distribution among 19 Shewanella species with completely sequenced genomes. Of ~170 protein families (metabolic enzymes, transporters and transcriptional regulators) implicated in these pathways, ~60 families were previously unknown, and their functions were inferred using genomic context analysis (operons and regulons). Moreover, ~2/3 of reconstructed pathways represent novel variants that include nonorthologous gene replacements and alternative biochemical routes as compared to their canonical prototypes. Most prominent are variations in transport and transcriptional regulation. Some of these bioinformatic predictions as well as most of the predicted growth phenotypes were experimentally verified. In contrast to highly variable peripheral sugar utilization pathways, most enzymes of the central carbon metabolism (CCM) are conserved within all Shewanella species as well as between Shewanella and E. coli. However, the transcriptional regulation of CCM in Shewanella species is completely different, with HexR playing an unexpected role of a global regulator as inferred by genomic reconstruction and confirmed by qPCR analysis of the S. oneidensis DhexR mutant.

Natasa Przulj
Imperial College London, UK

From Network Topology to Biological Function and Disease

Historically, a new natural science proceeds through three stages of development: first, amassing observations about the world; second, developing simplistic models capable of approximately reproducing the observations; and finally, the development of accurate predictive theoretical models under which the observations and earlier models become evident. Our current understanding of biological networks can be likened to the state of physics before Newton: although Copernicus, Kepler, Galileo and others had amassed a huge corpus of observations, and even some simplistic, case-specific models describing various phenomena, there was no theoretical framework tying it all together to provide understanding. Systems biology is currently somewhere between the first and second stages: we can hardly even describe the observational data mathematically, much less understand it theoretically.

Analysis and comparison of genetic sequences is well into the second stage mentioned above and making tentative steps into the third, but network analysis is just barely entering the second stage. In this talk I discuss new tools, developed in my lab, which are advancing network analysis into this second stage, and possibly giving hints towards the third stage---a theoretical understanding of the structure of biological networks. Analogous to tools for analyzing and comparing genetic sequences, we are developing new tools that decipher large network data sets, with the goal of improving biological understanding and contributing to development of new therapeutics.

Because nature is variable and the data are noisy, traditional graph isomorphism is of little use for graph comparison, and a more flexible, intentionally approximate approach is necessary. We introduce a systematic measure of a network's local structure that imposes a large number of local similarity constraints on networks being compared. In particular, we generalize the degree of a node to a degree vector describing the local topology around a node. We demonstrate that this local node similarity corresponds to similarity in biological function and involvement in disease. We also show how to use the degree vectors to design a network alignment algorithm that produces correct phylogenetic trees. Next, we demonstrate how statistics from large numbers of these local similarity measures can be combined to provide a global network similarity measure. Using this global similarity measure, we demonstrate that protein-protein interaction (PPI) networks are better modeled by geometric graphs than by any previous model. The geometric model is further corroborated by demonstrating that PPI networks can explicitly be embedded into a low-dimensional geometric space. Finally, we argue for a theoretical reason why PPI networks might be geometric.

John Reinitz
State University of New York at Stony Brook

When Two Plus Two Doesn't Equal Four: Modeling Non-Modular Enhancer Behavior in the Eve Promoter

The prediction of expression patterns from genomic sequence is an important unsolved problem in modern molecular genetics. Its solution requires an understanding of the transcriptional consequences of particular configurations of bound factors. An important aspect of the problem is to understand how modular enhancers arise from binding sites. We are currently using the {\em eve} gene of {\em Drosophila} as a testbed for finding the general rules by which sequence controls gene expression in metazoa. We believe that the most informative experimental materials for such studies are instances where the usual additive behavior of enhancers breaks down. Such instances can reveal underlying rules, but the complexity of the experimental phenomena require precise quantitative models for their interpretation. We consider two experimental situations in which modularity breaks down. In one case, a modular enhancer for stripe 2 fused to proximal sequences that do not drive any expression results in a fragment that expresses stripe 7, demonstrating nonadditive behavior. In another case, placing enhancers for stripes 2 and 3 adjacent to one another give rise to a novel expression pattern, an example of another type of nonadditive behavior. I will show how both types of nonadditive behavior can be understood using a quantitative model in conjunction with quantitative data from promoter-reporter constructs.

Pierre Rouze
Gent University, Gent, Belgium

From Protists to Plants, Fungi and Animals: Eukaryote Genomes Are Not Born Equal

Since the incipience of genome-wide sequencing, more than a thousand genomes from eukaryotes have been sequenced and the low cost of the “new sequencing” technologies will suddenly bring many more on the shelves. There is a clear issue in making the best use of these data, finding and annotating the genes and other features from these new genomes. This issue has mainly been seen from a computer science perspective. I would like here to pinpoint another issue which has to do with biology. Most of the organisms which have been sequenced until lately were either fungi, animals or plants. Although all model organisms used and documented from a cell and molecular biology perspective are among these, this is nevertheless a minor fraction of the eukaryote phylogenetic spectrum. Annotation traditionally proceeds by analogy, either by searching for genes that are known or found elsewhere or ab initio by looking at recurrent features according to the knowledge we have of the molecular mechanisms of genome expression. Do we care about and know well enough these features and mechanisms, i.e. the way the information is structured and the way it is encoded in the lesser documented organisms which are going to be the bulk genome sequences soon? Having been involved in the annotation of such organisms, e.g. green algae, brown algae, diatoms and haptophyte we indeed came across unexpected findings which come as a warning of our capability to properly decipher the genome information content in such organisms.