Prediction of conserved cis-regulatory elements in the Xenopus genome using genome
alignment tools: Bioinformatic methods in use in the Grainger Laboratory.

Sarah Louie, Hajime Ogino, Robert Grainger

Details of these steps are discussed below in conjunction with a "Genomial Tutorial" .pdf file.

INTRODUCTION

     Conserved non-coding regions of genomes are of great interest to many biologists because, in many cases, they function as gene regulatory sequences. Comparative genomics utilizes the tendency of functional sequences to be selected for during evolution and therefore resolves them from nonfunctional orthologous sequences through multispecies comparisons. The evolutionary distance between the Xenopus tropicalis and the human genomes is approximately 350 million years (Hedges 2006), making the Xenopus genome well positioned for use in identifying conserved non-coding elements by phylogenetic analysis. For example, enhancer sequences can be identified through alignments between Xenopus and mammals (Ogino, 2008). However, many enhancers (and some exons) cannot be identified through alignments comparing fish and mammalian genomes because of further evolutionary distance, and, in the case of zebrafish, genome duplication (Schwartz, 2003; and our observations). Because our lab is interested in the regulatory networks involved in eye formation, we have evaluated many potentially useful bioinformatic methods to identify potential regulatory elements for eye genes using the Xenopus genome. The particular set of programs and processes described here have enabled us to rapidly and accurately predict functional regulatory elements in genes involved in eye development, as validated with transgenic methods in Xenopus . What follows is a description of a bioinformatic protocol, which we continually refine over time as various tools emerge and are updated. Please send comments and suggestions to grainger@virginia.edu .

RATIONALE

     The goal of phylogenetic analysis is to identify conserved non-coding sequences that are likely to function in gene regulation. Since enhancer sequences may be relatively short and distant from the coding region of a gene, they can be difficult to predict. By using genome alignment tools with appropriate levels of sensitivity and specificity, and incorporating genomes spanning several hundred million years of evolutionary pressure, it has become possible to resolve functional noncoding sequences from non-regulatory sequence within Mb sized regions of genomic sequence. Several steps are involved in predicting evolutionary conserved regulatory regions (ECRs) with bioinformatics: 1) identify large syntenic regions in multiple genomes of interest; 2) mask repeats and build sequence structure files; 3) perform pairwise and multiple alignments; and 4) extract short conserved regions for further study.
      Quick and easy ways to extract and analyze orthologous sequences near a gene of interest are available using pre-made genome alignments at genome browser websites such as ECR Browser and VISTA Browser. These tools were developed for rapid and adjustable analysis, nice visualization of genomic alignments, and easy sequence extraction and motif analysis. (ECR Browser (http://ecrbrowser.dcode.org/ , VISTA Browser http://pipeline.lbl.gov/cgi-bin/gateway2 , or the UCSC Genome Browser http://genome.ucsc.edu/).
     As an example exercise, explore the ECR Browser tools to see how easy it can be to generate a preliminary assessment of conserved regions and predicted transcription factor binding sites within. Go to the ECR Browser page, then choose a base genome for the alignment (“frog” for Xenopus tropicalis ; or leave “human” as the default setting), and enter a gene name or genome location for analysis (“Pax6” or “scaffold_399:628577-650781”, respectively, for example ). You will be taken to a gene locus in the genome of interest after clicking on a source for the genomic sequence of interest, isoform a or b, for example on the next page. A multiple genomic alignment of the sequence you requested will be displayed (see slide 1 of the Genome tutorial.pdf). On the right you will see a + or an x next to the genomes shown, which allows you to add or remove genomes from the alignment. Adjust the length of the alignment using buttons such as “ -10X” at the bottom of the screen, or adjust the genomic position in the window at the upper right, so that the browser window displays the desired range of 5' and/or 3' flanking sequences surrounding the genomic region you selected. The multiple genome alignment shown in the main window of ECR Browser consists of multiple “pair-wise” alignments, each of which was generated by aligning a base sequence (such as Xenopus) with one of its orthologous sequences using a local alignment program (Schwartz et al. 2003). The result, using alignments of the 200kb surrounding the Pax6 coding region in multiple vertebrate genomes as an example, is at least three ECRs displayed as red peaks on the 5' end of the Pax6 coding region (labelled in Slide1) and several more on the region 3' of the coding sequence.
     BLASTZ is a local nucleotide alignment program utilized by ECR Browser and other alignment tools, PipMaker and MultiPipMaker (http://pipmaker.bx.psu.edu/pipmaker/). BLASTZ was developed to compare neutrally evolving regions, which requires greater sensitivity and specificity than the prediction of protein-coding regions (Schwartz 2003). The overall process to generate the pairwise alignments used by these tools is to a) label repetitive elements, b) perform pairwise mapping to identify syntenic regions, c) align syntenic regions using BLAT or Blast programs to identify short non-gapped aligned segments or “seeds” for further alignment, d) perform a gapped Blast to extend end gaps to join short neighboring blocks of aligned sequence and e) perform the final alignment of syntenic blocks. Importantly, BLASTZ has been optimized to identify homologous sequence apart from paralogous or duplicated sequence and so the alignments for ECR Browser represent sequences that have the same order and orientation of conserved blocks.
     Studies in our laboratory have led to the conclusion that, while the ECR and VISTA browsers are very useful, they are not nearly as sensitive for the identification of ECR's as the less user-friendly, but sensitive programs PipMaker and MultiPipMaker. The PipMaker tools require creation of some files by the user, but produce alignments of greater sensitivity than ECR Browser. In fact, ECR and VISTA browsers miss some ECRs such as the Pax6CE1 (Slide 2 and green arrows in Slide 3), which have been identified using MultiPipMaker in our lab, and shown to drive the earliest Pax6 expression in the presumptive lens ectoderm and retina (Ogino and Grainger, unpublished). This is due to the fact that the BLASTZ settings used by MultiPipMaker allow for duplications and inversions (Schwartz, 2003), find all matches and do not focus on the identification of othologous versus paralogous segments. Again, matches that are nonsyntenic or are in the reverse orientation are not included in alignments for ECR Browser (Ovcharenko, 2004). Based on our experience of identifying functional ECRs that were not predicted by ECR browser, we hereby present our protocol to identify cis-regulatory elements in the Xenopus tropicalis genome using MultiPipMaker. We further present a discussion on the prediction of transcription factor binding motifs (TFBMs) within ECRs of interest, which can be used to generate hypotheses for experimentation.

OVERVIEW (Slide 4)

  1. Extract genomic sequences of interest from genome assembly data.
  2. Build accompanying text files.
  3. Generate a multiple genomic sequence alignment.
  4. Perform a local realignment of ECR sequences.
  5. Identify conserved putative transcription factor binding sites in the ECR.


PROTOCOL

(1) Extract genomic sequences of interest from genome assembly data.

     The first step in generating an alignment with MultiPipMaker is creating all the sequence files needed by this tool (Slide 5). Raw sequence data is needed for each genome to be included in the alignment, all in the simple text format.
      There are multiple sources of genomic sequence, but we have found that the UCSC genome browser (http://genome.ucsc.edu/ and Slide 6) is a convenient way to access orthologous loci in multiple vertebrate genomes. The Genome Browser uses a similar access porthole to ECR browser (Slide 7), where one can request to view any gene or location in 20 vertebrate genomes, and adjust the size and content of the region viewed. Once a desired region is displayed, click on the “DNA” tab (Slide 8) to download the genomic DNA sequence represented in the display (Slide 9) with the “get DNA” button. Save the genomic DNA sequence in fasta format (Slide 10) as a simple text file. Once you have the sequence from the first genome of interest, go back to the genome browser window and click on “Convert” (Slide 8) to retrieve the orthologous sequences from additional genomes, such as zebrafish, chicken, mouse and human (Slide 11).

(2) Build accompanying text files.

     The second step in preparing for the alignment is building the exon, repeat and underlay files to accompany the sequence you choose to be the base sequence for the alignment (Slide 12). The repeats and exons files are text files that indicate location of repetitive sequences and exons in the base sequence of the alignment, respectively. The underlay file is a text file that instructs the program to shade specific gene structures (ex. exons and introns) with colors in the output. First retrieve the exon and underlay files from PipHelper (http://pipmaker.bx.psu.edu/cgi-bin/piphelper), by entering the genome and location corresponding to the base sequence downloaded from the UCSC genome browser (Slide 13). Click on the exon and underlay file links (Slide 14), and save this information as separate simple text files as shown in the examples for the Pax6CE1 (Slides 15-16). The repeat file produced by PipHelper can be used, however the RepeatMasker tool at the ISB website (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker) (Slide 17) includes low complexity regions in the report in addition to the simple repeats documented by PipHelper for the Pax6CE1 example. Simply enter the sequence file and source genome (Slide 18) and open the text file (Slide 19), then save the resulting documentation (Slide 20) as a simple text file. If these files for your base sequence are not available from PipHelper (as occasionally is the case for Xenopus ), you have to make them yourself by analyzing gene structures using standard DNA analysis tools (such as Entrez and Vector NTI) and use the examples as a template to create the necessary files.

(3) Generate a multiple genomic sequence alignment.

     To generate an alignment of the genomic sequences retrieved from the genome browser(s) (Slide 21) we use the more sensitive global alignment programs, PipMaker (http://pipmaker.bx.psu.edu/pipmaker/). At the PipMaker website (http://pipmaker.bx.psu.edu/pipmaker/), there are three programs, (basic) PipMaker, Advanced PipMaker, and MultiPipMaker (Slide 22). PipMaker and Advanced PipMaker generate pairwise alignments, and MultiPipMaker is used to align three or more sequences. Links to detailed instructions of these programs are found in the website. Click on MultiPipMaker and enter the number of sequences to be aligned on the next page (Slide 23). Select “Generate nucleotide level view” option to see a raw sequence alignment data in addition to a schematic “Pip” view. Next, upload DNA sequence files, repeat, exon and underlay files for the base sequence (Slide 24) and check the “use as default in pip” box. An annotation file is not necessary for the analysis. Enter names and upload sequence files for the additional orthologous loci of interest in the spaces below and check the “Search both strands” and “High sensitivity and low time limit” options. Detailed instructions of the “Show all matches”, “Chaining”, and “Single coverage” selections are found on the Advanced PipMaker instructions website (http://pipmaker.bx.psu.edu/pipmaker/pip-instr3.html). “Show all matches” is the setting we often use, though it can lead to a somewhat complex output view if regions of the base sequence align with multiple sequences in the second genome.
      In the example shown in Slide 24, the base sequence (the first sequence entered) is from Xenopus tropicalis . Our studies indicate that it is optimal to use X. tropicalis as the base sequence because this genome is positioned between human and fish evolutionarily, and thus often identifies regions of conservation spanning all vertebrates, which may not be detected when human or fish is used as the base sequence. (A discussion of why using mouse or human as the base genome can be convenient in initial comparative genomic studies is included below).
      Inspect the results of the MultiPipMaker analysis by first opening the resulting pip.pdf file for the plot (Slide 25) to see a schematic representation of sequences conserved between 50-100% for alignments of the Xenopus base sequence with each of the other sequences submitted. The Pax6CE1 not depicted in the ECR Browser plot is picked up in the pip plots for human, mouse and chicken as shown. Next open the acgt.pdf file (Slide 26) to see the nucleotide alignment corresponding to the Pax6CE1 location on the plot. It can be laborious to create exon and underlay files for Xenopus genomic sequences, particularly if several hundred kb and many genes are present in the sequence of interest. It is often convenient to perform a preliminary scan using mouse or human as the base sequence in MultiPipMaker, as the level of annotation for these genomes is very high, and this annotation can provide useful reference points in the resulting pip plot to determine the relative location of an ECR (example in Slide 27 (Ogino, unpublished)). In this view one can see the Pax6 coding sequence with exons labeled and the neighboring gene, Elp4, at the 3' end without having to make exon and underlay files by hand. Enhancers for a particular gene may be a megabase or more away from the coding sequence as reported for the Shh enhancer (Lettice, 2008). A summary diagram illustrating all of the conserved elements in 235kb surrounding the Xenopus tropicalis Pax6 gene, is shown in Slide 28. An enhancer for a gene of interest may be found adjacent to a neighboring gene, as is the case for the Pax6 SIMO enhancer, which lies at the 3' end of the Elp4 gene, or within an intron of a neighboring gene, as is the case for the more distant diencephalon enhancers (Kleinjen, 2006). Thus in general it is advisable to scan a minimum of a few hundred kb of genomic sequence alignments when searching for putative enhancers.

(4) Perform a local realignment of ECR sequences.

     It is helpful to realign the ECR region of interest and shade conserved residues with more a more accurate local alignment program before attempting to phylogenetically footprint the TFBMs predicted in step (5) (Slide 29). VectorNTI (Invitrogen) is an easy way to extract, realign, and shade in one program. ClustalW (http://www.ebi.ac.uk/clustalw/) is a proven local alignment program, and can be used in combination with BOXSHADE to generate virtually the same results as with VectorNTI. To use ClustalW, first open the genome sequences that you used for PipMaker analysis with a standard DNA analysis tool (Vector NTI, etc.), and create new text files in fasta format for just the ECR sequence in each genome. The Sequence Manipulation Suite also has a DNA sequence Range Extractor function that is a helpful online alternative (http://www.bioinformatics.org/sms2/range_extract_dna.html). An easy way to access the ClustalW and BOXSHADE programs in tandem is via the SDSC Biology WorkBench (http://workbench.sdsc.edu/) (Slide 30). Create a free account (Slide 31) and enter the WorkBench and click “Nucleic Tools”. In the next window, click “Add” to upload your ECR sequence files, then click “Save” to save each file to your account (Slide 32). Next select the sequence files and click “ClustalW-multiple sequence alignment”, then “Run” and “Save” to store your alignment in the WorkBench. Click on “Alignment Tools” (Slide 33) and click on “BOXSHADE-color-coded plots of pre-aligned sequences” and your saved alignment file then click “run”. In the next window, shading options for conserved and similar residues are presented and can be adjusted as desired. Click “Submit” to show the shaded ECR alignment (Slide 34). This can be saved and opened as a pdf file.

(5) Identify putative transcription factor binding sites conserved in the ECR.

     The final step of our protocol involves the in silico prediction of transcription factor binding motifs (TFBMs) that are conserved in the ECR of interest (Slide 35). Many programs that search for TFBMs in DNA sequences exist and are classified into three groups: pattern search programs, weight matrix search programs, and HMM-based programs. Because HMM-based programs require some special mathematical knowledge, pattern and weight matrix search programs are widely used for conventional analysis. This may be done using either the TRANSFAC or other available databases or a user-defined list of motifs (Slide 36). The available databases such as TRANSFAC give an extremely inclusive view of many possible TFBMs, and may provide excellent insight into unexpected mechanisms. However the TFBMs included in the databases are not all selected by the most stringent means, or by criteria that reflect accurate in vivo binding sites. A more limited but highly useful strategy is to develop a user-defined list of TFBMs that may be curated for a particular tissue, stage of development, etc. by searching the literature for TFBMs validated functionally, preferably by Chromatin Immuno-precipitation (ChIP) analysis in vivo.
      A pair of sequences can be analyzed together to derive a list of predicted and conserved TFBMs using rVISTA. An advantage that rVISTA has is that either multiple user-defined motifs can be evaluated or all of the TRANSFAC motifs can be chosen for an analysis of conserved sites between multiple sequences. However, rVISTA relies on a BLAST-type alignment that may not be sensitive enough to use the Xenopus sequence in the search. By using the human and mouse together, an inclusive list may be generated, and conservation in the Xenopus sequence can be confirmed or rejected visually using the local alignment generated in step (4).
      Alternatively, the TRANSFAC database can be used on individual sequences through P-Match (a combination of position and weight matrix search programs using the TRANSFAC public version) from the BIOBASE Gene Regulation website (http://www.gene-regulation.com/pub/programs.html) (Slide 37). This site also requires a free registration. Here, add sequences for analysis, select the vertebrate set of matrices and threshold similarity levels and “submit the form”. The result is a long file of predicted binding sites (Slide 38), which are linked to references and position-weight matrices for each factor in the results list (Slide 39).
     The analysis of an ECR with a user-defined list of TFBMs curated from the literature for a specific tissue can be more useful. To do this on a sequence or multiple sequences in parallel, go to the RSA tools DNA pattern website (http://www.bi.up.ac.za/rsa-tools/dna-pattern_form.cgi)(Slide 40) (or rVISTA). There, input a list of TFBMs in the “sequence<tab>NAME” format and ECR sequences in the fasta format and click “Go”. The results will be displayed separately for each sequence (Slide 41), and can be visualized and results for multiple sequences compared by clicking on the “feature map” button. These results can be used to analyze the conservation of TFBMs (Slide 42) in conjunction with the shaded local sequence alignment. Another possibly more comprehensive weight matrix search program, MatInspector (http://www.genomatix.de/products/MatInspector/index.html) can also be used with a subscription. Using results from the approaches described above, visually inspect the shaded alignment file for TFBMs within the conserved regions of the realigned ECR. The shaded alignment file can be easily decorated in MSWord for presentation purposes (Slide 43).

CONCLUSION

     The protocol described here is only a starting point for analysis of conserved elements as they are only untested predictions of possible function. Once an ECR is identified, it is necessary to test whether an ECR has gene regulatory activity. This can rapidly be done using transgenesis in Xenopus, using a protocol provided on our lab website (http://faculty.virginia.edu/xtropicalis/overview/transgen_protocol.html). Perhaps the greatest utility of the Xenopus system for this kind of study is the extraordinary efficiency with which one can make transgenic embryos, and thereby analyze the activity of putative enhancers in reporter constructs, either in their native form or after mutagenesis to test the significance of particular sites within a regulatory element. Finally, in vitro mutagenesis and in vivo ChIP assays are necessary to test the functionality of predicted TFBMs within a conserved regulatory element. This has been successfully done for the Lens1/Foxe3 enhancer by Hajime Ogino (Ogino, 2008), and the protocol for this is also provided on the Grainger lab website (http://faculty.virginia.edu/xtropicalis/chIP_analysis.htm).


REFERENCES

Ivan Ovcharenko, Marcelo Nobrega, Gabriela Loots, Lisa Stubbs. ECR Browser: a Tool for Visualizing and Accessing Data from Comparisons of Multiple Vertebrate Genomes. Nucleic Acids Research (2004) 32: W280-W286.

Ivan Ovcharenko, Gabriela Loots, Ross Hardison, Webb Miller, Lisa Stubbs. zPicture: Dynamic Alignment and Visualization Tool for Analyzing Conservation Profiles. Genome Research (2004) 14:472-477.

Kleinjan DA, van Heyningen V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet. (2005) Jan;76(1):8-32.

Hedges SB, Dudley J, & Kumar S. TimeTree: A public knowledge-base. (www.timetree.org). Pennsylvania and Arizona State Universities, USA (2006).

Lettice LA, Hill AE, Devenney PS, Hill RE. Point mutations in a distant sonic hedgehog cis-regulator generate a variable regulatory output responsible for preaxial polydactyly. Hum Mol Genet. (2008) Apr 1;17(7):978-85.

Ogino H, Fisher M, Grainger RM. Convergence of a head-field selector Otx2 and Notch signaling: a mechanism for lens specification. Development. (2008) Jan;135(2):249-58.

Scott Schwartz, Zheng Zhang, Kelly A. Frazer, Arian Smit, Cathy Riemer, John Bouck, Richard Gibbs, Ross Hardison, and Webb Miller. PipMaker---A Web Server for Aligning Two Genomic DNA Sequences. Genome Research (2000) Vol. 10, Issue 4; 577-586.

Scott Schwartz, Laura Elnitski, Mei Li, Matt Weirauch, Cathy Riemer, Arian Smit, Eric Green, Ross Hardison, NISC Comparative Sequencing program, Webb Miller. MultiPipMaker and Supporting Tools: Alignments and Analysis of Multiple Genomic DNA sequences. Nucleic Acids Research (2003) 31: 3518-3524.

Scott Schwartz, W. James Kent, Arian Smit, Zheng Zhang, Robert Baertsch, Ross C. Hardison, David Haussler, and Webb Miller. Human-Mouse Alignments with BLASTZ. Genome Res. (2003) 13: 103-107.