Molecular Phylogenetics Exercise
Definition of molecular phylogenetics:
The study of evolutionary relationships among organisms
or genes by a combination of molecular biology and statistical techniques
(Li, 1997. Molecular Evolution. Sinauer Press, Massachusetts)
1904 Nuttal used serological cross-reactions to infer
relationships among organisms. Showed humans closely related to apes.
1950’s Molecular techniques such as protein sequencing
and starch gel electrophoresis introduced into evolutionary studies.
1960’s-1970’s Molecular data used in phylogeny reconstruction
at higher levels such as orders and classes.
1985 development of PCR (polymerase chain reaction)
has led to unprecedented levels of activity in phylogenetic
2. Why use DNA sequence data instead of morphological data?
- DNA evolves at a much more constant rate than morphological data.
This leads to a clearer picture of the relationships between organisms.
- DNA sequence data is more amenable to quantitative treatment (molecular
phylogenetics is a statistical technique)
- Convergence and parallelisms much more likely with morphological
data (i.e. size)
- Molecular data can resolve relationships at all different levels
of organization, from species and populations to phyla and kingdoms.
- Less subjective
- Molecular data is much more abundant
- Uses of morphology
- Clock for molecules
- Can trace characters on molecular phylogenies
- Fossils (no DNA)
3. Rules of molecular evolution
- Nucleotide substitution rates vary
- Transition: transversion 2:1 (purines GA, pyrimidines CT)
- Deletions and insertions rare
- Unequal substitution rates between genes
- Plants: nuclear > chloroplast > mitochondrial (rearrangements
- Animals: mitochondrial > nuclear;
- Viruses : RNA viruses > retroviruses > DNA based
- RNA polio virus 1x10 –4 mutations per replication event, retroviruses
3-6x10 –5 per rep. event, DNA based E. coli 5.4x 10 –10
mutations per replication)
- What to do about substitution rates? Use more sequences, one from
fast, one from slow.
- Know your gene’s substitution rate
4. DNA sequences: Alignment is crucial
- all subsequent analysis depend on the final alignment
- homology =/= similarity
- alignment programs maximize similarity
- show GCG alignment, point out problems of contamination,
- (point out problems with bad sequences on GENBANK)
5. Phylogeny reconstruction techniques
BACK TO TOP
- Multiple techniques with pros and cons
- PARSIMONY- minimizes number of evolutionary changes in
a tree (used with fastener and bible exercises)
- Minimizes homoplasy (parallels, convergence, reversals)
- Only uses shared derived characters
- If degree of divergence between sequences is large, homoplasies
may be common, making parsimony phylogeny inaccurate
- DISTANCE- evolutionary distances computed for all pairs of taxa
(usually number of substitutions between sequences), phylogeny computed based
on relationship of distance values.
- If distances are large or sequences short, distance
estimates are subject to large statistical errors. Can compromise
accuracy of tree.
- MAXIMUM LIKELIHOOD- computes trees and determines likelihood
based on data, best tree is one with maximum likelihood based on data.
- Uses character info at all sites, but has a lot of assumptions
on model of substitution
Phylogeny reconstruction using molecular data
This lab will teach you how to:
- Acquire nucleotide sequences information from GenBank, a collection
of publicly available DNA sequences;
- Enter the sequence information into the computer program PAUP
(Phylogenetic Analysis Using Parsimony) and use the program to generate
a phylogenetic tree using parsimony; and
- Explore evolutionary patterns on this tree using the computer
Today you will acquire the sequences from GenBank and create a PAUP file
in MS Word. Next week you will import your MS Word file to PAUP, and
from PAUP to MacClade.
Epidemiological studies have increasingly used DNA sequence information.
To illustrate the utility of phylogenetic methods and sequence data, you
will explore the following true case. A French patient is hospitalized
for an operation. Prior to hospitalization, she is HIV negative and
has no risk factors associated with HIV. Shortly after hospitalization,
however, she is found to be HIV positive. Of the hospital staff, two
nurses are HIV positive and so one of them may have infected the patient.
GOAL: Determine which nurse, if either, is responsible for infecting
the patient. You will provide printouts of your computer results and
an explanation of your findings.
GenBank to PAUP: Getting sequences
- The first step is to acquire the sequences from GenBank.
- Start by going to the GenBank web site:
- All sequences have an associated identifier, called an accession
number. To start, acquire the nucleotide sequence of the patient, which
has the identifier: AF125604.
- Type in the accession number in the "Search GenBank" box on the
web page and click on GO. A summary description of the sequence information
can found by clicking the "DISPLAY" button with "GENBANK" chosen in the next
field. Print the page and identify (1) the number of base pairs (bp),
(2) the organism, (3) the gene that was sequenced, (4) the nucleotide base
sequence, and (5) the protein sequence (given as amino acid codes).
- We are mainly interested in the DNA sequence, which needs to be
pasted into a text editor. This is most easily done by choosing "FASTA"
in place of "GENBANK" and clicking the TEXT button (not DISPLAY).
Then, select the sequence information using your mouse and copy your selection.
- Now, open Microsoft Word. When a new window opens, paste
in your sequence data. Now you will need to do two things. First,
delete the "carriage returns" (indicated by paragraph marks in MS Word).
Your whole sequence will probably not fit on one line in MS Word (but in
PAUP it will). Second, you need to identify your sequence as being from
the patient so that you can easily see how the patient's HIV sequence compares
to the other samples. For example, type "Patient" in front of your
sequence, and hit tab once or twice to give yourself some room between this
label and the sequence information.
- This is a good point at which to save your MS Word file.
In the save dialog box, save the file as a text file to facilitate its importation
to PAUP next week.
- Next, you need to acquire information for the nurses using the
same basic procedure. Here are the relevant accession numbers and labels:
Nurse 1: AF125605
Nurse 2: AF125606
These are the three main "players" in this study. But in addition,
you need a sample of HIV sequences from the French population at large in
order to test whether the patient acquired the disease outside the hospital.
Here are some samples and accompanying labels for your file.
Sample 4: AF125607
Sample 5: AF125608
Sample 6: AF125609
Sample 7: AF125610
Sample 8: AF125611
After completing these steps, you should have eight "blocks" of data,
each with a distinctive label. Next, you need to add the computer
code that will provide PAUP with information about your data file.
Follow the syntax below very carefully.
At the start of your file, add the following: #NEXUS
This simply identifies the type of input file that you have (i.e. a Nexus
Now you have to provide code that will tell PAUP about your data matrix.
The program also needs details on the dimensions of this data matrix in terms
of the number of taxa (NTAX), which in this case are different HIV samples
rather than species. It also needs information on the number of characters
for each taxon (NCHAR). A line or two below your Nexus statement,
DIMENSIONS NTAX=8 NCHAR=477;
FORMAT MISSING=? GAP=- DATATYPE=DNA;
A line or two below that block of code, type the following to identify
your data matrix: MATRIX
Now, go to the end of your file and add a semicolon (indicating the end
of the data matrix), then type:
to indicate the end of the file.
Print out your file and save it to disk. You will need it next week
for running PAUP.
Next week you will also be generating a tree from your homework.
Create a PAUP file in MS Word using the above protocol. Remember that
each base (T, A, C or G) in your sequence is a different character; you therefore
need to create a matrix similar to that created above, but with 1's and
0's as character states instead of nucleotide bases. What coding do
you need to change or delete for this different type of data? What
other PAUP codes need to be changed for your homework problem?
BACK TO TOP
Last week you acquired nucleotide sequences from GenBank. This week
you will use those sequences to generate three phylogenetic trees, one for
each of: the "French patient" epidemiological study, the "bible passage"
homework problem, and the "fasteners" homework problem. For your research
project, you will follow this general process of acquiring sequences and
generating trees, so carefully doing this exercise in class will help you
with your independent project.
The first step is to import your file from MS Word into PAUP. Start
with the epidemiological study. First, make sure that the file is saved
as “Text” (use the pull-down menu in the “Save As” dialog box).
Next, open PAUP. When the dialog box comes up, choose your file
and click on the EXECUTE BUTTON. (If there are any problems with your
file, it will not execute. PAUP will open the file for you to edit.
Correct any mistakes, using last week’s handout as an example. Once
your file is corrected, go to Execute FILE NAME in the File menu.)
Open your file by going to the Window menu and choosing your file.
Take a look at your data matrix. Use spaces and tabs to align your
sequences in PAUP. Scroll to the right to see each of the bases.
Save your file (as a different name) within PAUP, when it is the active window.
- Can you tell, at this point, which nurse infected the patient?
- Can you rule out the possibility of the patient acquiring the
infection outside the hospital? _______________
(Phylogenetic analysis can more definitively test the hypothesis that
the patient acquired the infection in the hospital.)
Now you are ready to generate a tree.
Go to Search menu, choose Exhaustive. Take a look at the dialog
box. Change the default settings so that PAUP retains all trees with
less than 685 steps (a change from one base to another counts as one step;
the sum of these steps equals “tree length”). Once you have made this
change, hit the Search button.
The program will evaluate every possible tree, keeping only those with
fewer than 685 steps. It will also print out a histogram of the tree
Use the output to answer the following questions:
- What is the most parsimonious tree, and how many of these trees
did PAUP find? ________ (Remember that it is possible to have multiple
equally parsimonious trees).
- How many trees were evaluated in your search? __________
- What is the mean tree-length of all possible trees? ___________
At this point, you have not viewed any of your trees. Examine the
trees by going to menu Trees; select “Show Trees” and choose one of your
The tree you viewed may or not be the most parsimonious tree. Get
information on your stored trees by going to the Trees menu and select “Length
and fit measures”. Identify the tree with the fewest steps; this is
the most parsimonious tree. Display that tree and use it to answer
the main question:
- Which nurse most likely infected the patient? Why?
Now, systematically go through the other trees.
- Do your conclusions from the last question change based on these
Save your most parsimonious tree and one other, of your choosing, to bring
into MacClade for the next section of this lab. Under the Trees menu,
go to “Save Trees to File”. At the bottom of the dialog box, change
“All trees” to "Tree ___ to ___". Type in the tree numbers that you
want to save, if they are consecutive; otherwise, you may have to save to
two files (e.g. to save tree 1, type 1 in both boxes; to then save tree 21,
type 21 in both boxes on a subsequent save).
Before moving on to MacClade, there is one more thing to cover in PAUP.
The nurse example is relatively simple because there are very few “taxa."
In your independent project, however, you may find that it takes too long
to search every possible tree (** you may also have to align your sequences
**). For larger data matrices, you can sample the possible trees by
doing a “heuristic” search…
First, go to “Clear Trees” under the “Trees” menu. Go to the Search
menu and choose “heuristic”. Keep all trees less than or equal to 685
steps. Click search. Examine these trees and their characteristics.
- How many trees were saved? _________
- Is this equal to or less than found in the exhaustive search?
- Did your heuristic search find the most parsimonious tree?
- Did the heuristic search take more or less time than the exhaustive
As the final step for using PAUP, run your data matrices from the homework
problems. Are the trees equivalent to the trees that you calculated
BACK TO TOP
MacClade cannot be used to generate trees, but it is useful for understanding
patterns of evolution. The first step is to import your file from PAUP.
This is easy: just open it in MacClade. This will bring up your
data file, which looks something like an Excel file. Your sequences
can be seen in this window. Each column represents a different character
(indicated by a number), with the various character states (T, A, C, or
G) for each base provided in the cells below.
To see the tree, go to the Display menu and choose “Go to Tree Window”.
In the dialog box, choose “Open tree file”, and choose the file that represents
your most parsimonious tree.
There is your tree! Now, trace a character on that tree. Under
the Trace menu, choose "trace character." It will start with character
1. The character states of the different "taxa" are shown by boxes
at the "tips" of the phylogeny (top of the window). Different bases
are indicated by the different colors. The key to these colors is shown
in the small box at the bottom right of the window.
Character one is not very interesting in this context. All sequences
are the same for this position: they all have character state A.
In this case, then, character 1 is not "informative."
Scroll through the characters to identify situations where samples have
different character states. Do this by going to the box with the "key"
to the colors (bottom right), and click on the scroll bar at the bottom
of the box.
Click through the characters until you find an informative character (indicated
by two colors on the tree).
- Is this change (mutation) a transition or transversion?
- Go through your characters, counting transitions and transversions,
until your total count equals 30 or so.
- What is the ratio of transitions to transversions in your sequences?
___ : ___
- Scroll through the characters and identify a reversal and a case
of convergence. Provide character numbers for each here
- Reversal: character # ______
- Convergence: character # ______
It is also useful to examine the total number of changes that occur along
View the total changes by going to the Trace menu and choosing "Trace
all changes". The number of changes will be illustrated by colors
on the different branches. You can also view the number of changes
directly by changing the "trace all changes option", also under the Trace
menu. Click "graphics options" and change the setting to "label by
amount of change."
- Print out this tree, with the amounts of change labeled.
- Identify the two "longest" branches and the two "shortest" branches
on your printout.
- Does the output strengthen or weaken your conclusions regarding
which nurse infected the patient?
This lab has only covered a small subset of the options available in MacClade.
Explore the windows and their options. One particularly neat trick
is to drag the branches around the screen to create new tree topologies.
For example, if you force the patient and nurse 1 to be sister taxa…
- How many total changes have occurred between them? _______
- What is the total length of this tree? (Look in the other
"box" in the lower right hand corner.) _________ .
MacClade may be particularly useful for your independent project -- for
more practice, try out the different tools with your other "homework" trees,
imported as described above.
BACK TO TOP