fastphylo: Fast tools for phylogenetics


Table of Contents

1. Introduction
2. Algorithms
2.1. Fast Computation of Distance Estimators
2.1.1. About the published article
2.1.2. Abstract of the published article
2.1.3. Supplementary Material
2.2. Fast Neighbor Joining
2.2.1. About the published article
2.2.2. Abstract of the published article
2.2.3. Supplementary Material
3. Software
3.1. Download
3.2. Installation
3.2.1. Installation with prebuilt package
3.2.1.1. Installation on Ubuntu and Linux
3.2.1.2. Installation on Mac OS X
3.2.2. Building from source
3.2.2.1. Building from source on Unix
3.2.2.2. Building install packages
3.2.2.2.1. Building an rpm
3.2.2.2.2. Building a deb package
3.2.2.2.3. Building install package for MacOS X
3.3. Usage
3.3.1. fastdist
3.3.1.1. Command line options
3.3.1.2. fastdist input file formats
3.3.1.3. fastdist output file formats
3.3.1.4. Examples
3.3.2. fastprot
3.3.2.1. Command line options
3.3.2.2. fastprot input file formats
3.3.2.3. fastprot output file formats
3.3.2.4. Examples
3.3.3. fastprot_mpi
3.3.3.1. Command line options
3.3.3.2. fastprot_mpi input file formats
3.3.3.3. fastprot_mpi output file formats
3.3.3.4. Examples
3.3.4. fnj
3.3.4.1. Command line options
3.3.4.2. fnj input file formats
3.3.4.3. fnj output file formats
3.3.4.4. Examples
3.4. File formats
3.4.1. Fastphylo sequence XML format
3.4.2. phylip format
3.4.3. Fasta format
3.4.4. Fastphylo distance matrix XML format
3.4.5. Phylip distance matrix format
3.4.6. Binary distance matrix format
3.4.7. Fastphylo tree count XML format

List of Tables

1. fastdist input file formats
2. fastdist output file formats
3. fastprot input file formats
4. fastprot output file formats
5. fastprot_mpi input file formats
6. fastprot_mpi output file formats
7. fnj input file formats
8. fnj output file formats

List of Examples

1. fastdist with input in file phylip format
2. fastdist with output in file binary format connected to fnj through pipe
3. fastdist with input in file fasta format
4. fastdist with input file in XML format
5. fastdist with an XML stream on stdin
6. reading the fastdist XML output stream with python
7. fastprot with input in file phylip format
8. fastprot with output in file binary format connected to fnj through pipe
9. fastprot with input in file fasta format
10. fastprot_mpi with input in file phylip format
11. fastprot_mpi with output in file binary format connected to fnj through pipe
12. fastprot_mpi with input in file fasta format
13. fnj with input file in Phylip distance matrix format
14. fnj with input file in XML format
15. connecting fastdist to fnj with a pipe
16. reading the fnj XML output stream with python
17. Example files in Fastphylo sequence XML format
18. Example files in phylip format
19. seq.fasta, an example file in fasta format
20. protein_seq.fasta, an example file in fasta format
21. dm.xml, an example file in Fastphylo distance matrix XML format
22. dm.phylip, an example file in phylip distance matrix format

1. Introduction

fastphylo is software project containing the implementations of the algorithms "Fast Computation of Distance Estimators" and "Fast Neighbor Joining". The software is licensed under the MIT license.

The primary URL for this document is http://fastphylo.sourceforge.net.

2. Algorithms

2.1. Fast Computation of Distance Estimators

2.1.1. About the published article

Isaac Elias and Jens Lagergren published the algorithm in the journal BMC Bioinformatics in 2007.

BibTex

@Article{EliasLagergren_fastdist,
  author =      {Isaac Elias and Jens Lagergren},
  title =	{Fast Computation of Distance Estimators},
  journal =	{BMC Bioinformatics},
  year =        {2007},
  pages =       {89},
  volume =      {8}
}

2.1.2. Abstract of the published article

Background: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n3). Unfortunately, the fastest practical algorithms known for computing the distance matrix, from n sequences of length l, takes time proportional to l·n2. Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications.

Results: We give an advanced algorithm for computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity symbols. This new method is shown to be more accurate as well as faster than earlier methods.

Conclusions: Our novel algorithm for computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds.

2.1.3. Supplementary Material

Supplementary Material - Fast Computation of Distance Estimators. Contains additional figures for the tests run on the ambiguity approaches. (PDF)

Simulated Test Data for Ambiguities (Tar archive)

Biological Test Data (Tar archive)

Command file used for running Paup (Nexus file)

2.2. Fast Neighbor Joining

2.2.1. About the published article

Isaac Elias and Jens Lagergren published the algorithm in the book "Proc. of the 32nd International Colloquium on Automata, Languages and Programming ({ICALP}'05)" in 2005.

BibTex

@InProceedings{ICALP05:EliasLagergren_FNJ,
  author =      {Isaac Elias and Jens Lagergren},
  title =	{Fast Neighbor Joining},
  booktitle =	{Proc. of the 32nd International Colloquium on Automata, 
                Languages and Programming ({ICALP}'05)},
  pages =	{1263--1274},
  year =	{2005},
  volume =	{3580},
  series =	{Lecture Notes in Computer Science},
  month =	{July},
  publisher =	{Springer-Verlag},
  ISBN =	{3-540-27580-0},
}

2.2.2. Abstract of the published article

Reconstructing the evolutionary history of a set of species is a fundamental problem in biology and methods for solving this problem are gaged based on two characteristics: accuracy and efficiency. Neighbor Joining (NJ) is a so-called distance-based method that, thanks to its good accuracy and speed, has been embraced by the phylogeny community. It takes the distances between n taxa and produces in Θ(n3) time a phylogenetic tree, i.e., a tree which aims to describe the evolutionary history of the taxa. In addition to performing well in practice, the NJ algorithm has optimal reconstruction radius.

The contribution of this paper is twofold: (1) we present an algorithm called Fast Neighbor Joining (FNJ) with optimal reconstruction radius and optimal run time complexity O(n2) and (2) we present a greatly simplified proof for the correctness of NJ. Initial experiments show that FNJ in practice has almost the same accuracy as NJ, indicating that the property of optimal reconstruction radius has great importance to their good performance. Moreover, we show how improved running time can be achieved for computing the so-called correction formulas.

2.2.3. Supplementary Material

In Proc. of the 32nd Int. Coll. on Automata, Languages and Programming (ICALP'05), volume 3580 of Lecture Notes in Computer Science, pages 1263-1274. Springer-Verlag, July 2005. (PDF,Springer)

Slides from presentation at Technion, Israel 2006 (PDF)

Slides from presentation at ICALP 2005 (PDF)

Google scholar citations Go Citeseer

3. Software

3.1. Download

Download the software from the sourceforge project page. The latest version of fastphylo is 1.0.0.

3.2. Installation

3.2.1. Installation with prebuilt package

3.2.1.1. Installation on Ubuntu and Linux

To install fastphylo on Ubuntu or Linux, first download the fastphylo-1.0.0.-Linux.tar.gz and then log in as root and

# tar xfz fastphylo-1.0.0.-Linux.tar.gz 

3.2.1.2. Installation on Mac OS X

To install fastphylo on a Mac OS X v10.6.8 (Snow Leopard), first download the fastphylo-1.0.0-MacOSX10.6.8.tar.gz and then

$ tar xfz fastphylo-1.0.0-MacOSX10.6.8.tar.gz  

To install fastphylo on a Mac OS X v10.8.4 (Mountain Lion), first download the fastphylo-1.0.0-MacOSX10.8.4.tar.gz and then

$ tar xfz fastphylo-1.0.0-MacOSX10.8.4.tar.gz

3.2.2. Building from source

3.2.2.1. Building from source on Unix

To build fastphylo on Unix ( e.g. Linux, MacOSX ) you need to have this installed

If you have the fastphylo source code in the directory /tmp/fastphylo and you want to install fastphylo into the directory /tmp/install, you First run cmake then make and then make install

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/tmp/install /tmp/source && make && make install
-- A library with BLAS API found.
-- A library with BLAS API found.
-- A library with LAPACK API found.
-- A library with BLAS API found.
-- A library with BLAS API found.
-- A library with LAPACK API found.
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/build
Scanning dependencies of target fastphylo
[  1%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/BitVector.cpp.o
[  2%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Exception.cpp.o
[  3%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/InitAndPrintOn_utils.cpp.o
[  4%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Object.cpp.o
[  5%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Sequence.cpp.o
[  7%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/SequenceTree.cpp.o
[  8%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/SequenceTree_MostParsimonious.cpp.o
[  9%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Simulator.cpp.o
[ 10%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/arg_utils_ext.cpp.o
[ 11%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/file_utils.cpp.o
[ 13%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/stl_utils.cpp.o
[ 14%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/DNA_b128_String.cpp.o
[ 15%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/Sequences2DistanceMatrix.cpp.o
[ 16%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/distance_methods/LeastSquaresFit.cpp.o
[ 17%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/distance_methods/NeighborJoining.cpp.o
[ 19%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/Kimura2parameter.cpp.o
[ 20%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/TamuraNei.cpp.o
[ 21%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/ambiguity_nucleotide.cpp.o
[ 22%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/dna_pairwise_sequence_likelihood.cpp.o
[ 23%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/string_compare.cpp.o
[ 25%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DistanceMatrix.cpp.o
[ 26%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/FloatDistanceMatrix.cpp.o
[ 27%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DistanceRow.cpp.o
[ 28%] Building C object src/c++/CMakeFiles/fastphylo.dir/arg_utils.c.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 29%] Building C object src/c++/CMakeFiles/fastphylo.dir/std_c_utils.c.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 30%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/xml_output_global.cpp.o
[ 32%] Building C object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/sse2_wrapper.c.o
[ 33%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/computeTAMURANEIDistance_DNA_b128_String.cpp.o
[ 34%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/computeDistance_DNA_b128_String.cpp.o
Linking CXX static library libfastphylo.a
[ 34%] Built target fastphylo
[ 35%] Generating programs/fastdist/gengetopt/fastdist_gengetopt.c, programs/fastdist/gengetopt/fastdist_gengetopt.h
Scanning dependencies of target fastdist
[ 36%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/main.cpp.o
[ 38%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/PhylipMaInputStream.cpp.o
[ 39%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/FastaInputStream.cpp.o
[ 40%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/DataOutputStream.cpp.o
[ 41%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/XmlOutputStream.cpp.o
[ 42%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/PhylipDmOutputStream.cpp.o
[ 44%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/BinaryDmOutputStream.cpp.o
[ 45%] Building C object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/gengetopt/fastdist_gengetopt.c.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 46%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/XmlInputStream.cpp.o
Linking CXX executable fastdist
[ 48%] Built target fastdist
[ 50%] Generating programs/fastprot/gengetopt/fastprot_gengetopt.c, programs/fastprot/gengetopt/fastprot_gengetopt.h
Scanning dependencies of target fastprot
[ 51%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/main.cpp.o
[ 52%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/FastaInputStream.cpp.o
[ 53%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/DataOutputStream.cpp.o
[ 54%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/XmlOutputStream.cpp.o
[ 55%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/PhylipMaInputStream.cpp.o
[ 57%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ProtDistCalc.cpp.o
[ 58%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ModelMatrix.cpp.o
[ 59%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ExpectedDistance.cpp.o
[ 60%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/Matrix.cpp.o
[ 61%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/MaximumLikelihood.cpp.o
[ 63%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ProtSeqUtils.cpp.o
[ 64%] Building C object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/gengetopt/fastprot_gengetopt.c.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 65%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/XmlInputStream.cpp.o
Linking CXX executable fastprot
[ 67%] Built target fastprot
[ 69%] Generating programs/fastprot_mpi/gengetopt/fastprot_mpi_gengetopt.c, programs/fastprot_mpi/gengetopt/fastprot_mpi_gengetopt.h
Scanning dependencies of target fastprot_mpi
[ 70%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/main.cpp.o
[ 71%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/FastaInputStream.cpp.o
[ 72%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/DataOutputStream.cpp.o
[ 73%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/XmlOutputStream.cpp.o
[ 75%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/PhylipMaInputStream.cpp.o
[ 76%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ProtDistCalc.cpp.o
[ 77%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ModelMatrix.cpp.o
[ 78%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ExpectedDistance.cpp.o
[ 79%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/Matrix.cpp.o
[ 80%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/MaximumLikelihood.cpp.o
[ 82%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ProtSeqUtils.cpp.o
[ 83%] Building C object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/gengetopt/fastprot_mpi_gengetopt.c.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 84%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/XmlInputStream.cpp.o
Linking CXX executable fastprot_mpi
[ 86%] Built target fastprot_mpi
[ 88%] Generating programs/fnj/gengetopt/fnj_gengetopt.c, programs/fnj/gengetopt/fnj_gengetopt.h
Scanning dependencies of target fnj
[ 89%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/main.cpp.o
[ 90%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/DataInputStream.cpp.o
[ 91%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/DataOutputStream.cpp.o
[ 92%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/XmlOutputStream.cpp.o
[ 94%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/PhylipDmInputStream.cpp.o
[ 95%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/BinaryInputStream.cpp.o
[ 96%] Building C object src/c++/CMakeFiles/fnj.dir/programs/fnj/gengetopt/fnj_gengetopt.c.o
cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C
[ 97%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/XmlInputStream.cpp.o
Linking CXX executable fnj
[100%] Built target fnj
[ 34%] Built target fastphylo
[ 48%] Built target fastdist
[ 67%] Built target fastprot
[ 86%] Built target fastprot_mpi
[100%] Built target fnj
Install the project...
-- Install configuration: ""
-- Installing: /tmp/bin/fastdist
-- Removed runtime path from "/tmp/bin/fastdist"
-- Installing: /tmp/bin/fnj
-- Removed runtime path from "/tmp/bin/fnj"
-- Installing: /tmp/bin/fastprot
-- Removed runtime path from "/tmp/bin/fastprot"
-- Installing: /tmp/bin/fastprot_mpi
-- Removed runtime path from "/tmp/bin/fastprot_mpi"

If you want to build the html documentation ( i.e. this page ) you need to pass the -DBUILD_DOCBOOK=ON option to cmake.

3.2.2.2. Building install packages

This is section is mainly intended for package maintainers

3.2.2.2.1. Building an rpm

On a CentOS or Fedora machine, first log in as root and install the dependencies

# yum install xmlto libxml2-devel cmake gcc-c++ binutils gengetopt

Check that cmake is version 2.6 or later

$ cmake --version
cmake version 2.6-patch 0

If it is older you could download a cmake binary directly from www.cmake.org

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package

3.2.2.2.2. Building a deb package

On a Debian or Ubuntu machine, first log in as root and install the dependencies

# apt-get install libxml2-dev cmake g++ binutils gengetopt

Check that cmake is version 2.6 or later

$ cmake --version
cmake version 2.6-patch 0

If it is older you could download a cmake binary directly from www.cmake.org. Now build the deb package.

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package

3.2.2.2.3. Building install package for MacOS X

To build the fastphylo install package for MacOS X you need to have installed all the dependancies mentioned in section Section 3.2.2.1, “Building from source on Unix” on your MacOS X computer.

Check that cmake is version 2.6 or later

$ cmake --version
cmake version 2.6-patch 0

$ mkdir /tmp/build
$ cd /tmp/build
$ cmake -DSTATIC=ON -DCPACK_GENERATOR="TGZ" /tmp/source && make package

3.3. Usage

3.3.1. fastdist

fastdist implements the algorithm Fast Computation of Distance Estimators ( see Section 2.1, “Fast Computation of Distance Estimators” )

3.3.1.1. Command line options

Type fastdist --help to see the command line options

[user@saturn ~]$ fastdist --help
fastdist 1.0.0

Usage: fastdist [OPTIONS]... [FILE]...

Computes distance matrices out of multialignments

  -h, --help                    Print help and exit
  -V, --version                 Print version and exit
If FILE is not specified the input is read from stdin 
  -o, --outfile=filename        output filename. If not specifed, output is 
                                  written to stdout
  -I, --input-format=ENUM       input format. xml means the Fastphylo sequence 
                                  XML format  (possible values="fasta", 
                                  "phylip", "xml" default=`fasta')
  -e, --memory-efficient         memory efficient. Use less memory space and 
                                  fast implementation. Only used with fasta and 
                                  phylip format  (default=off)
  -O, --output-format=ENUM      output format. xml means the Fastphylo distance 
                                  matrix XML format  (possible 
                                  values="phylip", "xml", "binary" 
                                  default=`xml')
  -D, --distance-function=ENUM  Distance function  (possible values="JC", 
                                  "K2P", "TN93", "HAMMING" default=`K2P')
  -b, --bootstraps=INT          Bootstrap num times and create matrix for each  
                                  (default=`0')
  -k, --no-incl-orig            If the distance matrix from the original 
                                  sequences should not be included  
                                  (default=off)
  -s, --seed=INT                Random seed. If not specified the current 
                                  timestamp will be used
  -A, --no-ambiguities          Ignore ambiguities  (default=off)
  -R, --no-ambig-resolve        Specifies that ambigious symbols should not be 
                                  resolved by nearest neighbor  (default=off)
  -t, --no-transprob            Specifies that the transition probabilities 
                                  should not be used in the ambiguity model  
                                  (default=off)
  -a, --ambiguity-frequency-model=ENUM
                                Ambiguity frequency model  (possible 
                                  values="UNI", "BASE" default=`UNI')
  -T, --tstvratio=FLOAT         Transition/transvertion ratio for purine 
                                  transitions ( for the TN model )  
                                  (default=`2.0')
  -P, --pyrtvratio=FLOAT        Transition/transvertion ratio for  pyrimidines 
                                  transitions ( for the TN model )  
                                  (default=`2.0')
  -N, --no-tstvratio            If given fixed ts/tv ratios will not be used  
                                  (default=off)
  -F, --fixfactor=FLOAT         Float specifying what factor to use for 
                                  saturated data. If not given -1 in the entry. 
                                   (default=`1')
  -r, --number-of-runs=INT      nr of runs ( datasets ) in input. This option 
                                  is only used if the input format is 
                                  phylip_multialignment.  (default=`1')
  -p, --print-relaxng-input     print the Relax NG schema for the XML input 
                                  format ( Fastphylo sequence XML format ) and 
                                  then exit  (default=off)
  -w, --print-relaxng-output    print the Relax NG schema for the XML output 
                                  format ( Fastphylo distance matrix XML format 
                                  ) and then exit.  (default=off)

Example usage of this program can be found at its home page
http://fastphylo.sourceforge.net/



3.3.1.2. fastdist input file formats

Table 1. fastdist input file formats

file formatshort optiondescription
fasta format-I fastaSection 3.4.3, “Fasta format”
phylip format-I phylipSection 3.4.2, “phylip format”
fastphylo sequence XML format-I xmlSection 3.4.1, “Fastphylo sequence XML format”


3.3.1.3. fastdist output file formats

Table 2. fastdist output file formats

file formatshort optiondescription
fastphylo sequence XML format-O xmlSection 3.4.4, “Fastphylo distance matrix XML format”
Binary distance matrix format-O binarySection 3.4.6, “Binary distance matrix format”
phylip distance matrix format-O phylipSection 3.4.5, “Phylip distance matrix format”


3.3.1.4. Examples

Example 1. fastdist with input in file phylip format

We use the DNA file described in Example 18, “Example files in phylip format” as input file. The file has two datasets so we pass the option -r 2 to fastdist. Per default the output is given in XML format

[user@saturn ~]$ fastdist -I phylip seq.phylip

<?xml version="1.0"?>
<root>
 <runs>
  <run id="" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta"/>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


Example 2. fastdist with output in file binary format connected to fnj through pipe

We use the DNA file described in Example 19, “seq.fasta, an example file in fasta format” as input file.

[user@saturn ~]$ cat seq.phylip | fastdist -I phylip -O binary | fnj -I binary -O newick
(Gamma,Beta,Alpha);


Example 3. fastdist with input in file fasta format

We use the file described in Example 19, “seq.fasta, an example file in fasta format” as input file. Per default the output is given in XML format

[user@saturn ~]$ fastdist -I fasta seq.fasta

<?xml version="1.0"?>
<root>
 <runs>
  <run id="" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta"/>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


Example 4. fastdist with input file in XML format

We use the file described in Example 17, “Example files in Fastphylo sequence XML format” containing DNA sequences as input file.

[Note]Note

The -r option can only be used if the input is in phylip format. fastdist will for XML files compute all data sets ( runs ). Fasta files can only contain one data set so the -r option does not make any sense there.

[user@saturn ~]$ fastdist -I xml -O xml seq.xml
<?xml version="1.0"?>
<root>
 <runs>
  <run id="run1" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta">
     <extrainfo myattr="" species="penguin">
          <foo bar="1"/>
        </extrainfo>
    </identity>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
  <run id="run2" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta"/>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


Example 5. fastdist with an XML stream on stdin

If you leave out the input filename, the input will be read from stdin. fastdist doesn't wait for the whole xml file to be read before it starts. It starts a computation as soon as an ending </run> has been read. The memory consumption will not grow over time so the input can be arbitrarily large. A never ending input stream only works in the fastphylo sequence XML format, because the phylip input format needs you to specify in advance how many data sets are to be sent to fastdist ( the -r option ).

[user@saturn ~]$ cat seq.xml | fastdist -I xml -O xml
<?xml version="1.0"?>
<root>
 <runs>
  <run id="run1" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta">
     <extrainfo myattr="" species="penguin">
          <foo bar="1"/>
        </extrainfo>
    </identity>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
  <run id="run2" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta"/>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


Example 6. reading the fastdist XML output stream with python

If the XML output is very large you might want to use an XML parser that doesn't hold the whole file in memory. This python script is an example of how to do this

#!/usr/bin/python
import sys
from lxml import etree
from copy import deepcopy

for action, element in etree.iterparse(sys.stdin, tag="dm"):
  dm_copy=deepcopy(element)
  print dm_copy.xpath('count(row/entry[ number(.) < 0.1 ])')

For each distance matrix the script counts the number of elements with a value below 0.1

[user@saturn ~]$ cat seq.xml | fastdist -I xml -O xml | python fastdist_lxml.py
3.0
3.0

Read more about lxml and xpath.


3.3.2. fastprot

fastprot estimates the evolutionary distance between aligned protein sequences. It implements two methods for calculating the distance between protein sequences, the maximum likelihood of a distance and the expected distance (see further paper by Agarwal and States).

3.3.2.1. Command line options

Type fastprot --help to see the command line options

[user@saturn ~]$ fastprot --help
fastprot 1.0.0

Usage: fastprot [OPTIONS]... [FILE]...

Computes distance matrices out of multialignments of protein sequences

  -h, --help                    Print help and exit
      --detailed-help           Print help, including all details and hidden 
                                  options, and exit
  -V, --version                 Print version and exit
If FILE is not specified the input is read from stdin 
  -o, --outfile=filename        output filename. If not specified, output is 
                                  written to stdout
  -I, --input-format=ENUM       input format. xml means the Fastphylo sequence 
                                  XML format  (possible values="fasta", 
                                  "phylip", "xml" default=`fasta')
  -e, --memory-efficient         memory efficient. Use less memory space and 
                                  fast implementation. Only used with fasta and 
                                  phylip format  (default=off)
  -O, --output-format=ENUM      output format. xml means the Fastphylo distance 
                                  matrix XML format  (possible 
                                  values="phylip", "xml", "binary" 
                                  default=`xml')
  -b, --bootstraps=INT          Bootstrap num times and create matrix for each  
                                  (default=`0')
  -k, --no-incl-orig            If the distance matrix from the original 
                                  sequences should NOT be included - for 
                                  bootstrapping  (default=off)
  -R, --seed=INT                Random seed. If not specified the current 
                                  timestamp will be used
  -D, --distance-function=ENUM  Distance function  (possible values="ID", 
                                  "JC", "JCK", "JCSS", "WAG", "JTT", 
                                  "DAY", "ARVE", "MVR" default=`WAG')
  -F, --model-file=filename     Read matrix and equilibrium distribution from 
                                  file, when used --distance-function is 
                                  disregarded
  -i, --remove-indels           Remove gap columns. A gap is denoted by '-'.  
                                  (default=off)
  -m, --maximum-likelihood      Compute a Maximum Likelihood estimate instead. 
                                  Can not be used with --distance-function=ID, 
                                  JC, JCK or JCSS or --sd  (default=off)
  -S, --sd                      Not yet implemented! Output a matrix with 
                                  standard deviations after the distance 
                                  matrix. Can not be used with 
                                  --distance-function=ID, JC, JCK or JCSS or 
                                  --maximum-likelihood  (default=off)
  -p, --pfam                    use a normal distribution as distance prior, 
                                  estimated from Pfam 7.2  (default=off)
  -s, --speed=INT               'Speed'. High speed results in low precision, 
                                  only affects ED calculations. Default is 5. 
                                  Valid range is [1,10].  (possible 
                                  values="1", "2", "3", "4", "5", 
                                  "6", "7", "8" default=`4')
  -P, --print-relaxng-input     print the Relax NG schema for the XML input 
                                  format ( Fastphylo protein sequence XML 
                                  format ) and then exit  (default=off)
  -w, --print-relaxng-output    print the Relax NG schema for the XML output 
                                  format ( Fastphylo distance matrix XML format 
                                  ) and then exit.  (default=off)

Example usage of this program can be found at its home page
http://fastphylo.sourceforge.net/



3.3.2.2. fastprot input file formats

Table 3. fastprot input file formats

file formatshort optiondescription
fasta format-I fastaSection 3.4.3, “Fasta format”
phylip format-I phylipSection 3.4.2, “phylip format”
fastphylo sequence XML format-I xmlSection 3.4.1, “Fastphylo sequence XML format”


3.3.2.3. fastprot output file formats

Table 4. fastprot output file formats

file formatshort optiondescription
fastphylo sequence XML format-O xmlSection 3.4.4, “Fastphylo distance matrix XML format”
Binary distance matrix format-O binarySection 3.4.6, “Binary distance matrix format”
phylip distance matrix format-O phylipSection 3.4.5, “Phylip distance matrix format”


3.3.2.4. Examples

Example 7. fastprot with input in file phylip format

We use protein sequence file described in Example 18, “Example files in phylip format” as input file.

[user@saturn ~]$ fastprot -I phylip protein_seq.phylip -O xml


<?xml version="1.0"?>
<root>
 <runs>
  <run id="" dim="4">
   <identities>
    <identity name="Cow"/>
    <identity name="Carp"/>
    <identity name="Chicken"/>
    <identity name="Human"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.402252</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>2.622102</entry>
     <entry>2.334973</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>2.919533</entry>
     <entry>2.733489</entry>
     <entry>0.903515</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


Example 8. fastprot with output in file binary format connected to fnj through pipe

We use the protien file described in Example 19, “seq.fasta, an example file in fasta format” as input file.

[user@saturn ~]$ cat protein_seq.fasta | fastprot -I fasta -O binary | fnj -I binary -O newick
((Human,Chicken),Carp,Cow);


Example 9. fastprot with input in file fasta format

We use the file described in Example 20, “protein_seq.fasta, an example file in fasta format” as input file. Per default the output is given in XML format

[user@saturn ~]$ fastprot -I fasta protein_seq.fasta
<?xml version="1.0"?>
<root>
 <runs>
  <run id="" dim="4">
   <identities>
    <identity name="Cow"/>
    <identity name="Carp"/>
    <identity name="Chicken"/>
    <identity name="Human"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.402252</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>2.622102</entry>
     <entry>2.334973</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>2.919533</entry>
     <entry>2.733489</entry>
     <entry>0.903515</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


3.3.3. fastprot_mpi

fastprot_mpi is an implementation of of fastprot program using MPI libraries. fastprot_mpi can scale linearly to the number of nodes available on a cluster machine and can handle very large protein families.

3.3.3.1. Command line options

Type fastprot_mpi --help to see the command line options

[user@saturn ~]$ fastprot_mpi --help
fastprot_mpi 1.0.0

Usage: fastprot_mpi [OPTIONS]... [FILE]...

Computes distance matrices out of multialignments of protein sequences

  -h, --help                    Print help and exit
      --detailed-help           Print help, including all details and hidden 
                                  options, and exit
  -V, --version                 Print version and exit
If FILE is not specified the input is read from stdin 
  -o, --outfile=filename        output filename. If not specified, output is 
                                  written to stdout
  -I, --input-format=ENUM       input format. xml means the Fastphylo sequence 
                                  XML format  (possible values="fasta", 
                                  "phylip", "xml" default=`fasta')
  -e, --memory-efficient         memory efficient. Use less memory space and 
                                  fast implementation. Only used with fasta and 
                                  phylip format  (default=off)
  -O, --output-format=ENUM      output format. xml means the Fastphylo distance 
                                  matrix XML format  (possible 
                                  values="phylip", "xml", "binary" 
                                  default=`xml')
  -b, --bootstraps=INT          Bootstrap num times and create matrix for each  
                                  (default=`0')
  -k, --no-incl-orig            If the distance matrix from the original 
                                  sequences should not be included - for 
                                  bootstrapping  (default=off)
  -R, --seed=INT                Random seed. If not specified the current 
                                  timestamp will be used
  -D, --distance-function=ENUM  Distance function  (possible values="WAG", 
                                  "JTT", "DAY", "ARVE", "MVR" 
                                  default=`WAG')
  -F, --model-file=filename     Read matrix and equilibrium distribution from 
                                  file, when used --distance-function is 
                                  disregarded
  -i, --remove-indels           Remove gap columns. A gap is denoted by '-'.  
                                  (default=off)
  -m, --maximum-likelihood      Compute a Maximum Likelihood estimate instead. 
                                  Can not be used with --distance-function=ID, 
                                  JC, JCK or JCSS  (default=off)
  -p, --pfam                    use a normal distribution as distance prior, 
                                  estimated from Pfam 7.2  (default=off)
  -s, --speed=INT               'Speed'. High speed results in low precision, 
                                  only affects ED calculations. Default is 5. 
                                  Valid range is [1,10].  (possible 
                                  values="1", "2", "3", "4", "5", 
                                  "6", "7", "8" default=`4')
  -P, --print-relaxng-input     print the Relax NG schema for the XML input 
                                  format ( Fastphylo protein sequence XML 
                                  format ) and then exit  (default=off)
  -w, --print-relaxng-output    print the Relax NG schema for the XML output 
                                  format ( Fastphylo distance matrix XML format 
                                  ) and then exit.  (default=off)

Example usage of this program can be found at its home page
http://fastphylo.sourceforge.net/



3.3.3.2. fastprot_mpi input file formats

Table 5. fastprot_mpi input file formats

file formatshort optiondescription
fasta format-I fastaSection 3.4.3, “Fasta format”
phylip format-I phylipSection 3.4.2, “phylip format”
fastphylo sequence XML format-I xmlSection 3.4.1, “Fastphylo sequence XML format”


3.3.3.3. fastprot_mpi output file formats

Table 6. fastprot_mpi output file formats

file formatshort optiondescription
fastphylo sequence XML format-O xmlSection 3.4.4, “Fastphylo distance matrix XML format”
Binary distance matrix format-O binarySection 3.4.6, “Binary distance matrix format”
phylip distance matrix format-O phylipSection 3.4.5, “Phylip distance matrix format”


3.3.3.4. Examples

Example 10. fastprot_mpi with input in file phylip format

We use protein sequence file described in Example 18, “Example files in phylip format” as input file.

[user@saturn ~]$ mpirun -n 2 fastprot_mpi -I phylip protein_seq.phylip -O xml
<?xml version="1.0"?>
<root>
 <runs>
  <run id="" dim="4">
   <identities>
    <identity name="Cow"/>
    <identity name="Carp"/>
    <identity name="Chicken"/>
    <identity name="Human"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>40.199802</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>262.210205</entry>
     <entry>233.497314</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>291.953308</entry>
     <entry>273.348907</entry>
     <entry>90.351432</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


Example 11. fastprot_mpi with output in file binary format connected to fnj through pipe

We use the protien file described in Example 19, “seq.fasta, an example file in fasta format” as input file.

[user@saturn ~]$ mpirun -n 2 fastprot_mpi -I fasta protein_seq.fasta -O binary | fnj -I binary -O newick
((Human,Chicken),Carp,Cow);


Example 12. fastprot_mpi with input in file fasta format

We use the file described in Example 20, “protein_seq.fasta, an example file in fasta format” as input file. Per default the output is given in XML format

[user@saturn ~]$ mpirun -n 2 fastprot_mpi -I fasta protein_seq.fasta
<?xml version="1.0"?>
<root>
 <runs>
  <run id="" dim="4">
   <identities>
    <identity name="Cow"/>
    <identity name="Carp"/>
    <identity name="Chicken"/>
    <identity name="Human"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>40.199802</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>262.210205</entry>
     <entry>233.497314</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>291.953308</entry>
     <entry>273.348907</entry>
     <entry>90.351432</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


3.3.4. fnj

fnj implements the algorithm Fast Neighbor Joining ( see Section 2.2, “Fast Neighbor Joining” )

3.3.4.1. Command line options

Type fnj --help to see the command line options

[user@saturn ~]$ fnj --help
fnj 1.0.0

Usage: fnj [OPTIONS]... [FILE]...

builds phylogenetic trees

  -h, --help                    Print help and exit
  -V, --version                 Print version and exit
  -o, --outfile=filename        output filename. If not specifed, output is 
                                  written to stdout
  -I, --input-format=ENUM       input format. 'xml' means the 'Fastphylo 
                                  distance matrix XML format'  (possible 
                                  values="phylip", "xml", "binary" 
                                  default=`xml')
  -O, --output-format=ENUM      output format. 'xml' means the 'Fastphylo tree 
                                  count XML format'  (possible 
                                  values="newick", "xml" default=`xml')
  -c, --print-counts            print the tree count before each the newick 
                                  tree. This flag has no effect on the XML 
                                  output format.  (default=off)
  -a, --analyze-run-number=INT  Determines which dataset should be analyzed 
                                  with 1 being the first dataset. By default 
                                  all are analyzed
  -m, --method=ENUM             reconstruction method to apply  (possible 
                                  values="NJ", "FNJ", "BIONJ" 
                                  default=`FNJ')
  -d, --dm-per-run=INT          nr of Distance matrices per run. Is only used 
                                  if the input format is phylip  (default=`1')
  -r, --number-of-runs=INT      nr of runs. Is only used if the input format is 
                                  phylip  (default=`1')
  -b, --bootstraps=INT          number of boot straps  (default=`0')
  -p, --print-relaxng-input     print the Relax NG schema for the XML input 
                                  format ( Fastphylo distance matrix XML format 
                                  ) and then exit  (default=off)
  -w, --print-relaxng-output    print the Relax NG schema for the XML output 
                                  format ( Fastphylo tree count XML format ) 
                                  and then exit.  (default=off)

Example usage of this program can be found at its home page
http://fastphylo.sourceforge.net/


3.3.4.2. fnj input file formats

Table 7. fnj input file formats

file formatshort optiondescription
fastphylo sequence XML format-I xmlSection 3.4.4, “Fastphylo distance matrix XML format”
Binary distance matrix format-I binarySection 3.4.6, “Binary distance matrix format”
phylip distance matrix format-I phylipSection 3.4.5, “Phylip distance matrix format”

3.3.4.3. fnj output file formats

Table 8. fnj output file formats

file formatshort optiondescription
fastphylo count tree XML format-O xmlSection 3.4.7, “Fastphylo tree count XML format”

3.3.4.4. Examples

Example 13. fnj with input file in Phylip distance matrix format

We use the file described in Example 22, “dm.phylip, an example file in phylip distance matrix format” as input file. The file has two datasets so we pass the option -r 2 to fnj. Per default the output is given in the "fastphylo count tree XML format" ( -O xml ).

[user@saturn ~]$ fnj -r 2 -I phylip dm.phylip
<?xml version="1.0"?>
 <root>
  <runs>
   <run id="" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
    <tree>
     <count>2</count>
     <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml>
     <newick>(Gamma,Beta,Alpha);</newick>
    </tree>
   </run>
   <run id="" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
   </run>
  </runs>
 </root>


Example 14. fnj with input file in XML format

We use the file described in Example 21, “dm.xml, an example file in Fastphylo distance matrix XML format” as input file. Per default the output is given in the "fastphylo count tree XML format" ( -O xml ).

[Note]Note

The -r option is not available and also not needed when the input is in XML format. fnj computes all data sets ( runs ).

[user@saturn ~]$ fnj -I xml dm.xml

<?xml version="1.0"?>
 <root>
  <runs>
   <run id="a" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
    <tree>
     <count>1</count>
     <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml>
     <newick>(Gamma,Beta,Alpha);</newick>
    </tree>
   </run>
   <run id="b" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
    <tree>
     <count>1</count>
     <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml>
     <newick>(Gamma,Beta,Alpha);</newick>
    </tree>
   </run>
  </runs>
 </root>


Example 15. connecting fastdist to fnj with a pipe

We use the DNA file described in Example 18, “Example files in phylip format” as input file. The file has two data sets. We will bootstrap 3 times. First we send the data in phylip format through the pipe:

[user@saturn ~]$ cat seq.phylip | fastdist -I phylip -O phylip -b 3 -r 2 | fnj -I phylip -O xml -r 2 -d 4
<?xml version="1.0"?>
 <root>
  <runs>
   <run id="" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
    <tree>
     <count>8</count>
     <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml>
     <newick>(Gamma,Beta,Alpha);</newick>
    </tree>
   </run>
   <run id="" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
   </run>
  </runs>
 </root>

We could also send the data in XML format through the pipe:

[user@saturn ~]$ cat seq.phylip | fastdist -I phylip  -O xml -b 3 -r 2 | fnj -I xml -O xml -m FNJ
<?xml version="1.0"?>
 <root>
  <runs>
   <run id="" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
    <tree>
     <count>4</count>
     <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml>
     <newick>(Gamma,Beta,Alpha);</newick>
    </tree>
   </run>
   <run id="" dim="3">
    <identities>
     <identity name="Alpha"/>
     <identity name="Beta"/>
     <identity name="Gamma"/>
    </identities>
    <tree>
     <count>4</count>
     <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml>
     <newick>(Gamma,Beta,Alpha);</newick>
    </tree>
   </run>
  </runs>
 </root>

As the the XML format is more descriptive, the flags -d and -r are no longer needed by fnj.


Example 16. reading the fnj XML output stream with python

If the XML output is very large you might want to use an XML parser that doesn't hold the whole file in memory. This python script is an example of how to do this

#!/usr/bin/python
import sys
from lxml import etree
from copy import deepcopy

maxcount=0
for action, element in etree.iterparse(sys.stdin, tag="run"):
  run_copy=deepcopy(element)
  count=int(run_copy.xpath('tree/count')[0].text)
  if ( count > maxcount ):
    maxcount=count
print maxcount

The script prints the maximum count ( just as an example ).

[user@saturn ~]$ fnj -I xml dm.xml | python fnj_lxml.py
1

Read more about lxml and xpath.


3.4. File formats

This software package handles the following file formats

3.4.1. Fastphylo sequence XML format

The Fastphylo sequence XML format is chosen by the option -I xml to fastdist, fastprot or fastprot_mpi. For instance, type fastdist --print-relaxng-input to see its relaxng schema

[user@saturn ~]$ fastdist --print-relaxng-input
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="root">
      <element name="runs">
        <zeroOrMore>
          <element name="run">
            <attribute name="id">
              <text/>
            </attribute>
            <oneOrMore>
              <element name="seq">
                <attribute name="seq">
                  <data type="string">
                    <param name="pattern">[acgtumrwsykvhdbnxACGTUMRWSYKVHDBNX -.?]+</param>
                  </data>
                </attribute>
                <attribute name="name">
                  <text/>
                </attribute>
                <optional>
                  <element name="extrainfo">
                    <ref name="anyContent"/>
                  </element>
                </optional>
              </element>
            </oneOrMore>
          </element>
        </zeroOrMore>
      </element>
    </element>
  </start>
  <define name="anyContent">
    <mixed>
      <zeroOrMore>
        <choice>
          <attribute>
            <anyName/>
          </attribute>
          <ref name="anyElement"/>
        </choice>
      </zeroOrMore>
    </mixed>
  </define>
  <define name="anyElement">
    <element>
      <anyName/>
      <ref name="anyContent"/>
    </element>
  </define>
</grammar>


The Relax NG schema specifies that the extrainfo element is optional and can be inserted as a child to a seq element. The extrainfo element may contain any content and will be passed on to the output XML format.

Example 17. Example files in Fastphylo sequence XML format

The example file seq.xml contains DNA sequences:

<?xml version="1.0"?>
<root>
  <runs>
    <run id="run1">
      <seq name="Alpha" seq="AACGTGGCCACAT"/>
      <seq name="Beta" seq="AAGGTCGCCACAC">
        <extrainfo myattr="" species="penguin">
          <foo bar="1"/>
        </extrainfo>
      </seq>
      <seq name="Gamma" seq="CAGTTCGCCACAA"/>
    </run>
    <run id="run2">
      <seq name="Alpha" seq="AACGTGGCCACAT"/>
      <seq name="Beta" seq="AAGGTCGCCACAC"/>
      <seq name="Gamma" seq="CAGTTCGCCACAA"/>
    </run>
  </runs>
</root>

protein_seq.xml contains protein sequences:

<?xml version="1.0"?>
<root>
  <runs>
    <run id="run1">
      <seq name="Cow" seq="MAYPMQLGFQDA"/>
	  <seq name="Carp" seq="MAHPTQLGFKDA"/>
	  <seq name="Chicken" seq="MALLTLMLMEKL"/>
	  <seq name="Human" seq="MAHLFLTLTTKL"/>
	</run>  
  </runs>
</root>


3.4.2. phylip format

The phylip input format is chosen by the option -I phylip to fastdist.

Example 18. Example files in phylip format

The DNA example file seq.phylip contains two datasets:

   3   13
Alpha     AAC GTGG
Beta      AAG GTCG
Gamma     CAG TTCG
          CCAC AT
          CCAC AC
          CCAC AA

   3   13
Alpha     CCACGGG
Beta      AAGGTCG
Gamma     CAGTTCG
          CGACAT
          CCACAC
          CCGCAA

The example file protein_seq.phylip contains protein sequences:

4	12
Cow         MAYPMQLGFQDA
Carp        MAHPTQLGFKDA
Chicken     MALLTLMLMEKL
Human       MAHLFLTLTTKL


3.4.3. Fasta format

The Fasta input format is chosen by the option -I fasta to fastdist. Fasta files can only contain one data set. Read more about the Fasta format on Wikipedia. The parser will take the whole header line as the sequence identifier name, i.e. all characters after the greater-than character ( ">" ).

Example 19. seq.fasta, an example file in fasta format

The example files seq.fasta contains DNA:

>Alpha
AAC-GTGGCCAC-AT
>Beta
AAG-GTCGCCAC-AC
>Gamma
CAG-TTCGCCAC-AA


Example 20. protein_seq.fasta, an example file in fasta format

The example files protein_seq.fasta contains protein sequences:

>Cow
MAYPMQLGFQDA
>Carp
MAHPTQLGFKDA
>Chicken
MALLTLMLMEKL
>Human
MAHLFLTLTTKL


3.4.4. Fastphylo distance matrix XML format

The Fastphylo sequence XML format is chosen by the option -O xml to fastdist, fastprot, fastprot_mpi and the option -I xml to fnj. For instance type fastdist --print-relaxng-output to see its relaxng schema

[user@saturn ~]$ fastdist --print-relaxng-output
<?xml version="1.0"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="root">
      <element name="runs">
        <zeroOrMore>
          <element name="run">
            <attribute name="dim">
              <data type="integer"/>
            </attribute>
            <attribute name="id">
              <text/>
            </attribute>
            <element name="identities">
              <oneOrMore>
                <element name="identity">
                  <attribute name="name">
                    <text/>
                  </attribute>
                  <optional>
                    <element name="extrainfo">
                      <ref name="anyContent"/>
                    </element>
                  </optional>
                </element>
              </oneOrMore>
            </element>
            <element name="dms">
              <oneOrMore>
                <element name="dm">
                  <oneOrMore>
                    <element name="row">
                      <oneOrMore>
                        <element name="entry">
                          <data type="float"/>
                        </element>
                      </oneOrMore>
                    </element>
                  </oneOrMore>
                </element>
              </oneOrMore>
            </element>
          </element>
        </zeroOrMore>
      </element>
    </element>
  </start>
  <define name="anyContent">
    <mixed>
      <zeroOrMore>
        <choice>
          <attribute>
            <anyName/>
          </attribute>
          <ref name="anyElement"/>
        </choice>
      </zeroOrMore>
    </mixed>
  </define>
  <define name="anyElement">
    <element>
      <anyName/>
      <ref name="anyContent"/>
    </element>
  </define>
</grammar>


The Relax NG schema specifies that the extrainfo element is optional and can be inserted as a child to a seq element. The extrainfo element may contain any content.

Example 21. dm.xml, an example file in Fastphylo distance matrix XML format

The example file dm.xml contains

<?xml version="1.0"?>
<root>
 <runs>
  <run id="a" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta"/>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.299650</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>0.733169</entry>
     <entry>0.309520</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
  <run id="b" dim="3">
   <identities>
    <identity name="Alpha"/>
    <identity name="Beta"/>
    <identity name="Gamma"/>
   </identities>
   <dms>
   <dm>
    <row>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>3.258005</entry>
     <entry>0.000000</entry>
    </row>
    <row>
     <entry>1.873653</entry>
     <entry>0.459840</entry>
     <entry>0.000000</entry>
    </row>
   </dm>
   </dms>
  </run>
 </runs>
</root>


3.4.5. Phylip distance matrix format

The Phylip distance matrix format is chosen by the option -O phylip to fastdist or the option -I phylip to fnj.

Example 22. dm.phylip, an example file in phylip distance matrix format

The example file dm.phylip contains

    3
Alpha       0.000000  0.299650  0.733169
Beta        0.299650  0.000000  0.309520
Gamma       0.733169  0.309520  0.000000
    3
Alpha       0.000000  3.258005  1.873653
Beta        3.258005  0.000000  0.459840
Gamma       1.873653  0.459840  0.000000

It contains two data sets.


3.4.6. Binary distance matrix format

The Binary distance matrix format is chosen by the option -O binary to fastdist, fastprot and fastprot_mpi or the option -I binary to fnj. Using the binary format option, fastphylo performs row-wise operations in computing the upper triangular distance matrix. Furthermore, the upper triangular distance matrix is then stored in a binary format instead of plain text. The main advantage of introducing binary format is that it reduces the disk space utilization and speedup the performance of fastphylo since only half of the matrix is computted instead of the whole distance matrix.

In the binary format output file, we first store fastphylo's current version followed by the number of sequences, then accessions and finally rows of the upper trianguler distance matrix. We use colon delimiter to delimit each component separately.

3.4.7. Fastphylo tree count XML format

The Fastphylo tree count XML format is chosen by the option -O xml to fnj. You can see an example of the format in the example Example 14, “fnj with input file in XML format”. Type fnj --print-relaxng-output to see the formats relaxng schema.

[user@saturn ~]$ fnj --print-relaxng-output
<?xml version="1.0"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="root">
      <element name="runs">
        <zeroOrMore>
          <element name="run">
            <attribute name="id">
              <text/>
            </attribute>
            <attribute name="dim">
              <data type="integer"/>
            </attribute>
            <element name="identities">
              <oneOrMore>
                <element name="identity">
                  <attribute name="name">
                    <text/>
                  </attribute>
                  <optional>
                    <element name="extrainfo">
                      <ref name="anyContent"/>
                    </element>
                  </optional>
                </element>
              </oneOrMore>
            </element>
            <element name="tree">
              <element name="count">
                <data type="integer"/>
              </element>
              <element name="newick-xml">
                <ref name="branch"/>
              </element>
              <element name="newick">
                <text/>
              </element>
            </element>
          </element>
        </zeroOrMore>
      </element>
    </element>
  </start>
  <define name="anyContent">
    <mixed>
      <zeroOrMore>
        <choice>
          <attribute>
            <anyName/>
          </attribute>
          <ref name="anyElement"/>
        </choice>
      </zeroOrMore>
    </mixed>
  </define>
  <define name="anyElement">
    <element>
      <anyName/>
      <ref name="anyContent"/>
    </element>
  </define>
  <define name="branch">
    <element name="branch">
      <optional>
        <attribute name="length">
          <data type="float"/>
        </attribute>
      </optional>
      <oneOrMore>
        <choice>
          <element name="leaf">
            <optional>
              <attribute name="length">
                <data type="float"/>
              </attribute>
            </optional>
            <text/>
          </element>
          <ref name="branch"/>
        </choice>
      </oneOrMore>
    </element>
  </define>
</grammar>



fastphylo is hosted at

SourceForge.net Logo