Table of Contents
List of Tables
List of Examples
fastphylo is software project containing the implementations of the algorithms "Fast Computation of Distance Estimators" and "Fast Neighbor Joining". The software is published in the BMC Bioinformatics journal in 2013 and is licensed under the MIT license.
The primary URL for this document is http://fastphylo.sourceforge.net.
BibTex
@Article{24255987, AUTHOR = {Khan, Mehmood and Elias, Isaac and Sjolund, Erik and Nylander, Kristina and Guimera, Roman and Schobesberger, Richard and Schmitzberger, Peter and Lagergren, Jens and Arvestad, Lars}, TITLE = {Fastphylo: Fast tools for phylogenetics}, JOURNAL = {BMC Bioinformatics}, VOLUME = {14}, YEAR = {2013}, NUMBER = {1}, PAGES = {334}, URL = {http://www.biomedcentral.com/1471-2105/14/334}, DOI = {10.1186/1471-2105-14-334}, PubMedID = {24255987}, ISSN = {1471-2105}, }
Isaac Elias and Jens Lagergren published the algorithm in the journal BMC Bioinformatics in 2007.
BibTex
@Article{EliasLagergren_fastdist, author = {Isaac Elias and Jens Lagergren}, title = {Fast Computation of Distance Estimators}, journal = {BMC Bioinformatics}, year = {2007}, pages = {89}, volume = {8} }
Background: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n3). Unfortunately, the fastest practical algorithms known for computing the distance matrix, from n sequences of length l, takes time proportional to l·n2. Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications.
Results: We give an advanced algorithm for computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity symbols. This new method is shown to be more accurate as well as faster than earlier methods.
Conclusions: Our novel algorithm for computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds.
Supplementary Material - Fast Computation of Distance Estimators. Contains additional figures for the tests run on the ambiguity approaches. (PDF)
Simulated Test Data for Ambiguities (Tar archive)
Biological Test Data (Tar archive)
Command file used for running Paup (Nexus file)
Isaac Elias and Jens Lagergren published the algorithm in the book "Proc. of the 32nd International Colloquium on Automata, Languages and Programming ({ICALP}'05)" in 2005.
BibTex
@InProceedings{ICALP05:EliasLagergren_FNJ, author = {Isaac Elias and Jens Lagergren}, title = {Fast Neighbor Joining}, booktitle = {Proc. of the 32nd International Colloquium on Automata, Languages and Programming ({ICALP}'05)}, pages = {1263--1274}, year = {2005}, volume = {3580}, series = {Lecture Notes in Computer Science}, month = {July}, publisher = {Springer-Verlag}, ISBN = {3-540-27580-0}, }
Reconstructing the evolutionary history of a set of species is a fundamental problem in biology and methods for solving this problem are gaged based on two characteristics: accuracy and efficiency. Neighbor Joining (NJ) is a so-called distance-based method that, thanks to its good accuracy and speed, has been embraced by the phylogeny community. It takes the distances between n taxa and produces in Θ(n3) time a phylogenetic tree, i.e., a tree which aims to describe the evolutionary history of the taxa. In addition to performing well in practice, the NJ algorithm has optimal reconstruction radius.
The contribution of this paper is twofold: (1) we present an algorithm called Fast Neighbor Joining (FNJ) with optimal reconstruction radius and optimal run time complexity O(n2) and (2) we present a greatly simplified proof for the correctness of NJ. Initial experiments show that FNJ in practice has almost the same accuracy as NJ, indicating that the property of optimal reconstruction radius has great importance to their good performance. Moreover, we show how improved running time can be achieved for computing the so-called correction formulas.
Download the software from the sourceforge project page. The latest version of fastphylo is 1.0.1.
To install fastphylo on Ubuntu or Debian, first download the fastphylo-1.0.1.deb and then log in as root and
# dpkg -i fastphylo-1.0.1.deb
To install fastphylo on Centos or Debian, first download the fastphylo-1.0.1.Linux.rpm and then log in as root and
# yum localinstall fastphylo-1.0.1.Linux.rpm
To install fastphylo on a Mac OS X v10.6.8 (Snow Leopard ) on a Mac computer with Intel cpu, first download the fastphylo-1.0.0-MacOSX10.5.tar.gz and then
$ tar xfz fastphylo-1.0.0-MacOSX10.6.8.tar.gz
To install fastphylo on a Mac OS X v10.4 ( Tiger ) on a Mac computer with Intel cpu, first download the fastphylo-1.0.0-MacOSX10.4.tar.gz and then
$ tar xfz fastphylo-1.0.0-MacOSX10.4.tar.gz
To build fastphylo on Unix ( e.g. Linux, MacOSX ) you need to have this installed
For Ubuntu OS, you can install the above pre-requists using the following commands:
sudo apt-get install cmake sudo apt-get install libxml2 libxml2-dev sudo apt-get install -y autotools-dev g++ build-essential openmpi1.6 libopenmpi1.6-dbg sudo apt-get install libcr-dev mpich2 mpich2-doc sudo apt-get install libblas-dev libblas-doc liblapack-dev liblapack-doc
You can download the source code using svn:
svn checkout svn://svn.code.sf.net/p/fastphylo/code/trunk fastphylo-code
If you have the fastphylo source code in the directory /tmp/fastphylo
and you want to install fastphylo into the directory /tmp/install
, you
First run cmake then make and then make install
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DCMAKE_INSTALL_PREFIX=/tmp/install /tmp/source && make && make install -- A library with BLAS API found. -- A library with BLAS API found. -- A library with LAPACK API found. -- A library with BLAS API found. -- A library with BLAS API found. -- A library with LAPACK API found. -- Configuring done -- Generating done -- Build files have been written to: /tmp/build Scanning dependencies of target fastphylo [ 1%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/BitVector.cpp.o [ 2%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Exception.cpp.o [ 3%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/InitAndPrintOn_utils.cpp.o [ 4%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Object.cpp.o [ 5%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Sequence.cpp.o [ 7%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/SequenceTree.cpp.o [ 8%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/SequenceTree_MostParsimonious.cpp.o [ 9%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/Simulator.cpp.o [ 10%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/arg_utils_ext.cpp.o [ 11%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/file_utils.cpp.o [ 13%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/stl_utils.cpp.o [ 14%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/DNA_b128_String.cpp.o [ 15%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/Sequences2DistanceMatrix.cpp.o [ 16%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/distance_methods/LeastSquaresFit.cpp.o [ 17%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/distance_methods/NeighborJoining.cpp.o [ 19%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/Kimura2parameter.cpp.o [ 20%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/TamuraNei.cpp.o [ 21%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/ambiguity_nucleotide.cpp.o [ 22%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/dna_pairwise_sequence_likelihood.cpp.o [ 23%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/sequence_likelihood/string_compare.cpp.o [ 25%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DistanceMatrix.cpp.o [ 26%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/FloatDistanceMatrix.cpp.o [ 27%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DistanceRow.cpp.o [ 28%] Building C object src/c++/CMakeFiles/fastphylo.dir/arg_utils.c.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 29%] Building C object src/c++/CMakeFiles/fastphylo.dir/std_c_utils.c.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 30%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/xml_output_global.cpp.o [ 32%] Building C object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/sse2_wrapper.c.o [ 33%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/computeTAMURANEIDistance_DNA_b128_String.cpp.o [ 34%] Building CXX object src/c++/CMakeFiles/fastphylo.dir/DNA_b128/computeDistance_DNA_b128_String.cpp.o Linking CXX static library libfastphylo.a [ 34%] Built target fastphylo [ 35%] Generating programs/fastdist/gengetopt/fastdist_gengetopt.c, programs/fastdist/gengetopt/fastdist_gengetopt.h Scanning dependencies of target fastdist [ 36%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/main.cpp.o [ 38%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/PhylipMaInputStream.cpp.o [ 39%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/FastaInputStream.cpp.o [ 40%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/DataOutputStream.cpp.o [ 41%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/XmlOutputStream.cpp.o [ 42%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/PhylipDmOutputStream.cpp.o [ 44%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/BinaryDmOutputStream.cpp.o [ 45%] Building C object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/gengetopt/fastdist_gengetopt.c.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 46%] Building CXX object src/c++/CMakeFiles/fastdist.dir/programs/fastdist/XmlInputStream.cpp.o Linking CXX executable fastdist [ 48%] Built target fastdist [ 50%] Generating programs/fastprot/gengetopt/fastprot_gengetopt.c, programs/fastprot/gengetopt/fastprot_gengetopt.h Scanning dependencies of target fastprot [ 51%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/main.cpp.o [ 52%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/FastaInputStream.cpp.o [ 53%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/DataOutputStream.cpp.o [ 54%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/XmlOutputStream.cpp.o [ 55%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/PhylipMaInputStream.cpp.o [ 57%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ProtDistCalc.cpp.o [ 58%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ModelMatrix.cpp.o [ 59%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ExpectedDistance.cpp.o [ 60%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/Matrix.cpp.o [ 61%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/MaximumLikelihood.cpp.o [ 63%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/ProtSeqUtils.cpp.o [ 64%] Building C object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/gengetopt/fastprot_gengetopt.c.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 65%] Building CXX object src/c++/CMakeFiles/fastprot.dir/programs/fastprot/XmlInputStream.cpp.o Linking CXX executable fastprot [ 67%] Built target fastprot [ 69%] Generating programs/fastprot_mpi/gengetopt/fastprot_mpi_gengetopt.c, programs/fastprot_mpi/gengetopt/fastprot_mpi_gengetopt.h Scanning dependencies of target fastprot_mpi [ 70%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/main.cpp.o [ 71%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/FastaInputStream.cpp.o [ 72%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/DataOutputStream.cpp.o [ 73%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/XmlOutputStream.cpp.o [ 75%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/PhylipMaInputStream.cpp.o [ 76%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ProtDistCalc.cpp.o [ 77%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ModelMatrix.cpp.o [ 78%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ExpectedDistance.cpp.o [ 79%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/Matrix.cpp.o [ 80%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/MaximumLikelihood.cpp.o [ 82%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/ProtSeqUtils.cpp.o [ 83%] Building C object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/gengetopt/fastprot_mpi_gengetopt.c.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 84%] Building CXX object src/c++/CMakeFiles/fastprot_mpi.dir/programs/fastprot_mpi/XmlInputStream.cpp.o Linking CXX executable fastprot_mpi [ 86%] Built target fastprot_mpi [ 88%] Generating programs/fnj/gengetopt/fnj_gengetopt.c, programs/fnj/gengetopt/fnj_gengetopt.h Scanning dependencies of target fnj [ 89%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/main.cpp.o [ 90%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/DataInputStream.cpp.o [ 91%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/DataOutputStream.cpp.o [ 92%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/XmlOutputStream.cpp.o [ 94%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/PhylipDmInputStream.cpp.o [ 95%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/BinaryInputStream.cpp.o [ 96%] Building C object src/c++/CMakeFiles/fnj.dir/programs/fnj/gengetopt/fnj_gengetopt.c.o cc1: warning: command line option "-fno-default-inline" is valid for C++/ObjC++ but not for C [ 97%] Building CXX object src/c++/CMakeFiles/fnj.dir/programs/fnj/XmlInputStream.cpp.o Linking CXX executable fnj [100%] Built target fnj [ 34%] Built target fastphylo [ 48%] Built target fastdist [ 67%] Built target fastprot [ 86%] Built target fastprot_mpi [100%] Built target fnj Install the project... -- Install configuration: "" -- Installing: /tmp/bin/fastdist -- Removed runtime path from "/tmp/bin/fastdist" -- Installing: /tmp/bin/fnj -- Removed runtime path from "/tmp/bin/fnj" -- Installing: /tmp/bin/fastprot -- Removed runtime path from "/tmp/bin/fastprot" -- Installing: /tmp/bin/fastprot_mpi -- Removed runtime path from "/tmp/bin/fastprot_mpi"
If you want to build the html documentation ( i.e. this page ) you need to pass the -DBUILD_DOCBOOK=ON option to cmake.
This is section is mainly intended for package maintainers
On a CentOS or Fedora machine, first log in as root and install the dependencies
# yum install xmlto libxml2-devel cmake gcc-c++ binutils gengetopt
Check that cmake is version 2.6 or later
$ cmake --version cmake version 2.6-patch 0
If it is older you could download a cmake binary directly from www.cmake.org
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package
On a Debian or Ubuntu machine, first log in as root and install the dependencies
# apt-get install libxml2-dev cmake g++ binutils gengetopt
Check that cmake is version 2.6 or later
$ cmake --version cmake version 2.6-patch 0
If it is older you could download a cmake binary directly from www.cmake.org. Now build the deb package.
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DCMAKE_INSTALL_PREFIX=/ -DBUILD_DOCBOOK=ON /tmp/source && make package
To build the fastphylo install package for MacOS X you need to have installed all the dependancies mentioned in section Section 3.2.2.1, “Building from source on Unix” on your MacOS X computer.
Check that cmake is version 2.6 or later
$ cmake --version cmake version 2.6-patch 0
$ mkdir /tmp/build $ cd /tmp/build $ cmake -DSTATIC=ON -DCPACK_GENERATOR="TGZ" /tmp/source && make package
fastdist implements the algorithm Fast Computation of Distance Estimators ( see Section 2.1, “Fast Computation of Distance Estimators” )
Type fastdist --help
to see the command line options
[user@saturn ~]$ fastdist --help fastdist 1.0.1 Usage: fastdist [OPTIONS]... [FILE]... Computes distance matrices out of multialignments -h, --help Print help and exit -V, --version Print version and exit If FILE is not specified the input is read from stdin -o, --outfile=filename output filename. If not specifed, output is written to stdout -I, --input-format=ENUM input format. xml means the Fastphylo sequence XML format (possible values="fasta", "phylip", "xml" default=`fasta') -e, --memory-efficient memory efficient. Use less memory space and fast implementation. Only used with fasta and phylip format (default=off) -O, --output-format=ENUM output format. xml means the Fastphylo distance matrix XML format (possible values="phylip", "xml", "binary" default=`xml') -D, --distance-function=ENUM Distance function (possible values="JC", "K2P", "TN93", "HAMMING" default=`K2P') -b, --bootstraps=INT Bootstrap num times and create matrix for each (default=`0') -k, --no-incl-orig If the distance matrix from the original sequences should not be included (default=off) -s, --seed=INT Random seed. If not specified the current timestamp will be used -A, --no-ambiguities Ignore ambiguities (default=off) -R, --no-ambig-resolve Specifies that ambigious symbols should not be resolved by nearest neighbor (default=off) -t, --no-transprob Specifies that the transition probabilities should not be used in the ambiguity model (default=off) -a, --ambiguity-frequency-model=ENUM Ambiguity frequency model (possible values="UNI", "BASE" default=`UNI') -T, --tstvratio=FLOAT Transition/transvertion ratio for purine transitions ( for the TN model ) (default=`2.0') -P, --pyrtvratio=FLOAT Transition/transvertion ratio for pyrimidines transitions ( for the TN model ) (default=`2.0') -N, --no-tstvratio If given fixed ts/tv ratios will not be used (default=off) -F, --fixfactor=FLOAT Float specifying what factor to use for saturated data. If not given -1 in the entry. (default=`1') -r, --number-of-runs=INT nr of runs ( datasets ) in input. This option is only used if the input format is phylip_multialignment. (default=`1') -p, --print-relaxng-input print the Relax NG schema for the XML input format ( Fastphylo sequence XML format ) and then exit (default=off) -w, --print-relaxng-output print the Relax NG schema for the XML output format ( Fastphylo distance matrix XML format ) and then exit. (default=off) Example usage of this program can be found at its home page http://fastphylo.sourceforge.net/
Table 1. fastdist input file formats
file format | short option | description |
---|---|---|
fasta format | -I fasta | Section 3.4.3, “Fasta format” |
phylip format | -I phylip | Section 3.4.2, “phylip format” |
fastphylo sequence XML format | -I xml | Section 3.4.1, “Fastphylo sequence XML format” |
Table 2. fastdist output file formats
file format | short option | description |
---|---|---|
fastphylo sequence XML format | -O xml | Section 3.4.4, “Fastphylo distance matrix XML format” |
Binary distance matrix format | -O binary | Section 3.4.6, “Binary distance matrix format” |
phylip distance matrix format | -O phylip | Section 3.4.5, “Phylip distance matrix format” |
Example 1. fastdist with input in file phylip format
We use the DNA file described in Example 13, “Example files in phylip format” as input file.
The file has two datasets so we pass the option -r 2
to fastdist. Per default the output is given in XML format
[user@saturn ~]$ fastdist -I phylip seq.phylip <?xml version="1.0"?> <root> <runs> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
Example 2. fastdist with input in file fasta format
We use the file described in Example 14, “seq.fasta, an example file in fasta format” as input file. Per default the output is given in XML format
[user@saturn ~]$ fastdist -I fasta seq.fasta <?xml version="1.0"?> <root> <runs> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
Example 3. fastdist with input file in XML format
We use the file described in Example 12, “Example files in Fastphylo sequence XML format” containing DNA sequences as input file.
Note | |
---|---|
The -r option can only be used if the input is in phylip format. fastdist will for XML files compute all data sets ( runs ). Fasta files can only contain one data set so the -r option does not make any sense there. |
[user@saturn ~]$ fastdist -I xml -O xml seq.xml <?xml version="1.0"?> <root> <runs> <run id="run1" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"> <extrainfo myattr="" species="penguin"> <foo bar="1"/> </extrainfo> </identity> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> <run id="run2" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
Example 4. fastdist with an XML stream on stdin
If you leave out the input filename, the input will be read from stdin. fastdist doesn't wait for the whole xml file to be read before it starts. It starts a computation as soon as an ending </run> has been read. The memory consumption will not grow over time so the input can be arbitrarily large. A never ending input stream only works in the fastphylo sequence XML format, because the phylip input format needs you to specify in advance how many data sets are to be sent to fastdist ( the -r option ).
[user@saturn ~]$ cat seq.xml | fastdist -I xml -O xml <?xml version="1.0"?> <root> <runs> <run id="run1" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"> <extrainfo myattr="" species="penguin"> <foo bar="1"/> </extrainfo> </identity> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> <run id="run2" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
Example 5. reading the fastdist XML output stream with python
If the XML output is very large you might want to use an XML parser that doesn't hold the whole file in memory. This python script is an example of how to do this
#!/usr/bin/python import sys from lxml import etree from copy import deepcopy for action, element in etree.iterparse(sys.stdin, tag="dm"): dm_copy=deepcopy(element) print dm_copy.xpath('count(row/entry[ number(.) < 0.1 ])')
For each distance matrix the script counts the number of elements with a value below 0.1
[user@saturn ~]$ cat seq.xml | fastdist -I xml -O xml | python fastdist_lxml.py 3.0 3.0
fastprot estimates the evolutionary distance between aligned protein sequences. It implements two methods for calculating the distance between protein sequences, the maximum likelihood of a distance and the expected distance (see further paper by Agarwal and States).
Type fastprot --help
to see the command line options
[user@saturn ~]$ fastprot --help fastprot 1.0.1 Usage: fastprot [OPTIONS]... [FILE]... Computes distance matrices out of multialignments of protein sequences -h, --help Print help and exit --detailed-help Print help, including all details and hidden options, and exit -V, --version Print version and exit If FILE is not specified the input is read from stdin -o, --outfile=filename output filename. If not specified, output is written to stdout -I, --input-format=ENUM input format. xml means the Fastphylo sequence XML format (possible values="fasta", "phylip", "xml" default=`fasta') -e, --memory-efficient memory efficient. Use less memory space and fast implementation. Only used with fasta and phylip format (default=off) -O, --output-format=ENUM output format. xml means the Fastphylo distance matrix XML format (possible values="phylip", "xml", "binary" default=`xml') -b, --bootstraps=INT Bootstrap num times and create matrix for each (default=`0') -k, --no-incl-orig If the distance matrix from the original sequences should NOT be included - for bootstrapping (default=off) -R, --seed=INT Random seed. If not specified the current timestamp will be used -D, --distance-function=ENUM Distance function (possible values="ID", "JC", "JCK", "JCSS", "WAG", "JTT", "DAY", "ARVE", "MVR", "LG" default=`WAG') -F, --model-file=filename Read matrix and equilibrium distribution from file, when used --distance-function is disregarded -i, --remove-indels Remove gap columns. A gap is denoted by '-'. (default=off) -m, --maximum-likelihood Compute a Maximum Likelihood estimate instead. Can not be used with --distance-function=ID, JC, JCK or JCSS or --sd (default=off) -S, --sd Not yet implemented! Output a matrix with standard deviations after the distance matrix. Can not be used with --distance-function=ID, JC, JCK or JCSS or --maximum-likelihood (default=off) -p, --pfam use a normal distribution as distance prior, estimated from Pfam 7.2 (default=off) -s, --speed=INT 'Speed'. High speed results in low precision, only affects ED calculations. Default is 5. Valid range is [1,10]. (possible values="1", "2", "3", "4", "5", "6", "7", "8" default=`4') -P, --print-relaxng-input print the Relax NG schema for the XML input format ( Fastphylo protein sequence XML format ) and then exit (default=off) -w, --print-relaxng-output print the Relax NG schema for the XML output format ( Fastphylo distance matrix XML format ) and then exit. (default=off) Example usage of this program can be found at its home page http://fastphylo.sourceforge.net/
Table 3. fastprot input file formats
file format | short option | description |
---|---|---|
fasta format | -I fasta | Section 3.4.3, “Fasta format” |
phylip format | -I phylip | Section 3.4.2, “phylip format” |
fastphylo sequence XML format | -I xml | Section 3.4.1, “Fastphylo sequence XML format” |
Table 4. fastprot output file formats
file format | short option | description |
---|---|---|
fastphylo sequence XML format | -O xml | Section 3.4.4, “Fastphylo distance matrix XML format” |
Binary distance matrix format | -O binary | Section 3.4.6, “Binary distance matrix format” |
phylip distance matrix format | -O phylip | Section 3.4.5, “Phylip distance matrix format” |
Example 6. fastprot with input in file phylip format
We use protein sequence file described in Example 13, “Example files in phylip format” as input file.
[user@saturn ~]$ fastprot -I phylip protein_seq.phylip -O xml <?xml version="1.0"?> <root> <runs> <run id="" dim="4"> <identities> <identity name="Cow"/> <identity name="Carp"/> <identity name="Chicken"/> <identity name="Human"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.402252</entry> <entry>0.000000</entry> </row> <row> <entry>2.622102</entry> <entry>2.334973</entry> <entry>0.000000</entry> </row> <row> <entry>2.919533</entry> <entry>2.733489</entry> <entry>0.903515</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
Example 7. fastprot with input in file fasta format
We use the file described in Example 15, “protein_seq.fasta, an example file in fasta format” as input file. Per default the output is given in XML format
[user@saturn ~]$ fastprot -I fasta protein_seq.fasta <?xml version="1.0"?> <root> <runs> <run id="" dim="4"> <identities> <identity name="Cow"/> <identity name="Carp"/> <identity name="Chicken"/> <identity name="Human"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.402252</entry> <entry>0.000000</entry> </row> <row> <entry>2.622102</entry> <entry>2.334973</entry> <entry>0.000000</entry> </row> <row> <entry>2.919533</entry> <entry>2.733489</entry> <entry>0.903515</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
fnj implements the algorithm Fast Neighbor Joining ( see Section 2.2, “Fast Neighbor Joining” )
Type fnj --help
to see the command line options
[user@saturn ~]$ fnj --help fnj 1.0.1 Usage: fnj [OPTIONS]... [FILE]... builds phylogenetic trees -h, --help Print help and exit -V, --version Print version and exit -o, --outfile=filename output filename. If not specifed, output is written to stdout -I, --input-format=ENUM input format. 'xml' means the 'Fastphylo distance matrix XML format' (possible values="phylip", "xml", "binary" default=`xml') -O, --output-format=ENUM output format. 'xml' means the 'Fastphylo tree count XML format' (possible values="newick", "xml" default=`xml') -c, --print-counts print the tree count before each the newick tree. This flag has no effect on the XML output format. (default=off) -a, --analyze-run-number=INT Determines which dataset should be analyzed with 1 being the first dataset. By default all are analyzed -m, --method=ENUM reconstruction method to apply (possible values="NJ", "FNJ", "BIONJ" default=`FNJ') -d, --dm-per-run=INT nr of Distance matrices per run. Is only used if the input format is phylip (default=`1') -r, --number-of-runs=INT nr of runs. Is only used if the input format is phylip (default=`1') -b, --bootstraps=INT number of boot straps (default=`0') -p, --print-relaxng-input print the Relax NG schema for the XML input format ( Fastphylo distance matrix XML format ) and then exit (default=off) -w, --print-relaxng-output print the Relax NG schema for the XML output format ( Fastphylo tree count XML format ) and then exit. (default=off) Example usage of this program can be found at its home page http://fastphylo.sourceforge.net/
Table 5. fnj input file formats
file format | short option | description |
---|---|---|
fastphylo sequence XML format | -I xml | Section 3.4.4, “Fastphylo distance matrix XML format” |
Binary distance matrix format | -I binary | Section 3.4.6, “Binary distance matrix format” |
phylip distance matrix format | -I phylip | Section 3.4.5, “Phylip distance matrix format” |
Table 6. fnj output file formats
file format | short option | description |
---|---|---|
fastphylo count tree XML format | -O xml | Section 3.4.7, “Fastphylo tree count XML format” |
Example 8. fnj with input file in Phylip distance matrix format
We use the file described in Example 17, “dm.phylip, an example file in phylip distance matrix format” as input file. The file has two datasets so we pass the option -r 2
to fnj. Per default the output is given in the "fastphylo count tree XML format" ( -O xml ).
[user@saturn ~]$ fnj -r 2 -I phylip dm.phylip <?xml version="1.0"?> <root> <runs> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <tree> <count>2</count> <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml> <newick>(Gamma,Beta,Alpha);</newick> </tree> </run> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> </run> </runs> </root>
Example 9. fnj with input file in XML format
We use the file described in Example 16, “dm.xml, an example file in Fastphylo distance matrix XML format” as input file. Per default the output is given in the "fastphylo count tree XML format" ( -O xml ).
Note | |
---|---|
The -r option is not available and also not needed when the input is in XML format. fnj computes all data sets ( runs ). |
[user@saturn ~]$ fnj -I xml dm.xml <?xml version="1.0"?> <root> <runs> <run id="a" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <tree> <count>1</count> <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml> <newick>(Gamma,Beta,Alpha);</newick> </tree> </run> <run id="b" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <tree> <count>1</count> <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml> <newick>(Gamma,Beta,Alpha);</newick> </tree> </run> </runs> </root>
Example 10. connecting fastdist to fnj with a pipe
We use the DNA file described in Example 13, “Example files in phylip format” as input file. The file has two data sets. We will bootstrap 3 times. First we send the data in phylip format through the pipe:
[user@saturn ~]$ cat seq.phylip | fastdist -I phylip -O phylip -b 3 -r 2 | fnj -I phylip -O xml -r 2 -d 4 <?xml version="1.0"?> <root> <runs> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <tree> <count>8</count> <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml> <newick>(Gamma,Beta,Alpha);</newick> </tree> </run> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> </run> </runs> </root>
We could also send the data in XML format through the pipe:
[user@saturn ~]$ cat seq.phylip | fastdist -I phylip -O xml -b 3 -r 2 | fnj -I xml -O xml -m FNJ <?xml version="1.0"?> <root> <runs> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <tree> <count>4</count> <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml> <newick>(Gamma,Beta,Alpha);</newick> </tree> </run> <run id="" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <tree> <count>4</count> <newick-xml><branch><leaf>Gamma</leaf><leaf>Beta</leaf><leaf>Alpha</leaf></branch></newick-xml> <newick>(Gamma,Beta,Alpha);</newick> </tree> </run> </runs> </root>
As the the XML format is more descriptive, the flags -d and -r are no longer needed by fnj.
Example 11. reading the fnj XML output stream with python
If the XML output is very large you might want to use an XML parser that doesn't hold the whole file in memory. This python script is an example of how to do this
#!/usr/bin/python import sys from lxml import etree from copy import deepcopy maxcount=0 for action, element in etree.iterparse(sys.stdin, tag="run"): run_copy=deepcopy(element) count=int(run_copy.xpath('tree/count')[0].text) if ( count > maxcount ): maxcount=count print maxcount
The script prints the maximum count ( just as an example ).
[user@saturn ~]$ fnj -I xml dm.xml | python fnj_lxml.py 1
This software package handles the following file formats
The Fastphylo sequence XML format is chosen by the option -I xml
to fastdist, fastprot or fastprot_mpi.
For instance, type fastdist --print-relaxng-input
to see its relaxng schema
[user@saturn ~]$ fastdist --print-relaxng-input <?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="root"> <element name="runs"> <zeroOrMore> <element name="run"> <attribute name="id"> <text/> </attribute> <oneOrMore> <element name="seq"> <attribute name="seq"> <data type="string"> <param name="pattern">[acgtumrwsykvhdbnxACGTUMRWSYKVHDBNX -.?]+</param> </data> </attribute> <attribute name="name"> <text/> </attribute> <optional> <element name="extrainfo"> <ref name="anyContent"/> </element> </optional> </element> </oneOrMore> </element> </zeroOrMore> </element> </element> </start> <define name="anyContent"> <mixed> <zeroOrMore> <choice> <attribute> <anyName/> </attribute> <ref name="anyElement"/> </choice> </zeroOrMore> </mixed> </define> <define name="anyElement"> <element> <anyName/> <ref name="anyContent"/> </element> </define> </grammar>
The Relax NG schema specifies that the extrainfo element is optional and can be inserted as a child to a seq element. The extrainfo element may contain any content and will be passed on to the output XML format.
Example 12. Example files in Fastphylo sequence XML format
The example file seq.xml contains DNA sequences:
<?xml version="1.0"?> <root> <runs> <run id="run1"> <seq name="Alpha" seq="AACGTGGCCACAT"/> <seq name="Beta" seq="AAGGTCGCCACAC"> <extrainfo myattr="" species="penguin"> <foo bar="1"/> </extrainfo> </seq> <seq name="Gamma" seq="CAGTTCGCCACAA"/> </run> <run id="run2"> <seq name="Alpha" seq="AACGTGGCCACAT"/> <seq name="Beta" seq="AAGGTCGCCACAC"/> <seq name="Gamma" seq="CAGTTCGCCACAA"/> </run> </runs> </root>
protein_seq.xml contains protein sequences:
<?xml version="1.0"?> <root> <runs> <run id="run1"> <seq name="Cow" seq="MAYPMQLGFQDA"/> <seq name="Carp" seq="MAHPTQLGFKDA"/> <seq name="Chicken" seq="MALLTLMLMEKL"/> <seq name="Human" seq="MAHLFLTLTTKL"/> </run> </runs> </root>
The phylip input format is chosen by the option -I phylip
to fastdist.
Example 13. Example files in phylip format
The DNA example file seq.phylip contains two datasets:
3 13 Alpha AAC GTGG Beta AAG GTCG Gamma CAG TTCG CCAC AT CCAC AC CCAC AA 3 13 Alpha CCACGGG Beta AAGGTCG Gamma CAGTTCG CGACAT CCACAC CCGCAA
The example file protein_seq.phylip contains protein sequences:
4 12 Cow MAYPMQLGFQDA Carp MAHPTQLGFKDA Chicken MALLTLMLMEKL Human MAHLFLTLTTKL
The Fasta input format is chosen by the option -I fasta
to fastdist.
Fasta files can only contain one data set. Read more about the Fasta format on Wikipedia.
The parser will take the whole header line as the sequence identifier name, i.e. all characters after the greater-than character ( ">" ).
Example 14. seq.fasta, an example file in fasta format
The example files seq.fasta contains DNA:
>Alpha AAC-GTGGCCAC-AT >Beta AAG-GTCGCCAC-AC >Gamma CAG-TTCGCCAC-AA
The Fastphylo sequence XML format is chosen by the option -O xml
to fastdist, fastprot, fastprot_mpi and the option -I xml
to fnj.
For instance type fastdist --print-relaxng-output
to see its relaxng schema
[user@saturn ~]$ fastdist --print-relaxng-output <?xml version="1.0"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="root"> <element name="runs"> <zeroOrMore> <element name="run"> <attribute name="dim"> <data type="integer"/> </attribute> <attribute name="id"> <text/> </attribute> <element name="identities"> <oneOrMore> <element name="identity"> <attribute name="name"> <text/> </attribute> <optional> <element name="extrainfo"> <ref name="anyContent"/> </element> </optional> </element> </oneOrMore> </element> <element name="dms"> <oneOrMore> <element name="dm"> <oneOrMore> <element name="row"> <oneOrMore> <element name="entry"> <data type="float"/> </element> </oneOrMore> </element> </oneOrMore> </element> </oneOrMore> </element> </element> </zeroOrMore> </element> </element> </start> <define name="anyContent"> <mixed> <zeroOrMore> <choice> <attribute> <anyName/> </attribute> <ref name="anyElement"/> </choice> </zeroOrMore> </mixed> </define> <define name="anyElement"> <element> <anyName/> <ref name="anyContent"/> </element> </define> </grammar>
The Relax NG schema specifies that the extrainfo element is optional and can be inserted as a child to a seq element. The extrainfo element may contain any content.
Example 16. dm.xml, an example file in Fastphylo distance matrix XML format
The example file dm.xml contains
<?xml version="1.0"?> <root> <runs> <run id="a" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>0.299650</entry> <entry>0.000000</entry> </row> <row> <entry>0.733169</entry> <entry>0.309520</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> <run id="b" dim="3"> <identities> <identity name="Alpha"/> <identity name="Beta"/> <identity name="Gamma"/> </identities> <dms> <dm> <row> <entry>0.000000</entry> </row> <row> <entry>3.258005</entry> <entry>0.000000</entry> </row> <row> <entry>1.873653</entry> <entry>0.459840</entry> <entry>0.000000</entry> </row> </dm> </dms> </run> </runs> </root>
The Phylip distance matrix format is chosen by the option -O phylip
to fastdist or the option -I phylip
to fnj.
Example 17. dm.phylip, an example file in phylip distance matrix format
The example file dm.phylip contains
3 Alpha 0.000000 0.299650 0.733169 Beta 0.299650 0.000000 0.309520 Gamma 0.733169 0.309520 0.000000 3 Alpha 0.000000 3.258005 1.873653 Beta 3.258005 0.000000 0.459840 Gamma 1.873653 0.459840 0.000000
It contains two data sets.
The Binary distance matrix format is chosen by the option -O binary
to fastdist, fastprot and fastprot_mpi or the option -I binary
to fnj.
Using the binary format option, fastphylo performs row-wise operations in computing the upper triangular distance matrix. Furthermore, the upper triangular distance matrix
is then stored in a binary format instead of plain text. The main advantage of introducing binary format is that it reduces the
disk space utilization and speedup the performance of fastphylo since only half of the matrix is computted instead of the whole distance matrix.
In the binnary format output file, we first store fastphylo's current version followed by the number of sequences, then accessions and
finally, rows of the upper trianguler distance matrix. We use colon delimiter for binary format to delimit each component separately.
The Fastphylo tree count XML format is chosen by the option -O xml
to fnj.
You can see an example of the format in the example Example 9, “fnj with input file in XML format”.
Type fnj --print-relaxng-output
to see the formats relaxng schema.
[user@saturn ~]$ fnj --print-relaxng-output <?xml version="1.0"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="root"> <element name="runs"> <zeroOrMore> <element name="run"> <attribute name="id"> <text/> </attribute> <attribute name="dim"> <data type="integer"/> </attribute> <element name="identities"> <oneOrMore> <element name="identity"> <attribute name="name"> <text/> </attribute> <optional> <element name="extrainfo"> <ref name="anyContent"/> </element> </optional> </element> </oneOrMore> </element> <element name="tree"> <element name="count"> <data type="integer"/> </element> <element name="newick-xml"> <ref name="branch"/> </element> <element name="newick"> <text/> </element> </element> </element> </zeroOrMore> </element> </element> </start> <define name="anyContent"> <mixed> <zeroOrMore> <choice> <attribute> <anyName/> </attribute> <ref name="anyElement"/> </choice> </zeroOrMore> </mixed> </define> <define name="anyElement"> <element> <anyName/> <ref name="anyContent"/> </element> </define> <define name="branch"> <element name="branch"> <optional> <attribute name="length"> <data type="float"/> </attribute> </optional> <oneOrMore> <choice> <element name="leaf"> <optional> <attribute name="length"> <data type="float"/> </attribute> </optional> <text/> </element> <ref name="branch"/> </choice> </oneOrMore> </element> </define> </grammar>