Quick-start -lib LIBNAME -read1 READFILE_F.fq(.gz) -read2 READFILE_R.fq(.gz) [options] -help # Help page -man # Manual page in pager
without arguments will show the basic help message.
1. Basic usage
To screen paired-end 100 bp read files named reads_F.fq.gz
for SSU rRNA sequences, and have output files labeled as
“run01”: -lib run01 -read1 reads_F.fq.gz -read2 -reads_R.fq.gz
Use the recommended pipeline settings: -lib run01 -read1 reads_F.fq.gz -read2 -reads_R.fq.gz -almosteverything
Interleaved reads: -lib run01 -read1 reads_FR.fq.gz -interleaved
Longer read lengths (e.g. 150 bp): -lib run01 -read1 reads_F.fq.gz -read2 reads_R.fq.gz -readlength 150
Limit number of processors used to 8: -lib run01 -read1 reads_F.fq.gz -read2 reads_R.fq.gz -CPUs 8
2. Full description of program options
You can access help on the command line with the following options:
Invokes checking of working environment and dependencies without
data input. Use to test setup.
Print brief help message
Show manual
Show detailed list of output and temporary files and exit.
2.1. Standard input arguments
Library name to use as a filename prefix for the output files
for this phyloFlash run. The name must be one word comprising only letters,
numbers and _
or -
(no whitespace or other punctuation).
Forward reads in FASTA or FASTQ formats. May be compressed
with Gzip (.gz extension). If interleaved reads are provided, please use
flag in addition for paired-end processing.
Reverse reads, for paired-end reads. If this option is
omitted, phyloFlash will run in experimental single-end mode.
Use this flag if read file is in interleaved format
-readlength N
Set expected readlength (between 50 and 500). Always use if
your read length differs from 100. Default: 100.
2.2. Performance-related
Number of threads to use. Defaults to all available CPU cores.
-readlimit N
Limits processing to the first N reads in each input file that
map to the reference database. Use this for transcriptomes with a lot of rRNA
reads, and use values below 1000000. Default: unlimited.
-amplimit N
Set the limit of SSU read pairs to switch from to This feature is not reliable as has been
problematic to run (use values >100000). Default: 500000.
2.3. Customizing the run
Do not use SPAdes to assemble full-length sequences from
extracted reads
Use EMIRGE to reconstruct full-length sequences from extracted reads.
(Default: Off)
Use SortMeRNA instead of BBmap to extract SSU rRNA reads. Insert
size and %id to reference statistics will not be available. (Default: No)
Use Nhmmer to find positional coverage of reads across Barrnap’s HMM
model of the 16S and 18S rRNA genes from a subsample of reads, as an estimate
of coverage evenness. (Default: Off)
-id N
Minimum % identity of reads to map against reference database. Must be
between 50 and 98. Set to a lower value for very divergent taxa. Default: 70.
-clusterid N
% identity threshold for reference sequence clustering step.
Must be between 50 and 100. Default: 97.
-taxlevel N
Level in the taxonomy string to use for taxonomic units (NTUs),
for the taxonomic summary and to estimate diversity. Must be an integer, and
starts with 1 for the highest taxonomic level (Domain). Default: 4.
-maxinsert N
Maximum insert size allowed for paired end read mapping. Must be
between 0 and 1200. Default: 1200.
Use if data are from single-cell MDA libraries, option is passed to the
SPAdes assembler. (Default: Off)
-dbhome DIR
Directory containing phyloFlash reference databases, prepared
. If not specified, phyloFlash will check for an
environment variable $PHYLOFLASH_DBHOME
, then look in the current directory,
the home directory, and the directory where the
script is
located, for a suitable database directory containing the necessary files. If
there is more than one database folder, it will pick the one with the highest
SILVA version number.
-trusted FILENAME
User-supplied Fasta file of trusted contigs containing SSU
rRNA sequences. The SSU sequences will be extracted with Barrnap, and the input
read files will be screened against these extracted “trusted” SSU sequences
2.4. Localization and compatibility options
Use CRLF as the line terminator in CSV output, to be RFC4180 compliant
(Default: Off)
Use decimal comma instead of decimal point to fix locale
problems for some European systems (Default: Off)
2.5. Configuring output
Produce an HTML-formatted version of the report file. This helps
improve readability and individual sections of the report can be collapsed.
(Default: On, turn off with -nohtml
Include an interactive treemap of the NTU counts in the HTML report.
This uses the Google Visualization API, which requires an Internet connection
and that you agree to their terms of
service, and is not open-source
although it is free to use. (Default: Off)
Write status messages printed to STDERR also to a log file (Default:
Compress output into a tar.gz archive file. Overridden by
and -everything
(Default: Off)
Keep temporary/intermediate files (Default: Off)
Turn on all the optional analyses and output options. Options
without defaults and any local settings must still be specified. Equivalent to
-emirge -poscov -treemap -zip -log
Like -everything
except without -emirge
3. Testing phyloFlash
You will find test data in the test_files
folder. The test data provided
contains subsampled SSU reads from SRA ERR138446, a Caenorhabditis sample
with associated bacteria. You can test if phyloFlash is working properly with
these files: -lib TEST -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz
4. Expected performance
10 million 100 bp paired-end-reads of a metagenomic library are processed in less than 5 minutes on a normal 2014 desktop PC with 4 CPU cores and 8 GB of RAM.
phyloFlash usually detects most lifeforms on earth that have a SSU rRNA
sequence that is at least 70% identical to anything in the databases, more
exotic organisms might be problematic, but test data is hard to come by. If you
happen to have such a test case and you are willing to share please drop me a
line… If you think phyloFlash is not detecting a certain organism that is
very distant from the known SSU rRNA sequences please try lowering the minimum
sequence identity for a mapping hit by using e.g. -id 0.63