Quick-start

phyloFlash.pl -lib LIBNAME -read1 READFILE_F.fq(.gz) -read2 READFILE_R.fq(.gz) [options]
phyloFlash.pl -help # Help page
phyloFlash.pl -man # Manual page in pager

Running phyloFlash.pl without arguments will show the basic help message.

1. Basic usage

To screen paired-end 100 bp read files named reads_F.fq.gz and reads_R.fq.gz for SSU rRNA sequences, and have output files labeled as “run01”:

phyloFlash.pl -lib run01 -read1 reads_F.fq.gz -read2 -reads_R.fq.gz

Use the recommended pipeline settings:

phyloFlash.pl -lib run01 -read1 reads_F.fq.gz -read2 -reads_R.fq.gz -almosteverything

Interleaved reads:

phyloFlash.pl -lib run01 -read1 reads_FR.fq.gz -interleaved

Longer read lengths (e.g. 150 bp):

phyloFlash.pl -lib run01 -read1 reads_F.fq.gz -read2 reads_R.fq.gz -readlength 150

Limit number of processors used to 8:

phyloFlash.pl -lib run01 -read1 reads_F.fq.gz -read2 reads_R.fq.gz -CPUs 8

2. Full description of program options

You can access help on the command line with the following options:

-check_env Invokes checking of working environment and dependencies without data input. Use to test setup.

-help Print brief help message

-man Show manual

-outfiles Show detailed list of output and temporary files and exit.

2.1. Standard input arguments

-lib LIBNAME Library name to use as a filename prefix for the output files for this phyloFlash run. The name must be one word comprising only letters, numbers and _ or - (no whitespace or other punctuation).

-read1 FILENAME Forward reads in FASTA or FASTQ formats. May be compressed with Gzip (.gz extension). If interleaved reads are provided, please use --interleaved flag in addition for paired-end processing.

-read2 FILENAME Reverse reads, for paired-end reads. If this option is omitted, phyloFlash will run in experimental single-end mode.

-interleaved Use this flag if read file is in interleaved format

-readlength N Set expected readlength (between 50 and 500). Always use if your read length differs from 100. Default: 100.

-CPUs N Number of threads to use. Defaults to all available CPU cores.

-readlimit N Limits processing to the first N reads in each input file that map to the reference database. Use this for transcriptomes with a lot of rRNA reads, and use values below 1000000. Default: unlimited.

-amplimit N Set the limit of SSU read pairs to switch from emirge.py to emirge_amplicon.py. This feature is not reliable as emirge_amplicon.py has been problematic to run (use values >100000). Default: 500000.

2.3. Customizing the run

-skip_spades Do not use SPAdes to assemble full-length sequences from extracted reads

-emirge Use EMIRGE to reconstruct full-length sequences from extracted reads. (Default: Off)

-sortmerna Use SortMeRNA instead of BBmap to extract SSU rRNA reads. Insert size and %id to reference statistics will not be available. (Default: No)

-poscov Use Nhmmer to find positional coverage of reads across Barrnap’s HMM model of the 16S and 18S rRNA genes from a subsample of reads, as an estimate of coverage evenness. (Default: Off)

-id N Minimum % identity of reads to map against reference database. Must be between 50 and 98. Set to a lower value for very divergent taxa. Default: 70.

-clusterid N % identity threshold for reference sequence clustering step. Must be between 50 and 100. Default: 97.

-taxlevel N Level in the taxonomy string to use for taxonomic units (NTUs), for the taxonomic summary and to estimate diversity. Must be an integer, and starts with 1 for the highest taxonomic level (Domain). Default: 4.

-maxinsert N Maximum insert size allowed for paired end read mapping. Must be between 0 and 1200. Default: 1200.

-sc Use if data are from single-cell MDA libraries, option is passed to the SPAdes assembler. (Default: Off)

-dbhome DIR Directory containing phyloFlash reference databases, prepared with phyloFlash_makedb.pl. If not specified, phyloFlash will check for an environment variable $PHYLOFLASH_DBHOME, then look in the current directory, the home directory, and the directory where the phyloFlash.pl script is located, for a suitable database directory containing the necessary files. If there is more than one database folder, it will pick the one with the highest SILVA version number.

-trusted FILENAME User-supplied Fasta file of trusted contigs containing SSU rRNA sequences. The SSU sequences will be extracted with Barrnap, and the input read files will be screened against these extracted “trusted” SSU sequences

2.4. Localization and compatibility options

-crlf Use CRLF as the line terminator in CSV output, to be RFC4180 compliant (Default: Off)

-decimalcomma Use decimal comma instead of decimal point to fix locale problems for some European systems (Default: Off)

2.5. Configuring output

-html Produce an HTML-formatted version of the report file. This helps improve readability and individual sections of the report can be collapsed. (Default: On, turn off with -nohtml)

-treemap Include an interactive treemap of the NTU counts in the HTML report. This uses the Google Visualization API, which requires an Internet connection and that you agree to their terms of service, and is not open-source although it is free to use. (Default: Off)

-log Write status messages printed to STDERR also to a log file (Default: Off)

-zip Compress output into a tar.gz archive file. Overridden by -almosteverything and -everything (Default: Off)

-keeptmp Keep temporary/intermediate files (Default: Off)

-everything Turn on all the optional analyses and output options. Options without defaults and any local settings must still be specified. Equivalent to -emirge -poscov -treemap -zip -log

-almosteverything Like -everything except without -emirge

3. Testing phyloFlash

You will find test data in the test_files folder. The test data provided contains subsampled SSU reads from SRA ERR138446, a Caenorhabditis sample with associated bacteria. You can test if phyloFlash is working properly with these files:

phyloFlash.pl -lib TEST -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz

4. Expected performance

10 million 100 bp paired-end-reads of a metagenomic library are processed in less than 5 minutes on a normal 2014 desktop PC with 4 CPU cores and 8 GB of RAM.

phyloFlash usually detects most lifeforms on earth that have a SSU rRNA sequence that is at least 70% identical to anything in the databases, more exotic organisms might be problematic, but test data is hard to come by. If you happen to have such a test case and you are willing to share please drop me a line… If you think phyloFlash is not detecting a certain organism that is very distant from the known SSU rRNA sequences please try lowering the minimum sequence identity for a mapping hit by using e.g. -id 0.63.