Falco: FastQC Alternative Code

GitHub Downloads DOI Install on conda Install on conda Install on conda

This program is an emulation of the popular FastQC software to check large sequencing reads for common problems.

Installing falco

Installing through conda

If you use anaconda to manage your packages, and the conda binary is in your path, you can install the most recent release of falco by running

$ conda install -c bioconda falco

falco can be found inside the bin directory of your anaconda installer.

Installing from source (code release)

Compilation from source can be done by downloading a falco release from the releases section above. Upon downloading the release (in .tar.gz or .zip format), and inflating the downloaded file to a directory (e.g. falco), move to the target directory the file was inflated (e.g. cd falco). You should see a configure file in it. In this directory, run

$ ./configure CXXFLAGS="-O3 -Wall"
$ make all
$ make install

if you wish to install the falco binaries on a specific directory, you can use the --prefix argument when running ./configure, for instance:

$ ./configure CXXFLAGS="-O3 -Wall" --prefix=/path/to/installation/directory

The falco binary will be found in the bin directory inside the specified prefix.

Installing from a cloned repository

We strongly recommend using falco through stable releases as described above, as the latest commits might contain undocumented bugs. For the more advanced users who wish to test the most recent code, falco can be installed by first cloning the repository

$ git clone https://github.com/smithlabcode/falco.git
$ cd falco

Once inside the generated repsotory directory, run

$ make all
$ make install

This should create a bin directory inside the cloned repository containing falco.

Required C++ dependencies

zlib is required to read gzip compressed FASTQ files. It is usually installed by default in most UNIX computers and is part of the htslib setup, but it can also be installed with standard package managers like apt, brew or conda.

On Ubuntu, zlib C++ libraries can be installed with apt:

$ sudo apt install zlib1g zlib1g-dev

Optional C++ dependencies

htslib is required to process bam files. If not provided, bam files will be treated as unrecognized file formats.

If htslib is installed, falco can be compiled with it by simply replacing the configure command above with the --enable-hts flag:

$ ./configure CXXFLAGS="-O3 -Wall" --enable-hts

If falco was cloned from the repository, run the following commands to allow BAM file reading:

$ make HAVE_HTSLIB=1 all
$ make HAVE_HTSLIB=1 install

If successfully compiled, falco can be used in BAM files the same way as it is used with fastq and sam files.

Running falco

Run falco in with the following command, where the example.fq file provided can be replaced with the path to any FASTQ file you want to run falco

$ falco example.fq

This will generate three files in the same directory as the input fastq file:

  • fastqc_data.txt is a text file with a summary of the QC metrics

  • fastqc_report.html is the visual HTML report showing plots of the QC metrics summarized in the text summary.

  • summary.txt: A tab-separated file describing whether the pass/warn/fail result for each module. If multiple files are provided, only one summary file is generated, with one of the columns being the file name associated to each module result.

The full list of arguments and options can be seen by running falco without any arguments, as well as falco -? or falco --help. This will print the following list:

Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...
Options:
  -h, --help               Print this help file and exit
  -v, --version            Print the version of the program and exit
  -o, --outdir             Create all output files in the specified
                           output directory. FALCO-SPECIFIC: If the
                           directory does not exists, the program will
                           create it. If this option is not set then
                           the output file for each sequence file is
                           created in the same directory as the
                           sequence file which was processed.
      --casava             [IGNORED BY FALCO] Files come from raw
                           casava output. Files in the same sample
                           group (differing only by the group number)
                           will be analysed as a set rather than
                           individually. Sequences with the filter flag
                           set in the header will be excluded from the
                           analysis. Files must have the same names
                           given to them by casava (including being
                           gzipped and ending with .gz) otherwise they
                           won't be grouped together correctly.
      --nano               [IGNORED BY FALCO] Files come from nanopore
                           sequences and are in fast5 format. In this
                           mode you can pass in directories to process
                           and the program will take in all fast5 files
                           within those directories and produce a
                           single output file from the sequences found
                           in all files.
      --nofilter           [IGNORED BY FALCO] If running with --casava
                           then don't remove read flagged by casava as
                           poor quality when performing the QC
                           analysis.
      --extract            [ALWAYS ON IN FALCO] If set then the zipped
                           output file will be uncompressed in the same
                           directory after it has been created. By
                           default this option will be set if fastqc is
                           run in non-interactive mode.
  -j, --java               [IGNORED BY FALCO] Provides the full path to
                           the java binary you want to use to launch
                           fastqc. If not supplied then java is assumed
                           to be in your path.
      --noextract          [IGNORED BY FALCO] Do not uncompress the
                           output file after creating it. You should
                           set this option if you do not wish to
                           uncompress the output when running in
                           non-interactive mode.
      --nogroup            Disable grouping of bases for reads >50bp.
                           All reports will show data for every base in
                           the read. WARNING: When using this option,
                           your plots may end up a ridiculous size. You
                           have been warned!
      --min_length         [NOT YET IMPLEMENTED IN FALCO] Sets an
                           artificial lower limit on the length of the
                           sequence to be shown in the report. As long
                           as you set this to a value greater or equal
                           to your longest read length then this will
                           be the sequence length used to create your
                           read groups. This can be useful for making
                           directly comaparable statistics from
                           datasets with somewhat variable read
                           lengths.
  -f, --format             Bypasses the normal sequence file format
                           detection and forces the program to use the
                           specified format. Valid formats are bam, sam,
                           bam_mapped, sam_mapped, fastq, fq, fastq.gz
                           or fq.gz.
  -t, --threads            [NOT YET IMPLEMENTED IN FALCO] Specifies the
                           number of files which can be processed
                           simultaneously. Each thread will be
                           allocated 250MB of memory so you shouldn't
                           run more threads than your available memory
                           will cope with, and not more than 6 threads
                           on a 32 bit machine [1]
  -c, --contaminants       Specifies a non-default file which contains
                           the list of contaminants to screen
                           overrepresented sequences against. The file
                           must contain sets of named contaminants in
                           the form name[tab]sequence. Lines prefixed
                           with a hash will be ignored. Default:
                           Configuration/contaminant_list.txt
  -a, --adapters           Specifies a non-default file which contains
                           the list of adapter sequences which will be
                           explicity searched against the library. The
                           file must contain sets of named adapters in
                           the form name[tab]sequence. Lines prefixed
                           with a hash will be ignored. Default:
                           Configuration/adapter_list.txt
  -l, --limits             Specifies a non-default file which contains
                           a set of criteria which will be used to
                           determine the warn/error limits for the
                           various modules. This file can also be used
                           to selectively remove some modules from the
                           output all together. The format needs to
                           mirror the default limits.txt file found in
                           the Configuration folder. Default:
                           Configuration/limits.txt
  -k, --kmers              [IGNORED BY FALCO AND ALWAYS SET TO 7]
                           Specifies the length of Kmer to look for in
                           the Kmer content module. Specified Kmer
                           length must be between 2 and 10. Default
                           length is 7 if not specified.
  -q, --quiet              Supress all progress messages on stdout and
                           only report errors.
  -d, --dir                [IGNORED: FALCO DOES NOT CREATE TMP FILES]
                           Selects a directory to be used for temporary
                           files written when generating report images.
                           Defaults to system temp directory if not
                           specified.
  -s, -subsample           [Falco only] makes falco faster (but
                           possibly less accurate) by only processing
                           reads that are multiple of this value (using
                           0-based indexing to number reads). [1]
  -b, -bisulfite           [Falco only] reads are whole genome
                           bisulfite sequencing, and more Ts and fewer
                           Cs are therefore expected and will be
                           accounted for in base content.
  -r, -reverse-complement  [Falco only] The input is a
                           reverse-complement. All modules will be
                           tested by swapping A/T and C/G
      -skip-data           [Falco only] Do not create FastQC data text
                           file.
      -skip-report         [Falco only] Do not create FastQC report
                           HTML file.
      -skip-summary        [Falco only] Do not create FastQC summary
                           file
  -D, -data-filename       [Falco only] Specify filename for FastQC
                           data output (TXT). If not specified, it will
                           be called fastq_data.txt in either the input
                           file's directory or the one specified in the
                           --output flag. Only available when running
                           falco with a single input.
  -R, -report-filename     [Falco only] Specify filename for FastQC
                           report output (HTML). If not specified, it
                           will be called fastq_report.html in either
                           the input file's directory or the one
                           specified in the --output flag. Only
                           available when running falco with a single
                           input.
  -S, -summary-filename    [Falco only] Specify filename for the short
                           summary output (TXT). If not specified, it
                           will be called fastq_report.html in either
                           the input file's directory or the one
                           specified in the --output flag. Only
                           available when running falco with a single
                           input.
  -K, -add-call            [Falco only] add the command call call to
                           FastQC data output and FastQC report HTML
                           (this may break the parse of fastqc_data.txt
                           in programs that are very strict about the
                           FastQC output format).

Help options:
  -?, -help                print this help message
      -about               print about message

PROGRAM: falco
A high throughput sequence QC analysis tool

Citing falco

If falco was helpful for your research, you can cite us as follows:

de Sena Brandine G and Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000Research 2021, 8:1874 (https://doi.org/10.12688/f1000research.21142.2)

Please do not cite this manuscript if you used FastQC directly and not falco!

Copyright and License Information

Copyright (C) 2019-2024 Guilherme de Sena Brandine and Andrew D. Smith

Authors: Guilherme de Sena Brandine and Andrew D. Smith

This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.