GRAbB (Genome Region Assembly by Baiting) is program designed to assemble selected regions of the genome or transcriptome using reference sequences and NGS data.
- Usage
- Prerequisites
- Helper programs
- Algorithm overview
- Arguments
- Using custom assembler program
- Citation
- Contact
Asciinema casts:
Steps:
-
Install prerequisites (If this step is skipped, then configure_GRAbB.pl tries to use prerequisites included in the package)
-
Minimal set:
- Baiting program: mirabait (recommended) OR kmer_bait.pl (no installation needed)
- Read collecting program: seqtk (recommended) OR create_readpool.pl (no installation needed)
- Assembler: Edena OR Vevlet OR external_scaffold.pl (needs to be modified and requires a working installation of an assembler)
-
Assemblers:
- Edena: default assembler for GRAbB.pl
- Vevlet
- Other assembler: external_scaffold.pl has to be also edited
-
Exonerate:
- Required for running GRAbB.pl in exonerate mode
-
-
Configure GRAbB.pl
./configure_GRAbB.pl
Recommended to add prerequisites to the Path Or add the absolute path of the executables to the source code of GRAbB.pl before running configure_GRAbB.pl
Configured GRAbB.pl can be found in bin directory
Bug: On some systems the exonerate binary included in the package runs extremely slow. Configure can get stuck at 'Testing exonerate' block, then issue Ctrl + c. On these systems exonerate has to be installed or build from source code (See Prerequisites::exonerate, before rerunning the configuration script.
-
To test installation run the following (assembler has to be adjusted unless GRAbB.pl is configured without Edena)
bin/GRAbB.pl --ref for_testing/assembly.fas --reads for_testing/read* --folder test --prefix test
OR
Use Docker (See Docker.md for more detailed instructions)
-
Either download the docker repository via docker interface
docker pull brankovics/grabb
-
Or create a local docker image:
git clone https://github.com/b-brankovics/grabb cd grabb/docker docker build -t localhost:5000/$USER/grabb .
Run GRAbB.pl without any arguments and it prints the Usage information
The documentation is this file and the files mentioned at the examples.
See the wiki or the files Docker.md, Examples.md and Tutorial.md.
-
Download MIRA (4.0) assembler (http://sourceforge.net/projects/mira-assembler/)
-
Extract it. The executable files are in the bin folder.
-
Copy/move or symlink 'mirabait' into somewhere in the path or add the folder to the path (This program uses only mirabait)
Warning: the name of the executable file has to be mirabait!
Bug: mira 4.0.2 does not work properly, but mira 4.0 does
OR you may also use kmer_bait.pl, which is less efficient, but uses only perl and standard Unix commands
- Download EDENA and extract it or use the copy in the 3rd_party_programs
- Change to the directory
- Type
make
on the command line (g++ needs to be installed, on ubuntu typesudo apt-get install g++
) - Copy/move or symlink 'edena' into somewhere in the path or add the folder to the path (The files in the bin folder)
- Download Velvet and extract it or use the copy in the 3rd_party_programs
- Change to the directory. First zlib needs to be installed
- Change to 'third-party/zlib-1.2.3/'
- Type
make
on the command line - Type
sudo make install
on the command line - Go back to the parent directory (
cd ../..
) - Type
make
on the command line - Copy/move or symlink 'velveth' and 'velvetg' into somewhere in the path or add the folder to the path
- Download Seqtk from github and uncompress it or
git clone https://github.com/lh3/seqtk.git
- Change to the directory
- Type
make
on the command line (zlib needs to be installed, see 1.3) for instructions) - Copy/move or symlink 'seqtk' into somewhere in the path or add the folder to the path (The files in the bin folder)
OR you may also use create_readpool.pl, which is less efficient, but uses only perl and standard Unix commands
For Ubuntu run sudo apt-get install exonerate
Else:
-
Download exonerate from the EBI website and uncompress it or use the version included in the GRAbB package (3rd_party_programs)
-
Change to the directory
-
Type the following commands (The following packages have to be installed on the system before running
./configure
: gcc, make and glib2)./configure make make check make install
-
The executable is found at
src/program/exonerate
. Copy/move or symlink 'exonerate' into somewhere in the path or add the folder to the path (The files in the src/program folder)
- Download PRINSEQ lite and extract it or use the copy in the 3rd_party_programs
- Copy/move or symlink 'prinseq-lite.pl' into somewhere in the path or add the folder to the path
This program creates a FASTA format read file for each FASTQ read file specified.
Usage:
./fastq2fasta <reads_1.fastq> <reads_2.fastq>
This program reads the contigs from a fasta file and checks if they are overlapping with each other by using a minimal overlap size that is specified at invocation. Finally, prints all the overlaps found.
Usage:
./get_overlaps <contigs.fasta> <overlap>
This program creates a forward and reverse read file from an interleaved file
Usage:
./interleaved2pairs <reads.fastq>
This program reads the contigs from a fasta file and checks if they are overlapping with each other by using a minimal overlap size that is specified at invocation. Afterwards it loops through all the contigs and merges contig pairs that only overlap with each other at the given side. In the end it saves the contigs that were created into a file.
Usage:
./merge_contigs <contigs.fasta> <overlap> <output.fas>
This program creates a FASTA format read file for each FASTQ read file
Usage:
./rename_fastq <reads_1.fastq> <reads_2.fastq>
This program creates paired-end read files from single-end files
- Selects the reads that are at least as long as the specified length (<int>)
- Gets the first <int> characters to be used as forward read
- Gets the last <int> characters to be used as reverse read (reverse complement)
All the produced reads are $len long.
The output file will be created in the current working directory
Usage:
./single2pairs <int> <reads.fastq>
This program creates a read file with reads with uniform length
- Selects the reads that are at least as long as the specified length (<int>)
- Trims the reads to the specified length
All the produced reads are $len long
The output file will be created in the current working directory
Usage:
./uniform_length <int> <reads.fastq>
This program takes a fasta file and shifts the sequence in it according to a position value or a reference file the output is printed to STDOUT
-
Using position value n is the position value
Usage:
./fasta_shift -i <input.fas> -p <int> ><output.fas>
-
Using a reference file: the output will start with the first occurrence of the sequence in the reference file
Usage:
./fasta_shift -i <input.fas> -r <ref.fas> ><output.fas>
This program compares two sequences and prints a few metrics
Output: <# of identical bases> <# of identical bases, not counting '-'> <percentage of identity w/o '-'>
Usage:
./pairwise_alignment_score <input1.fas> <input2.fas>
This program prints the reverse complement of the input to STDOUT
The input may be files or STDIN
Usage:
./reverse_complement <input.fas> ><output.fas>
GRAbB is written in Perl and it uses only modules that are part of the core distribution of Perl. In addition to basic UNIX commands, the following third-party programs are used by GRAbB: mirabait (from the MIRA package), Seqtk, EDENA, Velvet and Exonerate.
The program is designed to be versatile and flexible with the following functionalities:
- Use pairing information
- Use additional bait sequences
- Assemble multiple regions separately in a single run
- Use any of a range of completion criteria (may use different ones for each region)
These functionalities are detailed below at the appropriate steps of the algorithm.
The main loop of GRAbB can be summarized as follows:
- creating the bait file
- finding reads by baiting
- collecting reads
- de novo assembly of selected reads
- testing completion
If multi-mode is selected, then the general baiting step is followed by specific baiting, de novo assembling and completion testing for each of the threads. The threads are generated by splitting the reference file into single-entry FASTA files. These newly created reference files are used as bait for the initial specific baiting steps for the given thread. At the end of each cycle of the main loop, the program checks whether there is any thread that is not completed yet, then it continues or stops accordingly.
At the invocation of GRAbB it is possible to specify a length filter that excludes all contigs from the assembly that are shorter than the specified length from being used for generating the bait file for the next iteration.
-
Initial bait file
At the start of the run an initial (general) bait file is created by concatenating the reference and bait files. This bait files is used for the first general baiting step.
-
General bait file
At the start of each iteration a general bait file is created by concatenating the latest assembly file(s)
-
Specific bait file
The specific bait file is only created in multi-mode. At the start of the run the reference file is split into single-entry FASTA files and these are used as initial specific bait files. In latter iterations the latest assembly for the give thread is used as the new bait file.
Reads belonging to the specified sequence are identified by using exact k-mer (31 bp) matching that is implemented by mirabait (from the MIRA package). The names of the reads thus identified, are collected and added to the list of read names from previous iterations (There is a separate list for the general baiting and for each thread.) If there are no new reads identified, then the program stops the iteration, either for the given thread (specific) or for all the threads (general).
By using a general baiting step before the specific baiting it is possible to reduce the required run time, because the large input read file(s) is/are only screened once per iteration and the specific baiting is confined to screening the reads that are already identified to be specific during the general baiting.
The identified reads are collected from the read files using Seqtk into internal read files. The program identifies reads based on the read names, thus the first word of the identifier line should be unique for each read if they are single-end or should be the same for both reads of the pair.
The program can use two assemblers, EDENA and Velvet, by default, but there is a skeleton code to add a new assembler to the source code of the program. Also it is also possible to write an external perl script that is used by GRAbB for the assembly. By default EDENA is used for assembly, but using command line options the other assemblers can be selected, as well.
The fact that single- or paired-mode is selected is passed on to the assembler program that assembles the specific reads de novo. Also, it is possible to pass additional arguments, such as overlap size (EDENA), to the assembler program at the invocation of the main program.
There are multiple completion criteria that can be specified for the program. These can be specified for each thread separately (multi-diff-mode or for all of the threads at once. Also, it is possible to specify multiple criteria for the same thread or run. In this case the program stops when any one of these criteria is met.
-
Exhaustive run
The first completion criterion is implicit, the program stops if there is no new information found. This means that either there are no new reads found or that the new assembly is identical to the bait used for the current iteration. If no other criterion is specified then the run is an exhaustive run, since it iterates until it cannot find any new information.
-
Length criteria (total length, longest contig or N50)
This is the simplest explicit completion criterion. There are three options that can be used for this setting: total assembly size, the length of the longest contig or the N50 value of the assembly. This criterion is tested independently for each of the threads in multi-mode or, otherwise, for the single thread. As mentioned before, multiple criteria can be used in a single run, this also applies for the different size criteria. These settings are useful when exploring the vicinity of a specified sequence region.
-
Matching homologous sequence
In this case the specific reference sequence is used to identify the homologous region within the assembly. To identify the matching region, GRAbB uses Exonerate with settings that ensure that the whole reference sequence is aligned to the assembly contigs. This makes it possible to match sequences that are somewhat dissimilar to the reference and may also contain indels, causing gaps in the alignment. The possible results of the matching can be divided into two groups: the whole sequence is matched—also if there are internal regions that correspond to gaps in the other sequence—or it is not. In the first case the completion criterion is met, and the matched region is extracted from the assembly in the same orientation as the reference sequence and saved to an output file. In the latter case there are two possibilities. If the matched region is larger than the one for the previous iteration, then the thread or run will continue, but if the size of the match has not improved than the thread or run will stop.
-
This information is passed on to the assembler program.
-
Paired-mode
It is the default if two read files are specified.
In paired-mode the read files are tested whether GRAbB can identify pairs as expected.
-
Single-mode
It is the default mode if there is only one or more than two read files specified. If two read files are specified than single-mode can be chosen by using the --single argument.
-
-
Multi-mode
Multi-mode is selected by using --type multi option at invocation. In multi-mode the reference file is split into single-entry FASTA files, these are referred to as specific references files. Also for each entry a separate thread is created. The individual threads are independent from each other, thus multiple regions can be assembled in a single run without interference from each other. The specific reference file is used as initial specific bait file. Also in exonerate-mode the specific reference file is used for the homology matching.
-
Multi-diff-mode
Multi-diff-mode is selected by using --type multi-diff option at invocation. It also belongs to the multi-mode with all its properties. The difference is that when the specific reference files are created if there is completion criterion specified in the identification line of the FASTA entry, than this is added to the completion criteria to be used for the given thread.
-
Exonerate-mode
Exonerate-mode is selected by using --type exonerate option at invocation. In this case the specific reference sequence is used to identify the homologous region within the assembly. To identify the matching region, GRAbB uses Exonerate with settings that ensure that the whole reference sequence is aligned to the assembly contigs. This makes it possible to match sequences that are somewhat dissimilar to the reference and may also contain indels, causing gaps in the alignment. The possible results of the matching can be divided into two groups: the whole sequence is matched—also if there are internal regions that correspond to gaps in the other sequence—or it is not. In the first case the completion criterion is met, and the matched region is extracted from the assembly in the same orientation as the reference sequence and saved to an output file. In the latter case there are two possibilities. If the matched region is larger than the one for the previous iteration, then the thread or run will continue, but if the size of the match has not improved than the thread or run will stop.
-
Clean-mode
-
Clean-mode
It is selected by --clean. GRAbB will remove some internal files to save disk space. But there is no information lost, because all the deleted files can be reconstructed using the remaining files.
-
Double clean-mode
It is selected by --clean --clean. GRAbB will remove some internal files to save disk space. At the end of the run all the output files and folders are deleted except for the result files.
-
GRAbB.pl --ref <reference file> --reads <read file 1> [<read file 2>] --folder <directory> --prefix <prefix> [options]
There are four mandatory arguments: reference file, read file(s), output folder and prefix for the log and results
The order of the arguments is not important.
--ref <ref.fas>
The reference file is a FASTA formatted file that contains one or more sequences. The sequence IDs have to be unique for each sequence (as required by mirabait). If the file contains multiple sequences and the program is run in multi-mode then the reference file is split into separate reference files that contain only a single sequence, the handling of these files is discussed in the segment on the main loop. Furthermore, the description lines may contain specification for the completion criterion to be used for the given sequence that is used if multi-diff mode is selected. Because the read selection is based on exact k-mer (31 bp) matching, the reference sequence does not have to be highly similar to the target sequence.
--bait <bait.fas>
A separate bait file can be specified besides the reference file, this file together with the reference file will be used as first bait. Useful when using special criterion for the assembly, such as homology.
--reads <r1.fastq> [<r2.fastq> ...]
Multiple read files can be specified as input. If two read files are given, then it is assumed that reads are paired, but in single-mode, reads are considered as single reads. The program identifies read pairs based on the read names, thus the first word of the identifier line should be the same for both sequences. The read files may be in FASTA or FASTQ format and may be compressed (using gnuzip). If EDENA is selected as assembler program, then all the reads should be of the same length.
--folder <folder_name>
The directory where all the output will be saved. If the directory is non-empty then the files it contains can be used like internal files. In this manner previous runs can be continued, make sure to remove or replace files that would suggest completion:
Folder structure:
<folder>/
-
reference.fas
-
bait.fas
-
extra_bait.fas
-
<prefix>_assembly_thread_<int>.fas
-
<prefix>_result_thread_<int>.fas (if exonerate-mode is selected and the sequence was matched)
-
<prefix>.log
-
old_collection.list
-
reads*<int>.fastq or reads<int>*.fasta
-
Round*<int>*/
-
hashstat.bin
-
mirabait.log
-
new_collection.list
-
positive_<int>.txt
-
readpool*<int>*.fastq
-
reads*<int>*.fastq
-
thread_<int>/
- assembly.fas
- new_collection.list
- readpool*<int>*.fastq
- (files or folders generated by the assembler)
- exonerate.log (if exonerate-mode is selected)
- result.fas (if exonerate-mode is selected and the sequence was matched)
-
-
thread_<int>/
- assembly.fas
- assembly_<int>.fas
- bait.fas
- final_assembly.fas
- old_collection.list
- reference.fas
- reference.fas.exonerate (if exonerate-mode is selected)
- result.fas (if exonerate-mode is selected and the sequence was matched)
--prefix <prefix_of_output>
The prefix for the output files:
- log file
- assembly file
- result file
--single
Treat reads as unpaired reads even if two read files are specified
--min_length=<int>
Minimum size required for a contig to be included for completion testing and baiting
--type <run_type_string>
-
When should the iteration stop?
-
The intrinsic criteria are:
- There are no new reads
- There is no assembly
- The bait sequence did not change
-
Extra criteria:
-
Length of the assembly
--type total=<int>
-
Length of the longest contig of the assembly
--type longest=<int>
-
N50 value of the assembly
--type N50=<int>
-
Reference matches to assembly (uses exonerate)
--type exonerate
-
-
-
Or to have parallel runs (threads)
--type multi
-
To get the completion criterion for each thread from the reference file (These criteria overwrite the globally defined criteria. Write "exhaustive" in the identifier line if exhaustive should be used.)
(Need to be used together with multi)
--type multi_diff
Multiple options can be used at the same time, but then they have to be typed as a concatenated string (or using quotation)
--type <multiexonerate | multi_exonerate | "multi exonerate">
--arg1 "argument1 argument2 argument3"
Arguments needed for the assembler program used for graph/hash generation
--arg2 "argument1 argument2 argument3"
Arguments needed for the assembler program used for the assembly
--assembler assembler_name
Specify the assembler to be used:
--clean
Remove some internal files to save disk space.
--clean --clean
If specified twice, then only result files are kept, the rest is deleted
GRAbB may use either EDENA or Velvet to preform the de novo assembly, by default. There are two possibilities to add other assemblers to be used by the program.
First the absolute path or the command for the assembler has to be added to the source code of GRAbB
my $alt_1_cmd = "";
my $alt_2_cmd = "";
These lines should be modified so that the command or absolute path of the executable of the assembler are written between the quotation marks. The first one is the command used for the first step of the assembly, the second is the command used for the second step of the assembly. If the assembly requires only one step, then use the other method to add the assembler.
In addition, there is a skeleton code within the source code of GRAbB
marked by # Assemble (Alternative)
at the end of the lines. Within
this block of code there are six lines that need to be changed to add
the new assembler. These lines have exclamation marks at the end.
$alt_1_cmd
and $alt_2_cmd
are the commands defined above. The
four command line strings ($single_first
, $single_second
,
$paired_first
and $paired_second
) have to be completed with
all necessary arguments that are required by the assembler for the
different assembly steps.
-
The name of the assembler, this will be displayed by GRAbB during the run
my $name = "";
-
The code to execute the first step of the assembly for single-end sequences (
@arg1
: the arguments specified at invocation after the --arg1 argument;@$readpool_ref
: list of the readpool files created by GRAbB)my $single_first = "$alt_1_cmd @arg1 @$readpool_ref";
-
The code to execute the second step of the assembly for single-end sequences (
@arg2
: the arguments specified at invocation after the --arg2 argument)my $single_second = "$alt_2_cmd @arg2";
-
The code to execute the first step of the assembly for paired-end sequences (
@arg1
: the arguments specified at invocation after the --arg1 argument;@$readpool_ref
: list of the readpool files created be GRAbB)my $paired_first = "$alt_1_cmd @arg1 @$readpool_ref";
-
The code to execute the second step of the assembly for paired-end sequences (
@arg2
: the arguments specified at invocation after the --arg2 argument)my $paired_second = "$alt_2_cmd @arg2";
-
The relative path of the assembly file created by the assembler. The working directory is identical to where the assembly commands have been issued
my $result = "";
After all these modifications are completed, then GRAbB maybe run with the newly specified assembler program. Using the proper argument:
--assembler alternative
There is a skeleton script that maybe copied and modified to run the assembly if external is selected as assembler.
#!/usr/bin/perl -w
use strict;
# A perl script that allows you to use an assembler that is not
specified in the GRAbB source code
# Get all the parameters
my ($reads, $paired, $parameters1, $parameters2, $outfile, $format) = @ARGV;
# Create an array containing the read files
my @reads = split / /, $reads;
my @param1 = split / /, $parameters1;
my @param2 = split / /, $parameters2;
# If there were no extra arguments passed for the assembler at the invocation of GRAbB
# then the value passed to this script is "-" ($param1[0] and/or $param2[0])
# For mapping-assemblers a reference is needed
#my $ref = &get_reference(); # For reference based assembly uncomment this line
# The program to be used for assembly
my $assmebler = ""; # ADD
# Run the assembly
`$assmebler @reads`; # ADD
# Specify the expected result file
my $result = ""; # ADD
# Copy the result file to the expected position
`perl -ne 'if (/^>/) {print} else {tr/acgtwmrskybvdh/ACGTWMRSKYBVDH/; print}' $result >$outfile`;
sub get_reference {
# This subroutine finds the reference file for the current thread
my $ref;
my $pwd = $ENV{"PWD"};
$pwd =~ /(thread_\d+)$/;
my $thread = $1;
$ref = "../../$thread/bait.fas";
return $ref;
}
The lines marked with # ADD
have to be adjusted:
-
If the user wishes to add a mapping-assembler then the following command has to be uncommented. Afterwards the reference file will be referred to by
$ref
.#my $ref = &get_reference(); # For reference based assembly uncomment this line
-
The absolute path or the command for the assembler
my $assmebler = ""; # ADD
-
The command line to execute the assembly. The list of read files is stored in the array
@reads
, the arguments specified after --arg1 are stored in the array@param1
, the arguments specified after --arg2 are stored in the array@param1
, the fact that the assembly is for a single-end or paired-end library is stored in the variable$paired
(the value of the variable is either "single" or "paired") and the variable$format
stores the file format of the read files (either "fasta" or "fastq").
$assmebler @reads
; # ADD
```
-
The relative path of the assembly file created by the assembler. The working directory is identical to where the assembly commands have been issued
my $result = ""; # ADD
After all these modifications are completed and save to the files <external.pl> (the file name can be anything that does not already exist in the directory), then GRAbB maybe run with the newly specified assembler program. Using the proper argument:
--assembler external <external.pl>
external_SPAdes.pl is a skeleton script that
maybe downloaded (and modified) to run the assembly if external
is selected as assembler (--assembler external external_SPAdes.pl
).
The script is configured to use SPAdes by calling
spades.py
. Therefore, if you have the spades.py
command in your
PATH, then you do not have to modify the script and use it
directly by specifying it as the assembler script by including the
following in the GRAbB.pl
invocation:
--assembler external external_SPAdes.pl
Where external_SPAdes.pl
stands for the relative path (you can also
use the absolute path) of the script. In the above example
external_SPAdes.pl
has to be in the current working directory.
To specify arguments for the SPAdes assembly within the GRAbB.pl run
use --arg1
option. For example:
--arg1 '-k 31,61,91 --only-assembler --cov-cutoff 200'
If spades.py
command is not in the PATH, then you can
- either add it to the PATH
- or modify the
external_SPAdes.pl
script.
- Modify line 22 of
external_SPAdes.pl
, fromto specify where your SPAdes executable is located. If it is atmy $assmebler = "spades.py"; # ADD executable here!
~/SPAdes-3.10.0/bin/spades.py
, then line 22 should look like this:my $assmebler = "~/SPAdes-3.10.0/bin/spades.py"; # ADD executable here!
- Save the modified file
- And use it as described above.
If you use GRAbB.pl
or any of the helper programs, please,
cite the paper describing the program.
Brankovics B, Zhang H, van Diepeningen AD, van der Lee TAJ, Waalwijk C, et al. (2016) GRAbB: Selective Assembly of Genomic Regions, a New Niche for Genomic Research. PLoS Comput Biol 12(6): e1004753. doi: 10.1371/journal.pcbi.1004753
Balázs Brankovics balazs.brankovics@wur.nl
- ORCID
- ResearchGate
- Contact details at Wageningen Plant Research