Page History

UGENE integrates a set of tools for taxonomy classification of microorganisms using whole-genome shotgun sequencing data. The tools are Kraken, CLARK, DIAMOND, etc. See, for example, "Parallel NGS Reads Classification" sample workflow that allows one to classify input FASTQ files with these tools working in parallel and then join their output with another tool WEVOTE.

Warning
The tools are available on 64-bit macOS or Linux operating systems only. Also, as the tools are quite resource-consuming, it is recommended to have at least 16 Gb of RAM available.

To use these tools one should provide appropriate taxonomy data and reference data, specific for a tool. Some reference databases are provided for each of the tool. One can use these data or build a custom database (see, for example, workflow element "Build Kraken Database").

It is recommended to use the UGENE Online Installer package to install and automatically configure the data. However, if the Internet is not available on the target computer, or it is required to use another UGENE package for some other reason, follow the instructions below on how to download and configure the data.

Download NGS Classification data

Use links in the "Data for NGS taxonomy classification" section on the "Download UGENE and components" page to download the data.

Warning
Make sure to have enough disk space on the target computer.

See the list of the available downloads in the table below.

Data	Archive size	Unpacked data size	Description	Data source
NCBI taxonomy classification	2.5 Gb	31 Gb	This includes a set of taxonomy data files from NCBI. These data should be present for any type the NGS classification analysis.	The data were downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/).
NCBI RefSeq bacterial genomes	130 Gb	132 Gb	The data can be used to build a database for CLARK-l (light version of CLARK), CLARK, or Kraken. As UGENE integrates modified version of CLARK/CLARK-l, it is possible to provide .gz archives as input for building the database. For building a Kraken database usage of .gz archives is not supported, it is required to unpack each *.gz file, so even more disk space will be required. Note that the data were used to build "CLARK-l DB: RefSeq bacterial+viral genomes" (see below).	The data were downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/bacteria*.genomic.fna.gz)
NCBI RefSeq viral genomes	7.4 Gb	11 Gb
NCBI RefSeq GRCh38 human genome	837 Mb	838 Mb
Kraken DB: MiniKraken 4Gb database	2.5 Gb	4.3 Gb
CLARK-l DB: RefSeq bacterial+viral genomes	7.4 Gb	11 Gb
CLARK-l DB: RefSeq viral genomes	16 Mb	72 Mb
DIAMOND DB: UniRef50	5.2 Gb	13 Gb
DIAMOND DB: UniRef90	13 Gb	34 Gb
Total:	161 Gb	226 Gb

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Download NGS Classification data