SARS-CoV-2 dataset




1. To provide a compact redistributable dataset containing the entire set of public SARS-CoV-2 genome data (up to June 17, 2022).

2. To demonstrate the efficiency of the NAF format.

How to use

1. Download the data:


2. Verify the downloaded file:

wget md5sum -c SARS-CoV-2-NCBI-2022-06-17.naf.md5

3. Install NAF tools:

3.a. Either from GitHub:

git clone --recurse-submodules cd naf && make && make test && sudo make install

3.b. Or using bioconda:

conda install naf

4. Decompress SARS-CoV-2 dataset into FASTA format (After checking that 171 GB of free disk space is available):

unnaf SARS-CoV-2-NCBI-2022-06-17.naf -o SARS-CoV-2-NCBI-2022-06-17.fasta

5. Use the data directly from NAF fomat. E.g., counting sequences:

unnaf SARS-CoV-2-NCBI-2022-06-17.naf | grep '>' | wc -l

More details

It was compressed with ennaf 1.3.0, using this command:

ennaf -22 --dna -o SARS-CoV-2-NCBI-2022-06-17.naf sequences.fasta

Adding "--long 31" to the compression command produces a bit smaller archive of 250,981,625 bytes. We decided to not use this option this time, because the difference is small, and because it requires more memory during decompression.

Using "-1 --dna" options compresses this data to 644,120,465 bytes.