SARS-CoV-2 dataset

Download

Contents

Rationale

1. To provide a compact redistributable dataset containing the entire set of public SARS-CoV-2 genome data (up to June 17, 2022).

2. To demonstrate the efficiency of the NAF format.

How to use

1. Download the data:

wget http://sayer.nig.ac.jp/kirill/SARS-CoV-2/sequence-data/SARS-CoV-2-NCBI-2022-06-17.naf

2. Verify the downloaded file:

wget http://sayer.nig.ac.jp/kirill/SARS-CoV-2/sequence-data/SARS-CoV-2-NCBI-2022-06-17.naf.md5 md5sum -c SARS-CoV-2-NCBI-2022-06-17.naf.md5

3. Install NAF tools:

3.a. Either from GitHub:

git clone --recurse-submodules https://github.com/KirillKryukov/naf.git cd naf && make && make test && sudo make install

3.b. Or using bioconda:

conda install naf

4. Decompress SARS-CoV-2 dataset into FASTA format (After checking that 171 GB of free disk space is available):

unnaf SARS-CoV-2-NCBI-2022-06-17.naf -o SARS-CoV-2-NCBI-2022-06-17.fasta

5. Use the data directly from NAF fomat. E.g., counting sequences:

unnaf SARS-CoV-2-NCBI-2022-06-17.naf | grep '>' | wc -l

More details

It was compressed with ennaf 1.3.0, using this command:

ennaf -22 --dna -o SARS-CoV-2-NCBI-2022-06-17.naf sequences.fasta

Adding "--long 31" to the compression command produces a bit smaller archive of 250,981,625 bytes. We decided to not use this option this time, because the difference is small, and because it requires more memory during decompression.

Using "-1 --dna" options compresses this data to 644,120,465 bytes.