SARS-CoV-2 dataset
Download
- SARS-CoV-2-NCBI-2022-06-17.naf (251 MB)
- SARS-CoV-2-NCBI-2022-06-17.naf.md5
- SARS-CoV-2-NCBI-2022-06-17.fasta.md5
Contents
- 5,588,048 sequences
- Includes all SARS-CoV-2 nucleotide sequences from GenBank, up to 2022-06-17
- Compressed size in NAF format: 251 MB (251,236,122 bytes)
- Decompressed size in FASTA format: 170 GB (170,254,229,547 bytes)
Rationale
1. To provide a compact redistributable dataset containing the entire set of public SARS-CoV-2 genome data (up to June 17, 2022).
2. To demonstrate the efficiency of the NAF format.
How to use
1. Download the data:
2. Verify the downloaded file:
3. Install NAF tools:
3.a. Either from GitHub:
3.b. Or using bioconda:
4. Decompress SARS-CoV-2 dataset into FASTA format (After checking that 171 GB of free disk space is available):
5. Use the data directly from NAF fomat. E.g., counting sequences:
More details
It was compressed with ennaf 1.3.0, using this command:
Adding "--long 31" to the compression command produces a bit smaller archive of 250,981,625 bytes. We decided to not use this option this time, because the difference is small, and because it requires more memory during decompression.
Using "-1 --dna" options compresses this data to 644,120,465 bytes.