Here, we present a set of manually curated datasets of prokaryotic psychrotolerant genomes, their plasmids and metagenomes from cold environments recovered from public repositories, mainly PATRICBRC (currently BV-BRC), GenBank and ENA Metagenomics. Datasets consist of 3978 psychrotolerants (Bacteria – 3822, Archeae – 156), 484 plasmids of psychrotolerants (PsychroPlasDb) and 2831 metagenomes.
We further analyzed these datasets in order to distinguish plasmid-like contigs, phage genomes (including prophages), and transposases. Briefly, metagenomes were quality-filtered with fastp and then assembled with MEGAHIT. Assembled contigs were gene-called with prodigal-gv and their proteins were then searched with MMseqs2 against the following reference databases to filter out contigs possibly carrying plasmids (MOBsuite‘s MOB and REP reference protein sequences; 1e-5, 75% seq. identity, 95% query coverage), phages (PHROGs head and packaging, connector, tail, and lysis protein profiles; 1e-5, 70% query coverage), and transposases (TnCentral‘s transposase proteins; 1e-5, 75% seq. identity, 95% query coverage). Plasmid and phage candidate contigs were then annotated with Bakta to remove contaminants (e.g. contigs encoding rRNA genes). Additionally, phage contigs were analyzed with CheckV to determine the quality of recovered phage genomes and their contamination with host genomes (at least medium-quality ones and shorter than 200kb were considered so far). Psychrotolerant genomes were also searched with PhiSpy and SigMa to identify prophages and predictions for Psychrobacter representatives were also manually curated. As a result of the above, we present sets of 14523 plasmids, 23608 phages, 6979 MOB and TRA proteins, 7770 REP proteins, 4554 and 9899 phage TerL and MCP proteins, as well as 129905 transposases.