An early variant of a coronavirus in a sample from Antarctica
More and more large scientific datasets collected by researchers are publicly available in various fields. In astronomy, for example, the Sloan Digital Sky Survey was a project which prepared the first 3D map of the universe. We have mapped the position, shape, and, in some cases, the spectra of 300 million galaxies. This dataset is currently applied as a reference map by many other research projects, just as Google Maps is used for orientation on a daily basis.
Thanks to the rapid development of technology over the past decade, the new generation of sequencing — the reading of the genetic material of living things — has evolved at such a rate that surpasses even the already exponential development of microelectronics and computers. This has allowed the sequencing of thousands of species and the digitisation of the genomes of hundreds of thousands of cancer patients, as well as animals, plants, and environmental samples, each containing thousands of species, mostly bacteria. The majority of these data have been uploaded to the Internet in the EBI ENA / NCBI SRA and other archives of sequence data. These data are used as “Google Maps” by many disciplines, from cancer immunotherapy to epidemiology. The exact number of data is increasing day by day, currently taking up hundreds of petabytes. This is a very rich and valuable dataset for all relevant research.
“It is interesting that whenever we take a sample, it can contain not only what we intend to collect, but also many other things,” explains physicist István Csabai the background of the recently published study. “For example, in one of our studies published some years ago, we analysed the sequencing dataset of ‘global wastewater’. It was a pioneering project organised by our Danish colleagues. They collected wastewater samples from about one hundred cities around the world. At that time, their aim was not to observe pathogens but to study the worldwide distribution of antimicrobial resistance. They were successful, but we did further analysis and looked for human mitochondrial DNA. Our colleagues reminded us that laboratory procedures were designed for bacteria and wastewater normally contained so much bacterial genetic material that we would end up empty handed. However, this is not what happened. Not only did we find traces of human DNA, but we were also able to reconstruct so-called human haplogroups that corresponded to the factual distribution of ethnic groups on Earth.”
The current study is similar, too. “Making use of the incredibly rich source of sequence reading archives,
we were looking for samples collected before December 2019 and containing genomic traces of SARS-CoV-2.
This presented a big technical challenge due to the enormous amount of data mentioned above, but thanks to the resourcefulness of IT specialists, algorithms applying clever mathematical tricks, and the ever-increasing speed of computers, it could be overcome. This was a kind of Google search for genetic sequences.
We hoped that traces of SARS-CoV-2 would be found in some samples collected for other reasons, and this could bring us closer to the origins of covid. It was an exciting discovery that one of the best hits was the set of soil samples collected from Antarctica. In theory, it is plausible that seals – similar to other species of carnivores such as cats or ferrets – may carry covid (as the samples were collected from places where seals and penguins congregate). Then they may have got to Wuhan’s infamous sea fish market through some infected fish. These are the strange and, in most cases, false hypotheses that researchers often raise when they see data that is hard to explain.
As we dug deeper, it became apparent that the genetic traces of SARS-CoV-2 had nothing to do with Antarctica, but could rather be explained by contamination from other samples.
Thus, ‘freedom of speech of data’ forced us to modify our hypothesis. I am deeply sorry for this because it would have been a much more intriguing scientific story and a far less politicised issue. With further tests, we were able to identify the potential host species: humans, green monkeys, and Chinese hamsters,” said Professor Csabai. This genetic material probably did not come from living animals, but rather from cell cultures often used in virological experiments.
The missing piece of the mystery is the exact date of sequencing. “It is certain, at least, that the SARS-CoV-2 genome discovered is most similar but not identical to the earliest known strain. Furthermore, we know that the DNA was extracted in December 2019 and sent for sequencing,” the professor added. Depending on when the contaminant samples were taken, the study in question could be an interesting story in itself concerning the incredible possibilities inherent in big data analytics, but it could also lead us closer to understanding the origins of an ongoing pandemic.
The research is related to the ongoing projects Horizon 2020 (VEO No. 874735), Horizon Europe (BY-COVID No. 101046203), and FIEK_16-1-2016-0005 carried out at ELTE.