I need to tread carefully here because this study is highly technical, obviously above my pay grade, and was written by someone who’s not drawing firm conclusions about the virus’s origins based on his findings. But the basic conclusion is fascinating and the author, Jesse Bloom, believes his approach is fertile ground for learning more about where the virus might have come from.
Bloom went looking for genome sequences from early samples of the virus in Wuhan. He dug through the NIH database and found something — only to discover the data curiously missing when he tried to download it. That’s not necessarily NIH’s fault, he made clear in a Twitter thread last night. Sometimes the people who provided the data will email the agency and request deletion for whatever reason. Undaunted, Bloom went looking online to see if he could recover the missing data in an archive somewhere. And he did, finding data for 34 samples from Google Cloud that allowed him to partially reconstruct the genomes of the virus from those samples.
And when he did, what he found surprised him. Normally, says Bloom, we’d expect the earliest samples of SARS-CoV-2 to most closely resemble whichever bat virus it originated from. As the virus spreads among the human population it’ll inevitably mutate little by little, making it less like the “progenitor” virus that infected patient zero. If the virus really did come from the Huanan Seafood Market in Wuhan then we should expect that samples taken from the market to be the most bat-like and then, as people became infected around town, the samples taken from them to look progressively less bat-like.
That’s not what the samples show, says Bloom.
Instead, early Huanan Seafood Market #SARSCoV2 viruses are more different from bat coronaviruses than #SARSCoV2 viruses collected later in China and even other countries. @lpipes @ras_nielsen give nice technical analysis at https://t.co/d18fFjNyPX (9/n)
— Bloom Lab (@jbloom_lab) June 22, 2021
If there were two different “types” of SARS-CoV-2 circulating early, he reasons, then it may be that they’re both offshoots of a “progenitor” virus that was circulating in Wuhan before the outbreak at the Huanan market:
Both progenitors suggest #SARSCoV2 was circulating in Wuhan before December outbreak at Huanan Seafood Market, which is corroborated by lots of other evidence, including news articles from China in early 2020 (see intro to my paper linked in first Tweet in this thread). (15/n)
— Bloom Lab (@jbloom_lab) June 22, 2021
Bloom put it this way in the abstract of his study: “Phylogenetic analysis of these sequences in the context of carefully annotated existing data suggests that the Huanan Seafood Market sequences that are the focus of the joint WHO-China report are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2’s bat coronavirus relatives.” He’s careful not to say that that implies a lab leak, but if I’m understanding his results correctly then they must be a point in favor of a lab leak, no? If the viral samples linked to the market are less similar to bat viruses than the samples taken from people not linked to the market then logically it should be less likely that the virus made the jump from animals to humans at the market.
In theory, I guess, the “progenitor” virus could have jumped from an animal to a human naturally at some other spot in Wuhan and then mutated a bit before arriving somehow at the market. But it’d be awfully strange for a bat virus to make the journey across China to a city that doesn’t have bats and then leap to humans in the wild — but not at the most logical place in the wild for that leap to be made. Which leaves us to consider the logical alternative: Did the “progenitor” virus infect someone at the Wuhan Institute of Virology instead?
Would that explain why it looked less like a bat virus when it finally ended up at the seafood market, because it had already passed through a number of people and mutated in a divergent way?
As I say, this is above my pay grade. This part of Bloom’s study is clear, though:
The fact that such an informative data set was deleted has implications beyond those gleaned directly from the recovered sequences. Samples from early outpatients in Wuhan are a gold mine for anyone seeking to understand spread of the virus. Even my analysis of the partial sequences is revealing, and it clearly would have been more scientifically informative to fully sequence the samples rather than surreptitiously delete the partial sequences. There is no plausible scientific reason for the deletion: the sequences are perfectly concordant with the samples described in Wang et al. (2020a,b), there are no corrections to the paper, the paper states human subjects approval was obtained, and the sequencing shows no evidence of plasmid or sample-to-sample contamination. It therefore seems likely the sequences were deleted to obscure their existence. Particularly in light of the directive that labs destroy early samples (Pingui 2020) and multiple orders requiring approval of publications on COVID-19 (China CDC 2020; Kang et al. 2020a), this suggests a less than wholehearted effort to trace early spread of the epidemic.
Why was the data deleted if there’s no reason to believe it was faulty? Maybe that was just the Chinese government being the Chinese government and following its instinct to suppress information in a crisis even when that information isn’t necessary inculpatory. But it could also be that whoever deleted it had reason to know there was a lab leak and wanted to obscure the genomic evidence of early samples precisely in order to prevent genomic detectives like Bloom from figuring out that the virus at the market didn’t look the way we’d expect it to look if in fact the leap from bats (or some intermediary host) took place there.
The deletion also raises the question of why NIH agreed to delete it. Bloom thinks there’s probably an innocent explanation for that:
In case of data set I describe above, it seems possible that trust that the NIH Sequence Read Archive grants to scientific authors to delete data may have been used to obscure sequences informative for understanding early #SARSCoV2. (20/n)
— Bloom Lab (@jbloom_lab) June 22, 2021
NIH processes a lot of data and doesn’t have the resources to scrutinize every request for a deletion. Although … one would think that a request to delete data on early samples from the worst pandemic in a century might have drawn extra attention from the agency. Particularly considering what sort of government is calling the shots in the country where those samples came from.
Why did they agree to delete in this case? Are there archived versions of other early samples that might help us pinpoint where, exactly, the virus most likely leaped to humans? Inquiring minds want to know.
Join the conversation as a VIP Member