Colonbiome Project: Pilot Finds Two Sequencing Methods Compatible
By Deborah Borfitz
April 28, 2020 | A growing body of research points to the potential of gut microbes as markers for early diagnosis of severe illnesses, including colon cancer. For researchers with the Bellvitge Biomedical Research Institute (IDIBELL) and Catalan Institute of Oncology (ICO) in Barcelona, the collection of gut microbiome data is just the starting point for the vast statistical stew they’ll eventually be mining for answers. Genetic, metabolomic and lifestyle information will all be in the mix to control for risk factors.
The idea is to join as many data types as possible to identify the ones that can most easily foretell disease in future patient populations, says Joan Mas-Lloret, a bioinformatician with the oncology data analytics program at ICO, IDIBELL and the multi-institutional Colonbiome Project. If that turns out to be a person’s diet, perhaps there will be no need to sequence their genome.
Mas-Lloret is first author of a study recently published in Scientific Data describing results of the first pilot test focused on two gene sequencing techniques used to detect microbiome diversity. The pilot was designed to learn what insights could be gleaned using the shotgun (examining all metagenomic DNA) and 16S (targeting a specific marker gene) sequencing approaches, he says. Microbiome profiles from the same samples can differ when analyzed by one technique versus the other.
Shotgun technologies are more expensive and may not be feasible when the amount of host DNA in a sample is high, Mas-Lloret explains, but can distinguish more species of microorganisms. With 16S, more samples can be sequenced because of the lower price point and biopsy samples can be analyzed without the interference of the human genome. On the other hand, less detailed information is returned.
It has yet to be determined “the amount of microbial sequencing reads necessary to get the full picture of the microbiome in a sample,” he adds. “However, it is not well established which is the required depth to get the full picture, and that was one of the things we meant to explore.”
For the pilot study, cross-sectional colon biopsies and fecal samples were obtained from nine participants, says Mas-Lloret. The metagenomic analysis involved between 47 and 92 million reads per sample and the targeted sequencing covered more than 300,000 reads per sample across seven hypervariable regions of the 16S gene.
Results of the bioinformatic exercise reveal that both sequencing techniques are consistent and that results of complete sequencing are not contradictory to single-gene sequencing, Mas-Lloret reports.
“Anyone interested in how different technologies or types of samples can affect detection can use our data to answer those questions,” Mas-Lloret says. All of the data have been entered into the European Nucleotide Archive (mirrored by the Sequence Read Archive of the National Institutes of Health) and are publicly available. Data collaboration is common in the field of microbiome research because data confidentiality is not an issue as it is with gene sequencing, he notes.
The computer code that enabled researchers to produce bacterial profiles from the sequences has also been uploaded to the repository, Mas-Lloret adds. Bioinformatics analysis of microbiome data is complex and, in the absence of standards, not well established.
Four Aims
The Colonbiome Project began in early 2018 and will take at least a few more years to complete, says Mas-Lloret. The shotgun data have been collected and are being analyzed, and researchers have most of the 16S data they will need.
Since they’ve classified sequences from the DNA, they know which bacteria are present and are now looking at differences between people with and without colon cancer, he continues. The difficulty is that microbiome data has many microorganisms and is “highly variable,” he says, meaning “we need to ensure the methodology we use is prepared for the particularities of the dataset.”
The project is enrolling participants who have gone through colorectal cancer screening, Mas-Lloret says. Shotgun metagenomics will be assessed for 50 subjects in each of three groups—patients with cancer, those with high-risk adenoma (benign tumor) and healthy controls. The 16S sequencing method will be used for another 150 subjects in each group.
Following assessment of the microbiome data, the first aim of the project, metabolomic analysis begins, he says. This will include the search for associations between lifestyle, diet, medications, genetic variation and gene expression data (from a collaboration with researchers at the University of Virginia) and microbiome diversity. “It gets trickier and trickier when you get into multi-omics assessment and start mixing data from genetic and biology domains [endeavoring to reach] … interesting and meaningful conclusions.”
Mas-Lloret says the plan is to use familiar statistical approaches and adapt them to the dataset. One potential approach is Mendelian randomization to know if the presence of bacteria is causing colorectal cancer or if colorectal cancer is favoring the presence of the bacteria.
A metabolome assessment will follow to look for connections between metabolites, the microbiome and host characteristics. “The idea here is to analyze the metabolome of plasma … and that can be done in several ways—a targeted approach where we search for specific metabolites or untargeted where we look at what is there in a more unsupervised way,” says Mas-Lloret. A combination approach is most likely, he adds, or the data may be too complex to analyze at all.
The ultimate goal is to come up with colorectal cancer risk models that incorporate lifestyle, genetics, microbiome and metabolomics data. “In the future, we may be able to produce a model you can feed microbiome data or diet data to learn the likelihood of a person developing adenoma or colon cancer.
“But we still need all the data,” he quickly adds, “and we have to establish statistical relationships and then find a good method to develop a predictive model.” Delays in data acquisition due to the COVID-19 pandemic have made it hard to say whether the model will be achieved by the original 2021 target date.