The potential for advancing algal biofuels and bioproducts relies on using algae strains that are best suited for industrial production. Genomic sequence data—the functional information in the DNA of a specific organism such as algae—can reveal the genes and regulatory mechanisms that control how a given strain grows and responds to stress. By screening genomes from a vast array of diverse algae, scientists can unlock the secrets of how to cultivate rapidly growing, high-quality strain compositions. High-quality genomes include no gaps in their sequence and accurately reflect all of the DNA in the strain.
The importance of genomic information to optimize biomanufacturing, a process that uses the growth of plants and/or micro-organisms (e.g., algae, yeast, or bacteria) to create bioproducts, is well known and widely accepted. However, analyses by scientists from the Los Alamos National Laboratory (LANL) suggest that, as the availability of algae genomes expands, the data about novel algae genome sequences is becoming increasingly unreliable and may leave out critical information about the strains’ DNA. Although there are more algae sequences available now than ever before, the quality of the genomic data published in literature is increasingly inconsistent and full of gaps and mistakes. This lack of quality can misrepresent what genes—and functions—are available in a given species.
Data, Sequencing, and Databases Make Algae More Accessible
The natural variety of algae springs from its vast genetic diversity and complexity. The identification of genes and pathways in different strains presents many opportunities for the development of biofuels, bioproducts, and even therapeutics. By identifying key genes through genomic sequencing, new species of algae can be tapped for these applications. For the last decade, genome sequencing has become “democratized” as faster, less expensive sequencing machines have become more readily shareable through publicly available databases. Such databases make it possible to quickly analyze a wide breadth of algal biodiversity.
“We hope to use this data to better understand algal biology and evolution,” said LANL evolutionary biologist Erik Hanschen. “We also hope to discover novel proteins, biochemical pathways, and untapped natural products by studying these algal genomes.”
Not All Genome Sequencing Is Created Equal
Hanschen, along with Blake Hovde, LANL computational biologist and Applied Genomics Team Leader, and Shawn Starkenburg, LANL deputy group leader, recently published a technical paper and a review article on the current state of public databases and the accepted methodology for assessing genomes. The researchers evaluated algal genomes for contiguity, or the completeness of the genome, as well as gene content.
In addition to concerns about contiguity, the authors describe a benchmarking tool called Benchmarking Universal Single-Copy Orthologs (BUSCO) that helps evaluate gene content. This tool helps determine how many genes from a well-curated gene set exist in a given species’ sequence, demonstrating whether that sequence is high or low quality. By comparing against the known reference data, novel genomes can be assessed for quality by tracking how many of the known genes appear in the novel sequence data: high-quality genomes will include a significant number of known genes while lower quality genomes have fewer of these genes.
“Algae strains of interest that have low-quality genomes are more likely to have overlooked genes and pathways, information which may be critical to the question we’re investigating,” said Hanschen. “A genome with an incomplete read of the genes gives a misleading view of the actual genetic pathways available and functioning in the strain.”
Improving Public Databases: When More Is Sometimes Less
BUSCO has proved to be a helpful quality assurance tool for specific algal lineages, such as Chlorophyta for green algae and Stramenopile for brown algae, but it is not reliable for all algal genomes. To improve the quality of public databases, the researchers offered additional strategies, such as using specific sequencing technologies that assemble longer contigs and scaffolds whenever possible. Until additional lineage-specific databases are developed, the researchers promote the use of the Eukaryota database of BUSCO genes, as it is intended for all Eukaryotes. However, it is clear that lineage-specific datasets are the most valuable and reliable.
Hanschen emphasized the need for high-quality data sets, “Such high-quality data sets, which could look like completed genomes or even provide missing annotations, will be really useful to our own work, but also provide a roadmap for others to produce similarly high-quality data sets.”
The priority for future algal genomics is clear: the quality of data is just as important as the quantity of data.
Funding and mission
This research was funded by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Bioenergy Technologies Office. The work supports the Complex Natural and Engineered Systems and Science of Signatures capability pillars.
Erik R. Hanschen, Shawn R. Starkenburg. The State of Algal Genome Quality and Diversity. Algal Research 50 (2020) 101968. https://doi.org/10.1016/j.algal.2020.101968
Erik R. Hanschen, Blake T. Hovde, Shawn R. Starkenburg. An Evaluation of Methodology to Determine Algal Genome Completeness. Algal Research 51 (2020) 102019. https://doi.org/10.1016/j.algal.2020.102019